# data preparation using vectorizer
in this notebook we will use the sklearn vectorizer and bag of words
## Bag of words
bag of words, "bow", is a technic where we extract words from text as features, and ake from those words vectors 
*for example*:
- "this is good"
- "good day"
- "this is a long day"\
when we vectorize those lines of text we get a bag of words which will contain a list of words **[this, is, good, day, a, long]** and for each line of the text the vectorizer will attribute the number of occurences of the word in the sentence
the vectorizer out put will be:\
- [1 1 1 0 0 0]
- [0 0 1 1 0 0]
- [1 1 0 1 1 1]\
now if we try to vectorize a new line eg.:\
- '*this day is a good long day*'\
the output will be:
- [1 1 1 2 1 1]

if we enter a line with a whle new words, what will be the output?\
for eg.:\
'*I love cats*'\
the output: [0 0 0 0 0 0]

look at it this way like a table, where the ***features*** are the words in given text and the rows contains the number of occurences in each sentence.
| input lines | this | is | good | day | a | long|
|:------|:---:|:---:|:---:|:---:|:---:|:---:|
| "this is good" | 1 | 1 | 1 | 0 | 0 | 0 |
| "good day" | 0 | 0 | 1 | 1 | 0 | 0 |
| "this is a long day" | 1 | 1 | 0 | 1 | 1 | 1 |
| "this day is a good long day" | 1 | 1 | 2 | 1 | 1 | 1 |
| "I love cats" | 0 | 0 | 0 | 0 | 0 | 0 |

**Incovinients** is that order is not taken in consideration

### n_grams
the n_grams technic consists of taking n successive words instead of one word as a feature where n is the number of successive words:
- 1_gram == monogram 1 word at a time
- 2_grams == bigram 2 successive words at a time
- 3_grams == trigram 3 successive words at a time\
this technic is used to keep sequences in dataframe

with **bigram** the features will be **[this is, is good, good day, is a, a long, long day]** so the out put also will change insteade on counting the occurences of a word we will count the occurences of sub sequence\
the *output*:
| line | this is | is good | good day | is a | a long | long day|
|:------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| "this is good" | 1 | 1 | 0 | 0 | 0 | 0 |
| "good day" | 0 | 0 | 1 | 0 | 0 | 0 |
| "this is a long day" | 1 | 0 | 0 | 1 | 1 | 1 |
| "this day is a good long day" | 0 | 0 | 0 | 1 | 0 | 1 |
| "I love cats" | 0 | 0 | 0 | 0 | 0 | 0 |

**Incovinients** is that we loose some words if the exact same sub sequence don't appear in the new text

### ngrams range
the ngrams alone is not quite good so we use ngrams_range\
ngram_range is a tuple (s, e) where all the n_grams between s_gram and e_gram will be taken in consideration, inclusively\
ngram_range== (1, 4) --> 1_gram, 2_grams, 3_grams and 4_grams bag of words will be created

for our example let's create an ngrams_range of (1,3)\
the bag of words will contain **[this, this is, this is good, this is a, is, is good, is a, is a long, good, good day, day, a, a long, a long day, long, long day]**

the *output* table:
| line | this| this is | this is good | this is a | is | is good | is a | is a long | good | good day | day | a | a long | a long day | long | long day |
|:------|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| "this is good" | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| "good day" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| "this is a long day" | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| "this day is a good long day" | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
| "I love cats" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

**Inconvinients** the dataframe get larger quickly and imagine if there are spelling errors, abbreviations or sms writings like [lol, amaaaaazing, hhh, ] that will be features 

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re

# for relative imports
import sys
sys.path.insert(0, '..')
from src.data.text_2_dataframe import Text2DF

## initalizing the vectorizers

In [6]:
# bag of words monogram
CV1= Text2DF(vectorizer=True, ngrams_range= (1,1), use_wordlist= False, use_stopwords= False)

# bag of words monogram using a vocabulary
# using vocabulary is providing features at the instantiation
CV2= Text2DF(vectorizer=True, ngrams_range= (1,1), use_wordlist= True, use_stopwords= False)

# bag of words monogram using vocabulary and avoiding stopwords
CV3= Text2DF(vectorizer=True, ngrams_range= (1,1), use_wordlist= True, use_stopwords= True)

# bag of words monogram avoiding stopwords
CV4= Text2DF(vectorizer=True, ngrams_range= (1,1), use_wordlist= False, use_stopwords= True)

# bag of words bigram
CV5= Text2DF(vectorizer=True, ngrams_range= (2,2), use_wordlist= False, use_stopwords= False)

## data read

In [7]:
cleaned_data= pd.read_csv('../data/interim/cleaned_dataset.csv')
filtered_data= pd.read_csv('../data/interim/filtered_dataset.csv')
display(cleaned_data.head())
filtered_data.head()

Unnamed: 0,Sentiment,tweet
0,0,my poor little dumpling in holmdel vids he was...
1,0,i m off too bed i gotta wake up hella early to...
2,0,i havent been able to listen to it yet my spea...
3,0,now remembers why solving a relatively big equ...
4,0,ate too much feel sick


Unnamed: 0,Sentiment,tweet
0,0,poor little vids trying hope dont try hard ton...
1,0,bed wake early tomorrow morning
2,0,able listen speakers
3,0,now solving big equation total pain butt
4,0,ate feel sick


we gonna use:
- cleaned version with vectorizer initialized with a vocabulary and/ or stopwords
- filtered version with no vocabulary and no stop words

to notice the difference
the vectorizer with bigram bow will be used with filtered dataset

In [8]:
# all the vectorizer with use_wordlist= False must be fitted to their text data
CV1.count_vectorizer.fit(filtered_data['tweet'])
CV4.count_vectorizer.fit(cleaned_data['tweet'])
CV5.count_vectorizer.fit(filtered_data['tweet'])

