# 1. Traditional NLP Preprocessing

## 1.1. Stop words

In [1]:
## You might need to run the following two lines to download stopwords for NLTK
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\86138\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
from nltk.corpus import stopwords

In [3]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

<font color='red'>Question:</font> What are stopwords? Why do we need to remove them?

### Stop words are basically a set of commonly used words in any language, not just English. The reason why stop words are critical to many applications is that, if we remove the words that are very commonly used in a given language, we can focus on the important words instead

## 1.2 Stemming

In [4]:
from nltk.stem.snowball import EnglishStemmer

In [5]:
stemmer = EnglishStemmer(ignore_stopwords=False)

In [6]:
stemmer.stem('walk')

'walk'

In [7]:
stemmer.stem('walking')

'walk'

In [8]:
stemmer.stem('walked')

'walk'

<font color='red'>Question:</font> What is stemming? Why is it helpful?

### In the context of machine learning based NLP, stemming makes your training data more dense. It reduces the size of the dictionary (number of words used in the corpus) two or three-fold (of even more for languages with many flections like French, where a single stem can generate dozens of words in case of verbs for instance).

## 1.3 CountVectorizer

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

**Example**: Build a stem_tokenizer function

In [10]:
import re
def stem_tokenizer(text):
    stemmer = EnglishStemmer(ignore_stopwords=True)
    words = re.sub(r"[^A-Za-z0-9\-]", " ", text).lower().split()
    words = [stemmer.stem(word) for word in words]
    return words 

**Example**: Initialize a vectorizer

In [11]:
cv = CountVectorizer(stop_words=stopwords.words('english'),
                     tokenizer=stem_tokenizer,
                     lowercase=True,
                     max_df=0.9,
                     min_df=1
                    )

**Example**: Fit vectorizer using texts

In [12]:
texts = ['Rabbit runs fast', 
         'Rabbit runs very very fast', 
         'Duck runs fast too', 
         'Rabbit runs faster than Duck runs',
         'Duck runs slower than Rabbit runs',
         'Rabbit runs faster',
         'Duck runs slower',
         'Duck runs every day',
         'This duck runs than all the other ducks, but it still runs slower than a rabbit'
        ]

In [13]:
cv.fit(texts)



**Example**: Vocabulary of the vectorizer

In [14]:
cv.vocabulary_

{'rabbit': 5,
 'fast': 3,
 'duck': 1,
 'faster': 4,
 'slower': 6,
 'everi': 2,
 'day': 0,
 'still': 7}

**Example**: Vectorized texts

In [15]:
import pandas as pd

In [18]:
df = pd.DataFrame(cv.transform(texts).todense())
df

Unnamed: 0,0,1,2,3,4,5,6,7
0,0,0,0,1,0,1,0,0
1,0,0,0,1,0,1,0,0
2,0,1,0,1,0,0,0,0
3,0,1,0,0,1,1,0,0
4,0,1,0,0,0,1,1,0
5,0,0,0,0,1,1,0,0
6,0,1,0,0,0,0,1,0
7,1,1,1,0,0,0,0,0
8,0,2,0,0,0,1,1,1


In [20]:
df.columns = list(zip(*sorted(cv.vocabulary_.items(), key=lambda x: x[-1])))[0]

In [21]:
df

Unnamed: 0,day,duck,everi,fast,faster,rabbit,slower,still
0,0,0,0,1,0,1,0,0
1,0,0,0,1,0,1,0,0
2,0,1,0,1,0,0,0,0
3,0,1,0,0,1,1,0,0
4,0,1,0,0,0,1,1,0
5,0,0,0,0,1,1,0,0
6,0,1,0,0,0,0,1,0
7,1,1,1,0,0,0,0,0
8,0,2,0,0,0,1,1,1


<font color='red'>Question:</font> How does **CountVectorizer** work? What is **Bag of Words**?

### CountVectorizer provides the get_features_name method, which contains the uniques words of the vocabulary, taken into account later to create the desired document-term matrix X. To have an easier visualization, we transform it into a pandas data frame. We compare it with the output obtained before.

<font color='red'>Question:</font> What do **df_min** and **df_max** do? Why do we want to remove the most and least frequent words?

#### max_df = 0.50 means "ignore terms that appear in more than 50% of the documents
#### min_df = 0.01 means "ignore terms that appear in less than 1% of the documents
#### The default min_df is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

## 2.4 Ngrams

**Example**: Add ngrams in the vectorizer

In [22]:
cv = CountVectorizer(stop_words=stopwords.words('english'),
                     tokenizer=stem_tokenizer,
                     lowercase=True,
                     max_df=0.9,
                     min_df=2,
                     ngram_range=(1, 3)
                    )

In [23]:
cv.fit(texts)



In [24]:
cv.vocabulary_

{'rabbit': 5,
 'fast': 3,
 'rabbit run': 6,
 'run fast': 9,
 'rabbit run fast': 7,
 'duck': 0,
 'duck run': 1,
 'faster': 4,
 'run faster': 10,
 'rabbit run faster': 8,
 'slower': 13,
 'run slower': 11,
 'slower rabbit': 14,
 'duck run slower': 2,
 'run slower rabbit': 12}

In [25]:
df = pd.DataFrame(cv.transform(texts).todense())

In [23]:
df.columns = list(zip(*sorted(cv.vocabulary_.items(), key=lambda x: x[-1])))[0]

In [24]:
df

Unnamed: 0,duck,duck run,duck run slower,fast,faster,rabbit,rabbit run,rabbit run fast,rabbit run faster,run fast,run faster,run slower,run slower rabbit,slower,slower rabbit
0,0,0,0,1,0,1,1,1,0,1,0,0,0,0,0
1,0,0,0,1,0,1,1,1,0,1,0,0,0,0,0
2,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0
3,1,1,0,0,1,1,1,0,1,0,1,0,0,0,0
4,1,1,1,0,0,1,1,0,0,0,0,1,1,1,1
5,0,0,0,0,1,1,1,0,1,0,1,0,0,0,0
6,1,1,1,0,0,0,0,0,0,0,0,1,0,1,0
7,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
8,2,1,0,0,0,1,0,0,0,0,0,1,1,1,1


<font color='red'>Question:</font> What is ngram? What problem can it solve?

## 2.5 TFIDF

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
tfidf = TfidfVectorizer(stop_words=stopwords.words('english'))
tfidf

In [29]:
tfidf.fit(texts)

In [31]:
df = pd.DataFrame(tfidf.transform(texts).todense())
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.0,0.0,0.0,0.750896,0.0,0.531611,0.391849,0.0,0.0
1,0.0,0.0,0.0,0.0,0.750896,0.0,0.531611,0.391849,0.0,0.0
2,0.0,0.531611,0.0,0.0,0.750896,0.0,0.0,0.391849,0.0,0.0
3,0.0,0.383134,0.0,0.0,0.0,0.622417,0.383134,0.564813,0.0,0.0
4,0.0,0.402638,0.0,0.0,0.0,0.0,0.402638,0.593566,0.568722,0.0
5,0.0,0.0,0.0,0.0,0.0,0.794357,0.488973,0.36042,0.0,0.0
6,0.0,0.531611,0.0,0.0,0.0,0.0,0.0,0.391849,0.750896,0.0
7,0.643201,0.334407,0.0,0.643201,0.0,0.0,0.0,0.24649,0.0,0.0
8,0.0,0.271489,0.522184,0.0,0.0,0.0,0.271489,0.400227,0.383476,0.522184


In [32]:
df.columns = list(zip(*sorted(tfidf.vocabulary_.items(), key=lambda x: x[-1])))[0]

In [33]:
df

Unnamed: 0,day,duck,ducks,every,fast,faster,rabbit,runs,slower,still
0,0.0,0.0,0.0,0.0,0.750896,0.0,0.531611,0.391849,0.0,0.0
1,0.0,0.0,0.0,0.0,0.750896,0.0,0.531611,0.391849,0.0,0.0
2,0.0,0.531611,0.0,0.0,0.750896,0.0,0.0,0.391849,0.0,0.0
3,0.0,0.383134,0.0,0.0,0.0,0.622417,0.383134,0.564813,0.0,0.0
4,0.0,0.402638,0.0,0.0,0.0,0.0,0.402638,0.593566,0.568722,0.0
5,0.0,0.0,0.0,0.0,0.0,0.794357,0.488973,0.36042,0.0,0.0
6,0.0,0.531611,0.0,0.0,0.0,0.0,0.0,0.391849,0.750896,0.0
7,0.643201,0.334407,0.0,0.643201,0.0,0.0,0.0,0.24649,0.0,0.0
8,0.0,0.271489,0.522184,0.0,0.0,0.0,0.271489,0.400227,0.383476,0.522184


<font color='red'>Question:</font> What does TFIDF mean? Why is it designed in this way?

#### Conclusion. TF-IDF (Term Frequency - Inverse Document Frequency) is a handy algorithm that uses the frequency of words to determine how relevant those words are to a given document. It's a relatively simple but intuitive approach to weighting words, allowing it to act as a great jumping off point for a variety of tasks.