```
#############################################
##                                         ##
##  Natural Language Processing in Python  ##
##                                         ##
#############################################

§2 Sentiment Analysis in Python

§2.2 Numeric features from reviews
```

# Getting granular with n-grams

## What are context matters?

* Putting 'not' in front of a word (negation) is one example of how context matters.

* E.g.,

    > *I am happy, **not sad**.*
    > 
    > *I am sad, **not happy**.*

## What is the capturing context with a BOW?

* **Unigrams**: single tokens.

* **Bigrams**: pairs of tokens.

* **Trigrams**: triples of tokens.

* **N-grams**: a sequence of n-tokens.

## Code of n-grams with the `CountVectorizer`:

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

sentence = ['The weather today is wonderful.']

# Only unigrams
vect_1 = CountVectorizer(ngram_range=(1, 1))
vect_1.fit(sentence)
X = vect_1.transform(sentence)
vect_1.get_feature_names()

['is', 'the', 'today', 'weather', 'wonderful']

In [2]:
# Uni- and bigrams
vect_1_2 = CountVectorizer(ngram_range=(1, 2))
vect_1_2.fit(sentence)
X = vect_1_2.transform(sentence)
vect_1_2.get_feature_names()

['is',
 'is wonderful',
 'the',
 'the weather',
 'today',
 'today is',
 'weather',
 'weather today',
 'wonderful']

In [3]:
# Only bigrams
vect_2 = CountVectorizer(ngram_range=(2, 2))
vect_2.fit(sentence)
X = vect_2.transform(sentence)
vect_2.get_feature_names()

['is wonderful', 'the weather', 'today is', 'weather today']

In [4]:
# Only trigrams
vect_3 = CountVectorizer(ngram_range=(3, 3))
vect_3.fit(sentence)
X = vect_3.transform(sentence)
vect_3.get_feature_names()

['the weather today', 'today is wonderful', 'weather today is']

## What is the best n?

* A longer sequence of tokens:

    * results in more features

    * leads higher precision of machine learning models

    * has a risk of overfitting

## How to specifying vocabulary size?

* `CountVectorizer(max_features, max_df, min_df)`:

    * **`max_features`**: if specified, it will include only the topmost frequent words in the vocabulary

        * if `max_features = None`, all words will be included
        
    * **`max_df`**: ignore terms higher than the specified frequency
        
        * if it is set to an integer, then absolute count; if it is set to a float, then it is a proportion
        
        * default is `1`, which means it does not ignore any terms
        
    * **`min_df`**: ignore terms with lower than specified frequency
    
        * if it is set to an integer, then absolute count; if it is set to a float, then it is a proportion
        
        * default is `1`, which means it does not ignore any terms

## Practice exercises for getting granular with n-grams:

$\blacktriangleright$ **Package pre-loading:**

In [5]:
import pandas as pd

$\blacktriangleright$ **Data pre-loading:**

In [6]:
reviews_all = pd.read_csv('ref2. Amazon product reviews sample.csv')[[
    'score', 'review'
]]
reviews = reviews_all[:100]

$\blacktriangleright$ **The token sequence with specific length BOW practice:**

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# Build the vectorizer, specify token sequence and fit
vect = CountVectorizer(ngram_range=(1, 2))
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   10  10 95  10 cups  100  100 years  110  110 years  114622  \
0   0      0        0    0          0    0          0       0   
1   0      0        0    0          0    0          0       0   
2   0      0        0    0          0    0          0       0   
3   0      0        0    0          0    0          0       0   
4   0      0        0    0          0    0          0       0   

   114622 excellent  12  ...  youtube video  yr  yr old  yucky  yucky thick  \
0                 0   0  ...              0   0       0      0            0   
1                 0   0  ...              0   0       0      0            0   
2                 0   0  ...              0   0       0      0            0   
3                 0   0  ...              0   0       0      0            0   
4                 0   0  ...              0   0       0      0            0   

   zelbessdisk  zelbessdisk three  zen  zen baseball  zen motorcycle  
0            0                  0    0             0           

$\blacktriangleright$ **Data re-pre-loading:**

In [8]:
movies_all = pd.read_csv('ref3. IMDB movie reviews sample.csv')[[
    'review', 'label'
]]
movies = movies_all[:1000]

$\blacktriangleright$ **Movies reviews with the size of vocabulary practice:**

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# Build the vectorizer, specify size of vocabulary and fit
vect = CountVectorizer(max_features=100)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   about  all  also  an  and  any  are  as  at  bad  ...  well  were  what  \
0      0    0     0   0    1    0    0   2   0    0  ...     0     0     0   
1      0    3     1   1   11    0    3   3   4    0  ...     0     0     1   
2      0    0     0   1    7    0    1   2   1    0  ...     0     0     0   
3      0    0     0   2    1    0    1   2   2    0  ...     1     0     0   
4      0    3     0   0    8    0    3   1   0    0  ...     2     1     0   

   when  which  who  will  with  would  you  
0     0      0    0     0     1      1    0  
1     1      2    0     2     7      2    3  
2     0      0    0     0     2      0    0  
3     0      0    1     0     0      0    1  
4     1      1    0     0     2      0    0  

[5 rows x 100 columns]


In [10]:
from sklearn.feature_extraction.text import CountVectorizer

# Build and fit the vectorizer
vect = CountVectorizer(max_df=200)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   00  000  000s  007  00s  01  06  07  08  10  ...  zombification  zombified  \
0   0    0     0    0    0   0   0   0   0   0  ...              0          1   
1   0    0     0    0    0   0   0   0   0   1  ...              0          0   
2   0    0     0    0    0   0   0   0   0   0  ...              0          0   
3   0    0     0    0    0   0   0   0   0   0  ...              0          0   
4   0    0     0    0    0   0   0   0   0   1  ...              0          0   

   zone  zoo  zoom  zooms  zsigmond  zulu  zuniga  zvyagvatsev  
0     0    0     0      0         0     0       0            0  
1     0    0     0      0         0     0       0            0  
2     0    0     0      0         0     0       0            0  
3     0    0     0      0         0     0       0            0  
4     0    0     0      0         0     0       0            0  

[5 rows x 17669 columns]


In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# Build and fit the vectorizer
vect = CountVectorizer(min_df=50)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   10  about  absolutely  acting  action  actor  actors  actually  after  \
0   0      0           0       0       0      0       0         0      0   
1   1      0           0       1       0      0       1         0      0   
2   0      0           0       0       0      0       0         0      1   
3   0      0           0       0       1      0       0         0      0   
4   1      0           0       0       1      0       0         0      0   

   again  ...  wouldn  written  wrong  year  years  yes  yet  you  young  your  
0      0  ...       0        0      0     0      0    0    0    0      0     0  
1      0  ...       0        0      0     2      0    0    1    3      0     2  
2      0  ...       0        0      0     0      0    0    0    0      1     0  
3      0  ...       0        0      0     0      0    0    0    1      1     0  
4      0  ...       0        0      1     0      0    0    0    0      0     0  

[5 rows x 434 columns]


$\blacktriangleright$ **BOW with n-grams and vocabulary size practice:**

In [12]:
#Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Build the vectorizer, specify max features and fit
vect = CountVectorizer(max_features=1000, ngram_range=(2, 2), max_df=500)
vect.fit(reviews.review)

# Transform the review
X_review = vect.transform(reviews.review)

# Create a DataFrame from the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   1980 style  aa batteries  aaa batteries  able to  about the  about this  \
0           0             0              0        0          0           0   
1           0             0              0        0          0           0   
2           0             0              0        0          0           0   
3           0             0              0        0          0           0   
4           0             0              0        0          0           0   

   across the  after that  again the  ahead of  ...  you know  you look  \
0           0           0          0         0  ...         0         0   
1           0           0          0         0  ...         0         0   
2           0           0          0         0  ...         0         0   
3           0           0          0         0  ...         0         0   
4           0           0          0         0  ...         1         0   

   you need  you should  you ve  you want  you will  your imagination  \
0      

## Version checking:

In [13]:
import sys
import sklearn

print('The Python version is {}.'.format(sys.version.split()[0]))
print('The pandas version is {}.'.format(pd.__version__))
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The Python version is 3.7.9.
The pandas version is 1.2.1.
The scikit-learn version is 0.24.1.
