```
#############################################
##                                         ##
##  Natural Language Processing in Python  ##
##                                         ##
#############################################

§2 Sentiment Analysis in Python

§2.2 Numeric features from reviews
```

# Bag-of-words

## What is a bag-of-words (BOW)?

* It describes the occurrence of words within a document or a collection of documents (corpus).

* It builds a vocabulary of the words and a measure of their presence.

* Word order and grammar rules could be lost.

## Code of sentiment analysis with BOW:

In [1]:
from nltk.tokenize import word_tokenize
from collections import Counter

Counter(
    word_tokenize(
        'This is the best book ever. I loved the book and highly recommend it!!!'
    ))

Counter({'This': 1,
         'is': 1,
         'the': 2,
         'best': 1,
         'book': 2,
         'ever': 1,
         '.': 1,
         'I': 1,
         'loved': 1,
         'and': 1,
         'highly': 1,
         'recommend': 1,
         'it': 1,
         '!': 3})

## What will the BOW end result be?

* The output will look something like this:

    ![BOW end result](ref1.%20BOW%20end%20result.jpg)

## Code of `CountVectorizer` function:

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

data = pd.read_csv('ref2. Amazon product reviews sample.csv')[[
    'score', 'review'
]]
vect = CountVectorizer(max_features=1000)
vect

CountVectorizer(max_features=1000)

In [3]:
vect.fit(data.review)
X = vect.transform(data.review)
X

<10000x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 406668 stored elements in Compressed Sparse Row format>

In [4]:
# Transform to an array
my_array = X.toarray()
my_array

array([[0, 0, 0, ..., 0, 1, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [5]:
# Transform back to a dataframe, assign column names
X_df = pd.DataFrame(my_array, columns=vect.get_feature_names())
X_df

Unnamed: 0,10,100,12,15,1984,20,30,40,451,50,...,wrong,wrote,year,years,yes,yet,you,young,your,yourself
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,2,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,3,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Practice question for the statement about BOW:

* Which of the following statements about BOW is true?

    $\Box$ Bag-of-words preserves the word order and grammar rules.
    
    $\Box$ Bag-of-words describes the order and frequency of words or tokens within a corpus of documents.
    
    $\boxtimes$ Bag-of-words is a simple but effective method to build a vocabulary of all the words occurring in a document.
    
    $\Box$ Bag-of-words can only be applied to a large document, not to shorter documents or single sentences.

## Practice exercises for Bag-of-words:

$\blacktriangleright$ **The first BOW practice:**

In [6]:
# Import the required function
from sklearn.feature_extraction.text import CountVectorizer

annak = [
    'Happy families are all alike;',
    'every unhappy family is unhappy in its own way'
]

# Build the vectorizer and fit it
anna_vect = CountVectorizer()
anna_vect.fit(annak)

# Create the bow representation
anna_bow = anna_vect.transform(annak)

# Print the bag-of-words result
print(anna_bow.toarray())

[[1 1 1 0 1 0 1 0 0 0 0 0 0]
 [0 0 0 1 0 1 0 1 1 1 1 2 1]]


$\blacktriangleright$ **Package pre-loading:**

In [7]:
import pandas as pd

$\blacktriangleright$ **Data pre-loading:**

In [8]:
reviews = pd.read_csv('ref2. Amazon product reviews sample.csv')[[
    'score', 'review'
]]

$\blacktriangleright$ **Product reviews' BOW practice:**

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

# Build the vectorizer, specify max features
vect = CountVectorizer(max_features=100)
# Fit the vectorizer
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   about  after  all  also  am  an  and  any  are  as  ...  what  when  which  \
0      0      0    1     0   0   0    2    0    0   0  ...     0     0      0   
1      0      0    0     0   0   0    3    1    1   0  ...     0     0      0   
2      0      0    3     0   0   1    4    0    1   1  ...     0     0      1   
3      0      0    0     0   0   0    9    0    1   0  ...     0     0      0   
4      0      1    0     0   0   0    3    0    1   0  ...     0     0      0   

   who  will  with  work  would  you  your  
0    2     0     1     0      2    0     1  
1    0     0     0     0      1    1     0  
2    1     0     0     1      1    2     0  
3    0     0     0     0      0    0     0  
4    0     0     0     0      0    3     1  

[5 rows x 100 columns]


## Version checking:

In [10]:
import sys
import nltk
import sklearn

print('The Python version is {}.'.format(sys.version.split()[0]))
print('The NLTK version is {}.'.format(nltk.__version__))
print('The pandas version is {}.'.format(pd.__version__))
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The Python version is 3.7.9.
The NLTK version is 3.5.
The pandas version is 1.2.1.
The scikit-learn version is 0.24.1.
