## Bag of Words Method

In this notebook, we go over the Bag of Words (BoW) method to convert text data into numerical values. 

A vocabulary of known words (tokens) is extracted from the text, the occurence of words is scored, and the resulting numerical values are saved in a vocabulary-long vector. There are a few versions of BoW, corresponding to different words scoring methods. We use the Sklearn library to calculate the BoW numerical values using:

1. Binary Scoring
2. Word Counts
3. Term Frequencies
4. Term Frequency-Inverse Document Frequencies


### 1. Binary BoW 

Let's calculate the first type of BoW. We will also go over some useful features of Sklearn's vectorizers here.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = ["This is the first document", "This is the second document", "and the third one"]

# Initialize the count vectorizer with the parameter: binary=True
binary_vectorizer = CountVectorizer(binary=True)

# fit_transform() function fits the text data and gets the BoW vectors
x = binary_vectorizer.fit_transform(sentences)

As the vocabulary size grows, the BoW vectors also get very large in size. They are usually made of many zeros and very few non-zero values. Sklearn stores these vectors in a compressed form. If we want to use them as Numpy arrays, we call the __toarray()__ function. Here are our BoW features. Each row corresponds to a single word.

In [2]:
x.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0]], dtype=int64)

Let's check out our vocabulary. We can use the __vocabulary___ attribute. This returns a dictionary with each word as key and index as value. You can realize that they are alphabetically ordered.

In [3]:
binary_vectorizer.vocabulary_

{'this': 8,
 'is': 3,
 'the': 6,
 'first': 2,
 'document': 1,
 'second': 5,
 'and': 0,
 'third': 7,
 'one': 4}

How can we calculate BoW for a new text? We will use the __transform()__ function this time. You can see below this doesn't change the vocabulary. New words are simply skipped in this case.

In [4]:
new_sentence = ["This is the new sentence"]

new_vectors = binary_vectorizer.transform(new_sentence)

In [5]:
new_vectors.toarray()

array([[0, 0, 0, 1, 0, 0, 1, 0, 1]])

### 2. Word Counts

Word counts can be simply calculated using the same __CountVectorizer()__ function __without__ the __binary__ parameter.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = ["This is the first document", "This is the second document", "and the third one"]

# Initialize the count vectorizer
count_vectorizer = CountVectorizer()

x = count_vectorizer.fit_transform(sentences)

x.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0]], dtype=int64)

In [7]:
new_sentence = ["This is the new sentence"]
new_vectors = count_vectorizer.transform(new_sentence)
new_vectors.toarray()

array([[0, 0, 0, 1, 0, 0, 1, 0, 1]])

### 3. Term Frequency (TF) 

Term Frequency (TF) vectors use the __TfidfVectorizer()__ function with the parameter: __use_idf=False__. We will set that parameter to True in the next example.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_vectorizer = TfidfVectorizer(use_idf=False)

x = tf_vectorizer.fit_transform(sentences)

x.toarray()

array([[0.       , 0.4472136, 0.4472136, 0.4472136, 0.       , 0.       ,
        0.4472136, 0.       , 0.4472136],
       [0.       , 0.4472136, 0.       , 0.4472136, 0.       , 0.4472136,
        0.4472136, 0.       , 0.4472136],
       [0.5      , 0.       , 0.       , 0.       , 0.5      , 0.       ,
        0.5      , 0.5      , 0.       ]])

In [9]:
new_sentence = ["This is the new sentence"]
new_vectors = tf_vectorizer.transform(new_sentence)
new_vectors.toarray()

array([[0.        , 0.        , 0.        , 0.57735027, 0.        ,
        0.        , 0.57735027, 0.        , 0.57735027]])

### 4. Term Frequency Inverse Document Frequency (TF-IDF)

Term Frequency Inverse Document Frequency (TF-IDF) vectors use the __TfidfVectorizer()__ function with the parameter: __use_idf=True__. We can also skip this parameter as it is already True by default.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(use_idf=True)

x = tfidf_vectorizer.fit_transform(sentences)

x.toarray()

array([[0.        , 0.43306685, 0.56943086, 0.43306685, 0.        ,
        0.        , 0.33631504, 0.        , 0.43306685],
       [0.        , 0.43306685, 0.        , 0.43306685, 0.        ,
        0.56943086, 0.33631504, 0.        , 0.43306685],
       [0.54645401, 0.        , 0.        , 0.        , 0.54645401,
        0.        , 0.32274454, 0.54645401, 0.        ]])

In [11]:
new_sentence = ["This is the new sentence"]
new_vectors = tfidf_vectorizer.transform(new_sentence)
new_vectors.toarray()

array([[0.        , 0.        , 0.        , 0.61980538, 0.        ,
        0.        , 0.48133417, 0.        , 0.61980538]])