<a href="https://colab.research.google.com/github/LxYuan0420/aws-machine-learning-university-accelerated-nlp/blob/master/colab_notebooks/MLA_NLP_Lecture1_BOW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Bag of Words Method**

In this notebook, we go over the Bag of Words (BoW) method to convert text data into numerical values, that will be later used for predictions with machine learning algorithms.

To convert text data to vectors of numbers, a vocabulary of known words (tokens) is extracted from the text, the occurence of words is scored, and the resulting numerical values are saved in vocabulary-long vectors. There are a few versions of BoW, corresponding to different words scoring methods. We use the Sklearn library to calculate the BoW numerical values using:

1. Binary
2. Word Counts
3. Term Frequencies
4. Term Frequency-Inverse Document Frequencies


**1. Binary**

Let's calculate the first type of BoW, recording whether the word is in the sentence or not. We will also go over some useful features of Sklearn's vectorizers here.

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

sentences = ["This document is the first document",
             "This document is the second document",
             "and this is the third one"]

bow_vectorizer = CountVectorizer(binary=True)

x = bow_vectorizer.fit_transform(sentences)

In [2]:
x.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1]])

In [3]:
# check out vocoabulary
bow_vectorizer.vocabulary_

{'and': 0,
 'document': 1,
 'first': 2,
 'is': 3,
 'one': 4,
 'second': 5,
 'the': 6,
 'third': 7,
 'this': 8}

In [4]:
print(f"There are {len(bow_vectorizer.vocabulary_)} vocabulary so each sentence is transformed to a vector of {len(bow_vectorizer.vocabulary_)}-dimenstion ")

There are 9 vocabulary so each sentence is transformed to a vector of 9-dimenstion 


In [6]:
print(bow_vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


In [8]:
new_sentences = ["This is a new sentence written by Donald Trump.",
                 "Another short sentences"]


new_vector = bow_vectorizer.transform(new_sentences)

new_vector.toarray()

array([[0, 0, 0, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0]])

**2. Word Counts**

Word counts can be simply calculated using the same CountVectorizer() function without the binary parameter.

In [9]:
sentences = ["This document is the first document", "This document is the second document", "and this is the third one"]

# Initialize the count vectorizer
count_vectorizer = CountVectorizer()

xc = count_vectorizer.fit_transform(sentences)

xc.toarray()

array([[0, 2, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1]])

In [10]:
count_vectorizer.vocabulary_

{'and': 0,
 'document': 1,
 'first': 2,
 'is': 3,
 'one': 4,
 'second': 5,
 'the': 6,
 'third': 7,
 'this': 8}

In [11]:
new_sentence = ["This is the new sentence"]
new_vectors = count_vectorizer.transform(new_sentence)
new_vectors.toarray()

array([[0, 0, 0, 1, 0, 0, 1, 0, 1]])

**3. Term Frequency (TF)**

Term Frequency (TF) vectors that show how important words are to the documents, are computed using

$$tf(term, doc) = \frac{number\, of\, times\, the\, term\, occurs\, in\, the\, doc}{total\, number\, of\, terms\, in\, the\, doc}$$
From sklearn we use the TfidfVectorizer() function with the parameter use_idf=False, which additionally automatically normalizes the term frequencies vectors by their Euclidean ($l2$) norm.

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_vectorizer = TfidfVectorizer(use_idf=False)

x = tf_vectorizer.fit_transform(sentences)

x.toarray()

array([[0.        , 0.70710678, 0.35355339, 0.35355339, 0.        ,
        0.        , 0.35355339, 0.        , 0.35355339],
       [0.        , 0.70710678, 0.        , 0.35355339, 0.        ,
        0.35355339, 0.35355339, 0.        , 0.35355339],
       [0.40824829, 0.        , 0.        , 0.40824829, 0.40824829,
        0.        , 0.40824829, 0.40824829, 0.40824829]])

In [13]:
new_sentence = ["This is the new sentence"]
new_vectors = tf_vectorizer.transform(new_sentence)
new_vectors.toarray()

array([[0.        , 0.        , 0.        , 0.57735027, 0.        ,
        0.        , 0.57735027, 0.        , 0.57735027]])

**4. Term Frequency Inverse Document Frequency (TF-IDF)**

Term Frequency Inverse Document Frequency (TF-IDF) vectors are computed using the TfidfVectorizer() function with the parameter use_idf=True. We can also skip this parameter as it is already True by default.

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(use_idf=True)

sentences = ["This document is the first document",
             "This document is the second document",
             "and this is the third one"]

xf = tfidf_vectorizer.fit_transform(sentences)

xf.toarray()

array([[0.        , 0.7284449 , 0.47890875, 0.28285122, 0.        ,
        0.        , 0.28285122, 0.        , 0.28285122],
       [0.        , 0.7284449 , 0.        , 0.28285122, 0.        ,
        0.47890875, 0.28285122, 0.        , 0.28285122],
       [0.49711994, 0.        , 0.        , 0.29360705, 0.49711994,
        0.        , 0.29360705, 0.49711994, 0.29360705]])

In [15]:
new_sentence = ["This is the new sentence"]
new_vectors = tfidf_vectorizer.transform(new_sentence)
new_vectors.toarray()

array([[0.        , 0.        , 0.        , 0.57735027, 0.        ,
        0.        , 0.57735027, 0.        , 0.57735027]])

In [16]:
tfidf_vectorizer.idf_

array([1.69314718, 1.28768207, 1.69314718, 1.        , 1.69314718,
       1.69314718, 1.        , 1.69314718, 1.        ])