# <span style = "color:green; font-size:40px"> Bag Of Words </span>

***

To understand the bag of words approach, let's first start with the help of an example.

Suppose we have a corpus with three sentences:
   * "I like to play football"
   * "Did you go outside to play tennis"
   * "John and I play tennis"

Now if we have to perform text classification, or any other task, on the above data using statistical techniques, we can not do so since statistical techniques work only with numbers. Therefore we need to convert these sentences into numbers.

### Step 1: Tokenize the Sentences

The first step in this regard is to convert the sentences in our corpus into tokens or individual words. Look at the table below:

| Sentence 1 | Sentence 2 | Sentence 3 |
| --- | --- | --- |
| I | Did | John |
| like | you | and |
| to | go | I |
| play | outside | play |
| football | to | tennis |
|  | play |  |
|  | tennis |  | 

### Step 2: Create a Dictionary of Word Frequency

The next step is to create a dictionary that contains all the words in our corpus as keys and the frequency of the occurence of the words as values. In other words, we need to create a histogram of the words in our corpus. Look at the following table:

| Word | Frequency |
| --- | --- | 
| I | 2 | 
| like | 1 | 
| to | 2 | 
| play | 3 |
| football | 1 |
| Did | 1 | 
| you | 1 | 
| go | 1 |
| outside | 1 |
| tennis | 2 |
| John | 1 |
| and | 1 |

In the table above, you can see each word in our corpus along with its frequency of occurence. For instance, you can see that since the word play occurs three times in the corpus (once in each sentence) its freqency is 3.

In our corpus, we only had three sentences, therefore it is easy for us to create a dictionary that contains all the words. In real world scenarios, there will be millions of words in the dictionary. Some of the words will have very small frequency are not very useful, hence such words are removed. One way to remove the words with less frequency is to sort the word frequency dictionary in the decreasing order of the frequency and then filter the words having a frequency higher than a certain threshold.

Let's sort our words frequency dictionary:

| Word | Frequency |
| --- | --- | 
| play | 3 |
|tennis | 2 |
| to | 2 |
| I | 2 |
| football | 1 |
| Did | 1 |
| you | 1 |
| go | 1 |
| outside | 1 |
| like | 1 |
| John | 1 | 
| and | 1 |

### Step 3: Creating the Bag of Words model

To create the bag of words model, we need to create a matrix where the columns correspond to the most frequent words in our dictionary where rows correspond to the document or sentences.

Suppose we filter 8 most occuring words from our dictionary. Then the document frequency matrix will look like this:
    

|  | Play | Tennis | To | I | Football | Did | You | go |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Sentence 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 |
| Sentence 2 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 |
| Sentence 3 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |

It is important to understand how the matrix is created. In the above matrix, the first row corresponds to the first sentence. In the first, the word "play" occurs once, therefore we added 1 in the first column. The word in the second column is "Tennis", it doesn't occur in the first sentence, therefore we added a 0 in the second column for sentence 1. Similarly, in the second sentence, both the words "play" and "Tennis" occur once, therefore we added 1 in the first two columns. However, in the fifth column, we added a 0, since the word "Football" doesn't occur in the second sentence. In this way, all the cells in the above matix are filled with either 0 or 1, depending upon the occurence of the word. Final corresponds to the bag of words model.

In each row, you can see the numeric representation of the corresponding sentence. For instance, the first row shows the numeric representation of Sentence 1. This numeric representation can now be used as input to the statistical models.

### Bag of words model in python

####  Let's import the required libraries

In [2]:
import nltk
import numpy as np
import re

re(Regular Expression) will be used for some preprocessing task on the text.

The first thing we need to create our Bag of words model is a dataset.

In [3]:
article_text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."""

In [4]:
article_text

'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.'

The next step is to split the corpus into individual sentences. To do so, we will use the sent_tokenize function from the NLTK library

In [5]:
corpus = nltk.sent_tokenize(article_text)

In [6]:
corpus

['Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.',
 'The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.',
 'The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.']

Our text contains punctuations. We don't want punctuations to be the part of our word frequency dictionary. In the following script, we first convert our text into lower case and then will remove the punctuation from our text. Removing punctuation can result in multiple empty spaces. We will remove the empty spaces from the text using regex.

Look at the following script:

The re module provides functions and support for regular expressions. re.sub() is used to replace substrings in strings.

In [7]:
for i in range(len(corpus)):
    corpus[i] = corpus[i].lower()
    corpus[i] = re.sub(r'\W', ' ', corpus[i])
    corpus[i] = re.sub(r'\s+',' ', corpus[i])

In [8]:
corpus

['natural language processing nlp is a subfield of linguistics computer science and artificial intelligence concerned with the interactions between computers and human language in particular how to program computers to process and analyze large amounts of natural language data ',
 'the goal is a computer capable of understanding the contents of documents including the contextual nuances of the language within them ',
 'the technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves ']

You can see that the text doesn't contain any special character or multiple empty spaces.

Now we have our own corpus. The next step is tokenize the sentence in the corpus and create a dictionary that contains words and their corresponding frequencies in our corpus. Look at the following script:

In [21]:
wordfreq = {}
for sentence in corpus:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1 

In the script above we created a dictionary called wordfreq. Next, we iterate through each sentence in the corpus. The sentence is tokenized into words. Next, we iterate through each word in the sentence. If the word doesn't exist in the wordfreq dictionary, we will add the word as the key and will set the value of word as 1. Otherwise, if word already exists in the dictionary, we will simply increment the key count by 1.

In [22]:
wordfreq

{'natural': 2,
 'language': 4,
 'processing': 1,
 'nlp': 1,
 'is': 2,
 'a': 2,
 'subfield': 1,
 'of': 5,
 'linguistics': 1,
 'computer': 2,
 'science': 1,
 'and': 5,
 'artificial': 1,
 'intelligence': 1,
 'concerned': 1,
 'with': 1,
 'the': 8,
 'interactions': 1,
 'between': 1,
 'computers': 2,
 'human': 1,
 'in': 2,
 'particular': 1,
 'how': 1,
 'to': 2,
 'program': 1,
 'process': 1,
 'analyze': 1,
 'large': 1,
 'amounts': 1,
 'data': 1,
 'goal': 1,
 'capable': 1,
 'understanding': 1,
 'contents': 1,
 'documents': 3,
 'including': 1,
 'contextual': 1,
 'nuances': 1,
 'within': 1,
 'them': 1,
 'technology': 1,
 'can': 1,
 'then': 1,
 'accurately': 1,
 'extract': 1,
 'information': 1,
 'insights': 1,
 'contained': 1,
 'as': 2,
 'well': 1,
 'categorize': 1,
 'organize': 1,
 'themselves': 1}

In [23]:
len(wordfreq)

54

Depending upon the task at hand, not all words are useful. In huge corpora, you can have millions of words. We can filter the most frequently occuring words. our corpus has 54 words in total. Let us filter down to the 20 most frequently occuring words. To do so, we can make use of Python's heap library. 

Look at the following script:

In [24]:
import heapq

most_freq = heapq.nlargest(20, wordfreq, key = wordfreq.get)

In [25]:
most_freq

['the',
 'of',
 'and',
 'language',
 'documents',
 'natural',
 'is',
 'a',
 'computer',
 'computers',
 'in',
 'to',
 'as',
 'processing',
 'nlp',
 'subfield',
 'linguistics',
 'science',
 'artificial',
 'intelligence']

Now our most_freq list contains 20 most frequently occuring words along with their frequency of occurence.

The final step is to convert the sentences in our corpus into their corresponding vector representation. The idea is straightforward, for each word in the most_freq dictionary if the word exists in the sentence, a 1 will be added for the word, else 0 will be added.

In [26]:
sentence_vectors = []
for sentence in corpus:
    sentence_tokens = nltk.word_tokenize(sentence)
    sent_vec = []
    for token in most_freq:
        if token in sentence_tokens:
            sent_vec.append(1)
        else:
            sent_vec.append(0)
    sentence_vectors.append(sent_vec)

In [27]:
sentence_vectors

[[1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0]]

In the script above we created an empty list sentence_vectors which will store vectors for all the sentences in the corpus. Next, we iterate through each sentence in the corpus and create an empty list sent_vec for the individual sentences. Similarly, we also tokenize the sentence. Next, we iterate through each word in the most_freq list and check if the word exists in the tokens for the sentence. If the word is a part of the sentence, 1 is appended to the individual sentence vector sent_vec, else 0 is appended. Finally, the sentence vector is added to the list sentense_vectors which contains vectors for all sentences. Basically, this sentence_vectors is our bag of words model.

However, the bag of words model that we saw in the theory section was in the form of a matix. Our model is in the form of a list of lists. We can convert our model into matrix form using this script:

In [30]:
sentence_vectors = np.asarray(sentence_vectors)

In [31]:
sentence_vectors

array([[1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0]])

Basically, we converted our list into a two-dimensional numpy array using asarray function.

### We can also use CountVectorizer model in Sklearn.feature_extraction.text to Create Bag of Words model

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [9]:
cv = CountVectorizer()

In [16]:
x = cv.fit_transform(corpus).toarray()

In [17]:
print(x)

[[0 1 1 3 1 0 1 0 0 0 1 2 1 0 0 0 1 0 0 0 1 1 1 0 0 0 1 1 1 3 1 1 2 1 0 2
  0 1 1 1 1 1 1 0 1 0 0 0 2 0 0 1 0]
 [0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 3
  0 0 0 0 0 0 0 0 4 1 0 0 0 1 0 0 1]
 [1 0 0 2 0 2 0 1 0 1 0 0 0 1 0 0 0 2 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
  1 0 0 0 0 0 0 1 3 0 1 1 0 0 1 0 0]]


<b> Bag of words model is one of the three most commonly used word embedding approaches with TF-IDF and Word2Vec being the other two.</b>

***