# Implementing Bag of Words

<font face='georgia'>
    <h3><strong>Fit method:</strong></h3>

<ol>
    <li> With this function, we will find all unique words in the data and we will assign a dimension-number to each unique word. </li>
    <br>
    <li> We  will create a python dictionary to save all the unique words, such that the key of dictionary represents a unique word and the corresponding value represent it's dimension-number. </li><br>
    <li> For example, if you have a review, <strong>__'very bad pizza'__</strong> then you can represent each unique word with a dimension_number as, <br>
        <strong>dict</strong> = { 'very' : 1, 'bad' : 2, 'pizza' : 3}     </li>
    </ol>
        

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
from tqdm import tqdm
import os

In [2]:
from tqdm import tqdm # tqdm is a library that helps us to visualize the runtime of for loop. refer this to know more about tqdm
#https://tqdm.github.io/

# it accepts only list of sentances
def fit(dataset):    
    unique_words = set() # at first we will initialize an empty set
    # check if its list type or not
    if isinstance(dataset, (list,)):
        for row in dataset: # for each review in the dataset
            for word in row.split(" "): # for each word in the review. #split method converts a string into list of words
                if len(word) < 2:
                    continue
                unique_words.add(word)
        unique_words = sorted(list(unique_words))
        vocab = {j:i for i,j in enumerate(unique_words)}
        #print(unique_words)    
        return vocab
    else:
        print("you need to pass list of sentance")
        
        

In [3]:
datset = ["abc def aaa prq", "lmn pqr aaaaaaa aaa abbb baaa"]
vocab = fit(datset)
print(vocab)

{'aaa': 0, 'aaaaaaa': 1, 'abbb': 2, 'abc': 3, 'baaa': 4, 'def': 5, 'lmn': 6, 'pqr': 7, 'prq': 8}


<font face='georgia'>
    <h4><strong>What is a Sparse Matrix?</strong></h4>

<ol>
    <li>Before going further into details about Transform method, we will understand what sparse matrix is.</li>
    <br>
    <li> Sparse matrix stores only non-zero elements and they occupy less amount of RAM comapre to a dense matrix. You can refer to this <a href="http://btechsmartclass.com/data_structures/sparse-matrix.html"><u>link</u>.</a> </li><br>
    <li> For example, assume you have a matrix,
        <pre>
[[1, 0, 0, 0, 0], 
[0, 0, 0, 1, 0], 
[0, 0, 4, 0, 0]] 
</pre>   </li>
    </ol>
        

In [4]:
from sys import getsizeof  # will tell amount of memory occupied.
import numpy as np
# we store every element here
a = np.array([[1, 0, 0, 0, 0], [0, 0, 0, 1, 0], [0, 0, 4, 0, 0]])
print("Bytes",getsizeof(a))

# here we are storing only non zero elements here (row, col, value)
a = [ (0, 0, 1), (1, 3, 1), (2,2,4)]  # (raw,column,value). pass list of tuple 
# with this way of storing we are saving alomost 50% memory for this example
print(getsizeof(a))

Bytes 240
120


<font face='georgia'>
    <h4><strong>How to write a Sparse Matrix?:</strong></h4>

<ol>
    <li> You can use csr_matrix() method of scipy.sparse to write a sparse matrix.</li>
    <li> You need to pass indices of non-zero elements into csr_matrix() for creating a sparse matrix. </li>
    <li> You also need to pass element value of each pair of indices. </li>
    <li> You can use lists to save the indices of non-zero elements and their corresponding element values. </li>
    <li> For example, 
        <ul>
            <li>Assume you have a matrix,
        <pre>
    [[1, 0, 0], 
    [0, 0, 1], 
    [4, 0, 6]] 
    </pre></li>
        <li> Then you can save the indices using a list as,<br><strong>list_of_indices</strong> =  [(0,0), (1,2), (2,0), (2,2)]</li>
            <li> And you can save the corresponding element values as, <br><strong>element_values</strong> = [1, 1, 4, 6]  </li>
        </ul></li>
    <li> Further you can refer to the documentation  <a href="https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.sparse.csr_matrix.html"><u>here</u>.</a> </li>
    </ol>

In [5]:
#example from web
import numpy as np
from scipy.sparse import csr_matrix
csr_matrix((3, 4), dtype=np.int8).toarray()     # 3 raw and 4 column.

# the output will contain a 3*4 null matrix

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int8)

In [6]:
import numpy as np
from scipy.sparse import csr_matrix
X = csr_matrix((3, 4), dtype=np.int32)
print(type(X))

<class 'scipy.sparse.csr.csr_matrix'>


In [7]:
X.toarray()

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int32)

In [8]:
row = np.array([0, 0, 1, 2, 2, 2])  #row position
col = np.array([0, 2, 2, 0, 1, 2])  #column position
data = np.array([1, 2, 3, 4, 5, 6]) #corresponding element in the matrix.

# at (0,0) the element stored is 1.
Matrix = csr_matrix((data, (row, col)), shape=(3, 3)).toarray()  # it will create a 3*3 matrix.
print(Matrix)

[[1 0 2]
 [0 0 3]
 [4 5 6]]


In [9]:
print(getsizeof(Matrix))

192


<font face='georgia'>
    <h3><strong>Transform method:</strong></h3>

<ol>
    <li>With this function, we will write a feature matrix using sprase matrix.</li>
    </ol>
        

In [10]:
#example from web
from collections import Counter   
#https://docs.python.org/3/library/collections.html#collections.Counter
'''A Counter is a dict subclass for counting hashable objects. It is a collection where 
elements are stored as dictionary keys and their counts are stored as dictionary values. 
Counts are allowed to be any integer value including zero or negative counts. 
The Counter class is similar to bags or multisets in other languages.'''

from scipy.sparse import csr_matrix
test = 'ABC def ABC def zzz zzz pqr ABC abc'
a = dict(Counter(test.split()))     # will compute the frequency and store in the dict as value.
print(a)
print(type(a))  #note this 
for i,j in a.items():
    print(i, j)

{'ABC': 3, 'def': 2, 'zzz': 2, 'pqr': 1, 'abc': 1}
<class 'dict'>
ABC 3
def 2
zzz 2
pqr 1
abc 1


In [11]:
print(Counter('abracadabra').most_common(3))
Z = Counter('abracadabra').most_common()
print(Z)

[('a', 5), ('b', 2), ('r', 2)]
[('a', 5), ('b', 2), ('r', 2), ('c', 1), ('d', 1)]


In [12]:
from scipy.sparse import csr_matrix
test = 'ABC def ABC def zzz zzz pqr ABC abc'
C = Counter(test.split())  # will return a python dictionary
print(C)
print(type(C))      #match with above


Counter({'ABC': 3, 'def': 2, 'zzz': 2, 'pqr': 1, 'abc': 1})
<class 'collections.Counter'>


In [13]:
datset = ["abc def aaa prq", "lmn pqr aaaaaaa aaa abbb baaa"]
vocab = fit(datset)
print(vocab)

{'aaa': 0, 'aaaaaaa': 1, 'abbb': 2, 'abc': 3, 'baaa': 4, 'def': 5, 'lmn': 6, 'pqr': 7, 'prq': 8}


In [14]:
# https://stackoverflow.com/questions/9919604/efficiently-calculate-word-frequency-in-a-string
# https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.sparse.csr_matrix.html
# note that we are we need to send the preprocessing text here, we have not inlcuded the processing

def transform(dataset,vocab):
    rows = []
    columns = []
    values = []
    if isinstance(dataset, (list,)):
        for idx, row in enumerate((dataset)): # for each review in the dataset
        #idx will be either 0 or 1 as the dataset is 2*2 matrix.
            # it will return a dict type object where key is the word and values is its frequency, {word:frequency}
            word_freq = dict(Counter(row.split())) #dic to first review in iteration one.
            print("The word frequency are ",word_freq) 
            # for every unique word in the document
            for word, freq in word_freq.items():  # for each unique word in the review.                
                if len(word) < 2:
                    continue
                # we will check if its there in the vocabulary that we build in fit() function
                # dict.get() function will return the values/actually column index, if the key doesn't exits it will return -1
                col_index = vocab.get(word, -1) # will return the values/actully columns
                # if the word exists
                if col_index !=-1:
                    # we are storing the index of the document
                    rows.append(idx)
                    # we are storing the dimensions of the word
                    columns.append(col_index)
                    # we are storing the frequency of the word
                    values.append(freq)
                    
        print("the row index are:    ",rows)
        print("the column index are: ",columns)
        print("the values are:       ",values) 
        print("The tranformed matrix is ")
        return csr_matrix((values, (rows,columns)), shape=(len(dataset),len(vocab)))
    else:
        print("you need to pass list of strings")

In [15]:
Review1 = "This pasta is very very testy"
Review2 = "This pasta is cheap but delecious pasta"
Text = [Review1,Review2]
vocab = fit(Text)
print("values in dictionary canbe used as colmn indx",vocab)
print("*"*30)

print(transform(Text, vocab).toarray())

#first raw for first review.
#second raw for the secons review.

values in dictionary canbe used as colmn indx {'This': 0, 'but': 1, 'cheap': 2, 'delecious': 3, 'is': 4, 'pasta': 5, 'testy': 6, 'very': 7}
******************************
The word frequency are  {'This': 1, 'pasta': 1, 'is': 1, 'very': 2, 'testy': 1}
The word frequency are  {'This': 1, 'pasta': 2, 'is': 1, 'cheap': 1, 'but': 1, 'delecious': 1}
the row index are:     [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
the column index are:  [0, 5, 4, 7, 6, 0, 5, 4, 2, 1, 3]
the values are:        [1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1]
The tranformed matrix is 
[[1 0 0 0 1 1 1 2]
 [1 1 1 1 1 2 0 0]]


In [16]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()

vec.fit(Text)
feature_matrix_2 = vec.transform(Text)
print(feature_matrix_2.toarray())

[[0 0 0 1 1 1 1 2]
 [1 1 1 1 2 0 1 0]]


In [17]:
strings = ["the method of lagrange multipliers is the economists workhorse for solving optimization problems",
           "the technique is a centerpiece of economic theory but unfortunately its usually taught poorly"]
vocab = fit(strings)
print(list(vocab.keys()))
print(transform(strings, vocab).toarray())

['but', 'centerpiece', 'economic', 'economists', 'for', 'is', 'its', 'lagrange', 'method', 'multipliers', 'of', 'optimization', 'poorly', 'problems', 'solving', 'taught', 'technique', 'the', 'theory', 'unfortunately', 'usually', 'workhorse']
The word frequency are  {'the': 2, 'method': 1, 'of': 1, 'lagrange': 1, 'multipliers': 1, 'is': 1, 'economists': 1, 'workhorse': 1, 'for': 1, 'solving': 1, 'optimization': 1, 'problems': 1}
The word frequency are  {'the': 1, 'technique': 1, 'is': 1, 'a': 1, 'centerpiece': 1, 'of': 1, 'economic': 1, 'theory': 1, 'but': 1, 'unfortunately': 1, 'its': 1, 'usually': 1, 'taught': 1, 'poorly': 1}
the row index are:     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
the column index are:  [17, 8, 10, 7, 9, 5, 3, 21, 4, 14, 11, 13, 17, 16, 5, 1, 10, 2, 18, 0, 19, 6, 20, 15, 12]
the values are:        [2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
The tranformed matrix is 
[[0 0 0 1 1 1 0 1 1 1 1 1 0 

## Comparing results with countvectorizer

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(analyzer='word')

vec.fit(strings)
feature_matrix_2 = vec.transform(strings)
print(feature_matrix_2.toarray())

[[0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 0 0 2 0 0 0 1]
 [1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0]]


In [19]:
def fit(dataset):    
    unique_words = set()  
    if isinstance(dataset, (list,)):
        for row in dataset: 
            for word in row.split(" "):
                if len(word) < 2:
                    continue
                unique_words.add(word)
        unique_words = sorted(list(unique_words))
        vocab = {j:i for i,j in enumerate(unique_words)}
        #print(unique_words)    
        return vocab
    else:
        print("you need to pass list of sentance")

def transform(dataset,vocab):
    rows = []
    columns = []
    values = []
    if isinstance(dataset, (list,)):
        for idx, row in enumerate((dataset)):  
            word_freq = dict(Counter(row.split())) 
            print("The word frequency are ",word_freq) 
            for word, freq in word_freq.items():  
                if len(word) < 2:
                    continue
                col_index = vocab.get(word, -1) 
                if col_index !=-1:  
                    rows.append(idx)
                    columns.append(col_index)
                    values.append(freq)            
        print("the row index are:    ",rows)
        print("the column index are: ",columns)
        print("the values are:       ",values) 
        print("The tranformed matrix is ")
        return csr_matrix((values, (rows,columns)), shape=(len(dataset),len(vocab)))
    else:
        print("you need to pass list of strings")

#Calling the function
Review1 = "This pasta is very very testy"
Review2 = "This pasta is cheap but delecious pasta"
Text = [Review1,Review2]
vocab = fit(Text)
print("values in dictionary canbe used as colmn indx",vocab)
print("*"*30)

print(transform(Text, vocab).toarray())       

values in dictionary canbe used as colmn indx {'This': 0, 'but': 1, 'cheap': 2, 'delecious': 3, 'is': 4, 'pasta': 5, 'testy': 6, 'very': 7}
******************************
The word frequency are  {'This': 1, 'pasta': 1, 'is': 1, 'very': 2, 'testy': 1}
The word frequency are  {'This': 1, 'pasta': 2, 'is': 1, 'cheap': 1, 'but': 1, 'delecious': 1}
the row index are:     [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
the column index are:  [0, 5, 4, 7, 6, 0, 5, 4, 2, 1, 3]
the values are:        [1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1]
The tranformed matrix is 
[[1 0 0 0 1 1 1 2]
 [1 1 1 1 1 2 0 0]]
