# Bag of words feature extraction

In this section, we will construct a Bag of Words Term Frequency Matrix for an example corpus


Let us use the same corpus that we used during training

```Python
X = ['The cat and the mouse ate together',
          'The old man put the cigarette in the ashtray and placed it on the table',
          'The cat and mouse game']
```


In [1]:
X = [
    "The cat and the mouse ate together",
    "The old man put the cigarette in the ashtray and placed it on the table",
    "The cat and mouse game",
]

In [6]:
%pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.4.1.post1-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Using cached scipy-1.12.0-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Using cached threadpoolctl-3.3.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.4.1.post1-cp311-cp311-win_amd64.whl (10.6 MB)
   ---------------------------------------- 0.0/10.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/10.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/10.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/10.6 MB 640.0 kB/s eta 0:00:17
   ---------------------------------------- 0.1/10.6 MB 465.5 kB/s eta 0:00:23
   ---------------------------------------- 0.1/10.6 MB 508.4 kB/s eta 0:00:21
   ---------------------------------------- 0.1/10.6 MB 435.7 kB/s eta 0:00:25
   ---------------------------------------- 0.1/10.6 MB 435

In [7]:
# Import the CountVectorizer class (Bag of words) from sklearn.feature_extraction.text
import sklearn
from sklearn.feature_extraction.text import CountVectorizer


In [8]:
# Create an object from this class
cv = CountVectorizer()

`CountVectorizer` is a transformer in Scikit-Learn. It allows us to transform some data (in this case, from a corpus of sentences to their corresponding word counts). With transformer objects, the workflow is to call the `fit` method followed by the `transform` method. (Recall the `Fit-Transform` workflow from the Data Science training)

The `fit` method will learn useful characteristics about the given data (in the case of Bag of Words - e.g. the unique words in the corpus) which will then be used to transform the data (in the case of Bag of Words - creating the Term Frequency matrix) when the `transform` method is called.


In [9]:
# Call the fit method on the cv object and pass the corpus
# TYPE YOUR CODE HERE
cv.fit(X)

In [10]:
# the cv object, in this case, will get the unique terms in our corpus which will then be used to create the columns of the Bag of Words matrix. 
# we can use the get_feature_names_out() method to get the feature names (in our case, the column names in the Term Frequency Matrix)
cv.get_feature_names_out()

array(['and', 'ashtray', 'ate', 'cat', 'cigarette', 'game', 'in', 'it',
       'man', 'mouse', 'old', 'on', 'placed', 'put', 'table', 'the',
       'together'], dtype=object)

In [16]:
# Transform the corpus X using the fitted model cv and store it as Xt
# TYPE YOUR CODE HERE
Xt = cv.transform(X)

In [17]:
# Inspect Xt
Xt

<3x17 sparse matrix of type '<class 'numpy.int64'>'
	with 23 stored elements in Compressed Sparse Row format>

In [18]:
# Printing it will show the data in a dictionary like format with keys corresponding to the indices
print(Xt)

  (0, 0)	1
  (0, 2)	1
  (0, 3)	1
  (0, 9)	1
  (0, 15)	2
  (0, 16)	1
  (1, 0)	1
  (1, 1)	1
  (1, 4)	1
  (1, 6)	1
  (1, 7)	1
  (1, 8)	1
  (1, 10)	1
  (1, 11)	1
  (1, 12)	1
  (1, 13)	1
  (1, 14)	1
  (1, 15)	4
  (2, 0)	1
  (2, 3)	1
  (2, 5)	1
  (2, 9)	1
  (2, 15)	1


In [19]:
# Displaying all the elements (dense format) of the sparse matrix using the .toarray() method
# Note, we are only showing this for demonstration purposes. When the data is large, we don't convert it into a dense format unless absolutely necessary.
Xt.toarray()

array([[1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 1],
       [1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 4, 0],
       [1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0]], dtype=int64)

In [20]:
# Displaying the column names of the term frequency matrix using get_feature_names_out() method
cv.get_feature_names_out()

array(['and', 'ashtray', 'ate', 'cat', 'cigarette', 'game', 'in', 'it',
       'man', 'mouse', 'old', 'on', 'placed', 'put', 'table', 'the',
       'together'], dtype=object)

In [22]:
# Put the data in a Pandas DataFrame with the column names given by the output of get_feature_names_out() method
# TYPE YOUR CODE HERE
import pandas as pd
df = pd.DataFrame(Xt.toarray(), columns = cv.get_feature_names_out())
display(df)


Unnamed: 0,and,ashtray,ate,cat,cigarette,game,in,it,man,mouse,old,on,placed,put,table,the,together
0,1,0,1,1,0,0,0,0,0,1,0,0,0,0,0,2,1
1,1,1,0,0,1,0,1,1,1,0,1,1,1,1,1,4,0
2,1,0,0,1,0,1,0,0,0,1,0,0,0,0,0,1,0


In [24]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(Xt.toarray())

array([[1.        , 0.57735027, 0.74535599],
       [0.57735027, 1.        , 0.43033148],
       [0.74535599, 0.43033148, 1.        ]])