# Bag of Words

#### The basic idea of Bag of Words is to take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter.

#### Using a process which we will go through now, we can covert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrance of each word or token in that document.

#### To handle this, we will be using sklearns count vectorizer method which does the following:

##### 1) It tokenizes the string(separates the string into individual words) and gives an integer ID to each token.

##### 2) It converts all tokenized words to their lower case.
##### 3) It also ignores all punctuation. 
##### 4) It will automatically ignore all stop_words that are found in the built in list of english stop words in scikit-learn
##### 5) It counts the occurrance of each of those tokens.

#### Before we dive into scikit-learn's Bag of Words(BoW) library to do the dirty work for us, let's implement it ourselves first so that we can understand what's happening behind the scenes.


In [98]:
import pandas as pd
import string
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

In [172]:
# Dataset available using filepath 'smsspamcollection/SMSSpamCollection'
df = pd.read_table('SMSSpamCollection', names=['label', 'sms_message'])

# Output printing out first 5 rows
df.head()

Unnamed: 0,label,sms_message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [173]:
df.shape

(5572, 2)

## Implementing Bag of Words from scratch

In [174]:
documents = df.sms_message.to_list()

In [175]:
print(documents[0:2])

['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'Ok lar... Joking wif u oni...']


### Converting all strings to their lower case form

In [176]:
lower_case_documents = [d.lower() for d in documents]

In [177]:
lower_case_documents[0]

'go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat...'

### Removing all punctuations

In [178]:
# remove punctuations
no_punctuation_documents = [d.translate(str.maketrans('','', string.punctuation)) for d in lower_case_documents]

no_punctuation_documents[0:4]
len(no_punctuation_documents)

5572

### Tokenization

In [179]:
# split up the sentence into individual words using a delimiter
splitted_documents = [d.split() for d in no_punctuation_documents]
print(splitted_documents[0:2])
len(splitted_documents)

[['go', 'until', 'jurong', 'point', 'crazy', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'there', 'got', 'amore', 'wat'], ['ok', 'lar', 'joking', 'wif', 'u', 'oni']]


5572

### Removing Stop_Words with nltk library

In [180]:
# get stop_words from english stopwords library
stop_words = set(stopwords.words('english'))

# get the list of documents without stop_words
filtered_documents = []
for i in range(len(splitted_documents)):
    filtered_documents.append([w for w in splitted_documents[i] if not w in stop_words])

print(filtered_documents[0:2])
len(filtered_documents)

[['go', 'jurong', 'point', 'crazy', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amore', 'wat'], ['ok', 'lar', 'joking', 'wif', 'u', 'oni']]


5572

### Counting frequencies

In [181]:
from collections import Counter
frequency_list = [Counter(d) for d in filtered_documents ]

In [182]:
print(frequency_list[0:2])

[Counter({'go': 1, 'jurong': 1, 'point': 1, 'crazy': 1, 'available': 1, 'bugis': 1, 'n': 1, 'great': 1, 'world': 1, 'la': 1, 'e': 1, 'buffet': 1, 'cine': 1, 'got': 1, 'amore': 1, 'wat': 1}), Counter({'ok': 1, 'lar': 1, 'joking': 1, 'wif': 1, 'u': 1, 'oni': 1})]


#### We have implemented the Bag of Words process from scratch! As we can see in the output, we have a frequency distribution dictionary which gives a clear view of the text that we are dealing with.

## Implementing Bag of Words in Scikit-learn using CountVectorizer method

### Preprocessing the documents

In [188]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the documents
count_vector.fit(documents)

# Get the names of columns
column_names = count_vector.get_feature_names_out()

print(column_names)
len(column_names)

['00' '000' '000pes' ... 'èn' 'ú1' '〨ud']


8713

### Matrix of frequencies

In [189]:
# get the array of documents by vetorizer
doc_array = count_vector.transform(documents).toarray()
doc_array

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [190]:
# create frequency matrix
frequency_matrix = pd.DataFrame(doc_array, columns=column_names)
frequency_matrix

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5567,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5568,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5569,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5570,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [191]:
frequency_matrix.shape

(5572, 8713)