# BoW Concept


In this notebook we will use the Bag of Word (BoW) concept. The idea of Bag of Word that it takes a text data and counts the frequancy of the words in that text, note that BoW treats each word individually and the order in which the words occur does not matter, it works as CountVectorizer() from sklearn.


In [1]:
#The Text we will use to build our BoW.
documents = ['Hello, how are you!',
             'Win money, win from home.',
             'Call me now.',
             'Hello, Call hello you tomorrow?']

## Step1:
The first step is to convert all the words into lower case and then to save them in another text.

In [2]:
def lower_case(document):
    '''This function takes a piece of text and converts all the words to lower case. '''
    lower_case_documents = []
    for i in documents:
        doc = []
        words = i.split()
        for word in words:
            doc.append(word.lower())
        lower_case_documents.append(' '.join(doc))
    return lower_case_documents

In [3]:
low_doc=lower_case(documents)
print(low_doc)

['hello, how are you!', 'win money, win from home.', 'call me now.', 'hello, call hello you tomorrow?']


## Step2: 
Removing all punctuation from the text.

In [4]:
import string
def remove_pun(document):
    ''' This function removes all Punctuation from a text document.'''
    sans_punctuation_documents = []
    for i in document:
        doc = []
        for letter in i:
            if letter not in string.punctuation:
                doc.append(letter)
        sans_punctuation_documents.append(''.join(doc))
    return sans_punctuation_documents


In [5]:
no_pun_doc=remove_pun(low_doc)
print(no_pun_doc)

['hello how are you', 'win money win from home', 'call me now', 'hello call hello you tomorrow']


## Step3:
Tokenization which means splitting a sentence into individual words, each sentence is splitted by a delimiter but space delimiter is usually used in texts.

In [8]:
def split_doc(document):
    '''This function splits text with space delimiter and save the word in a list.'''
    preprocessed_documents = []
    for i in document:
        preprocessed_documents.append(i.split())
    return preprocessed_documents

In [9]:
splitted_doc=split_doc(no_pun_doc)
print(splitted_doc)

[['hello', 'how', 'are', 'you'], ['win', 'money', 'win', 'from', 'home'], ['call', 'me', 'now'], ['hello', 'call', 'hello', 'you', 'tomorrow']]


## Step4:
Counting the frequancy of each word; the occurance of each word.
`Counter` counts the occurrence of each item in the list and returns a dictionary with the key as the item being counted and the corresponding value being the count of that item in the list. 

In [11]:
frequency_list = []
import pprint
from collections import Counter

for i in splitted_doc:
   
    frequency_counts = Counter(i)
    frequency_list.append(frequency_counts)
    
pprint.pprint(frequency_list)

[Counter({'hello': 1, 'how': 1, 'are': 1, 'you': 1}),
 Counter({'win': 2, 'money': 1, 'from': 1, 'home': 1}),
 Counter({'call': 1, 'me': 1, 'now': 1}),
 Counter({'hello': 2, 'call': 1, 'you': 1, 'tomorrow': 1})]
