# Bag Of Words

Bag of Words (BoW) is a language modeling technique used in natural language processing (NLP) to represent a text document as a bag of its words, disregarding grammar and word order, but keeping track of their frequency.

The basic idea behind BoW is to represent a document as a set of its constituent words, and count how many times each word appears in the document. This results in a sparse vector representation of the document, where the vector is as long as the vocabulary size, and each dimension corresponds to a unique word in the vocabulary.

The BoW technique is often used as a feature extraction method in various NLP tasks, such as text classification, sentiment analysis, and information retrieval. However, it suffers from some limitations, such as not capturing the semantic relationships between words and the context in which they appear. This has led to the development of more advanced techniques, such as word embeddings and deep learning models, that attempt to overcome these limitations.

**Note**
* This notebook is higly inspired by 
* * [Bag Of Word: Natural Language Processing](https://youtu.be/irzVuSO8o4g)
* * [Creating Bag Of Words From S](https://www.askpython.com/python/examples/bag-of-words-model-from-scratch)

In [4]:
import numpy as np 
from collections import defaultdict
import nltk
nltk.download('punkt') 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

Lets assume we have this sentence
* She loves pizza, pizza is delicious.
* She is a good person.
* good people are the best.

In [2]:
data = ["She loves pizza, pizza is delicious.",
        "She is a good person.",
        "good people are the best."]

Its a small case, so lets just manually type the unique words in the combined sentences into a dictionary 

In [3]:
identifiers = {"she", "loves", "pizza", "is" ,"delicious" ,"a", "good", "person", "people", "are" ,"the" ,"best"}

Now we will try to break the words. For example our sentence is `She loves pizza, pizza is delicious.`. What we want is `["She" , "loves" , "pizza" , "piazza" , "is" , "delicious"]`.

For that lets assume we first have this sentecne instead `Swag Like Ohio`, and we want `["Swag" , "Like" , "Ohio"]`

Our first step should be to break this sentence first

In [5]:
nltk.tokenize.word_tokenize("Swag Like Ohio")

['Swag', 'Like', 'Ohio']

and it was pretty straight forward

Now what if the sentence was `[Dang Like Ohio , Swag Like Ohio]` and the expected output was `["Dang" , "Like" , "Ohio" , "Swag" , "Like" , "Ohio"]` You would say, just tokenize them again 

In [6]:
nltk.tokenize.word_tokenize("Dang Like Ohio , Swag Like Ohio")

['Dang', 'Like', 'Ohio', ',', 'Swag', 'Like', 'Ohio']

Ooops, we got an extra `,` here. But we dont want comas. what we can do is to only select letters that are `alpha` using the `str.alpha`.

In [8]:
[word.lower() for word in nltk.tokenize.word_tokenize("Dang Like Ohio , Swag Like Ohio") if word.isalpha() ]

['dang', 'like', 'ohio', 'swag', 'like', 'ohio']

And that is what something we wanted. But we need to do this for all sentences.

In [9]:
sentences = []
vocab = []
for sent in data:
    sentence = [w.lower() for w in nltk.tokenize.word_tokenize(sent) if w.isalpha() ]
    sentences.append(sentence)
    for word in sentence:
        if word not in vocab:
            vocab.append(word)

In [10]:
vocab

['she',
 'loves',
 'pizza',
 'is',
 'delicious',
 'a',
 'good',
 'person',
 'people',
 'are',
 'the',
 'best']

In [12]:
sentences

[['she', 'loves', 'pizza', 'pizza', 'is', 'delicious'],
 ['she', 'is', 'a', 'good', 'person'],
 ['good', 'people', 'are', 'the', 'best']]

Now we need to assign `index` to each word in a dictionary. So that we can use that later

In [13]:
index_word = {}
i = 0
for word in vocab:
    index_word[word] = i 
    i += 1

Now we just need to define a function that adds value into a vector

In [14]:
def bag_of_words(sent):
    count_dict = defaultdict(int)
    vec = np.zeros(len(vocab))
    for item in sent:
        count_dict[item] += 1
    for key,item in count_dict.items():
        vec[index_word[key]] = item
    return vec   

And we have made our `Bag Of Words`.