## **BAG OF WORDS**

---

## **What is Bag of Words?**

Bag of words is a Natural Language Processing technique of text modelling. In technical terms, we can say that it is a method of feature extraction with text data. 
It is used to preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words. It keep track of word counts and disregard the grammatical details and the word order.

### **Why Bag of words is used?**

With bag-of-Words we can convert variable-length texts into a fixed-length vector. Machine learning models work with numerical data. By using the bag of words technique, we can convert a text into its equivalent vector of numbers.


### **How to apply bag of words?**

---

<hr>

### **1. Tokenization**

<hr/>

Converting the sentence into tokens.

In [None]:
import nltk 
nltk.download('punkt')
import re 
import numpy as np 
text = "Robofied is a comprehensive Artificial Intelligence platform based in Gurugram,Haryana working towards democratizing safe artificial intelligence towards a common goal of Singularity. At Robofied, we are doing research in speech, natural language, and machine learning. We develop open-source solutions for developers which empowers them so that they can make better products for the world. We educate people about Artificial Intelligence, its scope and impact via resources and tutorials."
dataset = nltk.sent_tokenize(text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


<hr>

### **2. Preprocessing the Data**

<hr/>

1. Convert text to lower case.

2. Remove all non-word characters.

3. Remove all punctuations.
  

In [None]:
for i in range(len(dataset)): 
    dataset[i] = dataset[i].lower() 
    dataset[i] = re.sub(r'\W', ' ', dataset[i]) 
    dataset[i] = re.sub(r'\s+', ' ', dataset[i]) 

<hr>

###  **3.Finding most frequent words**
<hr/>

1. Declare a dictionary to hold our bag of words.

2. Tokenize each sentence to words.

3. Check if the word exists in our dictionary.
If it does, then increment its count by 1. If it doesn’t, add it to our dictionary and set its count as 1.



In [None]:

word2count = {} 
for data in dataset: 
    words = nltk.word_tokenize(data) 
    for word in words: 
        if word not in word2count.keys(): 
            word2count[word] = 1
        else: 
            word2count[word] += 1
word2count

{'a': 2,
 'about': 1,
 'and': 3,
 'are': 1,
 'artificial': 3,
 'at': 1,
 'based': 1,
 'better': 1,
 'can': 1,
 'common': 1,
 'comprehensive': 1,
 'democratizing': 1,
 'develop': 1,
 'developers': 1,
 'doing': 1,
 'educate': 1,
 'empowers': 1,
 'for': 2,
 'goal': 1,
 'gurugram': 1,
 'haryana': 1,
 'impact': 1,
 'in': 2,
 'intelligence': 3,
 'is': 1,
 'its': 1,
 'language': 1,
 'learning': 1,
 'machine': 1,
 'make': 1,
 'natural': 1,
 'of': 1,
 'open': 1,
 'people': 1,
 'platform': 1,
 'products': 1,
 'research': 1,
 'resources': 1,
 'robofied': 2,
 'safe': 1,
 'scope': 1,
 'singularity': 1,
 'so': 1,
 'solutions': 1,
 'source': 1,
 'speech': 1,
 'that': 1,
 'the': 1,
 'them': 1,
 'they': 1,
 'towards': 2,
 'tutorials': 1,
 'via': 1,
 'we': 3,
 'which': 1,
 'working': 1,
 'world': 1}

<hr>

### **4. Building the Bag of words model**
<hr/>

Construct a vector, which would tell whether a word in each sentence is a frequent word or not. If a word in a sentence is a frequent word, set it as 1, else set it as 0.

In [None]:
import heapq 
freq_words = heapq.nlargest(100, word2count, key=word2count.get)
X = [] 
for data in dataset: 
    vector = [] 
    for word in freq_words: 
        if word in nltk.word_tokenize(data): 
            vector.append(1) 
        else: 
            vector.append(0) 
    X.append(vector) 
X = np.asarray(X) 
X

array([[1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

### **Bag of Words Model with Sklearn**

---

We can use the CountVectorizer() function from the Sk-learn library to implement the Bag of words model using Python.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
text = "Robofied is a comprehensive Artificial Intelligence platform based in Gurugram,Haryana working towards democratizing safe artificial intelligence towards a common goal of Singularity. At Robofied, we are doing research in speech, natural language, and machine learning. We develop open-source solutions for developers which empowers them so that they can make better products for the world. We educate people about Artificial Intelligence, its scope and impact via resources and tutorials."
CountVec = CountVectorizer(ngram_range=(1,1), # to use bigrams ngram_range=(2,2)
                           stop_words='english')
#transform
Count_data = CountVec.fit_transform([text])
 
#create dataframe
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names())
print(cv_dataframe)

   artificial  based  better  common  ...  speech  tutorials  working  world
0           3      1       1       1  ...       1          1        1      1

[1 rows x 37 columns]
