<a href="https://colab.research.google.com/github/Gladiator07/Natural-Language-Processing/blob/main/Basics/Text-Preprocessing/Bag-of-Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bag Of Words

### Bag of Words (using plain Python and nltk)

Bag of words keeps count of the total occurences of most frequently used words

In [1]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import re
import numpy as np

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
text = """
Beans. I was trying to explain to somebody as we were flying in, that’s corn. 
That’s beans. And they were very impressed at my agricultural knowledge. 
Please give it up for Amaury once again for that outstanding introduction. 
I have a bunch of good friends here today, including somebody who I served with, 
who is one of the finest senators in the country, and we’re lucky to have him, 
your Senator, Dick Durbin is here. I also noticed, by the way, former Governor 
Edgar here, who I haven’t seen in a long time, and somehow he has not aged and 
I have. And it’s great to see you, Governor. I want to thank President Killeen 
and everybody at the U of I System for making it possible for me to be here today. 
And I am deeply honored at the Paul Douglas Award that is being given to me. 
He is somebody who set the path for so much outstanding public service here in Illinois. 
Now, I want to start by addressing the elephant in the room. I know people are 
still wondering why I didn’t speak at the commencement."""

#### Step-1 
- Convert text to lower case
- Remove all non-word charachters
- Remove all punctuations

In [3]:
data = nltk.sent_tokenize(text)
for i in range(len(data)):
    data[i] = data[i].lower()
    data[i] = re.sub(r'\W', ' ', data[i])
    data[i] = re.sub(r'\s+', ' ', data[i])

In [4]:
data

[' beans ',
 'i was trying to explain to somebody as we were flying in that s corn ',
 'that s beans ',
 'and they were very impressed at my agricultural knowledge ',
 'please give it up for amaury once again for that outstanding introduction ',
 'i have a bunch of good friends here today including somebody who i served with who is one of the finest senators in the country and we re lucky to have him your senator dick durbin is here ',
 'i also noticed by the way former governor edgar here who i haven t seen in a long time and somehow he has not aged and i have ',
 'and it s great to see you governor ',
 'i want to thank president killeen and everybody at the u of i system for making it possible for me to be here today ',
 'and i am deeply honored at the paul douglas award that is being given to me ',
 'he is somebody who set the path for so much outstanding public service here in illinois ',
 'now i want to start by addressing the elephant in the room ',
 'i know people are still wond

#### Step-2:
- We declare a dictionary to hold our bag of words
- Next we tokenize each sentence to words
- Now for each word in sentence, we check if the words exists in our dictionary
- If it does, then we increment its count by 1. If it doesn't, we add it to our dictionary and set its count as 1

In [5]:
word2count = {}
for d in data:
    words = nltk.word_tokenize(d)
    for word in words:
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1

In [6]:
word2count

{'a': 2,
 'addressing': 1,
 'again': 1,
 'aged': 1,
 'agricultural': 1,
 'also': 1,
 'am': 1,
 'amaury': 1,
 'and': 7,
 'are': 1,
 'as': 1,
 'at': 4,
 'award': 1,
 'be': 1,
 'beans': 2,
 'being': 1,
 'bunch': 1,
 'by': 2,
 'commencement': 1,
 'corn': 1,
 'country': 1,
 'deeply': 1,
 'dick': 1,
 'didn': 1,
 'douglas': 1,
 'durbin': 1,
 'edgar': 1,
 'elephant': 1,
 'everybody': 1,
 'explain': 1,
 'finest': 1,
 'flying': 1,
 'for': 5,
 'former': 1,
 'friends': 1,
 'give': 1,
 'given': 1,
 'good': 1,
 'governor': 2,
 'great': 1,
 'has': 1,
 'have': 3,
 'haven': 1,
 'he': 2,
 'here': 5,
 'him': 1,
 'honored': 1,
 'i': 12,
 'illinois': 1,
 'impressed': 1,
 'in': 5,
 'including': 1,
 'introduction': 1,
 'is': 4,
 'it': 3,
 'killeen': 1,
 'know': 1,
 'knowledge': 1,
 'long': 1,
 'lucky': 1,
 'making': 1,
 'me': 2,
 'much': 1,
 'my': 1,
 'not': 1,
 'noticed': 1,
 'now': 1,
 'of': 3,
 'once': 1,
 'one': 1,
 'outstanding': 2,
 'path': 1,
 'paul': 1,
 'people': 1,
 'please': 1,
 'possible': 1,
 'p

In [7]:
len(word2count)

118

We have 118 words in our vocabulary

However, when processing large texts, the number of words could reach millions. We do not need to use all those words. Hence, we select a particular number of most frequently used words. Like this:



In [8]:
import heapq
freq_words = heapq.nlargest(100, word2count, key=word2count.get)
len(freq_words)

100

#### Step-3: Building the Bag of Words model

In this step we construct a vector, which would tell us whether a word in each sentence is a frequent word or not. If a word in a sentence is a frequent word, we set it as 1, else we set it as 0.


In [9]:
X = []
for d in data:
    vector = []
    for word in freq_words:
        if word in nltk.word_tokenize(d):
            vector.append(1)
        else:
            vector.append(0)
    X.append(vector)
X = np.asarray(X)
print(X)

[[0 0 0 ... 0 0 0]
 [1 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 1 0 ... 1 1 1]
 [1 1 1 ... 0 0 0]
 [1 1 0 ... 0 0 0]]


In [10]:
X.shape

(13, 100)

So, here we have 13 sentences and 100 dim vector (0 or 1) (vocabulary) in each sentence

## Bag of Words (using sklearn and nltk)

In [11]:
import nltk

In [12]:
paragraph =  """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""

In [13]:
import re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer

In [14]:
# creating objects of stemmers and lemmatizers
stemmer = SnowballStemmer(language="english")
lemmatizer = WordNetLemmatizer()

In [15]:
sentences = nltk.sent_tokenize(paragraph)
corpus = []

for i in range(len(sentences)):
    text = re.sub('[^a-zA-Z]', ' ', sentences[i])
    text = text.lower()
    text = nltk.word_tokenize(text)
    text = [stemmer.stem(word) for word in text if not word in set(stopwords.words('english'))]
    text = ' '.join(text)
    corpus.append(text)

In [16]:
corpus

['three vision india',
 'year histori peopl world come invad us captur land conquer mind',
 'alexand onward greek turk mogul portugues british french dutch came loot us took',
 'yet done nation',
 'conquer anyon',
 'grab land cultur histori tri enforc way life',
 '',
 'respect freedom other first vision freedom',
 'believ india got first vision start war independ',
 'freedom must protect nurtur build',
 'free one respect us',
 'second vision india develop',
 'fifti year develop nation',
 'time see develop nation',
 'among top nation world term gdp',
 'percent growth rate area',
 'poverti level fall',
 'achiev global recognis today',
 'yet lack self confid see develop nation self reliant self assur',
 'incorrect',
 'third vision',
 'india must stand world',
 'believ unless india stand world one respect us',
 'strength respect strength',
 'must strong militari power also econom power',
 'must go hand hand',
 'good fortun work three great mind',
 'dr vikram sarabhai dept',
 'space profess

In [23]:
# creating bag of words vector
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 100)
X = cv.fit_transform(corpus).toarray()

In [24]:
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [25]:
X.shape

(31, 100)