In [1]:
import nltk


**BOW,TF_IDF,WORD2VEC**

**Bag of words**

#### Bag of Words(BoW) Model

We cannot pass text directly to train our models in Natural Language Processing, thus we need to convert it into numbers, which machine can understand and can perform the required modelling on it. The Bag of Words(BoW) model is a fundamental(and old) way of doing this.

The BoW model is very simple as it discards all the information and order of the text and just considers the occurrences of the word, in short it converts a sentence or a paragraph into a bag of words with no meaning. It converts the documents to a fixed-length vector of numbers.

A unique number is assigned to each word(generally index of an array) along with the count representing the number of occurence of that word. This is the encoding of the words, in which we are focusing on the representation of the word and not on the order of the word.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]

CountVectorizer tokenizes(tokenization means breaking down a sentence or paragraph or any text into words) the text along with performing very basic preprocessing like removing the punctuation marks, converting all the words to lowercase, etc

In [3]:
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
#each word is given a unique code

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}


In [4]:
sentences = ["Machine learning is great ","Natural Language Processing is a complex field",
"Natural Language Processing is used in machine learning"]


vectorizer = CountVectorizer() 

train_data_features = vectorizer.fit_transform(sentences)
print(vectorizer.vocabulary_)
vectorizer.transform(["Machine learning is great","Natural Language Processing is a complex field",
"Natural Language Processing is used in machine learning"]).toarray()

{'machine': 7, 'learning': 6, 'is': 4, 'great': 2, 'natural': 8, 'language': 5, 'processing': 9, 'complex': 0, 'field': 1, 'used': 10, 'in': 3}


array([[0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0],
       [1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0],
       [0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int64)

**Drawbacks of using a Bag-of-Words (BoW) Model**

In the above example, we can have vectors of length 11. However, we start facing issues when we come across new sentences:

If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too.

Additionally, the vectors would also contain many 0s, thereby resulting in a sparse matrix (which is what we would like to avoid)

We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text.
 



**TF-IDF**

**Frequency**: This summarizes how often a given word appears within a document.

**Document Frequency**: This downscales words that appear a lot across documents.

**Inverse Document Frequency (IDF):** is a weight indicating how commonly a word is used. The more frequent its usage across documents, the lower its score. The lower the score, the less important the word becomes.

For example, the word the appears in almost all English texts and would thus have a very low IDF score as it carries very little “topic” information. In contrast, if you take the word coffee, while it is common, it’s not used as widely as the word the. Thus, coffee would have a higher IDF score than the.

**TF-IDF**: is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The car is driven on the road.","The truck is driven on the highway"]

In [6]:
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)

TfidfVectorizer()

In [7]:
X =vectorizer.fit_transform(text).toarray()
print(X)

[[0.42471719 0.30218978 0.         0.30218978 0.30218978 0.42471719
  0.60437955 0.        ]
 [0.         0.30218978 0.42471719 0.30218978 0.30218978 0.
  0.60437955 0.42471719]]


In [8]:
#Focus on IDF VALUES
print(vectorizer.idf_)

[1.40546511 1.         1.40546511 1.         1.         1.40546511
 1.         1.40546511]


In [9]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\91956\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [10]:
nltk.download('stopwords')
 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\91956\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
nltk.download('wordnet')
 

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\91956\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

**End Notes**

Let me summarize what we’ve covered in the article:

Bag of Words just creates a set of vectors containing the count of word occurrences in the document (reviews), while the TF-IDF model contains information on the more important words and the less important ones as well.
Bag of Words vectors are easy to interpret. However, TF-IDF usually performs better in machine learning models.


**word2vec**

In both Bow and TF-IDF approach semantic information is not stored.

TF-IDF gives importance to uncommon words

There is defenitely chance of overfitting


In word2vec model each word is basically represented as a vector of 32 or more dimension.

Here, the semantic information and the relation between different words is also preserved.



In [12]:
!pip install gensim



In [13]:
import nltk

from gensim.models import Word2Vec
from nltk.corpus import stopwords

import re
#anaconda !pip install gensim

In [14]:

paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""




In [15]:

# Preprocessing the data
text = re.sub(r'\[[0-9]*\]',' ',paragraph)
text = re.sub(r'\s+',' ',text)
text = text.lower()
text


'i have three visions for india. in 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds. from alexander onwards, the greeks, the turks, the moguls, the portuguese, the british, the french, the dutch, all of them came and looted us, took over what was ours. yet we have not done this to any other nation. we have not conquered anyone. we have not grabbed their land, their culture, their history and tried to enforce our way of life on them. why? because we respect the freedom of others.that is why my first vision is that of freedom. i believe that india got its first vision of this in 1857, when we started the war of independence. it is this freedom that we must protect and nurture and build on. if we are not free, no one will respect us. my second vision for india’s development. for fifty years we have been a developing nation. it is time we see ourselves as a developed nation. we are among the top 5 nations of the wo

In [16]:
# Preparing the dataset
sentences = nltk.sent_tokenize(text)

sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
sentences

[['i', 'have', 'three', 'visions', 'for', 'india', '.'],
 ['in',
  '3000',
  'years',
  'of',
  'our',
  'history',
  ',',
  'people',
  'from',
  'all',
  'over',
  'the',
  'world',
  'have',
  'come',
  'and',
  'invaded',
  'us',
  ',',
  'captured',
  'our',
  'lands',
  ',',
  'conquered',
  'our',
  'minds',
  '.'],
 ['from',
  'alexander',
  'onwards',
  ',',
  'the',
  'greeks',
  ',',
  'the',
  'turks',
  ',',
  'the',
  'moguls',
  ',',
  'the',
  'portuguese',
  ',',
  'the',
  'british',
  ',',
  'the',
  'french',
  ',',
  'the',
  'dutch',
  ',',
  'all',
  'of',
  'them',
  'came',
  'and',
  'looted',
  'us',
  ',',
  'took',
  'over',
  'what',
  'was',
  'ours',
  '.'],
 ['yet',
  'we',
  'have',
  'not',
  'done',
  'this',
  'to',
  'any',
  'other',
  'nation',
  '.'],
 ['we', 'have', 'not', 'conquered', 'anyone', '.'],
 ['we',
  'have',
  'not',
  'grabbed',
  'their',
  'land',
  ',',
  'their',
  'culture',
  ',',
  'their',
  'history',
  'and',
  'tried',
  

In [17]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\91956\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [18]:
for i in range(len(sentences)):
    sentences[i] = [word for word in sentences[i] if word not in stopwords.words('english')]
    


In [19]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\91956\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [20]:
#here we need to convert each and every word to vectors with the help of word2vec    
# Training the Word2Vec model
model = Word2Vec(sentences, min_count=2)#we can change the value here.if the word is present less than twice,remove
#word2vec is used mainly for huge amount of data.eg:we may apply this on entire doc of wikepedia

print(model)

Word2Vec(vocab=29, vector_size=100, alpha=0.025)


In [22]:
#words = model.wv.vocab
#words
#here words will be represented in alphabetic orders.each word will be represented as vectors

In [23]:



# Finding Word Vectors
vector = model.wv['vision']
vector
#each word will be represented as a vector of 32 or more dimension


array([-8.2470402e-03,  9.3104085e-03, -1.9035455e-04, -1.9678909e-03,
        4.6107038e-03, -4.0970482e-03,  2.7569898e-03,  6.9491598e-03,
        6.0659950e-03, -7.5209257e-03,  9.3941428e-03,  4.6645785e-03,
        3.9729690e-03, -6.2513947e-03,  8.4632728e-03, -2.1494904e-03,
        8.8308081e-03, -5.3665224e-03, -8.1366980e-03,  6.8107611e-03,
        1.6734540e-03, -2.1905121e-03,  9.5245503e-03,  9.4974982e-03,
       -9.7854072e-03,  2.5109882e-03,  6.1551095e-03,  3.8626830e-03,
        2.0276017e-03,  4.2483912e-04,  6.7750097e-04, -3.8248522e-03,
       -7.1342522e-03, -2.0816177e-03,  3.9217160e-03,  8.8225566e-03,
        9.2622126e-03, -5.9755617e-03, -9.4068600e-03,  9.7596208e-03,
        3.4338485e-03,  5.1686266e-03,  6.2874253e-03, -2.8009326e-03,
        7.3261638e-03,  2.8237968e-03,  2.8704628e-03, -2.3871814e-03,
       -3.1312704e-03, -2.3685780e-03,  4.2830193e-03,  7.6171578e-05,
       -9.5888041e-03, -9.6761463e-03, -6.1491285e-03, -1.3418916e-04,
      

In [24]:
# Most similar words
similar = model.wv.most_similar('nation')
similar

[('dr.', 0.3039070665836334),
 ('great', 0.17799635231494904),
 ('years', 0.16379284858703613),
 ('?', 0.16260458528995514),
 ('india', 0.1460980921983719),
 ('developed', 0.11138284206390381),
 ('minds', 0.09419557452201843),
 ('strength', 0.07514531910419464),
 ('three', 0.05126914009451866),
 ('vision', 0.04206670820713043)]