# Complete NLP Overview

### **NLP Overview**

Text PrepProcessing: Goal: Claining the input: Tokenization, Lemmatization, Stemming

Text PrepProcessing 2: Convert input text to Vectors: Bag of words, TFIDF, Unigrams, Bigrams

Text PreProcessing 3: Converting input text to vectors: Word2Vec, Avgword2vec

Nerural Networks: RNN, LSTM, RNN, GRU(Neural Networks)

Word Embedding: Convert input text to vectors

Transformers

BERT

### **Basic Terminilogies**:
- Corpus - Paragraph
- Documents - Sentences
- Vocabulary - Unique words in dict
- Words - words in a corpus


### **Tokenization:**

Paragraphs >> Sentences: Creates sentences from paragraphs

sentences >> words: Converting the sentences to words

Vocabulary>> Count of unique words

Using NLTK Library

Tokenization is a process to convert either pragraphs or sentences into tokens

In [1]:
corpus="""Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.
"""

In [2]:
print(corpus)

Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course! to become expert in NLP.



In [5]:
#Tokenization: Paragrahs > Sentences
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
documents  = sent_tokenize(corpus)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


In [6]:
for sentence in documents:
    print(sentence)

Hello Welcome,to Krish Naik's NLP Tutorials.
Please do watch the entire course!
to become expert in NLP.


In [7]:
#Tokenization: Paragraphs >> Words
from nltk.tokenize import word_tokenize
word_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [8]:
#Tokenization: Sentences >> Words
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', 'Welcome', ',', 'to', 'Krish', 'Naik', "'s", 'NLP', 'Tutorials', '.']
['Please', 'do', 'watch', 'the', 'entire', 'course', '!']
['to', 'become', 'expert', 'in', 'NLP', '.']


In [9]:
#punctuations
from nltk.tokenize import wordpunct_tokenize
wordpunct_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'",
 's',
 'NLP',
 'Tutorials',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [10]:
# Handling Full stops
from nltk.tokenize import TreebankWordTokenizer
tokenizer=TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'Krish',
 'Naik',
 "'s",
 'NLP',
 'Tutorials.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

### **Stemming**

Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP).


In [11]:
## Classification Problem
## Comments of product is a positive review or negative review
## Reviews----> eating, eat,eaten [going,gone,goes]--->go

words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

**PorterStemmer**

In [12]:
from nltk.stem import PorterStemmer
stemming = PorterStemmer()

In [13]:
for word in words:
    print(word + "----->"+stemming.stem(word))

eating----->eat
eats----->eat
eaten----->eaten
writing----->write
writes----->write
programming----->program
programs----->program
history----->histori
finally----->final
finalized----->final


When using stemming the original meaning of the word mighht change: e.g. history > histori

### RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

In [14]:
from nltk.stem import RegexpStemmer
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$', min=4)

In [15]:
reg_stemmer.stem('eating')

'eat'

In [16]:
reg_stemmer.stem('enable')

'en'

In [17]:
reg_stemmer.stem('ingeating')

'ingeat'

### Snowball Stemmer
It is a stemming algorithm which is also known as the Porter2 stemming algorithm as it is a better version of the Porter Stemmer since some issues of it were fixed in this stemmer.

In [18]:
from nltk.stem import SnowballStemmer
snowballsstemmer=SnowballStemmer('english')

In [20]:
for word in words:
    print(word + '----->' + snowballsstemmer.stem(word))

eating----->eat
eats----->eat
eaten----->eaten
writing----->write
writes----->write
programming----->program
programs----->program
history----->histori
finally----->final
finalized----->final


In [21]:
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [22]:
snowballsstemmer.stem("fairly"),snowballsstemmer.stem("sportingly")

('fair', 'sport')

In [23]:
stemming.stem('goes'),snowballsstemmer.stem('goes')

('goe', 'goe')

### Lemmatization

Lemmatization technique is like stemming. The output we will get after lemmatization is called ‘lemma’, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example −

In [26]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...


In [27]:
lemmatizer.lemmatize('going')

'going'

In [28]:
'''
POS- Noun-n
verb-v
adjective-a
adverb-r
'''
lemmatizer.lemmatize("going",pos='v')

'go'

In [29]:
words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]

In [30]:
for word in words:
    print(word+"---->"+lemmatizer.lemmatize(word,pos='v'))

eating---->eat
eats---->eat
eaten---->eat
writing---->write
writes---->write
programming---->program
programs---->program
history---->history
finally---->finally
finalized---->finalize


In [31]:
lemmatizer.lemmatize("goes",pos='v')

'go'

In [32]:
lemmatizer.lemmatize("fairly",pos='v'),lemmatizer.lemmatize("sportingly")

('fairly', 'sportingly')

**Compared to Stemming, Lemmatization will take a longer time to get the root words.** 

### Text Processing: Stopwords

In [42]:
## Speech Of DR APJ Abdul Kalam
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
            the world have come and invaded us, captured our lands, conquered our minds. 
            From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
            the French, the Dutch, all of them came and looted us, took over what was ours. 
            Yet we have not done this to any other nation. We have not conquered anyone. 
            We have not grabbed their land, their culture, 
            their history and tried to enforce our way of life on them. 
            Why? Because we respect the freedom of others.That is why my 
            first vision is that of freedom. I believe that India got its first vision of 
            this in 1857, when we started the War of Independence. It is this freedom that
            we must protect and nurture and build on. If we are not free, no one will respect us.
            My second vision for India’s development. For fifty years we have been a developing nation.
            It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
            in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
            Our achievements are being globally recognised today. Yet we lack the self-confidence to
            see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
            I have a third vision. India must stand up to the world. Because I believe that unless India 
            stands up to the world, no one will respect us. Only strength respects strength. We must be 
            strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
            My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
            space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
            I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
            I see four milestones in my career"""

In [33]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [34]:
stopwords.words('English')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [43]:
#Toekizing the paragraph
sentences = nltk.sent_tokenize(paragraph)

In [38]:
#Removing stopwords
stemmer = PorterStemmer()

In [48]:
# Remove stopwords > Filter and then apply stemming
for  i in range(len(sentences)):
    #Tokenizing the sentences > words
    words = nltk.word_tokenize(sentences[i])
    #Stemming words after filtering stopwords
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('English'))]
    #Joining the words to sentences
    sentences[i] = ' '.join(words)

In [40]:
sentences

['i three vision india .',
 'in 3000 year histori , peopl world come invad us , captur land , conquer mind .',
 'from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .',
 'yet done nation .',
 'we conquer anyon .',
 'we grab land , cultur , histori tri enforc way life .',
 'whi ?',
 'becaus respect freedom others.that first vision freedom .',
 'i believ india got first vision 1857 , start war independ .',
 'it freedom must protect nurtur build .',
 'if free , one respect us .',
 'my second vision india ’ develop .',
 'for fifti year develop nation .',
 'it time see develop nation .',
 'we among top 5 nation world term gdp .',
 'we 10 percent growth rate area .',
 'our poverti level fall .',
 'our achiev global recognis today .',
 'yet lack self-confid see develop nation , self-reli self-assur .',
 'isn ’ incorrect ?',
 'i third vision .',
 'india must stand world .',
 'becaus i believ unless india stand world , one respect us .',
 'onl

**Using Snowball Stemmer**

In [41]:
from nltk.stem import SnowballStemmer
snow_stemmer = SnowballStemmer('english')

In [44]:
#Toekizing the paragraph
sentences = nltk.sent_tokenize(paragraph)
# Remove stopwords > Filter and then apply stemming
for  i in range(len(sentences)):
    #Tokenizing the sentences > words
    words = nltk.word_tokenize(sentences[i])
    #Stemming words after filtering stopwords
    words = [snow_stemmer.stem(word) for word in words if word not in set(stopwords.words('English'))]
    #Joining the words to sentences
    sentences[i] = ' '.join(words)

sentences

['i three vision india .',
 'in 3000 year histori , peopl world come invad us , captur land , conquer mind .',
 'from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .',
 'yet done nation .',
 'we conquer anyon .',
 'we grab land , cultur , histori tri enforc way life .',
 'whi ?',
 'becaus respect freedom others.that first vision freedom .',
 'i believ india got first vision 1857 , start war independ .',
 'it freedom must protect nurtur build .',
 'if free , one respect us .',
 'my second vision india ’ develop .',
 'for fifti year develop nation .',
 'it time see develop nation .',
 'we among top 5 nation world term gdp .',
 'we 10 percent growth rate area .',
 'our poverti level fall .',
 'our achiev global recognis today .',
 'yet lack self-confid see develop nation , self-reli self-assur .',
 'isn ’ incorrect ?',
 'i third vision .',
 'india must stand world .',
 'becaus i believ unless india stand world , one respect us .',
 'onl

**Using Lemmatization**

In [45]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [50]:
#Toekizing the paragraph
sentences = nltk.sent_tokenize(paragraph)
# Remove stopwords > Filter and then apply stemming
for  i in range(len(sentences)):
    #Tokenizing the sentences > words
    words = nltk.word_tokenize(sentences[i])
    #Stemming words after filtering stopwords
    words = [lemmatizer.lemmatize(word.lower(),pos='v') for word in words if word not in set(stopwords.words('English'))]
    #Joining the words to sentences
    sentences[i] = ' '.join(words)

sentences

['i three visions india .',
 'in 3000 years history , people world come invade us , capture land , conquer mind .',
 'from alexander onwards , greeks , turks , moguls , portuguese , british , french , dutch , come loot us , take .',
 'yet do nation .',
 'we conquer anyone .',
 'we grab land , culture , history try enforce way life .',
 'why ?',
 'because respect freedom others.that first vision freedom .',
 'i believe india get first vision 1857 , start war independence .',
 'it freedom must protect nurture build .',
 'if free , one respect us .',
 'my second vision india ’ development .',
 'for fifty years develop nation .',
 'it time see develop nation .',
 'we among top 5 nations world term gdp .',
 'we 10 percent growth rate areas .',
 'our poverty level fall .',
 'our achievements globally recognise today .',
 'yet lack self-confidence see develop nation , self-reliant self-assured .',
 'isn ’ incorrect ?',
 'i third vision .',
 'india must stand world .',
 'because i believe unle

### Text PreProcessing: Parts of Speech Tagging

In [51]:
from nltk.corpus import stopwords
sentences=nltk.sent_tokenize(paragraph)

In [52]:
sentences

['I have three visions for India.',
 'In 3000 years of our history, people from all over \n            the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,\n            the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture, \n            their history and tried to enforce our way of life on them.',
 'Why?',
 'Because we respect the freedom of others.That is why my \n            first vision is that of freedom.',
 'I believe that India got its first vision of \n            this in 1857, when we started the War of Independence.',
 'It is this freedom that\n            we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 'My second vision for India’s development.',
 'For fifty 

In [54]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [55]:
# Idenitfying Post Tags
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [word for word in words if word not in set(stopwords.words('English'))]
    pos_tag = nltk.pos_tag(words)
    print(pos_tag)

[('I', 'PRP'), ('three', 'CD'), ('visions', 'NNS'), ('India', 'NNP'), ('.', '.')]
[('In', 'IN'), ('3000', 'CD'), ('years', 'NNS'), ('history', 'NN'), (',', ','), ('people', 'NNS'), ('world', 'NN'), ('come', 'VBP'), ('invaded', 'VBN'), ('us', 'PRP'), (',', ','), ('captured', 'VBD'), ('lands', 'NNS'), (',', ','), ('conquered', 'VBD'), ('minds', 'NNS'), ('.', '.')]
[('From', 'IN'), ('Alexander', 'NNP'), ('onwards', 'NNS'), (',', ','), ('Greeks', 'NNP'), (',', ','), ('Turks', 'NNP'), (',', ','), ('Moguls', 'NNP'), (',', ','), ('Portuguese', 'NNP'), (',', ','), ('British', 'NNP'), (',', ','), ('French', 'NNP'), (',', ','), ('Dutch', 'NNP'), (',', ','), ('came', 'VBD'), ('looted', 'JJ'), ('us', 'PRP'), (',', ','), ('took', 'VBD'), ('.', '.')]
[('Yet', 'RB'), ('done', 'VBN'), ('nation', 'NN'), ('.', '.')]
[('We', 'PRP'), ('conquered', 'VBD'), ('anyone', 'NN'), ('.', '.')]
[('We', 'PRP'), ('grabbed', 'VBD'), ('land', 'NN'), (',', ','), ('culture', 'NN'), (',', ','), ('history', 'NN'), ('tried'

In [56]:
"Taj Mahal is a beautiful Monument".split()

['Taj', 'Mahal', 'is', 'a', 'beautiful', 'Monument']

In [57]:
print(nltk.pos_tag("Taj Mahal is a beautiful Monument".split()))

[('Taj', 'NNP'), ('Mahal', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('beautiful', 'JJ'), ('Monument', 'NN')]


### Text PreProcessing: Named Entity Recognition(NER)

In [58]:
sentence="""The Eiffel Tower was built from 1887 to 1889 by Gustave Eiffel, 
            whose company specialized in building metal frameworks and structures."""

In [59]:
import nltk
nltk.word_tokenize(sentence)

['The',
 'Eiffel',
 'Tower',
 'was',
 'built',
 'from',
 '1887',
 'to',
 '1889',
 'by',
 'Gustave',
 'Eiffel',
 ',',
 'whose',
 'company',
 'specialized',
 'in',
 'building',
 'metal',
 'frameworks',
 'and',
 'structures',
 '.']

In [61]:
# getting pos tags
tag_elements = nltk.pos_tag(nltk.word_tokenize(sentence))
tag_elements

[('The', 'DT'),
 ('Eiffel', 'NNP'),
 ('Tower', 'NNP'),
 ('was', 'VBD'),
 ('built', 'VBN'),
 ('from', 'IN'),
 ('1887', 'CD'),
 ('to', 'TO'),
 ('1889', 'CD'),
 ('by', 'IN'),
 ('Gustave', 'NNP'),
 ('Eiffel', 'NNP'),
 (',', ','),
 ('whose', 'WP$'),
 ('company', 'NN'),
 ('specialized', 'VBD'),
 ('in', 'IN'),
 ('building', 'NN'),
 ('metal', 'NN'),
 ('frameworks', 'NNS'),
 ('and', 'CC'),
 ('structures', 'NNS'),
 ('.', '.')]

In [65]:
#Getting the NER
nltk.download('maxent_ne_chunker')
nltk.download('words')


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [67]:
#nltk.ne_chunk(tag_elements).draw()

## Encoding

### One Hot Encoding
- Get the vocabulary of words from the sentences:
- Represent each words in the vector
- Create vectors for each word in the sentence
![image.png](attachment:image.png)

Advantages:
- Easy to implement in python: e.g using Sklearn and pandas dummies

Disadvantages:
- Creates a sparse matrix: Lot of 1s and 0s.  - Which can lead to overfitting.
- The inputs have different sizes - because we are creating vectors for each word
- No semantic meaning is getting captured
- Out of Vocabulary(OOV)

### Bag of Words:
- Apply stopwords then get the vocabulary of the corpus
- Binary BoW (The count remains one rgardless of the count)and Bow (The count is updated based on the word frequency)
![image.png](attachment:image.png)

Advantages:
- Simple & Intuitive
- Input has fixed size - because we are creating vectors for the vocabulary

Disadvantages:
- Sparse matrix hence might lead to overfitting
- Word ordering changes hence  - changes the meaning of the sentences
- Out of Vocabulary for new words in the test data
- Semantic meaning is not captured.


### Ngrams
Ngramns helps to capture semantic meaning
- Bigrams
- Trigrams

![image.png](attachment:image.png)




**Practical Implementation**

In [1]:
paragraph = """
Narendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] ⓘ; born 17 September 1950)[a] is an Indian politician who has served as the 14th Prime Minister of India since 26 May 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the Member of Parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right wing Hindu nationalist paramilitary volunteer organisation. He is the longest-serving prime minister outside the Indian National Congress.[3]

Modi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education. He was introduced to the RSS at the age of eight. At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so. Modi became a full-time worker for the RSS in Gujarat in 1971. The RSS assigned him to the BJP in 1985 and he rose through the party hierarchy, becoming general secretary in 1998.[b] In 2001, Modi was appointed Chief Minister of Gujarat and elected to the legislative assembly soon after. His administration is considered complicit in the 2002 Gujarat riots,[c] and has been criticised for its management of the crisis. According to official records, a little over 1,000 people were killed, three-quarters of whom were Muslim; independent sources estimated 2,000 deaths, mostly Muslim.[12] A Special Investigation Team appointed by the Supreme Court of India in 2012 found no evidence to initiate prosecution proceedings against him.[d] While his policies as chief minister were credited for encouraging economic growth, his administration was criticised for failing to significantly improve health, poverty and education indices in the state.[e]

In the 2014 Indian general election, Modi led the BJP to a parliamentary majority, the first for a party since 1984. His administration increased direct foreign investment, and it reduced spending on healthcare, education, and social-welfare programmes. Modi began a high-profile sanitation campaign, controversially initiated a demonetisation of banknotes and introduced the Goods and Services Tax, and weakened or abolished environmental and labour laws. Modi's administration launched the 2019 Balakot airstrike against an alleged terrorist training camp in Pakistan. The airstrike failed,[15][16] but the action had nationalist appeal.[17] Modi's party won the 2019 general election which followed.[18] In its second term, his administration revoked the special status of Jammu and Kashmir,[19][20] and introduced the Citizenship Amendment Act, prompting widespread protests, and spurring the 2020 Delhi riots in which Muslims were brutalised and killed by Hindu mobs.[21][22][23] Three controversial farm laws led to sit-ins by farmers across the country, eventually causing their formal repeal. Modi oversaw India's response to the COVID-19 pandemic, during which, according to the World Health Organization's estimates, 4.7 million Indians died.[24][25] In the 2024 general election, Modi's party lost its majority in the lower house of Parliament and formed a government leading the National Democratic Alliance coalition.[26][27]

Under Modi's tenure, India has experienced democratic backsliding, or the weakening of democratic institutions, individual rights, and freedom of expression.[28][29][f] As prime minister, he has received consistently high approval ratings.[35][36][37] Modi has been described as engineering a political realignment towards right-wing politics. He remains a controversial figure domestically and internationally, over his Hindu nationalist beliefs and handling of the Gujarat riots, which have been cited as evidence of a majoritarian and exclusionary social agenda.[g]


"""


In [2]:
print(paragraph)


Narendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] ⓘ; born 17 September 1950)[a] is an Indian politician who has served as the 14th Prime Minister of India since 26 May 2014. Modi was the chief minister of Gujarat from 2001 to 2014 and is the Member of Parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right wing Hindu nationalist paramilitary volunteer organisation. He is the longest-serving prime minister outside the Indian National Congress.[3]

Modi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education. He was introduced to the RSS at the age of eight. At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so. Modi became a full-time worker for the RSS in Gujarat in 1971. The RSS assigned him to the BJP in 1985 and he rose through the par

In [3]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [4]:
# Tokenization: Convert parapgrah to sentences
nltk.download('punkt')
sentences = nltk.sent_tokenize(paragraph)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
print(sentences)

['\nNarendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] ⓘ; born 17 September 1950)[a] is an Indian politician who has served as the 14th Prime Minister of India since 26 May 2014.', 'Modi was the chief minister of Gujarat from 2001 to 2014 and is the Member of Parliament (MP) for Varanasi.', 'He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right wing Hindu nationalist paramilitary volunteer organisation.', 'He is the longest-serving prime minister outside the Indian National Congress.', '[3]\n\nModi was born and raised in Vadnagar in northeastern Gujarat, where he completed his secondary education.', 'He was introduced to the RSS at the age of eight.', 'At the age of 18, he was married to Jashodaben Modi, whom he abandoned soon after, only publicly acknowledging her four decades later when legally required to do so.', 'Modi became a full-time worker for the RSS in Gujarat in 1971.', 'The RSS assigned him to the BJP in 19

In [6]:
#Stemming
stemmer = PorterStemmer()

In [7]:
stemmer.stem('history')

'histori'

In [8]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [9]:
lemmatizer.lemmatize('history')

'history'

In [10]:
import re
corpus = []

In [11]:
#Cleaning the corpus - Removing special characters
for i in range(len(sentences)):
    text = re.sub('[^a-zA-Z]',' ',sentences[i])
    text = text.lower()
    corpus.append(text)

In [36]:
print(corpus)

[' narendra damodardas modi  gujarati    n  end   d mod   d s  modi      born    september       a  is an indian politician who has served as the   th prime minister of india since    may      ', 'modi was the chief minister of gujarat from      to      and is the member of parliament  mp  for varanasi ', 'he is a member of the bharatiya janata party  bjp  and of the rashtriya swayamsevak sangh  rss   a right wing hindu nationalist paramilitary volunteer organisation ', 'he is the longest serving prime minister outside the indian national congress ', '     modi was born and raised in vadnagar in northeastern gujarat  where he completed his secondary education ', 'he was introduced to the rss at the age of eight ', 'at the age of     he was married to jashodaben modi  whom he abandoned soon after  only publicly acknowledging her four decades later when legally required to do so ', 'modi became a full time worker for the rss in gujarat in      ', 'the rss assigned him to the bjp in      

In [None]:
#Toekizing the paragraph
sentences = nltk.sent_tokenize(paragraph)
# Remove stopwords > Filter and then apply stemming
for  i in range(len(sentences)):
    #Tokenizing the sentences > words
    words = nltk.word_tokenize(sentences[i])
    #Stemming words after filtering stopwords
    words = [lemmatizer.lemmatize(word.lower(),pos='v') for word in words if word not in set(stopwords.words('English'))]
    #Joining the words to sentences
    sentences[i] = ' '.join(words)

In [37]:
# Applying Stemming
for i in range(len(corpus)) :
    #Tokenizing the sentence to words
    words = nltk.word_tokenize(corpus[i])
    #Lemmatization after filtering stopwords
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('English'))]
    #Joining the words
    corpus[i] = ' '.join(words)

In [38]:
corpus

['narendra damodardas modi gujarati n end mod modi born september indian politician served th prime minister india since may',
 'modi chief minister gujarat member parliament mp varanasi',
 'member bharatiya janata party bjp rashtriya swayamsevak sangh r right wing hindu nationalist paramilitary volunteer organisation',
 'longest serving prime minister outside indian national congress',
 'modi born raised vadnagar northeastern gujarat completed secondary education',
 'introduced r age eight',
 'age married jashodaben modi abandoned soon publicly acknowledging four decade later legally required',
 'modi became full time worker r gujarat',
 'r assigned bjp rose party hierarchy becoming general secretary',
 'b modi appointed chief minister gujarat elected legislative assembly soon',
 'administration considered complicit gujarat riot c criticised management crisis',
 'according official record little people killed three quarter muslim independent source estimated death mostly muslim',
 'sp

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

In [47]:
cv=CountVectorizer(binary=True)

In [48]:
X = cv.fit_transform(corpus)

In [49]:
cv.vocabulary_

{'narendra': 149,
 'damodardas': 50,
 'modi': 145,
 'gujarati': 95,
 'end': 66,
 'mod': 144,
 'born': 28,
 'september': 199,
 'indian': 109,
 'politician': 167,
 'served': 200,
 'th': 221,
 'prime': 170,
 'minister': 142,
 'india': 108,
 'since': 204,
 'may': 139,
 'chief': 33,
 'gujarat': 94,
 'member': 140,
 'parliament': 161,
 'mp': 147,
 'varanasi': 227,
 'bharatiya': 26,
 'janata': 119,
 'party': 163,
 'bjp': 27,
 'rashtriya': 180,
 'swayamsevak': 215,
 'sangh': 194,
 'right': 191,
 'wing': 233,
 'hindu': 101,
 'nationalist': 151,
 'paramilitary': 160,
 'volunteer': 228,
 'organisation': 154,
 'longest': 132,
 'serving': 202,
 'outside': 156,
 'national': 150,
 'congress': 39,
 'raised': 179,
 'vadnagar': 226,
 'northeastern': 152,
 'completed': 37,
 'secondary': 197,
 'education': 61,
 'introduced': 115,
 'age': 8,
 'eight': 62,
 'married': 138,
 'jashodaben': 120,
 'abandoned': 0,
 'soon': 207,
 'publicly': 177,
 'acknowledging': 3,
 'four': 87,
 'decade': 52,
 'later': 124,
 'l

In [43]:
corpus[0]

'narendra damodardas modi gujarati n end mod modi born september indian politician served th prime minister india since may'

In [45]:
X[0].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 2, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

**Bag of Words**

In [52]:
#Gwetting Trigrams
cv=CountVectorizer(binary=True,ngram_range=(3,3))
X = cv.fit_transform(corpus)
cv.vocabulary_

{'narendra damodardas modi': 183,
 'damodardas modi gujarati': 54,
 'modi gujarati end': 175,
 'gujarati end mod': 103,
 'end mod modi': 72,
 'mod modi born': 165,
 'modi born september': 172,
 'born september indian': 33,
 'september indian politician': 233,
 'indian politician served': 124,
 'politician served th': 204,
 'served th prime': 234,
 'th prime minister': 254,
 'prime minister india': 206,
 'minister india since': 162,
 'india since may': 121,
 'modi chief minister': 173,
 'chief minister gujarat': 38,
 'minister gujarat member': 161,
 'gujarat member parliament': 100,
 'member parliament mp': 157,
 'parliament mp varanasi': 195,
 'member bharatiya janata': 156,
 'bharatiya janata party': 28,
 'janata party bjp': 136,
 'party bjp rashtriya': 197,
 'bjp rashtriya swayamsevak': 30,
 'rashtriya swayamsevak sangh': 215,
 'swayamsevak sangh right': 248,
 'sangh right wing': 230,
 'right wing hindu': 224,
 'wing hindu nationalist': 264,
 'hindu nationalist paramilitary': 112,
 '

In [53]:
#Gwetting Bigrams & Trigrams
cv=CountVectorizer(binary=True,ngram_range=(2,3))
X = cv.fit_transform(corpus)
cv.vocabulary_

{'narendra damodardas': 375,
 'damodardas modi': 112,
 'modi gujarati': 358,
 'gujarati end': 211,
 'end mod': 148,
 'mod modi': 339,
 'modi born': 351,
 'born september': 70,
 'september indian': 477,
 'indian politician': 255,
 'politician served': 420,
 'served th': 479,
 'th prime': 521,
 'prime minister': 424,
 'minister india': 333,
 'india since': 248,
 'since may': 487,
 'narendra damodardas modi': 376,
 'damodardas modi gujarati': 113,
 'modi gujarati end': 359,
 'gujarati end mod': 212,
 'end mod modi': 149,
 'mod modi born': 340,
 'modi born september': 353,
 'born september indian': 71,
 'september indian politician': 478,
 'indian politician served': 256,
 'politician served th': 421,
 'served th prime': 480,
 'th prime minister': 522,
 'prime minister india': 425,
 'minister india since': 334,
 'india since may': 249,
 'modi chief': 354,
 'chief minister': 79,
 'minister gujarat': 330,
 'gujarat member': 206,
 'member parliament': 324,
 'parliament mp': 401,
 'mp varanasi

### Term Frequency - Inverse Document Frequency(TFIDF)

 - Giving higher weights for words that are more frequency
 - Rare words captured using Term Frequency
 - Common words captured by Inverse Document Frequency
 ![image.png](attachment:image.png)

 Term Frequency is computed for every sentence while IDF is computed for every word
 ![image-2.png](attachment:image-2.png)
 ![image-3.png](attachment:image-3.png)

 **Advantages**
 - Intutive
 - Word importance is captured

 **Disadvanteges**
 - Sparisity
 - OOV









**Practical Implemntation of TF-IDF**

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X = cv.fit_transform(corpus)

In [55]:
corpus[0]

'narendra damodardas modi gujarati n end mod modi born september indian politician served th prime minister india since may'

In [56]:
X[0].toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.23266678, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.26123175, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.26123175, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  

In [57]:
#Using Max Features
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer(max_features=3)
X = cv.fit_transform(corpus)
X[0].toarray()

array([[0.        , 0.58700265, 0.80958501]])

### Word Embedding

Word embeddings is a technique which converts words into vectors. 
- Count & Frequency
- Deep Learning Trained Model

**Word2Vec**:- Feature Representation
- Has Limited Dimensions
- Sparisity is reduced
- Retains semantic meaning

Word2Vec objective function causes the words that occur in similar contexts to have similar embeddings.

**Type of Word2Vec**

**Continuous Bag of Words(CBoW)**
- Using a Window size and using the center word to get the left and right context.
- Goal is to predict the target word from the context
![image.png](attachment:image.png)


**Skipgram**
 - The input & output from CBoW are interchanged
 - Predict the context words from the target word



### Practical Implemenation

In [59]:
import pandas as pd
messages=pd.read_csv('SMSSpamCollection.txt',
                    sep='\t',names=["label","message"])

In [60]:
messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [61]:
messages.shape

(5572, 2)

In [62]:
messages['message'].loc[100]

"Please don't text me anymore. I have nothing else to say."

**Data Cleaning & Preprocessing**
- Tokenization, Stopwords, Stemming/Lemmatization
- Convert Text to vectors: BoW, TFIDF, Word2Vec, AvgWord2Vec

In [63]:
## Data Cleaning And Preprocessing
import re
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [64]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()

In [65]:
corpus = []
for i in range(0,len(messages)):
    # removing special characters
    review=re.sub('[^a-zA-z]',' ',messages['message'][i])
    #Lowecase to prevent duplocate words
    review=review.lower()
    #Spliitting
    review=review.split()
    #Stemming and removing stopwords
    review=[ps.stem(word) for word in review if not word in stopwords.words('english')]
    #Joining back the sentence
    review=' '.join(review)
    corpus.append(review)

In [66]:
corpus

['go jurong point crazi avail bugi n great world la e buffet cine got amor wat',
 'ok lar joke wif u oni',
 'free entri wkli comp win fa cup final tkt st may text fa receiv entri question std txt rate c appli',
 'u dun say earli hor u c alreadi say',
 'nah think goe usf live around though',
 'freemsg hey darl week word back like fun still tb ok xxx std chg send rcv',
 'even brother like speak treat like aid patent',
 'per request mell mell oru minnaminungint nurungu vettam set callertun caller press copi friend callertun',
 'winner valu network custom select receivea prize reward claim call claim code kl valid hour',
 'mobil month u r entitl updat latest colour mobil camera free call mobil updat co free',
 'gonna home soon want talk stuff anymor tonight k cri enough today',
 'six chanc win cash pound txt csh send cost p day day tsandc appli repli hl info',
 'urgent week free membership prize jackpot txt word claim c www dbuk net lccltd pobox ldnw rw',
 'search right word thank breather

### **Applying Bag of Words**

In [84]:
## Create the Bag OF Words model
from sklearn.feature_extraction.text import CountVectorizer
## for Binary BOW enable binary=True
cv=CountVectorizer(max_features=2500,ngram_range=(1,2))

In [85]:
## independent features
X=cv.fit_transform(corpus).toarray()

In [86]:
cv.vocabulary_

{'go': np.int64(819),
 'point': np.int64(1629),
 'crazi': np.int64(457),
 'avail': np.int64(123),
 'bugi': np.int64(234),
 'great': np.int64(867),
 'world': np.int64(2420),
 'la': np.int64(1104),
 'cine': np.int64(356),
 'got': np.int64(858),
 'wat': np.int64(2324),
 'ok': np.int64(1476),
 'lar': np.int64(1117),
 'joke': np.int64(1062),
 'wif': np.int64(2376),
 'oni': np.int64(1498),
 'ok lar': np.int64(1479),
 'free': np.int64(746),
 'entri': np.int64(623),
 'wkli': np.int64(2403),
 'comp': np.int64(407),
 'win': np.int64(2381),
 'cup': np.int64(472),
 'final': np.int64(703),
 'st': np.int64(1964),
 'may': np.int64(1272),
 'text': np.int64(2087),
 'receiv': np.int64(1714),
 'question': np.int64(1683),
 'std': np.int64(1980),
 'txt': np.int64(2200),
 'rate': np.int64(1695),
 'appli': np.int64(87),
 'free entri': np.int64(752),
 'entri wkli': np.int64(625),
 'wkli comp': np.int64(2404),
 'std txt': np.int64(1981),
 'txt rate': np.int64(2206),
 'rate appli': np.int64(1696),
 'dun': np.in

In [70]:
X.shape

(5572, 2500)

In [71]:
## Output Features
y=pd.get_dummies(messages['label'])

In [72]:
y

Unnamed: 0,ham,spam
0,True,False
1,True,False
2,False,True
3,True,False
4,True,False
...,...,...
5567,False,True
5568,True,False
5569,True,False
5570,True,False


In [73]:
y=y.iloc[:,0].values

In [74]:
y.shape

(5572,)

In [75]:
## Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20)

In [76]:
X_train,y_train

(array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]),
 array([ True,  True,  True, ..., False,  True,  True]))

In [79]:
#Using NaiveBayes
import numpy as np
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train,y_train)

In [80]:
#Predictions
y_pred = spam_detect_model.predict(X_test)

In [81]:
from sklearn.metrics import accuracy_score,classification_report
accuracy_score(y_test,y_pred)

0.9820627802690582

In [82]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

       False       0.94      0.93      0.94       162
        True       0.99      0.99      0.99       953

    accuracy                           0.98      1115
   macro avg       0.97      0.96      0.96      1115
weighted avg       0.98      0.98      0.98      1115



### Applying TFIDF

In [88]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(max_features=2500,ngram_range=(1,2))
X = tv.fit_transform(corpus).toarray()

In [89]:
#Train Test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state=0)

In [90]:
#Applying Naive Bayes
from sklearn.naive_bayes import MultinomialNB
spam_detect_model = MultinomialNB().fit(X_train,y_train)

In [92]:
#Predictions
y_pred = spam_detect_model.predict(X_test)

In [93]:
score = accuracy_score(y_test,y_pred)
print(score)

0.979372197309417


In [94]:
print(classification_report(y_pred,y_test))

              precision    recall  f1-score   support

       False       0.86      1.00      0.92       137
        True       1.00      0.98      0.99       978

    accuracy                           0.98      1115
   macro avg       0.93      0.99      0.96      1115
weighted avg       0.98      0.98      0.98      1115



**Using Random Forest Classifier**

In [96]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train,y_train)

In [97]:
y_pred = classifier.predict(X_test)

In [98]:
print(accuracy_score(y_pred,y_test))
print(classification_report(y_pred,y_test))

0.9847533632286996
              precision    recall  f1-score   support

       False       0.89      1.00      0.94       143
        True       1.00      0.98      0.99       972

    accuracy                           0.98      1115
   macro avg       0.95      0.99      0.97      1115
weighted avg       0.99      0.98      0.99      1115



### AvgWord2Vec

- Either Skipgram or CBoW
- Pretrained vs Train your own model: Train your own model if the pre-trained doesn't already contain 70% of the words
![image.png](attachment:image.png)

- Getting the average of each word in the sentence to a combine vector of the same dimension for that sentence. 
![image-2.png](attachment:image-2.png)


### Word2Vec Practical Implementation

In [99]:
#Applying Lemmatization to the corpus
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()


In [100]:
corpus = []
for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['message'][i])
    review = review.lower()
    review = review.split()
    
    review = [lemmatizer.lemmatize(word) for word in review]
    review = ' '.join(review)
    corpus.append(review)

In [101]:
corpus

['go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat',
 'ok lar joking wif u oni',
 'free entry in a wkly comp to win fa cup final tkts st may text fa to to receive entry question std txt rate t c s apply over s',
 'u dun say so early hor u c already then say',
 'nah i don t think he go to usf he life around here though',
 'freemsg hey there darling it s been week s now and no word back i d like some fun you up for it still tb ok xxx std chgs to send to rcv',
 'even my brother is not like to speak with me they treat me like aid patent',
 'a per your request melle melle oru minnaminunginte nurungu vettam ha been set a your callertune for all caller press to copy your friend callertune',
 'winner a a valued network customer you have been selected to receivea prize reward to claim call claim code kl valid hour only',
 'had your mobile month or more u r entitled to update to the latest colour mobile with camera for free call the mobile up

In [127]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [104]:
words=[]
for sent in corpus:
    sent_token=sent_tokenize(sent)
    for sent in sent_token:
        words.append(simple_preprocess(sent))

In [105]:
words[0]

['go',
 'until',
 'jurong',
 'point',
 'crazy',
 'available',
 'only',
 'in',
 'bugis',
 'great',
 'world',
 'la',
 'buffet',
 'cine',
 'there',
 'got',
 'amore',
 'wat']

**Creating Word2Vec**

In [129]:
#import gensim.models.word2vec

In [130]:
## Lets train Word2vec from scratch
#model=gensim.models.Word2Vec(words)

In [None]:
## To Get All the Vocabulary
model.wv.index_to_key

In [None]:
model.corpus_count

In [None]:
model.epochs

In [None]:
#Getting similar words?
model.wv.similar_by_word('good')

### Applying AvgWord2Vec

In [None]:
def avg_word2vec(doc):
    # remove out-of-vocabulary words
    #sent = [word for word in doc if word in model.wv.index_to_key]
    #print(sent)
    
    return np.mean([model.wv[word] for word in doc if word in model.wv.index_to_key],axis=0)
                #or [np.zeros(len(model.wv.index_to_key))], axis=0)

In [124]:
!pip install tqdm



In [None]:
from tqdm import tqdm