In [1]:
import nltk

In [2]:
#nltk.download()

In [3]:
paragraph = """I have three visions for India. In 3000 years of our history, people from all over 
               the world have come and invaded us, captured our lands, conquered our minds. 
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours. 
               Yet we have not done this to any other nation. We have not conquered anyone. 
               We have not grabbed their land, their culture, 
               their history and tried to enforce our way of life on them. 
               Why? Because we respect the freedom of others.That is why my 
               first vision is that of freedom. I believe that India got its first vision of 
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India 
               stands up to the world, no one will respect us. Only strength respects strength. We must be 
               strong not only as a military power but also as an economic power. Both must go hand-in-hand. 
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of 
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life. 
               I see four milestones in my career"""
            

## Tokenize

**Tokenize** : Converting the para to sentence and from sentence to word 

In [4]:
# Tokenizing sentences - convert from para to sentence 
sentences = nltk.sent_tokenize(paragraph)

# Tokenizing words - convert from sentence to word
words = nltk.word_tokenize(paragraph)

## Stemming and Lemmatization 

- **Stemming:** convert the word to word-stem but sometimes may not be meaningful but faster eg: sentiment analysis
- **Lemmatization:** does the same thing but it will always be meaningful but slower eg: chatbot

In [5]:
#stemming 
from nltk.stem import PorterStemmer 
from nltk.corpus import stopwords #remove prepositions and repeated words 

sentences = nltk.sent_tokenize(paragraph)
stemmer = PorterStemmer()

# Stemming
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)


In [6]:
sentences

['I three vision india .',
 'In 3000 year histori , peopl world come invad us , captur land , conquer mind .',
 'from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .',
 'yet done nation .',
 'We conquer anyon .',
 'We grab land , cultur , histori tri enforc way life .',
 'whi ?',
 'becaus respect freedom others.that first vision freedom .',
 'I believ india got first vision 1857 , start war independ .',
 'It freedom must protect nurtur build .',
 'If free , one respect us .',
 'My second vision india ’ develop .',
 'for fifti year develop nation .',
 'It time see develop nation .',
 'We among top 5 nation world term gdp .',
 'We 10 percent growth rate area .',
 'our poverti level fall .',
 'our achiev global recognis today .',
 'yet lack self-confid see develop nation , self-reli self-assur .',
 'isn ’ incorrect ?',
 'I third vision .',
 'india must stand world .',
 'becaus I believ unless india stand world , one respect us .',
 'onl

In [7]:
from nltk.stem import WordNetLemmatizer

sentences = nltk.sent_tokenize(paragraph)
lemmatizer = WordNetLemmatizer()

# Lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)

sentences

['I three vision India .',
 'In 3000 year history , people world come invaded u , captured land , conquered mind .',
 'From Alexander onwards , Greeks , Turks , Moguls , Portuguese , British , French , Dutch , came looted u , took .',
 'Yet done nation .',
 'We conquered anyone .',
 'We grabbed land , culture , history tried enforce way life .',
 'Why ?',
 'Because respect freedom others.That first vision freedom .',
 'I believe India got first vision 1857 , started War Independence .',
 'It freedom must protect nurture build .',
 'If free , one respect u .',
 'My second vision India ’ development .',
 'For fifty year developing nation .',
 'It time see developed nation .',
 'We among top 5 nation world term GDP .',
 'We 10 percent growth rate area .',
 'Our poverty level falling .',
 'Our achievement globally recognised today .',
 'Yet lack self-confidence see developed nation , self-reliant self-assured .',
 'Isn ’ incorrect ?',
 'I third vision .',
 'India must stand world .',
 'Bec

## Bag of words

- Converting from text to numeric data 
- Usually, we lower the sentences and removing stop words using stemming or lemmatization 
- Find count of each word and sort in descending order!
- All these words become features[cols] and sentences will become variables[rows]. Then the values will be filled in as 1/0. If in a particular sentence, a word is present then the value will be 1 or else 0 and the whole table is thus filled. This is **Binary Bag of words** 
- Now, if the words [represented as features] are repeated in the same sentence, the value can be increased to 2/3/4... and this is **Normal Bag of Words**

--- Disadvantage:

- Values are either 1/0 for Binary BOW but, 2 or more different words will have the same value i.e. 1 and there is no difference of importance and in sentiment analysis, some words are more important than others i.e. equal weightage of all words is an 
issue!
- Slow compared to word2vec

In [8]:
import re #regular expression 
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer #stemming 
from nltk.stem import WordNetLemmatizer #lemmatization

ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = [] #store clean data in this, will work as input to BOW

for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i]) #remove everything except a-z, A-Z
    review = review.lower() #convert to lower case
    review = review.split() #split every sentence 
    
    #use either stem or lemmatize
    #review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))] #remove stop words
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))] 
    
    #remove stop words
    review = ' '.join(review) #create one list with all meaningful words
    corpus.append(review)

In [9]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()

In [10]:
#now, X can be passed to the model as it is all numeric 
X

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [0, 1, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

## TF-IDF [Term Frequency Inverse Document Frequency]

TF = Number of repeatitions of a word in a sentence / Number of words in the same sentence 

IDF = Log(Number of sentence / Number of sentence containing the word)

Multiply TF * IDF to get the final value to be put in the table that has features[cols] as words and variables[rows] as the sentences. The advantage here is the importance of each word different and this helps in better understanding of the data.

In [13]:
# Cleaning the texts -> same as above
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

ps = PorterStemmer()
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)


In [14]:
# Creating the TF-IDF model
from sklearn.feature_extraction.text import TfidfVectorizer
cv = TfidfVectorizer()
X = cv.fit_transform(corpus).toarray()