In [1]:
corpus = """I have three visions for India. In 3000 years of history, people from all over the world have come and conquered us — ruled us, looted us — yet we have not done this to others. Why? Because we respect freedom. That is why my first vision is freedom. I believe that India got its first vision of this in 1947. We earned our freedom; we should protect and build on it.
My second vision is development. For fifty years, we have been a developing nation. It is time we see ourselves as a developed nation. We are among the top five nations in terms of GDP. Yet we see ourselves as a poor country. It's time to dream big. We must stand up and take responsibility for making India a developed nation.
My third vision is India must stand up to the world. Unless India stands up to the world, no one will respect us. Only strength respects strength. We must be strong not only as a military power but also as an economic one. We must realize that self-respect comes only when the nation is strong and self-reliant.
I am confident that India has the potential. The youth of India must take charge, innovate, work hard, and stay committed to the nation. If we want to be a great nation, each one of us must contribute with dedication. Let's ignite our minds, take pride in our country, and work towards a strong, self-reliant India.
"""

#The process to clean it is 1)applying tokenization 2) remove the stopwords 3) apply stemming/lemmatization



In [2]:
from nltk.corpus import stopwords

import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rohan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
stop_word_list = set(stopwords.words("english"))

In [4]:
#1 Tokenization to make the corpus as sentences

sentences = nltk.sent_tokenize(corpus)
sentences

['I have three visions for India.',
 'In 3000 years of history, people from all over the world have come and conquered us — ruled us, looted us — yet we have not done this to others.',
 'Why?',
 'Because we respect freedom.',
 'That is why my first vision is freedom.',
 'I believe that India got its first vision of this in 1947.',
 'We earned our freedom; we should protect and build on it.',
 'My second vision is development.',
 'For fifty years, we have been a developing nation.',
 'It is time we see ourselves as a developed nation.',
 'We are among the top five nations in terms of GDP.',
 'Yet we see ourselves as a poor country.',
 "It's time to dream big.",
 'We must stand up and take responsibility for making India a developed nation.',
 'My third vision is India must stand up to the world.',
 'Unless India stands up to the world, no one will respect us.',
 'Only strength respects strength.',
 'We must be strong not only as a military power but also as an economic one.',
 'We must 

In [5]:
#2 removing the stop words and performing stemming
from nltk.stem import PorterStemmer 
stemmer = PorterStemmer()
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in stop_word_list]
    sentences[i]=' '.join(words) #converts the words back into sentences.
sentences

['i three vision india .',
 'in 3000 year histori , peopl world come conquer us — rule us , loot us — yet done other .',
 'whi ?',
 'becaus respect freedom .',
 'that first vision freedom .',
 'i believ india got first vision 1947 .',
 'we earn freedom ; protect build .',
 'my second vision develop .',
 'for fifti year , develop nation .',
 'it time see develop nation .',
 'we among top five nation term gdp .',
 'yet see poor countri .',
 "it 's time dream big .",
 'we must stand take respons make india develop nation .',
 'my third vision india must stand world .',
 'unless india stand world , one respect us .',
 'onli strength respect strength .',
 'we must strong militari power also econom one .',
 'we must realiz self-respect come nation strong self-reli .',
 'i confid india potenti .',
 'the youth india must take charg , innov , work hard , stay commit nation .',
 'if want great nation , one us must contribut dedic .',
 "let 's ignit mind , take pride countri , work toward strong 

In [6]:
#Repeating the process for SnowballStemmr

corpus = """I have three visions for India. In 3000 years of history, people from all over the world have come and conquered us — ruled us, looted us — yet we have not done this to others. Why? Because we respect freedom. That is why my first vision is freedom. I believe that India got its first vision of this in 1947. We earned our freedom; we should protect and build on it.
My second vision is development. For fifty years, we have been a developing nation. It is time we see ourselves as a developed nation. We are among the top five nations in terms of GDP. Yet we see ourselves as a poor country. It's time to dream big. We must stand up and take responsibility for making India a developed nation.
My third vision is India must stand up to the world. Unless India stands up to the world, no one will respect us. Only strength respects strength. We must be strong not only as a military power but also as an economic one. We must realize that self-respect comes only when the nation is strong and self-reliant.
I am confident that India has the potential. The youth of India must take charge, innovate, work hard, and stay committed to the nation. If we want to be a great nation, each one of us must contribute with dedication. Let's ignite our minds, take pride in our country, and work towards a strong, self-reliant India.
"""

from nltk.stem import SnowballStemmer
snowball_stemmer=SnowballStemmer("english")


#1 Tokenization to make the corpus as sentences
sentences = nltk.sent_tokenize(corpus)


#Applying the snowball stemming
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [snowball_stemmer.stem(word) for word in words if word not in stop_word_list]
    sentences[i]=' '.join(words) #converts the words back into sentences.
sentences

['i three vision india .',
 'in 3000 year histori , peopl world come conquer us — rule us , loot us — yet done other .',
 'whi ?',
 'becaus respect freedom .',
 'that first vision freedom .',
 'i believ india got first vision 1947 .',
 'we earn freedom ; protect build .',
 'my second vision develop .',
 'for fifti year , develop nation .',
 'it time see develop nation .',
 'we among top five nation term gdp .',
 'yet see poor countri .',
 "it 's time dream big .",
 'we must stand take respons make india develop nation .',
 'my third vision india must stand world .',
 'unless india stand world , one respect us .',
 'onli strength respect strength .',
 'we must strong militari power also econom one .',
 'we must realiz self-respect come nation strong self-reli .',
 'i confid india potenti .',
 'the youth india must take charg , innov , work hard , stay commit nation .',
 'if want great nation , one us must contribut dedic .',
 "let 's ignit mind , take pride countri , work toward strong 

In [7]:
#Repeating the process for lemmatization

corpus = """I have three visions for India. In 3000 years of history, people from all over the world have come and conquered us — ruled us, looted us — yet we have not done this to others. Why? Because we respect freedom. That is why my first vision is freedom. I believe that India got its first vision of this in 1947. We earned our freedom; we should protect and build on it.
My second vision is development. For fifty years, we have been a developing nation. It is time we see ourselves as a developed nation. We are among the top five nations in terms of GDP. Yet we see ourselves as a poor country. It's time to dream big. We must stand up and take responsibility for making India a developed nation.
My third vision is India must stand up to the world. Unless India stands up to the world, no one will respect us. Only strength respects strength. We must be strong not only as a military power but also as an economic one. We must realize that self-respect comes only when the nation is strong and self-reliant.
I am confident that India has the potential. The youth of India must take charge, innovate, work hard, and stay committed to the nation. If we want to be a great nation, each one of us must contribute with dedication. Let's ignite our minds, take pride in our country, and work towards a strong, self-reliant India.
"""

from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()

#1 Tokenization to make the corpus as sentences
sentences = nltk.sent_tokenize(corpus)


#Applying the lemmatization
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word.lower(),pos='v') for word in words if word not in stop_word_list]
    sentences[i]=' '.join(words) #converts the words back into sentences.
sentences

['i three visions india .',
 'in 3000 years history , people world come conquer us — rule us , loot us — yet do others .',
 'why ?',
 'because respect freedom .',
 'that first vision freedom .',
 'i believe india get first vision 1947 .',
 'we earn freedom ; protect build .',
 'my second vision development .',
 'for fifty years , develop nation .',
 'it time see develop nation .',
 'we among top five nations term gdp .',
 'yet see poor country .',
 "it 's time dream big .",
 'we must stand take responsibility make india develop nation .',
 'my third vision india must stand world .',
 'unless india stand world , one respect us .',
 'only strength respect strength .',
 'we must strong military power also economic one .',
 'we must realize self-respect come nation strong self-reliant .',
 'i confident india potential .',
 'the youth india must take charge , innovate , work hard , stay commit nation .',
 'if want great nation , one us must contribute dedication .',
 "let 's ignite mind , t