# Stopwords
In NLP, stop words are commonly used words that are usually filtered out before preprocessing text data. These words do not carry significant meaning and are often removed to improve the efficiency of text analysis 

In [1]:
speech = """"I have three visions for India. In 3000 years of our history, people from all over the world have come and invaded us, captured our lands, conquered our minds. From Alexander onwards, the Greeks, the Turks, the Mughals, the Portuguese, the British, the French, the Dutch – all of them came and looted us, took over what was ours. Yet, we have not done this to any other nation. We have not conquered anyone. We have not grabbed their land, their culture, and their history. Why? Because we respect the freedom of others. That is why my first vision is that of freedom.

My second vision for India’s development. For fifty years, we have been a developing nation. It is time we see ourselves as a developed nation. We are among the top five nations in the world in terms of GDP. We have ten percent growth rate in most areas. Our poverty levels are falling. Our achievements are being globally recognized today. Yet we lack the self-confidence to see ourselves as a developed nation, as a self-reliant and self-assured nation.

My third vision is that India must stand up to the world. Because I believe that unless India stands up to the world, no one will respect us. Only strength respects strength. We must be strong not only as a military power but also as an economic power. Both must go hand-in-hand.

Why is the media here so negative? Why are we in India so embarrassed to recognize our own strengths, our own achievements? We are such a great nation. We have so many amazing success stories but we refuse to acknowledge them. Why?"

"""



In [2]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [3]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [7]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
sentences = nltk.sent_tokenize(speech)

In [8]:
##Apply stopwords and filter and then apply stemming
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words) #convert back to sentence

In [9]:
sentences

['`` i three vision india .',
 'in 3000 year histori , peopl world come invad us , captur land , conquer mind .',
 'from alexand onward , greek , turk , mughal , portugues , british , french , dutch – came loot us , took .',
 'yet , done nation .',
 'we conquer anyon .',
 'we grab land , cultur , histori .',
 'whi ?',
 'becaus respect freedom other .',
 'that first vision freedom .',
 'my second vision india ’ develop .',
 'for fifti year , develop nation .',
 'it time see develop nation .',
 'we among top five nation world term gdp .',
 'we ten percent growth rate area .',
 'our poverti level fall .',
 'our achiev global recogn today .',
 'yet lack self-confid see develop nation , self-reli self-assur nation .',
 'my third vision india must stand world .',
 'becaus i believ unless india stand world , one respect us .',
 'onli strength respect strength .',
 'we must strong militari power also econom power .',
 'both must go hand-in-hand .',
 'whi media neg ?',
 'whi india embarrass rec

Using Snowball Stemming

In [12]:
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer('english')

In [14]:
snowball_sentences = nltk.sent_tokenize(speech)
for i in range(len(snowball_sentences)):
    snowball_words = nltk.word_tokenize(snowball_sentences[i])
    snowball_words = [snowball_stemmer.stem(word) for word in snowball_words if word not in set(stopwords.words('english'))]
    snowball_sentences[i] = ' '.join(snowball_words)

In [16]:
snowball_sentences


['`` i three vision india .',
 'in 3000 year histori , peopl world come invad us , captur land , conquer mind .',
 'from alexand onward , greek , turk , mughal , portugues , british , french , dutch – came loot us , took .',
 'yet , done nation .',
 'we conquer anyon .',
 'we grab land , cultur , histori .',
 'whi ?',
 'becaus respect freedom other .',
 'that first vision freedom .',
 'my second vision india ’ develop .',
 'for fifti year , develop nation .',
 'it time see develop nation .',
 'we among top five nation world term gdp .',
 'we ten percent growth rate area .',
 'our poverti level fall .',
 'our achiev global recogn today .',
 'yet lack self-confid see develop nation , self-reli self-assur nation .',
 'my third vision india must stand world .',
 'becaus i believ unless india stand world , one respect us .',
 'onli strength respect strength .',
 'we must strong militari power also econom power .',
 'both must go hand-in-hand .',
 'whi media negat ?',
 'whi india embarrass r

In [17]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [18]:
lemma_sentences = nltk.sent_tokenize(speech)
for i in range(len(lemma_sentences)):
    lemma_words = nltk.word_tokenize(lemma_sentences[i])
    lemma_words = [lemmatizer.lemmatize(word) for word in lemma_words if word not in set(stopwords.words('english'))]
    lemma_sentences[i] = ' '.join(lemma_words)

In [19]:
lemma_sentences

['`` I three vision India .',
 'In 3000 year history , people world come invaded u , captured land , conquered mind .',
 'From Alexander onwards , Greeks , Turks , Mughals , Portuguese , British , French , Dutch – came looted u , took .',
 'Yet , done nation .',
 'We conquered anyone .',
 'We grabbed land , culture , history .',
 'Why ?',
 'Because respect freedom others .',
 'That first vision freedom .',
 'My second vision India ’ development .',
 'For fifty year , developing nation .',
 'It time see developed nation .',
 'We among top five nation world term GDP .',
 'We ten percent growth rate area .',
 'Our poverty level falling .',
 'Our achievement globally recognized today .',
 'Yet lack self-confidence see developed nation , self-reliant self-assured nation .',
 'My third vision India must stand world .',
 'Because I believe unless India stand world , one respect u .',
 'Only strength respect strength .',
 'We must strong military power also economic power .',
 'Both must go ha