### Stop Words:
**Stopwords** are common words (like "the," "is," "and") that are often filtered out during text processing because they carry little meaningful information for search or analysis.

Example: In the phrase "A quick brown fox over the lazy dog," the words "A," "over," and "the" are stopwords that would be removed, leaving only the descriptive keywords.

In [1]:
paragraph = """I have three vision for India. In 3000 years of our history, people from all over the world have come and invaded us,
                 captured our lands, conquered our minds. From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, 
                 the British, the French, the Dutch, all of them came and looted us, took over what was ours. Yet we have not done this to any other nation. We have not conquered anyone. 
                 We have not grabbed their lands, their culture, their history and tried to enforce our way of life on them. Why? Because we respect the freedom of others.
                 That is why my first vision is that of freedom. I believe that India got its first vision of this in 1857, when we started the War of Independence. 
                 It is this freedom that we must protect and nurture and build on. If we are not free, no one will respect us."""

In [9]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hyashwanth/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
stopwords_english = stopwords.words('english')  ## to check all the stop words in english
print(stopwords_english)

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [5]:
## We can also add our own stop words to the existing list
custom_stopwords = ['India', 'freedom', 'vision']
stopwords_english.extend(custom_stopwords)
print("\nUpdated Stop Words List:")
print(stopwords_english)


Updated Stop Words List:
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'sam

In [6]:
## Before applying stemming or lemmatization, it is a good practice to remove stop words from the text to improve the quality of the results.
words = paragraph.split()
filtered_words = [word for word in words if word.lower() not in stopwords_english]
print("\nFiltered Words:")
print(filtered_words)


Filtered Words:
['three', 'India.', '3000', 'years', 'history,', 'people', 'world', 'come', 'invaded', 'us,', 'captured', 'lands,', 'conquered', 'minds.', 'Alexander', 'onwards,', 'Greeks,', 'Turks,', 'Moguls,', 'Portuguese,', 'British,', 'French,', 'Dutch,', 'came', 'looted', 'us,', 'took', 'ours.', 'Yet', 'done', 'nation.', 'conquered', 'anyone.', 'grabbed', 'lands,', 'culture,', 'history', 'tried', 'enforce', 'way', 'life', 'them.', 'Why?', 'respect', 'others.', 'first', 'freedom.', 'believe', 'India', 'got', 'first', '1857,', 'started', 'War', 'Independence.', 'must', 'protect', 'nurture', 'build', 'on.', 'free,', 'one', 'respect', 'us.']


In [7]:
## lets perform stemming using PorterStemmer
ps = PorterStemmer()
for word in filtered_words:
    print(f"{word} --> {ps.stem(word)}")

three --> three
India. --> india.
3000 --> 3000
years --> year
history, --> history,
people --> peopl
world --> world
come --> come
invaded --> invad
us, --> us,
captured --> captur
lands, --> lands,
conquered --> conquer
minds. --> minds.
Alexander --> alexand
onwards, --> onwards,
Greeks, --> greeks,
Turks, --> turks,
Moguls, --> moguls,
Portuguese, --> portuguese,
British, --> british,
French, --> french,
Dutch, --> dutch,
came --> came
looted --> loot
us, --> us,
took --> took
ours. --> ours.
Yet --> yet
done --> done
nation. --> nation.
conquered --> conquer
anyone. --> anyone.
grabbed --> grab
lands, --> lands,
culture, --> culture,
history --> histori
tried --> tri
enforce --> enforc
way --> way
life --> life
them. --> them.
Why? --> why?
respect --> respect
others. --> others.
first --> first
freedom. --> freedom.
believe --> believ
India --> india
got --> got
first --> first
1857, --> 1857,
started --> start
War --> war
Independence. --> independence.
must --> must
protect -->

In [10]:
## Now lets combine tokenizeing, stop word removal and stemming into a single function
def preprocess_text(paragraph):
    sentences = sent_tokenize(paragraph)        # tokenizing into sentences
    for i in range(len(sentences)):             # iterating through each sentence
        words = word_tokenize(sentences[i])     # tokenizing into words
        filtered_words = [ps.stem(word) for word in words if word not in set(stopwords_english)]  # removing stop words and stemming
        sentences[i] = " ".join(filtered_words)  # joining the words back to form the sentence
    return sentences

processed_sentences = preprocess_text(paragraph)
for sentence in processed_sentences:
    print(sentence)

i three .
in 3000 year histori , peopl world come invad us , captur land , conquer mind .
from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .
yet done nation .
we conquer anyon .
we grab land , cultur , histori tri enforc way life .
whi ?
becaus respect other .
that first .
i believ got first 1857 , start war independ .
it must protect nurtur build .
if free , one respect us .


In [None]:
## Stemming with Snowball Stemmer
from nltk.stem import SnowballStemmer
ss = SnowballStemmer('english')

## Now lets combine tokenizeing, stop word removal and stemming into a single function
def preprocess_text(paragraph):
    sentences = sent_tokenize(paragraph)        # tokenizing into sentences
    for i in range(len(sentences)):             # iterating through each sentence
        words = word_tokenize(sentences[i])     # tokenizing into words
        filtered_words = [ss.stem(word) for word in words if word not in set(stopwords_english)]  # removing stop words and stemming using snowball stemming
        sentences[i] = " ".join(filtered_words)  # joining the words back to form the sentence
    return sentences

processed_sentences = preprocess_text(paragraph)
for sentence in processed_sentences:
    print(sentence)

i three .
in 3000 year histori , peopl world come invad us , captur land , conquer mind .
from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .
yet done nation .
we conquer anyon .
we grab land , cultur , histori tri enforc way life .
whi ?
becaus respect other .
that first .
i believ got first 1857 , start war independ .
it must protect nurtur build .
if free , one respect us .


In [22]:
## applying lemmatization
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

## Now lets combine tokenizeing, stop word removal and stemming into a single function
def preprocess_text(paragraph):
    sentences = sent_tokenize(paragraph)        # tokenizing into sentences
    for i in range(len(sentences)):             # iterating through each sentence
        sentences[i] = sentences[i].lower()
        words = word_tokenize(sentences[i])     # tokenizing into words
        filtered_words = [wnl.lemmatize(word, pos = 'v') for word in words if word not in set(stopwords_english)]  # removing stop words and stemming using lemmatize
    return sentences

processed_sentences = preprocess_text(paragraph)
for sentence in processed_sentences:
    print(sentence)

i have three vision for india.
in 3000 years of our history, people from all over the world have come and invaded us,
                 captured our lands, conquered our minds.
from alexander onwards, the greeks, the turks, the moguls, the portuguese, 
                 the british, the french, the dutch, all of them came and looted us, took over what was ours.
yet we have not done this to any other nation.
we have not conquered anyone.
we have not grabbed their lands, their culture, their history and tried to enforce our way of life on them.
why?
because we respect the freedom of others.
that is why my first vision is that of freedom.
i believe that india got its first vision of this in 1857, when we started the war of independence.
it is this freedom that we must protect and nurture and build on.
if we are not free, no one will respect us.
