**Tokenization**

In Natural Language Processing (NLP), **tokenization** is the process of breaking down a continuous stream of text into smaller, meaningful units called **tokens**.

These tokens can be:

* **Words:** The most common form, where text is split into individual words (e.g., "Hello, world!" -> ["Hello", ",", "world", "!"]).
* **Characters:** Splitting text into individual characters.
* **Subwords:** Units larger than characters but smaller than words, useful for handling rare words or languages with complex word structures (e.g., "unusual" -> ["un", "usual"]).
* **Sentences:** Breaking a large text into individual sentences.

`Tokenization is a fundamental first step in almost any NLP task,` as it converts raw, unstructured text into a format that machines can process and analyze more easily.

Alright, let's refine those definitions to be accurate and clear for NLP.

---

## NLP Terminology

* **Corpus:** A large, structured collection of texts, often used for linguistic analysis and training NLP models. Think of it as a dataset of language.
* **Documents:** Individual pieces of text contained within a corpus. A document could be anything from a single article, book chapter, email, or even a tweet.
* **Vocabulary:** The complete set of all unique words or tokens found across an entire corpus or within a specific document.
* **Words:** The fundamental individual units of language that make up documents and, collectively, form the vocabulary. These are typically the result of a tokenization process.

#NLTK (Natural Language Toolkit)

---


is a powerful and widely used open-source Python library for Natural Language Processing (NLP).

It provides a comprehensive suite of tools, data, and algorithms for various NLP tasks, including:

* **Tokenization:** Breaking text into words or sentences.
* **Stemming and Lemmatization:** Reducing words to their root forms.
* **Part-of-Speech (POS) Tagging:** Identifying the grammatical role of words (e.g., noun, verb).
* **Named Entity Recognition (NER):** Identifying and classifying named entities like people, organizations, and locations.
* **Text Classification:** Categorizing text into predefined classes.
* **Parsing:** Analyzing the grammatical structure of sentences.
* **Access to Corpora:** Provides a large collection of linguistic datasets for research and development.

NLTK is particularly popular for teaching and research in computational linguistics and is a great starting point for anyone learning NLP.

#spaCy (pronounced "space-ee")

---


is a free, open-source Python library designed for **advanced Natural Language Processing (NLP)**, particularly focused on **production-ready applications**.

While NLTK is often favored for teaching and research due to its breadth of algorithms and corpora, spaCy excels in:

* **Speed and Efficiency:** It's built for performance, written in Python and Cython, making it very fast for processing large volumes of text.
* **Production Use:** spaCy is engineered with a focus on ease of use in real-world applications, offering streamlined workflows and robust pre-trained models.
* **Modern Design:** It provides opinionated, "industrial-strength" NLP solutions, often with a clear and consistent API.
* **Pre-trained Models:** spaCy offers highly optimized statistical models for various languages (e.g., English, German, Spanish) that can perform tasks like:
    * **Tokenization:** Efficiently breaking text into tokens.
    * **Part-of-Speech (POS) Tagging:** Assigning grammatical tags to words.
    * **Dependency Parsing:** Analyzing the grammatical structure of sentences to show relationships between words.
    * **Named Entity Recognition (NER):** Identifying and classifying named entities (people, organizations, locations, dates, etc.).
    * **Text Classification:** Categorizing text.
    * **Word Vectors/Embeddings:** Providing numerical representations of words that capture semantic meaning.
* **Integration with Deep Learning:** spaCy seamlessly integrates with popular deep learning frameworks like TensorFlow and PyTorch, allowing users to leverage state-of-the-art transformer models (like BERT, GPT-2).
* **Rule-based Matching:** A powerful system for finding specific patterns in text beyond what statistical models might catch.

In essence, if you're looking to build an NLP application that needs to be fast, scalable, and deployed in a real-world setting, spaCy is often the go-to choice.

In [1]:
!pip3 install nltk



In [33]:
corpus = """Hello!, My Name is Karthikraja, Doing Data Scientist and AI Engineer. I am excel in Building AI Applications.
And, I'm a Civil Engineer
"""

In [34]:
corpus

"Hello!, My Name is Karthikraja, Doing Data Scientist and AI Engineer. I am excel in Building AI Applications.\nAnd, I'm a Civil Engineer\n"

In [35]:
print(corpus)

Hello!, My Name is Karthikraja, Doing Data Scientist and AI Engineer. I am excel in Building AI Applications.
And, I'm a Civil Engineer



In [36]:
# Tokenization
# Sentence ---> Paragraphs
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [37]:
from nltk.tokenize import sent_tokenize
documents = sent_tokenize(corpus)
#documents

In [38]:
documents

['Hello!, My Name is Karthikraja, Doing Data Scientist and AI Engineer.',
 'I am excel in Building AI Applications.',
 "And, I'm a Civil Engineer"]

In [39]:
type(documents)

list

In [40]:
for sentence in documents:
  print(sentence)

Hello!, My Name is Karthikraja, Doing Data Scientist and AI Engineer.
I am excel in Building AI Applications.
And, I'm a Civil Engineer


In [41]:
# Tokenization
#paragraphs --> words
#senetence --> words
from nltk.tokenize import word_tokenize
words = word_tokenize(corpus)

In [42]:
words

['Hello',
 '!',
 ',',
 'My',
 'Name',
 'is',
 'Karthikraja',
 ',',
 'Doing',
 'Data',
 'Scientist',
 'and',
 'AI',
 'Engineer',
 '.',
 'I',
 'am',
 'excel',
 'in',
 'Building',
 'AI',
 'Applications',
 '.',
 'And',
 ',',
 'I',
 "'m",
 'a',
 'Civil',
 'Engineer']

In [43]:
for sentence in documents:
  print(word_tokenize(sentence))

['Hello', '!', ',', 'My', 'Name', 'is', 'Karthikraja', ',', 'Doing', 'Data', 'Scientist', 'and', 'AI', 'Engineer', '.']
['I', 'am', 'excel', 'in', 'Building', 'AI', 'Applications', '.']
['And', ',', 'I', "'m", 'a', 'Civil', 'Engineer']


---
**wordpunct_tokenize**

`wordpunct_tokenize` is an NLTK tokenizer that splits text based on **whitespace** and **punctuation**. It's a simple, regular-expression-based tokenizer that separates most punctuation marks into their own tokens, even when they're attached to words (e.g., "hello!" becomes ["hello", "!"]). This often results in more granular tokens compared to `word_tokenize` (which uses Treebank rules and might keep some punctuation attached to words).

In [45]:
from nltk.tokenize import wordpunct_tokenize

In [46]:
for sentence in documents:
  print(wordpunct_tokenize(sentence))        # the output seperate the " ' " quotes also seperately at "I'm"

['Hello', '!,', 'My', 'Name', 'is', 'Karthikraja', ',', 'Doing', 'Data', 'Scientist', 'and', 'AI', 'Engineer', '.']
['I', 'am', 'excel', 'in', 'Building', 'AI', 'Applications', '.']
['And', ',', 'I', "'", 'm', 'a', 'Civil', 'Engineer']


---
**TreebankWordTokenizer**

The **TreebankWordTokenizer** (from NLTK) is a word tokenization tool that follows the rules established by the **Penn Treebank** corpus. It's known for its consistent handling of **contractions** (e.g., "don't" becomes "do", "n't") and **punctuation**, separating them into individual tokens. This makes it a popular choice for NLP tasks requiring linguistically informed word boundaries.

In [53]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(corpus)

['Hello',
 '!',
 ',',
 'My',
 'Name',
 'is',
 'Karthikraja',
 ',',
 'Doing',
 'Data',
 'Scientist',
 'and',
 'AI',
 'Engineer.',
 'I',
 'am',
 'excel',
 'in',
 'Building',
 'AI',
 'Applications.',
 'And',
 ',',
 'I',
 "'m",
 'a',
 'Civil',
 'Engineer']



---




## Stemming

**Stemming** is a text normalization technique in NLP that reduces inflected (or sometimes derived) words to their **root or base form**, often called a "stem." This process primarily involves **removing affixes (suffixes and prefixes)** to group together different forms of a word (e.g., "running," "runs," "runner" might all reduce to "run"). While important for **Natural Language Understanding (NLU)** and **Natural Language Processing (NLP)**, it's a heuristic process that doesn't guarantee the resulting stem is a grammatically correct word.

In [60]:
## Classification Problem
## Comments of product is a positive review or negative review
## Reviews----> eating, eat,eaten [going,gone,goes]--->go
words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized","running","runs"]

##PorterStemmer

In [61]:
from nltk.stem import PorterStemmer

In [62]:
stemming = PorterStemmer()

In [63]:

for word in words:
  print(word, "--->",stemming.stem(word))

eating ---> eat
eats ---> eat
eaten ---> eaten
writing ---> write
writes ---> write
programming ---> program
programs ---> program
history ---> histori
finally ---> final
finalized ---> final
running ---> run
runs ---> run


history ---> histori  # There are some meaning of word would be collapsed So, This is the drawback for the stemming.

In [64]:
#anther example
stemming.stem("Congratulations")

'congratul'



---



## RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example

---
## `RegexpStemmer` (NLTK)

The `RegexpStemmer` class in NLTK allows you to perform **custom stemming** using **regular expressions**.

Unlike rule-based stemmers (like Porter or Snowball) that have pre-defined rules, `RegexpStemmer` takes one or more regular expressions as input. Any substring of a word that matches these regular expressions will be **removed** to produce the stem.

This provides **flexibility** to define very specific stemming logic, which can be useful for niche applications or languages where standard stemmers might not perform optimally. However, it requires a good understanding of regular expressions and the linguistic patterns you wish to target. You also typically specify a `min` length for the resulting stem to prevent over-stemming.

In [72]:
from nltk.stem import RegexpStemmer
reg_stemming = RegexpStemmer('ing$|s$|able$|e4')

In [73]:
reg_stemming.stem("eating")

'eat'

In [74]:
reg_stemming.stem("materials")

'material'



---




## Snowball Stemmer

The **Snowball Stemmer** is an improved and often more accurate version of the original Porter Stemmer. It's also known as **Porter2 Stemmer**.

Key characteristics include:

* **Multi-language support:** Unlike the basic Porter Stemmer, Snowball offers stemming algorithms for a variety of languages beyond English (e.g., German, French, Spanish, Russian, etc.).
* **More aggressive/refined rules:** It often produces better stems than the original Porter Stemmer due to a more extensive and refined set of stemming rules.
* **Foundation:** It's built using the "Snowball" string processing language, specifically designed for creating stemming algorithms.
* **Efficiency:** It's designed to be efficient and fast, making it suitable for larger-scale NLP applications.

It's a popular choice in NLTK for tasks where a more robust and multilingual stemming solution is required.

In [77]:
from nltk.stem import SnowballStemmer

snowball_stemmer = SnowballStemmer("english")

for word in words:
  print(word+'---->'+snowball_stemmer.stem(word))

eating---->eat
eats---->eat
eaten---->eaten
writing---->write
writes---->write
programming---->program
programs---->program
history---->histori
finally---->final
finalized---->final
running---->run
runs---->run


In [78]:
# by using portar stemming
stemming.stem("fairly"),stemming.stem("sportingly")

('fairli', 'sportingli')

In [80]:
# by using snowball stemming
snowball_stemmer.stem("fairly"),snowball_stemmer.stem("sportingly")

('fair', 'sport')



---




## Lemmatization

**Lemmatization** in NLP is a text normalization technique that reduces words to their **base or dictionary form, known as a lemma**. Unlike stemming, lemmatization uses **linguistic knowledge** (like a dictionary or morphological analysis) to ensure that the resulting lemma is a grammatically correct word and has meaning. For example, "running," "runs," and "ran" would all be lemmatized to "run," while "better" would become "good." It is more computationally intensive but provides more accurate and meaningful root forms than stemming.

NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example

**uses:**

chatbots, machine translation, and deep text analysis, where preserving the precise dictionary form and semantic meaning of words is paramount for accurate understanding and generation.

In [82]:
from nltk.stem import WordNetLemmatizer

In [86]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [87]:
wordnet_lemmatizer = WordNetLemmatizer()

wordnet_lemmatizer.lemmatize("going")

'going'

In [94]:
'''
POS- Noun-n
verb-v
adjective-a
adverb-r
'''
wordnet_lemmatizer.lemmatize("going", 'v')

'go'

In [93]:
wordnet_lemmatizer.lemmatize("going",pos= 'n')

'going'

In [95]:
words=["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized","running","runs"]

for word in words:
  print(word+'---->'+wordnet_lemmatizer.lemmatize(word))

eating---->eating
eats---->eats
eaten---->eaten
writing---->writing
writes---->writes
programming---->programming
programs---->program
history---->history
finally---->finally
finalized---->finalized
running---->running
runs---->run


In [96]:
for word in words:
  print(word+'---->'+wordnet_lemmatizer.lemmatize(word, pos = 'n'))

eating---->eating
eats---->eats
eaten---->eaten
writing---->writing
writes---->writes
programming---->programming
programs---->program
history---->history
finally---->finally
finalized---->finalized
running---->running
runs---->run


In [97]:
for word in words:
  print(word+'---->'+wordnet_lemmatizer.lemmatize(word, pos = 'v'))

eating---->eat
eats---->eat
eaten---->eat
writing---->write
writes---->write
programming---->program
programs---->program
history---->history
finally---->finally
finalized---->finalize
running---->run
runs---->run


In [99]:
for word in words:
  print(word+'---->'+wordnet_lemmatizer.lemmatize(word, pos = 'r'))

eating---->eating
eats---->eats
eaten---->eaten
writing---->writing
writes---->writes
programming---->programming
programs---->programs
history---->history
finally---->finally
finalized---->finalized
running---->running
runs---->runs


In [106]:
wordnet_lemmatizer.lemmatize("fairly",pos='v'),wordnet_lemmatizer.lemmatize("fairly",pos= 'a'),wordnet_lemmatizer.lemmatize("sportingly")

('fairly', 'fairly', 'sportingly')


---

## Stopwords

**Stopwords** are very common words (like "the", "a", "is", "and", "in", "of") that appear frequently in almost any text but often carry **little to no unique meaning or analytical value** on their own, especially in tasks like text classification, information retrieval, or topic modeling.

### Why are they used (or removed)?

The primary reasons for identifying and often removing stopwords in NLP are:

1.  **Reduce Noise:** They clutter the data with highly frequent but non-informative terms, making it harder to identify the truly significant words.
2.  **Improve Efficiency:** By removing them, you reduce the vocabulary size and the dimensionality of your data, which speeds up processing and reduces computational costs for many NLP algorithms.
3.  **Enhance Performance/Accuracy:** For tasks focused on content words, removing stopwords helps algorithms focus on the more discriminative terms, potentially improving the accuracy of models in text classification, search relevance, or sentiment analysis.
4.  **Feature Reduction:** In machine learning models, each unique word often becomes a feature. Removing stopwords significantly reduces the number of features, leading to leaner and sometimes more robust models.

It's important to note that what constitutes a "stopword" can be context-dependent. While general lists exist (e.g., in NLTK), sometimes domain-specific stopwords are added or general ones are kept if they become important for a particular task (e.g., "not" for sentiment analysis).

In [110]:
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [112]:
from nltk.corpus import stopwords

In [114]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [115]:
stopwords.words('german')

['aber',
 'alle',
 'allem',
 'allen',
 'aller',
 'alles',
 'als',
 'also',
 'am',
 'an',
 'ander',
 'andere',
 'anderem',
 'anderen',
 'anderer',
 'anderes',
 'anderm',
 'andern',
 'anderr',
 'anders',
 'auch',
 'auf',
 'aus',
 'bei',
 'bin',
 'bis',
 'bist',
 'da',
 'damit',
 'dann',
 'der',
 'den',
 'des',
 'dem',
 'die',
 'das',
 'dass',
 'daß',
 'derselbe',
 'derselben',
 'denselben',
 'desselben',
 'demselben',
 'dieselbe',
 'dieselben',
 'dasselbe',
 'dazu',
 'dein',
 'deine',
 'deinem',
 'deinen',
 'deiner',
 'deines',
 'denn',
 'derer',
 'dessen',
 'dich',
 'dir',
 'du',
 'dies',
 'diese',
 'diesem',
 'diesen',
 'dieser',
 'dieses',
 'doch',
 'dort',
 'durch',
 'ein',
 'eine',
 'einem',
 'einen',
 'einer',
 'eines',
 'einig',
 'einige',
 'einigem',
 'einigen',
 'einiger',
 'einiges',
 'einmal',
 'er',
 'ihn',
 'ihm',
 'es',
 'etwas',
 'euer',
 'eure',
 'eurem',
 'euren',
 'eurer',
 'eures',
 'für',
 'gegen',
 'gewesen',
 'hab',
 'habe',
 'haben',
 'hat',
 'hatte',
 'hatten',
 '

In [201]:
## Speech Of DR APJ Abdul Kalam
paragraph = """I have three visions for India. In 3000 years of our history, people from all over
               the world have come and invaded us, captured our lands, conquered our minds.
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours.
               Yet we have not done this to any other nation. We have not conquered anyone.
               We have not grabbed their land, their culture,
               their history and tried to enforce our way of life on them.
               Why? Because we respect the freedom of others.That is why my
               first vision is that of freedom. I believe that India got its first vision of
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India
               stands up to the world, no one will respect us. Only strength respects strength. We must be
               strong not only as a military power but also as an economic power. Both must go hand-in-hand.
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life.
               I see four milestones in my career"""

In [202]:
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()

In [203]:
from nltk.tokenize import sent_tokenize

In [204]:
sentences = sent_tokenize(paragraph)

In [183]:
type(sentences)

list

In [184]:
sentences

['I have three visions for India.',
 'In 3000 years of our history, people from all over \n               the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,\n               the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture, \n               their history and tried to enforce our way of life on them.',
 'Why?',
 'Because we respect the freedom of others.That is why my \n               first vision is that of freedom.',
 'I believe that India got its first vision of \n               this in 1857, when we started the War of Independence.',
 'It is this freedom that\n               we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 'My second vision for India’s developme

In [185]:
len(sentences)

31

In [186]:
## Apply Stopwords And Filter And then Apply Stemming
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  print(words)

['I', 'have', 'three', 'visions', 'for', 'India', '.']
['In', '3000', 'years', 'of', 'our', 'history', ',', 'people', 'from', 'all', 'over', 'the', 'world', 'have', 'come', 'and', 'invaded', 'us', ',', 'captured', 'our', 'lands', ',', 'conquered', 'our', 'minds', '.']
['From', 'Alexander', 'onwards', ',', 'the', 'Greeks', ',', 'the', 'Turks', ',', 'the', 'Moguls', ',', 'the', 'Portuguese', ',', 'the', 'British', ',', 'the', 'French', ',', 'the', 'Dutch', ',', 'all', 'of', 'them', 'came', 'and', 'looted', 'us', ',', 'took', 'over', 'what', 'was', 'ours', '.']
['Yet', 'we', 'have', 'not', 'done', 'this', 'to', 'any', 'other', 'nation', '.']
['We', 'have', 'not', 'conquered', 'anyone', '.']
['We', 'have', 'not', 'grabbed', 'their', 'land', ',', 'their', 'culture', ',', 'their', 'history', 'and', 'tried', 'to', 'enforce', 'our', 'way', 'of', 'life', 'on', 'them', '.']
['Why', '?']
['Because', 'we', 'respect', 'the', 'freedom', 'of', 'others.That', 'is', 'why', 'my', 'first', 'vision', 'is'

In [187]:
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer('english')

In [188]:
## Apply Stopwords And Filter And then Apply Stemming
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  words = [ porter_stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
  sentences[i]= ' '.join(words) # converting all the list of words into sentences

In [189]:
sentences

['i three vision india .',
 'in 3000 year histori , peopl world come invad us , captur land , conquer mind .',
 'from alexand onward , greek , turk , mogul , portugues , british , french , dutch , came loot us , took .',
 'yet done nation .',
 'we conquer anyon .',
 'we grab land , cultur , histori tri enforc way life .',
 'whi ?',
 'becaus respect freedom others.that first vision freedom .',
 'i believ india got first vision 1857 , start war independ .',
 'it freedom must protect nurtur build .',
 'if free , one respect us .',
 'my second vision india ’ develop .',
 'for fifti year develop nation .',
 'it time see develop nation .',
 'we among top 5 nation world term gdp .',
 'we 10 percent growth rate area .',
 'our poverti level fall .',
 'our achiev global recognis today .',
 'yet lack self-confid see develop nation , self-reli self-assur .',
 'isn ’ incorrect ?',
 'i third vision .',
 'india must stand world .',
 'becaus i believ unless india stand world , one respect us .',
 'onl

In [205]:
## Apply Stopwords And Filter And then Apply Stemming
for i in range(len(sentences)):
  words = nltk.sent_tokenize(sentences[i])
  stem_words = [snowball_stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
  sentences[i] = ' '.join(stem_words)

In [206]:
sentences

['i have three visions for india.',
 'in 3000 years of our history, people from all over \n               the world have come and invaded us, captured our lands, conquered our minds.',
 'from alexander onwards, the greeks, the turks, the moguls, the portuguese, the british,\n               the french, the dutch, all of them came and looted us, took over what was ours.',
 'yet we have not done this to any other nation.',
 'we have not conquered anyone.',
 'we have not grabbed their land, their culture, \n               their history and tried to enforce our way of life on them.',
 'why?',
 'because we respect the freedom of others.that is why my \n               first vision is that of freedom.',
 'i believe that india got its first vision of \n               this in 1857, when we started the war of independence.',
 'it is this freedom that\n               we must protect and nurture and build on.',
 'if we are not free, no one will respect us.',
 "my second vision for india's developme

In [247]:
## Speech Of DR APJ Abdul Kalam
paragraph = """I have three visions for India. In 3000 years of our history, people from all over
               the world have come and invaded us, captured our lands, conquered our minds.
               From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,
               the French, the Dutch, all of them came and looted us, took over what was ours.
               Yet we have not done this to any other nation. We have not conquered anyone.
               We have not grabbed their land, their culture,
               their history and tried to enforce our way of life on them.
               Why? Because we respect the freedom of others.That is why my
               first vision is that of freedom. I believe that India got its first vision of
               this in 1857, when we started the War of Independence. It is this freedom that
               we must protect and nurture and build on. If we are not free, no one will respect us.
               My second vision for India’s development. For fifty years we have been a developing nation.
               It is time we see ourselves as a developed nation. We are among the top 5 nations of the world
               in terms of GDP. We have a 10 percent growth rate in most areas. Our poverty levels are falling.
               Our achievements are being globally recognised today. Yet we lack the self-confidence to
               see ourselves as a developed nation, self-reliant and self-assured. Isn’t this incorrect?
               I have a third vision. India must stand up to the world. Because I believe that unless India
               stands up to the world, no one will respect us. Only strength respects strength. We must be
               strong not only as a military power but also as an economic power. Both must go hand-in-hand.
               My good fortune was to have worked with three great minds. Dr. Vikram Sarabhai of the Dept. of
               space, Professor Satish Dhawan, who succeeded him and Dr. Brahm Prakash, father of nuclear material.
               I was lucky to have worked with all three of them closely and consider this the great opportunity of my life.
               I see four milestones in my career"""

In [248]:
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(paragraph)
sentences

['I have three visions for India.',
 'In 3000 years of our history, people from all over \n               the world have come and invaded us, captured our lands, conquered our minds.',
 'From Alexander onwards, the Greeks, the Turks, the Moguls, the Portuguese, the British,\n               the French, the Dutch, all of them came and looted us, took over what was ours.',
 'Yet we have not done this to any other nation.',
 'We have not conquered anyone.',
 'We have not grabbed their land, their culture, \n               their history and tried to enforce our way of life on them.',
 'Why?',
 'Because we respect the freedom of others.That is why my \n               first vision is that of freedom.',
 'I believe that India got its first vision of \n               this in 1857, when we started the War of Independence.',
 'It is this freedom that\n               we must protect and nurture and build on.',
 'If we are not free, no one will respect us.',
 'My second vision for India’s developme

In [254]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

In [255]:
## Apply Stopwords And Filter And then Apply Lemmatizing
for i in range(len(sentences)):
  words = nltk.word_tokenize(sentences[i])
  lemma_word = [wordnet_lemmatizer.lemmatize(word.lower(),pos='v') for word in words if word not in set(stopwords.words('english'))]
  sentences[i] = ' '.join(lemma_word)

In [256]:
sentences

['three visions india .',
 '3000 years history , people world come invade us , capture land , conquer mind .',
 'alexander onwards , greeks , turks , moguls , portuguese , british , french , dutch , come loot us , take .',
 'yet nation .',
 'conquer anyone .',
 'grab land , culture , history try enforce way life .',
 '?',
 'respect freedom others.that first vision freedom .',
 'believe india get first vision 1857 , start war independence .',
 'freedom must protect nurture build .',
 'free , one respect us .',
 'second vision india ’ development .',
 'fifty years develop nation .',
 'time see develop nation .',
 'among top 5 nations world term gdp .',
 '10 percent growth rate areas .',
 'poverty level fall .',
 'achievements globally recognise today .',
 'yet lack self-confidence see develop nation , self-reliant self-assured .',
 '’ incorrect ?',
 'third vision .',
 'india must stand world .',
 'believe unless india stand world , one respect us .',
 'strength respect strength .',
 'mus