***Stemming:***<br>
Stemming means the process of reducing the infected words to their word stem.It involves in converting the words to their reduced form such that words with similar meaning can be grouped into one. Example:  <br>
1. Final, Finally, Finalized ==> Fina (After stemming)<br>
2. History, Historical ==> Histori <br>
3. Going, goes, gone ==> go

***Lemmatization:***<br>
Similar to Stemming, but in this case the converted words are meaningful in nature unlike stemming where there may be chances that the converted words are not meaningful in nature. Example:<br>
1. History, Historical ==> History (After Lemmatization)
2. Final, Finally, Finalized ==> Final
3. Going, goes, gone ==> go

***So why do we need these methods?***<br>
We use these methods in the case of classifying the reviews based on the sentiments. Example: Amazon classify its reviews based on the stem words and then predict the number of positive reviews and the negative reviews.

In [1]:
# Lemmatization takes more time when compared to stemming

In [2]:
# Stemming is used in gmail spam classifier, positive and negative sentiments classifier because all that we need are the 
# stem words

In [3]:
# Lemmatization can be used in the case of chatbots, question and answer classifier because the response we give to the computer
# is meaningful in nature.

***Stop Words:***<br>
Stop words are the words that do not have a serious impact on the sentence meaning wrt to the positive and negative sentiment analysis. Example: I, me, he, they, a, the , my,your, of, from, and etc. For most of the scenarios we remove these stop words because they do not impact the sentiments but there are few places where these stop words impact the sentences.

# Stemming

In [4]:
paragraph="Mathematical analysis formally developed in the 17th century during the Scientific Revolution,but many of its ideas can be traced back to earlier mathematicians. Early results in analysis were implicitly present in the early days of ancient Greek mathematics. For instance, an infinite geometric sum is implicit in Zeno's paradox of the dichotomy. Later, Greek mathematicians such as Eudoxus and Archimedes made more explicit, but informal, use of the concepts of limits and convergence when they used the method of exhaustion to compute the area and volume of regions and solids. The explicit use of infinitesimals appears in Archimedes' The Method of Mechanical Theorems, a work rediscovered in the 20th century. In Asia, the Chinese mathematician Liu Hui used the method of exhaustion in the 3rd century AD to find the area of a circle. Zu Chongzhi established a method that would later be called Cavalieri's principle to find the volume of a sphere in the 5th century. The Indian mathematician Bhāskara II gave examples of the derivative and used what is now known as Rolle's theorem in the 12th century."

In [5]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [10]:
dir(stopwords)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_encoding',
 '_fileids',
 '_get_root',
 '_root',
 '_tagset',
 '_unload',
 'abspath',
 'abspaths',
 'citation',
 'encoding',
 'ensure_loaded',
 'fileids',
 'license',
 'open',
 'raw',
 'readme',
 'root',
 'words']

In [14]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [6]:
stemmer=PorterStemmer()

In [7]:
sentences=nltk.sent_tokenize(paragraph)

In [17]:
for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    #words=[word for word in words if word not in set(stopwords.words('english'))]  we removed all the stopwords from the sentences
    words=[stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    # Now the words are stemmed
    sentences[i]=' '.join(words) # Now we are replacing those words in the prev sentence by joining the stemmed words
    # into sentence

In [18]:
sentences

['mathemat analysi formal develop 17th centuri scientif revolut , mani idea trace back earlier mathematician .',
 'earli result analysi implicitli present earli day ancient greek mathemat .',
 "for instanc , infinit geometr sum implicit zeno 's paradox dichotomi .",
 'later , greek mathematician eudoxu archimed made explicit , inform , use concept limit converg use method exhaust comput area volum region solid .',
 "the explicit use infinitesim appear archimed ' the method mechan theorem , work rediscov 20th centuri .",
 'In asia , chines mathematician liu hui use method exhaust 3rd centuri AD find area circl .',
 "Zu chongzhi establish method would later call cavalieri 's principl find volum sphere 5th centuri .",
 "the indian mathematician bhāskara II gave exampl deriv use known roll 's theorem 12th centuri ."]

In [19]:
# Now you can see that all the words are converted into stemmed words.

# Lemmatization

In [25]:
sentences

['mathemat analysi formal develop 17th centuri scientif revolut , mani idea trace back earlier mathematician .',
 'earli result analysi implicitli present earli day ancient greek mathemat .',
 "for instanc , infinit geometr sum implicit zeno 's paradox dichotomi .",
 'later , greek mathematician eudoxu archimed made explicit , inform , use concept limit converg use method exhaust comput area volum region solid .',
 "the explicit use infinitesim appear archimed ' the method mechan theorem , work rediscov 20th centuri .",
 'In asia , chines mathematician liu hui use method exhaust 3rd centuri AD find area circl .',
 "Zu chongzhi establish method would later call cavalieri 's principl find volum sphere 5th centuri .",
 "the indian mathematician bhāskara II gave exampl deriv use known roll 's theorem 12th centuri ."]

In [28]:
sentences_lemma=nltk.sent_tokenize(paragraph)
sentences_lemma

['Mathematical analysis formally developed in the 17th century during the Scientific Revolution,but many of its ideas can be traced back to earlier mathematicians.',
 'Early results in analysis were implicitly present in the early days of ancient Greek mathematics.',
 "For instance, an infinite geometric sum is implicit in Zeno's paradox of the dichotomy.",
 'Later, Greek mathematicians such as Eudoxus and Archimedes made more explicit, but informal, use of the concepts of limits and convergence when they used the method of exhaustion to compute the area and volume of regions and solids.',
 "The explicit use of infinitesimals appears in Archimedes' The Method of Mechanical Theorems, a work rediscovered in the 20th century.",
 'In Asia, the Chinese mathematician Liu Hui used the method of exhaustion in the 3rd century AD to find the area of a circle.',
 "Zu Chongzhi established a method that would later be called Cavalieri's principle to find the volume of a sphere in the 5th century.",

In [23]:
from nltk.stem import WordNetLemmatizer

In [24]:
lematizer=WordNetLemmatizer()

In [29]:
for i in range(len(sentences_lemma)):
    words_lemma=nltk.word_tokenize(sentences_lemma[i])
    words_lemma=[lematizer.lemmatize(word_lemma) for word_lemma in words_lemma if word_lemma not in set(stopwords.words('english'))]
    sentences_lemma[i]=' '.join(words_lemma)

In [30]:
sentences_lemma

['Mathematical analysis formally developed 17th century Scientific Revolution , many idea traced back earlier mathematician .',
 'Early result analysis implicitly present early day ancient Greek mathematics .',
 "For instance , infinite geometric sum implicit Zeno 's paradox dichotomy .",
 'Later , Greek mathematician Eudoxus Archimedes made explicit , informal , use concept limit convergence used method exhaustion compute area volume region solid .',
 "The explicit use infinitesimal appears Archimedes ' The Method Mechanical Theorems , work rediscovered 20th century .",
 'In Asia , Chinese mathematician Liu Hui used method exhaustion 3rd century AD find area circle .',
 "Zu Chongzhi established method would later called Cavalieri 's principle find volume sphere 5th century .",
 "The Indian mathematician Bhāskara II gave example derivative used known Rolle 's theorem 12th century ."]

In [31]:
# Now you can see that our sentences are lematized