### Stemming
Stemming is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. stemming is important in natural language understanding (NLU) and Natural language processing (NLP).

### RegexpStemmer class

#### NLTK has regexpstemmer class with the help of which we can easily implement regular expression stemmer algorithms. It basically takes a single regular expression and remove prefix or suffix that matches the expression. 

### Porter stemmer

In [1]:
from nltk.stem import PorterStemmer

In [2]:
stemming=PorterStemmer()

In [3]:
Words= ['eating','eats','eaten','writing','programs','programming']

In [4]:
for word in Words:
    print(word+"------>"+stemming.stem(word))

eating------>eat
eats------>eat
eaten------>eaten
writing------>write
programs------>program
programming------>program


In [5]:
stemming.stem('congratulations')

'congratul'

In [6]:
stemming.stem('sitting')

'sit'

In [7]:
from nltk.stem import RegexpStemmer

In [8]:
reg_stem=RegexpStemmer('ing$|s$|e$|able$', min=4)

In [9]:
reg_stem.stem("eating")

'eat'

In [10]:
reg_stem.stem("ingeating")

'ingeat'

### Snowball Stemmer

In [11]:
from nltk.stem import SnowballStemmer  

#### snowball stemmer gives better accuracy than the porter stemmer, Lemmetization gives better accuracy than the stemming

In [12]:
stemmm= SnowballStemmer("english")

In [13]:
for word in Words:
    (print(word+"------>"+stemmm.stem(word)))

eating------>eat
eats------>eat
eaten------>eaten
writing------>write
programs------>program
programming------>program


In [14]:
stemming.stem('fairly')

'fairli'

In [15]:
stemmm.stem("Fairly"),stemmm.stem("goes"),stemmm.stem("going")

('fair', 'goe', 'go')

### Wordnet Lemmatizer

Lemmatization technique is like stemming.The output we will get after lemmatization is called 'lemma',which is a root word rather than root stem, the output of stemming. After lemmatization, we will be  getting a valid word that means the same thing.

NLTK provides Wordnet lemmatizer class which is a thin wrapper  around the wordnet corpus. This class uses Morphy() function to the Wordnet Corpus Reader class to find a lemma. 

##### Uses Cases ---> Q & A, Chatbots, Text Summarization. 

In [16]:
from nltk.stem import WordNetLemmatizer

In [17]:
Lemma=WordNetLemmatizer()

In [18]:
Lemma.lemmatize('goes')

'go'

In [19]:
Lemma.lemmatize('goes',pos='r')

'goes'

In [20]:
for word in Words:
    (print(word+"------>"+Lemma.lemmatize(word,pos='v')))

eating------>eat
eats------>eat
eaten------>eat
writing------>write
programs------>program
programming------>program


Lemmatizer takes more time than stemming

## StopWords

In [21]:
Paragraph = '''In 1991, he took over as chairman from JRD Tata. Under his chairmanship, the company saw mergers like Land Rover Jaguar's merger with Tata Motors, Corus's merger with Tata Steel, Tetley's merger with Tata Tea, Brunner Mond, General Chemical Industrial Products and Daewoo. He launched the Tata Nano, India’s most affordable car at Rs. 1 lakh.'''

In [22]:
Paragraph

"In 1991, he took over as chairman from JRD Tata. Under his chairmanship, the company saw mergers like Land Rover Jaguar's merger with Tata Motors, Corus's merger with Tata Steel, Tetley's merger with Tata Tea, Brunner Mond, General Chemical Industrial Products and Daewoo. He launched the Tata Nano, India’s most affordable car at Rs. 1 lakh."

In [23]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\priya\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [25]:
stopwords.words('German')

['aber',
 'alle',
 'allem',
 'allen',
 'aller',
 'alles',
 'als',
 'also',
 'am',
 'an',
 'ander',
 'andere',
 'anderem',
 'anderen',
 'anderer',
 'anderes',
 'anderm',
 'andern',
 'anderr',
 'anders',
 'auch',
 'auf',
 'aus',
 'bei',
 'bin',
 'bis',
 'bist',
 'da',
 'damit',
 'dann',
 'der',
 'den',
 'des',
 'dem',
 'die',
 'das',
 'dass',
 'daß',
 'derselbe',
 'derselben',
 'denselben',
 'desselben',
 'demselben',
 'dieselbe',
 'dieselben',
 'dasselbe',
 'dazu',
 'dein',
 'deine',
 'deinem',
 'deinen',
 'deiner',
 'deines',
 'denn',
 'derer',
 'dessen',
 'dich',
 'dir',
 'du',
 'dies',
 'diese',
 'diesem',
 'diesen',
 'dieser',
 'dieses',
 'doch',
 'dort',
 'durch',
 'ein',
 'eine',
 'einem',
 'einen',
 'einer',
 'eines',
 'einig',
 'einige',
 'einigem',
 'einigen',
 'einiger',
 'einiges',
 'einmal',
 'er',
 'ihn',
 'ihm',
 'es',
 'etwas',
 'euer',
 'eure',
 'eurem',
 'euren',
 'eurer',
 'eures',
 'für',
 'gegen',
 'gewesen',
 'hab',
 'habe',
 'haben',
 'hat',
 'hatte',
 'hatten',
 '

In [26]:
stemmer=PorterStemmer()

In [27]:
Sentence=nltk.sent_tokenize(Paragraph)

In [28]:
Sentence

['In 1991, he took over as chairman from JRD Tata.',
 "Under his chairmanship, the company saw mergers like Land Rover Jaguar's merger with Tata Motors, Corus's merger with Tata Steel, Tetley's merger with Tata Tea, Brunner Mond, General Chemical Industrial Products and Daewoo.",
 'He launched the Tata Nano, India’s most affordable car at Rs.',
 '1 lakh.']

In [29]:
type(Sentence)

list

In [30]:
## Apply Stopwords and filter and then apply Stemming

for i in range(len(Sentence)):
    words=nltk.word_tokenize(Sentence[i])
    words=[stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    Sentence[i]=' '.join(words) ## Converting all the word into sentences
   

In [31]:
Sentence

['in 1991 , took chairman jrd tata .',
 "under chairmanship , compani saw merger like land rover jaguar 's merger tata motor , coru 's merger tata steel , tetley 's merger tata tea , brunner mond , gener chemic industri product daewoo .",
 'he launch tata nano , india ’ afford car rs .',
 '1 lakh .']

In [32]:
## Apply Stopwords and filter and then apply Snowball Stemming

stemmm= SnowballStemmer("english")

for i in range(len(Sentence)):
    words=nltk.word_tokenize(Sentence[i])
    words=[stemmm.stem(word) for word in words if word not in set(stopwords.words('english'))]
    Sentence[i]=' '.join(words)




In [33]:
Sentence

['1991 , took chairman jrd tata .',
 "chairmanship , compani saw merger like land rover jaguar 's merger tata motor , coru 's merger tata steel , tetley 's merger tata tea , brunner mond , gener chemic industri product daewoo .",
 'launch tata nano , india ’ afford car rs .',
 '1 lakh .']

In [34]:
## Apply Stopwords and filter and then apply Lemmatization

Lemma=WordNetLemmatizer()

for i in range(len(Sentence)):
    words=nltk.word_tokenize(Sentence[i])
    words=[Lemma.lemmatize(word,pos='v') for word in words if word not in set(stopwords.words('english'))]
    Sentence[i]=' '.join(words)

In [35]:
Sentence

['1991 , take chairman jrd tata .',
 "chairmanship , compani saw merger like land rover jaguar 's merger tata motor , coru 's merger tata steel , tetley 's merger tata tea , brunner mond , gener chemic industri product daewoo .",
 'launch tata nano , india ’ afford car rs .',
 '1 lakh .']

In [36]:
Stopwords=set(stopwords.words('english'))

In [38]:
words=nltk.word_tokenize(Paragraph)

In [40]:
words_filtered=[]

In [41]:
for w in words:
    if w  not in Stopwords:
        words_filtered.append(w)

In [42]:
print(words_filtered)

['In', '1991', ',', 'took', 'chairman', 'JRD', 'Tata', '.', 'Under', 'chairmanship', ',', 'company', 'saw', 'mergers', 'like', 'Land', 'Rover', 'Jaguar', "'s", 'merger', 'Tata', 'Motors', ',', 'Corus', "'s", 'merger', 'Tata', 'Steel', ',', 'Tetley', "'s", 'merger', 'Tata', 'Tea', ',', 'Brunner', 'Mond', ',', 'General', 'Chemical', 'Industrial', 'Products', 'Daewoo', '.', 'He', 'launched', 'Tata', 'Nano', ',', 'India', '’', 'affordable', 'car', 'Rs', '.', '1', 'lakh', '.']
