In [32]:
!pip install --upgrade nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 2.6 MB/s eta 0:00:00
Installing collected packages: nltk
  Attempting uninstall: nltk
    Found existing installation: nltk 3.7
    Uninstalling nltk-3.7:
      Successfully uninstalled nltk-3.7
Successfully installed nltk-3.8.1



These lines import the Natural Language Toolkit (nltk) library in Python. 
The code specifically imports the PorterStemmer class for word stemming and the stopwords module for handling common words. 
The code involve text processing or analysis, where stemming reduces words to their root form, and stopwords are words that are often removed in language processing tasks.

In [33]:
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [34]:
paragraph = "Narendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] ⓘ; born 17 September 1950)[b] is an Indian politician who has served as the 14th prime minister of India since May 2014. Modi was the Chief Minister of Gujarat from 2001 to 2014 and is the Member of Parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right wing Hindu nationalist paramilitary volunteer organisation. He is the longest-serving prime minister from outside the Indian National Congress."

In [35]:
paragraph

'Narendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] ⓘ; born 17 September 1950)[b] is an Indian politician who has served as the 14th prime minister of India since May 2014. Modi was the Chief Minister of Gujarat from 2001 to 2014 and is the Member of Parliament (MP) for Varanasi. He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right wing Hindu nationalist paramilitary volunteer organisation. He is the longest-serving prime minister from outside the Indian National Congress.'

In [36]:
#Tokenization
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [37]:
sentences = nltk.sent_tokenize(paragraph)

In [38]:
print(sentences)

['Narendra Damodardas Modi (Gujarati: [ˈnəɾendɾə dɑmodəɾˈdɑs ˈmodiː] ⓘ; born 17 September 1950)[b] is an Indian politician who has served as the 14th prime minister of India since May 2014.', 'Modi was the Chief Minister of Gujarat from 2001 to 2014 and is the Member of Parliament (MP) for Varanasi.', 'He is a member of the Bharatiya Janata Party (BJP) and of the Rashtriya Swayamsevak Sangh (RSS), a right wing Hindu nationalist paramilitary volunteer organisation.', 'He is the longest-serving prime minister from outside the Indian National Congress.']


In [39]:
stemmer=PorterStemmer()

In [40]:
stemmer.stem('thinking')

'think'

This line imports the WordNetLemmatizer class from the Natural Language Toolkit (nltk) library in Python. Word lemmatization involves reducing words to their base or dictionary form, which can be useful in text analysis and natural language processing tasks. The WordNetLemmatizer is commonly used for lemmatization in the context of nltk.

In [43]:
from nltk.stem import WordNetLemmatizer

In [44]:
lemmatizer = WordNetLemmatizer()

In [52]:
lemmatizer.lemmatize('being')

'being'

In [53]:
#regular expression
import re

In [54]:
corpus = []
for i in range(len(sentences)):
    text = re.sub("[^a-zA-Z]"," ",sentences[i])
    text= text.lower()
    corpus.append(text)

In [55]:
corpus

['narendra damodardas modi  gujarati    n  end   d mod   d s  modi      born    september       b  is an indian politician who has served as the   th prime minister of india since may      ',
 'modi was the chief minister of gujarat from      to      and is the member of parliament  mp  for varanasi ',
 'he is a member of the bharatiya janata party  bjp  and of the rashtriya swayamsevak sangh  rss   a right wing hindu nationalist paramilitary volunteer organisation ',
 'he is the longest serving prime minister from outside the indian national congress ']

In [56]:
#stemming
for i in corpus:
    words = nltk.word_tokenize(i) #changing sentences to word
    for word in words:
        if word not in set(stopwords.words('english')): #if words not in set,whichever words is not present in the english apply stemming for that
            print(stemmer.stem(word))

narendra
damodarda
modi
gujarati
n
end
mod
modi
born
septemb
b
indian
politician
serv
th
prime
minist
india
sinc
may
modi
chief
minist
gujarat
member
parliament
mp
varanasi
member
bharatiya
janata
parti
bjp
rashtriya
swayamsevak
sangh
rss
right
wing
hindu
nationalist
paramilitari
volunt
organis
longest
serv
prime
minist
outsid
indian
nation
congress


In [57]:
#lemmatization
for i in corpus:
    words = nltk.word_tokenize(i) #changing sentences to word
    for word in words:
        if word not in set(stopwords.words('english')): #if words not in set,whichever words is not present in the english apply stemming for that
            print(lemmatizer.lemmatize(word))

narendra
damodardas
modi
gujarati
n
end
mod
modi
born
september
b
indian
politician
served
th
prime
minister
india
since
may
modi
chief
minister
gujarat
member
parliament
mp
varanasi
member
bharatiya
janata
party
bjp
rashtriya
swayamsevak
sangh
r
right
wing
hindu
nationalist
paramilitary
volunteer
organisation
longest
serving
prime
minister
outside
indian
national
congress


In [74]:
corpus=[]
for i in range(len(sentences)):
    test = re.sub("[^a-zA-Z]", " ", sentences[i])
    test=test.lower()
    test=test.split()
    test = [lemmatizer.lemmatize(word) for word in test if word not in set(stopwords.words('english'))]
    test = ''.join(test)
    corpus.append(test)

In [75]:
from sklearn.feature_extraction.text import CountVectorizer

In [76]:
cv = CountVectorizer()

In [77]:
x = cv.fit_transform(corpus)

In [78]:
cv.vocabulary_

{'narendradamodardasmodigujaratinendmodmodibornseptemberbindianpoliticianservedthprimeministerindiasincemay': 3,
 'modichiefministergujaratmemberparliamentmpvaranasi': 2,
 'memberbharatiyajanatapartybjprashtriyaswayamsevaksanghrrightwinghindunationalistparamilitaryvolunteerorganisation': 1,
 'longestservingprimeministeroutsideindiannationalcongress': 0}

In [79]:
corpus[3]

'longestservingprimeministeroutsideindiannationalcongress'

In [65]:
x[0].toarray()

array([[1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0,
        1, 0, 1, 1, 2, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
        1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0]], dtype=int64)