### TOPIC 1 : TOKENIZATION

KEY NOTES

    * NLTK -> Offers Felxible Algorithms for tasks like tokenization.
    * Spacy -> Renowed for its speed and performace. ideal for efficient NLP Solutions.

In [1]:
%pip install nltk

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
corpus = """Hello Welcome,to karthik's NLP Notebook.
Please do watch the entire course! to become expert in NLP.
"""

In [3]:
print(corpus)

Hello Welcome,to karthik's NLP Notebook.
Please do watch the entire course! to become expert in NLP.



In [4]:
# Tokenization
# Step 1 -> convert paragraph into sentences.

from nltk.tokenize import sent_tokenize

In [5]:
documents = sent_tokenize(corpus)
documents

["Hello Welcome,to karthik's NLP Notebook.",
 'Please do watch the entire course!',
 'to become expert in NLP.']

. and ! are splitted into different sentences.

In [6]:
type(documents)

list

In [7]:
## Tokenization
## step 2 -> sentences into words.

from nltk.tokenize import word_tokenize

In [8]:
word_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'karthik',
 "'s",
 'NLP',
 'Notebook',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [9]:
for sentence in documents:
    print(word_tokenize(sentence))

['Hello', 'Welcome', ',', 'to', 'karthik', "'s", 'NLP', 'Notebook', '.']
['Please', 'do', 'watch', 'the', 'entire', 'course', '!']
['to', 'become', 'expert', 'in', 'NLP', '.']


In [10]:
# wordpunct -> , ' everything is tokenized now.

from nltk.tokenize import wordpunct_tokenize

In [11]:
wordpunct_tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'karthik',
 "'",
 's',
 'NLP',
 'Notebook',
 '.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

In [15]:
# tree bank word tokenizer ( full stop not consider as sepearte word. )

from nltk.tokenize import TreebankWordTokenizer

In [16]:
tokenizer = TreebankWordTokenizer()

In [17]:
tokenizer.tokenize(corpus)

['Hello',
 'Welcome',
 ',',
 'to',
 'karthik',
 "'s",
 'NLP',
 'Notebook.',
 'Please',
 'do',
 'watch',
 'the',
 'entire',
 'course',
 '!',
 'to',
 'become',
 'expert',
 'in',
 'NLP',
 '.']

### Topic 2 : Stemming

* Stemming is the process of reducing the word to its stem.

* For example : [going,go,goes] , [watching,watch,watched]

* Instead of having similar kind of words. just have a single word.

* usecase : review classification..much more

In [22]:
words = ['eating','eats','eaten','writing','writes','programming','programs']

In [23]:
# 1. PorterStemmer

In [24]:
from nltk.stem import PorterStemmer

In [25]:
stemming = PorterStemmer()

In [26]:
for word in words:
    print(word+'--->'+stemming.stem(word))

eating--->eat
eats--->eat
eaten--->eaten
writing--->write
writes--->write
programming--->program
programs--->program


#### Disadvantage:

* The form of the word may change.

In [27]:
stemming.stem('congratulations')

'congratul'

In [34]:
# 2. RegexpStemmer class

# with the help of this we can easily implement regular expression stemmer algorithms.
# it basically takes a single regular expression and removes any prefix and suffix that matches the expression.

In [31]:
from nltk.stem import RegexpStemmer

In [35]:
reg_stemmer = RegexpStemmer('ing$|s$|e$|able$',min=4)

In [36]:
reg_stemmer.stem('eating')

'eat'

In [37]:
reg_stemmer.stem('ingeating')

'ingeat'

    ing$ -> removes ing at the end.
    ing -> removes ing in the word.

In [38]:
# 3. Snowball Stemmer

# it forms better than porter stemmer.
# it wont getting messed up.
# better accuracy then porter stemmer.

In [39]:
from nltk.stem import SnowballStemmer

In [41]:
snowballstemmer = SnowballStemmer('english')

In [42]:
for word in words:
    print(word+'--->'+snowballstemmer.stem(word))

eating--->eat
eats--->eat
eaten--->eaten
writing--->write
writes--->write
programming--->program
programs--->program


#### Porter Stemmer vs SnowballStemmer

In [43]:
# Porter stemmer

stemming.stem('fairly'),stemming.stem("sportingly")

('fairli', 'sportingli')

In [44]:
# Snowball stemmer

snowballstemmer.stem('fairly'),snowballstemmer.stem('sportingly')

('fair', 'sport')

### Topic 3 :  Lemmitization

    * Lemmitization is like stemming. 
    * The output we will get after lemmitization is lemma.
    * which is a root word rather than root stem. 
    * we will be getting a valid word that means the same thing.

In [45]:
from nltk.stem import WordNetLemmatizer

In [46]:
lemmatizer = WordNetLemmatizer()

In [55]:
'''
POS - Noun -n , Verb -v, adjective-a , adverb-r
'''
lemmatizer.lemmatize("going",pos='v')

'go'

In [56]:
words = ['eating','eats','eaten','writing','writes','programming','programs']

In [60]:
for word in words:
    print(word+"--->"+lemmatizer.lemmatize(word,pos='v'))

eating--->eat
eats--->eat
eaten--->eat
writing--->write
writes--->write
programming--->program
programs--->program


 Use case -> Q&A, Chatbot, Text summarization

### Topic 4: StopWords

like i, was,am,where.. these kind of words are said to be stopwords.

In [88]:
paragraph="""I am indeed delighted to be here with the Future of India on this Children's Day, 14th November, 2002.
            My greetings to the children, teachers and members of Industrial community particularly members of CII 
            who have organized this event and to release this vision document "Developed India : Mission for the Young".
            I am happy to inform you that today I have completed my mission of interacting with 100,000 school children 
            across the length and breadth of the nation. Wherever I went, be it Arunachal Pradesh, Nagaland, 
            Madhya Pradesh, Gujarat, Karnataka or any other part of India, the voice of the youth is unique and 
            strong in articulating their vision and dream. Everyone dreams of living in a prosperous India, a happy
            India and a peaceful India. The combination of prosperity, happiness and peace to a nation always come together.
            When all three of them converge on to India, then India truly be a Developed Nation. Today I am going to talk to you
            about how India can be transformed into a Developed Nation. There are more than 300 million children of your age group
            in India. There are about 2000 children in this hall. What can this smaller fraction do to realize the dream of Developed 
            India."""

In [62]:
from nltk.stem import PorterStemmer

In [63]:
from nltk.corpus import stopwords

In [65]:
stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [66]:
from nltk.stem import PorterStemmer

In [67]:
stemmer=PorterStemmer()

In [89]:
# tokenize
sentences = nltk.sent_tokenize(paragraph)
sentences

["I am indeed delighted to be here with the Future of India on this Children's Day, 14th November, 2002.",
 'My greetings to the children, teachers and members of Industrial community particularly members of CII \n            who have organized this event and to release this vision document "Developed India : Mission for the Young".',
 'I am happy to inform you that today I have completed my mission of interacting with 100,000 school children \n            across the length and breadth of the nation.',
 'Wherever I went, be it Arunachal Pradesh, Nagaland, \n            Madhya Pradesh, Gujarat, Karnataka or any other part of India, the voice of the youth is unique and \n            strong in articulating their vision and dream.',
 'Everyone dreams of living in a prosperous India, a happy\n            India and a peaceful India.',
 'The combination of prosperity, happiness and peace to a nation always come together.',
 'When all three of them converge on to India, then India truly be a D

In [71]:
type(sentences)

list

In [73]:
## Apply stopwords and filter and then apply stemming

In [76]:
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words) # converting all the words into sentences.

In [77]:
sentences

["i inde delight futur india children 's day , 14th novemb , 2002 .",
 "my greet children , teacher member industri commun particularli member cii organ event releas vision document `` develop india : mission young '' .",
 'i happi inform today i complet mission interact 100,000 school children across length breadth nation .',
 'wherev i went , arunach pradesh , nagaland , madhya pradesh , gujarat , karnataka part india , voic youth uniqu strong articul vision dream .',
 'everyon dream live prosper india , happi india peac india .',
 'the combin prosper , happi peac nation alway come togeth .',
 'when three converg india , india truli develop nation .',
 'today i go talk india transform develop nation .',
 'there 300 million children age group india .',
 'there 2000 children hall .',
 'what smaller fraction realiz dream develop india .']

In [83]:
# using snowballstemmer

from nltk.stem import SnowballStemmer

In [84]:
snowballstemmer = SnowballStemmer('english')

In [85]:
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [snowballstemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words) # converting all the words into sentences.

In [86]:
sentences

["i inde delight futur india children 's day , 14th novemb , 2002 .",
 "my greet children , teacher member industri communiti particular member cii organ event releas vision document `` develop india : mission young '' .",
 'i happi inform today i complet mission interact 100,000 school children across length breadth nation .',
 'wherev i went , arunach pradesh , nagaland , madhya pradesh , gujarat , karnataka part india , voic youth uniqu strong articul vision dream .',
 'everyon dream live prosper india , happi india peac india .',
 'the combin prosper , happi peac nation alway come togeth .',
 'when three converg india , india truli develop nation .',
 'today i go talk india transform develop nation .',
 'there 300 million children age group india .',
 'there 2000 children hall .',
 'what smaller fraction realiz dream develop india .']

In [96]:
# using lemmitization

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [94]:
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [lemmatizer.lemmatize(word.lower(),pos='v') for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words) # converting all the words into sentences.

In [95]:
sentences

["i indeed delight future india children 's day , 14th november , 2002 .",
 'my greet child , teacher member industrial community particularly member cii organize event release vision document `` develop india : mission young `` .',
 'i happy inform today i complete mission interact 100,000 school child across length breadth nation .',
 'wherever i go , arunachal pradesh , nagaland , madhya pradesh , gujarat , karnataka part india , voice youth unique strong articulate vision dream .',
 'everyone dream live prosperous india , happy india peaceful india .',
 'the combination prosperity , happiness peace nation always come together .',
 'when three converge india , india truly develop nation .',
 'today i go talk india transform develop nation .',
 'there 300 million child age group india .',
 'there 2000 child hall .',
 'what smaller fraction realize dream develop india .']