### Understanding NLTK
* Tokenizing
* Stop-words
* Stemming
* Lemmetizing
* Text cannot be processed by ML algos
* They needs to be pre-processed
* They needs to be feature reduction
* NLTK is a very foundation which provides all these things

In [1]:
import nltk

In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [3]:
my_txt = "Hello Mr. Learners, how is learning going on? Hope things are fine. Hope the lockdown solves all the issues."

In [4]:
sent_tokenize(my_txt)

['Hello Mr. Learners, how is learning going on?',
 'Hope things are fine.',
 'Hope the lockdown solves all the issues.']

In [5]:
word_tokenize(my_txt)

['Hello',
 'Mr.',
 'Learners',
 ',',
 'how',
 'is',
 'learning',
 'going',
 'on',
 '?',
 'Hope',
 'things',
 'are',
 'fine',
 '.',
 'Hope',
 'the',
 'lockdown',
 'solves',
 'all',
 'the',
 'issues',
 '.']

### Stemming 
* Many Variations of words carry the same meaning, other than when tense is involved.
* Objective is reduce the dimension of data
* Curse of dimension -lot of algorithms don't work that well if the dimensions is too many

In [6]:
from nltk.stem import PorterStemmer

In [14]:
ps  =  PorterStemmer()

In [15]:
words = ['run','runner','runing','run']

In [16]:
for word in words:
    print(ps.stem(word))

run
runner
rune
run


In [19]:
words = ['go','went','gone','going','goes']

In [20]:
for word in words:
    print(ps.stem(word))

go
went
gone
go
goe


In [62]:
text_data = ['I runs verying is fast','I was very running fast veries veried']

In [63]:
import pandas as pd

In [64]:
df = pd.DataFrame({'Text': text_data})

In [65]:
from sklearn.feature_extraction.text import CountVectorizer

In [66]:
cv = CountVectorizer()

In [67]:
def f(r):
    words = word_tokenize(r)
    res = []
    for word in words:
        res.append(ps.stem(word))
    return (' '.join(res))
df.Text = df.Text.map(f)

In [68]:
df.Text

0              I run veri is fast
1    I wa veri run fast veri veri
Name: Text, dtype: object

In [69]:
cv.fit_transform(df.Text)

<2x5 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [70]:
cv.vocabulary_

{'run': 2, 'veri': 3, 'is': 1, 'fast': 0, 'wa': 4}

In [72]:
cv.fit_transform(df.Text).toarray()

array([[1, 1, 1, 1, 0],
       [1, 0, 1, 3, 1]], dtype=int64)

 # Lemmatizing
* Similar to Stemming
* Stemming can work for incorrect words
* Lemmatizing works on the actual words

In [73]:
from nltk.stem import WordNetLemmatizer

In [74]:
wl = WordNetLemmatizer()

In [76]:
wl.lemmatize('cat')

'cat'

In [77]:
wl.lemmatize('runs')

'run'

In [78]:
wl.lemmatize('goose')

'goose'

In [81]:
wl.lemmatize('better',pos='a')

'good'

In [83]:
wl.lemmatize('good',pos = "a")

'good'

In [86]:
ps.stem('paying')

'pay'

In [87]:
ps.stem('pays')

'pay'

In [88]:
ps.stem('payed')

'pay'

In [90]:
from nltk.stem import LancasterStemmer

In [93]:
ls = LancasterStemmer()

In [94]:
ls.stem('trouble')

'troubl'

In [95]:
ls.stem('troubling')

'troubl'

In [97]:
text = 'He was running and eating at the, same time. He also has a very bad habbit of playing in the Sun after having food?'

In [98]:
punctuation = ',.?'

In [99]:
text = text.replace(',','').replace('?','').replace('.','')

In [100]:
words = word_tokenize(text)

In [101]:
words

['He',
 'was',
 'running',
 'and',
 'eating',
 'at',
 'the',
 'same',
 'time',
 'He',
 'also',
 'has',
 'a',
 'very',
 'bad',
 'habbit',
 'of',
 'playing',
 'in',
 'the',
 'Sun',
 'after',
 'having',
 'food']

In [103]:
for word in words:
    print(wl.lemmatize(word,pos = 'v'))

He
be
run
and
eat
at
the
same
time
He
also
have
a
very
bad
habbit
of
play
in
the
Sun
after
have
food


In [132]:
horror_data = pd.read_csv('https://raw.githubusercontent.com/edyoda/data-science-complete-tutorial/master/Data/horror-train.csv')

In [133]:
horror_data.columns

Index(['id', 'text', 'author'], dtype='object')

In [134]:
horror_data.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [135]:
horror_data = horror_data[['text']]

In [136]:
horror_data[:5]

Unnamed: 0,text
0,"This process, however, afforded me no means of..."
1,It never once occurred to me that the fumbling...
2,"In his left hand was a gold snuff box, from wh..."
3,How lovely is spring As we looked from Windsor...
4,"Finding nothing else, not even gold, the Super..."


* Using NearestNeighbours with metrices as cosine distance, we will find similar texts
* We can use regex to remove punchuations

In [137]:
def f(t):
    return t.replace(',','').replace('?','').replace('.','')
horror_data['new_text'] = horror_data.text.map(f)

In [138]:
def stem_func(r):
    words = word_tokenize(r)
    sent = []
    for word in words:
        sent.append(ps.stem(word))
    return ' '.join(sent)
horror_data['stem_words'] = horror_data.new_text.map(stem_func)

In [141]:
cv = CountVectorizer(stop_words='english')

In [142]:
cv.fit(horror_data.text)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [143]:
len(cv.vocabulary_)

24764

In [144]:
cv.fit(horror_data.stem_words)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [145]:
out = cv.transform(horror_data.stem_words)

In [146]:
len(cv.vocabulary_)

15355

In [147]:
from sklearn.neighbors import NearestNeighbors

In [148]:
nn = NearestNeighbors(metric='cosine')

In [151]:
nn.fit(out)

NearestNeighbors(algorithm='auto', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [152]:
nn.kneighbors(out[4:5])

(array([[1.11022302e-16, 4.57917835e-01, 4.79516561e-01, 4.96637990e-01,
         4.98449609e-01]]),
 array([[    4, 15457,  7409, 18122, 13440]], dtype=int64))

In [124]:
horror_data[:1].text[0]

'This process, however, afforded me no means of ascertaining the dimensions of my dungeon; as I might make its circuit, and return to the point whence I set out, without being aware of the fact; so perfectly uniform seemed the wall.'

In [154]:
horror_data.loc[4].text

'Finding nothing else, not even gold, the Superintendent abandoned his attempts; but a perplexed look occasionally steals over his countenance as he sits thinking at his desk.'

In [155]:
horror_data.loc[15457].text

'His countenance was rough but intelligent his ample brow and quick grey eyes seemed to look out, over his own plans, and the opposition of his enemies.'

In [127]:
horror_data.loc[18122].text

'The smile of triumph shone on his countenance; determined to pursue his object to the uttermost, his manner and expression seem ominous of the accomplishment of his wishes.'

In [128]:
horror_data.loc[18122]

text          The smile of triumph shone on his countenance;...
new_text      The smile of triumph shone on his countenance;...
stem_words    the smile of triumph shone on hi counten ; det...
Name: 18122, dtype: object

In [129]:
horror_data.loc[15457]

text          His countenance was rough but intelligent his ...
new_text      His countenance was rough but intelligent his ...
stem_words    hi counten wa rough but intellig hi ampl brow ...
Name: 15457, dtype: object

In [130]:
horror_data.loc[4]

text          Finding nothing else, not even gold, the Super...
new_text      Finding nothing else not even gold the Superin...
stem_words    find noth els not even gold the superintend ab...
Name: 4, dtype: object