# **Tokenization using spaCy**


In [None]:
import spacy

In [None]:
 nlp = spacy.load('en_core_web_sm')
 example1 = nlp("This is an example of tokenization")
 for token in example1:
     print(token.text)

This
is
an
example
of
tokenization


In [None]:
example2 = nlp("The quick brown fox jumped over the lazy dog")
for token in example2:
    print(token.text)

The
quick
brown
fox
jumped
over
the
lazy
dog


In [None]:
example3 = nlp("We're the champions")
for token in example3:
    print(token.text)

We
're
the
champions


# **Stemming words using NLTK**



1.   Output of tokenization is output of words
2.   Stemming is used to improve the quality of words
3.   PortStemmer is the most commonly used stemmer
     * Its output is not always interpretable





In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [None]:
example = "Cats running was"
example = [stemmer.stem(token) for token in example.split(" ")]
print(' '.join(example))

cat run wa


# **Lemmatization using spaCy**



1.   Finds the root not just the stem
2.   Correctly identifies the intended part of the speech and meaning of the word



In [None]:
import  spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
example4 = nlp("Animals")
for token in example4:
    print(token.lemma_)

animal


In [None]:
example41 = nlp("is am are")
for token in example41:
    print(token.lemma_)

be
am
be


# **Vectorization using SciKit learn**

**It is the process of turning a document into a numerical vector**
1.  Gives no useful information about the word <br/>
    * Example: Does not differentiate that 'am' and 'are' are the same verb but in different conjugation <br/>
    * Example: That 'I' and 'You' are subjects
2.  Most basic approach is the algorithm called 'Bag of words'
    * First define a fixed length vocabulary <br />
    Example: ['I', 'am', 'you', 'are', 'john', 'jack']

    * Map each word to an index in this vocabulary <br />
    ['I' =>1, 'am'=>2, 'you'=>3, 'are'=>4, 'john'=>5, 'jack'=>6]

    * Based on this index, construct a vector in which the word's index is a '*1*' if the word is seen in the document, else '*0*' <br />
    Input: "I am John" => [1, 1, 0, 0, 1, 0] <br />
    Input: "You are jack" => [0, 0, 1, 1, 0, 1] <br />
    Input: "I am jack" => [1, 1, 0, 0, 0, 1]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(binary=True, token_pattern=r'\b[^\d\W]+\b')

In [None]:
corpus = ["The dog is on the table", "the cats now are on the table"]
vectorizer.fit(corpus)
print(vectorizer.transform(["The dog is on the table"]).toarray())

[[0 0 1 1 1 1 1]]


In [None]:
vocab = vectorizer.vocabulary_
for key in sorted(vocab.keys()):
    print("{}: {}". format(key, vocab[key]))

are: 0
cats: 1
dog: 2
is: 3
on: 4
table: 5
the: 6


In [None]:
corpus2 = ["I am jack", "You are jjohn", "I am john"]
vectorizer.fit(corpus2)
print(vectorizer.transform(corpus2).toarray())

[[1 0 1 1 0 0 0]
 [0 1 0 0 1 0 1]
 [1 0 1 0 0 1 0]]


In [None]:
vocab = vectorizer.vocabulary_
for key in sorted(vocab.keys()):
    print("{}: {}".format(key, vocab[key]))

am: 0
are: 1
i: 2
jack: 3
jjohn: 4
john: 5
you: 6


# **Word Embedding using speCy**



*   word2vec



In [None]:
!python -m spacy download en_core_web_lg

In [6]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [7]:
example1 = "walking walked swimming swim ran running"
tokens = nlp(example1)
for token1 in tokens:
    for token2 in tokens:
        if(token1.text == token2.text):
            continue
        print(token1.text, token2.text, token1.similarity(token2))

walking walked 0.64602304
walking swimming 0.5131937
walking swim 0.44711038
walking ran 0.41761836
walking running 0.51972055
walked walking 0.64602304
walked swimming 0.27055323
walked swim 0.3237321
walked ran 0.6378045
walked running 0.36874038
swimming walking 0.5131937
swimming walked 0.27055323
swimming swim 0.7757358
swimming ran 0.26132163
swimming running 0.3509788
swim walking 0.44711038
swim walked 0.3237321
swim swimming 0.7757358
swim ran 0.35809636
swim running 0.3771573
ran walking 0.41761836
ran walked 0.6378045
ran swimming 0.26132163
ran swim 0.35809636
ran running 0.7287288
running walking 0.51972055
running walked 0.36874038
running swimming 0.3509788
running swim 0.3771573
running ran 0.7287288


In [8]:
example2 = "spain russia nepal kathmandu madrid moscow"
tokens = nlp(example2)
for token1 in tokens:
    for token2 in tokens:
        if(token1.text == token2.text):
            continue
        print(token1.text, token2.text, token1.similarity(token2))

spain russia 0.57819444
spain nepal 0.39723638
spain kathmandu 0.26612234
spain madrid 0.71929735
spain moscow 0.51622057
russia spain 0.57819444
russia nepal 0.47066462
russia kathmandu 0.30335176
russia madrid 0.43594518
russia moscow 0.7492537
nepal spain 0.39723638
nepal russia 0.47066462
nepal kathmandu 0.7108279
nepal madrid 0.30825683
nepal moscow 0.4314682
kathmandu spain 0.26612234
kathmandu russia 0.30335176
kathmandu nepal 0.7108279
kathmandu madrid 0.28801867
kathmandu moscow 0.45484468
madrid spain 0.71929735
madrid russia 0.43594518
madrid nepal 0.30825683
madrid kathmandu 0.28801867
madrid moscow 0.5473875
moscow spain 0.51622057
moscow russia 0.7492537
moscow nepal 0.4314682
moscow kathmandu 0.45484468
moscow madrid 0.5473875


# **Named Entity Recognition(NER)**



*   A task of NLP
*   Used to locate named entities in text into pre-defined categories <br/>
Example: "Google, a company found by Larry Page and Sergey Brin in the United State of America has one the world's most advnaced search engines." <br/>
     * Google => Organization
     * Larry Page => Name
     * Sergey Brin => Name
     * United States of America => Location

*   NER Pipeline
     1. Tokenization
     2. Word Embedding
     3. Sequence labelling  






In [10]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [12]:
example = "Google, a company found by Larry Page and Sergey Brin in the United State of America has one the world's most advnaced search engines."
doc = nlp(example)
for ent in doc.ents:
    print(ent.text, ent.label_)

Google ORG
Larry Page PERSON
Sergey Brin PERSON
the United State of America GPE
one CARDINAL


In [13]:
example2 = "The top U.S. commander in Afghanistan, General Austin Miller, accompanied Zalmay Khalilzad, the architect of the U.S.-Taliban deal, to the meeting hosted by Qatar’s foreign minister,"\
+ "Sheikh Mohammed bin Abdur Rahman al-Thani. The Taliban delegation was led by Mullah Abdul Ghani Baradar, the head of its political office.  The top U.S. commander in Afghanistan, General Austin"\
+ "Miller, accompanied Zalmay Khalilzad, the architect of the U.S.-Taliban deal, to the meeting hosted by Qatar’s foreign minister, Sheikh Mohammed bin Abdur Rahman al-Thani. The Taliban delegation"\
+ "was led by Mullah Abdul Ghani Baradar, the head of its political office. The militant group has repeatedly asserted that the Americans have violated their end of the agreement by attacking either"\
+ "civilians or Taliban forces not involved in fighting. U.S. officials deny this. The militant group has repeatedly asserted that the Americans have violated their end of the agreement by attacking"\
+ "either civilians or Taliban forces not involved in fighting. U.S. officials deny this."

doc = nlp(example2)
for ent in doc.ents:
    print(ent.text, ent.label_)

U.S. GPE
Afghanistan GPE
Austin Miller PERSON
Zalmay Khalilzad PERSON
Qatar GPE
Mohammed bin Abdur PERSON
Rahman al-Thani PERSON
Taliban ORG
Mullah Abdul Ghani Baradar PERSON
U.S. GPE
Afghanistan GPE
AustinMiller PERSON
Zalmay Khalilzad PERSON
Qatar GPE
Mohammed bin Abdur PERSON
Rahman al-Thani PERSON
Taliban ORG
Mullah Abdul Ghani Baradar PERSON
Americans NORP
eithercivilians NORP
Taliban ORG
U.S. GPE
Americans NORP
Taliban ORG
U.S. GPE


# **Text Classification**

*  Classify text into one of many classes
*  Is applied to:<br/>
   * Sentiment analysis
   * Chatbot

*   Text classification Pipeline
     1. Tokenization
     2. Stemming or lemmatization
     3. Word embedding

In [15]:
!pip install vadersentiment

Collecting vadersentiment
[?25l  Downloading https://files.pythonhosted.org/packages/76/fc/310e16254683c1ed35eeb97386986d6c00bc29df17ce280aed64d55537e9/vaderSentiment-3.3.2-py2.py3-none-any.whl (125kB)
[K     |████████████████████████████████| 133kB 3.8MB/s 
Installing collected packages: vadersentiment
Successfully installed vadersentiment-3.3.2


In [16]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [17]:
analyzer = SentimentIntensityAnalyzer()
sentences = ["This movie, and trilogy in general, is a cinematic (and literary) masterpiece, and simply refuses to get old. I love this movie so much it's become a tradition to watch the Lord of the Rings series at least once a year! It's just as good as Christmas",
             "It is the best movie I can remember I've watched while I was a kid!", "I think the regional crowd seems to be the cause for its high rating than it deserves. The reveal was just unbearably awful.",
             "This is an average commercial movie disguised as a 'psycho thriller', with a plot twist so predictable, that you will definitely see it coming within the 1st half hour."]

for sentence in sentences:
    print(analyzer.polarity_scores(sentence)['compound'])

0.9098
0.6696
-0.4588
0.25
