### POS TAGGING 

In [4]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [5]:
doc = nlp(u'I am learning how to build chatbots')
for token in doc:
    print(token.text, token.pos_)

I PRON
am VERB
learning VERB
how ADV
to PART
build VERB
chatbots NOUN


In [8]:
doc1 = nlp(u'Google release "Move Mirror" AI experiment that matches your pose from 80,100 images')
for token in doc1:
    print(token.text, token.lemma_, token.pos, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

Google Google 96 NNP compound Xxxxx True False
release release 92 NN ROOT xxxx True False
" " 97 `` punct " False False
Move Move 96 NNP nmod Xxxx True True
Mirror Mirror 96 NNP nmod Xxxxx True False
" " 97 '' punct " False False
AI AI 96 NNP compound XX True False
experiment experiment 92 NN appos xxxx True False
that that 90 WDT nsubj xxxx True True
matches match 100 VBZ relcl xxxx True False
your -PRON- 90 PRP$ poss xxxx True True
pose pose 92 NN dobj xxxx True False
from from 85 IN prep xxxx True True
80,100 80,100 93 CD nummod dd,ddd False False
images image 92 NNS pobj xxxx True False


In [9]:
doc2 = nlp(u'I am learning how to build chatbots')
for token in doc2:
        print(token.text, token.lemma_, token.pos, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

I -PRON- 95 PRP nsubj X True True
am be 100 VBP aux xx True True
learning learn 100 VBG ROOT xxxx True False
how how 86 WRB advmod xxx True True
to to 94 TO aux xx True True
build build 100 VB xcomp xxxx True False
chatbots chatbot 92 NNS dobj xxxx True False


| TEXT | ACTUAL TEXT OR WORD BEING PROCESSED |
| ---- | ----------------------------------- |
| LEMMA | Root form of word being processed |
| POS | Part of Speech of the word |
| TAG | They express POS |
| DEP | Syntactic dependency |
| SHAPE | Shape of the Word |
| ALPHA | Is the token an alpha character? |
| STOP | Is the word a stop word or part of stop list? |

### Stemming and Lemmatization

**Stemming** : Does the job in a crude way, heuristic way that chops off the the ends of words, assuming that the remaining word is what we are actually looking for, but it includes the removal of derivational affixes.<br><br>
**Lemmatization**: Tries to do the job more elegentaly with the use of a vocabulary and morphological analysis of words. It tries its best to remove infectional endings only and return the dictionary form of a word, known as the lemma.

In [14]:
from spacy.lemmatizer import Lemmatizer
from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
lemmatizer('chuckles', 'NOUN')                                 # 2nd param is token's part-of-speech tag

['chuckle']

In [15]:
lemmatizer('blazing', 'VERB')

['blaze']

In [16]:
lemmatizer('fastest', 'ADJ')

['fast']

#### To understand difference between stemmer and lemmatizer, use NLTK

In [17]:
import nltk
from nltk.stem.porter import *
from nltk.stem.snowball import SnowballStemmer
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english")
print(porter_stemmer.stem('fastest'))
print(snowball_stemmer.stem('fastest'))


fastest
fastest


### Named Entity Recognition
- Process of finding and classifying named entities existing in the given text into pre-defined categories.
- Hugely dependent on the knowledge base used to train NE extraction algorithm.
- spaCy comes with a very fast entity recognition model that is capable of identifying entity phrases from a given document.
- Entities can be of different types, such as person, location, organiation, dates, numerals, etc.

In [19]:
my_string = u"Google has its headquarters in Mountain View, California having revenue iamounted to 109.65 billion uS dollars"
doc = nlp(my_string)
for ent in doc.ents:
    print(ent.text, ent.label_)

Google ORG
Mountain View GPE
California GPE
109.65 billion uS dollars MONEY


In [27]:
my_string = (u"Mark Zuckerberg born May 14,1984 in New York is an American technology entreprenuer and philanthropist best known for co-funding and leading Facebook as it's chairman and CEO.")
doc = nlp(my_string)
for ent in doc.ents:
    print(ent.text, ent.label_)

Mark Zuckerberg PERSON
May 14,1984 DATE
New York GPE
American NORP
Facebook PERSON


In [28]:
my_string = u"I usually wake up at 9:00 AM. 90% of my daytime goes in learning new things."
doc = nlp(my_string)
for ent in doc.ents:
    print(ent.text, ent.label_)

9:00 AM TIME
90% PERCENT


In [30]:
my_string1 = u"Imagine Dragons are the best band."
my_string2 = u"Imagine Dragons come and take over the city."

doc1 = nlp(my_string1)
doc2 = nlp(my_string2)

for ent in doc1.ents:
    print(ent.text, ent.label_)

In [31]:
for ent in doc2.ents:
    print(ent.text, ent.label_)

#### Observations:
- We were to extract the context of the above two strings in a live environment.
- What to do? With help of Entity Extractor, one can easily figure out the statement and intelligently take the conversation further.

### STOP WORDS

Stop words are high-frequency words like a, an, the, to and also that we sometimes want to filter out a document before further processing.
- These usually have little lexical content and do not hold much of a meaning.

In [32]:
from spacy.lang.en.stop_words import STOP_WORDS
print(STOP_WORDS)

{'beforehand', 'did', 'his', 'should', 'cannot', "'re", 'our', 'therefore', '‘m', 'thereupon', 'which', 'same', 'former', 'per', 'part', 'whole', 'throughout', 'might', 'was', 'more', 'anywhere', 'quite', 'had', 'nowhere', 'always', 'thence', 'either', 'still', 'seems', 'rather', 'a', 'less', 'own', 'mostly', 'everything', 'been', 'now', 'thus', 'us', '’m', 'wherever', 'eight', 'whether', 'move', 'whereupon', "'s", '’ll', 'something', 'further', 'what', 'where', 'please', 'not', 'get', "'d", 'be', 'seeming', 'you', 'must', 'has', 'twelve', 'only', 'behind', 'afterwards', 'front', 'it', 'who', 'unless', 'onto', 'herein', 'fifteen', 'down', 'whoever', 'otherwise', 'two', 'their', 'no', 'last', 'ever', 'above', 'latter', 'her', 'ourselves', 'therein', 'can', 'yours', 'n’t', 'will', 'when', 'am', 'your', 'three', 'fifty', 'almost', 'being', 'whose', 'over', 'thereby', 'seem', 'them', 'against', 'under', 'myself', 'keep', 'an', 'to', 'by', 'thru', 'hereupon', 'most', 'they', 'such', 'yourse

In [33]:
# To check if a word is stop word or not
nlp.vocab[u'is'].is_stop

True

In [34]:
nlp.vocab[u'hello'].is_stop

False

In [35]:
nlp.vocab[u'with'].is_stop

True

#### Observations:
- Stop words are very important part of text cleanup. It helps removal of meaningless data before we try to do actual processing to make sense of the text.

### DEPENDENCY PARSING

In [36]:
doc = nlp(u"Book me a flight from Bangalore to Goa")
blr, goa = doc[5], doc[7]
list(blr.ancestors)

[from, flight, Book]

In [37]:
list(goa.ancestors)

[to, flight, Book]

**What are ancestors in Dependency parsing?**

- They are the rightmost token of this token's syntactic descendants. Like in above example for the object blr the ancestors were from, flight, and Book.
- You can always list the ancestors of a doc objects item using ancestors attribute.

In [38]:
list(doc[4].ancestors)                        # doc[4] == flight

[flight, Book]

In [39]:
doc[3].is_ancestor(doc[5])

True

**In real life scenario**<br><br>
- If we try to think of a real world scenario that we might actually face while trying to build a chatbot, we may come across some sentence like: "I want to book a cab to the hotel and a table at a restaurant"
- In this, its important to know what tasks are requested and where they are targeted.

In [45]:
doc = nlp(u"Book a table at the restaurant and the taxi to the hotel")
tasks = doc[2], doc[8]           #(table, taxi)
tasks_target = doc[5], doc[11]   #(restaurant, hotel)

for task in tasks_target:
    for tok in task.ancestors:
        if tok in tasks:
            print("Booking of {} belongs to {}".format(tok,task))
    

Booking of table belongs to restaurant
Booking of table belongs to hotel


**What are Children in Dependency Parsing?**<br>
Children are immediate syntactic dependents of the token. We can see the children of a word by using children attribute just like we used ancestors.

In [46]:
list(doc[3].children)

[restaurant]

**Interactive Visualization for Dependency Parsing**

In [47]:
from spacy import displacy
doc = nlp(u"Book a table at the restaurant and the taxi to the hotel")
displacy.serve(doc, style='dep')

  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


Let's see one more example of dependency parsing where we assume a user is asking the following sentence:<br>
*What are some places to visit in Berlin and stay in Lubeck?*

In [None]:
doc = nlp(u"What are some places to visit in Berlin and stay in Lubeck?")
places = [doc[7], doc[11]]                     #[Berlin, Lubeck]
actions = [doc[5], doc[9]]                     #[visit, stay]

In [None]:
for place in places:
    for tok in places.ancestors:
        if tok in actions:
            print("User is referring {} to {}").format(place, tok)
            break

**What is the use of Dependency Parsing in Chatbots?**

- Dependency parsing is one of the most important parts when handling chatbots from scratch. It becomes far more important when you want to figure out the meaning of a text input from your user to your chatbot.
- There can be cases when we haven't trained our chatbots, but still we don't want to lose our customer or reply like a dumb machine.
- Areas of help for dependency parsing:
    - It helps in finding relationships between words of grammatically correct sentences.
    - It can be used for sentence boundary detection.
    - It is quite useful to find out if the user is talking about more than one context simultaneously.

### Noun Chunks

In [50]:
doc = nlp(u"Boston Dynamics is geairing up to produce thousands of robot dogs")
list(doc.noun_chunks)

[Boston Dynamics, thousands, robot dogs]

In [51]:
doc = nlp(u"Deep Learning cracks the code of messenger RNAs and protein coding potential")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)

Deep Learning Learning nsubj cracks
the code code dobj cracks
messenger RNAs RNAs pobj of
protein coding potential potential conj RNAs


| Text | Root.Text | Root.Dep_ | Root.head.text |
|---------- | ------ | ------ | ------- |
| deep learning | learning | nsubj | cracks |
| the code | code | dobj | cracks |
| messenger RNAs | RNAs | pobj | of |
| protein-coding potential | potential | conj | RNAs |


| Column | Meaning |
| ----- | ------ |
| Text | Text of the original noun chunk |
| Root text | Text of the original word that connects the noun chunk with remaining parse |
| Root dep | Dependency relation that connects the root to its head |
| Root head text | Text of the root token's head |

### Finding Similarity

- spaCy uses high-quality word vectors to find similarity between two words using GloVe algorithm(*Global Vectors for Word Representation*)
- Unsupervised learning algorithm for obtaining vector representation for words.
- Uses aggregated global word-word co-occurence statistics from a corpus to train the model.

In [52]:
doc = nlp(u"How are you doing today?")
for token in doc:
    print(token.text, token.vector[:5])

How [-2.0411305  1.3408571 -0.4233122 -3.108604   0.836762 ]
are [-1.3876543  0.6120126 -1.4433626  3.5829468 -0.7003968]
you [-3.845181   -0.26770237  1.0238774  -1.9951785  -1.9080703 ]
doing [-3.2571833  -3.1260393  -1.4001853  -0.36002517 -2.039889  ]
today [ 2.7653427  -2.6375554   0.00717561 -0.3329475  -4.96414   ]
? [ 0.7992259 -2.4167218  2.8654325  5.255692  -2.7328923]


In [53]:
hello_doc = nlp(u"Hello")
hi_doc = nlp(u"hi")
hella_doc = nlp(u"hella")
print(hello_doc.similarity(hi_doc))
print(hello_doc.similarity(hella_doc))

  "__main__", mod_spec)


0.6450684542191895


  "__main__", mod_spec)


0.3945482191675344


### Observation:
- If we see the word hello, it is more related and similar to the word hi, even though there is only difference of characters between the words hello and hella.

In [54]:
GoT_str1 = nlp(u"When will next season of Game of Thrones be releasing?")
GoT_str2 = nlp(u"Game of Thrones next season release date?")
GoT_str1.similarity(GoT_str2)

  "__main__", mod_spec)


0.7016174785258795

In [59]:
# Find similarity between words
example_doc = nlp(u"car truck google")
for t1 in example_doc:
    for t2 in example_doc:
        similarity_perc = int(t1.similarity(t2) * 100)
        print("Word {} is {}% similar to word {}".format(t1.text, similarity_perc, t2.text))

Word car is 100% similar to word car


  "__main__", mod_spec)


Word car is 52% similar to word truck


  "__main__", mod_spec)


Word car is 37% similar to word google


  "__main__", mod_spec)


Word truck is 52% similar to word car
Word truck is 100% similar to word truck


  "__main__", mod_spec)


Word truck is 38% similar to word google


  "__main__", mod_spec)


Word google is 37% similar to word car


  "__main__", mod_spec)


Word google is 38% similar to word truck
Word google is 100% similar to word google


Finding similarity between words or sentences becomes quite important when we intend t build any application that is hugely dependent on the implementation of NLP.<br><br>
When building chatbots, finding similarity can be very much handy for the following:
- When building chatbots for recommendation
- Removing duplicates
- Building a spell-checker

### Good to know things in NLP for Chatbots

1. TOKENIZATION
2. REGULAR EXPRESSIONS

#### Tokenization

- One of the simple yet basic concepts of NLP where we split text into meaningful segments.
- spaCy first tokenizes the text(i.e., segments it into words and then punctuation and other things)
- **One question**: Why can't we just use the built-in *split* method of Python Language and do the tokenization?
    - Answer: The split method is just a raw method to split the sentence into tokens given a separator.

In [60]:
doc = nlp(u"Brexit is the impending withdrawal of the U.K. from the European Union.")
for token in doc:
    print(token.text)

Brexit
is
the
impending
withdrawal
of
the
U.K.
from
the
European
Union
.


#### Observations:
- See, U.K. comes as a single word after tokenization process, which makes sense, as U.K. is a country name and splitting it would be wrong.
- Even after this if you are not happy with spaCy's tokenization, then you can use its add_special_case case method to add your own rule before relying completely on spaCy's tokenization method.

### Regular Expressions

In [69]:
sentence2 = "Book me a metro from Airport Station to Hong Kong Station."
sentence1 = "Book me a cab to Hong Kong Airport from AsiaWorld-Expo."

In [70]:
import re
from_to = re.compile('.* from (.*) to (.*)')
to_from = re.compile('.* to (.*) from (.*)')

from_to_match = from_to.match(sentence2)
to_from_match = to_from.match(sentence2)

if from_to_match and from_to_match.groups():
    _from = from_to_match.groups()[0]
    _to = from_to_match.groups()[1]
    print("from_to pattern matched correctly. Printing values\n")
    print("From: {}, To: {}".format(_from, _to))
    
elif to_from_match and to_from_match.groups():
    _to = to_from_match.groups()[0]
    _from = to_from_match.groups()[1]
    print("to_from pattern matched correctly. Printing values\n")
    print("From: {}, To: {}".format(_from, _to))

from_to pattern matched correctly. Printing values

From: Airport Station, To: Hong Kong Station.
