In this section we will integrate all what we have learnt from the past 5 chapters. Let's **Put all together**. 

We will cover .. 
* Extracting named entities
* Using dependency relations for intent recognition
* Semantic similarity methods for semantic parsing
* Putting it all together

[**Codes**](https://github.com/PacktPublishing/Mastering-spaCy/tree/main/Chapter06)

## Semantic Parsing with spaCy 

We are going to work on **Airline Travel Information System (ATIS)** as known as  airplane ticket reservation system dataset. 


#### Dataset Overview

This dataset is one of standard benchmark dataset for intent(நோக்கம்) classification. This dataset containes customer utterance (வாடிக்கையாளர் பேச்சு), s who want to book
a flight, get information about the flights, including flight costs, flight
destinations, and timetables. 

**Before processing any dataset, we should look this by our naked eye** 

* What kind of utterances are there? Is it a short text corpus or does the
    corpus consist of long documents or medium-length paragraphs?
* What sort of entities does the corpus include? People's names, city
    names, country names, organization names, and so on. Which ones do we
    want to extract?
* How is punctuation used? Is the text correctly punctuated, or is no
    punctuation used at all?
* How are the grammatical rules followed? Is the capitalization correct?
* Did users follow the grammatical rules? Are there misspelled words?


In [29]:
from spacy.matcher import Matcher
from collections import Counter
from spacy import displacy
import en_core_web_md
import pandas as pd
import spacy 


nlp = en_core_web_md.load()

In [2]:
dataset = pd.read_csv("https://raw.githubusercontent.com/PacktPublishing/Mastering-spaCy/main/Chapter06/data/atis_intents.csv", header = None) 
# dataset.columns = ['intents', 'utterance']
dataset.head()

Unnamed: 0,0,1
0,atis_flight,i want to fly from boston at 838 am and arriv...
1,atis_flight,what flights are available from pittsburgh to...
2,atis_flight_time,what is the arrival time in san francisco for...
3,atis_airfare,cheapest airfare from tacoma to orlando
4,atis_airfare,round trip fares from pittsburgh to philadelp...


In [3]:
# let's print some text 

for text in dataset[1].head(): 
    print(text)

 i want to fly from boston at 838 am and arrive in denver at 1110 in the morning
 what flights are available from pittsburgh to baltimore on thursday morning
 what is the arrival time in san francisco for the 755 am flight leaving washington
 cheapest airfare from tacoma to orlando
 round trip fares from pittsburgh to philadelphia under 1000 dollars


**Small analysiss**
As we can see, the first user wants to book a flight; they included the
destination, the source cities, and the flight time. The third user is asking
about the arrival time of a specific flight and the fifth user made a query
with a price limit. The utterances are not capitalized or punctuated. This
is because these utterances are an output of a speech-to-text engine.


In [4]:
grouped = dataset.groupby(0).size() 

print(grouped)

0
atis_abbreviation                            147
atis_aircraft                                 81
atis_aircraft#atis_flight#atis_flight_no       1
atis_airfare                                 423
atis_airfare#atis_flight_time                  1
atis_airline                                 157
atis_airline#atis_flight_no                    2
atis_airport                                  20
atis_capacity                                 16
atis_cheapest                                  1
atis_city                                     19
atis_distance                                 20
atis_flight                                 3666
atis_flight#atis_airfare                      21
atis_flight_no                                12
atis_flight_time                              54
atis_ground_fare                              18
atis_ground_service                          255
atis_ground_service#atis_ground_fare           1
atis_meal                                      6
atis_quantity     

#### Extract NER with matCher

As we have already seen, this is a flights dataset. Hence, we expect to see
city/country names, airport names, and airline names:

In [5]:
for text in dataset[1].head(): 
    displacy.render(nlp(text), style = 'ent')

In [8]:
df.to_csv(r'utterance.txt', header=None, index=None, sep=' ', mode='a')  # saving only the utterance columns

In [25]:
corpus = open('utterance.txt', 'r').read()  # reading the file 
    

In [26]:
corpus = corpus.replace('"', '')  # replacing double quotes into empty 
corpus = corpus.split('\n')   # seperating the new line character 

In [27]:
all_ent_labels = []

for sentence in corpus: 
    doc = nlp(sentence.strip()) 
    ents = doc.ents   
    all_ent_labels += [ent.label_ for ent in ents] 
c = Counter(all_ent_labels)

```Python
all_ent_labels = []

for sentence in corpus: 
    doc = nlp(sentence.strip())  # strip -> remove unwanted spaces 
    ents = doc.ents   # getting the entities 
    all_ent_labels += [ent.label_ for ent in ents]   # save the entity label to the list 
c = Counter(all_ent_labels)  # counting each label how many times it came
```

In [28]:
print(c) 

Counter({'GPE': 9139, 'DATE': 1448, 'TIME': 1052, 'ORG': 452, 'ORDINAL': 188, 'CARDINAL': 169, 'FAC': 52, 'NORP': 51, 'MONEY': 37, 'PRODUCT': 12, 'PERSON': 8, 'LOC': 2, 'LAW': 1, 'QUANTITY': 1, 'PERCENT': 1})


The most frequent entity labels are GPE
(location names), DATE, TIME, and ORGANIZATION. Obviously, the
location entities refer to destination and source cities/countries, hence
they play a very important role in the overall semantic success of our
application.

#### Let's extract the Location Entities by Matcher

We are going to extract the location by his **pre**-**position** (like From denver, to chennai, at banglore...)

In [32]:
matcher = Matcher(nlp.vocab)  # usual way to initialize the matcher class 

pattern = [[ 
    {"POS": "ADP"},   # adposition = preposition + postposition (it helps to get the pre-positon of the gpe) 
    {"ENT_TYPE": "GPE"} 
]] 

matcher.add('prepositionLocation', pattern) 

doc = nlp("show me flights from denver to boston on tuesday")
matches = matcher(doc) 

for mid, start, end in matches: 
    print(doc[start:end])

from denver
to boston


In [31]:
spacy.explain('ADP')

'adposition'

In [36]:
### Some tries with the added pattern 

# 1st try 
doc = nlp("yes i'd like a flight from long beach to st. louis by way of dallas")
matches = matcher(doc)

for mid, start, end in matches:
    print(doc[start:end])
    
# 2nd try 
doc = nlp("what are the evening flights flying out of dallas")
matches = matcher(doc)

for mid, start, end in matches:
    print(doc[start:end])
    
# 3rd try 
doc = nlp("i'm looking for a flight that goes from ontario to westchester and stops in chicago")
matches = matcher(doc)

for mid, start, end in matches:
    print(doc[start:end])

to st
of dallas
of dallas
in chicago


We are getting a good amount of results!!


After extracting the locations, we can now extract the airline information.
The ORG entity label means an organization and it corresponds to
airline company names in our dataset. 

In [38]:
# Now let's get the org name also 

pattern = [[ 
    {'ENT_TYPE': 'ORG', 'OP': '+'}   # '+' one or more 
]]

matcher.add('AirlineName', pattern) 

doc = nlp("what is the earliest united airlines flight flying from denver")
matches = matcher(doc)

for mid,start,end in matches:
    print(doc[start:end])


united
united airlines
airlines
from denver


In [41]:
# Now let's try to get date and time 

pattern = [[
    {'ENT_TYPE': 'DATE', 'ENT_TYPE': 'TIME', 'OP':'+'}
]]

matcher.add('TimePattern', pattern) 

doc = nlp("what is the earliest united date atlanta to denver flight flying from denver")
matches = matcher(doc)

for mid,start,end in matches:
    print(doc[start:end])
    

to denver
from denver


 Now let's try to extract the abbreviated type entities, Extracting the abbreivation is a difficult task 
 
a) An abbreviation can be broken into two parts – letters, and digits.

b) The letter part can be 1-2 characters long.

c) The digit part is also 1-2 characters long.

d) The presence of digits indicates an abbreviation entity.

e) The presence of the following words indicates an abbreviation entity:
class, code, abbreviation.

f) The POS tag of an abbreviation is a noun. If the candidate word is a 1-
letter or 2-letter word, then we can look at the POS tag and see whether
it's a noun. This approach eliminates the false positives, such as us
(pronoun), me (pronoun), a (determiner), and an (determiner).



In [45]:
pattern1 = [{"TEXT": {"REGEX": "\w{1,2}\d{1,2}"}}]

pattern2 = [{"SHAPE": { "IN": ["x", "xx"]}}, {"SHAPE": { "IN": ["d", "dd"]}}]

pattern3 = [{"TEXT": {"IN": ["class", "code", "abbrev", "abbreviation"]}}, {"SHAPE": { "IN": ["x", "xx"]}}]

pattern4 = [{"POS": "NOUN", "SHAPE": { "IN": ["x", "xx"]}}]

matcher.add("abbrevEntities", [pattern1, pattern2, pattern3, pattern4])

sentences = [
'what does restriction ap 57 mean',
'what does the abbreviation co mean',
'what does fare code qo mean',
'what is the abbreviation d10',
'what does code y mean',
'what does the fare code f and fn mean',
'what is booking class c'
]

for sent in sentences:
    doc = nlp(sent)
    matches = matcher(doc)

    for mid, start, end in matches:
        print(doc[start:end])




ap 57
57
abbreviation co
co
code qo
d10
code y
code f
fn
class c
c


a) The first pattern matches to a single token, which consists of 1-2
letters and 1-2 digits. For example, d1, d10, ad1, and ad21 will match
this pattern.

b) The second pattern matches to 2-token abbreviations where the first
token is 1-2 letters and the second token 1-2 digits. The abbreviations ap
5, ap 57, a 5, and a 57 will match this pattern.

c) The third pattern matches to two tokens too. The first token is a context
clue word, such as class or code, and the second token should be a 1-2
letter token. Some example matches are code f, code y, and class c.

d) The fourth pattern extracts 1-2 letter short words whose POS tag is
NOUN. Some example matches from the preceding sentences are c and
co.


## Using dependency trees for extracting entities


In the last example, we extracted a very easy sentence, what If the sentence given by user like this? 

```Text 
" I'm going to a conference in Munich. I need an air ticket.

My sister's wedding will be held in Munich. I'd like to book a flight. "
```

Here's the pre-position **to** refers to the **conference** (how, you will see this in code). Instead of parsing like last example we need to have a pattern like this **to + ... + GPE**, Then, we have to be careful what words can come in between "to" and the city name, as well as what words should not come.

Let's look at another example: 
```Text 
" I want to book a flight to my conference without stopping at Berlin. "
```

Here, there's no **to** at all, we can't extract like this information, to understand clearly, we need to use the **dependency trees**, In chapter 3 we covered. 

#### Quick Recap 

Before proceeding to code, first, let's remember some concepts about dependency trees.
ROOT is a special dependency label and is always assigned to the main verb of the sentence.
spaCy shows syntactic relations with arcs. One of the tokens is the syntactic parent (called
the HEAD) and the other is dependent (called the CHILD). By way of an example, in Figure
Below, going has 3 syntactic children – I, m, and to. Equivalently, the syntactic head of to is
going (the same applies to I and m).

<center><img src="images/dep.png" width="600"/></center>


There are no incoming arcs into the verb going, so going is the root of the
dependency tree (when we examine the code, we'll see that the dependency
label is ROOT). This is supposed to happen because going is the main verb
of the sentence. If we follow the arc to its immediate right, we encounter to;
jumping over the arcs to the right we reach Munich. This shows that there's a
syntactic relation between to and Munich.

In code, we will travel **Right to left** to find the orignial destination(to). 

Right to left. We start from Munich, jump onto its head, and follow the
head's head, and so on. Since each word has exactly one head, it's
guaranteed that there will be only one path. Then we determine whether
to is on this path or not.


```Python 
def reach_parent(source_token, dest_token): 
    source_token = source_token.head   # it will give the head of the sentence 
    
    while source_token != dest_token: 
        if source_token.head == source_token: 
            return None 
        source_token = source_token.head  # changing the head to source word for finding the destination word
# In simple words, we finding the head word and checking this word is destination if not we make head word as souce word, this process ends when we reach destinatin

        return source_token 
```

In [21]:
import spacy 
import en_core_web_md 

nlp = en_core_web_md.load() 
doc = nlp("I'm going to a conference in Munich.") 

def reach_parent(source_token, dest_token): 
    source_token = source_token.head  
    
    while source_token != dest_token: 
        if source_token.head == source_token: 
            return None 
        source_token = source_token.head 

        return source_token 
    
reach_parent(doc[-2], doc[3])  # Munich -> to 

conference

Intresting right, by using this loop, we will get more accuracte information

In [20]:
from spacy import displacy 

displacy.render(doc)

## Using dependency relations for intent recognition

After extracting the entities, we want to find out what sort of intent the user
carries – to book a flight, to purchase a meal on their already booked flight,
cancel their flight, and so on. If you look at the intents list again, you will see
that every intent includes a verb (to book) and an object that the verb acts on
(flight, hotel, meal).

Here all the sentences having **transitive verbs** and **direct/indirect objects**

### Transitive verbs and Direct/inDirect obJects 

#### Transitive / In Transitive Verbs

A **verb** is a very important component of the sentence as it indicates the **action** in the sentence. The **object** of the sentence is the **thing/person** that is affected by the action of the verb.

A **transitive verb** is a verb that needs an **object** to **act upon**.

```Text 
Examples: 

I bought flowers.
He loved his cat.
He borrowed my book.
```

In these example sentences, **bought**, **loved**, and **borrowed** are **transitive
verbs**. In the first sentence, **bought** is the **transitive verb** and **flowers** is its
object, the thing that has been bought by the sentence subject, I. Loved – his
cat and borrowed – my book are transitive verb-object examples.

Let's look on first sentence what if don't have a **object** followed by the **verb**. 
```Text 
I bought
```
This is **In-Transitive** verb. 

Bought what? Without an object, this sentence doesn't carry any meaning at
all. In the preceding sentences, each of the objects completes the meaning of
the verb. This is a way of understanding whether a verb is transitive or not –
erase the object and check whether the sentence remains semantically intact.

Some verbs are transitive and some verbs are intransitive. An intransitive
verb is the opposite of a transitive verb; it doesn't need an object to act upon.


```Text
Yesterday I slept for 8 hours.
The cat ran towards me.
When I went out, the sun was shining.
Her cat died 3 days ago.
```

In all the preceding sentences, the verbs make sense without an object. If we
erase all the words other than the subject and object, these sentences are still
meaningful:

```Text
I slept.
The cat ran.
The sun was shining.
Her cat died.

```

#### Direct / In-Direct Objects 

As we remarked before, the object is the thing/person that is affected by the verb's action. A sentence can be direct or in direct. 

##### Direct objects 
A Direct objects questions are whom? / what?, You can ask the direct object by asking the subject verb what/ who? 

```
I bought flowers. I bought what? - flowers
He loved his cat. He loved who? - his cat
He borrowed my book. He borrowed what? - my book
```

##### In-Direct Objects 
An indirect object answers the questions for what?/for whom?/to whom?.

```
Let's see some examples:
He gave me his book. He gave his book to whom? - me
He gave his book to me. He gave his book to whom? -me
```

It comes with the pre-position like to, for, from, and so on.  As you can see from these examples, an indirect object is also an object
and is affected by the verb's action, but its role in the sentence is a bit
different. An indirect object is sometimes viewed as the recipient of the
direct object

If you want to look more on this **sentence syntax topic**, you can [**refer**](https://www.amazon.in/Linguistic-Fundamentals-Natural-Language-Processing/dp/1681736713/ref=sr_1_1_sspa?crid=3PM0YPXKV4GE7&keywords=linguistic+fundamentals&qid=1649378998&sprefix=linguistic+fundamental%2Caps%2C240&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUFCNFFQRUtVWVFSSksmZW5jcnlwdGVkSWQ9QTA3NzYyOTExUjM0NkpVT0FMQkpSJmVuY3J5cHRlZEFkSWQ9QTA5MjgzOTIyQzRXTFZONzhSQjlCJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ==) this to get good knowledge. 

##### Single Indent Recogonition

In [40]:
doc = nlp("find aflight from washington to sf")

displacy.render(doc)

In this example sentence, the transitive verb is find and the direct object is a
flight. The relation dobj connects a transitive verb to its direct object. If we
follow the arc, semantically, we see that the user wants to commit the action
of finding and the object they want to find is a flight. We can merge find and
a flight into a single word, findAflight or findFlight, which can be this
intent's name. Other intents can be bookFlight, cancelFlight, bookMeal, and
so on.

Let's extract the verb and the direct object in a more systematic way. We'll
first spot the direct object by looking for the dobj label in the sentence. To
locate the transitive verb, we look at the direct object's syntactic head. A
sentence can include more than one verb, hence we're careful while
processing the verbs. Here is the code:

In [24]:
doc = nlp("find a flight from washington to sf") 

for token in doc:
    if token.dep_ == 'dobj': 
        print(token.head.text + token.text.capitalize())
        print('----')

findFlight
----


In [36]:
doc[2].dep_

'dobj'

In [37]:
doc[2].head

find

In [38]:
doc[2].text

'flight'

##### Multiple Indent recogonition

In [41]:
displacy.render(nlp("show all flights and fares from denver to san francisco"))

In the preceding diagram, we see that the dobj arc connects show and flights.
The conj arc connects flights and fares to indicate the conjunction relation
between them. The conjunction relation is built by a conjunction such as and
or or and indicates that a noun is joined to another noun by this conjunction.
In this situation, we extract the direct object and its conjuncts. Let's now see
how we can turn this process into code:


In [52]:
doc = nlp("show all flights and fares from denver to san francisco")

for token in doc: 
    if token.dep_ == 'dobj': 
        dobj = token.text 
        conj = [t.text for t in token.conjuncts] 
        verb = token.head 
print(verb, dobj, conj)

show flights ['fares']


### Recognizing the intent using wordlists


In some cases, tokens other than the transitive verb and the direct object
contain the semantics of the user intent. In that case, you need to go further
down in the syntactic relations and explore the sentence structure deeper.

$$\heartsuit$$

```
Example: 
i want to make a reservation for a fligh
```

In [81]:
doc = nlp("i want to make areservation for afligt") 
displacy.render(doc)

In this sentence, the verb-object pair that best describes the user intent is
want-flight. However, if we look at the parse tree in Fg above, we see that
want and flight are not directly related in the parse tree. want is related to
the transitive verb make, and flight is related to the direct object
reservation, respectively:

What will we do then? We can play a trick and keep a list of helper verbs
such as would like, want, make, and need. Here's the code:

Please refer the book for code 😭 (293 - 297)

## Semantic similarity(சொற்பொருள் ஒற்றுமை) methods for semantic parsing(சொற்பொருள் பகுத்தல்)

As an NLP developer, while developing a semantic parser for a chatbot
application, text classification, or any other semantic application, you should
keep in my mind that users use a fairly wide set of phrases and expressions
for each intent. In fact, if you're building a chatbot by using a platform such as
RASA (https://rasa.com/) or on a platform such as Dialogflow
(https://dialogflow.cloud.google.com/), you're asked to provide as many
utterance examples as you can provide for each intent. Then, these utterances
are used to train the intent classifier behind the scenes.

There are usually two ways to recognize semantic similarity, either with a
synonyms dictionary or with word vector-based semantic similarity methods.
In this section, we will discuss both approaches. Let's start with how to use a
synonyms dictionary to detect semantic similarity

### Using Synonyms dictionary: 

We already went through our dataset and saw that different verbs are used to
express the same actions. For instance, landing, arriving, and flying to verbs
carry the same meaning, whereas leaving, departing, and flying from verbs
form another semantic group.

We already saw that in most cases, the transitive verbs and direct objects
express the intent. An easy way to **determine whether two utterances
represent the same intent** is to check whether the **verbs and the direct objects
are synonyms**.

Let's take an example and compare two example utterances from the dataset.
First, we prepare a small synonyms dictionary. We include only the base
forms of the verbs and nouns. While doing the comparison, we also use the
base form of the words:

In [63]:
verbSynsets = [
("show", "list"),
("book", "make a reservation", "buy", "reserve")  # all gives the same meaning 
]

objSynsets = [
("meal", "food"),
("aircraft", "airplane", "plane")
]


In [54]:
verbSynsets

[('show', 'list'), ('book', 'make a reservation', 'buy', 'reserve')]

In [70]:
# first create a two doc object with same name 
doc = nlp("show me all aircrafts that cp uses")
doc2 = nlp("list all meals on my flight")

# Then we extract the transitive verb and direct objects(dobj) from the first utterance 
for token in doc: 
    if token.dep_ == 'dobj': 
        obj = token.lemma_   # direct object 
        verb = token.head.lemma_  # transitive verb 
        break 
        
# Let's do this same for doc2 
for token in doc2: 
    if token.dep_ == 'dobj': 
        obj2 = token.lemma_   # direct object 
        verb2 = token.head.lemma_  # transitive verb 
        break 
        
# check whether the verb present in verbSynsets
vsyn = [syn for syn in verbSynsets if verb in syn]
vsyn1 = [syn for syn in verbSynsets if verb2 in syn]

print (vsyn)
print (vsyn1) 

# check the same for obj
osyn = [syn for syn in verbSynsets if obj in syn]
osyn2 = [syn for syn in verbSynsets if obj2 in syn]

print (osyn)  # It means false. .. 
print (osyn2)

[('show', 'list')]
[('show', 'list')]
[]
[]


This method is not good, because To use this we need to create a very large list and try to cover all words. Instead of using this, we should try to use the another method, word vectos 

### Using word vectors to recogonize semantic similarity 

Same like synonyms dictionary, we don't need to work with large dictionary, instead of that we can use word vectors to recogonize the similarity 


In [72]:
# create a doc object 
doc = nlp("show me all aircrafts that cp uses")
doc2 = nlp("list all meals on my flight")

# Create verb and object of the first sentence 
for token in doc: 
    if token.dep_ == 'dobj': 
        obj = token  # object 
        verb = token.head  # break 
        break 
        
# same for second question 
for token in doc2: 
    if token.dep_ == 'dobj': 
        obj2 = token  # object 
        verb2 = token.head  # break 
        break 

# let's calculate the semantic similarity between two direct objects with vectors 
print(obj.similarity(obj2))  # it has very less score (it mean they are not releated) 

# let's calculate for words 
print(verb.similarity(verb2)) 

0.1502587
0.33161193


## Putting it all together

In [76]:
# Extract entities

import spacy 
import en_core_web_md
from spacy.matcher import Matcher

nlp = en_core_web_md.load()
matcher = Matcher(nlp.vocab) 

pattern = [{"POS":"ADP"}, {"ENT_TYPE":"GPE"}]  # taking adposition 
matcher.add("prepositionLocation", [pattern])  # adding to matcher 

doc = nlp("show me flights from denver to philadelphia on tuesday")
matches = matcher(doc)

for mid, start, end in matches:
    print(doc[start:end])

from denver
to philadelphia


In [78]:
# Extract intent 

for token in doc: 
    if token.dep_ == 'dobj': 
        print(token.head.lemma_ + token.lemma_.capitalize()) 

showFlight


The final result is that the complete semantic representation of this utterance,
intent, and entities is extracted. This is a machine-readable and usable output.
We pass this result to the system component that made the call to the NLP
application to generate a response action.
