## Flight Booking Data Exploration


In [None]:
# Overall TODO list:
# The general process of named entity recognition is as follows:
## - Use high precision rules to extract unambigous entities e.g. 
## - Use application specific name lists such as airlines, cities etc.
## - Finally apply probabalistic models such as CRFs, HMMs etc.

In [588]:
import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import random, datetime

In [589]:
# flights dataset
df = pd.read_csv('flightdata.txt', sep='/n', header=None)
df.columns = ['sentence']
df.head()

  


Unnamed: 0,sentence
0,Show me the cheapest flights from Raipur to Ra...
1,Book a flight to Patna on 17 May 2018 or 18 Ma...
2,"Book a flight to Imphal on May 14, 2018 or May..."
3,I want to check availability on SpiceJet on Ma...
4,Any flights between Calicut and Gaya on 15 May...


In [590]:
# sample sentences
df.sample(frac=0.05)['sentence']

4413    Any flights between Nagpur and Pune on May 18,...
3516    Book a flight from Bhubaneswar on 17 May 2018 ...
954     Show me the cheapest flights from Pune to Raip...
548     Show me the cheapest flights from Bengaluru to...
771     Book a flight from Imphal on 27 May after 17 h...
3549    I want to check availability on Air India on M...
1616    Any flights between Nagpur and Shillong on 04 ...
49      Book a flight to Aizawl on 04 June or 05 June ...
2463    Book a flight from Patna on May 05, 2018 after...
659     Can you please show me flights from Visakhapat...
2179    Book a flight to Guwahati on May 20, 2018 or M...
1218    Show me the cheapest flights from Patna to Guw...
4613    Show me the cheapest flights from Patna to Shi...
645     Book a flight to Gaya on May 17, 2018 or May 1...
2419    Book a flight to Diu on May 27, 2018 or May 28...
3184    Please help me in booking flight from Guwahati...
3512    Show me the cheapest flights from New Delhi to...
4264    I want

## Broad Approach

The objective is to write a program which takes in a sentence as input and converts it into a semi-stuctured/structured format (such as a dict) containing fields such as from_destination, to_destination, date_of_travel, airline_type etc.

We'll first do two basic preprocessing tasks for each input sentence - tokenization and POS tagging. After preprocessing, we'll identify named entities such as locations (cities), dates, amount/momney, flight companies etc. Once we identify named entities, we can parse the sentence and try extracting relations between entities (from X to Y on date Z etc.).

In [591]:
# preprocessing function
def preprocess(sentence):
    words = nltk.word_tokenize(sentence)
    tagged_words = nltk.pos_tag(words)
    return tagged_words

In [593]:
# preprocessing some sample sentences
i = random.randrange(len(df.index))
tagged_sent = preprocess(df.loc[67, 'sentence'])
tagged_sent

[('Show', 'VB'),
 ('me', 'PRP'),
 ('the', 'DT'),
 ('cheapest', 'JJS'),
 ('flights', 'NNS'),
 ('from', 'IN'),
 ('Pune', 'NNP'),
 ('to', 'TO'),
 ('Bengaluru', 'VB'),
 ('on', 'IN'),
 ('June', 'NNP'),
 ('03', 'CD'),
 (',', ','),
 ('2018', 'CD'),
 ('.', '.')]

In [530]:
# example sentence tagging city_2 as VB
# i = 67

In [594]:
# printing in string format
' '.join(['{0}/{1}'.format(s[0], s[1]) for s in tagged_sent])

'Show/VB me/PRP the/DT cheapest/JJS flights/NNS from/IN Pune/NNP to/TO Bengaluru/VB on/IN June/NNP 03/CD ,/, 2018/CD ./.'

### Identifying Dates, Cities, Price Range etc.

Next, we need to identify important categories of information, such as the date and time of travel, to and from destination cities, price constraints (if any), airline provider (if any), etc. A fully specified dictionary for the following query will look something like this:

**Example query:** <br>*I want to book a flight from patna to bangalore between 21 May and 23 May after 5 PM on either Air India or Jet Airways*.

Note that we are assuming that the current year is 2018 and that *after 5 PM* refers to the departure time from city_1 (not arrival time at city_2).

In [532]:
# sample fully specified dict
sample_dict = {'from_location': 'patna', 
               'to_location': 'bengaluru', 
               'provider': ['airindia, jet'],
               'depart_day': ['21-05-2018', '23-05-2018'],
               'skip_day': None,
               'depart_time_after': 1700,
               'depart_time_before': None
              }

## Regular Expressions Based Approach



In [613]:
# TODO

# - dictionary
#      - dates: ['today', 'tomorrow']
#      - cities: ['patna', 'bengaluru']
#      - airlines: [...]
#      - time: [3 pm, noon]
#      - money: [INR xyz]
    
# - search regexes, e.g. from source_x to dest_y, on date xyz etc.
# - INR/Rs followed by CD is money

## Chunking Based Approach

Chunking is used to identify meaningful groups (or chunks) of a sentence and is commonly used for entity recognition. 

In our example, we would want to identify **date chunks** (e.g. on Feb 21, 2018), or **noun (source/destination city) chunks** (e.g. from patna to mumbai) etc. 

It is sometimes also called shallow parsing, since we are only interested in identifying noun or verb phrases, but not necessarily in knowing that the noun phrase is the subject of the sentence.

The following example describes chunking briefly.

- <a href="https://stackoverflow.com/questions/1598940/in-natural-language-processing-what-is-the-purpose-of-chunking?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa">A brief description of chunking (SO answer)</a>

The NLTK book explain chunking in detail (highly recommended read before moving on):
- <a href="https://www.safaribooksonline.com/library/view/natural-language-processing/9780596803346/ch07s02.html">NLTK book: Information Extraction using Chunking</a>

### Extracting Dates

Let's first look at extracting dates from the queries. We'll use 'chunking' to do that. Chunks are basically (regex) patterns of the POS tags. For e.g. look at the following few sentences:

In [595]:
# date example-1 (the 1621st sentence)
tagged_sent = preprocess(df.loc[1621, 'sentence'])
' '.join(['{0}/{1}'.format(s[0], s[1]) for s in tagged_sent])

'Show/VB me/PRP the/DT cheapest/JJS flights/NNS from/IN Imphal/NNP to/TO Patna/VB on/IN 16/CD May/NNP ./.'

The date chunk in the above query is ```on/IN 16/CD May/NNP ./.'```. We can extract the date chunk by defining a regex as follows: *an optional preposition IN followed by a cardinal CD followed by a proper noun NNP*.


In [596]:
# defining a simple date chunk
# try extracting the dates using a chunk grammar
grammar = 'date_chunk: {<IN>?<CD><NNP>?}'
cp = nltk.RegexpParser(grammar)

# choose a sentence s
s = df.loc[1621, 'sentence']
chunks = cp.parse(preprocess(s))
print(chunks)

(S
  Show/VB
  me/PRP
  the/DT
  cheapest/JJS
  flights/NNS
  from/IN
  Imphal/NNP
  to/TO
  Patna/VB
  (date_chunk on/IN 16/CD May/NNP)
  ./.)


This seems to work fine on the given query, because it had only one date chunk. It will not work on the queries below. 

In [597]:
# date example-2: 669th query
tagged_sent = preprocess(df.loc[669, 'sentence'])
' '.join(['{0}/{1}'.format(s[0], s[1]) for s in tagged_sent])

'Please/VB help/VB me/PRP in/IN booking/VBG flight/NN from/IN Cochin/NNP to/TO Bhopal/NNP on/IN either/CC AirAsia/NNP India/NNP or/CC SpiceJet/NNP between/IN 01/CD June/NNP 2018/CD and/CC 25/CD June/NNP 2018/CD ./.'

In the sentence above, the date chunk we're interested in is ```between/IN 01/CD June/NNP 2018/CD and/CC 25/CD June/NNP 2018/CD ./.'``` Another similar example is given below:

In [598]:
# date example-3
tagged_sent = preprocess(df.loc[3334, 'sentence'])
' '.join(['{0}/{1}'.format(s[0], s[1]) for s in tagged_sent])

'Book/VB a/DT flight/NN to/TO Calicut/VB on/IN 15/CD May/NNP or/CC 16/CD May/NNP before/IN 17/CD hours/NNS ./.'

In this sentence, there are two chunks of interest   ```on/IN 15/CD May/NNP``` or/CC ```16/CD May/NNP```. 

Also, since dates can be specified as 28 May, 2018 and 28 May 2018, an optional comma should be included (note that the POS tag of a comma is comma itself). Also, the order of CD and NNP can be reversed (e.g. May 20, 2018 or 20 May, 2018).

The grammar for including these cases can defined as follows:

In [599]:
# modifying the grammar further
# put an optional comma
# include May 20, 2018 and 20 May, 2018
grammar = r'''
date_chunk: {<IN>?<CD><NNP><,>?<CD>?}   # e.g. on 28 May, 2018
            {<IN>?<NNP><CD><,>?<CD>?}   # e.g. on May 28, 2018            
            '''

cp = nltk.RegexpParser(grammar)

# Note that May 28, 2018 is also correcty parsed now
s = df.loc[1846, 'sentence']
chunks = cp.parse(preprocess(s))
print(chunks)

(S
  I/PRP
  want/VBP
  to/TO
  check/VB
  availability/NN
  on/IN
  IndiGo/NNP
  (date_chunk on/IN May/NNP 28/CD ,/, 2018/CD)
  for/IN
  flights/NNS
  from/IN
  Trivandrum/NNP
  to/TO
  Aizawl/NNP
  ./.)


In [600]:
# testing the grammar on some random sentences
i = random.randrange(len(df.index))
s = df.loc[i, 'sentence']
chunks = cp.parse(preprocess(s))
print(chunks)

(S
  Show/VB
  me/PRP
  the/DT
  cheapest/JJS
  flights/NNS
  from/IN
  Raipur/NNP
  to/TO
  Visakhapatnam/NNP
  (date_chunk on/IN 18/CD May/NNP 2018/CD)
  or/CC
  (date_chunk 19/CD May/NNP 2018/CD)
  or/CC
  (date_chunk 20/CD May/NNP 2018/CD)
  ./.)


The date grammar seems to be working fine in most cases, apart from those which contain the phrases *today, tomorrow* etc. 

Also, there are some **false positives**, such as this (```up to INR xxx``` is chunked as a date):





In [601]:
# false positive example
i = 1216
s = df.loc[i, 'sentence']
chunks = cp.parse(preprocess(s))
print(chunks)

(S
  Any/DT
  flights/NNS
  between/IN
  Bhopal/NNP
  and/CC
  Imphal/NNP
  (date_chunk on/IN April/NNP 29/CD ,/, 2018/CD)
  up/IN
  to/TO
  (date_chunk INR/NNP 4000/CD)
  ./.)


In [None]:
# ideas:
## - NNP but not INR
## - NNP should not be followed by TO


In the query above, the phrase INR 4000 matches with the expression ```{<IN>?<NNP><CD><,>?<CD>?}```, so we need to find a way to avoid parsing INR 4000 as a date chunk.

One way to do that could be based on the observation that INR 4000 is usually grouped by a ```TO``` or a ```RB``` (adverb) tag, e.g. *up to INR 3000*, while a date chunk contains a IN tag (but not TO), e.g. *on May 28, 2018'.

So we can specify a new chunk grammar to match price phrases of the form *up to INR 3000* **before the date chunk**. The parser will classify the phrase according to the price chunk first, and then try the date chunk.

**Tip**: The ```trace = True``` argument lets us see the order in which the components of the grammar are parsed

In [602]:
# adding a price chunk before the date chunk
# In 'up to INR 3k', up can either be tagged as RB or IN
# put an optional comma
# include May 20, 2018 and 20 May, 2018
grammar = r'''
price_chunk: {<RB>?<IN>?<TO><NNP><CD>}  # price - up to INR 3000
             {<RB>?<IN>?<TO><CD><NNP>}  # price - up to 3000 INR

date_chunk: {<IN>?<CD><NNP><,>?<CD>?}   # date - e.g. on 28 May, 2018
            {<IN>?<NNP><CD><,>?<CD>?}   # date - e.g. on May 28, 2018            
            '''

cp = nltk.RegexpParser(grammar)


# trace = 1 argument lets us see the order in which the components of 
# the grammar are parsed
s = df.loc[2005, 'sentence']
chunks = cp.parse(preprocess(s), trace=True)
print(chunks)

# Input:
 <DT>  <NNS>  <IN>  <NNP>  <CC>  <NNP>  <IN>  <CD>  <NNP>  <RB>  <TO>  <NNP>  <CD>  <.> 
# price - up to INR 3000:
 <DT>  <NNS>  <IN>  <NNP>  <CC>  <NNP>  <IN>  <CD>  <NNP> {<RB>  <TO>  <NNP>  <CD>} <.> 
# price - up to 3000 INR:
 <DT>  <NNS>  <IN>  <NNP>  <CC>  <NNP>  <IN>  <CD>  <NNP> {<RB>  <TO>  <NNP>  <CD>} <.> 
# Input:
 <DT>  <NNS>  <IN>  <NNP>  <CC>  <NNP>  <IN>  <CD>  <NNP>  <price_chunk>  <.> 
# date - e.g. on 28 May, 2018:
 <DT>  <NNS>  <IN>  <NNP>  <CC>  <NNP> {<IN>  <CD>  <NNP>} <price_chunk>  <.> 
# date - e.g. on May 28, 2018:
 <DT>  <NNS>  <IN>  <NNP>  <CC>  <NNP> {<IN>  <CD>  <NNP>} <price_chunk>  <.> 
(S
  Any/DT
  flights/NNS
  between/IN
  Dimapur/NNP
  and/CC
  Mumbai/NNP
  (date_chunk on/IN 08/CD May/NNP)
  (price_chunk up/RB to/TO INR/NNP 7500/CD)
  ./.)


The trace above shows that the price chunk is matched first (and matches ```{<IN>  <TO>  <NNP>  <CD>}``` ), and then the date chunk is matched.

In [606]:
# testing the new grammar on some random sentences
i = random.randrange(len(df.index))
s = df.loc[i, 'sentence']
chunks = cp.parse(preprocess(s))
print(chunks)

(S
  Can/MD
  you/PRP
  please/VB
  show/VB
  me/PRP
  flights/NNS
  from/IN
  New/NNP
  Delhi/NNP
  to/TO
  Pondicherry/NNP
  tomorrow/NN
  ?/.
  I/PRP
  want/VBP
  to/TO
  avoid/VB
  early/JJ
  morning/NN
  flights/NNS
  ./.)


In [542]:
## Date Chunks TODO list

# today: 242
# tomorrow: 2208
# false positive INR 3500: i=348, 1216

# Ashish:
## today/tomorrow: 
## early morning: 0400-1000


### Extracting Source and Destination Cities

Let's now try extracting source and destinations cities. The most typical phrase is ```from city_1 to city_2```, though city_1 and city_2 both are optional (some queries specify only one of source or destination cities). 


Note that we need to **specify an optional NNP** regex ```<NNP><NNP>?``` to capture names
such as ```New/NNP Delhi/NNP```.

In [607]:
# modifying the grammar further
grammar = r'''
price_chunk: {<RB>?<IN>?<TO><NNP><CD>}  # price - up to INR 3000
             {<RB>?<IN>?<TO><CD><NNP>}  # price - up to 3000 INR

date_chunk: {<IN>?<CD><NNP><,>?<CD>?}   # e.g. on 28 May, 2018
            {<IN>?<NNP><CD><,>?<CD>?}   # e.g. on May 28, 2018  

cities: {<IN><NNP><NNP>?<TO><NNP><NNP>?} # from city_1 to city_2
        {<IN><NNP><NNP>?<CC><NNP><NNP>?} # between city_1 and city_2
        {<IN><NNP><NNP>?}                # from city_1
        {<TO><NNP><NNP>?}                # to city_2
        
            
            '''
cp = nltk.RegexpParser(grammar)

In [612]:
# parse randomly chosen sentences
i = random.randrange(len(df.index))
s = df.loc[1992, 'sentence']
chunks = cp.parse(preprocess(s))
print(chunks)

(S
  I/PRP
  want/VBP
  to/TO
  check/VB
  availability/NN
  (cities on/IN Vistara/NNP)
  (date_chunk on/IN 02/CD May/NNP)
  for/IN
  flights/NNS
  (cities from/IN Ahmedabad/NNP to/TO Cochin/NNP)
  ./.)


In [547]:
# City chunks TODO  

# i = 1992 - false positive - (on vistara, on goair etc. are false positives)
# i = 4353  (to city_2 tagged as VB)

# ideas:
## - for false positives such as on goair: we can just lookup a list of providers 
##   to check if the phrase contains a name such as goair, airindia etc.
airlines = ['vistara', 'air india', 'jet airways', 
            'indigo', 'spice jet', 'go air', 'air asia',
           ' air asia india', 'air deccan']
