## Flight Booking Data Exploration


In [49]:
import numpy as np
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import random
import IPython

In [93]:
# flights dataset
df = pd.read_csv('flightdata.txt', sep='/n', header=None)
df.columns = ['sentence']
df.head()

  


Unnamed: 0,sentence
0,I want to book flight between Visakhapatnam an...
1,Any flights between Visakhapatnam and Mumbai o...
2,"Book a flight to Visakhapatnam on May 31, 2018..."
3,Show me the cheapest flights from Raipur to Co...
4,Show me the cheapest flights from Aizawl to Bh...


In [94]:
# sample sentences
df.sample(frac=0.05)['sentence']

1915    I want to check availability on Jet Airways on...
3119    Show me the cheapest flights from Indore to Na...
3450    Please help me in booking flight from Guwahati...
1670    Can you please show me flights from Ahmedabad ...
2614    Show me the cheapest flights from Bhubaneswar ...
4748    Show me the cheapest flights from Nagpur to Mu...
752     Please show me the cheapest flight from Pune t...
2457    Please show me the cheapest flight from Trivan...
2796    I want to book flight between Ranchi and Cochi...
2518    Please show me the cheapest flight from Cochin...
3097    Show me the cheapest flights from Bhubaneswar ...
4599    I want to check availability on Vistara on 28 ...
3038    Book a flight from Diu on April 28, 2018 after...
1303    Show me the cheapest flights from Bhubaneswar ...
2296    Show me the cheapest flights from Patna to Imp...
4118    Show me the cheapest flights from Cochin to Bh...
1164    Book a flight to Pune on 18 May 2018 or 2018-0...
111     I want

## Broad Approach

The objective is to write a program which takes in a sentence as input and converts it into a semi-stuctured/structured format (such as a dict) containing fields such as from_detination, to_destination, date_of_travel, airline_type etc.

We'll first do two basic preprocessing tasks for each input sentence - tokenization and POS tagging. After preprocessing, we'll identify named entities such as locations (cities), dates, amount/momney, flight companies etc. Once we identify named entities, we can parse the sentence and try extracting relations between entities (from X to Y on date Z etc.).

In [52]:
# preprocessing function
def preprocess(sentence):
    words = nltk.word_tokenize(sentence)
    tagged_words = nltk.pos_tag(words)
    return tagged_words

In [101]:
# preprocessing some sample sentences
tagged_sent = preprocess(df.loc[random.randrange(len(df.index)), 'sentence'])
tagged_sent

[('Show', 'VB'),
 ('me', 'PRP'),
 ('the', 'DT'),
 ('cheapest', 'JJS'),
 ('flights', 'NNS'),
 ('from', 'IN'),
 ('Jammu', 'NNP'),
 ('to', 'TO'),
 ('Bengaluru', 'NNP'),
 ('on', 'IN'),
 ('27', 'CD'),
 ('May', 'NNP'),
 ('or', 'CC'),
 ('2018-05-28', 'CD'),
 ('or', 'CC'),
 ('2018-05-29', 'CD'),
 ('.', '.')]

In [102]:
' '.join(['{0}/{1}'.format(s[0], s[1]) for s in tagged_sent])

'Show/VB me/PRP the/DT cheapest/JJS flights/NNS from/IN Jammu/NNP to/TO Bengaluru/NNP on/IN 27/CD May/NNP or/CC 2018-05-28/CD or/CC 2018-05-29/CD ./.'

## Chunking

In [90]:
# try extracting the dates using a chunk grammar
grammar = 'Date Chunk: {<IN><CD>+<NNP>?<CD>?<,>?<CD>?}'
cp = nltk.RegexpParser(grammar)

# choose a random sentence s
s = df.loc[random.randrange(len(df.index)), 'sentence']
chunks = cp.parse(preprocess(s))

In [91]:
print(chunks)

(S
  Book/VB
  a/DT
  flight/NN
  to/TO
  Guwahati/VB
  (Date Chunk on/IN 28/CD May/NNP)
  or/CC
  2018-05-29/JJ
  (Date Chunk between/IN 15/CD)
  hours/NNS
  and/CC
  23/CD
  hours/NNS
  ./.)


In [92]:
chunks.draw()

In [58]:
# from nltk.tag import StanfordPOSTagger
# from nltk.tag.stanford import CoreNLPPOSTagger, CoreNLPNERTagger

In [59]:
# c = CoreNLPPOSTagger()