# Creation of Intents

#### Loading Data

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv("question_responce.csv").drop(columns='Unnamed: 0')
data = data.drop_duplicates()
data.head(5)

Unnamed: 0,responce,question
0,@115904 We'll be sure to pass along your kind ...,@AmericanAir Erica on the lax team is amazing ...
1,@115904 Our apologies for the delay in respond...,@AmericanAir Could you have someone on your la...
2,"@115905 Aww, that's definitely a future pilot ...",Ben Tennyson and an American Airlines pilot. 🎃...
3,@115906 We're sorry for your frustration.,"@AmericanAir Right, but I earned those. I also..."
4,@115909 We're glad you got to kick back and en...,"Thank you, @AmericanAir for playing #ThisIsUs ..."


#### Preprocessing 

**Steps :**

- Tokenization: Split the text into sentences and the sentences into words. Lowercase the words and remove punctuation.
- Words that have fewer than 3 characters are removed.
- All stopwords are removed.
- All words starting by '@' and '#' are removed
- Words are lemmatized — words in third person are changed to first person and verbs in past and future tenses are changed into present.
- Words are stemmed — words are reduced to their root form.

In [3]:
import gensim
import nltk
import re
import numpy as np
import json
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.corpus import stopwords, wordnet
from nltk.stem.porter import *

lemmatizer = nltk.stem.WordNetLemmatizer()
stemmer = SnowballStemmer("english")

np.random.seed(2018)
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('words')
words = set(nltk.corpus.words.words())

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mathy\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mathy\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\mathy\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [4]:
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    text = " ".join(filter(lambda x:x[0]!='@', text.split()))
    text = " ".join(filter(lambda x:x[0]!='#', text.split()))
    text = " ".join(filter(lambda x:x[0:4]!='http', text.split()))
    text = " ".join(w for w in nltk.wordpunct_tokenize(text) \
        if w.lower() in words or not w.isalpha())
    for token in gensim.utils.simple_preprocess(text):
        if token not in STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
    return result

In [5]:
data['question_prepro'] = data['question'].apply(preprocess)
data['question_prepro']

0                                      [team, amaz, rais]
1                               [team, avail, guid, gate]
2                                                 [pilot]
3              [right, pass, spous, need, chang, program]
4                    [thank, great, flight, flight, home]
                              ...                        
1847       [transatlant, servic, abl, join, daili, scrum]
1848                              [averag, price, ticket]
1849                                 [month, claim, tell]
1850    [terribl, servic, wait, tri, number, go, respons]
1851                               [chang, time, airport]
Name: question_prepro, Length: 1852, dtype: object

We need to clean the responce too

In [6]:
def response_preprocess(text):
    text = " ".join(filter(lambda x:x[0]!='@', text.split()))
    text = " ".join(filter(lambda x:x[0]!='#', text.split()))
    return text

In [7]:
data['responce'] = data['responce'].apply(response_preprocess)
data['responce']

0            We'll be sure to pass along your kind words!
1       Our apologies for the delay in responding to y...
2       Aww, that's definitely a future pilot in the m...
3                       We're sorry for your frustration.
4       We're glad you got to kick back and enjoy a sh...
                              ...                        
1847    We know staying connected is important, why no...
1848    We've capped our fares for nonstop flights at ...
1849    Please give our Baggage team a call at 800-866...
1850    Our apologies for the hold. Our Central Baggag...
1851    We're providing waivers for St Croix, Gillian....
Name: responce, Length: 1852, dtype: object

Drop empty questions

In [8]:
data = data[data['question_prepro'].map(lambda d: len(d)) > 1]
data

Unnamed: 0,responce,question,question_prepro
0,We'll be sure to pass along your kind words!,@AmericanAir Erica on the lax team is amazing ...,"[team, amaz, rais]"
1,Our apologies for the delay in responding to y...,@AmericanAir Could you have someone on your la...,"[team, avail, guid, gate]"
3,We're sorry for your frustration.,"@AmericanAir Right, but I earned those. I also...","[right, pass, spous, need, chang, program]"
4,We're glad you got to kick back and enjoy a sh...,"Thank you, @AmericanAir for playing #ThisIsUs ...","[thank, great, flight, flight, home]"
5,We never want your experience to be anything l...,@AmericanAir's wifi makes Amtrak's wifi look p...,"[look, their, free, decent, reason]"
...,...,...,...
1847,"We know staying connected is important, why no...",@AmericanAir and @172 have nailed in the trans...,"[transatlant, servic, abl, join, daili, scrum]"
1848,We've capped our fares for nonstop flights at ...,@AmericanAir Average price of ticket out: $250...,"[averag, price, ticket]"
1849,Please give our Baggage team a call at 800-866...,@AmericanAir Really annoyed been over a month ...,"[month, claim, tell]"
1850,Our apologies for the hold. Our Central Baggag...,@AmericanAir terrible service wait ages trying...,"[terribl, servic, wait, tri, number, go, respons]"


#### Building LDA

In [9]:
import gensim.corpora as corpora
from pprint import pprint

In [10]:
id2word = corpora.Dictionary(data['question_prepro'])
id2word.filter_extremes(no_below=5)
corpus = [id2word.doc2bow(text) for text in data['question_prepro']]

In [30]:
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=90)

In [31]:
lda_model[corpus[0]]

[(10, 0.67036855)]

#### Try to extract topic from new text 

In [32]:
from operator import itemgetter

In [33]:
def get_highest_topic(text):
  text_cleaned = preprocess(text)
  result = lda_model[id2word.doc2bow(text_cleaned)]
  #print("Max topic topic : ", max(result, key=itemgetter(1))[0], 
  #     "\nWith : ", max(result, key=itemgetter(1))[1])
  return max(result, key=itemgetter(1))[0]

get_highest_topic('Ben Tennyson  Airlines pilot.')


51

In [34]:
data['topic'] = data['question'].apply(get_highest_topic)
data.head(5)

Unnamed: 0,responce,question,question_prepro,topic
0,We'll be sure to pass along your kind words!,@AmericanAir Erica on the lax team is amazing ...,"[team, amaz, rais]",10
1,Our apologies for the delay in responding to y...,@AmericanAir Could you have someone on your la...,"[team, avail, guid, gate]",69
3,We're sorry for your frustration.,"@AmericanAir Right, but I earned those. I also...","[right, pass, spous, need, chang, program]",21
4,We're glad you got to kick back and enjoy a sh...,"Thank you, @AmericanAir for playing #ThisIsUs ...","[thank, great, flight, flight, home]",4
5,We never want your experience to be anything l...,@AmericanAir's wifi makes Amtrak's wifi look p...,"[look, their, free, decent, reason]",88


#### Building Intents

In [35]:
intents = {"Greeting": ["Hello", "How are you doing?", "Greetings!", "How do you do?"],
          "Apology": ["No problem"],
          "Thanks": ["No problem", "You're welcome"],
          "Goodbye": ["It was nice speaking to you", "See you later", "Speak soon!"]
             }

In [36]:
def update_intents(df):
  for index, row in df.iterrows():
    if str(row['topic']) not in intents :
      intents[str(row['topic'])] = []
    if row['responce'] != "":
      intents[str(row['topic'])].append(str(row['responce']))

In [37]:
update_intents(data)

Save Intents in order to use it again

In [38]:
with open('intents.json', 'w') as fp:
    json.dump(intents, fp)

#### Doc2Vec

In [39]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [40]:
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(data['responce'])]
model = Doc2Vec(tagged_data, window=1, min_count=2, epochs = 200)

#### First Results 

In [41]:
def chatbot(text):
  result = model.docvecs.most_similar(positive=[model.infer_vector(preprocess(text))], topn=1800)
  topic = get_highest_topic('text')
  for l in result:
    if data['topic'].iloc[int(l[0])] == topic:
      return data['responce'].iloc[int(l[0])]

In [42]:
list_word = ["@AmericanAir what’s going on with flight 301 DTW &gt; DFW? Delayed over 3 hours. Will it be canceled?", 
             "@AmericanAir awful service, more than 1 hour delay due to logbook maintenance issues. Where is preventive maintenance?",
             "@AmericanAir Yes. Refund my plane ticket😡 with your racist ass workers",
             "@AmericanAir I'm aware. Fits in overhead.This was a return trip. Had same luggage going. This was pre-security. Gate chck, ok, but Empty overheads on flt"]

list_resp = ["Looks like it has a maintenance delay and it's currently scheduled to take off at 10:10p.",
             "We do many checks pre and post departure. Please share your flight number if we can provide an update.",
             "We don't tolerate discrimination of any kind. Please DM your record locator and contact details (phone and email)",
             "We'd like to share this feedback. Please DM your record locator."]

for q in list_word:
  print("Prediction :", chatbot(q))

for r in list_resp:
  print("Expected :", r)

Prediction : Enjoying the view?
Prediction : Please follow and meet us in DMs with your record locator and we'll be happy to take a peek.
Prediction : Purchasing Main Cabin Extra or Preferred seats is an option offered in advance. Seats will be assigned once you arrive to the airport!
Prediction : Please follow and DM your record locator, Thomas. We'd like let our App Team know.
Expected : Looks like it has a maintenance delay and it's currently scheduled to take off at 10:10p.
Expected : We do many checks pre and post departure. Please share your flight number if we can provide an update.
Expected : We don't tolerate discrimination of any kind. Please DM your record locator and contact details (phone and email)
Expected : We'd like to share this feedback. Please DM your record locator.


  


In [49]:
question = input('Enter your question : ')
chatbot(question)

  


'Fabulous indeed!'