# Project Needs

1. Understand how to use NER (Named Entity Recognition)
  * Will need this to pull out places and times of trips
  * Also need to develop way to extract the customers desired price range 
2. Work on building a basic chatbot in python to help build understanding
3. Figure out how text can be scraped from websites for these purposes.
4. Pull some training data from different websites. Examples include:
  * Twitter/Instagram/Facebook
  * Wikipedia
  * Travel websites
5. Will need to find some sort of database from which the chat bot can offer solutions and trips to customers for the information given.

Need some APIs for this
1. SpaCy
2. Dialogue Flow
3. QA Net
4. Gensim

Try to find some pre-tuned model weights. 



# Start with exploring NLP

In [19]:
!pip install --upgrade nltk

Collecting nltk
[?25l  Downloading https://files.pythonhosted.org/packages/92/75/ce35194d8e3022203cca0d2f896dbb88689f9b3fce8e9f9cff942913519d/nltk-3.5.zip (1.4MB)
[K     |████████████████████████████████| 1.4MB 3.3MB/s 
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py) ... [?25l[?25hdone
  Created wheel for nltk: filename=nltk-3.5-cp36-none-any.whl size=1434676 sha256=00cc6e7467dd42e83d4d645dc2df0ba20ae7d79c6dc9551399d1da14be769afe
  Stored in directory: /root/.cache/pip/wheels/ae/8c/3f/b1fe0ba04555b08b57ab52ab7f86023639a526d8bc8d384306
Successfully built nltk
Installing collected packages: nltk
  Found existing installation: nltk 3.2.5
    Uninstalling nltk-3.2.5:
      Successfully uninstalled nltk-3.2.5
Successfully installed nltk-3.5


In [0]:
import nltk

In [2]:
nltk.__version__

'3.5'

In [3]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [4]:
sentence = """At eight o'clock on Thursday morning Arthur didn't feel very good."""
# tokenize the sentence
tokens = nltk.word_tokenize(sentence)
tokens

['At',
 'eight',
 "o'clock",
 'on',
 'Thursday',
 'morning',
 'Arthur',
 'did',
 "n't",
 'feel',
 'very',
 'good',
 '.']

In [5]:
tagged = nltk.pos_tag(tokens)
tagged

[('At', 'IN'),
 ('eight', 'CD'),
 ("o'clock", 'NN'),
 ('on', 'IN'),
 ('Thursday', 'NNP'),
 ('morning', 'NN'),
 ('Arthur', 'NNP'),
 ('did', 'VBD'),
 ("n't", 'RB'),
 ('feel', 'VB'),
 ('very', 'RB'),
 ('good', 'JJ'),
 ('.', '.')]

In [8]:
nltk.chunk.ne_chunk(tagged)

TclError: ignored

Tree('S', [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'NN'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'), Tree('PERSON', [('Arthur', 'NNP')]), ('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')])

In [0]:
import spacy
from spacy import displacy
from collections import Counter
from pprint import pprint
import en_core_web_sm
nlp = en_core_web_sm.load()

In [17]:
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
doc2 = nlp(sentence)
pprint([(X.text, X.label_) for X in doc2.ents])

[("eight o'clock", 'TIME'),
 ('Thursday', 'DATE'),
 ('morning', 'TIME'),
 ('Arthur', 'PERSON')]


In [16]:
doc2.ents

(eight o'clock, Thursday, morning, Arthur)

In [21]:
query = 'I would like to go to toronto on thursday jUne 17th'
doc=nlp(query)
[(X.text, X.label_) for X in doc.ents]

[('toronto', 'GPE'), ('thursday jUne 17th', 'DATE')]

## Pasted chatbot from net

In [25]:
from nltk.chat.util import Chat, reflections
pairs = [
    [
        r"my name is (.*)",
        ["Hello %1, How are you today ?",]
    ],
     [
        r"what is your name ?",
        ["My name is Chatty and I'm a chatbot ?",]
    ],
    [
        r"how are you ?",
        ["I'm doing good\nHow about You ?",]
    ],
    [
        r"sorry (.*)",
        ["Its alright","Its OK, never mind",]
    ],
    [
        r"i'm (.*) doing good",
        ["Nice to hear that","Alright :)",]
    ],
    [
        r"hi|hey|hello",
        ["Hello", "Hey there",]
    ],
    [
        r"(.*) age?",
        ["I'm a computer program dude\nSeriously you are asking me this?",]
        
    ],
    [
        r"what (.*) want ?",
        ["Make me an offer I can't refuse",]
        
    ],
    [
        r"(.*) created ?",
        ["Nagesh created me using Python's NLTK library ","top secret ;)",]
    ],
    [
        r"(.*) (location|city) ?",
        ['Chennai, Tamil Nadu',]
    ],
    [
        r"how is weather in (.*)?",
        ["Weather in %1 is awesome like always","Too hot man here in %1","Too cold man here in %1","Never even heard about %1"]
    ],
    [
        r"i work in (.*)?",
        ["%1 is an Amazing company, I have heard about it. But they are in huge loss these days.",]
    ],
[
        r"(.*)raining in (.*)",
        ["No rain since last week here in %2","Damn its raining too much here in %2"]
    ],
    [
        r"how (.*) health(.*)",
        ["I'm a computer program, so I'm always healthy ",]
    ],
    [
        r"(.*) (sports|game) ?",
        ["I'm a very big fan of Football",]
    ],
    [
        r"who (.*) sportsperson ?",
        ["Messy","Ronaldo","Roony"]
],
    [
        r"who (.*) (moviestar|actor) ?",
        ["Brad Pitt"]
],
    [
        r"quit",
        ["BBye take care. See you soon :) ","It was nice talking to you. See you soon :)"]
],
]
def chatty():
    print("Hi, I'm Chatty and I chat alot ;)\nPlease type lowercase English language to start a conversation. Type quit to leave ") #default message at the start
    chat = Chat(pairs, reflections)
    chat.converse()
if __name__ == "__main__":
    chatty()

Hi, I'm Chatty and I chat alot ;)
Please type lowercase English language to start a conversation. Type quit to leave 
>who are you
None
>who you actor
Brad Pitt
>who you sports
I'm a very big fan of Football
>quit
It was nice talking to you. See you soon :)


# Beautiful Soup

In [0]:
import requests
from bs4 import BeautifulSoup

In [0]:
list_of_training_websites = [
                             'https://magazine.trivago.ca/future-travel-plans/',
                             'https://magazine.trivago.ca/lgbtq-family-vacations-canada/',
                             'https://magazine.trivago.ca/live-hotel-webcams-sofa-to-suite/',
                             'https://magazine.trivago.ca/best-family-friendly-resorts-mexico/',
                             'https://magazine.trivago.ca/puerto-rico-resorts/',
                             'https://magazine.trivago.ca/caribbean-best-all-inclusive-resorts/',
                             'https://magazine.trivago.ca/ontario-road-trip-thunder-bay-to-ottawa/',
                             'https://magazine.trivago.ca/victoria-to-golden-bc-road-trip/',
                             'https://magazine.trivago.ca/best-ski-resorts-us/',
                             'https://magazine.trivago.ca/best-family-vacations-newborn-babies/',
                             'https://magazine.trivago.ca/chicago-hotels/',
                             'https://magazine.trivago.ca/420-friendly-cities-us/',
                             'https://magazine.trivago.ca/luxury-hotels-in-toronto/',
                             'https://magazine.trivago.ca/family-hotels-nyc/',
                             'https://magazine.trivago.ca/boston-hotels/',
                             'https://magazine.trivago.ca/bc-lake-resorts/',
                             'https://magazine.trivago.ca/meditation-retreats/',
                             'https://magazine.trivago.ca/hawaii-honeymoon-hotels/',
                             'https://magazine.trivago.ca/romantic-hotels-london-england/',
                             'https://magazine.trivago.ca/paris-honeymoon/',
                             'https://magazine.trivago.ca/us-in-room-jacuzzi-hotels/',
                             'https://magazine.trivago.ca/romantic-getaways-montreal/',
                             'https://magazine.trivago.ca/private-pool-hotels-santorini/',
                             'https://magazine.trivago.ca/french-chateau-hotel/',
                             'https://www.facebook.com/WTGTravelGuide',
                             'https://twitter.com/search?q=world%20travel&src=typed_query&f=live',
                             'https://s3.amazonaws.com/my89public/quac/val_v0.2.json',
                            ]

In [0]:
# Make the request to a url
r = requests.get(list_of_training_websites[0])

# Create soup from content of request
c = r.content

soup = BeautifulSoup(c)

In [18]:
main_content = soup.find('div', attrs = {'class': 'container-wide article'})
main_content

<div class="container-wide article"><row class="performance-top-part"><column md="11" sm="11"><a class="destination-tag" href="/destination/asia/" target="_self">Asia</a><h1 class="post-title">Places We Wish to Go: 5 Places in Our Future Travel Plans</h1><div class="excerpt">From Kenya to Japan, find out what destinations we've been dreaming of here at trivago.</div></column></row><div class="post_content"><div></div></div><div class="post_content"><div><div class="intro-text performance-intro"><p></p><p>How many ways can you travel from home?</p><p></p><p></p><p>In one Facebook group, one of the participants suggested a game: click on a button in Google Maps to discover a new travel destination. Each person would then be taken to a random place on Earth without having to leave the house, just exploring satellite photos of the location. In virtual communities, people are posting photos of what they can see from their homes. Many have taken to social media to share tips on how to recrea

In [19]:
# Extract the relevant information
content = main_content.find('ul').text

import pprint
pprint.pprint(content)

AttributeError: ignored

# Import a textfile

In [0]:
import pandas as pd

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
# Importing User dialogues. 

df=pd.read_csv('/content/drive/My Drive/1000ml/Project 7 - Chatbot/Data/dialogueText.csv')
df2=pd.read_csv('/content/drive/My Drive/1000ml/Project 7 - Chatbot/Data/dialogueText_196.csv')
df3=pd.read_csv('/content/drive/My Drive/1000ml/Project 7 - Chatbot/Data/dialogueText_301.csv')

In [0]:
# Importing QA datasets for travel
for i in range(1,9):
  for j in range(1,5):
    df_temp=pd.read_csv(f'/content/drive/My Drive/1000ml/Project 7 - Chatbot/Data/{i}_{j}_align.csv')
    if (j==1&i==1):
      QA_df = df_temp
    else:
      QA_df=pd.concat([QA_df, df_temp], ignore_index=True)

In [24]:
QA_df['Text'].iloc[1]

'Hi, Could someone please confirm if CX 884 - HKG-LAX business class seats are truly lie flat seats? The aircraft code shows 773, travel agent says they are lie flat but Seatguru shows them as recliners. Help is appreciated. Cheers'

In [25]:
QA_df.head()

Unnamed: 0,Annotator A ID,Annotator B ID,Parition ID,Corpora ID,Sentence ID,Text,Annotator A Text,Annotator B Text,Length,Error,Alignment Score,Agreement
0,7,3.0,2.0,1.0,6507.0,What advantage is there in booking directly wi...,What advantage is there in booking directly wi...,What advantage is there in booking directly wi...,242.0,152.0,0.371901,0.0
1,7,3.0,2.0,1.0,6508.0,"Hi, Could someone please confirm if CX 884 - H...","Hi, Could someone please confirm if CX 884 - H...","Hi, Could someone please confirm if CX 884 - H...",230.0,130.0,0.434783,0.0
2,7,3.0,2.0,1.0,6509.0,I will be transiting Dubai soon en route to Oz...,I will be transiting Dubai soon en route to Oz...,[I will be transiting Dubai soon en route to O...,448.0,175.0,0.609375,1.0
3,7,3.0,2.0,1.0,6514.0,Does anyone know where I'd find estimated pric...,Does anyone know where I'd find estimated pric...,Does anyone know where I'd find estimated pric...,274.0,134.0,0.510949,1.0
4,7,3.0,2.0,1.0,6518.0,It's from BA and finds the cheapest BA flight ...,It's from BA and finds the cheapest BA flight ...,It's from BA and finds the cheapest BA flight ...,89.0,0.0,1.0,1.0


In [6]:
df.text.iloc[1]

'Did I choose a bad channel? I ask because you seem to be dumb like windows user'

In [11]:
len(df3)

16587830

In [17]:
df[df['dialogueID'].str.contains('126125')]

Unnamed: 0,folder,dialogueID,date,from,to,text
0,3,126125.tsv,2008-04-23T14:55:00.000Z,bad_image,,"Hello folks, please help me a bit with the fol..."
1,3,126125.tsv,2008-04-23T14:56:00.000Z,bad_image,,Did I choose a bad channel? I ask because you ...
2,3,126125.tsv,2008-04-23T14:57:00.000Z,lordleemo,bad_image,the second sentence is better english and we...


In [15]:
df.dtypes

folder         int64
dialogueID    object
date          object
from          object
to            object
text          object
dtype: object