<a href="https://colab.research.google.com/github/KalikaKay/Thinkful-Notebooks/blob/main/Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [20]:
#Setup dataframe and visualizations
import math
import pandas as pd
import numpy as np
from scipy.stats.mstats import winsorize
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import datetime 


#Figure set up for dark theme:
plt.style.use(['dark_background'])
#Color to set all my graphs.
color = '#F9EDF5'
sns.set()

#suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [21]:
#Natural Language Dating Cleaning Import 
from collections import Counter
import nltk
import spacy
import re

# Build a Chatbot

First, do some data preprocessing to clean up the data. You can use your solution to the assignment of the Text preprocessing checkpoint.

# Natural Language Processing (NLP) 
## Data Cleaning

*Note: Because the memory requirements of the datasets are relatively large, it's best to use Google Colaboratory for this assignment.*

Note: When parsing the data using spaCy, you may run into some memory issues, even in Google Colaboratory. If you're having memory issues, try parsing your text as follows:

```
nlp = spacy.load('en', disable=['parser', 'ner'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))
nlp.max_length = 20000000
doc = nlp(the_dialogs_come_here)
```


## Cornell Movie--Dialogs Corpus.

The first dataset is a dialogue dataset called Cornell Movie--Dialogs Corpus. This corpus includes conversations between the characters of more than 600 movies.

Apply the data preprocessing techniques that you learned here to the Cornell Movie--Dialogs Corpus data. 

You'll use this dataset when developing a chatbot in an upcoming checkpoint. 

Access the dataset from the Thinkful database using the following credentials:

 

```
 postgres_user = 'dsbc_student'
 postgres_pw = '7*.8G9QH21'
 postgres_host = '142.93.121.174'
 postgres_port = '5432'
 postgres_db = 'cornell_movie_dialogs'
# The data is in the table called dialogs.
```



In [22]:
from sqlalchemy import create_engine
def get_df(table, database=None):

  postgres_user = 'dsbc_student'
  postgres_pw = '7*.8G9QH21'
  postgres_host = '142.93.121.174'
  postgres_port = '5432'

  if database is None:
    database = table
  # INSERT WITH DB NAME
  postgres_db = database
  dt = table

  # ASSUMED SQL STATEMENT, UPDATE AS NECESSARY.
  sql = 'select * from ' + dt

  engine = create_engine('postgresql://{}:{}@{}:{}/{}'.format(
      postgres_user, postgres_pw, postgres_host, postgres_port, postgres_db))

  df = pd.read_sql_query(sql,con=engine)

  # no need for an open connection, 
  # as we're only doing a single query
  engine.dispose()

  return df

In [23]:
mov = get_df('dialogs', 'cornell_movie_dialogs')
mov.drop(columns='index', inplace=True)
mov.head()

Unnamed: 0,dialogs
0,Can we make this quick? Roxanne Korrine and A...
1,"Well, I thought we'd start with pronunciation,..."
2,Not the hacking and gagging and spitting part....
3,Okay... then how 'bout we try out some French ...
4,You're asking me out. That's so cute. What's ...


The steps below were pulled from [here](https://prrao87.github.io/blog/spacy/nlp/performance/2020/05/02/spacy-multiprocess.html). I was looking at - wanted to look at - a few alternative cleaning techniques. 

After taking a look at spacy's stop_word results (info not saved, just take my own word for it); I found that I prefer to use this cleaning method prior to running my data through the natural language processor. 

This function could be updated. 

In [36]:
#Clean the data frame
def cleaner(df, field, limit=0):
    "Extract relevant text from DataFrame using a regex"
    # Regex pattern for only alphanumeric, hyphenated text with 3 or more chars
    pattern = re.compile(r"[A-Za-z0-9\-]{3,50}")
    df['clean'] = df[field].str.findall(pattern).str.join(' ')
    if limit > 0:
        return df.iloc[:limit, :].copy()
    else:
        return df

In [37]:
# Utility function for standard text cleaning
def text_cleaner(text):
    type(text)
    # Visual inspection identifies a form of punctuation that spaCy does not
    # recognize: the double dash --.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split())
    return text

In [38]:
nlp = spacy.load('en')

In [39]:
mov['clean'] = cleaner(mov, 'dialogs')

In [40]:
dialogs = []
#i = 1
print("Processing {} elements. start time: {}".format(mov['dialogs'].count(), datetime.datetime.now()))
for dialog in mov['dialogs']:
  #print("item {} start time: {}".format(i, datetime.datetime.now()))
  dialog = text_cleaner(dialog)
  dialogs.append(nlp(dialog))
  #i += 1
print("End time: {}".format(datetime.datetime.now()))

Processing 304446 elements. start time: 2020-12-21 21:38:01.020571
End time: 2020-12-21 22:22:37.778316


In [41]:
# Explore the objects that you've built.
print("Object Type: {}".format(type(dialogs)))
print("Object Length: {} ".format(len(dialogs)))
print("First Three: {}'".format(dialogs[:3]))
print("Elements Type: {}".format(type(dialogs[0])))

Object Type: <class 'list'>
Object Length: 304446 
First Three: [Can we make this quick? Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad. Again., Well, I thought we'd start with pronunciation, if that's okay with you., Not the hacking and gagging and spitting part. Please.]'
Elements Type: <class 'spacy.tokens.doc.Doc'>


Develop a chatbot using this corpus. In doing this, you're free to choose a chatbot development library like ChatterBot or write your own code from scratch.



In [49]:
import random
GREETING_INPUTS = ["hello", "hi", "greetings", "what's up","hey"]
GREETING_RESPONSES = ["hello", "hi", "hey", "hi there", "howdy"]
def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

In [50]:
def response(user_input, persuasion_sents):
    
    response = ""
    # Use spaCy to parse the user's input
    input_doc = nlp(user_input)
    # Then split it into sentences
    input_sents = [sent.text for sent in input_doc.sents]

    # Then append the user's sentence into your list of sentences
    for sentence in input_sents:
        persuasion_sents.append(sentence)
    
    # The next step is to vectorize your new corpus using TF-IDF
    TfidfVec = TfidfVectorizer(max_df=0.5, min_df=1, use_idf=True, norm=u'l2', smooth_idf=True, lowercase=False)
    tfidf = TfidfVec.fit_transform(persuasion_sents)
    
    # Remove the user's input from the corpus
    persuasion_sents.pop(-1)
    
    # Calculate the cosine similarity
    # between the user input and all of the other sentences in the corpus
    similarities = cosine_similarity(tfidf[-1], tfidf[:-1])
    # Get the index of most similar sentence
    idx = np.argmax(similarities)
        
    if(idx):
        response = response + persuasion_sents[idx]
        return response
    else:
        response = response + "I'm sorry! I don't know how to respond :("
        return response

In [69]:
vocabulary = []
for dialog in dialogs:
  if type(dialog) == str:
    vocabulary.append(dialog)
  else: 
    vocabulary.append(dialog.text)

In [70]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

print("The Dialogian: I will try to respond to you reasonably. If you want to exit, type bye.")

while(True):
    
    user_input = input("User: ")
    user_input=user_input.lower()
    
    if(user_input!='bye'):
        if(user_input == 'thanks' or user_input == 'thank you'):
            break
            print("Persuasion: You're welcome.")
        else:
            if(greeting(user_input) != None):
                print("Persuasion: " + greeting(user_input))
            else:
                print("Persuasion: ", end = "")
                print(response(user_input, vocabulary))
    else:
        print("Persuasion: Bye! It was a great chat.")
        break

The Dialogian: I will try to respond to you reasonably. If you want to exit, type bye.
User: hello
Persuasion: howdy
User: how are you today?
Persuasion: how are you today?
User: what?
Persuasion: A what?
User: it's cold outside.
Persuasion: I'm cold.
User: merry christmas!
Persuasion: A merry Christmas to you, Amy.
User: who is Amy?
Persuasion: ...You know... who *who* is?
User: you're so funny. I'm Kalika. What's your name?
Persuasion: And what is your name?
User: kalika
Persuasion: i'm kalika.
User: You've stolen my identity!
Persuasion: It was stolen!
User: Give it back!
Persuasion: Okay, give it back.
User: now. what is  your name?
Persuasion: And what is your name?
User: jon. who are you?
Persuasion: I know who you are.
User: you're terrible is who you are! absolutely terrible. By!
Persuasion: ...by you?
User: no. i mean, good bye.
Persuasion: ...good-bye then.
User: bye
Persuasion: Bye! It was a great chat.


Start a conversation with your chatbot, and discuss its strengths and weaknesses.

It's a funny little bot. It cannot identify itself, but it does have a little bit of character. Of course there's work to be d one, but all in all - it's cute. I had a lot of fun with it. 
