# Building a chatbot that texts for me

Hey there! Through this notebook I am going to be building a chatbot that talks to people for me. It pulles in a CSV files of text messages that have been created through another notebook, selects a chat with one other person, and learns how we talk. NLTK is the package that does all the heavy lifting in this.

Let's start by importing our necessary packages:

In [463]:
import os # file loading
import io # file loading
import nltk # our core library
import random
import string
import warnings
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer # TFIDF
from sklearn.metrics.pairwise import cosine_similarity # How we find similar texts (more on this later)

warnings.filterwarnings('ignore') # cleans up some stuff

**Loading my language**
Using open-source code in another notebook, I managed to extract the text messages from my computer's SQL database file, and load them into a CSV file. This contains all of the texts that I've ever sent or received since I signed into my Apple ID. It looks like the oldest texts date back to September of 2015!

In [464]:
all_msgs = pd.read_csv("texts.csv") # load file
all_msgs = all_msgs.dropna() # clean up
all_msgs = all_msgs.sort_values(by=['message_date'])
all_msgs

Unnamed: 0,text,handle_id,date,message_date,timestamp,month,year,is_sent,message_id,phone_number,chat_id
89468,Sure!,325,2015-09-10,463636977000000000,2015-09-10 21:02:57,9,2015,0,89466,coliestoff@comcast.net,397.0
89598,Awesome! Thank you!,325,2015-09-10,463637031000000000,2015-09-10 21:03:51,9,2015,0,89596,coliestoff@comcast.net,397.0
89523,Ok thanks again!,325,2015-09-10,463637531000000000,2015-09-10 21:12:11,9,2015,0,89521,coliestoff@comcast.net,397.0
89588,Gtg Netflix awaits,325,2015-09-10,463637536000000000,2015-09-10 21:12:16,9,2015,0,89586,coliestoff@comcast.net,397.0
89519,Why so early?,16,2015-09-11,463675257000000000,2015-09-11 07:40:57,9,2015,0,89517,+15037307966,15.0
89584,I'm almost to the turnaround,8,2015-09-11,463702094000000000,2015-09-11 15:08:14,9,2015,0,89582,+15037305828,5.0
89580,hahahha cool,35,2015-09-11,463710339000000000,2015-09-11 17:25:39,9,2015,0,89578,+15033335141,31.0
89511,ik,35,2015-09-11,463710365000000000,2015-09-11 17:26:05,9,2015,0,89509,+15033335141,31.0
89576,I didn't know she went to oes,35,2015-09-11,463710380000000000,2015-09-11 17:26:20,9,2015,0,89574,+15033335141,31.0
89507,cool,35,2015-09-11,463710395000000000,2015-09-11 17:26:35,9,2015,0,89505,+15033335141,31.0


Because we want to build this to talk to someone for me, we need to learn how I talk with that person. For example, my texts with Bevin Daglen, my dad, and my girlfrind are all very differnt. Let's pull out just the chats between me and one other person, based on their phone number. For this example, we're going to pull out all the messages between me and my girlfriend.

In [465]:
phone_number = '+15037461083' # emily
#phone_number = '+15035507677' # hudson
#phone_number = '+15038679170' # aida

person_msgs = all_msgs[all_msgs['phone_number'] == phone_number] # pull out just rows with that phone number
#person_msgs

Great! We have all the messages between me and Emily, but there seems to be a problem... The texts between us go all the way back to March of 2017, but for some reason there aren't any older texts that **I sent**? This seems to be a problem with the data, so let's figure out when we start to have complete data:

In [466]:
print('First timestatmp:', all_msgs[all_msgs['is_sent'] == 1]['timestamp'].min())

first_sent = all_msgs[all_msgs['is_sent'] == 1]['message_date'].min() # take the oldest timestamp where is_sent == 1
print('Computer-readable timestsamp:',first_sent)

First timestatmp: 2019-07-23 15:38:18
Computer-readable timestsamp: 585614298952704896


Unfortunatly, it looks like we only have complete data going back to July of 2019. I've tried pulling out all the texts, but it seems my iCloud accout just have no recollection of old texts that I sent. Additionally, my phone is set to auto-reply when I'm driving. Because of these two things, we need to drop old (incomplete) data, and rows where my phone auto-replied.

In [467]:
# remove rows where data is incomplete:
person_msgs = person_msgs[person_msgs['message_date'] > first_sent]

# remove rows where my phone auto-replied:
person_msgs = person_msgs[~person_msgs['text'].str.contains('I’m driving with Do Not Disturb While Driving turned on.')]
person_msgs = person_msgs[~person_msgs['text'].str.contains('I’m not receiving notifications. If this is urgent, reply “urgent” to send a notification through with your original message.')]

#person_msgs

Now it's time to take our data, put all the text messages into a string.

In [468]:
pd.set_option("display.max_colwidth", 10000000) # If we don't use this, our texts won't be complete
    
raw = person_msgs['text'].to_string(header = False, index = False)
raw = raw.lower() # convert it to lowercase
raw = raw.replace('\n', '. ') # cut out the newlines (between each message) and replace with a period
#print(raw[0:20000])

Because of the way that NLTK works, we need to convert our big string into a list of sentences, then into a list of words. Let's do that here:

In [469]:
sent_tokens = nltk.sent_tokenize(raw)# converts to list of sentences
word_tokens = nltk.word_tokenize(raw)# converts to list of words
print(word_tokens[0:10])

['uhmmm', 'idk', 'it', 'really', 'depends', 'it', '’', 's', 'super', 'wack']


Now let's get out lemminizer set up and working. This is much like stemming, but more practical. In stemming, we take the stem of each word. E.g. stemming, stemmer, and stemmed, all return 'stem'. The problem with this is that we can end up with stems that aren't real words. A lemminizer, in contrast, always returns a real word. For example, 'good,' 'great,' and 'better' will return 'good.'

In [470]:
lemmer = nltk.stem.WordNetLemmatizer() # The built-in NLTK lemmer

def LemTokens(tokens): # recursive function that returns the lemma for each word in the list
    return [lemmer.lemmatize(token) for token in tokens]

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation) # built-in-function to remove punctuation.
# Sice this is a texting bot, we're going to leave in other things like emojis.

def LemNormalize(text): # function to normalize the case of the text using the built-in .lower() function.
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

Here is where all the magic happens, we're going to define our first response function:

In [471]:
def response_v1(user_response): # define a function that can take in a string
    nlp_response='' # define the NLP response
    sent_tokens.append(user_response) # since we just got another message, let's treat it like any other sentence in our corpus.
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english') # now we're going to vectorize all of our words
    tfidf = TfidfVec.fit_transform(sent_tokens) # Apply the feature extractor
    
    vals = cosine_similarity(tfidf[-1], tfidf) # take the dot product between the last two vectors and find cosine difference
    
    idx = vals.argsort()[0][-2] #Perform an indirect sort along the second to last axis
    flat = vals.flatten() # Transform into 1D array
    flat.sort() # sort in ascending order
    
    req_tfidf = flat[-2]
    
    if(req_tfidf == 0):
        nlp_response = nlp_response + "Wtf?"
        return nlp_response
    else:
        nlp_response = nlp_response + sent_tokens[idx]
        return nlp_response

In [472]:
while(True):
    user_response = input()
    user_response=user_response.lower() #set it all to lower case

    print("NLP: ", end="")
    print(response_v2(user_response))
    break # comment this line to run through to the next code block

Hey!
NLP: awh that’s good to know ❤️ that’s actually that first thing i thought about when i woke up and it was nice to see that message from you


The above function works ok, but it really just reads through **all** of our messages and learns from that. What if we split the messages up, and it learned how **I** responded to things that other people say? 🤔 Let's try it!

In [473]:
msgs_them = person_msgs[person_msgs['is_sent'] == 0] # other person's messages are is_sent = 0
msgs_me = person_msgs[person_msgs['is_sent'] == 1] # my messages are is_sent = 1

# Normalize the texts using .lower():
msgs_them['text'] = msgs_them['text'].str.lower()
msgs_me['text'] = msgs_me['text'].str.lower()

# Split into sentence tokens (each text message is treated as one sentence):
sent_tokens_them = msgs_them['text'].to_list()
sent_tokens_me = msgs_me['text'].to_list()

# Put into new-line seperated lists:
raw_them = '.\n'.join(sent_tokens_them)
raw_me = '.\n'.join(sent_tokens_me)

# Take out each word and make their own list:
word_tokens_them = nltk.word_tokenize(raw_them)# converts to list of words
word_tokens_me = nltk.word_tokenize(raw_me)# converts to list of words

# Exact same lemmatizer code as above:
lemmer = nltk.stem.WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))

So now we have the messages of each person stored in different dataframes. We'll use NLTK to find the closest vectors between the other person's message and all the messages that they have previously sent. We will then find when that message was sent, and find how I responded to it. We will then respond with how I have previously responded.

In [474]:
def response_v2(user_response):
    nlp_response=''
    sent_tokens_them.append(user_response)
    
    # Same TFIDF code as above
    TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens_them)
    vals = cosine_similarity(tfidf[-1], tfidf)
    
    idx = vals.argsort()[0][-2]
    
    sentdate = msgs_them.iloc[idx]['message_date'] # Now that we have the index of their closest message, find when it was sent
    
    # Find the message that I used to respond to that text, and return it:
    messages_after = msgs_me[msgs_me['message_date'] >= sentdate]
    my_response = messages_after.iloc[0]['text']
    
    flat = vals.flatten() # flatten the 2D list to a 1D list
    flat.sort() # sort the list
    
    req_tfidf = flat[-2]
    
    if(req_tfidf == 0):
        nlp_response = nlp_response + "Wtf?"
        return nlp_response
    else:
        return my_response

In [None]:
while(True):
    user_response = input()
    user_response=user_response.lower() #set it all to lower case

    print("NLP:", response_v2(user_response))
    sent_tokens_them.remove(user_response)

what's up?
NLP: what's the order
huh?
NLP: yes it was
hahah ok
NLP: oh yeah totally! leaving right now
where are you going?
NLP: yes i am super hyped
hahah ok
NLP: oh yeah totally! leaving right now


And there you have it! We took in our corpus, pre-processed it, split it into two different corpuses (one for me and one for the other person), built the vectors, and created an interface for it. I'm excited to take this even further!