# Vivabot

![](https://images.unsplash.com/photo-1527430253228-e93688616381?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1191&q=80)

Photo by [Rock'n Roll Monkey](https://unsplash.com/photos/R4WCbazrD1g)

In this exercise, you will build your own bot: Vivabot. To do so, we will apply our knowledge about text preprocessing, TF-IDF and similarity, but also basic Python code.

Begin by importing the needed libraries:

## 12:10 start Vivabot

In [392]:
# TODO: import needed libraries
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
import string
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


First let's load our sentence database, stored in the file *chatbot_database.txt* and have a look at the data.

Warning, the file is not a CSV, so you might need to play with the paramaters of `pd.read_csv()` to open it correctly.

In [393]:
# TODO: load chatbot_database.txt
##df = pd.read_csv('chatbot_database.txt', delimiter='\\', header=None)
df = pd.read_csv('chatbot_database.txt', sep='\\', header=None)

In [394]:
df.columns # pas de colonne car header=None

Int64Index([0], dtype='int64')

In [395]:
df.shape

(86, 1)

In [396]:
df ## on n'a pas les lignes vides ... comment regrouper les blocs de texte ???

Unnamed: 0,0
0,"A chatbot (also known as a talkbot, chatterbot..."
1,Such programs are often designed to convincing...
2,Chatbots are typically used in dialog systems ...
3,Some chatterbots use sophisticated natural lan...
4,"The term ""ChatterBot"" was originally coined by..."
...,...
81,These cloud platforms provide Natural Language...
82,There are many APIs available for building you...
83,Malicious chatbots are frequently used to fill...
84,"They are commonly found on Yahoo! Messenger, W..."


In [397]:
df.columns = ['Text']
df

Unnamed: 0,Text
0,"A chatbot (also known as a talkbot, chatterbot..."
1,Such programs are often designed to convincing...
2,Chatbots are typically used in dialog systems ...
3,Some chatterbots use sophisticated natural lan...
4,"The term ""ChatterBot"" was originally coined by..."
...,...
81,These cloud platforms provide Natural Language...
82,There are many APIs available for building you...
83,Malicious chatbots are frequently used to fill...
84,"They are commonly found on Yahoo! Messenger, W..."


In [398]:
# Build the list of quotes : essai pour regrouper les blocs de texte séparés par les lignes vides
"""
quotes = []
for row in df.Text:
    quote = ""
    print(f'row: {row}')
    if row == '':
        quotes.append(quote)
    quote += '\n' + row
quotes
"""


'\nquotes = []\nfor row in df.Text:\n    quote = ""\n    print(f\'row: {row}\')\n    if row == \'\':\n        quotes.append(quote)\n    quote += \'\n\' + row\nquotes\n'

It is necessary to compute the TF-IDF on this database. First, do not forget to preprocess the data, and then compute and store the TF-IDF.

### PREPROCESSING

In [399]:
# TODO: preprocess and compute the TF-IDF of this database
def preprocessing(document):
    # 1- tokenization
    tokens = word_tokenize(document)
    # 2- lower case on alpha and leave unchaged others
    ###tokens = [t.lower() if t.isalpha() else t for t in tokens] ## pb sur similarity.max(): à voir
    # 3- remove stopwords and punctuation
    ##stop_words = set(stopwords.words('english') + list(string.punctuation))
    ###tokens = [t for t in tokens if not t in stop_words]
    # 4- stemming
    stemmer = PorterStemmer() #build root by removing some known suffix and prefix
    tokens_stem = [stemmer.stem(w) for w in tokens]
    return tokens_stem

In [400]:
## apply preproc to df
df['Text_preproc'] = df['Text'].apply(preprocessing)

### TF-IDF sur les données pré-processées

In [401]:
## TF-IDF entraîné sur tout le corpus
## !!! OK avec le preproc de base de sklearn sur col: 'Text' pour la fonction: get_closest_sentence() !!!
TFIDF_vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))
TF_IDF = TFIDF_vectorizer.fit_transform(df.Text).toarray()

### NOK avec df.Text_preproc dans le calcul de similarity; tjs max == 0.0 qd j'utilise t.lower() if t.isalpha() !!!??? A voir
## OK avec non-utilisation de la partie: t.lower() if t.isalpha() mais résultats moins bien qu'avec df.Text
#TFIDF_vectorizer = TfidfVectorizer(analyzer=lambda x: x)
#TF_IDF = TFIDF_vectorizer.fit_transform(df.Text_preproc).toarray()
TF_IDF

# Visu avec le df de tf_idf
df_TFIDF = pd.DataFrame(data=TF_IDF, columns=TFIDF_vectorizer.get_feature_names_out())
df_TFIDF #86 x 639

Unnamed: 0,000,100,16,1950,1966,1972,1984,1994,2006,2008,...,worker,workings,would,written,xico,yahoo,yekaliva,yet,york,zuckerberg
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.243849,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.282888,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
82,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
83,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0
84,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.253146,0.0,0.0,0.0,0.0


In [402]:
## fonction en vue de la 'généralisation' de la fonction vivabot avec le fichier 'database' dans la signature
def read_data_preproc_dataframe(database):
    # database is a txt file
    # in the same rep as the code to make it easier
    df = pd.read_csv(database, sep='\\', header=None)

    df.columns = ['Text']
    df['Text_preproc'] = df['Text'].apply(preprocessing) ## pas utilisé en fait
    print(df)
    print(f'df.shape: {df.shape}')

    ## TF-IDF entraîné sur tout le corpus
    ## !!! OK avec le preproc de base de sklearn sur col: 'Text' pour la fonction: get_closest_sentence() !!!
    TFIDF_vectorizer = TfidfVectorizer(stop_words=stopwords.words('english'))
    TF_IDF = TFIDF_vectorizer.fit_transform(df.Text).toarray()

    return TFIDF_vectorizer, TF_IDF, df.Text 

The next step is to get the closest sentence compared to a user query, using cosine similarity. This will be computed in the method `get_closest_sentence(query, tf_idf, vectorizer)`. This method will return the index of the closest sentence of `query` within the TF-IDF of the database, using `vectorizer` to compute the TF-IDF of the query.

Do not forget to preprocess the query before computing the TF-IDF.

In [403]:
# TODO: implement get_closest_sentence(query, tf_idf, vectorizer)
# voir EXO-3
"""
- Objet: calculer la phrase du corpus la + proche de la query
- Input: query = liste avec une string
- Return: phrase la + proche sur la base de la similarité cosine
"""
def get_closest_sentence(query, tf_idf, vectorizer, df_text):
    closest_sentence = ""
    query_TFIDF = vectorizer.transform(query).toarray()
    #print(pd.DataFrame(data=query_TFIDF, columns=TFIDF_vectorizer.get_feature_names_out()))

    # cosine similarity between the query and the movies
    similarity = cosine_similarity(query_TFIDF, tf_idf)

    #print(f"Similarity.max(): {similarity.max()}") #MAX = 0.0 dans les tests
    #print(f'similarity.shape: {similarity.shape}')
    #print(f'similarity.argmax(): {similarity.argmax()}')

    ## PASSER df.Text en arg !!!
    #closest_sentence = df.Text[similarity.argmax()]
    closest_sentence = df_text[similarity.argmax()]
    print(f"phrase la + proche de ** {query[0]} **:\n {closest_sentence}")
    return similarity.max(), closest_sentence


In [406]:
##### TESTINGS ####
#print(df)

## avec utilisation de la fonction qui a database en arg:
#TFIDF_vectorizer, TF_IDF = read_data_preproc_dataframe('chatbot_database.txt')

test_queries = [
    'Do you think chatbots are really intelligent?',
    "'%' of companies using chatbots",
    "2017 chatbots",
    "According to Forrester (2015)",
    "Aeromexico airline chatbot",
    "field of AI research is natural",
    "which is specific conversational agent."
]

dict_query_simmax = {}

for query in test_queries:
    ##sim_max, closest = get_closest_sentence([query], df_TFIDF, TFIDF_vectorizer)
    ## AJOUTER df.Text en arg 
    sim_max, closest = get_closest_sentence([query], TF_IDF, TFIDF_vectorizer, df.Text)
    dict_query_simmax[query] = [closest, sim_max]

dict_query_simmax

phrase la + proche de ** Do you think chatbots are really intelligent? **:
 However Weizenbaum himself did not claim that ELIZA was genuinely intelligent, and the Introduction to his paper presented it more as a debunking exercise:
phrase la + proche de ** '%' of companies using chatbots **:
 A 2017 study showed 4% of companies used chatbots.
phrase la + proche de ** 2017 chatbots **:
 A 2017 study showed 4% of companies used chatbots.
phrase la + proche de ** According to Forrester (2015) **:
 According to Forrester (2015), AI will replace 16 percent of American jobs by the end of the decade.
phrase la + proche de ** Aeromexico airline chatbot **:
 Aeromexico airline chatbot running on Facebook Messenger, March 2018
phrase la + proche de ** field of AI research is natural **:
 One pertinent field of AI research is natural language processing.
phrase la + proche de ** which is specific conversational agent. **:
 uses a markup language called AIML, which is specific to its function as a

{'Do you think chatbots are really intelligent?': ['However Weizenbaum himself did not claim that ELIZA was genuinely intelligent, and the Introduction to his paper presented it more as a debunking exercise:',
  0.24132947726326626],
 "'%' of companies using chatbots": ['A 2017 study showed 4% of companies used chatbots.',
  0.3362248410272699],
 '2017 chatbots': ['A 2017 study showed 4% of companies used chatbots.',
  0.5144475319394324],
 'According to Forrester (2015)': ['According to Forrester (2015), AI will replace 16 percent of American jobs by the end of the decade.',
  0.5109771284273031],
 'Aeromexico airline chatbot': ['Aeromexico airline chatbot running on Facebook Messenger, March 2018',
  0.5939616834385282],
 'field of AI research is natural': ['One pertinent field of AI research is natural language processing.',
  0.7303984456220012],
 'which is specific conversational agent.': ['uses a markup language called AIML, which is specific to its function as a conversational a

Let's define greetings words and greetings answers in two separate variables.

Greetings words should be words or short sentences like "Hello", "Hey", "Hi", What's up?" and so on.
The greetings answers can be words or short sentences that you want.

In [367]:
# TODO: Define the greetings words and answers in two variables
# We can define the input and answers
greetings_inputs = ['Hello', 'Hi', 'Good morning', 'Hey']
greetings_answers = ['Hey there, I am Vivabot, how can I help you?', 'Hello, my name is Vivabot, nice to meet you.',
                     'Vivabot at your service, sir.', 'Hi Master, I am Vivabot.']


Now create a Greetings function, called `greetings(sentence, greetings_inputs, greetings_outputs)`. If the variable `sentence` is in `greetings_inputs`, the function returns randomly a sentence from `greetings_outputs`. Otherwise the function returns nothing.

Take into account when the case does not match too: for example 'hello' or 'Hello' should both work!

In [368]:
# TODO: Implement the function greetings
#fonction du cours
def greetings(sentence, greetings_inputs, greetings_outputs):
    ##sentence = input('User input :\n>> ')

    # Then if a sentence is in that input, let's choose an answer
    if sentence in greetings_inputs:
        output = np.random.randint(4)
        print(greetings_answers[output])
    else:
        print('It was a pleasure. Bye!')

## Utiliser Text input - output!!

In [369]:
greeting = input('enter a greeting')
greetings(greeting, greetings_inputs, greetings_answers)

It was a pleasure. Bye!


Next step is to put it all together: let's define a function `vivabot(greetings_inputs, greetings_outputs, tf_idf, vectorizer, database)` that does the following:
<ol>
<li> Print some generic presentation </li>
<li> Ask for text input </li>
<li> If the text input is in greetings: call the function `greetings` and print its output using `greetings_inputs` and `greetings_ouputs`</li>
<li> If the text input is not in greetings, calls the function `get_closest_sentence` and prints the closest sentence using `tf_idf`, `vectorizer` and `database`</li>
<li> Go back to step 2 unless the text input is "Bye" </li>
</ol> 



In [407]:
# TODO: implement the function vivabot
## AJOUTER df.Text en arg
def vivabot(greetings_inputs, greetings_outputs, tf_idf, vectorizer, split_response, df_text):
    ## Step:
    # 1 - Print some generic presentation
    # 2 - Ask for text input
    # 3 - if the text input is in greetings:
    #   -> call the function 'greetings' and print its output using 'greetings_inputs' and 'greetings_ouputs'
    # 4 - If the text input is not in greetings:
    #   -> calls the function 'get_closest_sentence' and prints the closest sentence using 'tf_idf', 'vectorizer' and 'database'
    # 5 - Go back to step 2 unless the text input is "Bye"

    #### question: database en argument pour closest_sentence !!??
    ## VERSION-1: avec la database chatbot_database, sans database dans la signature de get_closest_sentence
    ## VERSION-2: mettre la lecture de la database et prétraitements sur le dataframe dans des fonctions appelées avant get_closest_sentence()

    print(f'This bot enables you to find the closest reference about chatbox to your query')
    dict_query_simmax = {}
    text_input = ""
    iterator = 0
    max_iters = 5

    print(f'tf_idf.shape: {tf_idf.shape}')

    while text_input.lower() != "bye" and iterator <= max_iters:
        text_input = input('User input :\n>> ') ## traiter le cas empty string !!
        if text_input.lower() == "bye":
            break
        elif text_input in greetings_inputs:
            greetings(text_input, greetings_inputs, greetings_answers)
        else:
            sim_max, closest = get_closest_sentence([text_input], tf_idf, vectorizer, df_text)
            # ne prendre que la 2nde partie de closest sur '?' -- voir comment le faire sur '.' sans rien casser
            #if split_response:
            #    closest = closest.split('?')[1]
            dict_query_simmax[text_input] = [closest, sim_max]
        iterator += 1

    return dict_query_simmax

Finally, call the function `vivabot` and see your chatbot coming to life!

If it does not work well, call the functions one by one and check they all work properly independently first.

### TEST CHATBOT SUR chatbot_database.txt ou sur dialogs.txt

In [412]:
## Pb avec dialogs.txt: -> Solution: PASSER df.Text en arg !! -> OK now
databases = ['chatbot_database.txt', 'dialogs.txt']

idx = 0
database = databases[idx]
database

'chatbot_database.txt'

In [413]:
# TODO: use your chatbot!
TFIDF_vectorizer = None
TF_IDF = None
split_response = idx==1
TFIDF_vectorizer, TF_IDF, df_Text = read_data_preproc_dataframe(database)
TF_IDF

dict_query_simmax = vivabot(greetings_inputs, greetings_answers, TF_IDF, TFIDF_vectorizer, split_response, df_Text)
dict_query_simmax

                                                 Text  \
0   A chatbot (also known as a talkbot, chatterbot...   
1   Such programs are often designed to convincing...   
2   Chatbots are typically used in dialog systems ...   
3   Some chatterbots use sophisticated natural lan...   
4   The term "ChatterBot" was originally coined by...   
..                                                ...   
81  These cloud platforms provide Natural Language...   
82  There are many APIs available for building you...   
83  Malicious chatbots are frequently used to fill...   
84  They are commonly found on Yahoo! Messenger, W...   
85  There has also been a published report of a ch...   

                                         Text_preproc  
0   [a, chatbot, (, also, known, as, a, talkbot, ,...  
1   [such, program, are, often, design, to, convin...  
2   [chatbot, are, typic, use, in, dialog, system,...  
3   [some, chatterbot, use, sophist, natur, langua...  
4   [the, term, ``, chatterbot, '',

{'of companies using chatbots': ['A 2017 study showed 4% of companies used chatbots.',
  0.3362248410272699],
 '2017 chatbots': ['A 2017 study showed 4% of companies used chatbots.',
  0.5144475319394324],
 'According to Forrester (2015': ['According to Forrester (2015), AI will replace 16 percent of American jobs by the end of the decade.',
  0.5109771284273031],
 'Aeromexico airline chatbot': ['Aeromexico airline chatbot running on Facebook Messenger, March 2018',
  0.5939616834385282]}

**\[BONUS\]**: Let's implement some sentiment analysis features on our brand new chatbot:


If the chatbot does not understand the user query (meaning the similarity is under a pre-defined threshold) implement a small talk function. The small talk function will take as input the query and return a positive or negative message depending on the tone (polarity) of the user.

### pas compris ce qui est demandé dans cette partie.
### comment établir la polarity à partir de la query ?

In [4]:
small_talks_good = ["Thanks for getting in touch with me", "I am so sorry I do not understand your point", 
                   "I'll make sure to understand you after my next update"]

In [None]:
small_talks_bad = ["I can not understand a word of what you are saying", "Please be more specific"]

In [None]:
# TODO: implement the function vivabot
SIMILARITY_THRESHS = [0.1, 0.25, 0.4]

def small_talk(query, similarity):
    small_talk = ""
    for idx in range(0,len(SIMILARITY_THRESHS)):
        if similarity <= SIMILARITY_THRESHS[idx]:
            small_talk = small_talks_good[idx]
            break

    return small_talk

Many improvements can be done now if you have time: improving preprocessing, change your database if you want to use it for another reason

This is your bot!