<h1 id="simple-chat-bot"><center>💬 Simple Chat Bot 💬</center></h1>
<i><center>Create your first Chat Bot with Python</center></i>

----

<h2 id="problem-description">📝 Problem Description</h2>

<br />

<figure>
    <img src="https://res.cloudinary.com/dte7upwcr/image/upload/blog/blog2/chatbot-o-que-e/chatbot-o-que-e-img_header.jpg" alt="Chat Bot Image" />
    <figcaption><i>Fig. 1 - Chat Bot. <sup>©</sup><a href="https://cloudinary.com" target="_blank">Cloudinary</a></i></figcaption>
</figure>

<br />

You have been hired by CUMI (*Central Unity of My Imagination*) to construct a Chat Bot model to talk with people in **English**, **Japanese** and **Portuguese**. The bot does not have a especific purpose, that is, it should just answer simple small talk questions.

You received a dataset containing more than 3k questions and answers to train the model. You can use any programming language and library, however, you have to explain each project step.

In the end, save the model in a pickle file to others be able to use it!

----

<h2 id="files-description">📁 Files Description</h2>

> **conversation.csv** - contains around 3K simple small talk questions and answers to train the model.

----

<h2 id="features">❓ Features</h2>

> **Id** - row id to identify the (*question, answer*) pair;

> **Question** - small talk question.

----

<h2 id="target-feature">🌟 Target Feature</h2>

> **Answer** - small talk answer.

----

<h2 id="metric">📏 Metric</h2>

There are inumerous metrics we can us to evaluate Chat Bot models, such as *Number of Active Users*, *Users Satisfaction Score*, *Number of Total Intections per Conversation* and *Time/Lenght of Conversation*. In this project, we will be using **Goal Completion Rate (*GCR*)**.

*GCR* measures how good the model fills its role. In our case, our model goal is to answer simple small talk questions, so, after training it, we will do some conversation tests and se how good and coherent the model answers.

----

<h2 id="limitations">🛑 Limitations</h2>

Nothing is perfect in this world and Artificial Intelligence and Data Science are not exceptions. Always there will be limitations in our projects.

The first - and most obvious - limitation we have here are the languages! Our bot will be able to understand English, Japanese and Portuguese only, so, if the model's accessibility becomes public to any person around the world, this person must now one of these three languages or use translators (like *Google Translate*) in order to interact with it.

Besides, even though our dataset contains around 3k small talk questions, it does not contain all possible questions in small talk and our model will probably not answer well some especific questions - yeah, I believe that we won't be able to ask about animes and shows spoilers. 😥

----

<h2 id="goals">🎯 Goals</h2>

> **Goal 1** - make a word cloud about the questions and other one about the answers;

> **Goal 2** - make the model answer simple  small talk questions based on our dataset;

> **Goal 3** - make the model be able to understand and answer in **English**, **Japanese** and **Portuguese**.

----

<h2 id="setup">⚙️ Setup</h2>

***Tools***

> Python Version 3.9.x+;

> Jupyter Notebook;

<br />

***Libraries***

> Pandas, Numpy, WordCloud;

> ChatterBot, NLTK, Tensorflow, Keras;

> Pickles.

----

<h2 id="acknowledgments">🎉 Acknowledgements</h2>

> Dataset by [*Kreesh Rajani*](https://www.kaggle.com/kreeshrajani) from Kaggle;

> Dataset URL: [3k Conversations Dataset for Chat Bot](https://www.kaggle.com/datasets/kreeshrajani/3k-conversations-dataset-for-chatbot).

----

In [5]:
import pandas as pd # pip install pandas
import numpy as np # pip install numpy
from wordcloud import WordCloud # pip install wordcloud

SEED = (4353)

np.random.seed(SEED)
pd.set_option('display.max_rows', 5)
pd.set_option('display.max_columns', 5)

----

<h2 id="reading-the-dataset">0) Reading the Dataset</h2>

In [10]:
dataset_path = './dataset/conversation.csv'
df = pd.read_csv(dataset_path, index_col='Unnamed: 0')
df.head()

Unnamed: 0,question,answer
0,"hi, how are you doing?",i'm fine. how about yourself?
1,i'm fine. how about yourself?,i'm pretty good. thanks for asking.
2,i'm pretty good. thanks for asking.,no problem. so how have you been?
3,no problem. so how have you been?,i've been great. what about you?
4,i've been great. what about you?,i've been good. i'm in school right now.


----

<h2 id="exploring-the-dataset">1) Exploring the Dataset</h2>

In [11]:
# Statistic Overview
df.describe()

Unnamed: 0,question,answer
count,3725,3725
unique,3510,3512
top,what do you mean?,what do you mean?
freq,22,22


Looking at the table above, we realize that there are duplicated questions and answers, being **what do you mean?** the most frequent one in both features. Let's take a look to all rows that has this sentence and check out whether we will drop or not them.

In [44]:
# Listing all rows containing "what do you mean?" on questions and answers
value = 'what do you mean?'
filtered_rows = df.loc[(df['question'] == value) | (df['answer'] == value)]
filtered_rows_indexes = filtered_rows.index


for (index, question, answer) in zip(
    range(len(filtered_rows))
    , filtered_rows['question']
    , filtered_rows['answer']
):
    print(f'{index}) {question} >> {answer}')

0) you're watching too much tv. >> what do you mean?
1) what do you mean? >> i mean you're wasting your life.
2) no, when i call him on his cell phone. >> what do you mean?
3) what do you mean? >> i buried him with his cell phone.
4) no. that's incomplete. >> what do you mean?
5) what do you mean? >> what's your mailing address?
6) oh, no, you don't. >> what do you mean?
7) what do you mean? >> every morning you get up late and rush off to work late.
8) which would you prefer? >> what do you mean?
9) what do you mean? >> when you die and go to heaven, they will offer you beer or cigarettes.
10) don't be ridiculous. >> what do you mean?
11) what do you mean? >> if you're going to make a wish, wish that you were really rich or famous.
12) without gravity, you would go up. >> what do you mean?
13) what do you mean? >> you would float into the sky like a balloon.
14) and lots of thieves. >> what do you mean?
15) what do you mean? >> i mean, keep your belongings close to you.
16) that's a l

Well, let's do the booring part quick. The code bellow changes the needed "what do you mean?" questions and answers to proper ones and replace them into the main dataset.