Esse exercício pode ser feito em grupo de até 3 pessoas.

Escreva um chatbot que, dado uma pergunta em Inglês, encontre uma pergunta mais parecida no corpus de perguntas e respostas disponível no Kaggle (https://www.kaggle.com/rtatman/questionanswer-dataset#S08_question_answer_pairs.txt) e exiba a resposta.

Resolva usando o Kaggle e somente compartilhe com fernandojvdasilva e envie o link na hora de submeter sua solução pelo Edmodo.

#### teste 1

Aqui tem:

- Geração WordCloud
- Modelagem de Tópicos
- Parcelas de comprimento de perguntas
- Estimativa de estrutura

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
%pylab inline
from textblob import TextBlob
from wordcloud import WordCloud
import sklearn
# assert sklearn.__version__ == '0.18' # Make sure we are in the modern age
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
raw = pd.read_csv('S08_question_answer_pairs.txt', encoding='utf-8', delimiter='\t', quotechar='"')
raw.head(5)

In [None]:
raw.info()

WordCloud!

Uma nuvem da palavra de todo o presente do texto.

In [None]:
text = ' '.join(raw['ArticleTitle'])
cloud = WordCloud(background_color='white', width=1920, height=1080).generate(text)
plt.figure(figsize=(32, 18))
plt.axis('off')
plt.imshow(cloud)
plt.savefig('questions_wordcloud.png')

Modelos de Tópicos

Como em qualquer análise que se respeite sobre dados de texto não rotulados, realizamos aqui alguns modelos de tópicos com a Fatoração de Matriz Não Negativa nas perguntas.

Isso nos permite saber sobre os diferentes tipos de categorias de piadas de perguntas.

In [None]:
raw['Question'].values

In [None]:
raw['Question'].values.astype('U').nbytes

In [None]:
raw['Question'].values.nbytes

In [None]:
raw.dropna(inplace=True)

In [None]:
# Some defaults
max_features=1715
max_df=0.95,  
min_df=2,
max_features=1715,
stop_words='english'

from nltk.corpus import stopwords
stop = stopwords.words('english')

# document-term matrix A
vectorized = CountVectorizer(max_features=1715, max_df=0.95, min_df=2, stop_words='english')

a = vectorized.fit_transform(raw.Question)
a.shape

In [None]:
from sklearn.decomposition import NMF
model = NMF(init="nndsvd",
            n_components=10,
            max_iter=200)

# Get W and H, the factors
W = model.fit_transform(a)
H = model.components_

print("W:", W.shape)
print("H:", H.shape)

Obter a lista de todos os termos cujos índices correspondem às colunas da matriz de termo do documento.

In [None]:
vectorizer = vectorized

terms = [""] * len(vectorizer.vocabulary_)
for term in vectorizer.vocabulary_.keys():
    terms[vectorizer.vocabulary_[term]] = term
    
# Have a look that some of the terms
terms[-5:]

In [None]:
for topic_index in range(H.shape[0]):  # H.shape[0] is k
    top_indices = np.argsort(H[topic_index,:])[::-1][0:10]
    term_ranking = [terms[i] for i in top_indices]
    print("Topic {}: {}".format(topic_index, ", ".join(term_ranking)))

Podemos ver alguns tipos populares de piadas de perguntas. Para citar alguns que ouvi:

- Atravesse a rua
- Trocar uma lâmpada
- Qual é a diferença b / w A e B
- Minha piada favorita

Análise de sentimentos

Atribuímos pontuações de sentimento às perguntas e respostas.

In [None]:
get_polarity = lambda x: TextBlob(x).sentiment.polarity
get_subjectivity = lambda x: TextBlob(x).sentiment.subjectivity

raw['q_polarity'] = raw.Question.apply(get_polarity)
raw['a_polarity'] = raw.Answer.apply(get_polarity)
raw['q_subjectivity'] = raw.Question.apply(get_subjectivity)
raw['a_subjectivity'] = raw.Answer.apply(get_subjectivity)

In [None]:
plt.figure(figsize=(7, 4))
sns.distplot(raw.q_polarity , label='Question Polarity')
sns.distplot(raw.q_subjectivity , label='Question Subjectivity')
sns.distplot(raw.a_polarity , label='Answer Polarity')
sns.distplot(raw.a_subjectivity , label='Answer Subjectivity')

Talvez seja uma piada é bom se o sentimento muda ao responder a pergunta? Infelizmente não há como responder a isso por causa da falta de um recurso de pontuação / upvote de piada nesta versão do conjunto de dados.

Sobre as próprias piadas

O que podemos dizer das próprias piadas? Vamos dar uma olhada no comprimento primeiro.

In [None]:
daf = raw.loc[raw.Answer.str.len() < 150]  # There appear to be some outliers in the dataset
sns.distplot(daf.Question.str.len(), label='Question Length')
sns.distplot(daf.Answer.str.len(), label='Answer Length')

In [None]:
raw.loc[raw.Answer.str.len() > 400].shape[0]

Sabemos que as respostas são geralmente menores que as perguntas. Existem perguntas cujas respostas são mais curtas do que elas? E o reverso?

Resultados semelhantes são válidos. Uma comparação melhor do comprimento da pergunta versus a duração da resposta seria um gráfico de dispersão. Até agora temos plotado a diferença, mas o que isso perde são os comprimentos exatos de Q e A. 500 - 550 é o mesmo que 10 - 60

In [None]:
ql, al = 'Question Length', 'Answer Length'
raw[ql] = raw.Question.str.len()
raw[al] = raw.Answer.str.len()
daf = raw.loc[raw[al] < 250]
sns.jointplot(x=ql, y=al, data=daf, kind='kde', space=0, color='g')

#### Teste 2

In [None]:
# https://medium.com/analytics-vidhya/building-a-simple-chatbot-in-python-using-nltk-7c8c8215ac6e

import nltk
import numpy as np
import random
import string
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [None]:
raw = pd.read_csv('S08_question_answer_pairs.txt', encoding='utf-8', delimiter='\t', quotechar='"')
raw.head(5)

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

In [None]:
# Removing Non-ASCII characters
def remove_non_ascii_1(raw):
    return ''.join([i if ord(i) < 128 else ' ' for i in raw])

In [None]:
#### teste de varios

In [None]:
ArticleTitle = str(raw['ArticleTitle'][i])

for i in range(0, 1715):
    ArticleTitlesentence_tokens = nltk.sent_tokenize(ArticleTitle)
    ArticleTitleword_tokens = nltk.word_tokenize(ArticleTitle)

    [ArticleTitlesentence_tokens[:2], ArticleTitleword_tokens[:2]]

In [None]:
Question = str(raw['Question'][i])

for i in range(0, 1715):
    Questionsentence_tokens = nltk.sent_tokenize(Question)
    Questionword_tokens = nltk.word_tokenize(Question)

    [Questionsentence_tokens[:2], Questionword_tokens[:2]]

In [None]:
Answer = str(raw['Answer'][i])

for i in range(0, 1715):
    Answersentence_tokens = nltk.sent_tokenize(Answer)
    Answerword_tokens = nltk.word_tokenize(Answer)

    [Answersentence_tokens[:2], Answerword_tokens[:2]]

In [None]:
DifficultyFromQuestioner = str(raw['DifficultyFromQuestioner'][i])

for i in range(0, 1715):
    DifficultyFromQuestionersentence_tokens = nltk.sent_tokenize(DifficultyFromQuestioner)
    DifficultyFromQuestionerword_tokens = nltk.word_tokenize(DifficultyFromQuestioner)

    [DifficultyFromQuestionersentence_tokens[:2], DifficultyFromQuestionerword_tokens[:2]]

In [None]:
DifficultyFromAnswerer = str(raw['DifficultyFromAnswerer'][i])

for i in range(0, 1715):
    DifficultyFromAnswerersentence_tokens = nltk.sent_tokenize(DifficultyFromAnswerer)
    DifficultyFromAnswererword_tokens = nltk.word_tokenize(DifficultyFromAnswerer)

    [DifficultyFromAnswerersentence_tokens[:2], DifficultyFromAnswererword_tokens[:2]]

In [None]:
ArticleFile = str(raw['ArticleFile'][i])

for i in range(0, 1715):
    ArticleFilesentence_tokens = nltk.sent_tokenize(ArticleFile)
    ArticleFileword_tokens = nltk.word_tokenize(ArticleFile)

    [ArticleFilesentence_tokens[:2], ArticleFileword_tokens[:2]]

In [None]:
#### teste de varios

In [None]:
ArticleTitlecorpus = []
for i in range(0, 1715):
    ArticleTitle = str(raw['ArticleTitle'][i])
    #review = re.sub('[^a-zA-Z]', ' ', dataset['Question'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    ArticleTitlecorpus.append(review)

In [None]:
Questioncorpus = []
for i in range(0, 1715):
    ArticleTitle = str(raw['Question'][i])
    #review = re.sub('[^a-zA-Z]', ' ', dataset['Question'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    Questioncorpus.append(review)

In [None]:
Answercorpus = []
for i in range(0, 1715):
    ArticleTitle = str(raw['Answer'][i])
    #review = re.sub('[^a-zA-Z]', ' ', dataset['Question'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    Answercorpus.append(review)

In [None]:
DifficultyFromQuestionercorpus = []
for i in range(0, 1715):
    ArticleTitle = str(raw['DifficultyFromQuestioner'][i])
    #review = re.sub('[^a-zA-Z]', ' ', dataset['Question'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    DifficultyFromQuestionercorpus.append(review)

In [None]:
DifficultyFromAnswerercorpus = []
for i in range(0, 1715):
    ArticleTitle = str(raw['DifficultyFromAnswerer'][i])
    #review = re.sub('[^a-zA-Z]', ' ', dataset['Question'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    DifficultyFromAnswerercorpus.append(review)

In [None]:
ArticleFilecorpus = []
for i in range(0, 1715):
    ArticleTitle = str(raw['ArticleFile'][i])
    #review = re.sub('[^a-zA-Z]', ' ', dataset['Question'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    ArticleFilecorpus.append(review)

In [None]:
lemmer = nltk.stem.WordNetLemmatizer()

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def lem_tokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

def lem_normalize(text):
    return lem_tokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
GREETING_INPUTS = ('hello', 'hi', 'greetings', 'sup', 'what\'s up', 'hey',)
GREETING_RESPONSES = ['hi', 'hey', '*nods*', 'hi there', 'hello', 'I am glad! You are talking to me']

def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


def response(user_response):
    robo_response = ''
    Answersentence_tokens.append(user_response)
    
    vectorizer = TfidfVectorizer(tokenizer=lem_normalize, stop_words='english')
    tfidf = vectorizer.fit_transform(sentence_tokens)
    
    values = cosine_similarity(tfidf[-1], tfidf)
    idx = values.argsort()[0][-2]
    flat = values.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    
    if req_tfidf is 0:
        robo_response = '{} Sorry, I don\'t understand you'.format(robo_response)
    else:
        robo_response = robo_response + Questionsentence_tokens[idx]
    return robo_response

In [None]:
import os

In [None]:
flag = True
print('BOT: My name is Robo, I will answer your questions about chatbots. If you want to exit, type Bye')

while flag:
    
    while True:
        os.system('clear')
        per_usr = input('[bot] Diga lá!')
        if per_usr.lower() in [p.lower() for p in Questionsentence_tokens]:
            for idx, pergs in enumerate(Questionsentence_tokens):
                if per_usr.lower() == pergs.lower():
                    print(Answersentence_tokens[idx])
                    break
  
        input('Press ENTER to continue:')
    else:
        flag = False
        print('BOT: bye!')

In [None]:
raw.head(2)

In [None]:
#### continuação normal

In [None]:
#sentence_tokens = nltk.sent_tokenize(raw)
#word_tokens = nltk.word_tokenize(raw)

#str(raw[i])    

In [None]:
corpus = []
for i in range(0, 1715):
    review = str(raw['ArticleTitle'][i])
    #review = re.sub('[^a-zA-Z]', ' ', dataset['Question'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)
    
    
    sentence_tokens = nltk.sent_tokenize(raw)
    word_tokens = nltk.word_tokenize(raw)

    [sentence_tokens[:2], word_tokens[:2]]

In [None]:
lemmer = nltk.stem.WordNetLemmatizer()

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)

def lem_tokens(tokens):
    return [lemmer.lemmatize(token) for token in tokens]

def lem_normalize(text):
    return lem_tokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
GREETING_INPUTS = ('hello', 'hi', 'greetings', 'sup', 'what\'s up', 'hey',)
GREETING_RESPONSES = ['hi', 'hey', '*nods*', 'hi there', 'hello', 'I am glad! You are talking to me']

def greeting(sentence):
    for word in sentence.split():
        if word.lower() in GREETING_INPUTS:
            return random.choice(GREETING_RESPONSES)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


def response(user_response):
    robo_response = ''
    sentence_tokens.append(user_response)
    
    vectorizer = TfidfVectorizer(tokenizer=lem_normalize, stop_words='english')
    tfidf = vectorizer.fit_transform(sentence_tokens)
    
    values = cosine_similarity(tfidf[-1], tfidf)
    idx = values.argsort()[0][-2]
    flat = values.flatten()
    flat.sort()
    req_tfidf = flat[-2]
    
    if req_tfidf is 0:
        robo_response = '{} Sorry, I don\'t understand you'.format(robo_response)
    else:
        robo_response = robo_response + sentence_tokens[idx]
    return robo_response

In [None]:
flag = True
print('BOT: My name is Robo, I will answer your questions about chatbots. If you want to exit, type Bye')

interactions = [
    'hi',
    'what is chatbot?',
    'describe its design, please',
    'what about alan turing?',
    'and facebook?',
    'sounds awesome',
    'bye',
]

while flag:
    user_response = interactions.pop(0)
    print('USER: {}'.format(user_response))
    if user_response is not 'bye':
        if user_response is 'thanks' or user_response is 'thank you':
            flag = False
            print('BOT: You are welcome...')
        else:
            if greeting(user_response) is not None:
                print('ROBO: {}'.format(greeting(user_response)))
            else:
                print('ROBO: ', end='')
                print(response(user_response))
                sentence_tokens.remove(user_response)
    else:
        flag = False
        print('BOT: bye!')

#### Teste 3

Exemplo dado em sala do meetup que funciona

In [None]:
# chat bot
import os

In [None]:
# lista de perguntas
perguntas = ['oi','Vai chover hoje?']

In [None]:
while True:
    os.system('clear')
    per_usr = input('[bot] Diga lá!')
    if per_usr.lower() in [p.lower() for p in perguntas]:
        for idx, pergs in enumerate(perguntas):
            if per_usr.lower() == pergs.lower():
                print(respostas[idx])
                break
    else:
        print('Não entendi ?')
        ens_usr = input('Quer que eu aprenda?')
        if ens_usr.lower() == 'sim':
            perguntas.append(per_usr)
            resp_usr = input('O que devo responder:')
            respostas.append(resp_usr)

    input('Press ENTER to continue:')

In [None]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
# Importing the dataset
#dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)

# Importing the dataset
cols = ["ArticleTitle","Question","Answer"]
dataset = pd.read_csv('S08_question_answer_pairs.txt', delimiter = '\t', usecols=cols,
                      quoting = 3, error_bad_lines=False, low_memory=False)

In [None]:
dataset.head(2)

In [None]:
# Removing Non-ASCII characters
def remove_non_ascii_1(dataset):
    return ''.join([i if ord(i) < 128 else ' ' for i in dataset])

In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [None]:
corpus = []
for i in range(0, 1715):
    review = str(dataset['requisicoes'][i])
    #review = re.sub('[^a-zA-Z]', ' ', dataset['Question'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [None]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

In [None]:
df[cat] = le.fit_transform(df[cat].astype(str))

In [None]:
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

In [None]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [None]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

In [None]:
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()

#### links

https://www.kaggle.com/cristianossd/nlp-chatbot-using-nltk

https://www.kaggle.com/quora/question-pairs-dataset

https://www.kaggle.com/quora/question-pairs-dataset

https://www.kaggle.com/stanfordu/stanford-question-answering-dataset

https://www.kaggle.com/jiriroz/qa-jokes

https://www.kaggle.com/bharathsh/stanford-q-a-json-to-clean-dataframe

https://www.kaggle.com/onesz19/scout-script-about-the-jokes

https://www.kaggle.com/karanabhishek/chatbot-try