# Goal 

There are primarily 3 types of chatbots:

* Rule-based: can answer a pre-defined set of statements (questions, chats or requests), and default to a base response in case an unknown statement was provided. This type of chatbots can be useful and accurate when the conversation topic (and potentially questions) are known.  3
* AI-based: these chatbots can train from a provided corpus, and can learn to respond to novel questions by generating responces from the provided corpus. This type of chatbots could be more useful in scenarios where discussion topic Is unknown, like in the case with general-purpose chatbots.  
* Hybrid: if the statement provided fits the criteria of a pre-defined set of answers, this type of chatbot replies with a pre-defined answer. Otherwise, it can answer in an AI-based method.

Here we cant to build a rule-based chatbot focused on customer service.

We have unlabeled messages between users and an airline company. It is possible to get some kind of labeling of user intents without doing it by hand ? Intents are the topic of a conversation. We can use the “questions” column from the training data to extract intents. 

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [48]:
# ! pip install langid
# ! pip install gensim
# ! pip install --upgrade numpy
# ! pip uninstall numpy
# ! pip install numpy==1.20.0
# ! pip install wordcloud
# ! pip install keybert
# ! pip install sklearn
# ! pip install matplotlib
# nltk.download('words')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('punkt')
# ! pip install sentence-transformers

In [119]:
#Import all the dependencies
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
import nltk
import langid, re
import pandas as pd
import spacy
import en_core_web_sm
import numpy as np 
from nltk.corpus import stopwords
nltk.download("stopwords")
eng_corpus = set(nltk.corpus.words.words())
nlp = spacy.load("en_core_web_sm")
pd.set_option('display.max_colwidth', 500)
pd.set_option('display.max_rows', 500)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [120]:
path = '/content/drive/MyDrive/datascience/'
data = pd.read_csv(path+'question_responce.csv', index_col='Unnamed: 0')
data.head()

Unnamed: 0,responce,question
603,@115904 We'll be sure to pass along your kind words! #AATeam,@AmericanAir Erica on the lax team is amazing give her a raise ty
605,@115904 Our apologies for the delay in responding to you. Have you made it to LAX? Let us know if you still need assistance.,@AmericanAir Could you have someone on your lax team available to guide me to my gate ASAP
608,"@115905 Aww, that's definitely a future pilot in the making! #HappyHalloween",Ben Tennyson and an American Airlines pilot. 🎃 #trunkortreat #halloween #2017 #diycostume #parenting @americanair … https://t.co/f1nNHQ0iLa https://t.co/lDViDkRdB1
612,@115906 We're sorry for your frustration.,"@AmericanAir Right, but I earned those. I also shouldn’t have to pay to pass them to my own spouse. You need to change your program."
618,@115909 We're glad you got to kick back and enjoy a show while flying! Thanks for your kind words.,"Thank you, @AmericanAir for playing #ThisIsUs and for having great flight attendants on my flight back home!"


# Pre processing steps 

Those messages are really messy and requires a lot of specific preprocessing : removing hastags, url, @ symbols and more.

In [121]:
#Remove url  
def clean_url(df):
    tag_url= re.compile(r"https://\S+|www\.\S+")
    df=tag_url.sub(r'',df)
    return df

#Remove html link 
def clean_html(df):
    tag_html=re.compile(r'<.*?>')
    df=tag_html.sub(r'',df)
    return df

#Remove all the most recent emojis
def remove_emoji(df):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"   
                               u"\u3030"
                               "]+", flags=re.UNICODE)
                               
    return emoji_pattern.sub(r'', df)


def clean_punctuation(df):
    tag_punct=re.compile(r'[^\w\s]')
    df=tag_punct.sub(r'',df)
    return df


def get_english(df):
    ''' Return True if the sentence is in english False otherwise'''
    return 1 if langid.classify(df)[0] == 'en' else 0


def remove_stops(df):
    custom_stopwords = set(stopwords.words("english") + ['amp', 'aa', 'lax', 'flight', 'flying', 'plane', 'flights', 'fly', 'american', 'airlines', 'american_airlines'])
    return ' '.join([t for t in word_tokenize(df) if not t in custom_stopwords])
  
    
def remove_names(df):
    tagged_sentence = nltk.tag.pos_tag(df.split())
    edited_sentence = [word for word,tag in tagged_sentence if tag != 'NNP' and tag != 'NNPS']
    df_final = ' '.join(edited_sentence)
    return df_final


def remove_non_english(df): 
    only_english = " ".join(w for w in nltk.wordpunct_tokenize(df) \
             if w.lower() in eng_corpus)
    return only_english

In [122]:
# Apply preprocess (the order is very important here)
data['question']=data['question'].apply(lambda x: re.sub('@[\w]+','',x)) #delate @
data['question']=data['question'].apply(lambda x: re.sub('#[^\s]+','',x)) #delate hastag

data['question']=data['question'].apply(lambda x: clean_url(x))
data['question']=data['question'].apply(lambda x: clean_html(x))
data['question']=data['question'].apply(lambda x: remove_emoji(x))
data['question']=data['question'].apply(lambda x: clean_punctuation(x))
data['question'] = data['question'].apply(lambda x :remove_names(x))

# marks english questions with 1 else 0
data['langue']=data['question'].apply(lambda x: get_english(x)) 
print(data.shape)
data = data.query('langue == 1')
print(data.shape)
data['question']=data['question'].apply(lambda x: remove_stops(x.lower()))
# data['question']=data['question'].apply(lambda x: remove_non_english(x))

# remove empty question
data.drop(data[data.question == ""].index, inplace=True)
print(data.shape)
data.head()

(1852, 3)
(1817, 3)
(1795, 3)


Unnamed: 0,responce,question,langue
603,@115904 We'll be sure to pass along your kind words! #AATeam,team amazing give raise ty,1
605,@115904 Our apologies for the delay in responding to you. Have you made it to LAX? Let us know if you still need assistance.,could someone team available guide gate,1
608,"@115905 Aww, that's definitely a future pilot in the making! #HappyHalloween",pilot,1
612,@115906 We're sorry for your frustration.,right earned also shouldnt pay pass spouse need change program,1
618,@115909 We're glad you got to kick back and enjoy a show while flying! Thanks for your kind words.,playing great attendants back home,1


In [123]:
# spacy allow to lemmatize data and filter them by postag in the same time
def lemmat_filter(texts): # , allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']
    doc = nlp(" ".join(texts)) 
    return [token.lemma_ for token in doc] # if token.pos_ in allowed_postags

data['question']=data['question'].apply(lambda x: lemmat_filter(word_tokenize(x)))
data['question'] = data['question'].apply(lambda x: " ".join(x))

data.head()

Unnamed: 0,responce,question,langue
603,@115904 We'll be sure to pass along your kind words! #AATeam,team amazing give raise ty,1
605,@115904 Our apologies for the delay in responding to you. Have you made it to LAX? Let us know if you still need assistance.,could someone team available guide gate,1
608,"@115905 Aww, that's definitely a future pilot in the making! #HappyHalloween",pilot,1
612,@115906 We're sorry for your frustration.,right earn also should not pay pass spouse need change program,1
618,@115909 We're glad you got to kick back and enjoy a show while flying! Thanks for your kind words.,play great attendant back home,1


## With Embedding and clustering using Bert

Clustering is another very common approach to unsupervised learning problems. We need to encode text data before clustering them. Here we have several alternatives like word embeddings using Doc2vec or transformer based embedding, leveraging the Bert algorithm.

Here, we are using Distilbert as it gives a nice balance between speed and performance. The package has several multi-lingual models available for you to use.

In [124]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-mpnet-base-v2')
embeddings = model.encode(data.question.tolist(), show_progress_bar=True)

Batches:   0%|          | 0/57 [00:00<?, ?it/s]

In [125]:
data['bert_vectors'] = list(embeddings)
data.head()

Unnamed: 0,responce,question,langue,bert_vectors
603,@115904 We'll be sure to pass along your kind words! #AATeam,team amazing give raise ty,1,"[-0.029498337, -0.014250554, -0.009332876, -0.023147868, 0.04312467, 0.05625785, -0.09698202, 0.016546456, -0.03381597, -0.07710057, 0.04916483, 0.040066127, -0.02481182, 0.12294168, 0.02427972, -0.023094539, -0.009931857, 0.036658112, 0.040020887, -0.02842706, 0.034125756, -0.019560639, 0.009754792, 0.011753749, -0.01775819, 0.0024813656, 0.026285049, 0.014820078, -0.04183885, -0.036151603, -0.024738379, 0.0048107253, -0.02827559, -0.004314978, 1.6164882e-06, -0.028584339, -0.018802507, -0...."
605,@115904 Our apologies for the delay in responding to you. Have you made it to LAX? Let us know if you still need assistance.,could someone team available guide gate,1,"[0.015011241, 0.04078075, -0.033850897, 0.04303636, -0.012707921, 0.008320153, 0.015971292, -0.00875302, 0.018422062, -0.03187447, 0.023016598, 0.020789329, 0.0016196866, 0.034994166, -0.015649967, -0.06637329, -0.004990005, -0.026310528, -0.017631143, -0.03052558, 0.0029296796, 0.050501507, -0.060690213, 0.009163956, 0.02030041, -0.011259294, 0.002413185, 0.032271035, 0.050747674, 0.008557845, -0.010444945, -0.025191272, -0.023651846, -0.02090349, 1.5040027e-06, 0.0037365672, -0.02965517, 0..."
608,"@115905 Aww, that's definitely a future pilot in the making! #HappyHalloween",pilot,1,"[-0.004280298, 0.06640515, 0.022615544, 0.0043656905, -0.041951634, -0.011471828, -0.013293675, 0.020290699, -0.100495495, -0.017038634, 0.05247038, -0.039941724, -0.0010023431, 0.049924318, 0.05782291, 0.008859276, -0.016875455, -0.035797328, 0.006330242, -0.015974293, -0.019689674, 0.029963912, -0.04174347, 0.00894954, 0.034827016, -0.010962871, 0.042294472, 0.003940598, -0.005653465, 0.049489405, 0.054256346, -0.002337655, 0.07346018, 0.029117353, 1.9630077e-06, 0.041573256, 0.047529284, ..."
612,@115906 We're sorry for your frustration.,right earn also should not pay pass spouse need change program,1,"[-0.040791407, 0.028858567, -0.015226508, 0.0038935214, -0.0067359908, 0.042162698, 0.020842206, 0.050628725, 0.038898896, -0.023537043, 0.032367203, -0.020873092, 0.028165994, 0.02617366, -0.03434817, 0.03087191, -0.05619934, 0.026540132, 0.0018909733, -0.014377415, 0.017831897, 0.0107407635, -0.035767987, 0.026542976, -0.029448293, -0.047546502, 0.05899227, -0.003827479, 0.02989908, 0.025214143, 0.08680133, 0.05347762, -0.014611753, -0.001414392, 1.5345673e-06, -0.01747745, -0.008379599, 0..."
618,@115909 We're glad you got to kick back and enjoy a show while flying! Thanks for your kind words.,play great attendant back home,1,"[0.004833184, 0.029627716, -0.017063215, 0.0039171367, -0.040716927, 0.023051921, -0.011337438, -0.017963113, -0.0368129, -0.021679124, 0.014425822, -0.057148132, 0.008131054, 0.031686913, 0.012578369, -0.006461406, -0.03072653, -0.005614795, 0.016200649, -0.04629601, -0.021347158, 0.0010469275, -0.014734916, 0.007145456, -0.033171482, 0.03879969, 0.07677206, 0.021947743, 0.055790663, 0.030053912, 0.09779764, 0.0065991073, 0.03432564, 0.009819334, 1.8311924e-06, 0.024444006, 0.0066780057, -0..."


### Clustering steps

We want to make sure that documents with similar topics are clustered together such that we can find the topics within these clusters. <br> 
Before doing so, we first need to lower the dimensionality of the embeddings as many clustering algorithms handle high dimensionality poorly (Curse of Dimensionality : distance measures, such as Euclidean and Manhattan, needed for clustering become useless at high dimensions). After having reduced the dimensionality of the documents embeddings, we can cluster the documents with Hdbscan. Hdbscan is a density-based algorithm that works quite well with UMAP since UMAP maintains a lot of local structure even in lower-dimensional space. 

We could skip the dimensionality reduction step if you use a clustering algorithm that can handle high dimensionality like a cosine-based k-Means.

#### With DBSCAN 

In [126]:
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score

losses = []
silhouette_scores = []
for cluster in np.arange(15,50, 3):
    dbscan = DBSCAN(eps=0.3, min_samples=10)
    dbscan.fit(embeddings)


In [127]:
# print(len(losses))
# plt.plot(losses)

Explore interesting subspaces.

In [128]:
# for cluster in np.arange(20, 24):
#     dbscan = DBSCAN(n_clusters = cluster, random_state=1)
#     dbscan.fit(embeddings)
#     print(f'Inertia: {dbscan.score(embeddings)}')
#     losses.append(dbscan.score(embeddings))
#     print(f'Silhouette scores: {silhouette_score(embeddings, dbscan.labels_)}')
#     silhouette_scores.append(silhouette_score(embeddings, dbscan.labels_))

In [129]:
# plt.plot(silhouette_scores)

In [130]:
# for cluster in np.arange(45, 50):
#     dbscan = DBSCAN(n_clusters = cluster, random_state=1)
#     dbscan.fit(embeddings)
#     print(f'Inertia: {dbscan.score(embeddings)}')
#     losses.append(dbscan.score(embeddings))
#     print(f'Silhouette scores: {silhouette_score(embeddings, dbscan.labels_)}')
#     silhouette_scores.append(silhouette_score(embeddings, dbscan.labels_))

In [131]:
# plt.plot(silhouette_scores)

It looks like 20 & 47 clusters seems interesting numbers to further explore.

In [132]:
# print('Inertia:')
# plt.plot(x, losses)

In [133]:
# print('Silhouette score:')
# plt.plot(x, silhouette_scores)

In [134]:
# from sklearn.cluster import dbscan
# from sklearn.metrics import silhouette_score

# dbscan = DBSCAN(algorithm='auto', eps=0.3, leaf_size=30, metric='euclidean',
#        metric_params=None, min_samples=10, n_jobs=None, p=None)
# dbscan.fit(embeddings)

# print(dbscan.labels_)


# labels = dbscan.labels_
 
# # Number of clusters in labels, ignoring noise if present.
# n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
# n_noise_ = list(labels).count(-1)
 
 
# print('Estimated number of clusters: %d' % n_clusters_)

In [135]:
from sklearn.cluster import MiniBatchKMeans

random_state = 0
cls = MiniBatchKMeans(n_clusters=35, random_state=random_state)
cls.fit(embeddings)

# cls.predict(embeddings)

print(cls.labels_)

[24  4 22 ... 21 21 13]


In [136]:
# data['bert_cls_clusters'] = dbscan.labels_
# data['bert_dbscan_47clusters'] = dbscan_47.labels_
data['cls_cluster'] = cls.labels_
data.head()

Unnamed: 0,responce,question,langue,bert_vectors,cls_cluster
603,@115904 We'll be sure to pass along your kind words! #AATeam,team amazing give raise ty,1,"[-0.029498337, -0.014250554, -0.009332876, -0.023147868, 0.04312467, 0.05625785, -0.09698202, 0.016546456, -0.03381597, -0.07710057, 0.04916483, 0.040066127, -0.02481182, 0.12294168, 0.02427972, -0.023094539, -0.009931857, 0.036658112, 0.040020887, -0.02842706, 0.034125756, -0.019560639, 0.009754792, 0.011753749, -0.01775819, 0.0024813656, 0.026285049, 0.014820078, -0.04183885, -0.036151603, -0.024738379, 0.0048107253, -0.02827559, -0.004314978, 1.6164882e-06, -0.028584339, -0.018802507, -0....",24
605,@115904 Our apologies for the delay in responding to you. Have you made it to LAX? Let us know if you still need assistance.,could someone team available guide gate,1,"[0.015011241, 0.04078075, -0.033850897, 0.04303636, -0.012707921, 0.008320153, 0.015971292, -0.00875302, 0.018422062, -0.03187447, 0.023016598, 0.020789329, 0.0016196866, 0.034994166, -0.015649967, -0.06637329, -0.004990005, -0.026310528, -0.017631143, -0.03052558, 0.0029296796, 0.050501507, -0.060690213, 0.009163956, 0.02030041, -0.011259294, 0.002413185, 0.032271035, 0.050747674, 0.008557845, -0.010444945, -0.025191272, -0.023651846, -0.02090349, 1.5040027e-06, 0.0037365672, -0.02965517, 0...",4
608,"@115905 Aww, that's definitely a future pilot in the making! #HappyHalloween",pilot,1,"[-0.004280298, 0.06640515, 0.022615544, 0.0043656905, -0.041951634, -0.011471828, -0.013293675, 0.020290699, -0.100495495, -0.017038634, 0.05247038, -0.039941724, -0.0010023431, 0.049924318, 0.05782291, 0.008859276, -0.016875455, -0.035797328, 0.006330242, -0.015974293, -0.019689674, 0.029963912, -0.04174347, 0.00894954, 0.034827016, -0.010962871, 0.042294472, 0.003940598, -0.005653465, 0.049489405, 0.054256346, -0.002337655, 0.07346018, 0.029117353, 1.9630077e-06, 0.041573256, 0.047529284, ...",22
612,@115906 We're sorry for your frustration.,right earn also should not pay pass spouse need change program,1,"[-0.040791407, 0.028858567, -0.015226508, 0.0038935214, -0.0067359908, 0.042162698, 0.020842206, 0.050628725, 0.038898896, -0.023537043, 0.032367203, -0.020873092, 0.028165994, 0.02617366, -0.03434817, 0.03087191, -0.05619934, 0.026540132, 0.0018909733, -0.014377415, 0.017831897, 0.0107407635, -0.035767987, 0.026542976, -0.029448293, -0.047546502, 0.05899227, -0.003827479, 0.02989908, 0.025214143, 0.08680133, 0.05347762, -0.014611753, -0.001414392, 1.5345673e-06, -0.01747745, -0.008379599, 0...",13
618,@115909 We're glad you got to kick back and enjoy a show while flying! Thanks for your kind words.,play great attendant back home,1,"[0.004833184, 0.029627716, -0.017063215, 0.0039171367, -0.040716927, 0.023051921, -0.011337438, -0.017963113, -0.0368129, -0.021679124, 0.014425822, -0.057148132, 0.008131054, 0.031686913, 0.012578369, -0.006461406, -0.03072653, -0.005614795, 0.016200649, -0.04629601, -0.021347158, 0.0010469275, -0.014734916, 0.007145456, -0.033171482, 0.03879969, 0.07677206, 0.021947743, 0.055790663, 0.030053912, 0.09779764, 0.0065991073, 0.03432564, 0.009819334, 1.8311924e-06, 0.024444006, 0.0066780057, -0...",12


In [137]:
data.cls_cluster.value_counts()

29    125
28     93
15     88
20     79
21     70
10     70
6      66
8      66
33     61
7      60
17     58
24     57
30     56
32     55
12     54
31     50
4      50
13     50
1      49
25     49
16     47
9      47
3      45
23     41
19     41
18     40
2      35
26     33
34     31
27     28
22     27
5      25
14     21
0      15
11     13
Name: cls_cluster, dtype: int64

Topics looks much more balanced with dbscan.

In [138]:
# data.to_csv(path+'k_means_clusters.csv')

## Intents extraction

In [139]:
import collections


def most_common(lst, n_words):
    
    counter=collections.Counter(lst)
    return counter.most_common(n_words)

def extract_labels(category_docs):
    """
    category_docs: list of documents, all from the same category or clustering
    """

    verbs = []
    dobjs = []
    nouns = []
    adjs = []
    
    verb = ''
    dobj = ''
    noun1 = ''
    noun2 = ''

    # for each document, append verbs, dobs, nouns, and adjectives to 
    # running lists for whole cluster
    for i in range(len(category_docs)):
        doc = nlp(category_docs[i])
        for token in doc:
            if token.is_stop==False:
                if token.dep_ == 'ROOT':
                    verbs.append(token.text.lower())
                elif token.dep_=='dobj':
                    dobjs.append(token.lemma_.lower())
                elif token.pos_=='NOUN':
                    nouns.append(token.lemma_.lower())     
                elif token.pos_=='ADJ':
                    adjs.append(token.lemma_.lower())
    
    # take most common words of each form
    if len(verbs) > 0:
        verb = most_common(verbs, 1)[0][0]
    if len(dobjs) > 0:
        dobj = most_common(dobjs, 1)[0][0]
    if len(nouns) > 0:
        noun1 = most_common(nouns, 1)[0][0]
    if len(set(nouns)) > 1:
        noun2 = most_common(nouns, 2)[1][0]
    
    # concatenate the most common verb-dobj-noun1-noun2 (if they exist)
    label_words = [verb, dobj]
    
    for word in [noun1, noun2]:
        if word not in label_words:
            label_words.append(word)

    if '' in label_words:
        label_words.remove('')
    
    label = '_'.join(label_words)
    
    return label

For BERT/MiniBatchKmeans:

In [140]:
# data['question'] = data['question'].apply(lambda x: " ".join(x))

intents = {}
for cluster in data['cls_cluster'].unique().tolist():
    intents['cluster_' + str(cluster)] = extract_labels(data[data['cls_cluster']==cluster]['question'].tolist())

In [141]:
intents

{'cluster_0': 'come_plane_wifi_price',
 'cluster_1': 'help_thing_love_time',
 'cluster_10': 'try_help_problem_record',
 'cluster_11': 'hey_notch_board_playlist',
 'cluster_12': 'carry_bag_passenger',
 'cluster_13': 'charge_use_class',
 'cluster_14': 'sky_view_day_morning',
 'cluster_15': 'check_bag',
 'cluster_16': 'offer_option_voucher_food',
 'cluster_17': 'try_work_year_family',
 'cluster_18': 'wait_discuss_time_hour',
 'cluster_19': 'thank_thanksgiving_today_weekend',
 'cluster_2': 'day_plenty_today_morning',
 'cluster_20': 'pay_ticket_day',
 'cluster_21': 'receive_service_customer',
 'cluster_22': 'pilot_pilot_time',
 'cluster_23': 'mile_use_flyer',
 'cluster_24': 'thank_trip_today_day',
 'cluster_25': 'delay_delay_hour',
 'cluster_26': 'upgrade_upgrade_list',
 'cluster_27': 'delay_connection_hour_time',
 'cluster_28': 'seat_seat_row',
 'cluster_29': 'want_honor_time_year',
 'cluster_3': 'need_trip_day_pilot',
 'cluster_30': 's_airport_hour',
 'cluster_31': 'wait_seat_hour_minute'

In [142]:
data[data['cls_cluster']==30].head(50)

Unnamed: 0,responce,question,langue,bert_vectors,cls_cluster
1334,@116145 An airport agent will be happy to take a look at your bag to assist with a claim but damage must be assessed in person.,long go airport,1,"[-0.018498687, 0.014077653, 0.013161911, 0.052902635, 0.008852411, -0.01767223, 0.0052313097, -0.006477188, -0.06387926, 0.0004113048, -0.032108, -0.004744607, 0.057753626, 0.020941779, 0.027012048, -0.012972196, -0.0038482244, -0.044839483, -0.05014354, -0.025041005, -0.020902848, 0.019220905, -0.022263318, 0.009113784, 0.036568616, -0.037376318, -0.030225636, 0.008754982, -0.00996854, -0.03649717, -0.032541893, 0.039727982, 0.0004858537, 0.05904598, 1.5778048e-06, -0.0041134884, 0.02417710...",30
7266,@117992 No problem at all. It’s a huge honor for us to support our troops.,extend courtesy airport spoil ur patriotism,1,"[-0.018746596, 0.023771912, -0.013543056, 0.009135544, -0.031010361, -0.02913432, -0.024022296, -0.020470044, -0.026584271, 0.029034948, 0.033309348, 0.0019057052, 0.038348354, -0.008907932, -0.034982193, 0.042701405, 0.01595521, -0.054300588, -0.013550832, -0.02487417, -0.03523684, 0.03557808, -0.01935094, -0.017879402, 0.016784783, 0.042360917, 0.04378834, 0.0116741005, 0.06694592, 0.0069498103, 0.0077413768, 0.003554901, 0.04696976, -0.035744675, 1.5924353e-06, -0.016101038, 0.035786968, ...",30
8502,"@118402 We're sorry that this has been your experience, is there some way we can help?",believe use favorite airline take one time leave bad taste mouth smh,1,"[0.0073497957, 0.06561619, -0.010235273, -0.033882786, -0.017940542, -0.04746576, 0.004492285, -0.012064098, 0.019421171, -0.006845434, 0.004186897, -0.0168757, 0.03102104, 0.07082943, 0.03488846, -0.002948817, 0.0032606446, -0.02843467, -0.016047733, -0.044071004, -0.056757618, -0.0029221077, -0.084189154, 0.010772, 0.019722754, -0.022841386, -0.019603958, 0.0068938513, 0.03745781, -0.057330497, 0.067353144, 0.029650275, 0.012393385, 0.07402463, 1.3238256e-06, 0.02114427, -0.0032266711, 0.0...",30
9559,"@118749 It is indeed and we're so proud! By the way, our team in ORD loves you too, Matt. #AAfamily",amazing picture country flagship carrier jet,1,"[-0.0027655705, 0.042370435, -0.0054862415, 0.056895252, 0.020611161, -0.02170144, -0.048269384, -0.0067565097, -0.0657471, 0.020278685, -0.018346546, 0.0027443608, 0.03240814, 0.019775469, 0.062918745, -0.053669717, 0.018827911, -0.010153792, -0.0033987912, 0.013112178, -0.0025931702, 0.02685715, -0.018566633, -0.0066994666, 0.012546232, -0.011911684, 0.0019304403, 0.0075110076, -0.025519378, 0.006748888, 0.0470337, -0.0076355585, 0.018613875, 0.025711946, 1.4800911e-06, -0.035939436, -0.00...",30
16690,@120713 Sorry for the wait. We're working hard to get everyone taken care of as quickly as possible.,airport 2a line checkini 2h30 takeoff,1,"[-0.03464372, -0.020069322, -0.006347061, 0.035936665, 0.0047481, -0.015610999, 0.0011046374, 0.021123337, 0.0046546804, 0.029106135, 0.029737802, -0.036027335, 0.040535476, 0.07923973, -0.012258624, 0.046871476, -0.028352138, -0.029350422, -0.068027034, -0.0040296945, -0.037234552, 0.06479944, -0.053793572, -0.0016235859, -0.022457834, 0.052898515, 0.03539319, -0.010467232, 0.050464727, -0.029336384, -0.04260744, 0.035773937, 0.08527472, 0.014032161, 1.5490797e-06, -0.038027037, 0.012697881...",30
18893,@121307 You're good to go as long as you have your British passport and return/onward ticket.,question travel transit need esta visa hold british passport,1,"[0.018417714, -0.05860327, -0.07479403, -0.035385024, -0.034019664, -0.0057693035, -0.017722279, 0.00973593, 0.0068424908, -0.027044578, -0.01454753, 0.009360624, 0.010402992, 0.033269934, 0.002474868, -0.016039718, 0.033147834, 0.0062395185, 0.005377049, -0.009888898, 0.024659485, -0.018888086, 0.0031870543, -0.015175114, 0.029093748, 0.027782485, 0.006833736, 0.00934302, 0.059537154, 0.09215866, 0.013032119, 0.039033804, 0.040637992, -0.06054938, 1.0227052e-06, 0.030592427, 0.060125865, -0...",30
28259,@123559 We currently have an estimated departure time of 7a out of LAX. It shouldn't be too long for us to get in the air. Thanks for your patience.,deplane tell bring boarding pass except website can not retrieve mobile pass,1,"[0.008931053, -0.0017154275, -0.009416699, -0.019530326, 0.010560251, -0.014514201, 0.050713293, 0.014768281, -0.018250076, -0.005784422, 0.014398927, -0.018862268, 0.07682693, 0.021573147, -0.0088352095, 0.05239705, -0.071110554, -0.02196478, -0.011405037, -0.014634074, -0.027509863, 0.04320184, 0.02307491, -0.019276021, 0.035696395, 0.023270663, 0.045323472, -0.009210341, 0.06521782, -0.06112886, 0.05539188, -0.009256488, 0.028548244, -0.041954048, 1.2232825e-06, -0.0023430812, 0.02209633,...",30
28805,@124030 Restrictions are shared on the right side of the purchase screen before purchase. Please contact Orbitz for more information.,1 there s restriction site 50 2 ticket say nee check carryon bag ticket counter oppose gate way,1,"[0.03446045, -0.05881391, -0.0142389955, 0.04366784, 0.015018857, -0.03812114, 0.022517327, -0.011215414, 0.0014440534, 0.031007864, -0.010252113, 0.021726053, 0.015500064, 0.07190354, 0.0027980306, -0.007846658, -0.0051360717, -0.07008875, -0.045057446, -0.029275378, -0.029969677, 0.029990874, -0.03309901, -0.00073299283, 0.038079247, -0.00011286123, 0.016417122, -0.014043083, 0.01344327, -0.09322621, 0.0126157645, -0.007300001, 0.06816212, -0.01566288, 1.2736404e-06, -0.043900296, -0.01146...",30
35153,@125896 We understand how tiring these long trips are and we'll have you in the air as soon as we can.,can not airport much longer get 15 hour sleepy hungry,1,"[-0.01571792, 0.02725907, -0.015857834, -0.024296487, 0.046185967, 0.0067706257, -0.047061358, 0.044143833, -0.030407894, -0.043864504, 0.013263362, -0.011751255, -0.0014907338, 0.012616465, 0.022519555, 0.0355324, -0.0011076091, -0.009783779, -0.046055008, -0.016949747, -0.06396617, -0.011161332, -0.040486246, 0.029704751, 0.050159965, -0.0003638257, 0.08338556, 0.021930514, -0.000431306, 0.006648303, 0.03357894, 0.060112257, -0.00030711695, -0.040697712, 1.6174137e-06, 0.012823138, 0.00107...",30
50643,@130335 Boarding passes aren't issued until you check-in 24 hrs. of your flight.,later still can not get boarding pass website,1,"[0.012151977, 0.011500896, -0.005311244, -0.028977532, -0.0048077493, -0.009708648, 0.026804991, 0.025596851, -0.028019806, -0.0120775, 0.012514566, -0.07601488, 0.056285225, 0.018643847, -0.035214398, 0.039545562, -0.017587097, -0.043782778, -0.018371347, 0.008367332, -0.05056124, 0.03650797, -0.014190113, -0.019501535, 0.009865173, 0.039217804, 0.023493996, -0.038983136, 0.06405238, -0.053563, 0.05400351, 0.018265994, 0.002052421, -0.038481865, 1.1350954e-06, -0.0068734717, 0.028448297, 0....",30


However, this model work based on statistical properties of a text (such as count) and not so much on semantic similarity. Let's try something a bit more advanced.


### Intents extraction with Bert

In [143]:
from keybert import KeyBERT

kw_model = KeyBERT()

In [144]:
def create_patterns(question):
    
    import operator
    
    two = kw_model.extract_keywords(question, keyphrase_ngram_range=(1, 2),stop_words=None)
    three = kw_model.extract_keywords(question, keyphrase_ngram_range=(1, 3),stop_words=None)
    four = kw_model.extract_keywords(question, keyphrase_ngram_range=(1, 4),stop_words=None)

    final_patterns = two + three + four
    if len(final_patterns) != 0 :
        result = max(final_patterns,key=operator.itemgetter(1))[0]
    else :
        result = ''

    return result


def create_tags(question):
    
    import operator
    tag = kw_model.extract_keywords(question, keyphrase_ngram_range=(1, 1),stop_words=None)
    if len(tag) != 0 :
            result = max(tag,key=operator.itemgetter(1))[0]
    else :
        result = ''
    return result

In [145]:
from tqdm import tqdm

intents = {}
for cluster in tqdm(data['cls_cluster'].unique().tolist()):
    intents['cluster_' + str(cluster)] = create_patterns(re.sub("\d+", " ", " ".join(data[data['cls_cluster']==cluster]['question'].tolist())))

100%|██████████| 35/35 [06:32<00:00, 11.20s/it]


In [146]:
intents

{'cluster_0': 'least free decent wifi',
 'cluster_1': 'be never pron all',
 'cluster_10': 'clearly try screen lock',
 'cluster_11': 'absolute good month playlist',
 'cluster_12': 'bag plenty room attendant',
 'cluster_13': 'class customer pay extra',
 'cluster_14': 'wonderful club high sky',
 'cluster_15': 'bag check gate',
 'cluster_16': 'get food poisoning inflight',
 'cluster_17': 'issue family vacation empathetic',
 'cluster_18': 'awesome customer service treat',
 'cluster_19': 'board happy thanksgiving thank',
 'cluster_2': 'morning min early hello',
 'cluster_20': 'stillonholdthey make booking try',
 'cluster_21': 'get good customer service',
 'cluster_22': 'pilot kinda need pilot',
 'cluster_23': 'membership way trip',
 'cluster_24': 'today great crew need',
 'cluster_25': 'delay hour provide',
 'cluster_26': 'upgrade like upgrade mile',
 'cluster_27': 'issue online system hour',
 'cluster_28': 'treat like upgrade seat',
 'cluster_29': 'give grief thing go',
 'cluster_3': 'witho

In [147]:
from tqdm import tqdm

tags = {}
for cluster in tqdm(data['cls_cluster'].unique().tolist()):
    tags['cluster_' + str(cluster)] = create_tags(re.sub("\d+", " ", " ".join(data[data['cls_cluster']==cluster]['question'].tolist())))

100%|██████████| 35/35 [00:27<00:00,  1.27it/s]


In [148]:
tags

{'cluster_0': 'wifi',
 'cluster_1': 'pron',
 'cluster_10': 'screen',
 'cluster_11': 'summer',
 'cluster_12': 'airline',
 'cluster_13': 'surcharge',
 'cluster_14': 'sky',
 'cluster_15': 'baggage',
 'cluster_16': 'meal',
 'cluster_17': 'holiday',
 'cluster_18': 'crew',
 'cluster_19': 'thankful',
 'cluster_2': 'morning',
 'cluster_20': 'booking',
 'cluster_21': 'customer',
 'cluster_22': 'piloting',
 'cluster_23': 'club',
 'cluster_24': 'senior',
 'cluster_25': 'delay',
 'cluster_26': 'upgrade',
 'cluster_27': 'scheduling',
 'cluster_28': 'seating',
 'cluster_29': 'grief',
 'cluster_3': 'airline',
 'cluster_30': 'airport',
 'cluster_31': 'taxiway',
 'cluster_32': 'pilot',
 'cluster_33': 'helpfulinvaluable',
 'cluster_34': 'delay',
 'cluster_4': 'gate',
 'cluster_5': 'send',
 'cluster_6': 'cancelling',
 'cluster_7': 'airline',
 'cluster_8': 'attendant',
 'cluster_9': 'airline'}

It's hard too tell, but I think the first method is the best here.

## Chatbot

In [149]:
def process_input(question):
 
    import re
    question = re.sub('@[\w]+','',question) #delate @
    question = re.sub('#[^\s]+','',question) #delate hastag
    question = clean_url(question)
    question = clean_html(question)
    question = remove_emoji(question)
    question = clean_punctuation(question)
    question = remove_names(question)
    
    if get_english(question):
        question = remove_stops(question.lower())
        # question = remove_non_english(question)

        if question == "":
            print('Sorry, I do not understand!')
    else:
        ('Please, talk to me in english!')
 
    return question

def encode_input(cleaned_question):
    embedding = model.encode(cleaned_question, show_progress_bar=True)
    return embedding


def predict_intent(embedding):
    intent = cls.predict(embedding.reshape(1, -1))
    return intent



In [150]:
test1 = "@AmericanAir I am a regular client of your company and I was sitted right next to a woman with a huge dog. And guess what ? I am allergic, how could you allow a 40lb dog to travel among all passengers ? Seriously It's ridiculous..."
test2 = "@AmericanAir You guys are always late, my flight is reschedule for the third time now... I can't believe this is happing to me again... I can afford to be late at work!"
test3 = "HOW MUCH IS THE TICKET TO AMSTERDAM"
test4 = "Suck my cock"

vectorized_query = encode_input(process_input(test4))
int(predict_intent(vectorized_query))

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

29

We have the intent, so now we can subset the dataset according to that intent

In [151]:
subset_intent = data[data['cls_cluster'] == int(predict_intent(vectorized_query))]
subset_intent.head()

Unnamed: 0,responce,question,langue,bert_vectors,cls_cluster
1336,"@116145 Our team will need to assess the damage in person, please see an agent for assistance.",able,1,"[-0.020353716, 0.04593392, -0.0028172068, -0.013853856, 0.055000857, -0.00067719264, -0.011381571, -0.042770065, -0.021583492, 0.004822091, -0.024313634, 0.03455315, -0.018316422, 0.007948827, 0.023381734, -0.04136052, -0.03815385, -0.018485868, -0.015769016, -0.021451773, -0.024772778, 0.02426907, -0.044639, -0.04265779, 0.036982443, -0.045074716, 0.041288085, -0.006785333, -0.074694954, 0.03147345, 0.008830159, 0.032759976, -0.018436776, -0.042628713, 1.9219574e-06, 0.015931372, 0.00092020...",29
2222,"@116417 Oh my, that #guac looks delish! Thanks for always sharing the good and the bad, Kevin. #yummy",give grief thing go poorly credit guacamole outstanding addition,1,"[0.019835008, 0.050782442, 0.022539701, -0.026484992, 0.035954516, 0.05543753, -0.03252636, -0.008919937, -0.022126015, -0.012508035, -0.0017122085, 0.030908672, 0.027041981, 0.005563695, 0.0076721925, -0.04683098, -0.005309765, 0.07120099, -0.0060006133, -0.011603085, -0.01886293, -0.008489992, 0.036515605, -0.003291932, -0.017369848, -0.017260911, 0.043230373, 0.0053055636, -0.0014716038, -0.07092183, -0.03958128, -0.023780473, -0.1069331, -0.0120184235, 1.9266909e-06, -0.019401144, 0.0178...",29
2694,"@116577 We apologize for the inconvenience. Travelers submit their paperwork to us, but aren't required to show it to everyone on board.",exactly see,1,"[-0.023471693, -0.049312487, -0.010000339, 0.028928878, -0.046554483, 0.045241512, -0.011108036, 0.0035881405, -0.03326637, 0.035132177, 0.042992312, 0.04214979, -0.012999944, 0.027551938, -0.011099375, 0.00029231858, -0.017419962, 0.0022950217, -0.010928286, -0.062451933, -0.03526957, 0.0021336197, -0.0036606882, 0.00951045, -0.015165524, 0.033507034, 0.009258552, 0.046980508, -0.02235027, -0.025900401, 0.002829037, -0.06862263, -0.01434591, -0.06417404, 2.039808e-06, 0.016420469, 0.0584509...",29
5872,@117544 We're sorry you feel that way. We appreciate your feedback.,do not like new livery,1,"[0.0046906937, 0.05810883, 0.015199645, 0.036833305, -0.038625862, 0.018442264, -0.04927703, 0.060897686, -0.065442815, -0.009487646, -0.05220299, -0.015931446, 0.042888213, 0.003186624, -0.046424817, 0.015366978, -0.05345476, 0.003715657, -0.0066085113, -0.018789276, 0.017298043, -0.014108309, 0.014645481, -0.0027164407, -0.040861815, -0.052748445, 0.004983523, -0.03285995, -0.015230686, 0.009718367, 0.060370382, 0.07766298, -0.021088807, 0.03770109, 1.651399e-06, 0.004103672, 0.005441643, ...",29
7662,@118124 Please send the requested info via DM. We need your contact info (phone and email).,12 year since will not address concern who s interested loyalty,1,"[0.023082662, 0.122966744, 0.00046083913, 0.0009780918, -0.039973676, -0.0010871692, -0.0034097987, 0.033533987, 0.0215643, -0.010972419, 0.018343123, -0.0016017708, 0.047237385, 0.0053387103, -0.095208555, 0.089953795, 0.05929283, 0.044383056, 0.047167357, -0.012964101, 0.026339479, 0.033261336, 0.015548939, -0.031434402, -0.06808653, -0.026518976, 0.0269157, -0.029254021, -0.007454933, -0.05261607, -0.02081515, 0.011926282, -0.02036843, -0.017682597, 1.7830118e-06, -0.055525202, -0.0423927...",29


In [152]:
from scipy.spatial import distance

subset_intent['cosine_similarity'] = subset_intent['bert_vectors'].apply(lambda x: 1 - distance.cosine(x, vectorized_query))
subset_intent.sort_values(by=['cosine_similarity'], ascending=False).head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,responce,question,langue,bert_vectors,cls_cluster,cosine_similarity
145940,"@157252 We love you too, Mikey!",love,1,"[0.0065462426, 0.0009857492, 0.009143514, -0.033039212, 0.05815943, 0.059181362, -0.10237369, 0.005193506, 0.022337392, -0.0017181454, 0.06339493, 0.01646664, 0.015478837, 0.010403791, -0.019004667, -0.06548845, 0.015427878, 0.08169727, -0.045819484, 0.004791352, 0.019910386, 0.007305156, 0.010617565, -0.06344178, -0.010452626, 0.0051551466, -0.013441716, -0.01988926, -0.020912038, 0.028157063, -0.025741383, -0.022176925, 0.017519457, -0.066620536, 2.2660806e-06, -0.026318373, 0.025323808, -...",29,0.359261
160309,@161193 What's going on?,way cuss get,1,"[-0.0031362567, 0.05735538, 0.02492682, 0.0058051604, -0.02894875, 0.03395764, -0.013664998, 0.024706172, -0.022642616, -0.015120872, 0.027789215, -0.04040948, 0.019341484, 0.06572003, 0.010688229, -0.016840298, 0.003246278, 0.00027494336, 0.021169715, 0.021826409, -0.012828303, -0.019951638, -0.017579002, 0.03166021, -0.0038612334, -0.058278985, 0.015489256, 0.023395183, -0.0181157, 0.006894765, 0.0012238538, -0.0061624204, 0.013628924, -0.029195921, 2.0369325e-06, -0.050848894, -0.05271085...",29,0.345812
72771,@136436 We'll need a little bit more info in order to check this. The equipment can be different on different dates.,red eye,1,"[0.020723222, -0.042870298, 0.024496961, -0.061978668, 0.038333543, 0.00036557863, -0.026390035, 0.020090513, 0.003906287, -0.046775434, 0.007276601, 0.051103994, 0.02297241, 0.013551311, 0.042790927, 0.00029188945, -0.012112413, 0.006710989, 0.008575091, 0.016706098, 0.0040668338, 0.00476765, 0.045295406, 0.02838117, -0.006501023, -0.075509205, -0.02260967, 0.048065647, -0.014173288, -0.0018538776, 0.018467177, 0.0054922258, -0.059839427, -0.03745993, 1.8045346e-06, 0.03258443, 0.0012753668...",29,0.308271
169127,"@163637 We're ready to go, Bill! We'll have you in the air and soaring the skies soon.",preppe,1,"[-0.036589276, 0.061206274, 0.0045469482, -0.065309994, 0.021194305, 0.0050465255, -0.01514734, 0.021842862, -0.044836625, 0.014735166, 0.039026644, 0.036832165, -0.012414906, 0.063797414, 0.005049367, -0.007720811, -0.012923147, 0.017830947, -0.011865436, -0.0044904756, -0.03388822, 0.025489802, -0.0028722961, 0.02921095, 0.026907194, -0.06449522, 0.028530957, 0.027383344, -0.0626602, -0.0061473795, 0.06988765, 0.057509292, -0.0044924663, -0.04756378, 1.994226e-06, 0.014000578, -0.016905509...",29,0.293863
72805,"@136448 We're thrilled to have you on board with us today, Jennifer. Have fun in the sun!",board,1,"[0.0067980634, -0.034887366, -0.013828047, 0.036027517, -0.018792475, -0.014560495, 0.0051818476, -0.017162062, -0.0020061873, 0.03003556, 0.0052354666, 0.0044235038, 0.032274965, 0.06719866, 0.0037375817, 0.016188968, -0.019435257, 0.024202973, -0.0509266, -0.018533567, -0.0069626467, 0.027683416, -0.011540201, 0.017294293, -0.002113874, 0.0014525686, -0.02554939, -0.012613163, -0.00083795487, 0.0204351, 0.024718396, -0.0066662068, -0.03715831, 0.048466444, 2.2825366e-06, -0.0022419076, 0.0...",29,0.288116


In [153]:
anwser = subset_intent.sort_values(by=['cosine_similarity'], ascending=False).responce.iat[0]
print(test4)
print(anwser)

Suck my cock
@157252 We love you too, Mikey!


In [174]:
def response(input_data, show_details=False):
    vectorized_query = encode_input(process_input(input_data))
    mask = data['cls_cluster'] == int(predict_intent(vectorized_query))
    # subset_intent = data[mask]
    subset_intent = data.loc[mask, :]
    subset_intent['cosine_similarity'] = subset_intent['bert_vectors'].apply(lambda x: 1 - distance.cosine(x, vectorized_query))
    # answer = subset_intent.sort_values(by=['cosine_similarity'], ascending=False).responce.iat[0]
    answer = subset_intent.sort_values(by=['cosine_similarity'], ascending=False).iloc[0, 0]
    return answer

In [155]:
subset_intent.sort_values(by=['cosine_similarity'], ascending=False).iloc[0, 0]

'@157252 We love you too, Mikey!'

In [181]:
question = "@AmericanAir You guys are always late, my flight is reschedule for the third time now... I can't believe this is happening to me again... I can afford to be late at work!"
question2 = "@AmericanAir I am a regular client of your company and I was sitted right next to a woman with a huge dog. And guess what ? I am allergic, how could you allow a 40lb dog to travel among all passengers ? Seriously It's ridiculous..."
response(question)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


"@155978 We know delays aren't fun and we do our best to avoid them if we can. What's your flight number? We'll check the status."

In [189]:
while True:
    input_data = input("Customer- ")
    answer = "Bot- " + response(input_data)
    print(answer)

Customer- hi


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Bot- @159672 Good morning! *waves intensely*
Customer- what time 


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Bot- @167209 Treat yo self, Will. Always great to have you on board.


KeyboardInterrupt: ignored

In [None]:
def process_input(question):
 
    import re
    question = re.sub('@[\w]+','',question) #delate @
    question = re.sub('#[^\s]+','',question) #delate hastag
    question = clean_url(question)
    question = clean_html(question)
    question = remove_emoji(question)
    question = clean_punctuation(question)
    
    if get_english(question):
        question = remove_stops(question.lower())
        # question = remove_non_english(question)
        
        if question == "":
            print('Sorry, I do not understand!')
    else:
        ('Please, talk to me in english!')
 
    return question

############################################################

model = SentenceTransformer('all-mpnet-base-v2')
embeddings = model.encode(data.question.tolist(), show_progress_bar=True)

data['bert_vectors'] = list(embeddings)  # -> [-0.042830452, 0.0033645306, -0.00762816, ...]

kmeans_35 = KMeans(n_clusters = 35, random_state=1)
kmeans_35.fit(embeddings)

data['bert_kmeans_35clusters'] = kmeans_35.labels_  # -> 6

############################################################

def encode_input(cleaned_question):
    model = SentenceTransformer('all-mpnet-base-v2')
    embedding = model.encode(cleaned_question, show_progress_bar=True)
    return embedding

def predict_intent(embedding):
    intent = kmeans_35.predict(embedding.reshape(1, -1))
    return intent

############################################################

test1 = "@AmericanAir I am a regular client of your company and I was sitted right next to a woman with a huge dog."

# cleaning and vectorizing the user input
vectorized_query = encode_input(process_input(test1))

subset_intent = data[data['bert_kmeans_35clusters'] == int(predict_intent(vectorized_query))]

############################################################

def response(input_data, show_details=False):
    vectorized_query = encode_input(process_input(input_data))
    mask = data['bert_kmeans_35clusters'] == int(predict_intent(vectorized_query))
    subset_intent = data.loc[mask, :]
    subset_intent['cosine_similarity'] = subset_intent['bert_vectors'].apply(lambda x: 1 - distance.cosine(x, vectorized_query))
    answer = subset_intent.sort_values(by=['cosine_similarity'], ascending=False).iloc[0, 0]
    return answer

while True:
    input_data = input("Customer- ")
    answer = "Bot- " + response(input_data)
    print(answer)