## Arabic CHATBOT - PROJASK Academy (Trial)


This is an Example on Retrieval chatbots of PROJASK academy with very simple dataset as trial using two methods TF-IDF and LSTM 



#### Our Toolbox
We’re going to use Python 3 along some libraries which include:

* Keras
* Scikit-learn
* Pandas

## Types of Chat Bots

* Retrieval Based
* Generative

Each approach has its use case, based on your application and your abilities you decide which one to use.



### The Retrieval bots 
It works by having a pre defined data set of Questions/Answers and a similarity measure to decide which question in the data set is most similar to the one asked.

* No grammatical mistakes
* No Irrelavent Answers
* Less Data to work

Steps:
* Encode questions in vectors using pre-defined method
* Using Predefined similarity method, find the most similar question in DB
* Return the answer of the choosen questions 



### The Generative
Models work by training a neural network using NLP techniques to output an answer given a certain input question without needing a data set to lookup.


#### Open Domain & Closed Domain


## Data set 
Arabic NLP is a competitive field with new breakthroughs everyday. The complexity of Arabic makes it hard to perform cognitive tasks without being lost in the details.

It is well known that the more data we have the better we can draw analytics from and the better models we can build. I’ve recently found many very active accounts in Ask.fm that provide knowledge to people for free, this website works by submitting a question to an author and having it displayed once they reply. I gathered a list of some of these accounts and gathered as many as possible of their questions/answers pairs. They’re all Islamic questions so with a single domain I guess the data is consistent and can be useful.

This dataset can be used as a base dataset for a more advanced question answering data set or it can just be used as a knowledge base where search algorithms are applied to extract useful information or even as the training data for a Chat bot.

In [1]:
import pandas as pd

data = pd.read_csv("full_dataset2.csv")

In [2]:
data.columns

Index(['Question', 'Answer'], dtype='object')

In [3]:
data.head()

Unnamed: 0,Question,Answer
0,مرحبا,مرحبا بك
1,السلام عليكم,و عليكم السلام و رحمه الله و بركاته
2,مرحبا,اهلا و سهلا
3,كيفك,الحمدلله
4,ازيك,كويس الحمدلله


In [4]:
data.describe()

Unnamed: 0,Question,Answer
count,31,31
unique,29,17
top,مرحبا,8 شارع سابا باشا الاسكندريه
freq,3,6


## Retrieval based bots
These bots rely on the similarity between the input question and all the questions in the data set. In order to compute this similarity we need to choose a similarity measure that would rate the similarity of two sentences, there are a lot of similarity measures for text but we will choose the cosine similarity for this one since it’s one of the most common measures in NLP.
### Cosine similarity
It works by measuring the cosine of the angle between two vectors, thus it is concerned by their directions rather than their magnitudes, which in text represents the term frequency in each question regardless of the document length, in other words the length of the documents(questions) will not affect the computation but only. You can see that in the equation:



### Vectorizing questions
To go from a text question to a vector that represents the question so we can compute the similarity we need to transform it, in order to transform a text document into a vector [Process of encoding text as integers to create feature vectors [numbers represent the data] ] we need to use a feature extraction technique, we will use TF-IDF because it’s the most common in NLP.

# TF-IDF

TF-IDF is computed by first computing two values for each term:

Term Frequency: The frequency of the term in the document
Document Frequency: The fraction of the documents that contain the term
Inverse Document Frequency: The logarithmically scaled inverse Document Frequency
The term frequency is used because we are concerned with finding documents that have similar terms, because if two documents have the same terms then they are probably very similar.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


vectorizer = TfidfVectorizer()

vectorizer.fit(data.values.ravel())

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

Now our vectorizer is ready to transform any question into a vector using TFIDF!

In [6]:
# Read a question from the user
question = [input('Please enter a question: \n')]
question = vectorizer.transform(question)    #Convert a collection of raw documents to a matrix of TF-IDF features.

# Rank all the questions using cosine similarity to the input question
rank = cosine_similarity(question, vectorizer.transform(data['Question'].values))
# Grab the top 5
top = np.argsort(rank, axis=-1).T[-1:].tolist()

# Print top 5
for item in top:
    print(data['Answer'].iloc[item].values[0])
    print("\n ########## \n")

Please enter a question: 
هل لديكم كورسات؟
لدينا كرسات برمجه python, Java, C, C++

 ########## 



This is a very simple question and therefore the results are good given the very simple approach we’re following.

### LSTM Encoding

In [7]:
import pandas as pd
import numpy as np
import re
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM, RepeatVector
from keras.utils import np_utils
from nltk.stem import ISRIStemmer
from six.moves import cPickle

BATCH_SIZE = 32 # Batch size for GPU
NUM_WORDS = 10000 # Vocab length
MAX_LEN = 20 # Padding length (# of words)
LSTM_EMBED = 8 # Number of LSTM nodes

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [9]:
def batches_generator(train_data, batch_size=32):
    # For OHE inputs
    num_words = np.max(train_data) + 1
    timesteps = train_data.shape[1]
    while True:
        indices = np.random.choice(len(train_data), size=batch_size)
        X = train_data[indices]
        X = np_utils.to_categorical(X, num_words)
        X = X.reshape((batch_size, timesteps, num_words))
        yield (X, X)

In [10]:
train_data = pd.read_csv("full_dataset2.csv")
stemmer = ISRIStemmer()         #it is the process to return the word to its roots

In [11]:
# We don't need the answers, so let's drop them
train_data.drop('Answer', inplace=True, axis=1)

train_data = train_data[train_data.Question.apply(lambda x: len(x.split())) < MAX_LEN]  #Apply the max length

train_data.Question = train_data.Question.apply(lambda x: (re.sub('[^\u0620-\uFEF0\s]', '', x)).strip()) #remove not arabic word

train_data = train_data[train_data.Question.apply(len) > 0]  #remove empty questions if founded

In [12]:
# Stem the words
train_data.Question = train_data.Question.apply(lambda x: " ".join([stemmer.stem(i) for i in x.split()]))

In [13]:
tokenizer = Tokenizer(num_words=NUM_WORDS, lower=False)

tokenizer.fit_on_texts(train_data["Question"].values)

In [14]:
# Save the tokenizer for later use
cPickle.dump(tokenizer, open("lstm-autoencoder-tokenizer.pickle", "wb"))

In [15]:
train_data = tokenizer.texts_to_sequences(train_data["Question"].values)

train_data = pad_sequences(train_data, padding='post', truncating='post', maxlen=MAX_LEN)

In [16]:
model = Sequential()
model.add(Embedding(NUM_WORDS, 100, input_length=MAX_LEN))
model.add(LSTM(LSTM_EMBED, dropout=0.2, recurrent_dropout=0.2, input_shape=(train_data.shape[1], NUM_WORDS)))
model.add(RepeatVector(train_data.shape[-1]))
model.add(LSTM(LSTM_EMBED, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
model.add(Dense(NUM_WORDS, activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')
# model.fit_generator(batches_generator(train_data), steps_per_epoch=(len(train_data) // BATCH_SIZE))
model.fit(train_data, np.expand_dims(train_data, -1), epochs=3, batch_size=BATCH_SIZE)


Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x12635b1e848>

In [17]:
model.save("lstm-encoder.h5")

In [18]:
import pandas as pd
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
from six.moves import cPickle
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from keras import backend as K
from nltk.stem import ISRIStemmer
from keras.utils import np_utils

BATCH_SIZE = 32 # Batch size for GPU
NUM_WORDS = 10000 # Vocab length
MAX_LEN = 20 # Padding length (# of words)
LSTM_EMBED = 8 # Number of LSTM nodes

K.set_learning_phase(False)


data = pd.read_csv("full_dataset2.csv")
tokenizer = cPickle.load(open("lstm-autoencoder-tokenizer.pickle", "rb"))

stemmer = ISRIStemmer()




In [19]:
# Read the encoder model
model = load_model("lstm-encoder.h5")


In [20]:
# Create the encoding function
encode = K.function([model.input, K.learning_phase()], [model.layers[1].output])


Questions = tokenizer.texts_to_sequences(data.Question)
# We pad sequences that are shorter than MAX_LEN
Questions = pad_sequences(Questions, padding='post', truncating='post', maxlen=MAX_LEN)
Questions = np.squeeze(np.array(encode([Questions])))

In [21]:
question = input('Please enter a question: \n')
question = stemmer.stem(question)
question = tokenizer.texts_to_sequences([question])
question = pad_sequences(question, padding='post', truncating='post', maxlen=MAX_LEN)
question = np.squeeze(encode([question]))

rank = cosine_similarity(question.reshape(1, -1), Questions)
top = np.argsort(rank, axis=-1).T[-5:].tolist()
for item in top:
    print(data['Answer'].iloc[item].values[0])

Please enter a question: 
مواعيد العمل الرسميه؟
لا يا فندم 
إن شاء الله
مساء النور
8 شارع سابا باشا الاسكندريه
من التاسعه صباحا  و حتي العاشره مساءا


In [None]:
question = input('Please enter a question: \n')
question = stemmer.stem(question)
question = tokenizer.texts_to_sequences([question])
question = pad_sequences(question, padding='post', truncating='post', maxlen=MAX_LEN)
question = np.squeeze(encode([question]))

rank = cosine_similarity(question.reshape(1, -1), Questions)
top = np.argsort(rank, axis=-1).T[0].tolist()
for item in top:
    print(data['Answer'].iloc[item].values[0])