**Initialization**
* I use these 3 lines of code on top of my each Notebooks because it will help to prevent any problems while reloading and reworking on a Project or Problem. And the third line of code helps to make visualization within the Notebook.

In [4]:
#@ Initialization:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Downloading the Dependencies**
* I have downloaded all the Libraries and Dependencies required for this Project in one particular cell.

In [5]:
#@ Downloading the Libraries and Dependencies. 
# !pip install -q -U trax                         # Downloading the Trax.
# nltk.download("punkt")
import pandas as pd
import numpy as np
import os
import nltk
import trax
from trax import layers as tl
from trax.supervised import training
from trax.fastmath import numpy as fastnp
import random
from collections import defaultdict

random.seed(111)

**Getting the Data**
* I have used Google Colab for this Project so the process of downloading and reading the Data might be different in other platforms. I will be using **Quora Answer Question Dataset** for this Project. I will build a Model that can Identify the Similar Questions or the Duplicate Questions which is useful when we have to work with several versions of the same Questions. The Dataset is labeled.

In [6]:
#@ Getting the Data:
PATH = "/content/drive/My Drive/Colab Notebooks/Questions"
data = pd.read_csv(os.path.join(PATH, "Questions.zip"))

#@ Inspecting the Data:
print(f"Number of Questions Pairs: {len(data)}")
data.head(10)                                                        # Inspecting the DataFrame.

Number of Questions Pairs: 404351


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,6,13,14,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
8,8,17,18,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,9,19,20,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


**Processing the Data**
* I will split the Data into Training set and Testing Set. The Test Set will be used later to evaluate the Model. I will select only the Question Pairs that are duplicate to train the Model. I will build two batches as input for the Neural Networks: Siamese Networks. The Test set uses the original pairs of Questions and the Status describing if the Questions are duplicates. 

In [7]:
#@ Processing the Data:
N_train = 300000                                               
N_test = 10240                                                 
data_train = data[:N_train]                                                    # Training pairs.
data_test = data[N_train:N_train+N_test]                                       # Test pairs.
del(data)                                                                      # Removing.

#@ Inspecting the Data:
print(f"Training Set: {len(data_train)} and Test Set: {len(data_test)}")

#@ Selecting the Question Pairs for Training:
train_idx = (data_train["is_duplicate"] == 1).to_numpy()
train_idx = [i for i,x in enumerate(train_idx) if x]
print(f"Number of Duplicate Questions: {len(train_idx)}")
print(f"Indexes of first Duplicate Questions: {train_idx[:10]}")

Training Set: 300000 and Test Set: 10240
Number of Duplicate Questions: 111486
Indexes of first Duplicate Questions: [5, 7, 11, 12, 13, 15, 16, 18, 20, 29]


In [8]:
#@ Inspecting the Duplicate Questions:
print(data_train["question1"][20])                                 # Index 20 has Duplicate Questions pairs.
print(data_train["question2"][20])                                 # Index 20 has Duplicate Questions pairs.
print("Index 20 is duplicate:", data_train["is_duplicate"][20])

Why do rockets look white?
Why are rockets and boosters painted white?
Index 20 is duplicate: 1


**Preparing the Data**

In [9]:
#@ Preparing the Data: Training the Model:
Q1_train_words = np.array(data_train["question1"][train_idx])
Q2_train_words = np.array(data_train["question2"][train_idx])

#@ Preparing the Data: Evaluating the Model:
Q1_test_words = np.array(data_test["question1"])
Q2_test_words = np.array(data_test["question2"])
y_test = np.array(data_test["is_duplicate"])

#@ Inspecting the Data:
print("TRAINING QUESTIONS:\n")
print("Question 1:", Q1_train_words[7])
print("Question 2:", Q2_train_words[7], "\n")

print("TESTING QUESTIONS:\n")
print("Question 1:", Q1_test_words[7])
print("Question 2:", Q2_test_words[7], "\n")
print("Inspecting Testing pairs is duplicate:", y_test[0])

TRAINING QUESTIONS:

Question 1: Why are so many Quora users posting questions that are readily answered on Google?
Question 2: Why do people ask Quora questions which can be answered easily by Google? 

TESTING QUESTIONS:

Question 1: Which is the best digital photo frame?
Question 2: What are the best 12-inch digital photo frames? 

Inspecting Testing pairs is duplicate: 0


**Preparing the Data**
* I will encode each word of the selected pairs with an Index which will be a list of numbers. Firstly, I will Tokenize each word using NLTK and I will use Python's Default Dictionary which assigns the values 0 to all Out of Vocabulary Words. 

In [10]:
#@ Preparing the Data:
Q1_train = np.empty_like(Q1_train_words)                                # Creating new Training array.
Q2_train = np.empty_like(Q2_train_words)                                # Creating new Training array.
Q1_test = np.empty_like(Q1_test_words)                                  # Creating new Test array.
Q2_test = np.empty_like(Q2_test_words)                                  # Creating new Test array.

#@ Building Vocabulary with Training Dataset:
vocab = defaultdict(lambda: 0)
vocab["<PAD>"] = 1
for idx in range(len(Q1_train_words)):
  Q1_train[idx] = nltk.word_tokenize(Q1_train_words[idx])               # Tokenizing the Training Set.
  Q2_train[idx] = nltk.word_tokenize(Q2_train_words[idx])               # Tokenizing the Training Set.
  q = Q1_train[idx] + Q2_train[idx]
  for word in q:
    if word not in vocab:
      vocab[word] = len(vocab) + 1
print("The length of the Vocabulary is:", len(vocab))

#@ Testing Dataset:
for idx in range(len(Q1_test_words)):
  Q1_test[idx] = nltk.word_tokenize(Q1_test_words[idx])                 # Tokenizing the Test Set.
  Q2_test[idx] = nltk.word_tokenize(Q2_test_words[idx])                 # Tokenizing the Test Set.

#@ Inspecting the Final Prepared Dataset:
print("Training Set is reduced to:", len(Q1_train))
print("Test Set is:", len(Q1_test))

The length of the Vocabulary is: 36342
Training Set is reduced to: 111486
Test Set is: 10240
