# Quora Question Pairs.
## By Aschwin Schilperoort

link: https://www.kaggle.com/c/quora-question-pairs/

The goal of this competition is to predict which of the provided pairs of questions contain two questions with the same meaning. The ground truth is the set of labels that have been supplied by human experts. The ground truth labels are inherently subjective, as the true meaning of sentences can never be known with certainty. Human labeling is also a 'noisy' process, and reasonable people will disagree. As a result, the ground truth labels on this dataset should be taken to be 'informed' but not 100% accurate, and may include incorrect labeling. We believe the labels, on the whole, to represent a reasonable consensus, but this may often not be true on a case by case basis for individual items in the dataset.

**Data fields**
- id - the id of a training set question pair
- qid1, qid2 - unique ids of each question (only available in train.csv)
- question1, question2 - the full text of each question
- is_duplicate - the target variable, set to 1 if question1 and question2 have essentially the same meaning, and 0 otherwise.

Let's take a look at what we have!

In [18]:
# importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\1asch\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [3]:
# importing the train & test dataset
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print(train.head())
print(test.head())

   id  qid1  qid2                                          question1  \
0   0     1     2  What is the step by step guide to invest in sh...   
1   1     3     4  What is the story of Kohinoor (Koh-i-Noor) Dia...   
2   2     5     6  How can I increase the speed of my internet co...   
3   3     7     8  Why am I mentally very lonely? How can I solve...   
4   4     9    10  Which one dissolve in water quikly sugar, salt...   

                                           question2  is_duplicate  
0  What is the step by step guide to invest in sh...             0  
1  What would happen if the Indian government sto...             0  
2  How can Internet speed be increased by hacking...             0  
3  Find the remainder when [math]23^{24}[/math] i...             0  
4            Which fish would survive in salt water?             0  
   test_id                                          question1  \
0        0  How does the Surface Pro himself 4 compare wit...   
1        1  Should I ha

So we can see we have two datasets train & test which contains questions asked on Quora. The problem is some of these questions are duplicates. So some questions are exactly the same as questions asked before. Repost alert!!! Now our goal is to find these duplicates, not by hand, but with the machines!

To have a machine learn what the text says and to calculate it's a duplicate we have to clean the data first. We have to:

- Remove numbers & punctuation!
- Remove stems!
- Remove stopwords!


After this we can make a bag of words model on which we can train a model to correctly classify duplicate questions! Let's start of by cleaning the dataset.

## Cleaning the dataset

In [None]:
corpus = []
for i in range(0, 1000):
    question1 = re.sub('[^a-zA-Z]', ' ', train['Review'][i])
    question2 = re.sub('[^a-zA-Z]', ' ', train['Review'][i])
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

In [4]:
train.shape

(404290, 6)

In [5]:
train.columns

Index(['id', 'qid1', 'qid2', 'question1', 'question2', 'is_duplicate'], dtype='object')

'What is the step by step guide to invest in share market in india?'

In [15]:
quest0 = [train['question1'][0],train['question2'][0]]

In [16]:
quest0

['What is the step by step guide to invest in share market in india?',
 'What is the step by step guide to invest in share market?']

In [21]:
quest0 = re.sub('[^a-zA-Z]', ' ', quest0[0])

In [22]:
quest0 = quest0.lower()

In [23]:
quest0

'what is the step by step guide to invest in share market in india '

In [24]:
quest0 = quest0.split()

In [25]:
quest0

['what',
 'is',
 'the',
 'step',
 'by',
 'step',
 'guide',
 'to',
 'invest',
 'in',
 'share',
 'market',
 'in',
 'india']

In [26]:
ps = PorterStemmer()

In [27]:
quest0 = [ps.stem(word) for word in quest0 if not word in set(stopwords.words('english'))]

In [28]:
quest0

['step', 'step', 'guid', 'invest', 'share', 'market', 'india']