## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a labeled .csv file** and **a presentation that describes the model you built** and its **performance**. 


In [152]:
import pandas as pd
import numpy as np

In [153]:
df = pd.read_csv("data/train.csv")

### Exploration

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 404290 entries, 0 to 404289
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            404290 non-null  int64 
 1   qid1          404290 non-null  int64 
 2   qid2          404290 non-null  int64 
 3   question1     404289 non-null  object
 4   question2     404288 non-null  object
 5   is_duplicate  404290 non-null  int64 
dtypes: int64(4), object(2)
memory usage: 18.5+ MB


In [4]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [10]:
import nltk

In [19]:


nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [11]:
def rm_punc(text):
    """
    Input: text
    Return: text with punctuation removed

    """
    from nltk.tokenize import word_tokenize
    tokens = word_tokenize(text)
    # remove all tokens that are not alphabetic
    words = [word for word in tokens if word.isalpha()]
    return words

In [21]:
text = "hello's to all the peeps. This is a test?"
rm_punc(text)

['hello', 'to', 'all', 'the', 'peeps', 'This', 'is', 'a', 'test']

In [12]:
def lowercase(tokens):
    """
    text: str.
    Return: lowercased text
    """
    # convert to lower case
    lowercase_tokens = [w.lower() for w in tokens]
    return lowercase_tokens

In [28]:
lowercase(rm_punc(text))

['hello', 'to', 'all', 'the', 'peeps', 'this', 'is', 'a', 'test']

In [13]:
def filter_stop_words(tokens):
    # filter out stop words
    from nltk.corpus import stopwords
    stop_words = set(stopwords.words('english'))
    tokens_stop_words_removed = [w for w in tokens if not w in stop_words]
    return tokens_stop_words_removed

In [31]:
filter_stop_words(lowercase(rm_punc(text)))

['hello', 'peeps', 'test']

In [14]:
def stemming(tokens):
    # stemming of words
    from nltk.stem.porter import PorterStemmer
    porter = PorterStemmer()
    stemmed = [porter.stem(word) for word in tokens]
    return stemmed

In [33]:
stemming(filter_stop_words(lowercase(rm_punc(text))))

['hello', 'peep', 'test']

In [15]:
def clean_text(text):
    try:
        result = stemming(filter_stop_words(lowercase(rm_punc(text))))
    except:
        result = np.NaN
    return result

In [18]:
df['q1_clean'] = df['question1'].transform(clean_text)


In [19]:
df.dropna()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,q1_clean
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0,"[step, step, guid, invest, share, market, india]"
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0,"[stori, kohinoor, diamond]"
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0,"[increas, speed, internet, connect, use, vpn]"
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0,"[mental, lone, solv]"
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0,"[one, dissolv, water, quikli, sugar, salt, met..."
...,...,...,...,...,...,...,...
404285,404285,433578,379845,How many keywords are there in the Racket prog...,How many keywords are there in PERL Programmin...,0,"[mani, keyword, racket, program, languag, late..."
404286,404286,18840,155606,Do you believe there is life after death?,Is it true that there is life after death?,1,"[believ, life, death]"
404287,404287,537928,537929,What is one coin?,What's this coin?,0,"[one, coin]"
404288,404288,537930,537931,What is the approx annual cost of living while...,I am having little hairfall problem but I want...,0,"[approx, annual, cost, live, studi, uic, chica..."


In [58]:
df['q2_clean'] = df['question2'].transform(clean_text)

In [59]:
df[df['q2_clean'].isnull()]

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate,q1_clean,q2_clean
105780,105780,174363,174364,How can I develop android app?,,0,"[develop, android, app]",
201841,201841,303951,174364,How can I create an Android app?,,0,"[creat, android, app]",


### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

In [92]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

def check_similarity(sentences, model):
    
    sentence_embeddings = model.encode(sentences)
    result = cosine_similarity(
        [sentence_embeddings[0]],
        sentence_embeddings[1:]
    )
    return result

models = ['distilbert-base-nli-stsb-quora-ranking',
          'all-MiniLM-L12-v2',
          'all-distilroberta-v1',
          'multi-qa-mpnet-base-dot-v1',
          'all-mpnet-base-v2',
          'paraphrase-MiniLM-L3-v2']

# number of question to check before writing results to file. lets me see progress as script runs
step = 1000 
max_cycles = (df.shape[0] // step) + 2    # +2 for capturing the last cycle and the remainder of the items that are less than one step

for model in models:
    print(model)
    STmodel = SentenceTransformer(model)
    
    for cycle in range(max_cycles):
        start = cycle * step
        stop = (cycle * step) + step
        result = []
        # print(start, stop)
        
        # if last cycle, then iterate until the last question, else iterate to stop value
        if cycle == max_cycles - 1:    
            for sentences in df[['question1', 'question2']][start:].values:
                result.append(check_similarity(sentences, STmodel)[0][0])
        else:
            for sentences in df[['question1', 'question2']][start:stop].values:
                result.append(check_similarity(sentences, STmodel)[0][0])
        print(cycle, '/', max_cycles  -1)
        # print(result)

        df_result = pd.DataFrame({'comparision':result})
        df_result.to_csv(f'{model}.csv', mode='a', header=False, index = False)
    


distilbert-base-nli-stsb-quora-ranking


Downloading:   0%|          | 0.00/345 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/557 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.69k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/507 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

0 / 405
1 / 405
2 / 405
3 / 405
4 / 405
5 / 405
6 / 405
7 / 405
8 / 405
9 / 405
10 / 405
11 / 405
12 / 405
13 / 405
14 / 405
15 / 405
16 / 405
17 / 405
18 / 405
19 / 405
20 / 405
21 / 405
22 / 405
23 / 405
24 / 405
25 / 405
26 / 405
27 / 405
28 / 405
29 / 405
30 / 405
31 / 405
32 / 405
33 / 405
34 / 405
35 / 405
36 / 405
37 / 405
38 / 405
39 / 405
40 / 405
41 / 405
42 / 405
43 / 405
44 / 405
45 / 405
46 / 405
47 / 405
48 / 405
49 / 405
50 / 405
51 / 405
52 / 405
53 / 405
54 / 405
55 / 405
56 / 405
57 / 405
58 / 405
59 / 405
60 / 405
61 / 405
62 / 405
63 / 405
64 / 405


KeyboardInterrupt: 

In [None]:
from sklearn.metrics import recall_score, precision_score, accuracy_score

model = 'distilbert-base-nli-stsb-quora-ranking'

df = pd.read_csv('data/train.csv')
results = pd.read_csv(f'{model}.csv', header=None, names=['comparision'])

In [None]:
df.shape

In [None]:
results.shape

In [None]:
def is_duplicate(x):
    return 1 if float(x) > 0.85 else 0

df[model] = results['comparision']
df[model] = df[model].transform(is_duplicate)
df['correct_pred'] =  np.where(df['is_duplicate']  ==  df[model], 1, 0)

print('accuracy', accuracy_score(df_test['is_duplicate'], df[model]))
print('recall', recall_score(df_test['is_duplicate'], df[model]))
print('precision', precision_score(df_test['is_duplicate'], df[model]))

In [None]:
df.to_csv(f'test-{model}.csv')

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
df_test= pd.read_csv(f'test-{model}.csv')
df_test.head()
ConfusionMatrixDisplay.from_predictions(df_test['is_duplicate'], df_test[model])

# Comparing New questions to all past questions

In [102]:
# encode all questions and store in file

from sentence_transformers import SentenceTransformer
import pickle

model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = list(df['question1'].values)

embeddings = model.encode(sentences)

#Store sentences & embeddings on disc
with open('embeddings.pkl', "wb") as fOut:
    pickle.dump({'sentences': sentences, 'embeddings': embeddings}, fOut, protocol=pickle.HIGHEST_PROTOCOL)


In [132]:
#Load sentences & embeddings from disc
with open('embeddings.pkl', "rb") as fIn:
    stored_data = pickle.load(fIn)
    stored_sentences = stored_data['sentences']
    stored_embeddings = stored_data['embeddings']

In [135]:
new_sentence = 'should i buy tiago'
new_embedding = model.encode(new_sentence)

In [147]:
def check_new_question(new_sentence):
    #Load sentences & embeddings from disc
    with open('embeddings.pkl', "rb") as fIn:
        stored_data = pickle.load(fIn)
        stored_sentences = stored_data['sentences']
        stored_embeddings = stored_data['embeddings']


    new_embedding = model.encode(new_sentence)
    similar_questions = []
    for i, embedding in enumerate(stored_embeddings):
        result = cosine_similarity([new_embedding], [embedding])

        if float(result[0][0]) > 0.8:
            similar_questions.append(stored_sentences[i])

    return similar_questions

In [148]:
results = check_new_question('how can i be happy?')

In [149]:
len(results)

42

In [150]:
results

['What is the best way to make ourselves happy?',
 'What should I do? How can I be happy?',
 'How can I ever become happy?',
 'What are some ways to be happy all the time?',
 'How can I be happy?',
 'How do I become happy?',
 'What are the simple ways to be happy?',
 'How can I stay positive and happy all the time?',
 'How can we be happy in life?',
 'How do I be happier?',
 'How can I be happy?',
 'How can I be happy? (read details)',
 'How can I make my self happy at all situation?',
 'What should I do? How can I be happy?',
 'What is the possible way to be happy in personal life?',
 'How can you be happy in your life?',
 'How do I become happier with myself?',
 'How can I become a happier person?',
 'How can I be happy for no reason?',
 'How can I ever become happy?',
 'How do I become happier with myself?',
 'What should I do when I am so happy?',
 'How do I become happy in life?',
 'How can I stay positive and happy all the time?',
 'How can I be happy again?',
 'What is the best 