<a href="https://colab.research.google.com/github/Samgoles/DuplicateQuestionQuaro/blob/main/Copy_of_mini_project_V.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Identifying Duplicate Questions

Over 100 million people visit Quora every month, so it's no surprise that many people ask similar (or the same) questions. Various questions with the same intent can cause people to spend extra time searching for the best answer to their question, and results in members answering multiple versions of the same question. Quora uses random forest to identify duplicated questions to provide a better experience to active seekers and writers, and offer more value to both of these groups in the long term.
Follow the steps outlined below to build the appropriate classifier model. 


Steps:
- Download data
- Exploration
- Cleaning
- Feature Engineering
- Modeling

By the end of this project you should have **a presentation that describes the model you built** and its **performance**. 


In [None]:
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv("/content/drive/MyDrive/train.csv")

In [None]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [None]:
df.isnull().sum()

id              0
qid1            0
qid2            0
question1       1
question2       2
is_duplicate    0
dtype: int64

In [None]:
df.dropna(inplace=True)

In [None]:
df.isnull().sum()

id              0
qid1            0
qid2            0
question1       0
question2       0
is_duplicate    0
dtype: int64

#### Note
There is no designated test.csv file. The train.csv file is the entire dataset. Part of the data in the train.csv file should be set aside to act as the final testing data.

In [None]:
# train test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[['question1','question2']], df['is_duplicate'], 
                                                    test_size=0.20, random_state=68)

### Exploration

In [None]:
404289*0.8*2

646862.4

### Cleaning

- Tokenization
- Stopwords cleaning
- Removing punctuation
- Normalizing
- Stemming

In [None]:

from gensim.utils import simple_preprocess
import string
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
!python3 -m nltk.downloader stopwords
!python -m nltk.downloader all
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

def punc_remove(docs):
    review_no_puncs = []
    for review in docs:
        review_no_punc = ''.join([char for char in review if char not in string.punctuation])
        review_no_puncs.append(review_no_punc)
    return review_no_puncs

def stop_word_remove(docs):
    review_no_stops = []
    for review in docs:
        tokens = review.split()
        review_no_stop = ' '.join([word for word in tokens if word not in stop_words])
        review_no_stops.append(review_no_stop)
    return review_no_stops

def lemmitization(docs):
    review_lemms = []
    for review in docs:
        tokens = review.split()
        review_lemm = ' '.join([lemmatizer.lemmatize(word) for word in tokens])
        review_lemms.append(review_lemm)
    return review_lemms

def stop_word_remove(docs):
    review_no_stops = []
    for review in docs:
        tokens = review.split()
        review_no_stop = ' '.join([word for word in tokens if word not in stop_words])
        review_no_stops.append(review_no_stop)
    return review_no_stops

def tokenize(texts):
    tokenized = []
    for doc in texts:
        tokenized.append(simple_preprocess(doc, min_len=2))
    return tokenized

def to_word_string(tokens):
    texts = []
    for doc in tokens:
        texts.append(' '.join([word for word in doc]))
    return texts

def preprocess(texts):        
    texts = punc_remove(texts)
    texts = stop_word_remove(texts)
    texts = stop_word_remove(texts)
    texts = lemmitization(texts)
    texts = tokenize(texts)
    return texts

def make_word_list(*args):
    i = len(args)
    word_list = set()
    for i in range(i):
        for e in args[i]:
            word_list.update(e)
    return list(word_list)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_

In [None]:
m=pd.concat([X_train['question1'], X_train['question2']])

In [None]:
qs_tokens = preprocess(m)

In [None]:
qs_tokens

[['what', 'think', 'donald', 'trump', 'pick'],
 ['going', 'prison'],
 ['what', 'best', 'way', 'initiate', 'conversation', 'stranger'],
 ['what', 'different', 'type', 'coffee', 'drink'],
 ['what',
  'would',
  'happen',
  'ceiling',
  'fan',
  'set',
  'turn',
  'clockwise',
  'vice',
  'versa'],
 ['which', 'san', 'francisco', 'restaurant', 'best', 'okonomiyaki'],
 ['what',
  'advantage',
  'disadvantage',
  'child',
  'whose',
  'parent',
  'moved',
  'every',
  'couple',
  'year'],
 ['how', 'someone', 'new', 'san', 'francisco', 'meet', 'people'],
 ['what', 'ups', 'package', 'isnt', 'delivered', 'time'],
 ['what', 'best', 'flavour', 'condom'],
 ['what', 'repercussion', 'banning', 'rs', 'rs', 'note', 'indian', 'economy'],
 ['who', 'grateful', 'life'],
 ['what', 'government', 'old', 'note', 'deposited', 'bank', 'everyday'],
 ['is',
  'illegal',
  'use',
  'someone',
  'el',
  'picture',
  'photo',
  'random',
  'person',
  'found',
  'internet',
  'facebook',
  'profile',
  'picture'],
 

In [None]:
qs = to_word_string(qs_tokens)

In [None]:
qs

['what think donald trump pick',
 'going prison',
 'what best way initiate conversation stranger',
 'what different type coffee drink',
 'what would happen ceiling fan set turn clockwise vice versa',
 'which san francisco restaurant best okonomiyaki',
 'what advantage disadvantage child whose parent moved every couple year',
 'how someone new san francisco meet people',
 'what ups package isnt delivered time',
 'what best flavour condom',
 'what repercussion banning rs rs note indian economy',
 'who grateful life',
 'what government old note deposited bank everyday',
 'is illegal use someone el picture photo random person found internet facebook profile picture',
 'what nikola teslas iq',
 'why india search engine like baidu china yandex russia',
 'what difference following english word',
 'how use network sim mobile',
 'what nominal diameter pipe',
 'do artificial neural network work kind data',
 'whats best way ship philippines us',
 'can lumia xl battery fit lumia',
 'how make carbo

In [None]:
d=[]
d.append(len(n) for n in qs)
h=sorted(pd.DataFrame(d).values)
#print(d)
h

[array([28, 12, 44, ..., 28, 31, 51])]

In [None]:
unique_word_list = make_word_list(qs_tokens)

In [None]:
len(unique_word_list)

81587

In [None]:
q1_train_tokens = preprocess(X_train['question1'])
q2_train_tokens = preprocess(X_train['question2'])
q1_test_tokens = preprocess(X_test['question1'])
q2_test_tokens = preprocess(X_test['question2'])

In [None]:
q1_train_str = to_word_string(q1_train_tokens)
q2_train_str = to_word_string(q2_train_tokens)
q1_test_str = to_word_string(q1_test_tokens)
q2_test_str = to_word_string(q2_test_tokens)

### Feature Engineering

- tf-idf
- word2vec
- word count
- number of the same words in both questions
- ....

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer()
def cosine_sim(text1, text2):
    tfidf = vect.fit_transform([text1, text2])
    return ((tfidf * tfidf.T).A)[0,1]

In [None]:
#text1='Hamed has brought me a deliciouse food'
#text2='my husband coocked me a tasty meal'

In [None]:
#tfidf = vect.fit_transform([text1, text2])
#print((tfidf * tfidf.T).A[0,1])

In [None]:
train_texts1 = X_train.question1.values.tolist()
train_texts2 = X_train.question2.values.tolist()

In [None]:
X_train

Unnamed: 0,question1,question2
244579,What do you think about Donald Trump pick?,What do intelligent people think about Donald ...
71837,Going to prison?,Why did you go to prison?
352948,What is the best way to initiate conversation ...,What's the best way to initiate a conversation...
78115,What are all the different types of coffee dri...,Why do I often pee when I drink coffee or tea?
252215,What would happen if the ceiling fan were set ...,Which way should a ceiling fan turn in the sum...
...,...,...
366835,How does the quote rate for concrete brick wal...,How do I calculate number of bricks and cement...
338158,What are your expectations for Christopher Nol...,What expectations do you have for Christopher ...
52132,Why do I keep dreaming about my ex boyfriend?,Why do I keep dreaming about my ex husband?
112040,How much does a full sleeve tattoo (from wrist...,How much would a tattoo like this cost?


In [None]:
train_texts1


['What do you think about Donald Trump pick?',
 'Going to prison?',
 'What is the best way to initiate conversation with a stranger?',
 'What are all the different types of coffee drinks?',
 'What would happen if the ceiling fan were set to turn clockwise or vice versa?',
 'Which San Francisco restaurant has the best okonomiyaki?',
 'What are some of the advantages, and disadvantages, of being a child whose parents moved every couple of years?',
 'How can someone new to San Francisco meet people?',
 "What does UPS do if a package isn't delivered on time?",
 'What is the best flavour of condom?',
 'What will be the repercussions of banning Rs 500 and Rs 1000 notes on Indian economy?',
 'Who are the most grateful about life?',
 'What will government do with the old 500/1000 notes that is being deposited in the banks everyday?',
 "Is it illegal to use someone else's picture (a photo of a random person you found on the internet) as your Facebook profile picture?",
 "What was Nikola Tesla's

In [None]:
X_train_targets = []
for text1, text2 in zip(train_texts1, train_texts2):
    #print(text1)
    #print(text2)
    X_train_targets.append(cosine_sim(text1, text2))
    #print()

In [None]:
X_train = X_train.reset_index(drop=True)

In [None]:
X_train = pd.concat([X_train, pd.Series(X_train_targets, name='cos_sim')], axis=1)

In [None]:
test_texts1 = X_test.question1.values.tolist()
test_texts2 = X_test.question2.values.tolist()

In [None]:
X_test_targets = []
for text1, text2 in zip(test_texts1, test_texts2):
    X_test_targets.append(cosine_sim(text1, text2))

NameError: ignored

In [None]:
X_test = X_test.reset_index(drop=True)

X_test = pd.concat([X_test, pd.Series(X_test_targets, name='cos_sim')], axis=1)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

vectorizer = CountVectorizer(min_df=0, lowercase=False, max_features=1500)
vectorizer.fit(qs)

CountVectorizer(lowercase=False, max_features=1500, min_df=0)

x_train_q1 = vectorizer.transform(X_train.question1)
x_train_q2 = vectorizer.transform(X_train.question2)
x_test_q1 = vectorizer.transform(X_test.question1)
x_test_q2 = vectorizer.transform(X_test.question2)

In [None]:
import numpy as np
from scipy.sparse import hstack
from scipy import sparse
X_train_tfidf = hstack((x_train_q1,x_train_q2, 
                        sparse.csr_matrix(np.array(X_train_targets).reshape(-1, 1))))
X_test_tfidf = hstack((x_test_q1,x_test_q2, 
                       sparse.csr_matrix(np.array(X_test_targets).reshape(-1, 1))))

 

In [None]:
 

from gensim.models import Word2Vec
model = Word2Vec(sentences=qs, vector_size=100, window=5, min_count=1, workers=4)

word_vectors = model.wv

word_vectors

<gensim.models.keyedvectors.KeyedVectors at 0x7f92b93331f0>

In [None]:
print(X_test_tfidf.A)

[[0.         0.         0.         ... 0.         0.         0.11671774]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.63923062]
 ...
 [0.         0.         0.         ... 0.         0.         0.06502982]
 [0.         0.         0.         ... 0.         0.         0.33609693]
 [0.         0.         0.         ... 0.         0.         0.71681174]]


### Modeling

Different modeling techniques can be used:

- logistic regression
- XGBoost
- LSTMs
- etc

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
from scipy import sparse

def main_process(X):
    X.dropna(inplace=True)
    
    tokens = preprocess(pd.concat([X['question1'], X['question2']]))
    qs = to_word_string(tokens)
    
    vectorizer = CountVectorizer(min_df=0, lowercase=False, max_features=3000)
    vectorizer.fit(qs)
    
    x_q1 = vectorizer.transform(X.question1)
    x_q2 = vectorizer.transform(X.question2)
    
    
    train_texts1 = X.question1.values.tolist()
    train_texts2 = X.question2.values.tolist()
    
    X_targets = []
    for text1, text2 in zip(train_texts1, train_texts2):
        X_targets.append(cosine_sim(text1, text2))
    
    X_tfidf = hstack((x_q1, x_q2, 
                      sparse.csc_matrix(np.array(X_targets).reshape(-1, 1))))
    return X_tfidf
    

In [None]:
from scipy.sparse import coo_matrix, hstack
A = coo_matrix([[1, 2], [3, 4]])
B = coo_matrix([[5], [6]])
#hstack([A,B]).toarray()
print(hstack([A,B]).toarray())

[[1 2 5]
 [3 4 6]]


In [None]:


from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

prep = FunctionTransformer(main_process)
model = LogisticRegression(max_iter=1700)

pipe = Pipeline([
    ('pre', prep),
    ('model', model)
])



In [None]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('pre',
                 FunctionTransformer(func=<function main_process at 0x7f925dd7f5e0>)),
                ('model', LogisticRegression(max_iter=1700))])

In [None]:
pipe.fit(X_train, y_train)

In [None]:
y_pred = pipe.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix


print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('F1 Score: ', f1_score(y_test, y_pred))
print('Confusion Matrix : ') 
print(confusion_matrix(y_test, y_pred))



Accuracy:  0.6347671226100077
Recall:  0.3526658866789236
Precision:  0.5092436163537192
F1 Score:  0.41673250118502136
Confusion Matrix : 
[[40776 10167]
 [19365 10550]]


In [None]:
model = LogisticRegression(max_iter=1700)
model.fit(X_train_tfidf, y_train)

KeyboardInterrupt: 

In [None]:
X_train_tfidf

<323429x3001 sparse matrix of type '<class 'numpy.float64'>'
	with 3857931 stored elements in COOrdinate format>

In [None]:


y_pred = model.predict(X_test_tfidf)

print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('F1 Score: ', f1_score(y_test, y_pred))
print('Confusion Matrix : ') 
print(confusion_matrix(y_test, y_pred))

In [None]:
from xgboost import XGBClassifier

xgb = XGBClassifier(objective='reg:squarederror', n_estimators=500, use_label_encoder=False, 
                    max_depth=3)
xgb.fit(X_train_tfidf, y_train)
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix

y_pred = xgb.predict(X_test_tfidf)

print('Accuracy: ', accuracy_score(y_test, y_pred))
print('Recall: ', recall_score(y_test, y_pred))
print('Precision: ', precision_score(y_test, y_pred))
print('F1 Score: ', f1_score(y_test, y_pred))
print('Confusion Matrix : ') 
print(confusion_matrix(y_test, y_pred))

Accuracy:  0.7367360063320884
Recall:  0.5964900551562761
Precision:  0.6594235033259424
F1 Score:  0.6263799912242212
Confusion Matrix : 
[[41727  9216]
 [12071 17844]]


In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
cross_val_score(xgb, X_train_tfidf, y_train)


array([0.75970071, 0.75829391, 0.75868039, 0.75611415, 0.75638865])

In [None]:
X_train

Unnamed: 0,question1,question2,cos_sim,cos_sim.1
0,What do you think about Donald Trump pick?,What do intelligent people think about Donald ...,0.602975,0.602975
1,Going to prison?,Why did you go to prison?,0.318784,0.318784
2,What is the best way to initiate conversation ...,What's the best way to initiate a conversation...,0.905550,0.905550
3,What are all the different types of coffee dri...,Why do I often pee when I drink coffee or tea?,0.059514,0.059514
4,What would happen if the ceiling fan were set ...,Which way should a ceiling fan turn in the sum...,0.314367,0.314367
...,...,...,...,...
323424,How does the quote rate for concrete brick wal...,How do I calculate number of bricks and cement...,0.101123,0.101123
323425,What are your expectations for Christopher Nol...,What expectations do you have for Christopher ...,0.421136,0.421136
323426,Why do I keep dreaming about my ex boyfriend?,Why do I keep dreaming about my ex husband?,0.779915,0.779915
323427,How much does a full sleeve tattoo (from wrist...,How much would a tattoo like this cost?,0.300698,0.300698


['what think donald trump pick',
 'going prison',
 'what best way initiate conversation stranger',
 'what different type coffee drink',
 'what would happen ceiling fan set turn clockwise vice versa',
 'which san francisco restaurant best okonomiyaki',
 'what advantage disadvantage child whose parent moved every couple year',
 'how someone new san francisco meet people',
 'what ups package isnt delivered time',
 'what best flavour condom',
 'what repercussion banning rs rs note indian economy',
 'who grateful life',
 'what government old note deposited bank everyday',
 'is illegal use someone el picture photo random person found internet facebook profile picture',
 'what nikola teslas iq',
 'why india search engine like baidu china yandex russia',
 'what difference following english word',
 'how use network sim mobile',
 'what nominal diameter pipe',
 'do artificial neural network work kind data',
 'whats best way ship philippines us',
 'can lumia xl battery fit lumia',
 'how make carbo

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import RidgeClassifier, LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
param_grid = [
    {'model': [SVC(probability=True)],
     'model__C': [0.1, 0.5, 1.0],
     'model__gamma': [0.1, 0.5, 1],

    },
    
    {'model': [LogisticRegression()],
     'model__C': [0.1, 0.5, 1.0],
    },
    {'model': [XGBClassifier()],  
     'model__objective': 'reg:squarederror',
     'model__n_estimators': [10,20,30],
     'model__use_label_encoder': False
    },
       {'model': [RandomForestClassifier(max_depth=4,n_estimators=30)],  
     'model__max_depth': [2, 4, 8],
     'model__n_estimators':[1,5,10],

    }
]

grid = GridSearchCV(Pipeline, param_grid=param_grid ,scoring='roc_auc', cv=5,n_jobs = -1,refit=True)
grid.fit(X_train_tfidf, y_train)



ValueError: ignored

In [None]:
from keras.models import Sequential
from keras import layers

input_dim = 4  # Number of features


embedding_dim = 30

model = Sequential()
#model.add(layers.Embedding(input_dim=81587 ,#vocab_size, 
 #                          output_dim=embedding_dim, 
  #                         input_length=100))
#model.add(layers.Flatten())
model = Sequential()
model.add(layers.Dense(10, input_dim=input_dim, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', 
            optimizer='adam', 
              metrics=['accuracy'])
model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_6 (Dense)             (None, 10)                50        
                                                                 
 dense_7 (Dense)             (None, 1)                 11        
                                                                 
Total params: 61
Trainable params: 61
Non-trainable params: 0
_________________________________________________________________


In [None]:


 history=model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

IndentationError: ignored

In [None]:
pipe = Pipeline([
    ('pre', prep),
    ('model', model)
])

In [None]:
pipe.fit(X_train, y_train)

ValueError: ignored

In [None]:
parameters = {
	'model__epochs': 50,
	'model__verbose':0,
	'model__batch_size': 40
}

In [None]:
history = model.fit(X_test_tfidf, y_train,
                  epochs=100,
                   verbose=False,
                    batch_size=10)

ValueError: ignored

In [None]:
X_train_tfidf

NameError: ignored

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(qs)

X_train = tokenizer.texts_to_sequences(sentences_train)
# X_test = tokenizer.texts_to_sequences(sentences_test)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index

NameError: name 'sentences_train' is not defined

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

maxlen = 100

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
# X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers

vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 50


model = Sequential()
model.add(layers.Embedding(input_dim=vocab_size, 
                           output_dim=embedding_dim, 
                           input_length=maxlen))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
model.summary()

In [None]:
history = model.fit(X_train, y_train,
                    epochs=3,
                    verbose=False,
                    validation_data=(X_test, y_test),
                    batch_size=10)
loss, accuracy = model.evaluate(X_train, y_train, verbose=False)
print("Training Accuracy: {:.4f}".format(accuracy))
loss, accuracy = model.evaluate(X_test, y_test, verbose=False)
print("Testing Accuracy:  {:.4f}".format(accuracy))
plot_history(history)