Quora-Question-Pair-Similarity-Deployed-on-Flask

Quora Question-Piar Similarity

NOTE: Link of the deployed model.

1. Problem Statement

Identify which questions asked on Quora are duplicates of questions that have already been asked.
This could be useful to instantly provide answers to questions that have already been answered.
We are tasked with predicting whether a pair of questions are duplicates or not.

Note: Data Source

2. Machine Learning Probelm

It is a binary classification problem, for a given pair of questions we need to predict if they are duplicate or not.

Performance Metric
- log-loss
- Binary Confusion Matrix
We build train and test by randomly splitting in the ratio of 70:30 or 80:20 whatever we choose as we have sufficient points to work with.

3. EDA

Note: Always refers to the notebook, to know in deatils about each plots/outcomes.

3.1 Class Distribution

This plot show the class distribution of dataset. From above plot it can be concluded that this dataset is imbalanced because the number of datapoints for class-1 (Similar) is less than class-0 (Not similar)

3.2 Unique v/s Repeated Question

Above plot shows that the number of unique questions is almost five time the number of repeated question.

3.3 Desinged Features

There are 24 desinged features (hand crafted features). Analysis of all features has been done in the Quora_EDA.ipynb notebook.

3.3.1 word_share Feature

Above features looks important (but not too much) because distribution plot is not completely overlapping

4. ML Models

Various models has been tried like Logistic Regression, Linear SVM and GBDT with BoW, TF-IDF and W2V vectorizer along with hand-crafted features. But out of all these GBDT with TFIDF vectorizer gave the best performance. For all other models training, calibration and tunning look at the notebook,

4.1 GBDT

The dataset was large, so I used lightgbm library for training the GBDT model. Gridseach method is used to find the best combination of max_depth and n_estimators hyperparamters of the model. Rest other hyperparameters are kept as deafult. In above plot, each cell of hearmap shows the log-loss value (max_depth, estimator) pair. Best hyperparameter pair is: max_depth=10, n_estimators=500.

4.2 Performance Summary of All Models

Requirements

python == 3.6
flask == 1.1.1
gunicorn == 20.0.4
scikit-learn == 0.22.1
numpy == 1.16.4
pandas == 0.24.2
lightgbm == 2.2.3
pickle == 4.0
scipy == 1.2.0
nltk == 3.4.3
fuzzywuzzy == 0.17.0
distance == 0.1.3

File Details

Flask API (folder)

It contains all required file to build a UI. This folder is used to deploy the model on Heroku. Link of the deployed model.

.ipynb Extension

With this extension, files are IPython Notebooks. File name Quora_EDA.ipynb has exploratory data anaysis. In exploartory data analysis various distribution plots, barplots, t-SNE plots etc are plotted. In this file some additional features are designed. In quora_vectorizing_and_models.ipynb file first TF-IDF and TFIDF-W2V vectorization are done then final TF-IDF and TFIDF-W2V features are created by merging vectorized features with designed features in Quora_EDA.ipynb file. Logistic regression, linear SVM and GBDT (with LightGBM) machine learning models are applied. For each model hyper-parameter tunning has been done. At the end all model's log-loss values are compared.

'test.py' and 'feature.py'

These files is for testing the model.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Flask API		Flask API
Images		Images
Models		Models
__pycache__		__pycache__
Quora_EDA.ipynb		Quora_EDA.ipynb
README.md		README.md
featurizer.py		featurizer.py
quora_vectorizing_and_models.ipynb		quora_vectorizing_and_models.ipynb
readme.md		readme.md
test.py		test.py

OGsiji/Quora-Question-Pair-Similarity-Deployed-on-Flask

Folders and files

Latest commit

History

Repository files navigation