Deep learning tool to determine similarity of given two question. The tool will is able to understand context of question to determine the similarity.
eg: Question 1: What is your age? Question 2: How old are you? --> They are Similar
- How to use it
- Download the Weights and embeddings
- How to fine-tune one of the trained models on your own dataset
- ToDo
This is two part Deep learning based NLP solution to determine resemblance or similarity of two given question.
Part one is to extract word embeddings using Universal Word Embedding and second part is to predict wether the given two question are similair or not.
The main goal of this project is to try alternative against some traditional approach like TFIDF, Minimum distance, Co-Occurrence Matrix, or even Word2Vec, GolVe, etc and achieve state-of-the-art result.
Quora Question Pairs dataset have been used to train data.The repository will also provide pretrained weights and word embeddings(in .npy format).
The architecture can be divided in two part:
- Embeddings generation:
-
Two option were explored, Word2Vec & Universal sentence encoder & the latter was chosen. Word2Vec generate embeddings per word which make it more difficult for the network to the overall context of sentence (here question). Universal sentence encoder generates embedding of complete sentence, which helps network to understand overall context of question.
-
Universal sentence encoder generates a output of vector size of 512 whatever may be the input single word, a sentence or even a small paragraph (there in no fixed size on paragraph).
-
A direct consumable form of Universal sentence encoder is available at Tensorflow hub. More details are mentioned at its page.
- Prediction using SimilarityNet:
-
SimilarityNet is a classifier (with multiple backend to choose from) which takes word embeddings as input & then classify them as, 0: Not Similar; 1: Similar
-
Currently it supports three backend classifier:
- ANN
- Random Forest
- Catboost
Note: Currently SimilarityNet with Catboost as backend works best & so it will be default backend.
-
Embeddings Generator: It takes about 0.1 sec per embeddings.
-
Training Performance:
- ANN: (GPU) It takes about 70-90 sec per epoch
- Random Forest: (CPU only) It takes about 20 min on 75% of Quora question pairs dataset.
- Catboost: (GPU) It is fastest among all, takes about 5 min.
-
Benchmark result:
- ANN: 80%
- Random Forest: 83.34%
- Catboost: 86.7%
This repository provides Jupyter notebook tutorial that explain training, inference and evaluation, and there are a bunch of explanations in the subsequent sections that complement the notebooks.
Training could be performed in two steps:
- Generation of embeddings using any desired text embedding method.
- Using the text embedding to feed to Network and train the model.
Quora Question Pairs dataset was used to train & it can be downloaded from here.
Note: Any other dataset will also do.
- Text embeddings generated from Universal sentence encoder for Quora Question Pairs can be downloaded from here
- Weights can be downloaded from here
Fine-tuning can be done by changing two configration.
-
Change text embedding generator
-
Change Hidden layer configration
- CLI tool
- Universal generic notebook to carry all function.
- Add examples
- Make TF 2.0 compatible