Determining resemblance of two question

Deep learning tool to determine similarity of given two question. The tool will is able to understand context of question to determine the similarity.

eg: Question 1: What is your age? Question 2: How old are you? --> They are Similar

Overview

This is two part Deep learning based NLP solution to determine resemblance or similarity of two given question.

Part one is to extract word embeddings using Universal Word Embedding and second part is to predict wether the given two question are similair or not.

The main goal of this project is to try alternative against some traditional approach like TFIDF, Minimum distance, Co-Occurrence Matrix, or even Word2Vec, GolVe, etc and achieve state-of-the-art result.

Quora Question Pairs dataset have been used to train data.The repository will also provide pretrained weights and word embeddings(in .npy format).

Architecture

The architecture can be divided in two part:

Embeddings generation:

Two option were explored, Word2Vec & Universal sentence encoder & the latter was chosen. Word2Vec generate embeddings per word which make it more difficult for the network to the overall context of sentence (here question). Universal sentence encoder generates embedding of complete sentence, which helps network to understand overall context of question.
Universal sentence encoder generates a output of vector size of 512 whatever may be the input single word, a sentence or even a small paragraph (there in no fixed size on paragraph).
A direct consumable form of Universal sentence encoder is available at Tensorflow hub. More details are mentioned at its page.

Prediction using SimilarityNet:

SimilarityNet is a classifier (with multiple backend to choose from) which takes word embeddings as input & then classify them as, 0: Not Similar; 1: Similar
Currently it supports three backend classifier:
1. ANN
2. Random Forest
3. Catboost

Note: Currently SimilarityNet with Catboost as backend works best & so it will be default backend.

Performance

Embeddings Generator: It takes about 0.1 sec per embeddings.
Training Performance:
1. ANN: (GPU) It takes about 70-90 sec per epoch
2. Random Forest: (CPU only) It takes about 20 min on 75% of Quora question pairs dataset.
3. Catboost: (GPU) It is fastest among all, takes about 5 min.
Benchmark result:
1. ANN: 80%
2. Random Forest: 83.34%
3. Catboost: 86.7%

How to use it

This repository provides Jupyter notebook tutorial that explain training, inference and evaluation, and there are a bunch of explanations in the subsequent sections that complement the notebooks.

Training Details

Training could be performed in two steps:

Generation of embeddings using any desired text embedding method.
Using the text embedding to feed to Network and train the model.

Dataset

Quora Question Pairs dataset was used to train & it can be downloaded from here.

Note: Any other dataset will also do.

Download the Weights and embeddings

Text embeddings generated from Universal sentence encoder for Quora Question Pairs can be downloaded from here
Weights can be downloaded from here

How to fine-tune one of the trained models on your own dataset

Fine-tuning can be done by changing two configration.

Change text embedding generator
Change Hidden layer configration

TODO

CLI tool
Universal generic notebook to carry all function.
Add examples
Make TF 2.0 compatible

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.dvc		.dvc
assets/logs		assets/logs
core		core
data		data
pretrained_weights		pretrained_weights
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
executor.py		executor.py
inference.py		inference.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Determining resemblance of two question

Contents

Overview

Architecture

Performance

How to use it

Training Details

Dataset

Download the Weights and embeddings

How to fine-tune one of the trained models on your own dataset

TODO

About

Releases

Packages

Languages

License

Akashdesarda/Determining-resemblance-of-two-question

Folders and files

Latest commit

History

Repository files navigation

Determining resemblance of two question

Contents

Overview

Architecture

Performance

How to use it

Training Details

Dataset

Download the Weights and embeddings

How to fine-tune one of the trained models on your own dataset

TODO

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages