Sentence Similarity Calculator
This repo contains various ways to calculate the similarity between source and target sentences. You can choose the pre-trained models you want to use such as ELMo, BERT and Universal Sentence Encoder (USE).
And you can also choose the method to be used to get the similarity:
1. Cosine similarity 2. Manhattan distance 3. Euclidean distance 4. Angular distance 5. Inner product 6. TS-SS score 7. Pairwise-cosine similarity 8. Pairwise-cosine similarity + IDF
You can experiment with (The number of models) x (The number of methods) combinations!
- After cloning this repository, you can simply install all the dependent libraries described in
pip install -r requirements.txt.
git clone https://github.com/Huffon/sentence-similarity.git cd sentence-similarity pip install -r requirements.txt
- To test your sentences, you should fill out
corpus.txtwith sentences as below.
I ate an apple. I went to the Apple. I ate an orange. ...
- Then, choose the model and method to be used to calculate the similarity between source and target sentences.
python sensim.py --model MODEL_NAME --method METHOD_NAME --verbose LOG_OPTION (bool)
- In the following section, you can see the result of
- As you guys know, there is a no silver-bullet which can calculate perfect similarity between sentences. You should conduct various experiments with your dataset.
TS-SS scoremight not fit with short-sentence similarity task, since this method originally devised to calculate the similarity between documents.
- Python version should be higher than 3.6.x
- You should install PyTorch via official Installation guide
- To use
spaCymodel which is used to tokenize input sentence, download English model by running
python -m spacy download en_core_web_sm.
allennlp==0.9.0 bert-score==0.2.1 numpy==1.17.3 scikit-learn==0.21.3 scipy==1.3.1 seaborn==0.9.0 sentence-transformers==0.2.3 spacy==2.1.9 tensorflow==1.15.0 tensorflow-hub==0.7.0 torch==1.3.0
- Upgrade TF to TF2.0 to use
- Add pairwise cosine similarity method in
- Universal Sentence Encoder
- Deep contextualized word representations
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
- A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering
- TF-hub's Universal Sentence Encoder
- Allen NLP's ELMo
- Sentence Transformers
- Vector Similarity