<a href="https://colab.research.google.com/github/Lab-of-Infinity/Advanced-Deep-Learning-Based-NLP-Image-Processing-Projects/blob/main/Project_5_Similar_Research_Paper_Recommendation_using_SBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Similar Research Paper Recommendation using `SBERT`**

- The recommendation systems task is to produce a list of recommendations for a user.
- Recommendation systems try to show what user might like the most 

- This is a symmetric search task, as the search queries have the same length and content as the questions in the corpus.

- For a given research paper, this simple and easy recommendation system will suggest most similar papers

- We will use paper title and abstract to match similar papers

### **SPECTER Model**
- **SPECTER is a model trained on scientific citations and can be used to estimate the similarity of two publications. We can use it to find similar papers.**

- As model, we use SPECTER (https://github.com/allenai/specter), which encodes paper titles and abstracts 
into a vector space. https://arxiv.org/pdf/2004.07180.pdf

- SPECTER can be easily applied to
downstream applications without task-specific
fine-tuning

In [None]:
%pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m87.0 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m77.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.4

In [None]:
from sentence_transformers import SentenceTransformer, util
import os
import json
import requests

In [None]:
response = requests.get('https://sbert.net/datasets/emnlp2016-2018.json')
papers = json.loads(response.text)

In [None]:
len(papers)

974

In [None]:
papers[0]

{'title': 'Rule Extraction for Tree-to-Tree Transducers by Cost Minimization',
 'abstract': 'Finite-state transducers give efficient representations of many Natural Language phenomena. They allow to account for complex lexicon restrictions encountered, without involving the use of a large set of complex rules difficult to analyze. We here show that these representations can be made very compact, indicate how to perform the corresponding minimization, and point out interesting linguistic side-effects of this operation.',
 'url': 'http://aclweb.org/anthology/D16-1002',
 'venue': 'EMNLP',
 'year': '2016'}

In [None]:
model = SentenceTransformer('allenai-specter')

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/622 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/462k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/331 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/222k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [None]:

#To encode the papers, we must combine the title and the abstracts to a single string
paper_texts = [paper['title'] + '[SEP]' + paper['abstract'] for paper in papers]

In [None]:
# Compute embeddings for all papers
corpus_embeddings = model.encode(paper_texts,show_progress_bar=True, convert_to_tensor=True)

Batches:   0%|          | 0/31 [00:00<?, ?it/s]

In [None]:
#We define a function, given title & abstract, searches our corpus for relevant (similar) papers
def search_papers(title, abstract):
  query_embedding = model.encode(title + "[SEP]" + abstract, convert_to_tensor = True)
  search_hits = util.semantic_search(query_embedding, corpus_embeddings)
  search_hits = search_hits[0]
  print('Paper:', title)
  print('Most similar papers:')
  for hit in search_hits:
    related_paper = papers[hit['corpus_id']]
    print("{:.2f}\t{}\t{} {}".format(hit['score'], related_paper['title'], related_paper['venue'], related_paper['year']))


### Search
Now we search for some papers that have been presented at EMNLP 2019 and 2020.

In [None]:
# This paper was the EMNLP 2019 Best Paper
search_papers(title='Specializing Word Embeddings (for Parsing) by Information Bottleneck', 
              abstract='Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In the discrete version, our automatically compressed tags form an alternative tag set: we show experimentally that our tags capture most of the information in traditional POS tag annotations, but our tag sequences can be parsed more accurately at the same level of tag granularity. In the continuous version, we show experimentally that moderately compressing the word embeddings by our method yields a more accurate parser in 8 of 9 languages, unlike simple dimensionality reduction.')


Paper: Specializing Word Embeddings (for Parsing) by Information Bottleneck
Most similar papers:
0.88	An Investigation of the Interactions Between Pre-Trained Word Embeddings, Character Models and POS Tags in Dependency Parsing	EMNLP 2018
0.87	NORMA: Neighborhood Sensitive Maps for Multilingual Word Embeddings	EMNLP 2018
0.87	Generalizing Word Embeddings using Bag of Subwords	EMNLP 2018
0.87	Word Embeddings for Code-Mixed Language Processing	EMNLP 2018
0.87	LAMB: A Good Shepherd of Morphologically Rich Languages	EMNLP 2016
0.87	Word Mover's Embedding: From Word2Vec to Document Embedding	EMNLP 2018
0.87	Charagram: Embedding Words and Sentences via Character n-grams	EMNLP 2016
0.87	Segmentation-Free Word Embedding for Unsegmented Languages	EMNLP 2017
0.86	Addressing Troublesome Words in Neural Machine Translation	EMNLP 2018
0.86	Conditional Word Embedding and Hypothesis Testing via Bayes-by-Backprop	EMNLP 2018


In [None]:
# This paper was the EMNLP 2020 Best Paper
search_papers(title='Digital Voicing of Silent Speech',
              abstract='In this paper, we consider the task of digitally voicing silent speech, where silently mouthed words are converted to audible speech based on electromyography (EMG) sensor measurements that capture muscle impulses. While prior work has focused on training speech synthesis models from EMG collected during vocalized speech, we are the first to train from EMG collected during silently articulated speech. We introduce a method of training on silent EMG by transferring audio targets from vocalized to silent signals. Our method greatly improves intelligibility of audio generated from silent EMG compared to a baseline that only trains with vocalized data, decreasing transcription word error rate from 64% to 4% in one data condition and 88% to 68% in another. To spur further development on this task, we share our new dataset of silent and vocalized facial EMG measurements.')


Paper: Digital Voicing of Silent Speech
Most similar papers:
0.82	Session-level Language Modeling for Conversational Speech	EMNLP 2018
0.79	Neural Multitask Learning for Simile Recognition	EMNLP 2018
0.78	Speech segmentation with a neural encoder model of working memory	EMNLP 2017
0.77	MSMO: Multimodal Summarization with Multimodal Output	EMNLP 2018
0.77	Estimating Marginal Probabilities of n-grams for Recurrent Neural Language Models	EMNLP 2018
0.76	A Co-Attention Neural Network Model for Emotion Cause Analysis with Emotional Context Awareness	EMNLP 2018
0.76	Learning Unsupervised Word Translations Without Adversaries	EMNLP 2018
0.75	Large Margin Neural Language Model	EMNLP 2018
0.75	Phrase-Based & Neural Unsupervised Machine Translation	EMNLP 2018
0.75	Multimodal Language Analysis with Recurrent Multistage Fusion	EMNLP 2018
