<a href="https://www.kaggle.com/code/aisuko/semantic-search-in-publications?scriptVersionId=162137361" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

In the notebook [Similar Questions Retrieval](https://www.kaggle.com/code/aisuko/similar-questions-retrieval) we show an example based on the Quora duplicate question dataset. 

In this notebook, we will try to find similar publications. As corpus, we will use all EMNLP pulications from 2016-2018. We then search for similar papers using papers that have been presented at EMNLP 2019/2020. This is a **symmetric search task**, as the search queries have the same length and content as the questions in the corpus.

In [1]:
!pip install sentence-transformers==2.3.1

Collecting sentence-transformers==2.3.1
  Downloading sentence_transformers-2.3.1-py3-none-any.whl.metadata (11 kB)
Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-2.3.1


In [2]:
import os
import json
from sentence_transformers.util import http_get

dataset_file='emnlp2016-2018.json'

http_get('https://sbert.net/datasets/emnlp2016-2018.json', dataset_file)

with open(dataset_file) as fIn:
    papers=json.load(fIn)
    
print(len(papers),'papers loaded')

  0%|          | 0.00/1.10M [00:00<?, ?B/s]

974 papers loaded


In [3]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import normalize_embeddings

model=SentenceTransformer('allenai-specter')

# to encoder the papers, we must combine the title and the abstracts to a single string
paper_texts=[paper['title']+'[SEP]'+paper['abstract'] for paper in papers]

# compute embeddings for all papers
corpus_embeddings=model.encode(paper_texts, convert_to_tensor=True,show_progress_bar=True, device='cuda')
corpus_embeddings=normalize_embeddings(corpus_embeddings)
corpus_embeddings

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/2.71k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/622 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/331 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/222k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/462k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/31 [00:00<?, ?it/s]

tensor([[ 0.0014,  0.0133,  0.0073,  ...,  0.0274,  0.0193,  0.0124],
        [ 0.0124,  0.0127,  0.0218,  ...,  0.0373, -0.0011,  0.0385],
        [-0.0338,  0.0804,  0.0237,  ...,  0.0393,  0.0109,  0.0780],
        ...,
        [ 0.0026,  0.0254,  0.0083,  ...,  0.0118,  0.0015,  0.0802],
        [ 0.0101,  0.0669,  0.0144,  ...,  0.0295,  0.0195,  0.0173],
        [-0.0221,  0.0449,  0.0179,  ..., -0.0191,  0.0037,  0.0604]],
       device='cuda:0')

In [4]:
from sentence_transformers.util import semantic_search, dot_score

def search_papers(title, abstract, top_k=5):
    query_embedding=model.encode(title+'[SEP]'+abstract, convert_to_tensor=True, show_progress_bar=True, device='cuda')
    
    search_hits=semantic_search(
        query_embedding, 
        corpus_embeddings,
        query_chunk_size=100,
        corpus_chunk_size=500000,
        top_k=top_k
    )
    # get the hits for the first query
    search_hits=search_hits[0]
    
    print('Paper:',title)
    print('Most similar papers:')
    for hit in search_hits:
        related_paper=papers[hit['corpus_id']]
        print("{:.2f}\t{}\t{} {}".format(hit['score'], related_paper['title'], related_paper['venue'], related_paper['year']))

# Searching the papers

Now we search for some papers that have been presented at EMNLP 2019 and 2020.

In [5]:
# this paper was the EMNLP 2019 Best Paper
search_papers(
    title='Specializing Word Embeddings (for Parsing) by Information Bottleneck',
    abstract='Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In the discrete version, our automatically compressed tags form an alternative tag set: we show experimentally that our tags capture most of the information in traditional POS tag annotations, but our tag sequences can be parsed more accurately at the same level of tag granularity. In the continuous version, we show experimentally that moderately compressing the word embeddings by our method yields a more accurate parser in 8 of 9 languages, unlike simple dimensionality reduction.'
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper: Specializing Word Embeddings (for Parsing) by Information Bottleneck
Most similar papers:
0.88	An Investigation of the Interactions Between Pre-Trained Word Embeddings, Character Models and POS Tags in Dependency Parsing	EMNLP 2018
0.87	NORMA: Neighborhood Sensitive Maps for Multilingual Word Embeddings	EMNLP 2018
0.87	Generalizing Word Embeddings using Bag of Subwords	EMNLP 2018
0.87	Word Embeddings for Code-Mixed Language Processing	EMNLP 2018
0.87	LAMB: A Good Shepherd of Morphologically Rich Languages	EMNLP 2016


In [6]:
# This paper was the EMNLP 2020 Best Paper
search_papers(title='Digital Voicing of Silent Speech',
              abstract='In this paper, we consider the task of digitally voicing silent speech, where silently mouthed words are converted to audible speech based on electromyography (EMG) sensor measurements that capture muscle impulses. While prior work has focused on training speech synthesis models from EMG collected during vocalized speech, we are the first to train from EMG collected during silently articulated speech. We introduce a method of training on silent EMG by transferring audio targets from vocalized to silent signals. Our method greatly improves intelligibility of audio generated from silent EMG compared to a baseline that only trains with vocalized data, decreasing transcription word error rate from 64% to 4% in one data condition and 88% to 68% in another. To spur further development on this task, we share our new dataset of silent and vocalized facial EMG measurements.')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper: Digital Voicing of Silent Speech
Most similar papers:
0.82	Session-level Language Modeling for Conversational Speech	EMNLP 2018
0.79	Neural Multitask Learning for Simile Recognition	EMNLP 2018
0.78	Speech segmentation with a neural encoder model of working memory	EMNLP 2017
0.77	MSMO: Multimodal Summarization with Multimodal Output	EMNLP 2018
0.77	Estimating Marginal Probabilities of n-grams for Recurrent Neural Language Models	EMNLP 2018


In [7]:
# This paper was a EMNLP 2020 Honourable Mention Papers
search_papers(title='Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems',
              abstract='The lack of time efficient and reliable evalu-ation methods is hampering the development of conversational dialogue systems (chat bots). Evaluations that require humans to converse with chat bots are time and cost intensive, put high cognitive demands on the human judges, and tend to yield low quality results. In this work, we introduce Spot The Bot, a cost-efficient and robust evaluation framework that replaces human-bot conversations with conversations between bots. Human judges then only annotate for each entity in a conversation whether they think it is human or not (assuming there are humans participants in these conversations). These annotations then allow us to rank chat bots regarding their ability to mimic conversational behaviour of humans. Since we expect that all bots are eventually recognized as such, we incorporate a metric that measures which chat bot is able to uphold human-like be-havior the longest, i.e.Survival Analysis. This metric has the ability to correlate a bot’s performance to certain of its characteristics (e.g.fluency or sensibleness), yielding interpretable results. The comparably low cost of our frame-work allows for frequent evaluations of chatbots during their evaluation cycle. We empirically validate our claims by applying Spot The Bot to three domains, evaluating several state-of-the-art chat bots, and drawing comparisonsto related work. The framework is released asa ready-to-use tool.')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper: Spot The Bot: A Robust and Efficient Framework for the Evaluation of Conversational Dialogue Systems
Most similar papers:
0.86	Multi-view Response Selection for Human-Computer Conversation	EMNLP 2016
0.84	Patterns of Argumentation Strategies across Topics	EMNLP 2017
0.84	Natural Language Does Not Emerge ‘Naturally’ in Multi-Agent Dialog	EMNLP 2017
0.83	Towards Exploiting Background Knowledge for Building Conversation Systems	EMNLP 2018
0.83	AirDialogue: An Environment for Goal-Oriented Dialogue Research	EMNLP 2018


In [8]:
# EMNLP 2020 paper on making Sentence-BERT multilingual
search_papers(title='Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation',
              abstract='We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to create multilingual versions from previously monolingual models. The training is based on the idea that a translated sentence should be mapped to the same location in the vector space as the original sentence. We use the original (monolingual) model to generate sentence embeddings for the source language and then train a new system on translated sentences to mimic the original model. Compared to other methods for training multilingual sentence embeddings, this approach has several advantages: It is easy to extend existing models with relatively few samples to new languages, it is easier to ensure desired properties for the vector space, and the hardware requirements for training is lower. We demonstrate the effectiveness of our approach for 50+ languages from various language families. Code to extend sentence embeddings models to more than 400 languages is publicly available.')

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Paper: Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
Most similar papers:
0.90	Sentence Compression for Arbitrary Languages via Multilingual Pivoting	EMNLP 2018
0.90	Learning Crosslingual Word Embeddings without Bilingual Corpora	EMNLP 2016
0.89	Unsupervised Multilingual Word Embeddings	EMNLP 2018
0.89	InferLite: Simple Universal Sentence Representations from Natural Language Inference Data	EMNLP 2018
0.88	Improving Cross-Lingual Word Embeddings by Meeting in the Middle	EMNLP 2018
