# Retrieve & Re-Rank Demo over Simple Wikipedia

This examples demonstrates the Retrieve & Re-Rank Setup and allows to search over [Simple Wikipedia](https://simple.wikipedia.org/wiki/Main_Page).

You can input a query or a question. The script then uses semantic search
to find relevant passages in Simple English Wikipedia (as it is smaller and fits better in RAM).

For semantic search, we use `SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')` and retrieve
32 potentially passages that answer the input query.

Next, we use a more powerful CrossEncoder (`cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')`) that
scores the query and all retrieved passages for their relevancy. The cross-encoder further boost the performance,
especially when you search over a corpus for which the bi-encoder was not trained for.


In [2]:
%%capture
!pip install -U sentence-transformers rank_bm25 marko

In [4]:
from google.colab import drive
drive.mount('/content/drive')

%cd /content/drive/MyDrive/projects/hf_search

Mounted at /content/drive
/content/drive/MyDrive/projects/hf_search


In [67]:
passages_df = []
from tqdm.notebook import tqdm
for i, row in tqdm(df.iterrows()):
    text = row["readme"]
    if text:
        doc = md.parse(text)
        passages_df.extend([{"id":i,
                             "modelId":row["modelId"],
                             "passage":md.render(e) 
                             } for e in doc.children if e.get_type() not in ["BlankLine", "ThematicBreak"]])
passages_df = pd.DataFrame(passages_df)
passages_df.to_json('passages.jsonl', orient='records', lines=True)

passages_df

0it [00:00, ?it/s]

Unnamed: 0,id,modelId,passage
0,0,tizaino/bert-base-uncased-finetuned-Pisa,license: apache-2.0\ntags:\n
1,0,tizaino/bert-base-uncased-finetuned-Pisa,- generated_from_trainer\nmodel-index:\n- name...
2,0,tizaino/bert-base-uncased-finetuned-Pisa,<!-- This model card has been generated automa...
3,0,tizaino/bert-base-uncased-finetuned-Pisa,# bert-base-uncased-finetuned-Pisa\n
4,0,tizaino/bert-base-uncased-finetuned-Pisa,This model is a fine-tuned version of [bert-ba...
...,...,...,...
254360,28912,SongRb/distilbert-base-uncased-finetuned-cola,- learning_rate: 2e-05\n- train_batch_size: 16...
254361,28912,SongRb/distilbert-base-uncased-finetuned-cola,### Training results\n
254362,28912,SongRb/distilbert-base-uncased-finetuned-cola,| Training Loss | Epoch | Step | Validation Lo...
254363,28912,SongRb/distilbert-base-uncased-finetuned-cola,### Framework versions\n


In [68]:
passages_df.to_json('passages.jsonl', orient='records', lines=True)

In [70]:
import warnings
warnings.filterwarnings("ignore")

import json
from sentence_transformers import SentenceTransformer, CrossEncoder, util
import gzip
import os
import torch
import pandas as pd
from marko import Markdown
md = Markdown()

if not torch.cuda.is_available():
    print("Warning: No GPU found. Please add GPU to your notebook")


#We use the Bi-Encoder to encode all passages, so that we can use it with sematic search
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
bi_encoder.max_seq_length = 256     #Truncate long passages to 256 tokens
top_k = 32                          #Number of passages we want to retrieve with the bi-encoder

#The bi-encoder will retrieve 100 documents. We use a cross-encoder, to re-rank the results list to improve the quality
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

# As dataset, we use Simple English Wikipedia. Compared to the full English wikipedia, it has only
# about 170k articles. We split these articles into paragraphs and encode them with the bi-encoder

passages_df = pd.read_json('passages.jsonl', lines=True)
passages = passages_df["passage"].values.tolist()

print("Passages:", len(passages))

Passages: 254365


In [71]:
# We encode all passages into our vector space. This takes about 5 minutes (depends on your GPU speed)
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)

Batches:   0%|          | 0/7949 [00:00<?, ?it/s]

In [72]:
import joblib
joblib.dump(corpus_embeddings, 'corpus_embeddings.pkl')

['corpus_embeddings.pkl']

In [73]:
# We also compare the results to lexical search (keyword search). Here, we use 
# the BM25 algorithm which is implemented in the rank_bm25 package.

from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string
from tqdm.autonotebook import tqdm
import numpy as np


# We lower case our text and remove stop-words from indexing
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc


tokenized_corpus = []
for passage in tqdm(passages):
    tokenized_corpus.append(bm25_tokenizer(passage))

bm25 = BM25Okapi(tokenized_corpus)


  0%|          | 0/254365 [00:00<?, ?it/s]

In [90]:
# This function will search all wikipedia articles for passages that
# answer the query
def search(query):
    print("Input question:", query)

    ##### BM25 search (lexical search) #####
    bm25_scores = bm25.get_scores(bm25_tokenizer(query))
    top_n = np.argpartition(bm25_scores, -5)[-5:]
    bm25_hits = [{'corpus_id': idx, 'score': bm25_scores[idx]} for idx in top_n]
    bm25_hits = sorted(bm25_hits, key=lambda x: x['score'], reverse=True)
    
    print("Top-3 lexical search (BM25) hits")
    for hit in bm25_hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    ##### Sematic Search #####
    # Encode the query using the bi-encoder and find potentially relevant passages
    question_embedding = bi_encoder.encode(query, convert_to_tensor=True)
    question_embedding = question_embedding.cuda()
    hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)
    hits = hits[0]  # Get the hits for the first query

    ##### Re-Ranking #####
    # Now, score all retrieved passages with the cross_encoder
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)

    # Sort results by the cross-encoder scores
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]

    # Output of top-5 hits from bi-encoder
    print("\n-------------------------\n")
    print("Top-3 Bi-Encoder Retrieval hits")
    hits = sorted(hits, key=lambda x: x['score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['score'], passages[hit['corpus_id']].replace("\n", " ")))

    # Output of top-5 hits from re-ranker
    print("\n-------------------------\n")
    print("Top-3 Cross-Encoder Re-ranker hits")
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    for hit in hits[0:3]:
        print("\t{:.3f}\t{}".format(hit['cross-score'], passages[hit['corpus_id']].replace("\n", " ")))


In [91]:
search(query = "translate arabic to french")

Input question: translate arabic to french
Top-3 lexical search (BM25) hits
	14.908	Here is how to use this model to translate legal text from Deustch to French in PyTorch: 
	14.908	Here is how to use this model to translate legal text from French to Deustch in PyTorch: 
	14.908	Here is how to use this model to translate legal text from Cszech to French in PyTorch: 

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.775	language: "Arabic" 
	0.658	# Arabic QA 
	0.630	# arabic-iti 

-------------------------

Top-3 Cross-Encoder Re-ranker hits
	3.077	This is a model for translating English to Arabic. The special about this model that is take into considration the using of additional Arabic characters like `پ` or `گ`. 
	-2.486	language: "Arabic" 
	-3.603	# An English-Arabic Bilingual Encoder 


In [77]:
search(query = "sentence pair classification")

Input question: sentence pair classification
Top-3 lexical search (BM25) hits
	15.199	CCaligned (en-id sentence pair) 
	14.940	This model can be used using Huggingface Transformers. I have created a pipeline for a sentence pair classification. Hope it will be useful. 
	14.381	### Finetuned on annual report sentence pair 

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.730	**Multi-class sentence classification:** 
	0.669	The model will predict scores for the pairs `('Sentence 1', 'Sentence 2')` and `('Sentence 3', 'Sentence 4')`. 
	0.669	The model will predict scores for the pairs `('Sentence 1', 'Sentence 2')` and `('Sentence 3', 'Sentence 4')`. 

-------------------------

Top-3 Cross-Encoder Re-ranker hits
	5.674	This model can be used using Huggingface Transformers. I have created a pipeline for a sentence pair classification. Hope it will be useful. 
	1.662	**Multi-class sentence classification:** 
	0.285	The model will predict scores for the pairs `('Sentence 1', 'S

In [78]:
search(query = "audio preprocessing")

Input question: audio preprocessing
Top-3 lexical search (BM25) hits
	9.432	# Audio to Audio repository template 
	9.432	# Audio to Audio repository template 
	9.024	### Preprocessing 

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.638	# Audio-to-Audio Test 
	0.611	Update the audio_path as per your local file structure. 
	0.609	- asteroid - audio - ConvTasNet - audio-to-audio datasets: - wham - sep_clean license: cc-by-sa-4.0 widget: - example_title: Librispeech sample 1 src: https://cdn-media.huggingface.co/speech_samples/sample1.flac 

-------------------------

Top-3 Cross-Encoder Re-ranker hits
	-1.592	```python import torch import torchaudio import librosa from datasets import load_dataset, load_metric from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import re  # test_data = #TODO: WRITE YOUR CODE TO LOAD THE TEST DATASET. For sample see the Colab link in Training Section.  wer = load_metric("wer") processor = Wav2Vec2Processor.from_pretrained("gchhablani

In [82]:
search(query = "audio to text model")

Input question: audio to text model
Top-3 lexical search (BM25) hits
	9.432	# Audio to Audio repository template 
	9.432	# Audio to Audio repository template 
	8.503	You can use the model via the Audio Classification pipeline: 

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.715	You can use the model via the Audio Classification pipeline: 
	0.715	You can use the model via the Audio Classification pipeline: 
	0.715	You can use the model via the Audio Classification pipeline: 

-------------------------

Top-3 Cross-Encoder Re-ranker hits
	3.205	# Text to Speech Model 
	2.942	# Personal speech to text model 
	0.871	To transcribe audio files the model can be used as a standalone acoustic model as follows: 


In [83]:
search(query = "image to text model")

Input question: image to text model
Top-3 lexical search (BM25) hits
	12.540	You can use the model for image and text retrieval. 
	12.540	You can use the model for image and text retrieval. 
	11.632	# Text To Image repository template 

-------------------------

Top-3 Bi-Encoder Retrieval hits
	0.735	You can use the model for image and text retrieval. 
	0.735	You can use the model for image and text retrieval. 
	0.659	## Generate images from text 

-------------------------

Top-3 Cross-Encoder Re-ranker hits
	7.029	Check out the model text-to-image and image-to-image capabilities using [this demo](https://huggingface.co/spaces/sujitpal/clip-rsicd-demo). 
	7.029	Check out the model text-to-image and image-to-image capabilities using [this demo](https://huggingface.co/spaces/sujitpal/clip-rsicd-demo). 
	5.815	You can use the model for image and text retrieval. 
