<a href="https://colab.research.google.com/github/ChenKua/xir/blob/main/robust04_Processed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load Dataset

Connect to google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

Pickle file locations in my drive
<br>/content/drive/MyDrive/robust04/docs.pkl      containing document
<br>/content/drive/MyDrive/robust04/queries.pkl   containing queries
<br>/content/drive/MyDrive/robust04/qrels.pkl     containing query relevence
<br> For more detail of the datasets, please refer to https://ir-datasets.com/trec-robust04.html

<br>Note the official websit only offers .tar file.

In [None]:
# queries
queries_df = pd.read_pickle("/content/drive/MyDrive/robust04/queries.pkl")

In [None]:
# documents
docs_df = pd.read_pickle("/content/drive/MyDrive/robust04/docs.pkl")

In [None]:
# query relevence
qrels_df = pd.read_pickle("/content/drive/MyDrive/robust04/qrels.pkl")

In [None]:
# Example
queries_df.head(2)

Unnamed: 0,query_id,title,description,narrative
0,301,International Organized Crime,Identify organizations that participate in int...,A relevant document must as a minimum identify...
1,302,Poliomyelitis and Post-Polio,Is the disease of Poliomyelitis (polio) under ...,Relevant documents should contain data or outb...


In [None]:
docs_df.head(1)

Unnamed: 0,doc_id,text,marked_up_doc
0,FBIS3-1,"\n\nPOLITICIANS, PARTY PREFERENCES \n\n Sum...","<TEXT>\nPOLITICIANS, PARTY PREFERENCES \n\n ..."


In [None]:
docs_df.shape

(528155, 3)

In [None]:
# Cut datasets othervise the Colab will easily crush in the later process.
docs_df = docs_df.head(10000)
docs_df.shape

(10000, 3)

# Vector Search with FAISS

Please refer to https://huggingface.co/course/chapter5/6?fw=tf



In [None]:
!pip install datasets transformers[sentencepiece]
!pip install faiss-gpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from datasets import Dataset

In [None]:
docs_dataset = Dataset.from_pandas(docs_df)
docs_dataset

Dataset({
    features: ['doc_id', 'text', 'marked_up_doc'],
    num_rows: 10000
})

In [None]:
from transformers import AutoTokenizer, TFAutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = TFAutoModel.from_pretrained(model_ckpt, from_pt=True)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFMPNetModel: ['embeddings.position_ids']
- This IS expected if you are initializing TFMPNetModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFMPNetModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFMPNetModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFMPNetModel for predictions without further training.


In [None]:
def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

In [None]:
def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="tf"
    )
    encoded_input = {k: v for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

In [None]:
embedding = get_embeddings(docs_dataset["text"][0])
embedding.shape

TensorShape([1, 768])

In [None]:
embeddings_dataset = docs_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).numpy()[0]}
)

  0%|          | 0/10000 [00:00<?, ?ex/s]

In [None]:
embeddings_dataset.add_faiss_index(column="embeddings")

  0%|          | 0/10 [00:00<?, ?it/s]

Dataset({
    features: ['doc_id', 'text', 'marked_up_doc', 'embeddings'],
    num_rows: 10000
})

In [None]:
question = "Famous Movie"
question_embedding = get_embeddings([question]).numpy()
question_embedding.shape

(1, 768)

In [None]:
scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=3
)

In [None]:
import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

In [None]:
for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.text}")
    print(f"SCORE: {row.scores}")

    print("=" * 50)
    print()

COMMENT: 

Language:  English 
Article Type:BFN 

  [Text] Pyongyang, March 18 (KCNA) -- Comrade Kim Chong-il, 
supreme commander of the Korean People's Army, highly praised 
the feat of Hong Kyong-ae, a non-commissioned officer of the 
People's Army who died after saving her comrades while on her 
military duty, and recently took care that she was awarded the 
title of heroine of the Republic and her platoon was called 
"Hong Kyong-ae Platoon". 
  Last year, Comrade Kim Chong-il, upon hearing that Yu 
Kyong-nam, a soldier of the Korean People's Army, died after 
saving his comrades by covering a handgrenade on the point of 
explosion with his body, called him a "fine son of the country" 
and put him up as a hero of the Republic. 
  The self-sacrificing spirit of devoting oneself to the 
country and the people is highly displayed among KPA soldiers. 
In this course, many heroes and heroines have emerged. 
  In the 1990s when the Korean revolution has entered a new, 
higher stage of dev