### THIS CODE IS USED TO CREATE ALL THE MODELS NEEDED FOR TASK 3. IT WAS EXECUTED ON GOOGLE COLAB SINCE A GPU WAS FUNDAMENTAL TO PERFORM THE COMPUTATION IN REASONABLE TIME

In [None]:
!pip install sentence-transformers

In [None]:
!pip install faiss-gpu

In [None]:
import os
import gzip
import json
import random
import time
import faiss
import pandas as pd
import numpy as np
import torch
from torch import nn
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import T5Tokenizer, T5ForConditionalGeneration
from sentence_transformers import SentenceTransformer, InputExample, losses, models, datasets, util
from tqdm import tqdm
from pprint import pprint
from textblob import TextBlob
import gc

One crucial aspect to consider is the nature of the problem we are addressing, which is an asymmetric semantic search problem. In this context, we typically encounter a scenario where the user provides a short query, such as a question or a set of keywords, and our objective is to find a longer paragraph or text that effectively answers the query. In the case of our movie recommendation task, users typically input a few words to describe the desired plot or theme they want to watch. However, it's important to note that the data we are working with is considerably longer, as the movie plots in our dataset are, on average, quite extensive.

Choosing the appropriate model is crucial, particularly when dealing with the task at hand. Therefore, we will opt for the msmarco-distilbert-base-dot-prod-v3 model. This model is specifically tuned for dot-product retrieval, enabling the retrieval of longer documents. It maps sentences and paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

In [None]:
model = SentenceTransformer('msmarco-distilbert-base-dot-prod-v3')

In [None]:
# Loading the dataset and print useful information
data = pd.read_csv('wiki_movie_plots_deduped.csv', memory_map=True)
data.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34886 entries, 0 to 34885
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Release Year      34886 non-null  int64 
 1   Title             34886 non-null  object
 2   Origin/Ethnicity  34886 non-null  object
 3   Director          34886 non-null  object
 4   Cast              33464 non-null  object
 5   Genre             34886 non-null  object
 6   Wiki Page         34886 non-null  object
 7   Plot              34886 non-null  object
dtypes: int64(1), object(7)
memory usage: 2.1+ MB


In [None]:
# Selecting the 'Title' and 'Plot' columns from the 'data' DataFrame
dataframe = data[['Title', 'Plot']]
# Dropping any rows with missing values in the 'dataframe' DataFrame
dataframe.dropna(inplace=True)
# Dropping any duplicate rows based on the 'Plot' column in the 'dataframe' DataFrame
dataframe.drop_duplicates(subset=['Plot'], inplace=True)

39

One challenge we face when using BERT is how to efficiently store documents and their large embeddings. To address this issue, we employ a method called FAISS (Facebook AI Similarity Search).
It is a library developed by Facebook's AI Research (FAIR) that provides efficient and scalable solutions for similarity search and clustering tasks. FAISS is specifically designed to handle large-scale embeddings, such as those generated by models like BERT. It utilizes advanced indexing and search techniques to enable fast and accurate similarity search over large collections of vectors. By using FAISS, we can store and retrieve documents and their embeddings effectively, allowing for efficient similarity-based operations in our system.

In [None]:
# Encode movie plot descriptions using the specified model msmarco-distilbert-base-dot-prod-v3
encoded_data = model.encode(dataframe.Plot.tolist())
# Convert encoded data to a NumPy array of type 'float32'
encoded_data = np.asarray(encoded_data.astype('float32'))
# Create a new FAISS index for inner product (IP) similarity search
data_index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
# Add the encoded data to the index with corresponding IDs
data_index.add_with_ids(encoded_data, np.array(range(0, len(dataframe))))
# Write the index to a file for future use
faiss.write_index(data_index, 'movie_plot.index')

By employing this straightforward configuration, we achieve satisfactory results. However, there is room for improvement through fine-tuning.

# Fine tune

When starting from scratch, it would have been convenient to fine-tune a sentence-transformer model on our dataset if we had access to query and relevant information. However, this data is not available to us in our current situation.
However, we do have a valuable resource at hand: the textual information in the form of movie plots. This raises an intriguing possibility - can we devise an unsupervised approach to fine-tune our model using this dataset?
To accomplish our objective, we employ synthetic query generation. We begin with the plots extracted from our document collection, and from these plots, we generate various potential queries that users might ask or search for. By simulating these queries, we aim to cover a wide range of possible user inputs and ensure that our system is prepared to handle diverse search scenarios. This approach allows us to proactively address user needs and provide relevant information in response to their queries.

This capability is achieved through the use of BEIR (Benchmarking Information Retrieval models). BEIR is a framework specifically designed for benchmarking and evaluating information retrieval models. In our case, BEIR enables us to generate synthetic queries by leveraging the plots from our document collection. It offers methods and utilities for query generation based on the given passages, allowing us to create a diverse set of potential user queries. By utilizing BEIR, we can effectively simulate user interactions and ensure that our system is equipped to handle a wide range of query types and user intents.

In [None]:
plot_description = dataframe.Plot.tolist()
# Initialize the T5 tokenizer and model for query generation. 
# The tokenizer is responsible for tokenizing the input text, while the model is used for generating queries based on the provided plot descriptions
tokenizer = T5Tokenizer.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1') 
model = T5ForConditionalGeneration.from_pretrained('BeIR/query-gen-msmarco-t5-large-v1')
model.eval()
model.to('cuda')

In [None]:
# Parameters for query generation
batch_size = 16 
num_queries = 5 # Number of queries to generate for every plot 
max_length_paragraph = 512
max_length_query = 64

As we can see from this parameters, the length of the paragraph is much longer than the lenght of the query. This allows us to obtain the aforementioned asymetric semantic search
The objective here is to convert the available information in the paragraphs (plot_description) into a set of questions. This knowledge tuple, consisting of the questions and corresponding information, will then be used to fine-tune an SBERT model. By doing so, the model will be able to capture the semantic and syntactic relationship between these tuples, enabling a better understanding of the data.

In [None]:
def _removeNonAscii(s): 
    return "".join(i for i in s if ord(i) < 128)

In [None]:
# Create a file for writing the generated queries
with open('generated_queries_all.tsv', 'w') as fOut:
    # Iterate over the paragraphs in batches
    for start_idx in tqdm(range(0, len(plot_description), batch_size)):
        # Extract a batch of plot
        sub_plots = plot_description[start_idx:start_idx+batch_size]
        
        # Prepare the input sequences for the model with the tokenizer
        tokenized_inputs = tokenizer.prepare_seq2seq_batch(sub_plots, max_length=max_length_paragraph, truncation=True, return_tensors='pt').to('cuda')
        
        # Generate queries using the model
        outputs = model.generate(**tokenized_inputs, max_length=max_length_query, do_sample=True, top_p=0.95, num_return_sequences=num_queries)
        
        # Write the generated queries and corresponding plots to the file
        for idx, model_output in enumerate(outputs):
            # Decode the generated query and remove non-ASCII characters
            query = tokenizer.decode(model_output, skip_special_tokens=True)
            query = _removeNonAscii(query)
            
            # Get the corresponding paragraph and remove non-ASCII characters
            para = sub_plots[int(idx/num_queries)]
            para = _removeNonAscii(para)
            
            # Write the query and paragraph to the file
            fOut.write("{}\t{}\n".format(query.replace("\t", " ").strip(), para.replace("\t", " ").strip()))

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.

100%|██████████| 2117/2117 [4:30:17<00:00,  7.66s/it]


The last step that we need to do is to fine-tune the new model. 
We are going to use the following loss: MultipleNegativesRankingLoss. It is designed to train models for ranking tasks, where the goal is to learn to rank a set of items in a desired order based on their relevance to a given query.
The loss function works by considering a set of positive samples (relevant items) and a set of negative samples (irrelevant items). It encourages the model to assign higher rankings to the positive samples compared to the negative samples.

Specifically, the MultipleNegativesRankingLoss computes the ranking loss for each positive sample by comparing its rank with that of multiple negative samples. The loss is calculated based on the pairwise differences in ranks between the positive and negative samples.

The loss function penalizes the model when a positive sample is ranked lower than any of the negative samples. It aims to minimize this discrepancy and encourage the model to correctly rank the positive samples higher than the negative samples.

By optimizing the MultipleNegativesRankingLoss during training, the model learns to better discriminate between relevant and irrelevant items and produces more accurate rankings for given queries.

In [None]:
train_examples = []

# Read the contents of the file 'generated_queries_all.tsv' line by line
with open('../input/user-query-data/generated_queries_all.tsv') as fIn:
    for line in fIn:
        try:
            # Split the line on the tab character to extract the query and paragraph
            query, paragraph = line.strip().split('\t', maxsplit=1)
            
            # Create an InputExample object with the query and paragraph texts
            train_examples.append(InputExample(texts=[query, paragraph]))
        except:
            pass

# Shuffle the train_examples list randomly to ensure diverse training order
random.shuffle(train_examples)

# Create a NoDuplicatesDataLoader to avoid duplicate entries in a batch. It is important that it does not contain duplicate entries
train_dataloader = datasets.NoDuplicatesDataLoader(train_examples, batch_size=8)

# Create a SentenceTransformer model from scratch
word_emb = models.Transformer('sentence-transformers/msmarco-distilbert-base-dot-prod-v3')
pooling = models.Pooling(word_emb.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_emb, pooling])

# Define the MultipleNegativesRankingLoss for semantic search training
train_loss = losses.MultipleNegativesRankingLoss(model)

# Tune the model
num_epochs = 3
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=num_epochs, warmup_steps=warmup_steps, show_progress_bar=True)

# Save the trained model
os.makedirs('search', exist_ok=True)
model.save('search/search-model')

Now we can perform the same steps as we did previously but with the new model

In [None]:
# Load the new model
model = SentenceTransformer('search/search-model')

In [None]:
# Perform the same operation as we did previously
encoded_data = model.encode(dataframe.Plot.tolist())
encoded_data = np.asarray(encoded_data.astype('float32'))
index = faiss.IndexIDMap(faiss.IndexFlatIP(768))
index.add_with_ids(encoded_data, np.array(range(0, len(dataframe))))
faiss.write_index(index, 'movie_plot.index')

We can now use all of this in TASK #3 to obtain a nice result