# Information retvieval using sentence transformers in PyTorch

In this notebook, we build a retriever-reranker ensemble
1. Retriever (a model that for a given topic outputs *a significant number* of content items, many of which are not really relevant) stage
- First we fine-tune a pretrained sentence transformer model **'paraphrase-distilroberta-base-v2'** from **'HuggingFace Library'** in an unsupervised fashion, on a set of *positive* (topic title, content item title) pairs (`uns_train.csv`), that is, the corresponding topic and content item are known to be related.
- Then, using the fine-tuned model, we map all topic and content item titles to 768-dimensional real-valued vectors and split content title vectors into clusters of 30 nearest neighbors using the KNN algorithm.
- Next, for every topic we compose a list of its content item neighbors and split the list of topics into a training set (`train_topics.csv`) and a test set (`test_topics.csv`).
- Finally, for every topic in the `train_topics.csv` list, we label its neighboring content items with either 0, or 1, based on the known correlation with the topic. The result is a training dataset (`sup_train.csv`) for the reranker model - this is the output of the retriever.
2. Reranker stage
- We construct a custom classification (0 or 1) model based on **'paraphrase-multilingual-mpnet-base-v2'** from **'HuggingFace Library'** and train it on the `sup_train.csv` dataset. This is the reranker model: for every pair (topic, content item) it predicts if the corresponding topic title and content item title are related (outputs 1) or not related (outputs 0).
- Finally, we test the reranker model: for every topic in `test_topics.csv` we use the reranker model to drop irrelevant content items that were originally output by the retriver.

Note: The notebook was tested with `ml.m5.large (4 vCPU + 16 GiB)` instance and `Python3 (PyTorch 1.6 Python 3.6 CPU Optimized)` kernel. 

## Setup
Update sagemaker package and restart the kernel. 

In [2]:
!pip install -U sagemaker -q

distutils: /opt/conda/include/python3.6m/UNKNOWN
sysconfig: /opt/conda/include/python3.6m[0m
user = False
home = None
root = None
prefix = None[0m


In [3]:
import sagemaker
sagemaker.__version__

'2.117.0'

In [4]:
!pip install sentence_transformers -q

distutils: /opt/conda/include/python3.6m/UNKNOWN
sysconfig: /opt/conda/include/python3.6m[0m
user = False
home = None
root = None
prefix = None[0m


In [5]:
import boto3, os, sagemaker
import json

sess = sagemaker.Session()
bucket = sess.default_bucket() 
prefix = 'sentencetransformer/input'
role = sagemaker.get_execution_role()

## Load the data and create a training set `uns_dataset` for the encoder. Save it to `uns_train.csv`

In [6]:
%%time

import pandas as pd

DATA_PATH = "./Kaggle/"

topics_df = pd.read_csv(DATA_PATH + "topics.csv")
content_df = pd.read_csv(DATA_PATH + "content.csv")
correlations_df = pd.read_csv(DATA_PATH + "correlations.csv")
sample_sub_df = pd.read_csv(DATA_PATH + "sample_submission.csv")

CPU times: user 6.87 s, sys: 975 ms, total: 7.84 s
Wall time: 11.8 s


In [7]:
def build_uns_dataset():
    topics = topics_df[topics_df['title'].notna()]
    content = content_df[content_df['title'].notna()]
    
    topics = topics[topics['language'] == 'en']
    content = content[content['language'] == 'en']
    
    print(' ')
    print('-' * 50)
    print(f"topics.shape: {topics.shape}")
    print(f"content.shape: {content.shape}")
    
    topics = topics.rename(columns = {"id": "topic_id",
                                         "title": "topic_title",
                                         "description": "topic_description",
                                         "language": "topic_language"
                                        }
                             )
    
    content = content.rename(columns = {"id": "content_id",
                                           "title": "content_title",
                                           "description": "content_description",
                                           "text": "content_text",
                                           "language": "content_language"
                                          }
                               )
    correlations_df["content_id"] = correlations_df["content_ids"].str.split(" ")
    corr = correlations_df.explode("content_id").drop(columns = ["content_ids"])
    
    corr = corr.merge(topics, how = "left", on = "topic_id")
    corr = corr.merge(content, how = "left", on = "content_id")
    corr = corr[corr['topic_title'].notna()]
    corr = corr[corr['content_title'].notna()]
    
    corr["set"] = corr[["topic_title", "content_title"]].values.tolist()
    dataset = pd.DataFrame(corr["set"])
    
    return dataset

In [20]:
uns_dataset.to_csv('uns_train.csv')

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'uns_train.csv')).upload_file('uns_train.csv')

uns_data_input_path = "s3://{}/{}/".format(bucket, prefix)
uns_data_input_path

's3://sagemaker-us-east-1-852055550328/sentencetransformer/input/'

## Choose a pretrained sentence transformer as an encoder (test on 20% of `uns_dataset`)

In [10]:
from tqdm import tqdm
from sklearn.model_selection import train_test_split

uns_train , uns_test = train_test_split(uns_dataset, test_size = 0.2)

def create_test_sentences(dataset):
    sentences_1 = []
    sentences_2 = []
    
    dataset.reset_index(drop = True, inplace = True)

    for i in tqdm(range(len(dataset))):
        row = dataset.iloc[i]
        pair = row["set"]
        sentences_1.append(str(pair[0]))
        sentences_2.append(str(pair[1]))         
    
    return sentences_1, sentences_2

sentences_1, sentences_2 = create_test_sentences(uns_test)

In [28]:
from sentence_transformers import SentenceTransformer, util
import numpy as np

model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1') # untrained score 0.6256573
#model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2') # untrained score 0.5305708
#model = SentenceTransformer('sentence-transformers/multi-qa-distilbert-cos-v1') # untrained score 0.509438
#model = SentenceTransformer('sentence-transformers/paraphrase-distilroberta-base-v2') # untrained score 0.5043382
#model = SentenceTransformer('sentence-transformers/all-distilroberta-v1') # untrained score 0.49077263

#Compute embedding for both lists
embeddings_1 = model.encode(sentences_1, show_progress_bar = True, convert_to_tensor = True)
embeddings_2 = model.encode(sentences_2, show_progress_bar = True, convert_to_tensor = True)

#Compute cosine-similarities
cosine_scores = util.cos_sim(embeddings_1, embeddings_2)

scores = []

for i in range(len(cosine_scores) - 1):
    scores.append(cosine_scores[i][i])

scores = np.array(scores)
print(round(scores.mean(), 8))

del embeddings_1, embeddings_2, cosine_scores

Batches: 100%|██████████| 798/798 [02:19<00:00,  5.73it/s]
Batches: 100%|██████████| 798/798 [02:44<00:00,  4.86it/s]


In [None]:
#Hence, choose 'multi-qa-mpnet-base-dot-v1' as the basic retriever model

## Fine-tune *'multi-qa-mpnet-base-dot-v1'* from *'HuggingFace Library'* on the dataset from `uns_train.csv`.

Here, instead of PyTorch, we use the HuggingFace SageMaker Python SDK.

In [55]:
model_name = 'sentence-transformers/multi-qa-mpnet-base-dot-v1'

In [65]:
# hyperparameters, which are passed into the training job
hyperparameters = {'epochs': 2,
                   'batch_size': 32,
                   'model_name': model_name
                  }

In [41]:
!pygmentize ./uns_train.py

[34mimport[39;49;00m [04m[36mcsv[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mboto3[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m [34mimport[39;49;00m DataLoader[37m[39;49;00m
[34mfrom[39;49;00m [04m[36msentence_transformers[39;49;00m [34mimport[39;49;00m SentenceTransformer, LoggingHandler[37m[39;49;00m
[34mfrom[39;49;00m [04m[36msentence_transformers[39;49;00m [34mimport[39;49;00m models, util, datasets, evaluation, losses, InputExample[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mdatasets[39;49;00m [34mimport[39;49;00m Dataset[37m[39;49;00m
[34mimport[39;49;00m [04m[36mrandom[3

In [66]:
from sagemaker.huggingface import HuggingFace
from sagemaker.huggingface import HuggingFaceModel

huggingface_estimator = HuggingFace(
    entry_point = 'uns_train.py',
    source_dir = './',
    instance_type = 'ml.p3.2xlarge', # GPU supported by Hugging Face
    instance_count = 1,
    role = role,
    transformers_version = '4.6',
    pytorch_version = '1.7',
    py_version = 'py36',
    hyperparameters = hyperparameters
)

In [67]:
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': uns_data_input_path}, wait = True)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker.image_uris:Defaulting to the only supported framework/algorithm version: latest.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-pytorch-training-2023-04-06-02-59-07-964


2023-04-06 03:06:20 Starting - Starting the training job...
2023-04-06 03:06:47 Starting - Preparing the instances for trainingProfilerReport-1680750380: InProgress
.........
2023-04-06 03:08:06 Downloading - Downloading input data...
2023-04-06 03:08:47 Training - Downloading the training image............
2023-04-06 03:10:47 Training - Training image download completed. Training in progress....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-04-06 03:11:10,791 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-04-06 03:11:10,822 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-04-06 03:11:10,825 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-04-06 03:11:36,041 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m

In [68]:
#Fine-tuned "multi-qa-mpnet-base-dot-v1" model
#uns_model_location = 's3://sagemaker-us-east-1-852055550328/huggingface-pytorch-training-2023-04-05-21-40-24-963/output/model.tar.gz'
#uns_model_location = 's3://sagemaker-us-east-1-852055550328/huggingface-pytorch-training-2023-04-06-00-25-59-604/output/model.tar.gz'
#uns_model_location = 's3://sagemaker-us-east-1-852055550328/huggingface-pytorch-training-2023-04-06-02-15-06-088/output/model.tar.gz'
uns_model_location = 's3://sagemaker-us-east-1-852055550328/huggingface-pytorch-training-2023-04-06-02-59-07-964/output/model.tar.gz'

## Deploy the fine-tuned transformer

In [69]:
from sagemaker.huggingface import HuggingFace
from sagemaker.huggingface import HuggingFaceModel

# Create Hugging Face Model Class

huggingface_model = HuggingFaceModel(
    model_data = uns_model_location,  # path to your trained sagemaker model
    entry_point = 'uns_inference.py',
    role = role, # iam role with permissions to create an Endpoint
    transformers_version = "4.6", # transformers version used
    pytorch_version = "1.7", # pytorch version used
    py_version = "py36", # python version of the DLC
)

In [21]:
!pygmentize ./uns_inference.py

[34mfrom[39;49;00m [04m[36mtransformers[39;49;00m [34mimport[39;49;00m AutoTokenizer, AutoModel[37m[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mnn[39;49;00m[04m[36m.[39;49;00m[04m[36mfunctional[39;49;00m [34mas[39;49;00m [04m[36mF[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[37m#def mean_pooling(model_output, attention_mask):[39;49;00m[37m[39;49;00m
[37m#    token_embeddings = model_output[0] #First element of model_output contains all token embeddings[39;49;00m[37m[39;49;00m
[37m#    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()[39;49;00m[37m[39;49;00m
[37m#    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min = 1e-9)[39;49;00m[37m[39;49;00m
 [37m[39;49;00m
[37m#CLS Pooling - Take output from first token[39;49;00m[37m[39;49;00m
[34m

In [70]:
#Inference using endpoint deployment

encoder = huggingface_model.deploy(
    initial_instance_count = 1,
    instance_type = "ml.m5.xlarge"
)

INFO:sagemaker:Creating model with name: huggingface-pytorch-inference-2023-04-06-03-38-00-247
INFO:sagemaker:Creating endpoint-config with name huggingface-pytorch-inference-2023-04-06-03-38-01-046
INFO:sagemaker:Creating endpoint with name huggingface-pytorch-inference-2023-04-06-03-38-01-046


-----!

## Check performance of the fine-tuned encoder

In [71]:
from tqdm import tqdm
from sklearn.model_selection import train_test_split

uns_train , uns_test = train_test_split(uns_dataset, test_size = 0.01)

In [78]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

uns_test.reset_index(drop = True, inplace = True)

scores = []
counter = 0

for i in tqdm(range(len(uns_test))):
    row = uns_test.iloc[i]
    pair = row["set"]
    topic_title = str(pair[0])
    content_title = str(pair[1])
    
    topic_data_str = json.dumps({"inputs" : topic_title})
    topic_data = json.loads(topic_data_str)
    topic_embed = encoder.predict(topic_data)
    vec_1 = topic_embed["vectors"]
    
    content_data_str = json.dumps({"inputs" : content_title})
    content_data = json.loads(content_data_str)
    content_embed = encoder.predict(content_data)
    vec_2 = content_embed["vectors"]
    
    scores.append(cosine_similarity([vec_1], [vec_2]))
    
scores = np.array(scores)
print(round(scores.mean(), 8))

100%|██████████| 1276/1276 [03:57<00:00,  5.37it/s]

0.8861534





In [93]:
#Note that the performance improved

## Split topics into training and testing subsets. Save the datasets to `train_topics.csv` and `test_topics.csv`

Take 5% for testing

In [6]:
%%time

import pandas as pd

def read_data(DATA_PATH):
    topics = pd.read_csv(DATA_PATH + "topics.csv")
    content = pd.read_csv(DATA_PATH + "content.csv")
    sample_submission = pd.read_csv(DATA_PATH + "sample_submission.csv")
    
    # Merge topics with sample submission to only infer test topics
    #topics = topics.merge(sample_submission, how = 'inner', left_on = 'id', right_on = 'topic_id')
    
    topics = topics[topics['title'].notna()]
    content = content[content['title'].notna()]
    
    topics = topics[topics['language'] == 'en']
    content = content[content['language'] == 'en']
    
    # Fillna titles
    #topics['title'].fillna("", inplace = True)
    #content['title'].fillna("", inplace = True)
    
    # Sort by title length to make inference faster
    topics['length'] = topics['title'].apply(lambda x: len(x))
    content['length'] = content['title'].apply(lambda x: len(x))
    topics.sort_values('length', inplace = True)
    content.sort_values('length', inplace = True)
    
    # Drop cols
    #topics.drop(['description', 'channel', 'category', 'level', 'language', 'parent', 'has_content', 'length', 'topic_id', 'content_ids'], axis = 1, inplace = True)
    #content.drop(['description', 'kind', 'language', 'text', 'copyright_holder', 'license', 'length'], axis = 1, inplace = True)
    
    # Drop cols
    topics.drop(['description', 'channel', 'category', 'level', 'language', 'parent', 'has_content', 'length'], axis = 1, inplace = True)
    content.drop(['description', 'kind', 'language', 'text', 'copyright_holder', 'license', 'length'], axis = 1, inplace = True)
    
    # Reset index
    topics.reset_index(drop = True, inplace = True)
    content.reset_index(drop = True, inplace = True)
    print(' ')
    print('-' * 50)
    print(f"topics.shape: {topics.shape}")
    print(f"content.shape: {content.shape}")
    return topics, content

CPU times: user 8 µs, sys: 3 µs, total: 11 µs
Wall time: 13.1 µs


In [10]:
DATA_PATH = "./Kaggle/"
topics, content = read_data(DATA_PATH)

 
--------------------------------------------------
topics.shape: (36160, 2)
content.shape: (65939, 2)


In [75]:
from sklearn.model_selection import train_test_split

train_topics , test_topics = train_test_split(topics, test_size = 0.05)

print(' ')
print('-' * 50)
print(train_topics.shape)
print(test_topics.shape)

 
--------------------------------------------------
(34352, 2)
(1808, 2)


In [76]:
#Save train and test topics lists

train_topics.to_csv('train_topics.csv')
test_topics.to_csv('test_topics.csv')

boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train_topics.csv')).upload_file('train_topics.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test_topics.csv')).upload_file('test_topics.csv')

## Construct embeddings for training topics, testing topics and content items

We map all topic and content item titles to 768-dimensional real-valued vectors

In [79]:
train_topics = pd.read_csv("train_topics.csv")
test_topics = pd.read_csv("test_topics.csv")

In [86]:
title_list = train_topics['title'].tolist()

In [92]:
# Use the retriever model to get embeddings for topic and content item titles
from tqdm import tqdm

def get_embeddings(df, model):
    
    data_embeddings = []
    title_list = df['title'].tolist()
    
    for i in tqdm(range(len(title_list))):
        title = title_list[i]
        title_data_str = json.dumps({"inputs" : title})
        title_data = json.loads(title_data_str)
        title_embed = encoder.predict(title_data)
        data_embeddings.append(title_embed["vectors"])
    
    return data_embeddings

In [93]:
train_topics_embeddings = get_embeddings(train_topics, encoder)

100%|██████████| 34352/34352 [50:22<00:00, 11.37it/s]


In [94]:
train_te = pd.DataFrame(train_topics_embeddings)
train_te.to_csv('train_topics_embeddings.csv')

In [95]:
test_topics_embeddings = get_embeddings(test_topics, encoder)

100%|██████████| 1808/1808 [02:39<00:00, 11.30it/s]


In [96]:
test_te = pd.DataFrame(test_topics_embeddings)
test_te.to_csv('test_topics_embeddings.csv')

In [97]:
content_embeddings = get_embeddings(content, encoder)

100%|██████████| 65939/65939 [1:46:39<00:00, 10.30it/s]


In [None]:
ce = pd.DataFrame(content_embeddings)
ce.to_csv('content_embeddings.csv')

In [None]:
# The retriever model is not needed any more, so we delete the endpoint
encoder.delete_endpoint()

## Split content item embeddings into classes of nearest neighbours

We split content title vectors into clusters of 50 nearest neighbors using the KNN algorithm.

In [11]:
train_topics_embeddings = pd.read_csv("train_topics_embeddings.csv")
content_embeddings = pd.read_csv("content_embeddings.csv")

In [12]:
train_topics_embeddings = train_topics_embeddings.values.tolist()
content_embeddings = content_embeddings.values.tolist()

In [13]:
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import cross_val_score
import numpy as np

neighbors_model = NearestNeighbors(n_neighbors = 50, 
                                   metric = 'cosine',
                                   algorithm = 'brute',
                                   n_jobs = -1
                                  )
neighbors_model.fit(content_embeddings)

NearestNeighbors(algorithm='brute', metric='cosine', n_jobs=-1, n_neighbors=50)

In [14]:
distances, indices = neighbors_model.kneighbors(train_topics_embeddings)

In [15]:
train_topics = pd.read_csv("train_topics.csv")

In [16]:
predictions = []
for k in range(len(indices)):
    pred = indices[k]
    p = ' '.join([content.loc[ind, 'id'] for ind in pred])
    predictions.append(p)
    
train_topics['predictions'] = predictions

In [17]:
# This is the output of the retriever for the train topics dataset
train_topics.to_csv('train_topics_with_retrieved_content.csv')

In [18]:
del train_topics_embeddings, content_embeddings, distances, indices

## Build a training dataset from `train_topics_with_retrieved_content.csv` and save it to `sup_train.csv`

In [19]:
import pandas as pd

train_topics = pd.read_csv('train_topics_with_retrieved_content.csv')
train_topics.drop(['Unnamed: 0'], axis = 1, inplace = True)

print(' ')
print('-' * 50)
print(train_topics.shape)

 
--------------------------------------------------
(34352, 4)


In [20]:
DATA_PATH = "./Kaggle/"
topics, content = read_data(DATA_PATH)

content.set_index('id', inplace = True)
train_topics.reset_index(drop = True, inplace = True)
correlations_df.set_index('topic_id', inplace = True)

 
--------------------------------------------------
topics.shape: (36160, 2)
content.shape: (65939, 2)


In [21]:
import gc
from tqdm import tqdm

def build_sup_train_set(topics, content, correlations):
    # Create lists for training
    topics_ids = []
    content_ids = []
    title1 = []
    title2 = []
    label = []
    
    # Iterate over each topic
    for k in tqdm(range(len(topics))):
        row = topics.iloc[k]
        topics_id = row['id']
        topics_title = row['title']
        predictions = row['predictions'].split(' ')
        true_content = []
        if topics_id in correlations.index:
            true_content = correlations.loc[topics_id, 'content_ids'].split(' ')
        
        for pred in predictions:
            content_title = content.loc[pred, 'title']
            topics_ids.append(topics_id)
            content_ids.append(pred)
            title1.append(topics_title)
            title2.append(content_title)
            if pred in true_content:
                label.append(1)
            else:
                label.append(0)
                
        for item in true_content:
            if item in content.index:
                content_title = content.loc[item, 'title']
            else:
                continue
            topics_ids.append(topics_id)
            content_ids.append(item)
            title1.append(topics_title)
            title2.append(content_title)
            label.append(1)

    # Build training dataset
    train = pd.DataFrame(
        {'topics_ids': topics_ids, 
         'content_ids': content_ids, 
         'title1': title1, 
         'title2': title2,
         'label' : label
        }
    )
    # Release memory
    del topics_ids, content_ids, title1, title2, label

    return train

In [22]:
sup_train_set = build_sup_train_set(train_topics, content, correlations_df)

100%|██████████| 34352/34352 [00:21<00:00, 1619.36it/s]


In [23]:
sup_train_set.to_csv('sup_train.csv')
boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'sup_train.csv')).upload_file('sup_train.csv')

## Train reranker on the dataset from `sup_train.csv`

Choose `all-mpnet-base-v2' as a base model for the reranker

In [56]:
sup_training_input_path = 's3://sagemaker-us-east-1-852055550328/sentencetransformer/input/'

In [58]:
# Hyperparameters that are passed into the training job
sup_hyperparameters = {'epochs': 2,
                       'batch_size': 32,
                       'model_name': 'sentence-transformers/all-mpnet-base-v2'
                      }

In [59]:
from sagemaker.pytorch import PyTorch

reranker = PyTorch(
    entry_point = 'sup_train.py',
    source_dir = './',
    role = role,
    instance_count = 1,
    instance_type = "ml.p3.2xlarge",
    hyperparameters = sup_hyperparameters,
    framework_version = "1.6",
    py_version = "py36"
)

In [60]:
# Starting the train job of our reranker
reranker.fit({'train': sup_training_input_path}, wait = True)

2023-04-07 02:05:07 Starting - Starting the training job...ProfilerReport-1680833107: InProgress
...
2023-04-07 02:06:07 Starting - Preparing the instances for training......
2023-04-07 02:07:07 Downloading - Downloading input data......
2023-04-07 02:08:07 Training - Downloading the training image......
2023-04-07 02:09:08 Training - Training image download completed. Training in progress....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-04-07 02:09:23,389 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-04-07 02:09:23,420 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-04-07 02:09:23,422 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-04-07 02:09:56,241 sagemaker-training-toolkit INFO     Installing dependencies from requirements.txt:[0m
[

## Deploy the reranker

In [94]:
#sup_model_location = 's3://sagemaker-us-east-1-852055550328/pytorch-training-2023-04-04-20-36-48-263/output/model.tar.gz'
#sup_model_location = 's3://sagemaker-us-east-1-852055550328/pytorch-training-2023-04-06-14-08-26-364/output/model.tar.gz'
sup_model_location = 's3://sagemaker-us-east-1-852055550328/pytorch-training-2023-04-07-01-56-44-331/output/model.tar.gz'

In [95]:
from sagemaker.pytorch import PyTorchModel

reranker_model = PyTorchModel(model_data = sup_model_location,
                              role = role,
                              entry_point = 'sup_inference.py',
                              source_dir = './',
                              framework_version = "1.6",
                              py_version = "py36"
                             )

In [96]:
from sagemaker.serializers import CSVSerializer

predictor = reranker_model.deploy(initial_instance_count = 1,
                                  instance_type = "ml.m5.xlarge",
                                  serializer = CSVSerializer()
                                 )

---------------!

In [105]:
endpoint_name = 'pytorch-inference-2023-04-08-02-29-21-553'

## Retrieve MANY content items for each topic from `test_topics.csv`

In [110]:
import pandas as pd

test_topics = pd.read_csv("test_topics.csv")

In [109]:
DATA_PATH = "./Kaggle/"
topics, content = read_data(DATA_PATH)
correlations_df = pd.read_csv(DATA_PATH + "correlations.csv")

 
--------------------------------------------------
topics.shape: (36160, 2)
content.shape: (65939, 2)


In [76]:
test_topics_embeddings = pd.read_csv("test_topics_embeddings.csv")
content_embeddings = pd.read_csv("content_embeddings.csv")

In [77]:
test_topics_embeddings = test_topics_embeddings.values.tolist()
content_embeddings = content_embeddings.values.tolist()

In [78]:
from sklearn.neighbors import NearestNeighbors
import numpy as np

neighbors_model = NearestNeighbors(n_neighbors = 1000, metric = 'cosine')
neighbors_model.fit(content_embeddings)

NearestNeighbors(metric='cosine', n_neighbors=1000)

In [79]:
distances, indices = neighbors_model.kneighbors(test_topics_embeddings)

In [80]:
predictions = []
for k in range(len(indices)):
    pred = indices[k]
    p = ' '.join([content.loc[ind, 'id'] for ind in pred])
    predictions.append(p)
    
test_topics['predictions'] = predictions

In [81]:
# This is the output of the retriever
test_topics.to_csv('test_topics_with_retrieved_content.csv')

## Test the reranker on a random sample of topics from `test_topics.csv`

In [111]:
import pandas as pd

test_topics = pd.read_csv('test_topics_with_retrieved_content.csv')
test_topics.shape

(1808, 5)

In [112]:
DATA_PATH = "./Kaggle/"
topics, content = read_data(DATA_PATH)
correlations_df = pd.read_csv(DATA_PATH + "correlations.csv")

test_sample = test_topics.sample(10)
#test_sample = test_topics

content.set_index('id', inplace = True)
#test_sample.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis = 1, inplace = True)
test_sample.reset_index(drop = True, inplace = True)

 
--------------------------------------------------
topics.shape: (36160, 2)
content.shape: (65939, 2)


In [113]:
import gc
from tqdm import tqdm

def build_sup_test_set(topic_row, content):
    # Create lists for training
    topics_ids = []
    content_ids = []
    title1 = []
    title2 = []

    topics_id = topic_row['id']
    topics_title = topic_row['title']
    predictions = topic_row['predictions'].split(' ')
        
    for pred in predictions:
        content_title = content.loc[pred, 'title']
        topics_ids.append(topics_id)
        content_ids.append(pred)
        title1.append(topics_title)
        title2.append(content_title)

    # Build training dataset
    test = pd.DataFrame(
        {'topics_ids': topics_ids, 
         'content_ids': content_ids, 
         'title1': title1, 
         'title2': title2
        }
    )
    # Release memory
    del topics_ids, content_ids, title1, title2

    return test

In [114]:
def preprocess_data(data):
    #data['title1'].fillna("Title does not exist", inplace = True)
    #data['title2'].fillna("Title does not exist", inplace = True)
    # Create feature column
    data['text'] = data['title1'] + '[SEP]' + data['title2']
    # Drop titles
    data.drop(['title1', 'title2'], axis = 1, inplace = True)
    # Sort so inference is faster
    data['length'] = data['text'].apply(lambda x: len(x))
    data.sort_values('length', inplace = True)
    data.drop(['length'], axis = 1, inplace = True)
    data.reset_index(drop = True, inplace = True)
    return data

In [115]:
def f2_score(y_true, y_pred):
    y_true = y_true.astype(str)
    y_pred = y_pred.astype(str)
    y_true = y_true.apply(lambda x: set(x.split(" ")))
    y_pred = y_pred.apply(lambda x: set(x.split(" ")))
    tp = np.array([len(x[0] & x[1]) for x in zip(y_true, y_pred)])
    fp = np.array([len(x[1] - x[0]) for x in zip(y_true, y_pred)])
    fn = np.array([len(x[0] - x[1]) for x in zip(y_true, y_pred)])
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    #f2 = (5 * precision * recall) / (4 * precision + recall)
    f2 = tp / (tp + 0.2 * fp + 0.8 * fn)
    return round(precision.mean(), 8), round(recall.mean(), 8), round(f2.mean(), 8)

In [116]:
import io
from io import StringIO
import boto3
import numpy as np

client = boto3.client('sagemaker-runtime')
csv_file = io.StringIO()

topic_ids = []
content_ids = []
test = pd.DataFrame()

for k in tqdm(range(len(test_sample))):
    row = test_sample.iloc[k]
    topic_id = row['id']
    topic_preds = build_sup_test_set(row, content)
    topic_preds = preprocess_data(topic_preds)
    
    # by default sagemaker expects comma seperated
    csv_file = io.StringIO()
    topic_preds.to_csv(csv_file, sep = ",", header = ['topics_ids', 'content_ids', 'text'], index = False)
    csv_payload = csv_file.getvalue()
    
    response = client.invoke_endpoint(EndpointName = endpoint_name,
                                      ContentType = "text/csv",
                                      Body = csv_payload
                                     )
        
    result = response["Body"].read().decode()
    result = json.loads(result)
    result = np.array(result)
    
    topic_preds['prediction'] = result
    test = test.append(topic_preds)

100%|██████████| 10/10 [02:51<00:00, 17.11s/it]


In [90]:
threshold = 0.06

test['predictions'] = np.where(test['prediction'] > threshold, 1, 0)
test1 = test[test['predictions'] == 1]
test1 = test1.groupby(['topics_ids'])['content_ids'].unique().reset_index()
test1['content_ids'] = test1['content_ids'].apply(lambda x: ' '.join(x))
test1.columns = ['topic_id', 'content_ids']
test0 = pd.Series(test['topics_ids'].unique())
test0 = test0[~test0.isin(test1['topic_id'])]
test0 = pd.DataFrame({'topic_id': test0.values, 'content_ids': ""})
test_preds = pd.concat([test1, test0], axis = 0, ignore_index = True)

merged = pd.merge(test_preds, correlations_df, how = 'inner', on = 'topic_id')

y_pred = merged['content_ids_x']
y_true = merged['content_ids_y']

precision, recall, f2 = f2_score(y_true, y_pred)

print("Precision for the sample test set:", precision)
print("Recall for the sample test set:", recall)
print("F2 score for the sample test set:", f2)

Precision for the sample test set: 0.00112973
Recall for the sample test set: 0.02875408
F2 score for the sample test set: 0.00408274


In [117]:
# Delete the reranker endpoint
predictor.delete_endpoint()

## References: 
- https://www.kaggle.com/competitions/learning-equality-curriculum-recommendations
- https://www.sbert.net/examples/applications/retrieve_rerank/README.html
- https://www.kaggle.com/code/hasanbasriakcay/learning-equality-eda-fe-modeling
- https://www.kaggle.com/code/ragnar123/lecr-xlm-roberta-base-baseline