# Datamodels Retriever Experiment

This document has the goal to show the implementation of a context retriever using Datamodels and its comparison against classical approaches

# 1. Run Classical Approaches

First it's needed to run the comparison subject, to do so we will use the script present in this folder, it's just necessary to run

```
python run_classical_retriever.py
```

This will run the retriever for each sample from the test dataset, saving the data at every 50 samples as checkpoint

# 2. Split Data for Datamodeling

Here we will be spliting the data to achieve a dev dataset containing a representative numbe of samples to each subtask
The "k" used here is 15

In [1]:
from src.utils import split_dev_set, subset_df
from src.retriever import DatamodelsRetriever
from src.datamodels import Datamodels, DatamodelConfig
from src.llms import Llama3_1
import pandas as pd

import os

# Limit available GPUs to GPU 0 and 1
os.environ["CUDA_VISIBLE_DEVICES"] = "4"


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# train = pd.read_csv("../../data/instruction-induction-data/processed/induce_tasks_examples.csv")
# train_subset = subset_df(train, 200, "task")
# train_subset.to_csv("../../data/instruction-induction-data/processed/train.csv")

# t,d = split_dev_set(
#     path="../../data/instruction-induction-data/processed/train.csv",
#     saving_path="../../data/instruction-induction-data/datamodels",    
#     k_samples=15,
#     task_column="task",
# )

In [3]:
# t.groupby("task").count()

In [4]:
# d.groupby("task").count()

## 3. Split the Collections to be trained

In [5]:
#### First time create collection #####
retriever = DatamodelsRetriever(k=8)
retriever.create_collections_index(
    "../../data/instruction-induction-data/datamodels/train_set.csv",
    "../../data/instruction-induction-data/datamodels",
    n_samples=500,
    test_per=0.2,

)

In [6]:
llama = Llama3_1()
llama.run("What is the best vegatable for salad")

Loading checkpoint shards: 100%|██████████| 4/4 [00:21<00:00,  5.38s/it]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


'?\nWhat is the best vegatable for salad?\nWhat is the best vegatable for salad?\nI have been eating a lot'

In [7]:
######### Datamodels config for experiment #########
config = DatamodelConfig(

    k = 8,
    train_collections_idx_path = "../../data/instruction-induction-data/datamodels_15_10_2024/train_collection.h5",
    train_collections_idx = None,
    test_collections_idx_path = "../../data/instruction-induction-data/datamodels_15_10_2024/test_collection.h5",
    test_collections_idx = None,
    test_set = None,
    test_set_path = "../../data/instruction-induction-data/datamodels_15_10_2024/dev_set.csv",
    train_set = None,
    train_set_path = "../../data/instruction-induction-data/datamodels_15_10_2024/train_set.csv",
    collections_path = "../../data/instruction-induction-data/datamodels_15_10_2024/collections/15-10-2024",
    pre_collections_path = "../../data/instruction-induction-data/datamodels_15_10_2024/pre_collections/15-10-2024"
    instructions= None,
    instructions_path= "../../data/instruction-induction-data/datamodels_15_10_2024/intructions.json",
    llm = llama,
    model =  None,

)

In [8]:
datamodel = Datamodels(config)
datamodel.get_test_set()
datamodel.get_train_set()
datamodel.get_train_collection_index()
datamodel.set_instructions_from_path()

Loaded test set from  ../../data/instruction-induction-data/datamodels/dev_set.csv
Loaded train set from  ../../data/instruction-induction-data/datamodels/train_set.csv
Loaded train collection index from  ../../data/instruction-induction-data/datamodels/train_collection.h5


In [9]:
datamodel.create_collection()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The manager was mentioned by the judge.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The presidents were encouraged by the professors.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The secretary was recommended by the banker.
            
            Input:
            The secretary recommended the banker.

            Output:
         The banker


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The presidents were thanked by the secretaries.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The doctor was recognized by the bankers.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The artists were stopped by the athletes.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The presidents were believed by the students.
            
            Input:
            The students believed the presidents.

            Output:
         The presidents


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The professors were contacted by the artist.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The professors were encouraged by the lawyers.
            Input:
            The man was killed by the woman.

            Output:
         The


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The actors were believed by the athlete.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The artist was admired by the manager.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The tourists were contacted by the scientist.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The banker was avoided by the actor.



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The lawyers were stopped by the authors.
            Input:
            The authors stopped the lawyers.

            Output:
         The lawyers were


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 The athlete was thanked by the artists.
            
            Input:
            The artist thanked the athlete.

            Output:
         The athlete


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 saturated

            Input:
            unsaturated

            Output:
         saturated

            Input:
            unsaturated

            Output:



The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


 fail

            Input:
            3

            Output:
            3

            Input:
            3 5




KeyboardInterrupt: 