# Datamodels Retriever Experiment

This document has the goal to show the implementation of a context retriever using Datamodels and its comparison against classical approaches

# 1. Run Classical Approaches

First it's needed to run the comparison subject, to do so we will use the script present in this folder, it's just necessary to run

```
python run_classical_retriever.py
```

This will run the retriever for each sample from the test dataset, saving the data at every 50 samples as checkpoint

# 2. Split Data for Datamodeling

Here we will be spliting the data to achieve a dev dataset containing a representative numbe of samples to each subtask
The "k" used here is 15

In [1]:
from src.utils import split_dev_set, subset_df
from src.retriever import DatamodelsRetriever
from src.datamodels import Datamodels, DatamodelConfig
from src.llms import Llama3_1
import pandas as pd

import os

# Limit available GPUs to GPU 0 and 1
os.environ["CUDA_VISIBLE_DEVICES"] = "4"


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# train = pd.read_csv("../../data/instruction-induction-data/processed/induce_tasks_examples.csv")
# train_subset = subset_df(train, 200, "task")
# train_subset.to_csv("../../data/instruction-induction-data/processed/train.csv")

# t,d = split_dev_set(
#     path="../../data/instruction-induction-data/processed/train.csv",
#     saving_path="../../data/instruction-induction-data/datamodels",    
#     k_samples=15,
#     task_column="task",
# )

In [3]:
# t.groupby("task").count()

In [4]:
# d.groupby("task").count()

## 3. Split the Collections to be trained

In [5]:
#### First time create collection #####
# retriever = DatamodelsRetriever(k=8)
# retriever.create_collections_index(
#     "../../data/instruction-induction-data/datamodels_15_10_2024/train_set.csv",
#     "../../data/instruction-induction-data/datamodels_15_10_2024",
#     n_samples=500,
#     test_per=0.2,

# )

In [6]:
llama = Llama3_1()
llama.run("What is the best vegatable for salad")

Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.04s/it]


What is the best vegatable for salad
What is the best vegatable for salad?
I am a vegetarian and I love eating salads. I want to know what is the best vegetable to use in a salad


'What is the best vegatable for salad?\nI am a vegetarian and I love eating salads. I want to know what is the best vegetable to use in a salad'

In [7]:
######### Datamodels config for experiment #########
config = DatamodelConfig(
    k = 8,
    train_collections_idx_path = "../../data/instruction-induction-data/datamodels_15_10_2024/train_collection.h5",
    train_collections_idx = None,
    test_collections_idx_path = "../../data/instruction-induction-data/datamodels_15_10_2024/test_collection.h5",
    test_collections_idx = None,
    test_set = None,
    test_set_path = "../../data/instruction-induction-data/datamodels_15_10_2024/dev_set.csv",
    train_set = None,
    train_set_path = "../../data/instruction-induction-data/datamodels_15_10_2024/train_set.csv",
    collections_path = "../../data/instruction-induction-data/datamodels_15_10_2024/collections/15-10-2024",
    pre_collections_path = "../../data/instruction-induction-data/datamodels_15_10_2024/pre_collections/15-10-2024",
    instructions= None,
    instructions_path= "../../data/instruction-induction-data/datamodels_15_10_2024/intructions.json",
    llm = llama,
    model =  None,
)

In [8]:
datamodel = Datamodels(config)
datamodel.get_test_set()
datamodel.get_train_set()
datamodel.get_train_collection_index()
datamodel.set_instructions_from_path()

Loaded test set from  ../../data/instruction-induction-data/datamodels_15_10_2024/dev_set.csv
Loaded train set from  ../../data/instruction-induction-data/datamodels_15_10_2024/train_set.csv
Loaded train collection index from  ../../data/instruction-induction-data/datamodels_15_10_2024/train_collection.h5


In [9]:
datamodel.create_pre_collection()

Collection id: 0

            Fill the expected Output according to the instruction
            Intruction: Write the input sentence in passive form.

            Examples:
            Input: 5040 
Output: five thousand and forty
Input: 1 64 
Output: 65
Input: grenade 
Output: r
Input: Sentence 1: A school bus is driving uphill on a rural road. Sentence 2: A race care driving along a dirt road. 
Output: 1 - probably not
Input: souvenir 
Output: near
Input: The student recognized the professors. 
Output: The professors were recognized by the student.
Input: Sentence 1: White House in damage control over Obama Supreme Court remarks Sentence 2: Fact check: Obama's Supreme Court remarks 
Output: 4 - almost perfectly
Input: camouflage 
Output: c


            User Input:
            The judge mentioned the manager.

            Model Output:
        

            Fill the expected Output according to the instruction
            Intruction: Write the input sentence in passive form.

        

KeyboardInterrupt: 