# Impact Context Reduction

This notebook server to select the tasks and prepare the dataset to be used for the experiment

In [36]:
import pandas as pd
import plotly.express as px
import random

random.seed(42)


## Load and Select 

The data from Instruction Induction is loaded and then it's selected the tasks with unique response value

In [3]:
folder = "../../data/instruction-induction-data/raw"
tasks_df = pd.read_csv(f"{folder}/induce_tasks_examples.csv")

In [4]:
tasks_df

Unnamed: 0,task,input,output
0,active_to_passive,The tourist supported the authors.,The authors were supported by the tourist.
1,active_to_passive,The athlete contacted the tourists.,The tourists were contacted by the athlete.
2,active_to_passive,The judges believed the bankers.,The bankers were believed by the judges.
3,active_to_passive,The president encouraged the actor.,The actor was encouraged by the president.
4,active_to_passive,The lawyers believed the authors.,The authors were believed by the lawyers.
...,...,...,...
67614,word_in_context,Sentence 1: I know the feeling! Sentence 2: Ha...,same
67615,word_in_context,"Sentence 1: Confidence is always borrowed, nev...",not the same
67616,word_in_context,Sentence 1: Messages must go through diplomati...,not the same
67617,word_in_context,Sentence 1: The end of the year. Sentence 2: O...,not the same


In [6]:
tasks_df["task"].unique()

array(['active_to_passive', 'antonyms', 'diff', 'first_word_letter',
       'informal_to_formal', 'larger_animal', 'letters_list', 'negation',
       'num_to_verbal', 'orthography_starts_with', 'rhymes',
       'second_word_letter', 'sentence_similarity', 'sentiment',
       'singular_to_plural', 'sum', 'synonyms', 'taxonomy_animal',
       'translation_en-de', 'translation_en-es', 'translation_en-fr',
       'word_in_context'], dtype=object)

In [31]:
selected_tasks_list = [
    "active_to_passive",
    "antonyms",
    "diff",
    "first_word_letter",
    "num_to_verbal",
    "orthography_starts_with",
    "singular_to_plural",
    "sum",
    "synonyms",
    "taxonomy_animal",
]

## Select tasks
selected_tasks = tasks_df[tasks_df["task"].isin(selected_tasks_list)]

In [32]:
## Select tasks for test set
ex_tasks_df = pd.read_csv(f"{folder}/execute_tasks_examples.csv")
test = ex_tasks_df[ex_tasks_df["task"].isin(selected_tasks_list)]

## Plot distribuition of samples

In [33]:
fig = px.histogram(selected_tasks, x="task", title="Induce Tasks Count")
fig.show()

In [35]:
fig = px.histogram(test, x="task", title="Induce Tasks Count")
fig.show()

## Undersampling the pool of tasks

In order to avoid class unbalancement to select the examples we will create a upper limit of 900 examples per classs
They will be randomly selected

In [37]:
def get_random_sample_per_category(df: pd.DataFrame, n_samples_per_category: int = 900) -> pd.DataFrame:
  
    """
    Returns a sampled DataFrame with a specified number of rows per category.

    Args:
    df (pd.DataFrame): The input DataFrame to be sampled.
    num_samples (int): The number of rows to sample per category.

    Returns:
    pd.DataFrame: The sampled DataFrame.

    Notes:
    - If a category has fewer rows than the specified number of samples, all rows in that category are included in the sample.
    - If a category has more rows than the specified number of samples, a random sample of the specified number of rows is returned.
    - If a category has exactly the specified number of rows, all rows in that category are included in the sample.
    - If a category has less than the specified number of rows, a warning message is printed.

    Example:
    >>> df = pd.DataFrame({'task': ['A', 'B', 'A', 'B', 'C', 'C'], 'value': [1, 2, 3, 4, 5, 6]})
    >>> get_random_sample_per_category(df, 2)
    task  value
    0     A      1
    2     A      3
    1     B      2
    3     B      4
    4     C      5
    5     C      6
    """

    # Group the DataFrame by the "task" column
    grouped_df = df.groupby('task')

    # Create an empty DataFrame to store the results
    sampled_df = pd.DataFrame()

    # Iterate through each group and randomly sample n_samples_per_category rows
    for name, group in grouped_df:
        if len(group) >= n_samples_per_category:  # Check if there are enough rows in the group
            sampled_df = pd.concat([sampled_df, group.sample(n_samples_per_category, random_state=42)])
        else:
            print(f"Warning: Category '{name}' has less than {n_samples_per_category} rows. All rows included.")
            sampled_df = pd.concat([sampled_df, group])

    return sampled_df


In [38]:
#undersampling the examples
sampled_df = get_random_sample_per_category(selected_tasks, 900)
fig = px.histogram(sampled_df, x="task", title="Induce Tasks Count")
fig.show()

## Save the .csv files to final dataframes

In [39]:
# save csvs
test.to_csv("test.csv")
sampled_df.to_csv("pool.csv")