<div align="center">
    <img alt="Institut Polytechnique de Paris Logo" width="auto" height="150px" src="https://www.ip-paris.fr/sites/default/files/presse/Charte%20Graphique/Logo%20IP%20Paris%206%20%C3%A9coles%20vertical%20png.png" />
</div>

<div style="text-align: center; font-family: Arial, sans-serif; margin: 20px 0;">
    <h1 style="font-size: 32px; margin-bottom: 10px;">Lab 3: TableGPT</h1>
    <p style="font-size: 16px; margin: 0;">Authors: 
        <strong>Tim Luka Horstmann</strong> & <strong>William Liaw</strong>
    </p>
    
</div>

---
**Tim Luka Horstmann**

**Email:** [tim.horstmann@ip-paris.fr](mailto:tim.horstmann@ip-paris.fr)  
**Website:** [horstmann.tech](https://horstmann.tech)


**William Liaw**

**Email:** [william.liaw@telecom-paris.fr](mailto:tim.horstmann@ip-paris.fr)

In this lab, we'll discover the power of code generation models through TableGPT2. The aim is to see how the model can be used in data analysis.

First of all, the notebook is divided into X sections:
0. Installation: This section is dedicated to module installation, model loading and data loading.
1. Guided introduction: Together, we'll discover how to use and evaluate TableGPT2.
2. More questions: You'll need to add at least one new question type to our simple evaluation system.
3. More data sets: You'll need to implement a question with multiple datasets.


IMPORTANT:
- You must work in pairs. You must submit **ONLY ONE NOTEBOOK** for each pair.
- Do not share your work with other pairs.
- You should not use Copilot, ChatGPT or similar tools. At the very least, remove the prompt ...
- <font color='red'>All the things you need to do are indicated in red.</font>

<font color='red'>**FIRST QUESTION:** What are the specificty of the TableGPT2 model?</font> https://huggingface.co/tablegpt/TableGPT2-7B

<h2 style="
    font-family: 'Times New Roman', Times, serif; 
    color: #cfcfcf; 
    font-size: 22px; 
    font-weight: normal; 
    border-bottom: 1px solid #ddd; 
    padding-bottom: 5px; 
    margin-bottom: 15px;">
    Answer
</h2>

The TableGPT2 model was developed to extend the capabilities of traditional pre-trained language models to structured data, specifically tables, addressing their limitations in understanding and processing tabular information.

Unlike general-purpose language models, TableGPT2 was trained using instruction tuning, where training examples were structured as (instruction, table, completion) pairs. Many of these examples were annotated by human labelers to enhance performance. The model is built on the Qwen2.5 architecture and underwent continual pre-training on a diverse set of tables, optimizing it for structured data processing. A key innovation is its specialized table encoding, which improves interpretation across different table components, including rows, columns, and entire tables. The model also benefits from supervised fine-tuning to further enhance its effectiveness.

TableGPT2 delivers strong performance across various table-related tasks, including entity matching, error detection, data imputation, table-based question answering, and more.



<div style="
    background-color: #2C3E50; 
    color: #ECF0F1; 
    font-size: 28px; 
    text-align: center; 
    padding: 20px; 
    border-radius: 15px; 
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.2); 
    width: 100%; 
    margin: auto; 
    font-family: Times New Roman, sans-serif;">
    0. Setup

</div>

In [1]:
# !pip install transformers datasets bitsandbytes accelerate

In [2]:
import random
import re

import numpy as np
import pandas as pd
import torch
from tqdm.notebook import tqdm
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          BitsAndBytesConfig, GenerationConfig)

In [None]:
llm_name = "tablegpt/TableGPT2-7B"

# We want to use 4bit quantization to save memory
quantization_config = BitsAndBytesConfig(
    load_in_8bit=False, load_in_4bit=True
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(llm_name, padding_side="left", cache_dir="/Data/tlh45")
# Prevent some transformers specific issues.
tokenizer.use_default_system_prompt = False
tokenizer.pad_token_id = tokenizer.eos_token_id

# Load LLM.
llm = AutoModelForCausalLM.from_pretrained(
    llm_name,
    quantization_config=quantization_config,
    device_map={"": 0}, # load all the model layers on GPU 0
    torch_dtype=torch.bfloat16, # float precision
    cache_dir="/Data/tlh45"
)
# Set LLM on eval mode.
llm.eval()

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(152064, 3584)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2SdpaAttention(
          (q_proj): Linear4bit(in_features=3584, out_features=3584, bias=True)
          (k_proj): Linear4bit(in_features=3584, out_features=512, bias=True)
          (v_proj): Linear4bit(in_features=3584, out_features=512, bias=True)
          (o_proj): Linear4bit(in_features=3584, out_features=3584, bias=False)
          (rotary_emb): Qwen2RotaryEmbedding()
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear4bit(in_features=3584, out_features=18944, bias=False)
          (up_proj): Linear4bit(in_features=3584, out_features=18944, bias=False)
          (down_proj): Linear4bit(in_features=18944, out_features=3584, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-0

In [4]:
generation_config = GenerationConfig(
  max_new_tokens=512,
  do_sample=False,
  # do_sample=True,
  # temperature=.7,
  # top_p=.8,
  # top_k=20,
  eos_token_id=tokenizer.eos_token_id,
  pad_token_id=tokenizer.pad_token_id,
)

In [None]:
df = pd.read_csv("hf://datasets/phihung/titanic/train.csv")
df = df.drop("Cabin", axis=1).dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 712 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Name         712 non-null    object 
 4   Sex          712 non-null    object 
 5   Age          712 non-null    float64
 6   SibSp        712 non-null    int64  
 7   Parch        712 non-null    int64  
 8   Ticket       712 non-null    object 
 9   Fare         712 non-null    float64
 10  Embarked     712 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 66.8+ KB


In [6]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


<div style="
    background-color: #2C3E50; 
    color: #ECF0F1; 
    font-size: 28px; 
    text-align: center; 
    padding: 20px; 
    border-radius: 15px; 
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.2); 
    width: 100%; 
    margin: auto; 
    font-family: Times New Roman, sans-serif;">
    1.1 Guided Introduction: The Model.

</div>

Below there is an example of a prompt that could be used with TableGPT2.

```
Given access to several pandas dataframes, write the Python code to answer the user's question.
The answer should be store in a variable named "output".

/*
"df.head(5).to_string(index=False)" as follows:
 PassengerId  Survived  Pclass                                                Name    Sex  Age  SibSp  Parch           Ticket    Fare Embarked
           1         0       3                             Braund, Mr. Owen Harris   male 22.0      1      0        A/5 21171  7.2500        S
           2         1       1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0      1      0         PC 17599 71.2833        C
           3         1       3                              Heikkinen, Miss. Laina female 26.0      0      0 STON/O2. 3101282  7.9250        S
           4         1       1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0      1      0           113803 53.1000        S
           5         0       3                            Allen, Mr. William Henry   male 35.0      0      0           373450  8.0500        S
*/

Question: How many child survive? (under 18)
```

The prompt is divided in 3 parts:
1. The global instruction wich is to write python that could answer a question on a specific dataset.
2. The header of the given dataset: 5 first lines of titanic dataset.
3. The question to answer: "How many child survive? (under 18)


First, we will implement a function that generate an answer for this prompt.

<font color='red'>TODO: Fill in the `generate_answer` function following the comments inside.</font>


In [None]:
example_prompt_template = """Given access to several pandas dataframes, write the Python code to answer the user's question.
The answer should be store in a variable named "output". Don't put print statements in the code.

/*
"{var_name}.head(5).to_string(index=False)" as follows:
{df_info}
*/

Question: {user_question}
"""

def generate_answer(prompt, llm=llm, generation_config=generation_config):
    # Create turns with the given prompt
    turns = [{"role": "user", "content": prompt}]

    # Tokenize turns.
    input_ids = tokenizer.apply_chat_template(turns, return_tensors="pt").to("cuda")

    # Ensure we don't use gradient to save memory space and computation time.
    with torch.no_grad():
        outputs = llm.generate(input_ids, generation_config)

    # Recover and decode answer.
    answer_tokens = outputs[0, input_ids.shape[1] : -1]
    return tokenizer.decode(answer_tokens).strip()



prompt = example_prompt_template.format(
    var_name="df",
    df_info=df.head(5).to_string(index=False),
    user_question="How many child survive? (under 18)",
)

answer = generate_answer(prompt)

print(prompt)
print("\n*****\n")
print(answer)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Given access to several pandas dataframes, write the Python code to answer the user's question.
The answer should be store in a variable named "output". Don't put print statements in the code.

/*
"df.head(5).to_string(index=False)" as follows:
 PassengerId  Survived  Pclass                                                Name    Sex  Age  SibSp  Parch           Ticket    Fare Embarked
           1         0       3                             Braund, Mr. Owen Harris   male 22.0      1      0        A/5 21171  7.2500        S
           2         1       1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0      1      0         PC 17599 71.2833        C
           3         1       3                              Heikkinen, Miss. Laina female 26.0      0      0 STON/O2. 3101282  7.9250        S
           4         1       1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0      1      0           113803 53.1000        S
           5         0       3          

<div style="
    background-color: #2C3E50; 
    color: #ECF0F1; 
    font-size: 28px; 
    text-align: center; 
    padding: 20px; 
    border-radius: 15px; 
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.2); 
    width: 100%; 
    margin: auto; 
    font-family: Times New Roman, sans-serif;">
    1.2 Guided Introduction: The Answer.

</div>

As you can see, the model answer with some generated code.

```
Python code:
```python
# Filter the dataframe to include only passengers under the age of 18
children = df[df['Age'] < 18]

# Count the number of children who survived
child_survivors = children[children['Survived'] == 1]

# Save the answer in the variable output
output = len(child_survivors)
```

So we will need to execute it, but there is some difficulty:
1. Sometime, the llm answer with \`\`\`python ... \`\`\`, sometime the llm answer directly with the code. We need to handle both cases.
2. We need to recover the variable output from the execution.
3. We need to evaluate single value and list of values.


First, we will implement a function that generate an answer for this prompt.

<font color='red'>TODO: Fill in the `exec_answer` function following the comments inside.</font>


In [8]:
def exec_answer(answer, gold):

    # Extract the code from the answer. Be careful, the code is now always in ``` ```.
    code_fence_pattern = re.compile(r"```python\s*(.*?)\s*```", re.DOTALL | re.IGNORECASE)
    match = code_fence_pattern.search(answer)

    if match:
        code = match.group(1)
    else:
        code = answer.strip()

    # Execute the code, https://docs.python.org/3/library/functions.html#exec
    try:
        local_namespace = {}
        exec(code, globals(), local_namespace)

        if "output" not in local_namespace:
            raise ValueError("The code did not define the variable `output`.")

        output = local_namespace["output"]

        # if the code work: Return True or False based on output == gold (be careful to handle iterable !)
        if isinstance(output, pd.Series) or isinstance(output, np.ndarray):
            output = output.tolist()
        if isinstance(gold, pd.Series) or isinstance(gold, np.ndarray):
            gold = gold.tolist()
        if isinstance(output, (list, tuple)) and isinstance(gold, (list, tuple)):
            return (list(output) == list(gold))
        else:
            return output == gold

    except Exception as e:
        # if the code don't work return False.
        print(f"Error during execution: {e}")
        return False


print(exec_answer(answer, 61))

True


<div style="
    background-color: #2C3E50; 
    color: #ECF0F1; 
    font-size: 28px; 
    text-align: center; 
    padding: 20px; 
    border-radius: 15px; 
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.2); 
    width: 100%; 
    margin: auto; 
    font-family: Times New Roman, sans-serif;">
    1.3 Guided Introduction: The Question.

</div>

Now we want to automatically generate questions to evaluate the performance of our model. There are benchmarks on this subject, but here we want to practice code by generating the questions ourselves.

We will generate some basic filter questions.

<font color='red'>TODO: Fill in the `generate_filter_question` function following the comments inside.</font>


In [9]:
def generate_random_question(generate_function, df, k=1, seed=42):
    random.seed(seed)
    return [generate_function(df) for _ in range(k)]


def generate_filter_question(df):
    # Create a question template that take a target colunm, a filter column and a filter value
    question_template = "Which entries in `{target_column}` correspond to rows where `{filter_column}` = `{filter_value}`?"

    # Get a random target column and a random filter column (be careful they should be differnts)
    columns = list(df.columns)

    if len(columns) < 2:
        raise ValueError(
            "DataFrame must have at least two columns to create a valid question."
        )

    # Ensure the target column and filter column are different
    target_column, filter_column = random.sample(columns, 2)

    # Get a random filter value inside the filer column. Avoid NaN values.
    valid_values = df[filter_column].dropna().unique()

    if len(valid_values) == 0:
        raise ValueError(
            f"No valid values found in column {filter_column} for filtering."
        )

    filter_value = random.choice(valid_values)

    # Compute the correct answer for the given target column, filter column and filter value.
    filtered_df = df[df[filter_column] == filter_value]
    answer = filtered_df[target_column].tolist()

    # return formated question and associated answer in a dict {"question":[question], "answer":[answer]}
    question = question_template.format(
        target_column=target_column,
        filter_column=filter_column,
        filter_value=filter_value,
    )

    return {"question": question, "answer": answer}


print(generate_random_question(generate_filter_question, df, k=5))

[{'question': 'Which entries in `Embarked` correspond to rows where `Survived` = `0`?', 'answer': ['S', 'S', 'S', 'S', 'S', 'S', 'S', 'Q', 'S', 'S', 'S', 'S', 'C', 'S', 'C', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'C', 'C', 'S', 'C', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'C', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'C', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'C', 'S', 'S', 'C', 'S', 'Q', 'S', 'C', 'S', 'S', 'C', 'S', 'S', 'C', 'S', 'S', 'S', 'C', 'S', 'S', 'C', 'Q', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'C', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'Q', 'S', 'C', 'S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S', 'S', 'S', 'S', 'S', 'C', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'C', 'Q', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'C', 'S', 'Q', 'Q', 'S', 'S', 'C', 'S', 'C', 'S', 'S', 'C', 'S', 'S', 'C', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'C', 'S', 'S', 'S', 'S', 'C', 'C', '

<div style="
    background-color: #2C3E50; 
    color: #ECF0F1; 
    font-size: 28px; 
    text-align: center; 
    padding: 20px; 
    border-radius: 15px; 
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.2); 
    width: 100%; 
    margin: auto; 
    font-family: Times New Roman, sans-serif;">
    1.4 Guided Introduction: The Evaluation.

</div>

The last step in this section is to evaluate our model on 20 random questions! We'll use simple accuracy.

You should have an accuracy between 0.9 and 1.

<font color='red'>TODO: Follow instruction in comment of the cell below.</font>

<font color='green'>BONUS: Investigate on errors and improve our prompt/parsing to solve them.</font>


In [None]:
# Generate 20 random question
num_questions = 20
questions = generate_random_question(generate_filter_question, df, k=num_questions)

# Iterate over question to format prompt, generate answer and execute answer.
correct = 0
for entry in tqdm(questions, desc="Evaluating model"):
    question = entry["question"]
    gold_answer = entry["answer"]  # The expected correct answer

    # Generate the answer from the model
    prompt = example_prompt_template.format(
        var_name="df",
        df_info=df.head(5).to_string(index=False),
        user_question=question,
    )

    generated_answer = generate_answer(prompt)

    # Execute the generated answer and check if it's correct
    if exec_answer(generated_answer, gold_answer):
        correct += 1

# Report the Accuracy
accuracy = correct / num_questions

print("Acc: ", accuracy)

Evaluating model:   0%|          | 0/20 [00:00<?, ?it/s]

['male']
71     5
86     1
138    0
156    0
208    0
220    0
266    4
282    0
329    0
333    2
504    0
574    0
746    1
764    0
791    0
841    0
853    0
Name: SibSp, dtype: int64
Acc:  1.0


<div style="
    background-color: #2C3E50; 
    color: #ECF0F1; 
    font-size: 28px; 
    text-align: center; 
    padding: 20px; 
    border-radius: 15px; 
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.2); 
    width: 100%; 
    margin: auto; 
    font-family: Times New Roman, sans-serif;">
    2. More Questions.

</div>

Now it's your turn to imagine a type of question ("How many ..."). Implement a function to generate new type of question. Verify that our previous code work with your new question then evaluate it.

<font color='red'>TODO: Generate **AT LEAST ONE** new type of question and report this new question accuracy.</font>


In [11]:
def generate_count_question(df):
    # Question template
    question_template = (
        "How many rows have `{filter_column}` {operator} `{filter_value}`?"
    )

    columns = list(df.columns)
    if len(columns) == 0:
        raise ValueError(
            "DataFrame must have at least one column to create a question."
        )

    # Randomly pick a column to filter
    filter_column = random.choice(columns)
    # Identify if the column is numeric
    col_dtype = df[filter_column].dtype

    # If numeric, pick from {=, <, >}; otherwise only '='
    if np.issubdtype(col_dtype, np.number):
        possible_operators = ["=", "<", ">"]
    else:
        possible_operators = ["="]

    operator = random.choice(possible_operators)

    # Get valid (non-NaN) values
    valid_values = df[filter_column].dropna().unique()
    if len(valid_values) == 0:
        raise ValueError(
            f"No valid (non-NaN) values found in column '{filter_column}' for filtering."
        )

    # Pick a random value for the filter
    filter_value = random.choice(valid_values)

    # Convert filter_value to a numeric type if the column is numeric
    # (to avoid issues when comparing strings with < or >)
    if np.issubdtype(col_dtype, np.number):
        filter_value = float(filter_value)  # ensures numeric comparison is valid

    # Apply the chosen operator for filtering
    if operator == "=":
        filtered_df = df[df[filter_column] == filter_value]
    elif operator == "<":
        filtered_df = df[df[filter_column] < filter_value]
    else:
        filtered_df = df[df[filter_column] > filter_value]

    # Count the filtered rows
    answer = len(filtered_df)

    # Format the question
    question = question_template.format(
        filter_column=filter_column, operator=operator, filter_value=filter_value
    )

    return {"question": question, "answer": answer}


print(generate_random_question(generate_count_question, df, k=5))

[{'question': 'How many rows have `Embarked` = `S`?', 'answer': 554}, {'question': 'How many rows have `Sex` = `male`?', 'answer': 453}, {'question': 'How many rows have `Pclass` > `3.0`?', 'answer': 0}, {'question': 'How many rows have `Embarked` = `Q`?', 'answer': 28}, {'question': 'How many rows have `SibSp` = `1.0`?', 'answer': 183}]


In [None]:
# Generate 20 random question
num_questions = 20
questions = generate_random_question(generate_count_question, df, k=num_questions)

# Iterate over question to format prompt, generate answer and execute answer.
correct = 0
for entry in tqdm(questions, desc="Evaluating model"):
    question = entry["question"]
    gold_answer = entry["answer"]  # The expected correct answer

    # Generate the answer from the model
    prompt = example_prompt_template.format(
        var_name="df",
        df_info=df.head(5).to_string(index=False),
        user_question=question,
    )

    generated_answer = generate_answer(prompt)

    # Execute the generated answer and check if it's correct
    if exec_answer(generated_answer, gold_answer):
        correct += 1

# Report the Accuracy
accuracy = correct / num_questions

print("Acc: ", accuracy)

Evaluating model:   0%|          | 0/20 [00:00<?, ?it/s]

Acc:  1.0


<div style="
    background-color: #2C3E50; 
    color: #ECF0F1; 
    font-size: 28px; 
    text-align: center; 
    padding: 20px; 
    border-radius: 15px; 
    box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.2); 
    width: 100%; 
    margin: auto; 
    font-family: Times New Roman, sans-serif;">
    3. More datasets.

</div>

Below we load a new dataset: "adult_income_dataset".

<font color='red'>TODO: Evaluate our questions on this new dataset. Report the accuracy. Comment Any differences.</font>

<font color='green'>BONUS: Try to find a prompt that answer this question: What is the mean salary of titanic surviror based on adult dataset.</font>

In [None]:
adult = pd.read_csv("hf://datasets/meghana/adult_income_dataset/adult.csv")
adult.info()

titanic = df.copy()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [14]:
titanic.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [15]:
adult.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [16]:
# check values in the column income in adult
print(adult["income"].unique())

['<=50K' '>50K']


<font color='red'>TODO: Evaluate our questions on this new dataset. Report the accuracy. Comment Any differences.</font>

In [None]:
# Generate 20 random question
num_questions = 20
questions = generate_random_question(generate_filter_question, adult, k=num_questions)

# Iterate over question to format prompt, generate answer and execute answer.
correct = 0
for entry in tqdm(questions, desc="Evaluating model"):
    question = entry["question"]
    gold_answer = entry["answer"]  # The expected correct answer

    # Generate the answer from the model

    prompt = example_prompt_template.format(
        var_name="adult",
        df_info=adult.head(5).to_string(index=False),
        user_question=question,
    )

    generated_answer = generate_answer(prompt)

    # Execute the generated answer and check if it's correct
    if exec_answer(generated_answer, gold_answer):
        correct += 1

# Report the Accuracy
accuracy = correct / num_questions

print("Acc: ", accuracy)

Evaluating model:   0%|          | 0/20 [00:00<?, ?it/s]

['White', 'Black', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'Black', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'Black', 'Black', 'White', 'White', 'White', 'White', 'White', 'Asian-Pac-Islander', 'Black', 'White', 'White', 'White', 'Black', 'Black', 'White', 'White', 'White', 'White', 'Black', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'Black', 'Asian-Pac-Islander', 'Black', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'White', 'Black', 'White', 'White', 'White', 'Asian-Pac-Islander', 'White', 'Black', 'White', 'White', 'White', 'White', 'Asian-Pac-Islander', 'White', 'White', 'White', 'Asian-Pac-Islander', 'White', 'White'

In [None]:
# Generate 20 random question
num_questions = 20
questions = generate_random_question(generate_count_question, adult, k=num_questions)

# Iterate over question to format prompt, generate answer and execute answer.
correct = 0
for entry in tqdm(questions, desc="Evaluating model"):
    question = entry["question"]
    gold_answer = entry["answer"]  # The expected correct answer

    # Generate the answer from the model

    prompt = example_prompt_template.format(
        var_name="adult",
        df_info=adult.head(5).to_string(index=False),
        user_question=question,
    )

    generated_answer = generate_answer(prompt)

    # Execute the generated answer and check if it's correct
    if exec_answer(generated_answer, gold_answer):
        correct += 1

# Report the Accuracy
accuracy = correct / num_questions

print("Acc: ", accuracy)

Evaluating model:   0%|          | 0/20 [00:00<?, ?it/s]

Error during execution: The code did not define the variable `output`.
Error during execution: The code did not define the variable `output`.
Acc:  0.9


<font color='red'>Comments on the differences</font>

<h2 style="
    font-family: 'Times New Roman', Times, serif; 
    color: #cfcfcf; 
    font-size: 22px; 
    font-weight: normal; 
    border-bottom: 1px solid #ddd; 
    padding-bottom: 5px; 
    margin-bottom: 15px;">
    Answer
</h2>

We observed a perfect (1.0) accuracy for filter questions on both the Titanic and Adult datasets. For count questions, the accuracy remained perfect (1.0) on the Titanic data but dropped slightly to 0.9 on the Adult data. This minor discrepancy may be due to the Adult dataset’s greater variety of categorical values and the occasional presence of placeholders (such as “?”). These factors may introduce hardships on filtering and counting scenarios for the LLM's interpretation and reasoning. Overall, the results still demonstrate robust performance for both question types and datasets, highlighting their effectiveness in extracting information from structured data.

<font color='green'>BONUS: Try to find a prompt that answer this question: What is the mean salary of titanic surviror based on adult dataset.</font>

Note: We decided to use the columns age and gender/sex as the merge/join basis for this task as it makes sense to compare salary based on these attributes, given the constraints of the dataset.
Furthermore, we note that the salray (=income) is only given as ['<=50K' '>50K'] (check print(adult["income"].unique()) above).

In [None]:
# GET GOLD ANSWER

# Work on copies to preserve original data
titanic_copy = titanic.copy()
adult_copy = adult.copy()

titanic_copy.rename(columns=str.lower, inplace=True)
adult_copy.rename(columns=str.lower, inplace=True)

# 'gender' column in adult dataset to 'sex'
if "gender" in adult_copy.columns:
    adult_copy = adult_copy.rename(columns={"gender": "sex"})

# to lowercase for consistency
titanic_copy["sex"] = titanic_copy["sex"].str.lower()
adult_copy["sex"] = adult_copy["sex"].str.lower()

if "age" in adult_copy.columns and "age" in titanic_copy.columns:
    adult_copy["age"] = adult_copy["age"].astype(float)
    titanic_copy["age"] = titanic_copy["age"].astype(float)

    # ages to nearest integer
    adult_copy["age"] = adult_copy["age"].round().astype(int)
    titanic_copy["age"] = titanic_copy["age"].round().astype(int)

# we are only interested in survivors:
survivors = titanic_copy[titanic_copy["survived"] == 1]

# encode income in the adult dataset
income_mapping = {"<=50K": 0, ">50K": 1}
if "income" in adult_copy.columns:
    adult_copy = adult_copy.assign(
        income_encoded=adult_copy["income"].map(income_mapping)
    )

# merge operation based on sex and rounded age
merged_df = survivors.merge(adult_copy, on=["sex", "age"], how="inner")

# mean salary of matched survivors
gold_mean_salary = merged_df["income_encoded"].mean() if not merged_df.empty else None

print(
    "Estimated mean salary of Titanic survivors (based on adult dataset):",
    gold_mean_salary,
)

Estimated mean salary of Titanic survivors (based on adult dataset): 0.1632504674191538


Interpretation:

Based on the gold mean salary, we can observe that only ~16.3% of the matched Titanic survivors earned more than 50K, while the remaining 83.7% earned 50K or less (according to the adult dataset).

(This result seems reasonable based on the history of the Titanic incident and given that it happened in 1912 (so salaries can't directly be compared to today's values)).

In [20]:
# code to reset dataframes (if needed) -> should be run if the code is run multiple times
adult = pd.read_csv("hf://datasets/meghana/adult_income_dataset/adult.csv")
titanic = df.copy()

example_prompt_template_new = """Given access to two pandas dataframes named "adult" and "titanic", write the Python code to answer the user's question.
The answer should be stored in a variable named "output". Don't put print statements in the code.
Ensure the merge operation is performed correctly based on 'sex' and 'age'.
- In the "adult" dataframe, 'sex' is mapped from 'gender'.
- In the "titanic" dataframe, 'sex' is mapped from 'Sex'.
- 'age' should be rounded to the nearest integer in both dataframes.
Please note that column names in the "titanic" dataset are capitalized, while column names in the "adult" dataset are lowercase.
Income in the "adult" dataset is given as either '<=50K' or '>50K' and should be mapped to numerical values (0 for '<=50K' and 1 for '>50K') before merging.
Preprocess the "adult" dataset **before merging** by renaming 'gender' to 'sex' and rounding 'age' to the nearest integer.
Preprocess the "titanic" dataset by renaming 'Sex' to 'sex' and rounding 'age' to the nearest integer to ensure consistent column names for merging.
After merging, filter Titanic survivors and calculate their mean income.
\
/\
\
\
"{var_name_1}.head(5).to_string(index=False)" as follows:
{df_info_1}
\
/\
\
\
"{var_name_2}.head(5).to_string(index=False)" as follows:
{df_info_2}
\
/\

Question: {user_question}
"""

prompt = example_prompt_template_new.format(
    var_name_1="adult",
    df_info_1=adult.head(5).to_string(index=False),
    var_name_2="titanic",
    df_info_2=titanic.head(5).to_string(index=False),
    user_question="What is the mean salary of titanic surviror based on adult dataset?",
)

generated_answer = generate_answer(prompt)

print(generated_answer)

# Execute the generated answer and check if it's correct
exec_answer(generated_answer, gold_mean_salary)

```python
import pandas as pd

# Preprocess the adult dataset
adult['sex'] = adult['gender'].str.lower()
adult['age'] = adult['age'].round()
adult['income'] = adult['income'].map({'<=50K': 0, '>50K': 1})

# Preprocess the titanic dataset
titanic['sex'] = titanic['Sex'].str.lower()
titanic['age'] = titanic['Age'].round()

# Merge the datasets on 'sex' and 'age'
merged_df = pd.merge(titanic, adult, on=['sex', 'age'], how='inner')

# Filter Titanic survivors and calculate their mean income
survivors = merged_df[merged_df['Survived'] == 1]
mean_income = survivors['income'].mean()

output = mean_income
```


True

The result of the above code is "True", indicating that model succesfully generated code to answer the question "What is the mean salary of titanic surviror based on adult dataset.".

The code should also print the code the LLM generated, which is the following:

```python
import pandas as pd

# Preprocess the adult dataset
adult['sex'] = adult['gender'].str.lower()
adult['age'] = adult['age'].round()
adult['income'] = adult['income'].map({'<=50K': 0, '>50K': 1})

# Preprocess the titanic dataset
titanic['sex'] = titanic['Sex'].str.lower()
titanic['age'] = titanic['Age'].round()

# Merge the datasets on 'sex' and 'age'
merged_df = pd.merge(titanic, adult, on=['sex', 'age'], how='inner')

# Filter Titanic survivors and calculate their mean income
survivors = merged_df[merged_df['Survived'] == 1]
mean_income = survivors['income'].mean()

output = mean_income
```