# Lab 4: TableGPT

In this lab, we'll discover the power of code generation models through TableGPT2. The aim is to see how the model can be used in data analysis.

First of all, the notebook is divided into X sections: 0. Installation: This section is dedicated to module installation, model loading and data loading.

1. Guided introduction: Together, we'll discover how to use and evaluate TableGPT2.
2. More questions: You'll need to add at least one new question type to our simple evaluation system.
3. More data sets: You'll need to implement a question with multiple datasets.

IMPORTANT:

- You must work in pairs. You must submit **ONLY ONE NOTEBOOK** for each pair.
- Do not share your work with other pairs.
- You should not use Copilot, ChatGPT or similar tools. At the very least, remove the prompt ...
- <font color='red'>All the things you need to do are indicated in red.</font>

<font color='red'>**FIRST QUESTION:** What are the specificty of the TableGPT2 model?</font> https://huggingface.co/tablegpt/TableGPT2-7B


TableGPT2-7B is a large-scale decoder designed for data-intensive tasks, particularly those involving tabular data, such as business intelligence (BI), automated analysis, and database applications.

It is based on the Qwen2.5 architecture and features specialized encoding to process structured data from tables. The model supports Chinese as its primary language and offers strong performance in coding, data interpretation, and BI-focused query answering.

With 7 billion parameters, it has been trained on multimodal BI data, including 86 billion tokens and nearly 600,000 tables. TableGPT2-7B is open-sourced as a standalone decoder, with plans to release a specialized encoder for tighter integration with DeepSpeed and vLLM in the future.

## 0. Setup


In [1]:
# !pip install transformers datasets bitsandbytes accelerate

In [2]:
from transformers import (
    BitsAndBytesConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    GenerationConfig,
)

import pandas as pd
import torch

In [3]:
llm_name = "tablegpt/TableGPT2-7B"

# We want to use 4bit quantization to save memory
quantization_config = BitsAndBytesConfig(load_in_8bit=False, load_in_4bit=True)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(llm_name, padding_side="left")
# Prevent some transformers specific issues.
tokenizer.use_default_system_prompt = False
tokenizer.pad_token_id = tokenizer.eos_token_id

# Load LLM.
llm = AutoModelForCausalLM.from_pretrained(
    llm_name,
    quantization_config=quantization_config,
    device_map="cuda",  # load all the model layers on GPU 0
    torch_dtype=torch.bfloat16,  # float precision
)
# Set LLM on eval mode.
llm.eval()

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Qwen2ForCausalLM(
  (model): Qwen2Model(
    (embed_tokens): Embedding(152064, 3584)
    (layers): ModuleList(
      (0-27): 28 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear4bit(in_features=3584, out_features=3584, bias=True)
          (k_proj): Linear4bit(in_features=3584, out_features=512, bias=True)
          (v_proj): Linear4bit(in_features=3584, out_features=512, bias=True)
          (o_proj): Linear4bit(in_features=3584, out_features=3584, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear4bit(in_features=3584, out_features=18944, bias=False)
          (up_proj): Linear4bit(in_features=3584, out_features=18944, bias=False)
          (down_proj): Linear4bit(in_features=18944, out_features=3584, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((3584,), 

In [4]:
generation_config = GenerationConfig(
    max_new_tokens=512,
    do_sample=False,
    # do_sample=True,
    # temperature=.7,
    # top_p=.8,
    # top_k=20,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

In [5]:
df = pd.read_csv("hf://datasets/phihung/titanic/train.csv")
df = df.drop("Cabin", axis=1).dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 712 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  712 non-null    int64  
 1   Survived     712 non-null    int64  
 2   Pclass       712 non-null    int64  
 3   Name         712 non-null    object 
 4   Sex          712 non-null    object 
 5   Age          712 non-null    float64
 6   SibSp        712 non-null    int64  
 7   Parch        712 non-null    int64  
 8   Ticket       712 non-null    object 
 9   Fare         712 non-null    float64
 10  Embarked     712 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 66.8+ KB


## 1.1 Guided Introduction: The Model.

Below there is an example of a prompt that could be used with TableGPT2.

```
Given access to several pandas dataframes, write the Python code to answer the user's question.
The answer should be store in a variable named "output".

/*
"df.head(5).to_string(index=False)" as follows:
 PassengerId  Survived  Pclass                                                Name    Sex  Age  SibSp  Parch           Ticket    Fare Embarked
           1         0       3                             Braund, Mr. Owen Harris   male 22.0      1      0        A/5 21171  7.2500        S
           2         1       1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0      1      0         PC 17599 71.2833        C
           3         1       3                              Heikkinen, Miss. Laina female 26.0      0      0 STON/O2. 3101282  7.9250        S
           4         1       1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0      1      0           113803 53.1000        S
           5         0       3                            Allen, Mr. William Henry   male 35.0      0      0           373450  8.0500        S
*/

Question: How many child survive? (under 18)
```

The prompt is divided in 3 parts:

1. The global instruction wich is to write python that could answer a question on a specific dataset.
2. The header of the given dataset: 5 first lines of titanic dataset.
3. The question to answer: "How many child survive? (under 18)

First, we will implement a function that generate an answer for this prompt.

<font color='red'>TODO: Fill in the `generate_answer` function following the comments inside.</font>


In [6]:
example_prompt_template = """Given access to several pandas dataframes, write the Python code to answer the user's question.
The answer should be store in a variable named "output". Make SURE the answer isn't be a dataframe but a single value (int, float, str, etc.).

/*
"{var_name}.head(5).to_string(index=False)" as follows:
{df_info}
*/

Question: {user_question}
"""


def generate_answer(prompt, llm=llm, generation_config=generation_config):
    # Create turns with the given prompt.
    chat = [
        {"role": "user", "content": prompt},
    ]

    # Apply template with the tokenizer. Be careful to return pt tensors on the same device than `llm`.
    chat_encoded = tokenizer.apply_chat_template(chat, return_tensors="pt").to(
        llm.device
    )

    # Generate with llm using the given generation config.
    llm_outputs_ids = llm.generate(
        input_ids=chat_encoded,
        generation_config=generation_config,
    )[0]

    # Decode and select the answer to return.
    answer = tokenizer.decode(
        llm_outputs_ids[chat_encoded.size(1) :], skip_special_tokens=True
    )
    return answer


prompt = example_prompt_template.format(
    var_name="df",
    df_info=df.head(5).to_string(index=False),
    user_question="How many child survive? (under 18)",
)

answer = generate_answer(prompt)

print(prompt)
print("\n*****\n")
print(answer)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Given access to several pandas dataframes, write the Python code to answer the user's question.
The answer should be store in a variable named "output". Make SURE the answer isn't be a dataframe but a single value (int, float, str, etc.).

/*
"df.head(5).to_string(index=False)" as follows:
 PassengerId  Survived  Pclass                                                Name    Sex  Age  SibSp  Parch           Ticket    Fare Embarked
           1         0       3                             Braund, Mr. Owen Harris   male 22.0      1      0        A/5 21171  7.2500        S
           2         1       1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0      1      0         PC 17599 71.2833        C
           3         1       3                              Heikkinen, Miss. Laina female 26.0      0      0 STON/O2. 3101282  7.9250        S
           4         1       1        Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0      1      0           113803 53.1000    

## 1.2 Guided Introduction: The Answer.

As you can see, the model answer with some generated code.

````
Python code:
```python
# Filter the dataframe to include only passengers under the age of 18
children = df[df['Age'] < 18]

# Count the number of children who survived
child_survivors = children[children['Survived'] == 1]

# Save the answer in the variable output
output = len(child_survivors)
````

So we will need to execute it, but there is some difficulty:

1. Sometime, the llm answer with \`\`\`python ... \`\`\`, sometime the llm answer directly with the code. We need to handle both cases.
2. We need to recover the variable output from the execution.
3. We need to evaluate single value and list of values.

First, we will implement a function that generate an answer for this prompt.

<font color='red'>TODO: Fill in the `exec_answer` function following the comments inside.</font>


In [7]:
import re, sys, os


def exec_answer(answer, gold, context):
    # Extract the code from the answer. Be careful, the code is now always in ``` ```.
    if "```" in answer:
        code = re.search(r"```python\n(.*?)\n```", answer, re.DOTALL)
        code = code.group(1)
    else:
        code = answer

    # Execute the code, https://docs.python.org/3/library/functions.html#exec
    # if the code work: Return True or False based on output == gold (be careful to handle iterable !)
    # if the code don't work return False.

    original_stdout = sys.stdout  # suppress output from exec
    try:
        sys.stdout = open(os.devnull, "w")  # suppress output from exec
        exec(code, context)
        output = next(
            reversed(context.values())
        )  # get the last assigned value of the context as output

        if isinstance(output, pd.DataFrame) or isinstance(output, pd.Series):
            raise Exception("Output is a DataFrame or Series, please return a scalar.")

        sys.stdout = original_stdout
        return output == gold
    except Exception as e:
        sys.stdout = original_stdout
        print(e)
        return False


print(exec_answer(answer, 61, {"df": df}))

True


## 1.3 Guided Introduction: The Question.

Now we want to automatically generate questions to evaluate the performance of our model. There are benchmarks on this subject, but here we want to practice code by generating the questions ourselves.

We will generate some basic filter questions.

<font color='red'>TODO: Fill in the `generate_filter_question` function following the comments inside.</font>


In [8]:
import random, operator
from tqdm import tqdm

categorical_columns = ["Sex", "Pclass", "Embarked", "Survived"]
numerical_columns = ["Age", "Fare", "SibSp", "Parch"]
numerical_ops = ["<", ">", "==", "!=", "<=", ">="]
operators = {
    "<": operator.lt,
    ">": operator.gt,
    "==": operator.eq,
    "!=": operator.ne,
    "<=": operator.le,
    ">=": operator.ge,
}


def generate_random_question(
    generate_function, df, categorical_columns, numerical_columns, k, seed=42
):
    random.seed(seed)
    questions = []
    for _ in tqdm(range(k)):
        question = generate_function(df, categorical_columns, numerical_columns)
        questions.append(question)
    return questions


def generate_filter_question(df, categorical_columns, numerical_columns):
    # Get a random target column and a random filter column (be careful they should be different)
    # Get a random filter value inside the filter column. Avoid NaN values.
    filter_col = random.choice(categorical_columns + numerical_columns)
    if filter_col in categorical_columns:  # Categorical column
        filter_val = random.choice(df[filter_col].dropna().unique())
    else:  # Numerical column
        filter_val = random.choice(df[filter_col].dropna().unique())

    target_col = random.choice(
        [col for col in [*categorical_columns, *numerical_columns] if col != filter_col]
    )
    if target_col in categorical_columns:  # Categorical column
        target_val = random.choice(df[target_col].dropna().unique())
    else:  # Numerical column
        target_val = random.choice(df[target_col].dropna().unique())

    filter_op = random.choice(numerical_ops)
    target_op = random.choice(numerical_ops)

    # Create a question template that take a target column, a filter column and a filter value
    # WE ARE ALSO ADDING A TARGET VALUE, OTHERWISE THE ANSWER IS IMPOSSIBLE TO COMPUTE
    template = (
        "Generate a question based on a tabular dataset with this format:\n"
        "{df_info}\n"
        "The question should be about the number of rows where the value in the column '{target_col}' is {target_cond} "
        "and the value in the column '{filter_col}' is {filter_cond}."
    )

    # Determine the conditions for target_col and filter_col
    target_cond = (
        f"'{target_val}'"
        if target_col in categorical_columns
        else f"{target_op} '{target_val}'"
    )
    filter_cond = (
        f"'{filter_val}'"
        if filter_col in categorical_columns
        else f"{filter_op} '{filter_val}'"
    )

    content = template.format(
        df_info=df.head(5).to_string(index=False),
        target_col=target_col,
        target_cond=target_cond,
        filter_col=filter_col,
        filter_cond=filter_cond,
    )

    chat = [
        {"role": "user", "content": content},
    ]
    chat_encoded = tokenizer.apply_chat_template(chat, return_tensors="pt").to(
        llm.device
    )

    llm_outputs_ids = llm.generate(
        input_ids=chat_encoded,
        generation_config=generation_config,
    )[0]
    question = tokenizer.decode(
        llm_outputs_ids[chat_encoded.size(1) :], skip_special_tokens=True
    )

    # Compute the correct answer based on the question
    if filter_col in categorical_columns and target_col in categorical_columns:
        correct_answer = df[
            (df[target_col] == target_val) & (df[filter_col] == filter_val)
        ].shape[0]
    elif filter_col in numerical_columns and target_col in categorical_columns:
        correct_answer = df[
            (df[target_col] == target_val)
            & operators[filter_op](df[filter_col], filter_val)
        ].shape[0]
    elif filter_col in categorical_columns and target_col in numerical_columns:
        correct_answer = df[
            operators[target_op](df[target_col], target_val)
            & (df[filter_col] == filter_val)
        ].shape[0]
    else:
        correct_answer = df[
            operators[target_op](df[target_col], target_val)
            & operators[filter_op](df[filter_col], filter_val)
        ].shape[0]

    # return formatted question and associated answer in a dict {"question":[question], "answer":[answer]}
    return {"question": question, "answer": correct_answer}


generate_random_question(
    generate_filter_question, df, categorical_columns, numerical_columns, 5
)

  0%|          | 0/5 [00:00<?, ?it/s]

100%|██████████| 5/5 [00:13<00:00,  2.69s/it]


[{'question': "What is the number of rows where the value in the column 'SibSp' is greater than 3 and the value in the column 'Pclass' is 3?",
  'answer': 23},
 {'question': "How many rows in the dataset have the value 'male' in the 'Sex' column and 'Q' in the 'Embarked' column?",
  'answer': 16},
 {'question': "How many rows in the dataset have a 'Pclass' value of '3' and a 'Sex' value of 'male'?",
  'answer': 253},
 {'question': "How many rows in the dataset have a 'SibSp' value less than or equal to 5 and a 'Sex' value of 'male'?",
  'answer': 453},
 {'question': 'How many passengers survived with no siblings or spouses aboard?',
  'answer': 0}]

To generate the questions we used a hybrid approach: we generate the question with a llm, but we constraint the type of question generated, by providing an example built with a template.
Since we need to calculate the true answer to evaluate the model, we need to know in advance which columns are we using (as filter and target) ans which operation are we performing on them.

The main steps to generate a question are summarized below:

1. Define the columns and operations: Determine the columns and operations that will be used in the question. This includes identifying the filter column and the target column, as well as the operation to be performed on them.

2. Generate the question template: Create a question template that includes placeholders for the filter column, target column, and operation. This template will be used to generate the final question.

3. Generate the question prompt: Use the question template as a prompt to generate the question.

4. Generate the true answer: Calculate the true answer to the question based on the filter column, target column, and operation. This will be used to evaluate the model's performance.

## 1.4 Guided Introduction: The Evaluation.

The last step in this section is to evaluate our model on 20 random questions! We'll use simple accuracy.

You should have an accuracy between 0.9 and 1.

<font color='red'>TODO: Follow instruction in comment of the cell below.</font>

<font color='green'>BONUS: Investigate on errors and improve our prompt/parsing to solve them.</font>


In [9]:
# Generate 20 random question
print("Generating random questions")
questions = generate_random_question(
    generate_filter_question, df, categorical_columns, numerical_columns, 20
)
sum_correct = 0
incorrect_answers = []

# Iterate over question to format prompt, generate answer and execute answer.
print("\nStarting the evaluation")
for q in tqdm(questions):
    prompt = example_prompt_template.format(
        var_name="df",
        df_info=df.head(5).to_string(index=False),
        user_question=q["question"],
    )
    answer = generate_answer(prompt)
    res = exec_answer(answer, q["answer"], {"df": df})
    if res == False:
        incorrect_answers.append((q, answer))

    sum_correct += res

# Report the Accuracy
print(f"Accuracy: {sum_correct}/{len(questions)}")

Generating random questions


100%|██████████| 20/20 [01:02<00:00,  3.13s/it]



Starting the evaluation


100%|██████████| 20/20 [01:44<00:00,  5.21s/it]

Accuracy: 19/20





Looking at the results, we can see that the model is able to generate code that can answer simple questions.

We obtained this high accuracy by fixing some of the issues we encountered during the guided introduction.
In particular the model tended to:
- output a dataframe instead of a number (fixed by specifying in the answer prompt that the output should be a number)
- printing, and not assigning the results to the "output" variable (fixed by taking as output the last assigned variable)

## 2. More Questions.

Now it's your turn to imagine a type of question ("How many ..."). Implement a function to generate new type of question. Verify that our previous code work with your new question then evaluate it.

<font color='red'>TODO: Generate **AT LEAST ONE** new type of question and report this new question accuracy.</font>


In [10]:
import random
from tqdm import tqdm


def generate_fare_question(df, categorical_columns, numerical_columns):
    # Get a random target column and a random filter column (be careful they should be different)
    # Get a random filter value inside the filter column. Avoid NaN values.
    filter_col = random.choice(
        [col for col in [*categorical_columns, *numerical_columns] if col != "Fare"]
    )
    filter_val = random.choice(df[filter_col].dropna().unique())
    filter_op = random.choice(numerical_ops)

    # Create a question template that take a target column, a filter column and a filter value
    if filter_col in categorical_columns:
        template = (
            "Generate a question based on a tabular dataset with this format:\n"
            "{df_info}\n"
            "The question should be about the total fare for the passengers where "
            "the value in the column '{filter_col}' is '{filter_val}'."
            "Here are some examples of questions:"
            "1) What is the total fare for the passengers who survived?"
            "2) How much did the passengers who embarked in S pay in total?"
            "3) What is the total fare for male passengers?"
        )
    else:
        template = (
            "Generate a question based on a tabular dataset with this format:\n"
            "{df_info}\n"
            "The question should be about the total fare for the passengers where "
            "the value in the column '{filter_col}' is {filter_op} '{filter_val}'."
            "Here are some examples of questions:"
            "1) What is the total fare for the passengers who are older than 30?"
            "2) How much did the passengers who paid more than 50 in total?"
            "3) What is the total fare for passengers who have more than 2 siblings?"
        )

    content = template.format(
        df_info=df.head(5).to_string(index=False),
        filter_col=filter_col,
        filter_val=filter_val,
        filter_op=filter_op,
    )

    chat = [
        {"role": "user", "content": content},
    ]
    chat_encoded = tokenizer.apply_chat_template(chat, return_tensors="pt").to(
        llm.device
    )

    llm_outputs_ids = llm.generate(
        input_ids=chat_encoded,
        generation_config=generation_config,
    )[0]
    question = tokenizer.decode(
        llm_outputs_ids[chat_encoded.size(1) :], skip_special_tokens=True
    )

    # Compute the correct answer for the given target column, filter column and filter value.
    if filter_col in categorical_columns:
        correct_answer = df[(df[filter_col] == filter_val)]["Fare"].sum()
    else:
        correct_answer = df[operators[filter_op](df[filter_col], filter_val)][
            "Fare"
        ].sum()

    # return formatted question and associated answer in a dict {"question":[question], "answer":[answer]}
    return {"question": question, "answer": correct_answer}

In [11]:
# Generate 20 random question
print("Generating random questions")
questions = generate_random_question(
    generate_fare_question, df, categorical_columns, numerical_columns, 20
)
sum_correct = 0
incorrect_answers = []

# Iterate over question to format prompt, generate answer and execute answer.
print("\nStarting the evaluation")
for q in tqdm(questions):
    prompt = example_prompt_template.format(
        var_name="df",
        df_info=df.head(5).to_string(index=False),
        user_question=q["question"],
    )
    answer = generate_answer(prompt)
    res = exec_answer(answer, q["answer"], {"df": df})
    if res == False:
        incorrect_answers.append((prompt, q["answer"], answer))

    sum_correct += res

# Report the Accuracy
print(f"Accuracy: {sum_correct}/{len(questions)}")

Generating random questions


100%|██████████| 20/20 [00:47<00:00,  2.37s/it]



Starting the evaluation


100%|██████████| 20/20 [01:33<00:00,  4.68s/it]

Accuracy: 20/20





Here we encountered another problem. Sometimes the generated questions where not correct.

We fixed this by using few-shot prompting to guide the model in generating the right format of question.

## 3. More datasets.

Below we load a new dataset: "adult_income_dataset".

<font color='red'>TODO: Evaluate our questions on this new dataset. Report the accuracy. Comment Any differences.</font>

<font color='green'>BONUS: Try to find a prompt that answer this question: What is the mean salary of titanic surviror based on adult dataset.</font>


In [14]:
adult = pd.read_csv("hf://datasets/meghana/adult_income_dataset/adult.csv")
adult.info()

titanic = df

categorical_columns = [
    "workclass",
    "education",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "gender",
    "native-country",
    "income",
]
numerical_columns = [
    "age",
    "fnlwgt",
    "educational-num",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [15]:
# Generate 20 random question
print("Generating random questions")
questions = generate_random_question(
    generate_filter_question, adult, categorical_columns, numerical_columns, 20
)
sum_correct = 0
incorrect_answers = []

# Iterate over question to format prompt, generate answer and execute answer.
print("\nStarting the evaluation")
for q in tqdm(questions):
    prompt = example_prompt_template.format(
        var_name="adult",
        df_info=adult.head(5).to_string(index=False),
        user_question=q["question"],
    )
    answer = generate_answer(prompt)
    res = exec_answer(answer, q["answer"], {"adult": adult})
    if res == False:
        incorrect_answers.append((q, answer))

    sum_correct += res

# Report the Accuracy
print(f"Accuracy: {sum_correct}/{len(questions)}")

Generating random questions


100%|██████████| 20/20 [01:56<00:00,  5.81s/it]



Starting the evaluation


100%|██████████| 20/20 [02:10<00:00,  6.51s/it]

Accuracy: 19/20





We obtain a high accucary on the new dataset as well.

This indicates that our question generation system is robust and can be applied to different datasets.

Regarding the **bonus question**, we found that answering the question "What is the mean salary of titanic survivors based on adult dataset." is impossible.

This is because the adult dataset doesn't contain information about the salary of individuals. It just has a feature called "income" which indicates whether an individual earns more or less than 50k.

If the salary was available, we could have the following strategy:

- Join the two datasets on the age and sex columns (we don't have the name of the passengers in the adult dataset, or a common identifier)
- Filter the joined dataset to include only the survivors
- Calculate the mean salary of the survivors