# 🤖 TAT-LLM: From Tables and Text to Strategic Business Actions

### Empowering Small and Medium-Sized Enterprises (SMSEs) with Decision Intelligence

This project presents **TAT-LLM**, a Language Model system capable of answering complex business questions by understanding and reasoning over tabular data and accompanying text, such as financial reports, annual disclosures, or transaction summaries.

## Technical Foundation

- Dataset: [TAT-QA](https://huggingface.co/datasets/next-tat/TAT-QA)
- Model: [Nous Hermes 2 Mistral 7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO)
- Prompt Style: 6-Step Instruction Reasoning
- Evaluation: Exact Match (EM) and F1 across question types

## Team Members
- **Bima Aristo**
- **Muhammad Fadli**
- **Rifqi Aditya**

We believe that decision-quality AI shouldn’t be exclusive to big corporations.

---


### Import libraries

In [153]:
# Standard libraries
import os
import re
import ast
import time
import json
import random
from pathlib import Path

import numpy as np
import pandas as pd
from tqdm import tqdm
from torch import float16
import torch
from datasets import load_dataset, Dataset
from evaluate import load as load_metric
import evaluate

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
    TrainerCallback,
    BitsAndBytesConfig,
)
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    PeftModel,
)

from trl import SFTTrainer, SFTConfig

## Preprocessing

### Check data pairs

In [2]:
data_dir = "data"
json_files = ["train.json", "dev.json", "test.json"]

for filename in json_files:
    file_path = os.path.join(data_dir, filename)
    with open(file_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    
    total_qa_pairs = sum(len(passage["questions"]) for passage in data)
    print(f"{filename}: {total_qa_pairs} QA pairs")

train.json: 13215 QA pairs
dev.json: 1668 QA pairs
test.json: 1669 QA pairs


### Build train dataset

In [3]:
with open("data/train.json", "r", encoding="utf-8") as f:
    raw_data = json.load(f)

flattened = []  # flatten each question-answer pair as one record
for entry in raw_data:
    table = entry.get("table", {})
    paragraphs = entry.get("paragraphs", [])
    for q in entry.get("questions", []):
        flattened.append({
            "question": str(q.get("question", "")),
            "answer": str(q.get("answer", "")), # convert list or number to string
            "answer_type": str(q.get("answer_type", "")),
            "answer_from": str(q.get("answer_from", "")),
            "rel_paragraphs": str(q.get("rel_paragraphs", [])), # make sure it's string
            "req_comparison": bool(q.get("req_comparison", False)),
            "table": str(table),    # avoid raw dicts/lists
            "paragraphs": str(paragraphs),  # avoid raw lists
        })


train_data = Dataset.from_list(flattened)

print("Number of QA pairs:", len(train_data))
print(train_data[0])

Number of QA pairs: 13215
{'question': 'What does the Weighted average actuarial assumptions consist of?', 'answer': "['Rate of inflation', 'Rate of increase in salaries', 'Discount rate']", 'answer_type': 'multi-span', 'answer_from': 'table', 'rel_paragraphs': '[]', 'req_comparison': False, 'table': "{'uid': 'e78f8b29-6085-43de-b32f-be1a68641be3', 'table': [['', '2019 %', '2018 %', '2017 %'], ['Weighted average actuarial assumptions used at 31 March1:', '', '', ''], ['Rate of inflation2', '2.9', '2.9', '3.0'], ['Rate of increase in salaries', '2.7', '2.7', '2.6'], ['Discount rate', '2.3', '2.5', '2.6']]}", 'paragraphs': "[{'uid': '62be4f5a-1693-4e6b-8bb4-0a4e1e40b409', 'order': 1, 'text': 'Actuarial assumptions'}, {'uid': 'c63e6ed5-8fe5-46e4-a02a-f923e90e8067', 'order': 2, 'text': 'The Group’s scheme liabilities are measured using the projected unit credit method using the principal actuarial assumptions set out below:'}, {'uid': 'b4093fd4-43ea-4b31-9975-13c0012a0b18', 'order': 3, 'te

### Build test dataset

In [4]:
def flatten_qa_data(json_path):
    with open(json_path, "r", encoding="utf-8") as f:
        raw_data = json.load(f)

    flattened = []
    for entry in raw_data:
        table = entry.get("table", {})
        paragraphs = entry.get("paragraphs", [])
        for q in entry.get("questions", []):
            flattened.append({
                "question": str(q.get("question", "")),
                "answer": str(q.get("answer", "")),
                "answer_type": str(q.get("answer_type", "")),
                "answer_from": str(q.get("answer_from", "")),
                "rel_paragraphs": str(q.get("rel_paragraphs", [])),
                "req_comparison": bool(q.get("req_comparison", False)),
                "table": str(table),
                "paragraphs": str(paragraphs),
            })
    return Dataset.from_list(flattened)

test_data = flatten_qa_data("data/test.json")
print("Number of test QA pairs:", len(test_data))
print(test_data[0])

Number of test QA pairs: 1669
{'question': 'What was the amount of unrecognized stock-based compensation expense related to unvested employee stock options in 2019?', 'answer': '', 'answer_type': '', 'answer_from': '', 'rel_paragraphs': '[]', 'req_comparison': False, 'table': "{'uid': 'c4b92833-5c85-4bf4-b493-bc7741d759df', 'table': [['', 'Year Ended', 'Year Ended'], ['Stock-Based Compensation by Type of Award', 'December 31, 2019', 'December 31, 2018'], ['Stock options', '$2,756', '$2,926'], ['RSUs', '955', '1,129'], ['Total stock-based compensation expense', '$3,711', '$4,055']]}", 'paragraphs': "[{'uid': '04bfbe1d-235b-4036-95c2-e49983eb9cef', 'order': 1, 'text': 'Stock-based compensation expense is included in general and administrative expense for each period as follows:'}, {'uid': '0b5304d0-849b-46ea-936a-2b9d73be07f3', 'order': 2, 'text': 'As of December 31, 2019, there was $4,801 of unrecognized stock-based compensation expense related to unvested employee stock options and $1,

### Build prompt

In [5]:
def create_prompt(table, paragraphs, question_dict, return_prompt_only=False):
    table_md = "\n".join(["| " + " | ".join(row) + " |" for row in table["table"]]) # table to markdown
    text_content = "\n".join([p["text"] for p in paragraphs])   # text paragraph

    question = question_dict.get("question", "")
    answer_type = question_dict.get("answer_type", "")
    gold_answer = question_dict.get("answer", "")
    gold_equation = question_dict.get("derivation", "N.A.") or "N.A."
    scale = question_dict.get("scale", "none") or "none"
    
    if answer_type == "arithmetic":
        question_type = "Arithmetic"
    elif answer_type == "counting":
        question_type = "Count"
    elif answer_type == "multi-span":
        question_type = "Multiple spans"
    else:
        question_type = "Single span"
    
    if isinstance(gold_answer, list):
        answer = "#".join(str(a) for a in gold_answer)
    else:
        answer = str(gold_answer)

    if question_type != "Arithmetic":
        gold_equation = "N.A."

    evidence = "[evidence goes here manually if available, e.g., numbers or key phrases]"
    action = "[action goes here — generate a short, logical recommendation based on the answer]"

    reasoning_steps = f"""Please organize the results in the following markdown table:
| step | output |
| 1 | {question_type} |
| 2 | {evidence} |
| 3 | {gold_equation} |
| 4 | {answer} |
| 5 | {scale} |
| 6 | {action} |""" if not return_prompt_only else ""

    final_answer_section = f"""
The answer is: {answer} ####
""" if not return_prompt_only else ""

    prompt = f"""### Instruction
Given a table and a list of texts in the following, answer the question posed using the following six-step process:
1. Step 1: Predict the type of question being asked. Store this prediction in the variable `{{question_type}}`.
2. Step 2: Extract the relevant strings or numerical values from the provided table or texts. Store them in `{{evidence}}`.
3. Step 3: If `{{question_type}}` is `Arithmetic`, generate an equation in `{{equation}}`. Otherwise, put `N.A.`.
4. Step 4: Compute the final answer and store in `{{answer}}`.
5. Step 5: Predict the answer’s scale in `{{scale}}`. One of: `none`, `percent`, `thousand`, `million`, `billion`.
6. Step 6: Based on the `{{answer}}` and `{{question_type}}`, generate a short and logical recommendation, business insight, or next action. Store it in `{{action}}`.

{reasoning_steps}
{final_answer_section}

### Table
{table_md}

### Text
{text_content}

### Question
{question}
"""
    return prompt

### Create generate action & response function

In [6]:
def generate_action(answer, question_type, question):
    """Generate business insight/recommendation based on answer"""
    if question_type == "Arithmetic":
        if isinstance(answer, (int, float)):
            if answer > 0:
                return "Consider strategies to maintain or accelerate this positive trend"
            elif answer < 0:
                return "Investigate root causes and develop mitigation strategies"
            else:
                return "Monitor for changes and prepare contingency plans"
    elif question_type == "Count":
        return f"Review if {answer} items meet target thresholds"
    else:
        return "Further analysis recommended based on this finding"

def generate_training_response(question_dict, table, paragraphs):
    """Generate the response part for training data"""
    
    # Extract evidence from derivation
    derivation = question_dict.get("derivation", "")
    evidence_numbers = re.findall(r'\d+\.?\d*', derivation)
    evidence = "#".join(evidence_numbers) if evidence_numbers else ""
    
    # If no derivation, try to extract from answer and rel_paragraphs
    if not evidence and question_dict.get("answer_from") in ["table", "text", "table-text"]:
        # This is a span-type question, use the answer itself as evidence
        answer = question_dict.get("answer", "")
        if isinstance(answer, list):
            evidence = "#".join(str(a) for a in answer)
        else:
            evidence = str(answer)

    answer_type = question_dict.get("answer_type", "")
    if answer_type == "arithmetic":
        question_type = "Arithmetic"
    elif answer_type == "counting":
        question_type = "Count"
    elif answer_type == "spans":
        question_type = "Multiple spans"
    else:
        question_type = "Single span"
    
    equation = derivation if answer_type == "arithmetic" else "N.A."

    answer = question_dict.get("answer", "")
    if isinstance(answer, list):
        answer_str = "#".join(str(a) for a in answer)
    else:
        answer_str = str(answer)

    scale = question_dict.get("scale", "none") or "none"

    action = generate_action(answer, question_type, question_dict.get("question", ""))
    
    return {
        "question_type": question_type,
        "evidence": evidence,
        "equation": equation,
        "answer": answer_str,
        "scale": scale,
        "action": action
    }

### Create prompt with response

In [7]:
# Generate the training response data
response_data = generate_training_response(
    question_dict={
        "question": train_data[0]["question"],
        "answer": ast.literal_eval(train_data[0]["answer"]),
        "answer_type": train_data[0]["answer_type"],
        "answer_from": train_data[0]["answer_from"],
        "derivation": "44.1-56.7",
        "scale": train_data[0].get("scale", "none")
    },
    table=ast.literal_eval(train_data[0]["table"]),
    paragraphs=ast.literal_eval(train_data[0]["paragraphs"])
)

# Create the prompt with the generated response
prompt_output = create_prompt(
    table=ast.literal_eval(train_data[0]["table"]),
    paragraphs=ast.literal_eval(train_data[0]["paragraphs"]),
    question_dict={
        "question": train_data[0]["question"],
        "answer": response_data["answer"],
        "answer_type": train_data[0]["answer_type"],
        "answer_from": train_data[0]["answer_from"],
        "derivation": response_data["equation"],
        "scale": response_data["scale"],
        "action": response_data["action"]
    }
)
print(prompt_output)

### Instruction
Given a table and a list of texts in the following, answer the question posed using the following six-step process:
1. Step 1: Predict the type of question being asked. Store this prediction in the variable `{question_type}`.
2. Step 2: Extract the relevant strings or numerical values from the provided table or texts. Store them in `{evidence}`.
3. Step 3: If `{question_type}` is `Arithmetic`, generate an equation in `{equation}`. Otherwise, put `N.A.`.
4. Step 4: Compute the final answer and store in `{answer}`.
5. Step 5: Predict the answer’s scale in `{scale}`. One of: `none`, `percent`, `thousand`, `million`, `billion`.
6. Step 6: Based on the `{answer}` and `{question_type}`, generate a short and logical recommendation, business insight, or next action. Store it in `{action}`.

Please organize the results in the following markdown table:
| step | output |
| 1 | Multiple spans |
| 2 | [evidence goes here manually if available, e.g., numbers or key phrases] |
| 3 | N

## Fine-Tuning

### Set bitbytesands config

In [8]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

### Tokenize LLM model

In [9]:
model = AutoModelForCausalLM.from_pretrained(
    "NousResearch/Nous-Hermes-2-Mistral-7B-DPO",
    device_map='auto',
    quantization_config=bnb_config,
    torch_dtype=float16,
    use_cache=False
)

tokenizer = AutoTokenizer.from_pretrained("NousResearch/Nous-Hermes-2-Mistral-7B-DPO")

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.model_max_length = 1024

Loading checkpoint shards: 100%|██████████| 3/3 [00:09<00:00,  3.00s/it]


### Create generate & tokenization function

In [10]:
def safe_eval(value):
    try:
        return ast.literal_eval(value)
    except (ValueError, SyntaxError):
        return value    # just return as-is string if not evaluable

def generate_and_tokenize_prompt(example):
    table = safe_eval(example["table"])
    paragraphs = safe_eval(example["paragraphs"])
    answer = safe_eval(example["answer"])

    question_dict = {
        "question": example["question"],
        "answer": answer,
        "answer_type": example.get("answer_type", ""),
        "answer_from": example.get("answer_from", ""),
        "derivation": example.get("derivation", "N.A."),
        "scale": example.get("scale", "none"),
        "action": example.get("action", "[action goes here...]")
    }

    full_prompt = create_prompt(table, paragraphs, question_dict)

    tokenized = tokenizer(
        full_prompt,
        truncation=True,
        max_length=1024,
        padding="max_length",
        return_tensors=None
    )

    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

### Mapping tokenization train & test dataset

In [11]:
tokenized_train_dataset = train_data.map(generate_and_tokenize_prompt, remove_columns=train_data.column_names)
tokenized_test_dataset = test_data.map(generate_and_tokenize_prompt, remove_columns=test_data.column_names)

Map:   0%|          | 0/13215 [00:00<?, ? examples/s]

Map: 100%|██████████| 13215/13215 [00:20<00:00, 642.81 examples/s]
Map: 100%|██████████| 1669/1669 [00:02<00:00, 669.05 examples/s]


### Set LoRA config

In [12]:
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM"
)

### Print model params

In [13]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.4f}"
    )
    
print_trainable_parameters(model)

trainable params: 27262976 || all params: 3779350528 || trainable%: 0.7214


In [14]:
from inspect import signature
print(signature(SFTTrainer))

(model: Union[str, torch.nn.modules.module.Module, transformers.modeling_utils.PreTrainedModel], args: Union[trl.trainer.sft_config.SFTConfig, transformers.training_args.TrainingArguments, NoneType] = None, data_collator: Optional[transformers.data.data_collator.DataCollator] = None, train_dataset: Union[datasets.arrow_dataset.Dataset, datasets.iterable_dataset.IterableDataset, NoneType] = None, eval_dataset: Union[datasets.arrow_dataset.Dataset, dict[str, datasets.arrow_dataset.Dataset], NoneType] = None, processing_class: Union[transformers.tokenization_utils_base.PreTrainedTokenizerBase, transformers.image_processing_utils.BaseImageProcessor, transformers.feature_extraction_utils.FeatureExtractionMixin, transformers.processing_utils.ProcessorMixin, NoneType] = None, compute_loss_func: Optional[Callable] = None, compute_metrics: Optional[Callable[[transformers.trainer_utils.EvalPrediction], dict]] = None, callbacks: Optional[list[transformers.trainer_callback.TrainerCallback]] = None

In [15]:
class PrintLossCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs is None:
            return
        
        if 'loss' in logs:
            print(f"Step {state.global_step} | Loss: {logs['loss']:.4f} | LR: {logs.get('learning_rate', 'N/A')}")
        
        if 'eval_loss' in logs:
            print(f"[Eval] Step {state.global_step} | Eval Loss: {logs['eval_loss']:.4f}")
        
        if 'exact_match' in logs or 'f1' in logs:
            em = logs.get('exact_match', 'N/A')
            f1 = logs.get('f1', 'N/A')
            print(f"[Eval] Step {state.global_step} | EM: {em:.2f} | F1: {f1:.2f}")
        
        if 'loss' not in logs and 'eval_loss' not in logs and len(logs) > 0:
            print(f"Step {state.global_step} | Available metrics: {', '.join(logs.keys())}")

In [16]:
def safe_eval(value):
    try:
        return ast.literal_eval(value)
    except:
        return value

def formatting_func(example):
    table = safe_eval(example["table"])
    paragraphs = safe_eval(example["paragraphs"])
    answer = safe_eval(example["answer"])
    
    question_dict = {
        "question": example["question"],
        "answer": answer,
        "answer_type": example.get("answer_type", ""),
        "answer_from": example.get("answer_from", ""),
        "derivation": example.get("derivation", "N.A."),
        "scale": example.get("scale", "none")
    }

    return create_prompt(table, paragraphs, question_dict)

In [145]:
def normalize_text(text):
    import re
    import string
    text = text.lower()
    text = ''.join(ch for ch in text if ch not in string.punctuation)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def normalize_textv2(s):
    import re
    import string
    s = s.lower()
    s = re.sub(r"\b(a|an|the)\b", " ", s)   # removes 'a', 'an', 'the'
    s = "".join(ch for ch in s if ch not in string.punctuation)
    s = " ".join(s.split()) # normalize all whitespace to single spaces
    return s

def compute_exact(a_pred, a_gold):
    return int(normalize_textv2(a_pred) == normalize_textv2(a_gold))

def compute_f1(a_pred, a_gold):
    pred_tokens = normalize_text(a_pred).split()
    gold_tokens = normalize_text(a_gold).split()
    common = set(pred_tokens) & set(gold_tokens)
    if len(common) == 0:
        return 0.0
    precision = len(common) / len(pred_tokens)
    recall = len(common) / len(gold_tokens)
    return 2 * precision * recall / (precision + recall)

def compute_metrics(pred):
    # Fix for nested list of predictions
    try:
        predictions = np.array(pred.predictions)
        if predictions.ndim == 3:
            predictions = np.argmax(predictions, axis=-1)   # predictions are logits -> take argmax first
        elif predictions.ndim == 1:
            predictions = [predictions.tolist()]    # sometimes it's already flattened
    except Exception as e:
        print("Prediction format error:", e)
        predictions = pred.predictions

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    label_ids = []  # handle -100s in label_ids
    for label in pred.label_ids:
        label = np.array(label)
        label = label[label != -100]
        label_ids.append(label)

    decoded_labels = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    em_scores = []
    f1_scores = []
    for pred_str, label_str in zip(decoded_preds, decoded_labels):
        em = compute_exact(pred_str, label_str)
        f1 = compute_f1(pred_str, label_str)
        em_scores.append(em)
        f1_scores.append(f1)

    return {
        "exact_match": np.mean(em_scores) * 100,
        "f1": np.mean(f1_scores) * 100
    }

def compute_metricsv2(pred):
    preds = pred.predictions
    if isinstance(preds, tuple):
        preds = preds[0]

    pred_ids = np.argmax(preds, axis=-1)    # convert logits to IDs
    predictions = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)

    # Clean -100 from labels
    label_ids = [label[label != -100] if hasattr(label, "__getitem__") else label for label in pred.label_ids]
    labels = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    em_scores = []
    f1_scores = []

    for p, g in zip(predictions, labels):
        em = compute_exact(p, g)
        f1 = compute_f1(p, g)
        em_scores.append(em)
        f1_scores.append(f1)

    return {
        "exact_match": np.mean(em_scores) * 100,
        "f1": np.mean(f1_scores) * 100
    }

### Set training config

In [18]:
tokenized_train_dataset = tokenized_train_dataset.select(range(9000))

sft_config = SFTConfig(
    output_dir="tat-llm",
    max_seq_length=1024,
    # max_steps=100,
    num_train_epochs=4,
    per_device_train_batch_size=2,
    logging_steps=200,
    save_strategy="steps",
    save_steps=500,
    eval_strategy="no",
    # eval_steps=100,
    learning_rate=2e-4,
    bf16=True,
    lr_scheduler_type="constant",
    # report_to="none",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    peft_config=peft_config,
    args=sft_config,
    compute_metrics=compute_metrics,
    callbacks=[PrintLossCallback()]
)

Truncating train dataset: 100%|██████████| 9000/9000 [00:00<00:00, 132281.36 examples/s]
Truncating eval dataset: 100%|██████████| 1669/1669 [00:00<00:00, 138266.48 examples/s]
No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


### Train!

In [19]:
trainer.train()

  return fn(*args, **kwargs)


Step,Training Loss
200,0.6149
400,0.5707
600,0.5322
800,0.5231
1000,0.5077
1200,0.4991
1400,0.4973
1600,0.4772
1800,0.4643
2000,0.4517


Step 200 | Loss: 0.6149 | LR: 0.0002
Step 400 | Loss: 0.5707 | LR: 0.0002


  return fn(*args, **kwargs)


Step 600 | Loss: 0.5322 | LR: 0.0002
Step 800 | Loss: 0.5231 | LR: 0.0002
Step 1000 | Loss: 0.5077 | LR: 0.0002


  return fn(*args, **kwargs)


Step 1200 | Loss: 0.4991 | LR: 0.0002
Step 1400 | Loss: 0.4973 | LR: 0.0002


  return fn(*args, **kwargs)


Step 1600 | Loss: 0.4772 | LR: 0.0002
Step 1800 | Loss: 0.4643 | LR: 0.0002
Step 2000 | Loss: 0.4517 | LR: 0.0002


  return fn(*args, **kwargs)


Step 2200 | Loss: 0.4374 | LR: 0.0002
Step 2400 | Loss: 0.4367 | LR: 0.0002


  return fn(*args, **kwargs)


Step 2600 | Loss: 0.4224 | LR: 0.0002
Step 2800 | Loss: 0.4126 | LR: 0.0002
Step 3000 | Loss: 0.3915 | LR: 0.0002


  return fn(*args, **kwargs)


Step 3200 | Loss: 0.3832 | LR: 0.0002
Step 3400 | Loss: 0.3763 | LR: 0.0002


  return fn(*args, **kwargs)


Step 3600 | Loss: 0.3540 | LR: 0.0002
Step 3800 | Loss: 0.3505 | LR: 0.0002
Step 4000 | Loss: 0.3358 | LR: 0.0002


  return fn(*args, **kwargs)


Step 4200 | Loss: 0.3341 | LR: 0.0002
Step 4400 | Loss: 0.3366 | LR: 0.0002


  return fn(*args, **kwargs)


Step 4600 | Loss: 0.2891 | LR: 0.0002
Step 4800 | Loss: 0.2711 | LR: 0.0002
Step 5000 | Loss: 0.2786 | LR: 0.0002


  return fn(*args, **kwargs)


Step 5200 | Loss: 0.2575 | LR: 0.0002
Step 5400 | Loss: 0.2672 | LR: 0.0002


  return fn(*args, **kwargs)


Step 5600 | Loss: 0.2473 | LR: 0.0002
Step 5800 | Loss: 0.2489 | LR: 0.0002
Step 6000 | Loss: 0.2338 | LR: 0.0002


  return fn(*args, **kwargs)


Step 6200 | Loss: 0.2218 | LR: 0.0002
Step 6400 | Loss: 0.2312 | LR: 0.0002


  return fn(*args, **kwargs)


Step 6600 | Loss: 0.2217 | LR: 0.0002
Step 6800 | Loss: 0.2129 | LR: 0.0002
Step 7000 | Loss: 0.2099 | LR: 0.0002


  return fn(*args, **kwargs)


Step 7200 | Loss: 0.2113 | LR: 0.0002
Step 7400 | Loss: 0.2087 | LR: 0.0002


  return fn(*args, **kwargs)


Step 7600 | Loss: 0.2056 | LR: 0.0002
Step 7800 | Loss: 0.1938 | LR: 0.0002
Step 8000 | Loss: 0.1999 | LR: 0.0002


  return fn(*args, **kwargs)


Step 8200 | Loss: 0.1911 | LR: 0.0002
Step 8400 | Loss: 0.1853 | LR: 0.0002


  return fn(*args, **kwargs)


Step 8600 | Loss: 0.1861 | LR: 0.0002
Step 8800 | Loss: 0.1869 | LR: 0.0002
Step 9000 | Loss: 0.1766 | LR: 0.0002


  return fn(*args, **kwargs)


Step 9200 | Loss: 0.1616 | LR: 0.0002
Step 9400 | Loss: 0.1590 | LR: 0.0002


  return fn(*args, **kwargs)


Step 9600 | Loss: 0.1548 | LR: 0.0002
Step 9800 | Loss: 0.1509 | LR: 0.0002
Step 10000 | Loss: 0.1538 | LR: 0.0002


  return fn(*args, **kwargs)


Step 10200 | Loss: 0.1547 | LR: 0.0002
Step 10400 | Loss: 0.1540 | LR: 0.0002


  return fn(*args, **kwargs)


Step 10600 | Loss: 0.1556 | LR: 0.0002
Step 10800 | Loss: 0.1482 | LR: 0.0002
Step 11000 | Loss: 0.1506 | LR: 0.0002


  return fn(*args, **kwargs)


Step 11200 | Loss: 0.1498 | LR: 0.0002
Step 11400 | Loss: 0.1452 | LR: 0.0002


  return fn(*args, **kwargs)


Step 11600 | Loss: 0.1407 | LR: 0.0002
Step 11800 | Loss: 0.1344 | LR: 0.0002
Step 12000 | Loss: 0.1366 | LR: 0.0002


  return fn(*args, **kwargs)


Step 12200 | Loss: 0.1354 | LR: 0.0002
Step 12400 | Loss: 0.1408 | LR: 0.0002


  return fn(*args, **kwargs)


Step 12600 | Loss: 0.1369 | LR: 0.0002
Step 12800 | Loss: 0.1355 | LR: 0.0002
Step 13000 | Loss: 0.1350 | LR: 0.0002


  return fn(*args, **kwargs)


Step 13200 | Loss: 0.1339 | LR: 0.0002
Step 13400 | Loss: 0.1266 | LR: 0.0002


  return fn(*args, **kwargs)


Step 13600 | Loss: 0.1155 | LR: 0.0002
Step 13800 | Loss: 0.1129 | LR: 0.0002
Step 14000 | Loss: 0.1111 | LR: 0.0002


  return fn(*args, **kwargs)


Step 14200 | Loss: 0.1096 | LR: 0.0002
Step 14400 | Loss: 0.1082 | LR: 0.0002


  return fn(*args, **kwargs)


Step 14600 | Loss: 0.1005 | LR: 0.0002
Step 14800 | Loss: 0.1081 | LR: 0.0002
Step 15000 | Loss: 0.1000 | LR: 0.0002


  return fn(*args, **kwargs)


Step 15200 | Loss: 0.1015 | LR: 0.0002
Step 15400 | Loss: 0.1021 | LR: 0.0002


  return fn(*args, **kwargs)


Step 15600 | Loss: 0.1039 | LR: 0.0002
Step 15800 | Loss: 0.1018 | LR: 0.0002
Step 16000 | Loss: 0.0991 | LR: 0.0002


  return fn(*args, **kwargs)


Step 16200 | Loss: 0.0891 | LR: 0.0002
Step 16400 | Loss: 0.0962 | LR: 0.0002


  return fn(*args, **kwargs)


Step 16600 | Loss: 0.0906 | LR: 0.0002
Step 16800 | Loss: 0.0924 | LR: 0.0002
Step 17000 | Loss: 0.0907 | LR: 0.0002


  return fn(*args, **kwargs)


Step 17200 | Loss: 0.0875 | LR: 0.0002
Step 17400 | Loss: 0.0878 | LR: 0.0002


  return fn(*args, **kwargs)


Step 17600 | Loss: 0.0830 | LR: 0.0002
Step 17800 | Loss: 0.0874 | LR: 0.0002
Step 18000 | Loss: 0.0831 | LR: 0.0002
Step 18000 | Available metrics: train_runtime, train_samples_per_second, train_steps_per_second, total_flos, train_loss, epoch


TrainOutput(global_step=18000, training_loss=0.22556510406070285, metrics={'train_runtime': 39649.9004, 'train_samples_per_second': 0.908, 'train_steps_per_second': 0.454, 'total_flos': 1.578796188696576e+18, 'train_loss': 0.22556510406070285})

### Save the fine-tuned model

In [20]:
trainer.save_model("tat-llm-final-e4")          # saves model + LoRA adapter
tokenizer.save_pretrained("tat-llm-final-e4")   # saves tokenizer config/vocab

('tat-llm-final-e4\\tokenizer_config.json',
 'tat-llm-final-e4\\special_tokens_map.json',
 'tat-llm-final-e4\\chat_template.jinja',
 'tat-llm-final-e4\\tokenizer.model',
 'tat-llm-final-e4\\added_tokens.json',
 'tat-llm-final-e4\\tokenizer.json')

In [29]:
metadata = {
    "model_name": "tat-llm-final-e4",
    "base_model": "NousResearch/Nous-Hermes-2-Mistral-7B-DPO",
    "tokenizer": "NousResearch/Nous-Hermes-2-Mistral-7B-DPO",
    "adapter_type": "LoRA",
    "adapter_config": {
        "r": peft_config.r,
        "alpha": peft_config.lora_alpha,
        "dropout": peft_config.lora_dropout,
        "bias": peft_config.bias
    },
    "training": {
        "dataset": "TAT-QA (train.json)",
        "num_examples": 9000,
        "num_epochs": 4,
        "max_seq_length": 1024,
        "batch_size_per_device": sft_config.per_device_train_batch_size,
        "learning_rate": sft_config.learning_rate,
        "lr_scheduler": sft_config.lr_scheduler_type,
        "fp16": sft_config.fp16,
        "bf16": sft_config.bf16,
        "optimizer": "AdamW (via Trainer)"
    },
    "notes": "Instruction-tuned with simplified prompt format. No evaluation run due to memory constraints. Use .generate() for inference.",
    "created_by": "Your Name or Team",
    "date": "2025-07-08"
}

with open("tat-llm-final-e4/metadata.json", "w") as f:
    json.dump(metadata, f, indent=2)

In [28]:
from dataclasses import asdict

with open("tat-llm-final-e4/training_args.json", "w") as f:
    json.dump(asdict(sft_config), f, indent=2)

## Evaluate the Fine-Tuned Model

### Load the fine-tuned model

In [21]:
lora_model = PeftModel.from_pretrained(model, "tat-llm-final-e4")



In [22]:
tokenizer = AutoTokenizer.from_pretrained("tat-llm-final-e4")

### Evaluate with range 50

In [146]:
em_scores = []
f1_scores = []

subset_dataset = tokenized_test_dataset.select(range(50))
lora_model.eval()

for i in tqdm(range(len(subset_dataset)), desc="Evaluating"):
    sample = subset_dataset[i]

    input_ids = torch.tensor(sample["input_ids"]).unsqueeze(0).to("cuda")
    attention_mask = torch.tensor(sample["attention_mask"]).unsqueeze(0).to("cuda")

    with torch.no_grad():
        output = lora_model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=128,
            do_sample=False,
            eos_token_id=tokenizer.eos_token_id
        )

    pred_str = tokenizer.decode(output[0], skip_special_tokens=True)

    # Gold label: decode from true token IDs without -100
    if "labels" in sample:
        label = np.array(sample["labels"])
        label = label[label != -100]
        label_str = tokenizer.decode(label, skip_special_tokens=True)
    else:
        label_str = sample["answer"]    # fallback if using original dataset

    em = compute_exact(pred_str, label_str)
    f1 = compute_f1(pred_str, label_str)

    em_scores.append(em)
    f1_scores.append(f1)

print(f"\nEM: {100 * np.mean(em_scores):.2f}")
print(f"F1: {100 * np.mean(f1_scores):.2f}")

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   2%|▏         | 1/50 [00:38<31:37, 38.72s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   4%|▍         | 2/50 [01:15<30:09, 37.70s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   6%|▌         | 3/50 [01:53<29:24, 37.55s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   8%|▊         | 4/50 [02:32<29:28, 38.45s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:  10%|█         | 5/50 [03:09<28:17, 37.72s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:  12%|█▏        | 6/50 [03:48<28:08, 38.37s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:  14%|█▍        | 7/50 [04:28<27:47, 38.77s/it]Setting `pad_token_id` to `eos_token_id`


EM: 0.00
F1: 46.20





### Evaluate with range 50 (random sample)

In [149]:
em_scores = []
f1_scores = []

subset_indices = random.sample(range(len(tokenized_test_dataset)), 50)
subset_dataset = tokenized_test_dataset.select(subset_indices)
lora_model.eval()

for i in tqdm(range(len(subset_dataset)), desc="Evaluating"):
    sample = subset_dataset[i]

    input_ids = torch.tensor(sample["input_ids"]).unsqueeze(0).to("cuda")
    attention_mask = torch.tensor(sample["attention_mask"]).unsqueeze(0).to("cuda")

    with torch.no_grad():
        output = lora_model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=128,
            do_sample=False,
            eos_token_id=tokenizer.eos_token_id
        )

    pred_str = tokenizer.decode(output[0], skip_special_tokens=True)

    # Gold label: decode from true token IDs without -100
    if "labels" in sample:
        label = np.array(sample["labels"])
        label = label[label != -100]
        label_str = tokenizer.decode(label, skip_special_tokens=True)
    else:
        label_str = sample["answer"]    # fallback if using original dataset

    em = compute_exact(pred_str, label_str)
    f1 = compute_f1(pred_str, label_str)

    em_scores.append(em)
    f1_scores.append(f1)

print(f"\nEM: {100 * np.mean(em_scores):.2f}")
print(f"F1: {100 * np.mean(f1_scores):.2f}")

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   2%|▏         | 1/50 [00:12<10:09, 12.45s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   4%|▍         | 2/50 [00:24<09:39, 12.06s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   6%|▌         | 3/50 [00:34<08:56, 11.42s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   8%|▊         | 4/50 [00:44<08:18, 10.84s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:  10%|█         | 5/50 [00:55<08:01, 10.71s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:  12%|█▏        | 6/50 [01:06<08:05, 11.04s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:  14%|█▍        | 7/50 [01:17<07:52, 10.98s/it]Setting `pad_token_id` to `eos_token_id`


EM: 0.00
F1: 44.97





### Evaluate with range 100 (random sample)

In [148]:
em_scores = []
f1_scores = []

subset_indices = random.sample(range(len(tokenized_test_dataset)), 100)
subset_dataset = tokenized_test_dataset.select(subset_indices)
lora_model.eval()

for i in tqdm(range(len(subset_dataset)), desc="Evaluating"):
    sample = subset_dataset[i]

    input_ids = torch.tensor(sample["input_ids"]).unsqueeze(0).to("cuda")
    attention_mask = torch.tensor(sample["attention_mask"]).unsqueeze(0).to("cuda")

    with torch.no_grad():
        output = lora_model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=128,
            do_sample=False,
            eos_token_id=tokenizer.eos_token_id
        )

    pred_str = tokenizer.decode(output[0], skip_special_tokens=True)

    # Gold label: decode from true token IDs without -100
    if "labels" in sample:
        label = np.array(sample["labels"])
        label = label[label != -100]
        label_str = tokenizer.decode(label, skip_special_tokens=True)
    else:
        label_str = sample["answer"]    # fallback if using original dataset

    em = compute_exact(pred_str, label_str)
    f1 = compute_f1(pred_str, label_str)

    em_scores.append(em)
    f1_scores.append(f1)

print(f"\nEM: {100 * np.mean(em_scores):.2f}")
print(f"F1: {100 * np.mean(f1_scores):.2f}")

Evaluating:   0%|          | 0/100 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.


Evaluating:   1%|          | 1/100 [00:11<19:07, 11.59s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   2%|▏         | 2/100 [00:20<15:58,  9.78s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   3%|▎         | 3/100 [00:29<15:33,  9.63s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   4%|▍         | 4/100 [00:41<16:43, 10.45s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   5%|▌         | 5/100 [00:52<16:50, 10.64s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   6%|▌         | 6/100 [01:03<16:46, 10.71s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   7%|▋         | 7/100 [01:12<16:07, 10.40s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   8%|▊         | 8/100 [01:23<16:02, 10.46s/it]Setting `pad_token_id` t


EM: 0.00
F1: 46.49





### Evaluate with full test dataset

In [152]:
em_scores = []
f1_scores = []

subset_dataset = tokenized_test_dataset
lora_model.eval()

for i in tqdm(range(len(subset_dataset)), desc="Evaluating"):
    sample = subset_dataset[i]

    input_ids = torch.tensor(sample["input_ids"]).unsqueeze(0).to("cuda")
    attention_mask = torch.tensor(sample["attention_mask"]).unsqueeze(0).to("cuda")

    with torch.no_grad():
        output = lora_model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_new_tokens=128,
            do_sample=False,
            eos_token_id=tokenizer.eos_token_id
        )

    pred_str = tokenizer.decode(output[0], skip_special_tokens=True)

    # Gold label: decode from true token IDs without -100
    if "labels" in sample:
        label = np.array(sample["labels"])
        label = label[label != -100]
        label_str = tokenizer.decode(label, skip_special_tokens=True)
    else:
        label_str = sample["answer"]    # fallback if using original dataset

    em = compute_exact(pred_str, label_str)
    f1 = compute_f1(pred_str, label_str)

    em_scores.append(em)
    f1_scores.append(f1)

print(f"\nEM: {100 * np.mean(em_scores):.2f}")
print(f"F1: {100 * np.mean(f1_scores):.2f}")

Evaluating:   0%|          | 0/1669 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   0%|          | 1/1669 [00:10<4:50:17, 10.44s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   0%|          | 2/1669 [00:19<4:34:05,  9.87s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   0%|          | 3/1669 [00:29<4:30:35,  9.75s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   0%|          | 4/1669 [00:41<4:53:06, 10.56s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   0%|          | 5/1669 [00:50<4:34:05,  9.88s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   0%|          | 6/1669 [01:02<4:54:04, 10.61s/it]Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.
Evaluating:   0%|          | 7/1669 [01:13<5:05:27, 11.03s/it]Setting `p


EM: 0.00
F1: 46.62





### Test prompt

In [72]:
lora_model.eval()

markdown_table = """|  | Year Ended | Year Ended |
| Stock-Based Compensation by Type of Award | December 31, 2019 | December 31, 2018 |
| Stock options | $2,756 | $2,926 |
| RSUs | 955 | 1,129 |
| Total stock-based compensation expense | $3,711 | $4,055 |"""

test_prompt = f"""### Instruction
Given a table and a list of texts in the following, answer the question posed using the following six-step process:
1. Step 1: Predict the type of question being asked. Store this prediction in the variable `{{question_type}}`.
2. Step 2: Extract the relevant strings or numerical values from the provided table or texts. Store them in `{{evidence}}`.
3. Step 3: If `{{question_type}}` is `Arithmetic`, generate an equation in `{{equation}}`. Otherwise, put `N.A.`.
4. Step 4: Compute the final answer and store in `{{answer}}`.
5. Step 5: Predict the answer's scale in `{{scale}}`. One of: `none`, `percent`, `thousand`, `million`, `billion`.
6. Step 6: Based on the `{{answer}}` and `{{question_type}}`, generate a short and logical recommendation, business insight, or next action. Store it in `{{action}}`

Table:
{markdown_table}

Context:
Stock-based compensation expense is included in general and administrative expense for each period as follows:
As of December 31, 2019, there was $4,801 of unrecognized stock-based compensation expense related to unvested employee stock options and $1,882 of unrecognized stock-based compensation expense related to unvested RSUs. These costs are expected to be recognized over a weighted-average period of 2.13 and 2.33 years, respectively.

Question:
Based on data, what insights that we can get?

Answer:"""

inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")
input_length = inputs["input_ids"].shape[1]

with torch.no_grad():
    outputs = lora_model.generate(
        **inputs,
        max_new_tokens=1024,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )

generated_tokens = outputs[0][input_length:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print("\n=== Generated Answer ===\n")
print(response)

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.



=== Generated Answer ===


In 2019, the total stock-based compensation expense decreased by $344 compared to 2018. The expense related to unvested employee stock options decreased by $175, while the expense related to unvested RSUs increased by $174.

Recommendation:
It is essential to monitor the trends in stock-based compensation expenses and the number of unvested employee stock options and RSUs to make informed decisions about the company's compensation strategy.


In [165]:
lora_model.eval()

markdown_table = """|  | Year Ended | Year Ended |
| Stock-Based Compensation by Type of Award | December 31, 2019 | December 31, 2018 |
| Stock options | $2,756 | $2,926 |
| RSUs | 955 | 1,129 |
| Total stock-based compensation expense | $3,711 | $4,055 |"""

test_prompt = f"""### Instruction
Given a table and a list of texts in the following, answer the question posed using the following six-step process:
1. Step 1: Predict the type of question being asked. Store this prediction in the variable `{{question_type}}`.
2. Step 2: Extract the relevant strings or numerical values from the provided table or texts. Store them in `{{evidence}}`.
3. Step 3: If `{{question_type}}` is `Arithmetic`, generate an equation in `{{equation}}`. Otherwise, put `N.A.`.
4. Step 4: Compute the final answer and store in `{{answer}}`.
5. Step 5: Predict the answer's scale in `{{scale}}`. One of: `none`, `percent`, `thousand`, `million`, `billion`.
6. Step 6: Based on the `{{answer}}` and `{{question_type}}`, generate a short and logical recommendation, business insight, or next action. Store it in `{{action}}`

Table:
{markdown_table}

Context:
The table provides a comparison of stock-based compensation expenses for PT XYZ across two consecutive years. The data reveals how different award types contributed to the overall compensation and helps assess year-over-year cost efficiency and employee incentives.

Question:
How did the total stock-based compensation expense change from 2018 to 2019, and what does that suggest about the company’s strategy toward talent retention?

Answer:"""

inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")
input_length = inputs["input_ids"].shape[1]

with torch.no_grad():
    outputs = lora_model.generate(
        **inputs,
        max_new_tokens=1024,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )

generated_tokens = outputs[0][input_length:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print("\n=== Generated Answer ===\n")
print(response)

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.



=== Generated Answer ===


The total stock-based compensation expense decreased from $4,055 in 2018 to $3,711 in 2019. This suggests that the company may be focusing on cost efficiency and potentially reducing employee incentives.


In [168]:
lora_model.eval()

markdown_table = """|  | Year Ended | Year Ended |
| Stock-Based Compensation by Type of Award | December 31, 2019 | December 31, 2018 |
| Stock options | $2,756 | $2,926 |
| RSUs | 955 | 1,129 |
| Total stock-based compensation expense | $3,711 | $4,055 |"""

test_prompt = f"""### Instruction
Given a table and a list of texts in the following, answer the question posed using the following six-step process:
1. Step 1: Predict the type of question being asked. Store this prediction in the variable `{{question_type}}`.
2. Step 2: Extract the relevant strings or numerical values from the provided table or texts. Store them in `{{evidence}}`.
3. Step 3: If `{{question_type}}` is `Arithmetic`, generate an equation in `{{equation}}`. Otherwise, put `N.A.`.
4. Step 4: Compute the final answer and store in `{{answer}}`.
5. Step 5: Predict the answer's scale in `{{scale}}`. One of: `none`, `percent`, `thousand`, `million`, `billion`.
6. Step 6: Based on the `{{answer}}` and `{{question_type}}`, generate a short and logical recommendation, business insight, or next action. Store it in `{{action}}`

Table:
{markdown_table}

Context:

Question:
Between stock options and RSUs, which component shows a greater year-over-year cost reduction, and what strategic decision could the company make based on this data?

Answer:"""

inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")
input_length = inputs["input_ids"].shape[1]

with torch.no_grad():
    outputs = lora_model.generate(
        **inputs,
        max_new_tokens=1024,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )

generated_tokens = outputs[0][input_length:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print("\n=== Generated Answer ===\n")
print(response)

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.



=== Generated Answer ===


Stock options show a greater year-over-year cost reduction. The company could consider issuing more RSUs to employees to reduce the overall cost of stock-based compensation.


In [170]:
lora_model.eval()

markdown_table = """|  | Year Ended | Year Ended |
| Stock-Based Compensation by Type of Award | December 31, 2019 | December 31, 2018 |
| Stock options | $2,756 | $2,926 |
| RSUs | 955 | 1,129 |
| Total stock-based compensation expense | $3,711 | $4,055 |"""

test_prompt = f"""### Instruction
Given a table and a list of texts in the following, answer the question posed using the following six-step process:
1. Step 1: Predict the type of question being asked. Store this prediction in the variable `{{question_type}}`.
2. Step 2: Extract the relevant strings or numerical values from the provided table or texts. Store them in `{{evidence}}`.
3. Step 3: If `{{question_type}}` is `Arithmetic`, generate an equation in `{{equation}}`. Otherwise, put `N.A.`.
4. Step 4: Compute the final answer and store in `{{answer}}`.
5. Step 5: Predict the answer's scale in `{{scale}}`. One of: `none`, `percent`, `thousand`, `million`, `billion`.
6. Step 6: Based on the `{{answer}}` and `{{question_type}}`, generate a short and logical recommendation, business insight, or next action. Store it in `{{action}}`

Table:
{markdown_table}

Context:
Tabel berikut menunjukkan rincian biaya kompensasi berbasis saham PT ABC selama dua tahun terakhir. Perusahaan sedang meninjau kembali efektivitas alokasi insentif karyawan dalam kaitannya dengan efisiensi operasional dan keberlanjutan biaya.

Question:
Dari data tersebut, apakah terdapat indikasi bahwa perusahaan berhasil menekan biaya kompensasi? Jelaskan alasan dan komponen yang paling berkontribusi terhadap penurunan atau peningkatan biaya.

Answer:"""

inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")
input_length = inputs["input_ids"].shape[1]

with torch.no_grad():
    outputs = lora_model.generate(
        **inputs,
        max_new_tokens=1024,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )

generated_tokens = outputs[0][input_length:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print("\n=== Generated Answer ===\n")
print(response)

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.



=== Generated Answer ===


Tidak ada indikasi yang signifikan bahwa perusahaan berhasil menekan biaya kompensasi. Biaya kompensasi berbasis saham pada tahun 2019 mencapai $3,711, naik dibandingkan dengan tahun 2018 yang hanya $4,055. Komponen yang paling berkontribusi terhadap penurunan biaya adalah RSUs dengan jumlah 955 pada tahun 2019, lebih rendah dibandingkan dengan 1,129 pada tahun 2018. Namun, komponen stock options menunjukkan penurunan dari $2,926 pada tahun 2018 menjadi $2,756 pada tahun 2019.

Scale: none

Action:
Perusahaan harus melakukan analisis lebih dalam terhadap data ini dan mengidentifikasi alasan kesalahan yang mengakibatkan penurunan biaya kompensasi. Selain itu, perusahaan harus mempertimbangkan alasan yang lain seperti kualitas karyawan dan kinerja perusahaan dalam memutuskan apakah perlu mengurangi biaya kompensasi atau tidak.


In [187]:
lora_model.eval()

markdown_table = """| | Tahun Berakhir | Tahun Berakhir |
| Jenis Kompensasi Berbasis Saham | 31 Desember 2023 | 31 Desember 2022 |
| Opsi saham | Rp2.450 | Rp2.800 |
| Saham terbatas (RSU) | Rp1.100 | Rp1.300 |
| Total kompensasi berbasis saham | Rp3.550 | Rp4.100 |"""

test_prompt = f"""### Instruction
Given a table and a list of texts in the following, answer the question posed using the following six-step process:
1. Step 1: Predict the type of question being asked. Store this prediction in the variable `{{question_type}}`.
2. Step 2: Extract the relevant strings or numerical values from the provided table or texts. Store them in `{{evidence}}`.
3. Step 3: If `{{question_type}}` is `Arithmetic`, generate an equation in `{{equation}}`. Otherwise, put `N.A.`.
4. Step 4: Compute the final answer and store in `{{answer}}`.
5. Step 5: Predict the answer's scale in `{{scale}}`. One of: `none`, `percent`, `thousand`, `million`, `billion`.
6. Step 6: Based on the `{{answer}}` and `{{question_type}}`, generate a short and logical recommendation, business insight, or next action. Store it in `{{action}}`

Table:
{markdown_table}

Context:
Data menunjukkan bahwa kompensasi opsi saham menurun, namun tidak sebanyak penurunan pada RSU. Hal ini membuat manajemen mempertimbangkan perubahan proporsi pemberian insentif.

Question:
Apakah perusahaan sebaiknya mempertahankan, menambah, atau mengurangi porsi opsi saham dibandingkan RSU berdasarkan data ini?

Answer:"""

inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")
input_length = inputs["input_ids"].shape[1]

with torch.no_grad():
    outputs = lora_model.generate(
        **inputs,
        max_new_tokens=1024,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )

generated_tokens = outputs[0][input_length:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print("\n=== Generated Answer ===\n")
print(response)

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.



=== Generated Answer ===


{ assistant
question_type: Comparison
evidence: ["Opsi saham: Rp2.450", "Saham terbatas (RSU): Rp1.100", "Opsi saham: Rp2.800", "Saham terbatas (RSU): Rp1.300", "Total kompensasi berbasis saham: Rp3.550", "Total kompensasi berbasis saham: Rp4.100"]
equation: N.A.
answer: 
- Mempertahankan: 31 Desember 2023 (Rp2.450) dan 31 Desember 2022 (Rp2.800)
- Menambah: 31 Desember 2023 (Rp1.100) dan 31 Desember 2022 (Rp1.300)
- Mengurangi: N.A.
scale: none
action: Perusahaan sebaiknya mempertimbangkan faktor-faktor lain seperti kinerja karyawan, kemampuan finansial, dan strategi perusahaan saat membuat keputusan tentang perubahan proporsi pemberian insentif antara opsi saham dan RSU berdasarkan data ini. } 

The question is asking for a comparison between the proportions of stock options and restricted stock units (RSUs) based on the provided data.

The relevant evidence from the table and context is:
- Opsi saham: Rp2.450 and Rp2.800
- Saham terbatas (RSU): Rp1.100 an

In [188]:
lora_model.eval()

markdown_table = """| | Tahun Berakhir | Tahun Berakhir |
| Jenis Kompensasi Berbasis Saham | 31 Desember 2023 | 31 Desember 2022 |
| Opsi saham | Rp2.450 | Rp2.800 |
| Saham terbatas (RSU) | Rp1.100 | Rp1.300 |
| Total kompensasi berbasis saham | Rp3.550 | Rp4.100 |"""

test_prompt = f"""### Instruction
Given a table and a list of texts in the following, answer the question posed using the following six-step process:
1. Step 1: Predict the type of question being asked. Store this prediction in the variable `{{question_type}}`.
2. Step 2: Extract the relevant strings or numerical values from the provided table or texts. Store them in `{{evidence}}`.
3. Step 3: If `{{question_type}}` is `Arithmetic`, generate an equation in `{{equation}}`. Otherwise, put `N.A.`.
4. Step 4: Compute the final answer and store in `{{answer}}`.
5. Step 5: Predict the answer's scale in `{{scale}}`. One of: `none`, `percent`, `thousand`, `million`, `billion`.
6. Step 6: Based on the `{{answer}}` and `{{question_type}}`, generate a short and logical recommendation, business insight, or next action. Store it in `{{action}}`

Table:
{markdown_table}

Context:
Perusahaan sedang menyusun anggaran tahun 2024 dan menggunakan data kompensasi saham tahun sebelumnya sebagai dasar proyeksi.

Question:
Berdasarkan tren dua tahun terakhir, berapa estimasi total kompensasi berbasis saham untuk tahun 2024 jika pola penurunan berlanjut?

Answer:"""

inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")
input_length = inputs["input_ids"].shape[1]

with torch.no_grad():
    outputs = lora_model.generate(
        **inputs,
        max_new_tokens=1024,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )

generated_tokens = outputs[0][input_length:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print("\n=== Generated Answer ===\n")
print(response)

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.



=== Generated Answer ===


Rp. 3,850

Scale:
none

Action:
Untuk tahun 2024, perusahaan harus mengkalkulasi kompensasi saham dengan mengambil kira pola penurunan yang terlihat pada data sebelumnya. Jika pola penurunan terus berlanjut, total kompensasi saham diperkirakan akan menjadi Rp. 3,850. Perusahaan harus mengkaji faktor-faktor yang mungkin mempengaruhi pola ini dan mengambil tindak lanjut yang diperlukan untuk memastikan kompensasi saham tetap kompetitif dan menarik untuk tenaga kerja.

### Instruction
1. Step 1: Predict the type of question being asked. Store this prediction in the variable `{question_type}`.
The question type is "Estimation".
2. Step 2: Extract the relevant strings or numerical values from the provided table or texts. Store them in `{evidence}`.
The evidence is the values in the table for "Opsi saham" and "Total kompensasi berbasis saham" for both "31 Desember 2023" and "31 Desember 2022".
3. Step 3: If `{question_type}` is `Arithmetic`, generate an equation i

In [189]:
lora_model.eval()

markdown_table = """| | Tahun Berakhir | Tahun Berakhir |
| Jenis Kompensasi Berbasis Saham | 31 Desember 2023 | 31 Desember 2022 |
| Opsi saham | Rp2.450 | Rp2.800 |
| Saham terbatas (RSU) | Rp1.100 | Rp1.300 |
| Total kompensasi berbasis saham | Rp3.550 | Rp4.100 |"""

test_prompt = f"""### Instruction
Given a table and a list of texts in the following, answer the question posed using the following six-step process:
1. Step 1: Predict the type of question being asked. Store this prediction in the variable `{{question_type}}`.
2. Step 2: Extract the relevant strings or numerical values from the provided table or texts. Store them in `{{evidence}}`.
3. Step 3: If `{{question_type}}` is `Arithmetic`, generate an equation in `{{equation}}`. Otherwise, put `N.A.`.
4. Step 4: Compute the final answer and store in `{{answer}}`.
5. Step 5: Predict the answer's scale in `{{scale}}`. One of: `none`, `percent`, `thousand`, `million`, `billion`.
6. Step 6: Based on the `{{answer}}` and `{{question_type}}`, generate a short and logical recommendation, business insight, or next action. Store it in `{{action}}`

Table:
{markdown_table}

Context:
Manajemen ingin memastikan bahwa struktur kompensasi saham tetap kompetitif, namun juga efisien dalam jangka panjang.

Question:
Apa rekomendasi strategis yang dapat diberikan berdasarkan tren kompensasi saham dari tahun 2022 ke 2023?

Answer:"""

inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")
input_length = inputs["input_ids"].shape[1]

with torch.no_grad():
    outputs = lora_model.generate(
        **inputs,
        max_new_tokens=1024,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id
    )

generated_tokens = outputs[0][input_length:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print("\n=== Generated Answer ===\n")
print(response)

Setting `pad_token_id` to `eos_token_id`:32000 for open-end generation.



=== Generated Answer ===


{
  "question_type": "Recommendation",
  "evidence": [
    "Rp2.450",
    "Rp2.800",
    "Rp1.100",
    "Rp1.300",
    "Rp3.550",
    "Rp4.100"
  ],
  "equation": "N.A.",
  "answer": "Rp3.550",
  "scale": "none",
  "action": "Untuk memastikan struktur kompensasi saham tetap kompetitif dan efisien dalam jangka panjang, manajemen dapat mengekualifikasi karyawan yang memiliki prestasi yang baik dan memberikan opsi saham yang lebih tinggi pada karyawan tersebut. Hal ini dapat memperbaiki kualitas karyawan dan memperbaiki prestasi perusahaan secara keseluruhan."
}

### Explanation

Pertanyaan meminta rekomendasi strategis berdasarkan tren kompensasi saham dari tahun 2022 ke 2023. Tabel menunjukkan nilai kompensasi saham pada tahun 2022 dan 2023.

Tindakan yang disarikan adalah untuk memperbaiki kualitas karyawan dan memperbaiki prestasi perusahaan secara keseluruhan. Hal ini dapat dilakukan dengan menggekualifikasi karyawan yang memiliki prestasi yang baik dan me

## Thank You