# Synthetic Dataset Generator: TableQA
----
Synthetic Image Dataset on Tables for QA Reasoning and Recognition Tasks

----
authors: Marc Haraoui, Aser Lompo

date: 19/03/2025

## Setup

In [1]:
!pip install -U groq

Collecting groq
  Downloading groq-0.31.1-py3-none-any.whl.metadata (16 kB)
Downloading groq-0.31.1-py3-none-any.whl (134 kB)
Installing collected packages: groq
  Attempting uninstall: groq
    Found existing installation: groq 0.24.0
    Uninstalling groq-0.24.0:
      Successfully uninstalled groq-0.24.0
Successfully installed groq-0.31.1
[0m

### Imports

In [2]:
from groq import Groq
from openai import OpenAI
import google.generativeai as genai
import requests
import pandas as pd
from datetime import datetime, timedelta
import json
from PIL import Image as PILImage
from collections import Counter
import os
from pdf2image import convert_from_path
from IPython.display import Image, display
import numpy as np
import pytesseract
from pytesseract import Output
from typing_extensions import final
import random
from pprint import pprint
import shutil
import time

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from utils import generate_unique_filename, crop_image, save_latex_table_as_image, extract_json_blocks, decode_llm_latex_output, decode_control_sequences, extract_tables_from_text, update_start_time, read_Exception


### Config

In [3]:
# API keys for model hosts
groq_key   = "..."
google_key = "..."
openai_key = "..."
openrouter_key = "..."


# LATEX INSTRUCTIONS
TABLE_INSTRUCT_LATEX = r"""
You are an expert in generating synthetic datasets composed of LaTeX-formatted tables, optionally accompanied by illustrative diagrams. Your task is to produce structured content suitable for data-centric documents, ensuring each table (and diagram, if included) is clear, well-organized, and visually informative.
Your final output should start with ```json and end with ``` as plain text, not just formatting. Like this:

```json
{
  "table_1": "BEGIN_LATEX
<LaTeX code for table 1 (with/without diagram) here>
END_LATEX",

  "table_2": "BEGIN_LATEX
<LaTeX code for table 2 (with/without diagram) here>
END_LATEX",

  "table_3": "BEGIN_LATEX
<LaTeX code for table 3 (with/without diagram) here>
END_LATEX"
}
```

Requirements:
    The tables and diagrams will be used to generate reasoning questions. Therefore:

        - If topic inspirations are supplied, ensure every generated table aligns with those topics.

        - Each LaTeX output must primarily consist of a table. Include a diagram only if it meaningfully complements the table; avoid adding one unnecessarily. Do not generate diagrams alone. If a diagram is empty or non necessary DON'T INCLUDE it.

        - Keep any diagram minimal—smaller than the table, chart-free, and purely illustrative—serving only to reinforce the table’s content without adding new information.

        - Each table and their diagram must contain realistic, domain-relevant content. They must be self-contained, include a clear descriptive title and not rely on external data to compile.

        - The type of information presented should be diverse—such as numerical data or qualitative. The variety and richness of visual elements is essential to the overall quality of the table and their diagram. Table quality should also come with a large number of rows and columns.

        - Table and diagram layouts should be creatively designed—taking inspiration from reference example (when provided) but incorporating meaningful variations such as colors, multi-row or multi-column cells, custom formatting adjustments, or any other visual enhancement that promotes structural diversity.

        - Table layouts should be at least as complex as the example provided, don't try to simplify (diagrams are not mandatory). Table complexity should also come with a large number of rows and columns.

        - Do NOT escape any characters in the LaTeX code. The LaTeX must be written as plain text, exactly as it would appear in a .tex file, with real line breaks and single backslashes (\), not JSON-escaped.

        - All LaTeX tables and diagrams must be constrained to fit entirely within the printable area of a standard A4 page when compiled to PDF, without overflowing horizontally or vertically. Use appropriate formatting techniques such as adjusting column widths, reducing font size, or enabling landscape mode if necessary but NEVER rotation.

        - Make sure each LaTeX table and diagram includes all required \usepackage declarations and is enclosed within a complete, compilable LaTeX document structure, including the appropriate preamble and \begin{document}...\end{document} block.

        - Make sure each LaTeX codes start and end with BEGIN_LATEX and END_LATEX, respectively.

        - Make sure to wrapp your final answer with ```json at the beginning and ``` at the end.
"""

In [4]:
QA_INSTRUCT = r"""
You are an expert in generating question–answer pairs from LaTeX-formatted. Your task is to create a structured dataset consisting of visually challenging, reasoning-based questions and their corresponding answers derived from a given LaTeX formatted table with optional diagram.

Input:

You will be provided with a sample LaTeX table as context. Based on this table or diagram, your goal is to generate a JSON object with the following structure:

    questions: A python list of 3 challenging questions that require reasoning and analysis based ONLY on the data presented in the table and the optional diagram. The questions must be answerable using ONLY the information in the table or diagram(no extra knowledge).
    answers: A python list of 3 detailed answers to the 3 questions, including a clear chain of thought explaining the reasoning process.

Requirements:

    All questions must be relevant to the table's context and designed to test deeper understanding or inference.
    When possible, all questions should make full use of the visual or structural elements of the table or diagram (such as rows, columns, headers, colors, patterns, diagrams etc.) while maintaining clear relevance to the table’s content.
    Questions must be clear and answarable with an objective methodology, no subjective question.
    All entries (both questions and answers) should be returned as lists of string values.
    The global result should be a single JSON object wrapped in a markdown code block using ```json at the beginning and ``` at the end, and containing all two key-value pairs.
    This means your output should start with ```json and end with ``` as plain text, not just formatting.
"""

In [21]:
QA_EVAL_INSTRUCT = r"""
You are a reasoning question answer expert. You will be given a LaTeX formated table with/without diagram, a list of 3 topics, and a pair of a question and its answer.

Your task is to evaluate the pair of question answer based solely on the data in the LaTeX code and these criteria:

    1) Does the LaTeX code contain a Table (not some charts alone or diagrams alone) ?

    2) Does the table, any optional diagrams, and the rest of the document are on one single topic from the provided list of topics, and internally consistent (be careful to off-topic diagrams)?

    3) Is the question clear and related to the table or the diagram?

    4) Is the answer (including its reasoning) totally valid and does it actually respond to the question?

    5) Is the answer FULLY supported by and ONLY BY the table or diagram data (no extra knowledge)?

If the five criteria are true, mark the pair as correct.
If one of the criteria is not met, mark it as incorrect.

Think step by step and conclude with your decision and the index of the criterium not met (if none, index is 0) as follows:
JSON_mention
{{"decision": [0, index_of_the_criterium_not_met]}} for incorrect or {{"decision": [1, 0]}} for correct
"""

In [30]:
generation_settings = {
    "qwen/qwen3-32b": {
        "temperature": 0.2, "top_p": 0.95
    },
    "deepseek-r1-distill-llama-70b": {
        "temperature": 0.2, "top_p": 0.95
    },
    "gemini-2.5-flash-preview-04-17": {
        "temperature": 0.3, "top_p": 1.0
    },
    "gpt-4.1-mini": {
        "temperature": 0.3, "top_p": 1.0
    },
    "deepseek/deepseek-prover-v2:free": {
        "temperature": 0.2, "top_p": 0.95
    },
    "deepseek/deepseek-prover-v2": {
        "temperature": 0.2, "top_p": 0.95
    },
    "gemini-2.0-flash": {
        "temperature": 0.2, "top_p": 0.95
    },
    "gemini-2.5-flash": {
        "temperature": 0.2, "top_p": 0.95
    },
    "gemini-2.5-pro": {
        "temperature": 0.1, "top_p": 0.95
    },
    "o1-mini": {
        "temperature": 0.3, "top_p": 1.0
    },
    "microsoft/phi-4-reasoning-plus:free": {
        "temperature": 0.3, "top_p": 0.95
    },
    "qwen/qwen3-30b-a3b:free": {
        "temperature": 0.3, "top_p": 0.95
    },
    "google/gemini-2.5-flash-preview-05-20:thinking": {
        "temperature": 0.3, "top_p": 1.0
    },
    "google/gemini-2.5-pro": {
        "temperature": 0.1, "top_p": 1.0
    },
    "tngtech/deepseek-r1t-chimera:free": {
        "temperature": 0.3, "top_p": 0.95
    },
    "anthropic/claude-sonnet-4": {
        "temperature": 0.2, "top_p": 0.95
    },
    "anthropic/claude-3.5-haiku:beta": {
        "temperature": 0.2, "top_p": 0.95
    },
    "gpt-4o": {
        "temperature": 0.2, "top_p": 0.95
    },
    "gpt-4.1": {
        "temperature": 0.1, "top_p": 0.9  # This model ignores top_p
    },
    "openai/gpt-5": {
        "temperature": 0.1, "top_p": 0.95
    },
    "openai/gpt-4.1": {
        "temperature": 0.1, "top_p": 0.95
    },
    "openai/gpt-oss-120b": {
        "temperature": 0.2, "top_p": 0.95
    },
    "x-ai/grok-3-beta": {
        "temperature": 0.2, "top_p": 0.95
    },
    "rekaai/reka-flash-3:free": {
        "temperature": 0.2, "top_p": 0.95
    },
    "deepseek/deepseek-r1-distill-qwen-32b:free": {
        "temperature": 0.3, "top_p": 0.95
    },
    "mistralai/mistral-large-2411": {
        "temperature": 0.0, "top_p": 0.95
    },
    "deepseek/deepseek-chat-v3.1": {
        "temperature": 0.1, "top_p": 0.95
    },
    "deepcogito/cogito-v2-preview-deepseek-671b": {
        "temperature": 0.1, "top_p": 0.95
    },
}

#### Init models spec (only run one time for all)

In [None]:
models_spec={
    "llama3-70b-8192": {
        "name": "llama3-70b-8192",
        "api_host": "groq",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "llama-3.3-70b-versatile":{
        "name":"llama-3.3-70b-versatile",
        "api_host": "groq",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "meta-llama/llama-4-maverick-17b-128e-instruct":{
        "name": "llama-4-maverick-17b-128e-instruct",
        "api_host": "groq",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "meta-llama/llama-4-scout-17b-16e-instruct":{
        "name": "llama-4-scout-17b-16e-instruct",
        "api_host": "groq",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "qwen-qwq-32b":{
        "name": "qwen-qwq-32b",
        "api_host": "groq",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "gemini-1.5-flash":{
        "name": "gemini-1.5-flash",
        "api_host": "google",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "gemma-3-27b-it":{
        "name": "gemma-3-27b-it",
        "api_host": "google",
        "json_format": False,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "gemini-2.0-flash":{
        "name": "gemini-2.0-flash",
        "api_host": "google",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "gemini-2.5-flash-preview-04-17":{
        "name": "gemini-2.5-flash-preview-04-17",
        "api_host": "google",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "gemini-2.5-flash":{
        "name": "gemini-2.5-flash",
        "api_host": "google",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "gemini-2.5-pro":{
        "name": "gemini-2.5-pro",
        "api_host": "google",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "deepseek-r1-distill-llama-70b":{
        "name": "deepseek-r1-distill-llama-70b",
        "api_host": "groq",
        "json_format": False,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "gpt-3.5-turbo-0125":{
        "name": "gpt-3.5-turbo-0125",
        "api_host": "openai",
        "json_format": "json_object",
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "gpt-4.1-mini":{
        "name": "gpt-4.1-mini",
        "api_host": "openai",
        "json_format": "json_object",
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "gpt-4o":{
        "name": "gpt-4o",
        "api_host": "openai",
        "json_format": "json_object",
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "gpt-4.1":{
        "name": "gpt-4.1",
        "api_host": "openai",
        "json_format": "json_object",
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "openai/o1-preview":{
        "name": "o1-preview",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "openai/gpt-5": {
        "name": "gpt-5",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "openai/gpt-4.1":{
        "name": "gpt-4.1",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "openai/gpt-oss-120b":{
        "name": "gpt-oss-120b",
        "api_host": "groq",
        "json_format": False,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "microsoft/phi-4-reasoning-plus:free":{
        "name": "phi-4-reasoning-plus",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "google/gemini-2.5-flash-preview-05-20:thinking":{
        "name": "gemini-2.5-flash-preview-05-20",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "google/gemini-2.5-pro":{
        "name": "gemini-2.5-pro-preview",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "qwen/qwen3-30b-a3b:free":{
        "name": "qwen3-30b-a3b",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "qwen/qwq-32b:free":{
        "name": "qwen-qwq-32b",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "deepseek/deepseek-chat":{
        "name": "deepseek-chat",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "deepseek/deepseek-r1-distill-qwen-32b:free":{
        "name": "deepseek-r1-distill-qwen-32b",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    'tngtech/deepseek-r1t-chimera:free':{
        "name": 'deepseek-r1t-chimera',
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "deepseek/deepseek-prover-v2:free":{
        "name": "deepseek-prover-v2",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "deepseek/deepseek-prover-v2":{
        "name": "deepseek-prover-v2",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "mistralai/mistral-small-3.1-24b-instruct:free":{
        "name": "mistral-small-3.1-24b-instruct",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "anthropic/claude-sonnet-4":{
        "name": "claude-sonnet-4",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "anthropic/claude-3.5-haiku:beta":{
        "name": "claude-3.5-haiku",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "x-ai/grok-3-beta":{
        "name": "grok-3-beta",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "rekaai/reka-flash-3:free":{
        "name": "reka-flash-3",
        "api_host": "openrouter",
        "json_format": True,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "qwen/qwen3-32b":{
        "name": "qwen3-32b",
        "api_host": "groq",
        "json_format": False,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "mistralai/mistral-large-2411":{
        "name": "mitral-large-2411",
        "api_host": "openrouter",
        "json_format": False,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "deepseek/deepseek-chat-v3.1":{
        "name": "deepseek-chat-v3.1",
        "api_host": "openrouter",
        "json_format": False,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
    "deepcogito/cogito-v2-preview-deepseek-671b":{
        "name": "cogito-v2-671b",
        "api_host": "openrouter",
        "json_format": False,
        "available":True,
        "valid_table_generated": 0,
        "table_generated": 0,
        "calls": 0,
        "valid_out_format": 0,
    },
}


with open("drive/MyDrive/TableQA/models_spec.json", "w") as f:
    json.dump(models_spec, f, indent=2)

In [None]:
with open("drive/MyDrive/TableQA/time.json", "w") as f:
    time_file = {'start_time': datetime.now().isoformat()}
    json.dump(time_file, f, indent=2)

#### Load models spec

In [38]:
with open("drive/MyDrive/TableQA/models_spec.json", "r") as f:
    models_spec = json.load(f)

In [39]:
qa_evaluators=["gemini-2.5-pro", "gpt-4.1", "mistralai/mistral-large-2411", "deepseek/deepseek-chat-v3.1", "deepcogito/cogito-v2-preview-deepseek-671b"]

qa_generators=["gemini-2.0-flash", "qwen/qwen3-32b", "openai/gpt-5", "openai/gpt-oss-120b", "qwen/qwen3-30b-a3b:free",
               "gemini-2.5-flash", "anthropic/claude-sonnet-4", "x-ai/grok-3-beta", 'gemini-2.5-pro']

"""
table_generators=["gemini-2.0-flash", "qwen/qwen3-30b-a3b:free", 'tngtech/deepseek-r1t-chimera:free', "anthropic/claude-3.5-haiku:beta", "rekaai/reka-flash-3:free", "deepseek/deepseek-r1-distill-qwen-32b:free",
                 ]
"""

table_generators=["gemini-2.5-flash", "gemini-2.5-pro", "anthropic/claude-sonnet-4", "gpt-4.1", "x-ai/grok-3-beta"]


In [40]:
api_keys = {'groq': groq_key, 'google': google_key, 'openai': openai_key, 'openrouter': openrouter_key}

In [41]:
config = {'TABLE_INSTRUCT_LATEX': TABLE_INSTRUCT_LATEX,
          'QA_INSTRUCT': QA_INSTRUCT,
          'QA_EVAL_INSTRUCT': QA_EVAL_INSTRUCT,
          'generation_settings': generation_settings,
          'models_spec': models_spec,
          'qa_evaluators': qa_evaluators,
          'qa_generators': qa_generators,
          'table_generators': table_generators,
          'models_spec_path': "drive/MyDrive/TableQA/models_spec.json",
          'time_path': "drive/MyDrive/TableQA/time.json",
          'latex_output_dir': "drive/MyDrive/TableQA/dataset/Latex",
          }

### Visual-TableQA class

In [17]:
# TableQA v2 class
class TableQA_v2:
  """ Creates a Synthetic Dataset on Tables for QA Reasoning and Recognition Tasks."""
  def __init__(self, api_keys, config):
      self.config = config

      # Setting up the LLM hosts
      self.groq_client = Groq(api_key=api_keys['groq'])
      genai.configure(api_key=api_keys['google'])
      self.openai_client = OpenAI(api_key=api_keys['openai'])
      self.openrouter_key = api_keys['openrouter']
      self.update_models_usage()

  def update_models_usage(self):
      if update_start_time(self.config['time_path']):
          # reset the counters
          for model in self.config['models_spec']:
              config['models_spec'][model]['available'] = True

          print("Models usage reinitialized")

      # save the models usage
      with open(self.config['models_spec_path'], "w") as f:
              json.dump(self.config['models_spec'], f, indent=2)


  def model_selector(self, task):
      model_list = 'table_generators' if task=='table' else 'qa_generators'
      weights = [int(self.config['models_spec'][model]['available']) for model in self.config[model_list]]
      model_id= random.choices(range(len(weights)), weights=weights, k=1)[0]
      model_name = self.config[model_list][model_id]
      print("----------------------selected model ", model_name)
      return model_name

  def get_settings(self, model_name):
      if model_name in self.config['generation_settings']:
          temperature = self.config['generation_settings'][model_name]['temperature']
          top_p = self.config['generation_settings'][model_name]['top_p']
      else:
          print(model_name, " does not have generation settings")
          temperature, top_p = 0.2, 0.9
      return temperature, top_p

  def call_llm(self, model_name, prompt, max_tokens):
      json_format = self.config['models_spec'][model_name]['json_format']
      temperature, top_p = self.get_settings(model_name)

      api_host = self.config['models_spec'][model_name]['api_host']
      messages = prompt if api_host == 'google' else [{"role": "user", "content": prompt}]

      if api_host.lower() == "groq":
          completion = self.groq_client.chat.completions.create(model=model_name,messages=messages,temperature=temperature, top_p= top_p,
              max_completion_tokens=20000,stream=False,
              reasoning_format='hidden' if model_name in ["deepseek-r1-distill-llama-70b", "qwen/qwen3-32b", "openai/gpt-oss-120b"] else None,
              response_format={"type": "json_object"} if json_format else None,
              reasoning_effort='high' if model_name=="openai/gpt-oss-120b" else None,
          )
          response = completion.choices[0].message.content
          if completion.choices[0].finish_reason == "stop":
              return response if json_format else extract_json_blocks(response)
          else:
              print(completion.choices[0].finish_reason)
              raise Exception(f"API call ended before task. Reason: {completion.choices[0].finish_reason}")

      elif api_host.lower() == "google":
          model = genai.GenerativeModel(model_name)
          generation_config = {
                                "max_output_tokens": max_tokens,
                                "temperature": temperature,
                                "top_p": top_p,
                                "response_mime_type": "application/json" if json_format else None,
                                }
          response = model.generate_content(contents=messages, generation_config=generation_config)
          if (response.prompt_feedback is None or
                response.prompt_feedback.block_reason.name == "BLOCK_REASON_UNSPECIFIED"):
              return response.text if json_format else extract_json_blocks(response.text)
          else:
              print(response.prompt_feedback.block_reason.name)
              raise Exception(f"API call ended before task. Reason: {response.prompt_feedback.block_reason.name}")

      elif api_host.lower() == "openai":
          completion = self.openai_client.chat.completions.create(model=model_name,messages=messages,temperature=temperature,
                  top_p= top_p, max_completion_tokens=max_tokens,response_format= {"type": json_format})

          if completion.choices[0].finish_reason == "stop":
              return completion.choices[0].message.content
          else:
              print(completion.choices[0].finish_reason)
              raise Exception(f"API call ended before task. Reason: {completion.choices[0].finish_reason}")

      elif api_host.lower() == "openrouter":
          if model_name == "deepseek/deepseek-chat-v3.1":
                args = {"model": model_name, "messages": messages, "temperature": temperature, "top_p": top_p,
                        "reasoning": {"effort": "high", "exclude": False}}
          else:
              args = {"model": model_name, "messages": messages, "temperature": temperature, "top_p": top_p}

          if json_format:
              args["response_format"]= {"type": "json_object"}
              
          response = requests.post(url="https://openrouter.ai/api/v1/chat/completions",
                                   headers={"Authorization": f"Bearer {self.openrouter_key}", "Content-Type": "application/json"},
                                   data=json.dumps(args))
          response = response.json()
          if response['choices'][0]['finish_reason'] == "stop":
              return extract_json_blocks(response['choices'][0]['message']['content'])
          else:
              print(response['choices'][0]['finish_reason'])
              raise Exception(f"API call ended before task. Reason: {response['choices'][0]['finish_reason']}")

      else:
        raise Exception("Please choose api cloud host from 'google', 'groq', 'openai' and 'openrouter'.")


  def safe_call_llm(self, model_name, prompt, max_tokens=5000):
      #time.sleep(random.uniform(1, 3))
      try:
          return self.call_llm(model_name, prompt, max_tokens), None
      except Exception as e:
          # Handle API calls limits
          msg = read_Exception(e).lower()
          if any(keyword in msg for keyword in ["quota", "billing", "insufficient", "resource exhausted", "429", "402"]):
              self.config['models_spec'][model_name]['available'] = False
              print(model_name, "has reached its limit")
          elif 'ended' in msg:
              print(model_name, "needs more tokens to think")
          return None, msg


  def safe_json_loads(self, json_text):
      try:
          return json.loads(json_text), None
      except Exception as e:
          msg = str(e).lower()
          return None, msg

  def generate_synthetic_table(self, table_inspo, topic_inspo):

      table_instruct = self.config['TABLE_INSTRUCT_LATEX']
      # Prepare the query to generate the table
      query = f"{table_instruct}\n\nTable example:\n{table_inspo}" if table_inspo is not None else table_instruct
      query = f"{query}\n\nInspiration topics: {topic_inspo}" if topic_inspo is not None else query
      query += "\n\nGenerate tables"

      # Generate tables
      call_status = "failed"
      while call_status is not None:
          model = self.model_selector('table') # Select model
          table, call_status = self.safe_call_llm(model_name=model, prompt=query)
      print("Table model: ", model)

      table = extract_tables_from_text(table)
      return table, model

  def generate_reasoning_qa(self, table, instruction):
      # Prepare the query
      query = f"{instruction}\n\nTable:\n{table}\n\n"

      # Generate the questions and answers for reasoning capabilities
      call_status = "failed"
      while call_status is not None:
          model = self.model_selector('qa') # Select model
          qa_pairs, call_status = self.safe_call_llm(model_name=model, prompt=query)
      print("QA model: ", model, '\n\n')
      qa_pairs, json_load_status = self.safe_json_loads(qa_pairs)
      self.config['models_spec'][model]['calls'] += 1
      self.config['models_spec'][model]['valid_out_format'] += 1 if json_load_status is None else 0
      return qa_pairs, model, json_load_status

  def generate_dataset_rows(self, table, table_name):

      qa_instruction = self.config['QA_INSTRUCT']

      # Generate Resoning Q&A
      json_load_status = "failed"
      while json_load_status is not None:
          qa_pairs, model, json_load_status = self.generate_reasoning_qa(table=table, instruction=qa_instruction)

      dataset_rows = []
      n_qa_pairs=min(len(qa_pairs['questions']), len(qa_pairs['answers']))
      if len(qa_pairs['questions']) != len(qa_pairs['answers']):
          print(f"Mismatch in QA pairs. {len(qa_pairs['questions'])} questions for {len(qa_pairs['answers'])} answers.")
      for k in range(n_qa_pairs):
          dataset_row = {
              'table_image': table_name,
              'question': qa_pairs['questions'][k],
              'answer': qa_pairs['answers'][k],
              'model_name': model,
          }
          dataset_rows.append(dataset_row)

      return dataset_rows

  def evaluate_single_jury(self, model, qa_eval_prompt):
      qa_eval_prompt=qa_eval_prompt.replace("JSON_mention", "JSON") if model=='gpt-4.1' else qa_eval_prompt.replace("JSON_mention", "\n")
      for attempt in range(3):
          decision, call_status = self.safe_call_llm(model_name=model, prompt=qa_eval_prompt)
          if call_status is None:
              decision = json.loads(decision)
              verdict, reason = decision['decision'][0], decision['decision'][1]
              return {"verdict": verdict, "reason": reason, "success": True}
          # Exceeded retries, skip this jury
          return {"verdict": 0, "success": False, "error": call_status}

  def evaluate_qa_pair(self, qa_eval_prompt):
      evaluations = {model: self.evaluate_single_jury(model, qa_eval_prompt) for model in self.config['qa_evaluators']}
        
      # Keep only successful juries
      successful = [eval for _,eval in evaluations.items() if eval.get("success")]
      verdicts = [eval["verdict"] for eval in successful]
      if not verdicts:
          majority = None
          print("no verdict")
      elif verdicts.count(1) > verdicts.count(0):
          majority = 1
      elif verdicts.count(1) < verdicts.count(0):
          majority = 0
      else:
          majority = evaluations["gpt-4.1"]["verdict"]
            
      confidence = 0.0 if not verdicts else verdicts.count(majority) / len(verdicts)

      print(f"Verdict: {majority}, Confidence: {confidence}, Successful_juries: {len(successful)}")
      return {
          "evaluations": evaluations,
          "final_verdict": majority,
          "confidence": confidence,
          "successful_juries": len(successful),
          "total_juries": len(evaluations)
        }


  def generate_dataset(self, table_inspo=None, topic_inspo=None, show_img=False):
      # Generate synthetic table inspired by table_inspo
      try:
          synthetic_tables, table_model = self.generate_synthetic_table(table_inspo=table_inspo, topic_inspo=topic_inspo)
      except Exception as e:
          print(type(e).__name__)
          return [], None

      dataset = []
      for (k,v) in synthetic_tables.items():
          self.config['models_spec'][table_model]['table_generated'] += 1
          try:
              # Save table as image
              table_name = generate_unique_filename(prefix=self.config['models_spec'][table_model]['name'])
              save_latex_table_as_image(table=v, table_name=table_name, output_dir=self.config['latex_output_dir'], show_img=show_img)
              self.config['models_spec'][table_model]['valid_table_generated'] += 1

              # Generate dataset rows
              dataset_rows = self.generate_dataset_rows(table=v, table_name=table_name)

              # Evaluate QA pairs using LLMs as juries
              qa_eval_prompt = self.config['QA_EVAL_INSTRUCT'] + f"Table: {v}\n" + f"Topics: {topic_inspo}\n"
              for row in dataset_rows:
                  evaluations = self.evaluate_qa_pair(f"{qa_eval_prompt}Question: {row['question']}\nAnswer: {row['answer']}\n")
                  row['decision'] = evaluations['final_verdict']
                  row['detailed_evaluations'] = evaluations

              dataset.extend(dataset_rows)
          except Exception as e:
              print(type(e).__name__)

      self.update_models_usage()

      return dataset, synthetic_tables


## Generation

##### Table inspos

In [None]:
latex_table_inspo = []

folder_path = "drive/MyDrive/TableQA/dataset/Latex/inspo_3"
for file_name in os.listdir(folder_path):
    if file_name.endswith(".tex"):
        file_path = os.path.join(folder_path, file_name)
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                table = f.read()
                latex_table_inspo.append(table)
        except Exception as e:
            print(f"Error reading {file_path}: {e}")

##### generation loop

In [None]:
from drive.MyDrive.TableQA.samples.synthetic_table_topics import topics

start_idx = 780
counter = 0
dataset = []

config['latex_output_dir']= "drive/MyDrive/TableQA/dataset/Latex/table_topic_4"

tqa = TableQA_v2(api_keys=api_keys, config=config)
# Save everything
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"data_{timestamp}.jsonl"
output_path = os.path.join(config['latex_output_dir'], filename)

while counter <100:
    print('\nIteration : ', counter)
    topic_inspo = topics[3*(counter+start_idx): 3*(counter+start_idx)+3]
    topic_inspo = ", ".join(topic_inspo)
    table_inspo = random.sample(latex_table_inspo, 1)[0]
    output, tables = tqa.generate_dataset(table_inspo=table_inspo, topic_inspo=topic_inspo, show_img=False)
    dataset.extend(output)
    if len(output)!=0:
        with open(output_path, "a", encoding="utf-8") as f:
            for entry in output:
                f.write(json.dumps(entry) + "\n")
    counter += 1

print('\n\n-------------valid_table_generated vs table_generated----------------\n')
pprint({k:(tqa.config['models_spec'][k]['valid_table_generated'], tqa.config['models_spec'][k]['table_generated']) for k in tqa.config['table_generators']})
print('\n\n-------------valid_output_format vs calls----------------\n')
pprint({k:(tqa.config['models_spec'][k]['valid_out_format'], tqa.config['models_spec'][k]['calls']) for k in tqa.config['qa_generators']})
print('\n\n-------------availability----------------\n')
pprint({k:tqa.config['models_spec'][k]['available'] for k in tqa.config['qa_generators']})


print(f"Saved {len(dataset)} datapoints to {output_path}")
