# Gemini - Get Relevant Tables

In [2]:
%pip install google cbsodata

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
from google.generativeai import GenerativeModel
import google.generativeai as genai
import json
from typing import List, Dict, TypedDict
from ast import literal_eval
import cbsodata as cbs

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
class TableInfo(TypedDict):
    id: str
    description: str

In [4]:
def create_table_selector_prompt(tables: List[TableInfo], query: str) -> str:
    return f"""
    Given a user's query about data and a list of available tables, identify which tables would be most relevant for answering the query.
    
    Available tables:
    {json.dumps(tables, indent=2)}
    
    User query: {query}
    
    Select the table IDs that would be most relevant for answering this query. Consider:
    1. Direct matches between query topics and table descriptions
    2. Implicit relationships that might be needed to fully answer the query
    3. Select ALL tables that might contribute to a complete answer
    
    Ask clarifying questions if not enough information is available to complete the request.
    
    Provide your response as a single list of strings of table IDs. Nothing else.
    """

In [5]:
def get_relevant_ids_with_gemini(
  model: GenerativeModel,
  tables: List[TableInfo],
  query: str
) -> List[str]:
  prompt = create_table_selector_prompt(tables, query)
  response = model.generate_content(prompt)
  return literal_eval(response.text) 
  

# CBS - Get Data From Gemini Response

## Get Table Descriptions and IDs

In [6]:
def extract_table_info(table: Dict) -> TableInfo:
    return {
        "id": 
            table["Identifier"],
        "description": 
            table["Title"].replace('\n', ' ').replace('\r', ' ') + " " +
            table["Summary"].replace('\n', ' ').replace('\r', ' ') + " " +
            table["ShortDescription"].replace('\n', ' ').replace('\r', ' ')
    }

In [7]:
def get_english_table_info(json_filename: str) -> List[TableInfo]:
  with open(json_filename, 'r') as file:
    english_tables = json.load(file)
  
  return [
    extract_table_info(table) for table in english_tables
  ]

In [None]:
table_info = get_english_table_info('english_tables.json')

print(len(table_info))
table_info[0]

1010


{'id': '80783eng',
 'description': "Agriculture; crops, livestock and land use by general farm type, region Agricultural census; crops, livestock, land use and corresponding number of holdings by general farm type and region  This table contains data on land use, arable farming, horticulture, grassland, grazing livestock and housed animals, at regional level, by general farm type.  The figures in this table are derived from the agricultural census. Data collection for the agricultural census is part of a combined data collection for a.o. agricultural policy use and enforcement of the manure law.  Regional breakdown is based on the main location of the holding. Due to this the region where activities (crops, animals) are allocated may differ from the location where these activities actually occur.  The agricultural census is also used as the basis for the European Farm Structure Survey (FSS). Data from the agricultural census do not fully coincide with the FSS. In the FSS years (2000, 2

## Test Gemini on english datasets

In [107]:
# initialize the model
genai.configure(api_key="AIzaSyDKQzxYWUpb0-WBQXSVq5kWbeZiEzjm3zk")
model = genai.GenerativeModel(model_name='gemini-1.5-flash')

In [None]:
relevant_ids = get_relevant_ids_with_gemini(
  model= model,
  tables = get_english_table_info('english_tables.json'),
  query= "What are the CO2 emissions by sector?"
)

relevant_ids

['83300ENG', '85669ENG', '84917ENG', '84918ENG']

In [114]:
len(relevant_ids)

4

In [115]:
for id in relevant_ids[:10]:
  print(cbs.get_info(id)['Title'])

Emissions to air by the Dutch economy; national accounts 
Emissions of greenhouse gases according to IPCC guide-lines	
Renewable energy; consumption by energy source, technology and application
Avoided use of fossil energy and emission of CO2
