# Query Electoral Database

This notebook shows how to query the SQLite database created by the pipeline and retrieve data as DataFrames for analysis.

## Setup


In [1]:
import sys
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path.cwd() / 'src'))

from analytics.clean_votes import CleanVotesOrchestrator
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt

%matplotlib inline


## 1. Initialize Orchestrator

The orchestrator provides easy access to the database.


In [2]:
# Initialize with explicit database path (from project root)
# Point to the root-level data/processed/ directory, not analytics/data/processed/
db_path = Path.cwd().parent / "data" / "processed" / "electoral_data.db"
orchestrator = CleanVotesOrchestrator(db_path=str(db_path))

print(f"Database: {orchestrator.db_path}")
print(f"Database exists: {orchestrator.db_path.exists()}")


2026-01-22 14:49:51,924 - analytics.clean_votes.database - INFO - Metadata table initialized successfully
2026-01-22 14:49:51,925 - analytics.clean_votes.database - INFO - Database initialized at: /Users/hectorcorro/Documents/Labex/ine-shapefiles-downloader/data/processed/electoral_data.db
2026-01-22 14:49:51,925 - analytics.clean_votes.database - INFO - Database exists: True
2026-01-22 14:49:51,926 - analytics.clean_votes.orchestrator - INFO - Orchestrator initialized with database: /Users/hectorcorro/Documents/Labex/ine-shapefiles-downloader/data/processed/electoral_data.db


Database: /Users/hectorcorro/Documents/Labex/ine-shapefiles-downloader/data/processed/electoral_data.db
Database exists: True


## 2. List All Available Elections

See what's in the database.


In [3]:
# List all elections in database
elections = orchestrator.list_available_elections()

print(f"Total elections: {len(elections)}")
print(f"\nColumns: {elections.columns.tolist()}")
print(f"\nFirst 10 elections:")
elections.head(10)


Total elections: 224

Columns: ['id', 'election_name', 'election_date', 'entidad_id', 'entidad_name', 'table_name', 'has_geometry', 'row_count', 'created_at', 'updated_at', 'source_file', 'shapefile_path', 'metadata_json']

First 10 elections:


Unnamed: 0,id,election_name,election_date,entidad_id,entidad_name,table_name,has_geometry,row_count,created_at,updated_at,source_file,shapefile_path,metadata_json
0,129,PRES_2024,2024-06-03,1,AGUASCALIENTES,election_pres_2024_01,1,659,2025-12-06 05:24:39,2026-01-21 22:11:27,/Users/hectorcorro/Documents/Labex/ine-shapefi...,,
1,161,SEN_2024,2024-06-03,1,AGUASCALIENTES,election_sen_2024_01,1,659,2025-12-06 05:25:29,2026-01-21 22:12:20,/Users/hectorcorro/Documents/Labex/ine-shapefi...,,
2,193,DIP_FED_2024,2024-06-03,1,AGUASCALIENTES,election_dip_fed_2024_01,1,659,2025-12-06 05:26:26,2026-01-21 22:13:18,/Users/hectorcorro/Documents/Labex/ine-shapefi...,,
3,130,PRES_2024,2024-06-03,2,BAJA CALIFORNIA,election_pres_2024_02,1,2096,2025-12-06 05:24:39,2026-01-21 22:11:28,/Users/hectorcorro/Documents/Labex/ine-shapefi...,,
4,162,SEN_2024,2024-06-03,2,BAJA CALIFORNIA,election_sen_2024_02,1,2096,2025-12-06 05:25:30,2026-01-21 22:12:20,/Users/hectorcorro/Documents/Labex/ine-shapefi...,,
5,194,DIP_FED_2024,2024-06-03,2,BAJA CALIFORNIA,election_dip_fed_2024_02,1,2096,2025-12-06 05:26:26,2026-01-21 22:13:19,/Users/hectorcorro/Documents/Labex/ine-shapefi...,,
6,131,PRES_2024,2024-06-03,3,BAJA CALIFORNIA SUR,election_pres_2024_03,1,520,2025-12-06 05:24:39,2026-01-21 22:11:28,/Users/hectorcorro/Documents/Labex/ine-shapefi...,,
7,163,SEN_2024,2024-06-03,3,BAJA CALIFORNIA SUR,election_sen_2024_03,1,520,2025-12-06 05:25:30,2026-01-21 22:12:20,/Users/hectorcorro/Documents/Labex/ine-shapefi...,,
8,195,DIP_FED_2024,2024-06-03,3,BAJA CALIFORNIA SUR,election_dip_fed_2024_03,1,520,2025-12-06 05:26:26,2026-01-21 22:13:19,/Users/hectorcorro/Documents/Labex/ine-shapefi...,,
9,132,PRES_2024,2024-06-03,4,CAMPECHE,election_pres_2024_04,1,543,2025-12-06 05:24:39,2026-01-21 22:11:28,/Users/hectorcorro/Documents/Labex/ine-shapefi...,,


In [4]:
# See unique elections (not by entidad)
unique_elections = elections['election_name'].unique()
print(f"Unique elections: {len(unique_elections)}")
for election in unique_elections:
    count = len(elections[elections['election_name'] == election])
    print(f"  {election}: {count} entidades")


Unique elections: 7
  PRES_2024: 32 entidades
  SEN_2024: 32 entidades
  DIP_FED_2024: 32 entidades
  DIP_FED_2021: 32 entidades
  DIP_FED_2018: 32 entidades
  PRES_2018: 32 entidades
  SEN_2018: 32 entidades


## 3. Load Specific Election Data

Load data for a specific election and state.


In [5]:
# Load Presidential 2024 data for Aguascalientes (entidad_id=1)
df_ags = orchestrator.load_election_data(
    election_name='PRES_2024',
    entidad_id=1,
    as_geodataframe=False  # Load as regular DataFrame
)

print(f"Shape: {df_ags.shape}")
print(f"Columns: {df_ags.columns.tolist()[:10]}...")
df_ags.head()


2026-01-22 14:23:49,946 - analytics.clean_votes.database - INFO - Loading data from table: election_pres_2024_01
2026-01-22 14:23:49,975 - analytics.clean_votes.database - INFO - Loaded 659 rows


Shape: (659, 60)
Columns: ['ID', 'ENTIDAD_gdf', 'DISTRITO_F', 'DISTRITO_L', 'MUNICIPIO', 'SECCION', 'TIPO', 'CONTROL', 'GEOMETRY1_', 'geometry']...


Unnamed: 0,ID,ENTIDAD_gdf,DISTRITO_F,DISTRITO_L,MUNICIPIO,SECCION,TIPO,CONTROL,GEOMETRY1_,geometry,...,PT-MORENA_PCT,NO_REGISTRADAS_PCT,NULOS_PCT,ENTIDAD,DISTRITO_FEDERAL_y,ID_DISTRITO_FEDERAL,ID_DISTRITO_FEDERAL_STR,ID_ENTIDAD_STR,SECCION_STR,crs
0,202.0,1,3,14,1,212,2,257.0,,POLYGON ((-102.28055860116653 21.8623827616314...,...,0.0,0.253807,2.030457,AGUASCALIENTES,0.0,3,3,1,212,EPSG:4326
1,23.0,1,3,6,1,23,2,251.0,,POLYGON ((-102.30328016006439 21.8947084927160...,...,0.253807,0.253807,1.015228,AGUASCALIENTES,0.0,3,3,1,23,EPSG:4326
2,264.0,1,3,17,1,274,2,473.0,,POLYGON ((-102.31238119212288 21.8795593677940...,...,0.196078,0.588235,2.352941,AGUASCALIENTES,0.0,3,3,1,274,EPSG:4326
3,228.0,1,3,17,1,238,2,289.0,,POLYGON ((-102.30477271464954 21.8535690153630...,...,0.970874,0.0,3.640777,AGUASCALIENTES,0.0,3,3,1,238,EPSG:4326
4,476.0,1,2,13,1,499,2,748.0,,POLYGON ((-102.25237580263585 21.9050264458675...,...,0.0,0.0,0.790514,AGUASCALIENTES,0.0,2,2,1,499,EPSG:4326


## 4. Load with Geometry (GeoDataFrame)

Load data with geometry for spatial analysis and mapping.


In [8]:
# Load with geometry (now works for all states after geometry fix!)
# Testing with Baja California Sur (entidad_id=3) which previously had no geometry
gdf_bcs = orchestrator.load_election_data(
    election_name='PRES_2024',
    entidad_id=3,  # Baja California Sur
    as_geodataframe=True  # Load as GeoDataFrame
)

print(f"Type: {type(gdf_bcs)}")
print(f"Has geometry: {'geometry' in gdf_bcs.columns}")
print(f"CRS: {gdf_bcs.crs if hasattr(gdf_bcs, 'crs') else 'N/A'}")
print(f"Rows with geometry: {gdf_bcs['geometry'].notna().sum()}/{len(gdf_bcs)}")

# Quick map
gdf_bcs.explore(column='MORENA_PCT', cmap='RdYlBu_r', legend=True, 
                tiles='CartoDB positron', tooltip=['SECCION', 'MORENA_PCT'])


2026-01-22 14:25:48,563 - analytics.clean_votes.database - INFO - Loading data from table: election_pres_2024_03
2026-01-22 14:25:48,622 - analytics.clean_votes.database - INFO - Loaded 520 rows
2026-01-22 14:25:48,641 - analytics.clean_votes.database - INFO - Converted to GeoDataFrame


Type: <class 'geopandas.geodataframe.GeoDataFrame'>
Has geometry: True
CRS: EPSG:4326
Rows with geometry: 520/520


## 5. Query Multiple States

Load and compare multiple states.


In [4]:
# Load multiple states
states_to_compare = {
    1: 'Aguascalientes',
    9: 'CDMX',
    15: 'Estado de México',
    19: 'Nuevo León'
}

results = []

for entidad_id, name in states_to_compare.items():
    try:
        df = orchestrator.load_election_data('PRES_2024', entidad_id)
        results.append({
            'Entidad': name,
            'Sections': len(df),
            'Total Votes': df['TOTAL_VOTOS_SUM'].sum(),
            'MORENA %': df['MORENA_PCT'].mean(),
            'PAN %': df['PAN_PCT'].mean(),
            'PRI %': df['PRI_PCT'].mean()
        })
    except ValueError as e:
        print(f"No data for {name}: {e}")

comparison = pd.DataFrame(results)
comparison

2026-01-22 14:50:14,187 - analytics.clean_votes.database - INFO - Loading data from table: election_pres_2024_01
2026-01-22 14:50:14,208 - analytics.clean_votes.database - INFO - Loaded 659 rows
2026-01-22 14:50:14,208 - analytics.clean_votes.database - INFO - Loading data from table: election_pres_2024_09
2026-01-22 14:50:14,236 - analytics.clean_votes.database - INFO - Loaded 2149 rows
2026-01-22 14:50:14,237 - analytics.clean_votes.database - INFO - Loading data from table: election_pres_2024_15
2026-01-22 14:50:14,263 - analytics.clean_votes.database - INFO - Loaded 2102 rows
2026-01-22 14:50:14,263 - analytics.clean_votes.database - INFO - Loading data from table: election_pres_2024_19
2026-01-22 14:50:14,288 - analytics.clean_votes.database - INFO - Loaded 2005 rows


Unnamed: 0,Entidad,Sections,Total Votes,MORENA %,PAN %,PRI %
0,Aguascalientes,659,613312.0,34.245497,34.655198,5.66364
1,CDMX,2149,1923711.0,42.037844,22.280582,6.71471
2,Estado de México,2102,2815750.0,45.962933,11.117094,12.135411
3,Nuevo León,2005,1720175.0,27.979297,19.922785,11.919202


## 6. Direct SQL Queries

You can also query directly using SQL if needed.


In [5]:
import sqlite3

# Connect to database
db_path = orchestrator.db_path
conn = sqlite3.connect(db_path)

# Query election metadata
query = """
SELECT election_name, entidad_name, row_count, has_geometry, created_at
FROM election_metadata
WHERE election_name = 'PRES_2024'
ORDER BY entidad_id
"""

metadata_df = pd.read_sql_query(query, conn)
print(f"Found {len(metadata_df)} records for PRES_2024")
metadata_df.head(10)

Found 32 records for PRES_2024


Unnamed: 0,election_name,entidad_name,row_count,has_geometry,created_at
0,PRES_2024,AGUASCALIENTES,659,1,2025-12-06 05:24:39
1,PRES_2024,BAJA CALIFORNIA,2096,1,2025-12-06 05:24:39
2,PRES_2024,BAJA CALIFORNIA SUR,520,1,2025-12-06 05:24:39
3,PRES_2024,CAMPECHE,543,1,2025-12-06 05:24:39
4,PRES_2024,COAHUILA,1718,1,2025-12-06 05:24:39
5,PRES_2024,COLIMA,385,1,2025-12-06 05:24:39
6,PRES_2024,CHIAPAS,2008,1,2025-12-06 05:24:39
7,PRES_2024,CHIHUAHUA,2042,1,2025-12-06 05:24:39
8,PRES_2024,CIUDAD DE MEXICO,2149,1,2025-12-06 05:24:39
9,PRES_2024,DURANGO,1358,1,2025-12-06 05:24:39


In [6]:
# Query specific election table directly
query = """
SELECT ID_ENTIDAD, SECCION, ENTIDAD, MORENA_PCT, PAN_PCT, TOTAL_VOTOS_SUM
FROM election_pres_2024_01
LIMIT 10
"""

direct_df = pd.read_sql_query(query, conn)
direct_df


Unnamed: 0,ID_ENTIDAD,SECCION,ENTIDAD,MORENA_PCT,PAN_PCT,TOTAL_VOTOS_SUM
0,1,212,AGUASCALIENTES,38.071066,35.532995,394.0
1,1,23,AGUASCALIENTES,20.050761,63.451777,394.0
2,1,274,AGUASCALIENTES,20.588235,50.588235,510.0
3,1,238,AGUASCALIENTES,49.757282,24.029126,412.0
4,1,499,AGUASCALIENTES,45.059289,21.343874,253.0
5,1,500,AGUASCALIENTES,38.284519,24.267782,478.0
6,1,239,AGUASCALIENTES,53.864734,19.323671,414.0
7,1,501,AGUASCALIENTES,41.833333,26.166667,600.0
8,1,443,AGUASCALIENTES,43.206751,28.185654,1185.0
9,1,148,AGUASCALIENTES,41.534392,27.645503,756.0


In [7]:
# Close connection
conn.close()


## 7. Combine Multiple Elections for Time Series Analysis


In [10]:
# Compare MORENA performance across elections
elections_to_compare = ['PRES_2024', 'PRES_2018', 'DIP_FED_2021']
entidad_id = 9  # CDMX

temporal_data = []

for election in elections_to_compare:
    try:
        df = orchestrator.load_election_data(election, entidad_id)
        temporal_data.append({
            'Election': election,
            'MORENA %': df['MORENA_PCT'].mean(),
            'Sections': len(df),
            'Total Votes': df['TOTAL_VOTOS_SUM'].sum()
        })
    except ValueError:
        print(f"No data for {election}")

if temporal_data:
    temporal_df = pd.DataFrame(temporal_data)
    print("\nMORENA Performance Over Time (CDMX):")
    print("\nMORENA Performance Over Time (CDMX):")
    print(temporal_df.to_string(index=False))  # Text version
    
    # Also display as a proper table
    display(temporal_df)  # ← Explicit display command

2026-01-22 15:04:37,004 - analytics.clean_votes.database - INFO - Loading data from table: election_pres_2024_09
2026-01-22 15:04:37,081 - analytics.clean_votes.database - INFO - Loaded 2149 rows
2026-01-22 15:04:37,088 - analytics.clean_votes.database - INFO - Loading data from table: election_pres_2018_09
2026-01-22 15:04:37,133 - analytics.clean_votes.database - INFO - Loaded 5528 rows
2026-01-22 15:04:37,134 - analytics.clean_votes.database - INFO - Loading data from table: election_dip_fed_2021_09
2026-01-22 15:04:37,167 - analytics.clean_votes.database - INFO - Loaded 5527 rows



MORENA Performance Over Time (CDMX):

MORENA Performance Over Time (CDMX):
    Election  MORENA %  Sections  Total Votes
   PRES_2024 42.037844      2149    1923711.0
   PRES_2018 48.125246      5528    5327994.0
DIP_FED_2021 39.862355      5527    3892400.0


Unnamed: 0,Election,MORENA %,Sections,Total Votes
0,PRES_2024,42.037844,2149,1923711.0
1,PRES_2018,48.125246,5528,5327994.0
2,DIP_FED_2021,39.862355,5527,3892400.0


## 8. Export to Other Formats

Export data for use in other tools.


In [None]:
# Load data
df_export = orchestrator.load_election_data('PRES_2024', entidad_id=1)

# Export to CSV
df_export.to_csv('data/insights/pres_2024_aguascalientes.csv', index=False)
print("✓ Exported to CSV")

# Export to Parquet (more efficient)
df_export.to_parquet('data/insights/pres_2024_aguascalientes.parquet', index=False)
print("✓ Exported to Parquet")

# If you have geometry, export to GeoJSON
gdf_export = orchestrator.load_election_data('PRES_2024', entidad_id=1, as_geodataframe=True)
if 'geometry' in gdf_export.columns:
    gdf_export.to_file('data/insights/pres_2024_aguascalientes.geojson', driver='GeoJSON')
    print("✓ Exported to GeoJSON")


## Summary

You can query the database using:

1. **`orchestrator.list_available_elections()`** - See what's available
2. **`orchestrator.load_election_data(election_name, entidad_id)`** - Load specific data
3. **Direct SQL queries** - For custom queries
4. **Export to CSV/Parquet/GeoJSON** - For use in other tools

All data is stored by (election, entidad) in separate tables for easy querying!
