# Data Loading

We are provided with a set of CSV files. Let's load them into a relational repository to conduct some preliminary data exploration.

This notebook tests the loading of raw CSV files into a local DuckDB instance. At the end we will start the integrated UI and conduct a preliminary data exploration.

In [None]:
import duckdb
import pandas as pd
from pathlib import Path

In [None]:
# Connect to DuckDB (creates file if doesn't exist)
con = duckdb.connect("data/aida_challenge.duckdb")

In [None]:
# Define data sources
data_files = {
    "clienti": "data/raw/clienti.csv",
    "polizze": "data/raw/polizze.csv",
    "sinistri": "data/raw/sinistri.csv",
    "reclami": "data/raw/reclami.csv",
    "abitazioni": "data/raw/abitazioni.csv",
    "interazioni_clienti": "data/raw/interazioni_clienti.csv",
    "competitor_prodotti": "data/raw/competitor_prodotti.csv",
}

# Load each CSV into DuckDB
for table_name, file_path in data_files.items():
    con.execute(
        f"""
        CREATE TABLE IF NOT EXISTS {table_name} AS 
        SELECT * FROM read_csv_auto('{file_path}')
    """
    )
    print(f"âœ“ Loaded {table_name}")

# Quick verification
print("\nTables in database:")
print(con.execute("SHOW TABLES").df())

In [None]:
# Check schema and sample data for each table
for table_name in data_files.keys():
    print(f"\n{'='*60}")
    print(f"TABLE: {table_name}")
    print(f"{'='*60}")

    # Show data types and column info
    print("\nSchema:")
    schema = con.execute(f"DESCRIBE {table_name}").df()
    print(schema.to_string(index=False))

    # Show sample rows
    print(f"\nSample data (first 3 rows):")
    sample = con.execute(f"SELECT * FROM {table_name} LIMIT 3").df()
    print(sample.to_string(index=False))

    # Show row count
    count = con.execute(f"SELECT COUNT(*) as count FROM {table_name}").fetchone()[0]
    print(f"\nTotal rows: {count:,}")

We can easily setup a UI do visualize and explore the data:

In [None]:
con.execute("CALL start_ui()")