# 01. ETL Phase: Ingestion & Schema Generation

**Project:** RIWI Sales Analytics  
**Objective:** Ingest the raw sales data (`RWVentas.csv`), analyze its structure to generate a SQL schema automatically, and load it into a PostgreSQL staging area.

### Key Deliverables
1.  **Automated SQL Script:** A Python function generates the `CREATE TABLE` statement based on the CSV structure.
2.  **Staging Table:** Raw data loaded into `raw_sales` table in the database.

### Imports & Setup

In [12]:
import pandas as pd
import sys
import os
from sqlalchemy import text

# Add src to path for database connection
sys.path.append(os.path.abspath(os.path.join('..', 'src')))
from database.db_connection import get_db_engine

### Load Raw Data

In [13]:
# Path to file
csv_path = '../data/raw/RWVentas.csv'

# Load data
try:
    df_raw = pd.read_csv(csv_path)
    print("Data loaded successfully.")
except Exception as e:
    print(f"Error loading CSV: {e}")

# Inspect first rows
df_raw.head()

Data loaded successfully.


Unnamed: 0,Ciudad,Fecha,Producto,Tipo_Producto,Cantidad,Precio_Unitario,Tipo_Venta,Tipo_Cliente,Descuento,Costo_Envio,Total
0,Antofagasta,2025-11-28,Leche,Alimento_Percedero,2.0,1587.0,Online,Minorista,0.2,0.0,2539.0
1,Monterrey,2025-11-29,Leche,Hogar,5.0,,Call_Center,Mayorista,0.2,10000.0,20412.0
2,Valparaíso,2025-12-07,Café,Hogar,1.0,3882.0,Tienda_Física,Minorista,0.0,0.0,3882.0
3,Sevilla,2025-12-01,Té,Snack,5.0,2060.0,Distribuidor,Corporativo,0.15,0.0,8755.0
4,Sevilla,2025-11-18,Chocolate,Snack,1.0,3712.0,Online,Minorista,0.05,250000.0,8526.0


## 2. Schema Analysis & SQL Script Generation

**Requirement:** The project requires a "Database Creation Script". 

Instead of writing SQL manually, we automate this process. The following function maps Pandas data types to SQL types (e.g., `object` -> `VARCHAR`, `float` -> `DECIMAL`) and generates a `schema.sql` file. This ensures the database structure always matches the source file.

In [14]:
# Inspect data types
df_raw.info()

# Function to map Pandas types to SQL types (Basic mapping)
def generate_create_table_script(df, table_name):
    schema = []
    for column, dtype in df.dtypes.items():
        # Clean column name (remove spaces, etc if needed)
        col_name = column.strip()
        
        if "int" in str(dtype):
            sql_type = "INTEGER"
        elif "float" in str(dtype):
            sql_type = "DECIMAL(15, 2)" # Money format
        elif "datetime" in str(dtype):
            sql_type = "DATE"
        else:
            sql_type = "VARCHAR(255)"
        
        schema.append(f"    {col_name} {sql_type}")
    
    # Construct the CREATE TABLE statement
    columns_sql = ",\n".join(schema)
    create_script = f"""
-- Auto-generated script based on RWVentas.csv
DROP TABLE IF EXISTS {table_name};

CREATE TABLE {table_name} (
{columns_sql}
);
"""
    return create_script

# Generate script for 'raw_sales'
sql_script = generate_create_table_script(df_raw, 'raw_sales')
print(sql_script)

# Save to file (Deliverable)
with open('../sql/schema.sql', 'w') as f:
    f.write(sql_script)
    print("Schema saved to sql/schema.sql")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   Ciudad           995449 non-null  object
 1   Fecha            995380 non-null  object
 2   Producto         995395 non-null  object
 3   Tipo_Producto    995374 non-null  object
 4   Cantidad         995470 non-null  object
 5   Precio_Unitario  995545 non-null  object
 6   Tipo_Venta       995477 non-null  object
 7   Tipo_Cliente     995579 non-null  object
 8   Descuento        995497 non-null  object
 9   Costo_Envio      995531 non-null  object
 10  Total            995472 non-null  object
dtypes: object(11)
memory usage: 83.9+ MB

-- Auto-generated script based on RWVentas.csv
DROP TABLE IF EXISTS raw_sales;

CREATE TABLE raw_sales (
    Ciudad VARCHAR(255),
    Fecha VARCHAR(255),
    Producto VARCHAR(255),
    Tipo_Producto VARCHAR(255),
    Cantidad VARCHAR(255),
    Pre

## 3. Data Loading: Staging Area

We load the raw data into the **`raw_sales`** table. 

**Strategy:** 
- We use `if_exists='replace'` to initialize the table.
- We treat all columns as `TEXT` or loose types initially to prevent ingestion errors caused by dirty data (e.g., `???` in numeric columns). 
- Cleaning and type casting will be handled in the next stage (Notebook 02).

In [15]:
engine = get_db_engine()
table_name = 'raw_sales'

if engine:
    try:
        with engine.connect() as connection:
            # Structure: Execute the DDL script
            print("Creating table structure...")
            connection.execute(text(sql_script))
            connection.commit() # Important for DDL in some drivers
            
            # Upload Data
            print(f"Uploading {len(df_raw)} rows to '{table_name}'...")
            df_raw.to_sql(
                name=table_name,
                con=engine,
                if_exists='replace',
                index=False,
                chunksize=10000
            )
            print("Success! Data ingested into PostgreSQL.")
            
    except Exception as e:
        print(f"Database error: {e}")

Successfully created engine for database: riwi_ventas_db
Creating table structure...
Uploading 1000000 rows to 'raw_sales'...
Success! Data ingested into PostgreSQL.


## 4. Conclusion

- **Status:** Success.
- **Output:** ~1M rows loaded into `raw_sales`.
- **Artifact:** `sql/schema.sql` created.
- **Next Step:** Data Cleaning & Enrichment.