# ETL: Silver ‚Üí Gold Layer

**Objetivo:** Transformar dados do Silver em Data Warehouse (Gold) com Star Schema.

**Processo:**
1. Extrair dados de `silver.uber_silver`
2. Popular 8 dimens√µes
3. Popular tabela fato com m√©tricas e FKs
4. Validar Data Warehouse

**Schema:** dwh.dim_data, dim_tempo, dim_cliente, dim_veiculo, dim_status, dim_localizacao, dim_pagamento, dim_motivo_cancelamento ‚Üí fato_corridas

In [16]:
import pandas as pd
import psycopg2
from psycopg2.extras import execute_values
from datetime import datetime, timedelta
import numpy as np
import hashlib

print("‚úÖ Bibliotecas importadas!")

‚úÖ Bibliotecas importadas!


In [17]:
# Configura√ß√£o PostgreSQL
DB_CONFIG = {
    'host': 'localhost',
    'port': 5432,
    'database': 'uberdb',
    'user': 'admin',
    'password': 'uber.10'
}

def get_connection():
    return psycopg2.connect(**DB_CONFIG)

# Teste conex√£o
try:
    conn = get_connection()
    print("‚úÖ Conex√£o estabelecida!")
    conn.close()
except Exception as e:
    print(f"‚ùå Erro: {e}")

‚úÖ Conex√£o estabelecida!


## 0. PREPARA√á√ÉO: Criar Schema e Tabelas do DWH

In [18]:
# Executar DDL do Gold para criar schema e tabelas
import os

ddl_path = os.path.join('..', 'Data Layer', 'gold', 'gold_ddl.sql')

print("üìÇ Lendo arquivo gold_ddl.sql...")
with open(ddl_path, 'r', encoding='utf-8') as f:
    ddl_script = f.read()

print("üöÄ Executando DDL no PostgreSQL...")
conn = get_connection()
cur = conn.cursor()

try:
    cur.execute(ddl_script)
    conn.commit()
    print("‚úÖ Schema 'dwh' e todas as tabelas criadas com sucesso!")
    
    # Verificar tabelas criadas
    cur.execute("""
        SELECT table_name 
        FROM information_schema.tables 
        WHERE table_schema = 'dwh'
        ORDER BY table_name;
    """)
    tabelas = cur.fetchall()
    print(f"\nüìä Tabelas criadas no schema 'dwh': {len(tabelas)}")
    for tabela in tabelas:
        print(f"   ‚Ä¢ {tabela[0]}")
        
except Exception as e:
    conn.rollback()
    print(f"‚ùå Erro ao executar DDL: {e}")
    raise
finally:
    cur.close()
    conn.close()

üìÇ Lendo arquivo gold_ddl.sql...
üöÄ Executando DDL no PostgreSQL...
‚úÖ Schema 'dwh' e todas as tabelas criadas com sucesso!

üìä Tabelas criadas no schema 'dwh': 9
   ‚Ä¢ dim_cliente
   ‚Ä¢ dim_data
   ‚Ä¢ dim_localizacao
   ‚Ä¢ dim_motivo_cancelamento
   ‚Ä¢ dim_pagamento
   ‚Ä¢ dim_status
   ‚Ä¢ dim_tempo
   ‚Ä¢ dim_veiculo
   ‚Ä¢ fato_corridas


## 1. EXTRA√á√ÉO: Carregar Silver

In [19]:
query_silver = """
SELECT booking_id, customer_id, vehicle_type, pickup_location, drop_location,
       booking_value, ride_distance, payment_method, booking_status,
       reason_for_cancelling_by_customer, driver_cancellation_reason, incomplete_rides_reason,
       date, time, avg_vtat, avg_ctat, driver_ratings, customer_rating
FROM silver.uber_silver
ORDER BY date, time;
"""

conn = get_connection()
df_silver = pd.read_sql(query_silver, conn)
conn.close()

print(f"üìä Registros carregados: {len(df_silver):,}")
df_silver.head()

  df_silver = pd.read_sql(query_silver, conn)


üìä Registros carregados: 97,765


Unnamed: 0,booking_id,customer_id,vehicle_type,pickup_location,drop_location,booking_value,ride_distance,payment_method,booking_status,reason_for_cancelling_by_customer,driver_cancellation_reason,incomplete_rides_reason,date,time,avg_vtat,avg_ctat,driver_ratings,customer_rating
0,CNR4352144,CID8362794,Bike,Udyog Vihar,Ambience Mall,99.0,37.98,Cash,Completed,Reason Unknown,Reason Unknown,Reason Unknown,2024-01-01,00:19:34,10.8,38.9,4.8,4.8
1,CNR9147645,CID8300238,Go Mini,Basai Dhankot,Madipur,114.0,39.29,Uber Wallet,Completed,Reason Unknown,Reason Unknown,Reason Unknown,2024-01-01,01:35:18,8.5,15.1,4.2,4.1
2,CNR8140858,CID9268400,Go Mini,Jhilmil,Welcome,735.0,39.39,UPI,Completed,Reason Unknown,Reason Unknown,Reason Unknown,2024-01-01,01:53:01,8.1,42.6,4.3,4.7
3,CNR6073090,CID7393428,Go Mini,Sarojini Nagar,Madipur,918.0,44.21,Cash,Completed,Reason Unknown,Reason Unknown,Reason Unknown,2024-01-01,03:59:29,2.9,33.8,3.6,4.9
4,CNR4082656,CID9685431,eBike,Panchsheel Park,Pragati Maidan,423.0,40.82,Cash,Completed,Reason Unknown,Reason Unknown,Reason Unknown,2024-01-01,04:00:07,8.6,24.3,4.4,3.9


## 2. TRANSFORMA√á√ÉO: Criar Dimens√µes

In [20]:
# 2.1 Dim_Data
df_silver['date'] = pd.to_datetime(df_silver['date'])
min_date = df_silver['date'].min()
max_date = df_silver['date'].max() + timedelta(days=365)
date_range = pd.date_range(start=min_date, end=max_date, freq='D')

dim_data = pd.DataFrame({
    'data_completa': date_range,
    'data_key': date_range.strftime('%Y%m%d').astype(int),
    'ano': date_range.year,
    'trimestre': date_range.quarter,
    'mes': date_range.month,
    'nome_mes': date_range.strftime('%B'),
    'dia': date_range.day,
    'dia_da_semana': date_range.dayofweek + 1,
    'nome_dia_semana': date_range.strftime('%A'),
    'fim_de_semana': date_range.dayofweek >= 5,
    'dia_util': date_range.dayofweek < 5
})

print(f"‚úÖ dim_data: {len(dim_data):,} registros ({dim_data['data_completa'].min()} a {dim_data['data_completa'].max()})")

‚úÖ dim_data: 730 registros (2024-01-01 00:00:00 a 2025-12-30 00:00:00)


In [21]:
# 2.2 Dim_Tempo
times = pd.date_range('00:00', '23:59', freq='1min').time

def classificar_periodo(hora):
    if 0 <= hora < 6: return 'Madrugada'
    elif 6 <= hora < 12: return 'Manh√£'
    elif 12 <= hora < 18: return 'Tarde'
    else: return 'Noite'

def classificar_turno(hora):
    if 8 <= hora < 18: return 'Comercial'
    elif 18 <= hora < 23: return 'Noturno'
    else: return 'Madrugada'

dim_tempo = pd.DataFrame({
    'tempo_key': [int(f"{t.hour:02d}{t.minute:02d}") for t in times],
    'hora': [t.hour for t in times],
    'minuto': [t.minute for t in times],
    'periodo': [classificar_periodo(t.hour) for t in times],
    'turno': [classificar_turno(t.hour) for t in times],
    'hora_pico': [(7 <= t.hour <= 9) or (17 <= t.hour <= 19) for t in times]
}).drop_duplicates(subset=['tempo_key'])

print(f"‚úÖ dim_tempo: {len(dim_tempo):,} registros")

‚úÖ dim_tempo: 1,440 registros


In [22]:
# 2.3 Dim_Cliente
dim_cliente = df_silver[['customer_id']].drop_duplicates().copy()
dim_cliente['data_cadastro'] = df_silver.groupby('customer_id')['date'].min().values

print(f"‚úÖ dim_cliente: {len(dim_cliente):,} registros")

‚úÖ dim_cliente: 97,268 registros


In [23]:
# 2.4 Dim_Veiculo
dim_veiculo = df_silver[['vehicle_type']].drop_duplicates().copy()

def categorizar_veiculo(vtype):
    if pd.isna(vtype): return 'Desconhecido'
    v_lower = str(vtype).lower()
    if 'premium' in v_lower or 'luxury' in v_lower: return 'Premium'
    elif 'bike' in v_lower or 'moto' in v_lower: return 'Bike'
    else: return 'Econ√¥mico'

dim_veiculo['categoria'] = dim_veiculo['vehicle_type'].apply(categorizar_veiculo)
dim_veiculo['capacidade'] = None

print(f"‚úÖ dim_veiculo: {len(dim_veiculo)} registros")

‚úÖ dim_veiculo: 7 registros



In [24]:
# 2.5 Dim_Status
dim_status = df_silver[['booking_status']].drop_duplicates().copy()

def categorizar_status(status):
    if pd.isna(status): return 'Desconhecido', False
    s_lower = str(status).lower()
    if 'complete' in s_lower: return 'Completado', False
    elif 'cancel' in s_lower: return 'Cancelado', False
    elif 'incomplete' in s_lower: return 'Incompleto', False
    else: return 'Ativo', True

dim_status[['status_categoria', 'status_ativo']] = dim_status['booking_status'].apply(
    lambda x: pd.Series(categorizar_status(x))
)

print(f"‚úÖ dim_status: {len(dim_status)} registros")

‚úÖ dim_status: 2 registros



In [25]:
# 2.6 Dim_Localizacao
pickup_loc = df_silver[['pickup_location']].rename(columns={'pickup_location': 'local_nome'})
drop_loc = df_silver[['drop_location']].rename(columns={'drop_location': 'local_nome'})
dim_localizacao = pd.concat([pickup_loc, drop_loc]).drop_duplicates()
dim_localizacao['regiao'] = None
dim_localizacao['zona'] = None

print(f"‚úÖ dim_localizacao: {len(dim_localizacao):,} registros")

‚úÖ dim_localizacao: 176 registros


In [26]:
# 2.7 Dim_Pagamento
dim_pagamento = df_silver[['payment_method']].drop_duplicates().copy()

def classificar_pagamento(method):
    if pd.isna(method): return 'Desconhecido'
    m_lower = str(method).lower()
    return 'Dinheiro' if 'cash' in m_lower or 'dinheiro' in m_lower else 'Digital'

dim_pagamento['tipo_pagamento'] = dim_pagamento['payment_method'].apply(classificar_pagamento)

print(f"‚úÖ dim_pagamento: {len(dim_pagamento)} registros")

‚úÖ dim_pagamento: 5 registros


In [27]:
# 2.8 Dim_Motivo_Cancelamento
dim_motivo = df_silver[[
    'reason_for_cancelling_by_customer', 'driver_cancellation_reason', 'incomplete_rides_reason'
]].drop_duplicates().copy()

def criar_hash_motivo(row):
    motivo_str = f"{row['reason_for_cancelling_by_customer']}|{row['driver_cancellation_reason']}|{row['incomplete_rides_reason']}"
    return hashlib.md5(motivo_str.encode()).hexdigest()

dim_motivo['motivo_hash'] = dim_motivo.apply(criar_hash_motivo, axis=1)

print(f"‚úÖ dim_motivo_cancelamento: {len(dim_motivo):,} registros")

‚úÖ dim_motivo_cancelamento: 4 registros


## 3. CARGA: Inserir Dimens√µes no DWH

In [28]:
# 3.1 Inserir dim_data
conn = get_connection()
cur = conn.cursor()
cur.execute("TRUNCATE TABLE dwh.dim_data CASCADE;")

data_values = [
    (int(row['data_key']), row['data_completa'].date(), int(row['ano']), int(row['trimestre']),
     int(row['mes']), row['nome_mes'], int(row['dia']), int(row['dia_da_semana']),
     row['nome_dia_semana'], bool(row['fim_de_semana']), bool(row['dia_util']))
    for _, row in dim_data.iterrows()
]

insert_query = """
INSERT INTO dwh.dim_data (data_key, data_completa, ano, trimestre, mes, nome_mes, 
                          dia, dia_da_semana, nome_dia_semana, fim_de_semana, dia_util)
VALUES %s
"""
execute_values(cur, insert_query, data_values, page_size=1000)
conn.commit()

cur.execute("SELECT COUNT(*) FROM dwh.dim_data;")
print(f"‚úÖ dim_data inserida: {cur.fetchone()[0]:,} registros")
cur.close()
conn.close()

‚úÖ dim_data inserida: 730 registros


In [29]:
# 3.2 Inserir dim_tempo
conn = get_connection()
cur = conn.cursor()
cur.execute("TRUNCATE TABLE dwh.dim_tempo CASCADE;")

tempo_values = [(int(row['tempo_key']), int(row['hora']), int(row['minuto']),
                 row['periodo'], row['turno'], bool(row['hora_pico']))
                for _, row in dim_tempo.iterrows()]

execute_values(cur, """
INSERT INTO dwh.dim_tempo (tempo_key, hora, minuto, periodo, turno, hora_pico)
VALUES %s
""", tempo_values, page_size=1000)
conn.commit()

cur.execute("SELECT COUNT(*) FROM dwh.dim_tempo;")
print(f"‚úÖ dim_tempo inserida: {cur.fetchone()[0]:,} registros")
cur.close()
conn.close()

‚úÖ dim_tempo inserida: 1,440 registros


In [30]:
# 3.3 Inserir dim_cliente
conn = get_connection()
cur = conn.cursor()
cur.execute("TRUNCATE TABLE dwh.dim_cliente CASCADE;")

for _, row in dim_cliente.iterrows():
    cur.execute("""
    INSERT INTO dwh.dim_cliente (customer_id, data_cadastro)
    VALUES (%s, %s);
    """, (row['customer_id'], row['data_cadastro'].date()))

conn.commit()
cur.execute("SELECT COUNT(*) FROM dwh.dim_cliente;")
print(f"‚úÖ dim_cliente inserida: {cur.fetchone()[0]:,} registros")
cur.close()
conn.close()

‚úÖ dim_cliente inserida: 97,268 registros


In [31]:
# 3.4 Inserir dim_veiculo
conn = get_connection()
cur = conn.cursor()
cur.execute("TRUNCATE TABLE dwh.dim_veiculo CASCADE;")

for _, row in dim_veiculo.iterrows():
    cur.execute("""
    INSERT INTO dwh.dim_veiculo (vehicle_type, categoria, capacidade)
    VALUES (%s, %s, %s);
    """, (row['vehicle_type'], row['categoria'], row['capacidade']))

conn.commit()
cur.execute("SELECT COUNT(*) FROM dwh.dim_veiculo;")
print(f"‚úÖ dim_veiculo inserida: {cur.fetchone()[0]} registros")
cur.close()
conn.close()

‚úÖ dim_veiculo inserida: 7 registros


In [32]:
# 3.5 Inserir dim_status
conn = get_connection()
cur = conn.cursor()
cur.execute("TRUNCATE TABLE dwh.dim_status CASCADE;")

for _, row in dim_status.iterrows():
    cur.execute("""
    INSERT INTO dwh.dim_status (booking_status, status_categoria, status_ativo)
    VALUES (%s, %s, %s);
    """, (row['booking_status'], row['status_categoria'], row['status_ativo']))

conn.commit()
cur.execute("SELECT COUNT(*) FROM dwh.dim_status;")
print(f"‚úÖ dim_status inserida: {cur.fetchone()[0]} registros")
cur.close()
conn.close()

‚úÖ dim_status inserida: 2 registros


In [33]:
# 3.6 Inserir dim_localizacao
conn = get_connection()
cur = conn.cursor()
cur.execute("TRUNCATE TABLE dwh.dim_localizacao CASCADE;")

for _, row in dim_localizacao.iterrows():
    cur.execute("""
    INSERT INTO dwh.dim_localizacao (local_nome, regiao, zona)
    VALUES (%s, %s, %s);
    """, (row['local_nome'], row['regiao'], row['zona']))

conn.commit()
cur.execute("SELECT COUNT(*) FROM dwh.dim_localizacao;")
print(f"‚úÖ dim_localizacao inserida: {cur.fetchone()[0]:,} registros")
cur.close()
conn.close()

‚úÖ dim_localizacao inserida: 176 registros


In [34]:
# 3.7 Inserir dim_pagamento
conn = get_connection()
cur = conn.cursor()
cur.execute("TRUNCATE TABLE dwh.dim_pagamento CASCADE;")

for _, row in dim_pagamento.iterrows():
    cur.execute("""
    INSERT INTO dwh.dim_pagamento (payment_method, tipo_pagamento)
    VALUES (%s, %s);
    """, (row['payment_method'], row['tipo_pagamento']))

conn.commit()
cur.execute("SELECT COUNT(*) FROM dwh.dim_pagamento;")
print(f"‚úÖ dim_pagamento inserida: {cur.fetchone()[0]} registros")
cur.close()
conn.close()

‚úÖ dim_pagamento inserida: 5 registros


In [35]:
# 3.8 Inserir dim_motivo_cancelamento
conn = get_connection()
cur = conn.cursor()
cur.execute("TRUNCATE TABLE dwh.dim_motivo_cancelamento CASCADE;")

for _, row in dim_motivo.iterrows():
    cur.execute("""
    INSERT INTO dwh.dim_motivo_cancelamento 
    (reason_cancel_customer, driver_cancellation_reason, incomplete_rides_reason, motivo_hash)
    VALUES (%s, %s, %s, %s);
    """, (row['reason_for_cancelling_by_customer'], row['driver_cancellation_reason'],
          row['incomplete_rides_reason'], row['motivo_hash']))

conn.commit()
cur.execute("SELECT COUNT(*) FROM dwh.dim_motivo_cancelamento;")
print(f"‚úÖ dim_motivo_cancelamento inserida: {cur.fetchone()[0]:,} registros")
cur.close()
conn.close()

‚úÖ dim_motivo_cancelamento inserida: 4 registros



## 4. PREPARAR FATO: Lookups e Transforma√ß√µes

In [36]:
# Carregar lookups das dimens√µes
conn = get_connection()
lookup_cliente = pd.read_sql("SELECT cliente_key, customer_id FROM dwh.dim_cliente", conn)
lookup_veiculo = pd.read_sql("SELECT veiculo_key, vehicle_type FROM dwh.dim_veiculo", conn)
lookup_status = pd.read_sql("SELECT status_key, booking_status FROM dwh.dim_status", conn)
lookup_pagamento = pd.read_sql("SELECT pagamento_key, payment_method FROM dwh.dim_pagamento", conn)
lookup_localizacao = pd.read_sql("SELECT local_key, local_nome FROM dwh.dim_localizacao", conn)
lookup_motivo = pd.read_sql("""
    SELECT motivo_key, reason_cancel_customer, driver_cancellation_reason, incomplete_rides_reason 
    FROM dwh.dim_motivo_cancelamento
""", conn)
conn.close()

# Renomear coluna do lookup para fazer merge correto
lookup_motivo = lookup_motivo.rename(columns={'reason_cancel_customer': 'reason_for_cancelling_by_customer'})

print(f"‚úÖ Lookups carregados (Cliente: {len(lookup_cliente):,}, Localiza√ß√µes: {len(lookup_localizacao):,})")

‚úÖ Lookups carregados (Cliente: 97,268, Localiza√ß√µes: 176)


  lookup_cliente = pd.read_sql("SELECT cliente_key, customer_id FROM dwh.dim_cliente", conn)
  lookup_veiculo = pd.read_sql("SELECT veiculo_key, vehicle_type FROM dwh.dim_veiculo", conn)
  lookup_status = pd.read_sql("SELECT status_key, booking_status FROM dwh.dim_status", conn)
  lookup_pagamento = pd.read_sql("SELECT pagamento_key, payment_method FROM dwh.dim_pagamento", conn)
  lookup_localizacao = pd.read_sql("SELECT local_key, local_nome FROM dwh.dim_localizacao", conn)
  lookup_motivo = pd.read_sql("""


In [37]:
# Preparar tabela fato
df_fato = df_silver.copy()

# Criar chaves
df_fato['data_key'] = pd.to_datetime(df_fato['date']).dt.strftime('%Y%m%d').astype(int)

def time_to_key(time_str):
    if pd.isna(time_str) or time_str == '': return None
    try:
        time_obj = pd.to_datetime(time_str, format='%H:%M:%S').time()
        return int(f"{time_obj.hour:02d}{time_obj.minute:02d}")
    except:
        return None

df_fato['tempo_key'] = df_fato['time'].apply(time_to_key)

# Merge com dimens√µes
df_fato = df_fato.merge(lookup_cliente, on='customer_id', how='left')
df_fato = df_fato.merge(lookup_veiculo, on='vehicle_type', how='left')
df_fato = df_fato.merge(lookup_status, on='booking_status', how='left')
df_fato = df_fato.merge(lookup_pagamento, on='payment_method', how='left')
df_fato = df_fato.merge(
    lookup_localizacao.rename(columns={'local_key': 'pickup_local_key', 'local_nome': 'pickup_location'}),
    on='pickup_location', how='left'
)
df_fato = df_fato.merge(
    lookup_localizacao.rename(columns={'local_key': 'drop_local_key', 'local_nome': 'drop_location'}),
    on='drop_location', how='left'
)
df_fato = df_fato.merge(
    lookup_motivo,
    on=['reason_for_cancelling_by_customer', 'driver_cancellation_reason', 'incomplete_rides_reason'],
    how='left'
)

# Calcular m√©tricas derivadas
df_fato['valor_por_km'] = df_fato.apply(
    lambda x: round(x['booking_value'] / x['ride_distance'], 2) 
    if pd.notna(x['ride_distance']) and x['ride_distance'] > 0 else None,
    axis=1
)

# Flags booleanas
df_fato['corrida_completa'] = df_fato['booking_status'].str.lower().str.contains('complete', na=False)
df_fato['corrida_cancelada'] = df_fato['booking_status'].str.lower().str.contains('cancel', na=False)
df_fato['corrida_incompleta'] = df_fato['booking_status'].str.lower().str.contains('incomplete', na=False)

print(f"‚úÖ Fato preparada: {len(df_fato):,} registros")
print(f"   Completas: {df_fato['corrida_completa'].sum():,}")
print(f"   Canceladas: {df_fato['corrida_cancelada'].sum():,}")

‚úÖ Fato preparada: 97,765 registros
   Completas: 97,765
   Canceladas: 0


## 5. CARGA: Inserir Tabela Fato

In [38]:
# Selecionar colunas para inser√ß√£o
fato_columns = [
    'booking_id', 'data_key', 'tempo_key', 'cliente_key', 'veiculo_key',
    'status_key', 'pagamento_key', 'pickup_local_key', 'drop_local_key', 'motivo_key',
    'booking_value', 'ride_distance', 'avg_vtat', 'avg_ctat', 
    'driver_ratings', 'customer_rating', 'valor_por_km',
    'corrida_completa', 'corrida_cancelada', 'corrida_incompleta'
]

df_fato_insert = df_fato[fato_columns].where(pd.notnull(df_fato[fato_columns]), None)
fato_values = [tuple(row) for row in df_fato_insert.values]

conn = get_connection()
cur = conn.cursor()
cur.execute("TRUNCATE TABLE dwh.fato_corridas;")

insert_query = """
INSERT INTO dwh.fato_corridas (
    corrida_key, data_key, tempo_key, cliente_key, veiculo_key,
    status_key, pagamento_key, pickup_local_key, drop_local_key, motivo_key,
    booking_value, ride_distance, avg_vtat, avg_ctat,
    driver_ratings, customer_rating, valor_por_km,
    corrida_completa, corrida_cancelada, corrida_incompleta
)
VALUES %s
ON CONFLICT (corrida_key) DO NOTHING;
"""

# Inserir em batches
batch_size = 1000
total_batches = (len(fato_values) + batch_size - 1) // batch_size
print(f"üöÄ Inserindo {len(fato_values):,} registros em {total_batches} batches...")

for i in range(0, len(fato_values), batch_size):
    batch = fato_values[i:i+batch_size]
    execute_values(cur, insert_query, batch, page_size=batch_size)
    if (i // batch_size + 1) % 10 == 0:
        print(f"   Batch {i // batch_size + 1}/{total_batches}")

conn.commit()
cur.execute("SELECT COUNT(*) FROM dwh.fato_corridas;")
print(f"\n‚úÖ fato_corridas inserida: {cur.fetchone()[0]:,} registros")
cur.close()
conn.close()

üöÄ Inserindo 97,765 registros em 98 batches...
   Batch 10/98
   Batch 10/98
   Batch 20/98
   Batch 20/98
   Batch 30/98
   Batch 30/98
   Batch 40/98
   Batch 40/98
   Batch 50/98
   Batch 50/98
   Batch 60/98
   Batch 60/98
   Batch 70/98
   Batch 70/98
   Batch 80/98
   Batch 80/98
   Batch 90/98
   Batch 90/98

‚úÖ fato_corridas inserida: 97,765 registros

‚úÖ fato_corridas inserida: 97,765 registros


## 6. VALIDA√á√ÉO do Data Warehouse

In [39]:
# Verificar integridade
conn = get_connection()
validation_queries = {
    'Total Corridas': "SELECT COUNT(*) FROM dwh.fato_corridas",
    'Corridas Completas': "SELECT COUNT(*) FROM dwh.fato_corridas WHERE corrida_completa = TRUE",
    'Corridas Canceladas': "SELECT COUNT(*) FROM dwh.fato_corridas WHERE corrida_cancelada = TRUE",
    'Total Clientes': "SELECT COUNT(*) FROM dwh.dim_cliente",
    'Total Localiza√ß√µes': "SELECT COUNT(*) FROM dwh.dim_localizacao",
    'Receita Total': "SELECT SUM(booking_value) FROM dwh.fato_corridas",
    'Dist√¢ncia Total (km)': "SELECT SUM(ride_distance) FROM dwh.fato_corridas",
    'M√©dia Rating Motorista': "SELECT AVG(driver_ratings) FROM dwh.fato_corridas WHERE driver_ratings IS NOT NULL",
    'M√©dia Rating Cliente': "SELECT AVG(customer_rating) FROM dwh.fato_corridas WHERE customer_rating IS NOT NULL"
}

print("="*60)
print("üìä VALIDA√á√ÉO DO DATA WAREHOUSE")
print("="*60)
for label, query in validation_queries.items():
    result = pd.read_sql(query, conn).iloc[0, 0]
    if isinstance(result, (int, np.integer)):
        print(f"{label:.<40} {result:>15,}")
    elif isinstance(result, (float, np.floating)):
        print(f"{label:.<40} {result:>15,.2f}")
print("="*60)
conn.close()

üìä VALIDA√á√ÉO DO DATA WAREHOUSE
Total Corridas..........................          97,765
Corridas Completas......................          97,765
Corridas Canceladas.....................               0
Total Clientes..........................          97,268
Total Localiza√ß√µes......................             176
Receita Total...........................   45,100,932.00
Dist√¢ncia Total (km)....................    2,408,269.23
M√©dia Rating Motorista..................            4.23
M√©dia Rating Cliente....................            4.40


  result = pd.read_sql(query, conn).iloc[0, 0]


In [40]:
# Query anal√≠tica: Top 10 rotas por receita
conn = get_connection()
df_top_rotas = pd.read_sql("""
SELECT pickup.local_nome AS origem, drop.local_nome AS destino,
       COUNT(*) AS total_corridas,
       SUM(f.booking_value) AS receita_total,
       AVG(f.booking_value) AS ticket_medio,
       AVG(f.ride_distance) AS distancia_media
FROM dwh.fato_corridas f
JOIN dwh.dim_localizacao pickup ON f.pickup_local_key = pickup.local_key
JOIN dwh.dim_localizacao drop ON f.drop_local_key = drop.local_key
WHERE f.corrida_completa = TRUE
GROUP BY pickup.local_nome, drop.local_nome
ORDER BY receita_total DESC
LIMIT 10;
""", conn)
conn.close()

print("\nüèÜ TOP 10 ROTAS MAIS RENT√ÅVEIS:\n")
df_top_rotas

  df_top_rotas = pd.read_sql("""



üèÜ TOP 10 ROTAS MAIS RENT√ÅVEIS:



Unnamed: 0,origem,destino,total_corridas,receita_total,ticket_medio,distancia_media
0,Kirti Nagar,Yamuna Bank,8,6921.0,865.125,27.80125
1,Paharganj,Sarojini Nagar,11,6800.0,618.181818,22.56
2,Ghitorni,Mandi House,10,6517.0,651.7,24.223
3,Vaishali,IIT Delhi,11,6450.0,586.363636,22.357273
4,Ardee City,Nirman Vihar,10,6433.0,643.3,20.874
5,Jahangirpuri,Ashram,8,6391.0,798.875,23.96125
6,Rithala,Udyog Vihar Phase 4,11,6325.0,575.0,18.859091
7,Rithala,Basai Dhankot,10,6278.0,627.8,20.076
8,Mehrauli,Netaji Subhash Place,9,6250.0,694.444444,17.041111
9,Rohini West,Sohna Road,13,6204.0,477.230769,29.408462


## 7. SUM√ÅRIO FINAL

In [41]:
print("\n" + "="*70)
print(" " * 15 + "üéØ ETL SILVER ‚Üí GOLD CONCLU√çDO! üéØ")
print("="*70)
print("\nüìä RESUMO DA CARGA:")
print("-"*70)

conn = get_connection()
cur = conn.cursor()
tabelas = [
    ('dwh.dim_data', 'Dimens√£o Data'),
    ('dwh.dim_tempo', 'Dimens√£o Tempo'),
    ('dwh.dim_cliente', 'Dimens√£o Cliente'),
    ('dwh.dim_veiculo', 'Dimens√£o Ve√≠culo'),
    ('dwh.dim_status', 'Dimens√£o Status'),
    ('dwh.dim_localizacao', 'Dimens√£o Localiza√ß√£o'),
    ('dwh.dim_pagamento', 'Dimens√£o Pagamento'),
    ('dwh.dim_motivo_cancelamento', 'Dimens√£o Motivo'),
    ('dwh.fato_corridas', 'üåü FATO CORRIDAS')
]

for tabela, descricao in tabelas:
    cur.execute(f"SELECT COUNT(*) FROM {tabela};")
    count = cur.fetchone()[0]
    print(f"{descricao:.<50} {count:>15,} registros")

cur.close()
conn.close()
print("\n" + "="*70)
print("‚úÖ Data Warehouse pronto para an√°lises!")
print("="*70)


               üéØ ETL SILVER ‚Üí GOLD CONCLU√çDO! üéØ

üìä RESUMO DA CARGA:
----------------------------------------------------------------------
Dimens√£o Data.....................................             730 registros
Dimens√£o Tempo....................................           1,440 registros
Dimens√£o Cliente..................................          97,268 registros
Dimens√£o Ve√≠culo..................................               7 registros
Dimens√£o Status...................................               2 registros
Dimens√£o Localiza√ß√£o..............................             176 registros
Dimens√£o Pagamento................................               5 registros
Dimens√£o Motivo...................................               4 registros
üåü FATO CORRIDAS...................................          97,765 registros

‚úÖ Data Warehouse pronto para an√°lises!
