# ETL Silver → Gold

Pipeline ETL para popular o Star Schema da camada Gold a partir dos dados limpos do Silver Layer.

**Objetivo:** Transformar dados normalizados em estrutura dimensional otimizada para análises de BI.

## 1. Imports

Bibliotecas necessárias para o ETL.

In [13]:
import pandas as pd
import psycopg2
from psycopg2.extras import execute_batch
from datetime import datetime

## 2. Configuração

Parâmetros de conexão ao banco de dados PostgreSQL.

In [14]:
DB_CONFIG = {
    'host': 'localhost',
    'port': 5432,
    'database': 'airline_delays',
    'user': 'postgres',
    'password': 'postgres'
}

## 3. Conectar e Criar Schema DW

Abre conexão com PostgreSQL e executa o DDL para criar as tabelas dimensionais e fato.

In [15]:
print("Conectando ao PostgreSQL...")
conn = psycopg2.connect(**DB_CONFIG)
cur = conn.cursor()

print("Executando DDL da camada DW...")
with open('../Data Layer/gold/ddl.sql', 'r', encoding='utf-8') as f:
    ddl_sql = f.read()
    cur.execute(ddl_sql)
    conn.commit()

print("Schema DW criado com sucesso")
cur.close()

Conectando ao PostgreSQL...
Executando DDL da camada DW...
Schema DW criado com sucesso


## 4. Carregar Dados do Silver

Leitura dos dados da tabela `silver.airline_delays`.

In [16]:
print("Carregando dados do Silver...")

query = """
    SELECT 
        year,
        month,
        carrier,
        carrier_name,
        airport,
        airport_name,
        arr_flights,
        arr_del15,
        carrier_ct,
        weather_ct,
        nas_ct,
        security_ct,
        late_aircraft_ct,
        arr_cancelled,
        arr_diverted,
        arr_delay,
        carrier_delay,
        weather_delay,
        nas_delay,
        security_delay,
        late_aircraft_delay
    FROM silver.airline_delays
    ORDER BY year, month, carrier, airport
"""

df = pd.read_sql_query(query, conn)
print(f"Carregados {len(df):,} registros do Silver")
print(f"\nInfo do DataFrame:")
print(df.info())

Carregando dados do Silver...


  df = pd.read_sql_query(query, conn)


Carregados 171,666 registros do Silver

Info do DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171666 entries, 0 to 171665
Data columns (total 21 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   year                 171666 non-null  int64  
 1   month                171666 non-null  int64  
 2   carrier              171666 non-null  object 
 3   carrier_name         171666 non-null  object 
 4   airport              171666 non-null  object 
 5   airport_name         171666 non-null  object 
 6   arr_flights          171426 non-null  float64
 7   arr_del15            171223 non-null  float64
 8   carrier_ct           171426 non-null  float64
 9   weather_ct           171426 non-null  float64
 10  nas_ct               171426 non-null  float64
 11  security_ct          171426 non-null  float64
 12  late_aircraft_ct     171426 non-null  float64
 13  arr_cancelled        171426 non-null  float64
 14  arr_diver

## 5. Preparar Cursor

Cria cursor para operações no banco.

In [17]:
cur = conn.cursor()
print("Cursor criado")

Cursor criado


## 6. Limpar Tabelas DW

Remove dados antigos para reprocessamento.

In [18]:
print("Limpando tabelas DW...")
cur.execute("TRUNCATE TABLE dw.fact_flight_delays, dw.dim_carrier, dw.dim_airport, dw.dim_time CASCADE;")
conn.commit()

Limpando tabelas DW...


## 7. Dimensão Tempo (dim_time)

Popula a dimensão temporal com atributos calculados.

In [19]:
print("Populando dim_time...")

time_data = df[['year', 'month']].drop_duplicates().sort_values(['year', 'month'])

data = []
meses_nomes = [
    'Janeiro', 'Fevereiro', 'Março', 'Abril', 'Maio', 'Junho',
    'Julho', 'Agosto', 'Setembro', 'Outubro', 'Novembro', 'Dezembro'
]

for _, row in time_data.iterrows():
    year = int(row['year'])
    month = int(row['month'])
    
    trimestre = (month - 1) // 3 + 1
    semestre = 1 if month <= 6 else 2
    mes_nome = meses_nomes[month - 1]
    mes_ano = f"{year}-{month:02d}"
    ano_trimestre = f"{year}-Q{trimestre}"
    
    data.append((
        year,
        month,
        trimestre,
        semestre,
        mes_nome,
        mes_ano,
        ano_trimestre
    ))

execute_batch(cur, """
    INSERT INTO dw.dim_time 
    (year, month, trimestre, semestre, mes_nome, mes_ano, ano_trimestre)
    VALUES (%s, %s, %s, %s, %s, %s, %s)
    ON CONFLICT (year, month) DO NOTHING
""", data)
conn.commit()

print(f"dim_time populada com {len(data):,} registros")

Populando dim_time...
dim_time populada com 121 registros


## 8. Dimensão Companhia Aérea (dim_carrier)

Popula a dimensão de companhias aéreas.

In [20]:
print("Populando dim_carrier...")

carriers = df[['carrier', 'carrier_name']].drop_duplicates()

data = []
for _, row in carriers.iterrows():
    carrier_code = str(row['carrier']) if pd.notna(row['carrier']) else 'UNKNOWN'
    carrier_name = str(row['carrier_name']) if pd.notna(row['carrier_name']) else 'Unknown Carrier'
    data.append((carrier_code, carrier_name))

execute_batch(cur, """
    INSERT INTO dw.dim_carrier (carrier_code, carrier_name)
    VALUES (%s, %s)
    ON CONFLICT (carrier_code) DO UPDATE SET
        carrier_name = EXCLUDED.carrier_name,
        data_atualizacao = NOW()
""", data)
conn.commit()

print(f"dim_carrier populada com {len(data):,} registros")

Populando dim_carrier...
dim_carrier populada com 23 registros


## 9. Dimensão Aeroporto (dim_airport)

Popula a dimensão de aeroportos.

In [21]:
print("Populando dim_airport...")

airports = df[['airport', 'airport_name']].drop_duplicates()

data = []
for _, row in airports.iterrows():
    airport_code = str(row['airport']) if pd.notna(row['airport']) else 'UNKNOWN'
    airport_name = str(row['airport_name']) if pd.notna(row['airport_name']) else 'Unknown Airport'
    data.append((airport_code, airport_name))

execute_batch(cur, """
    INSERT INTO dw.dim_airport (airport_code, airport_name)
    VALUES (%s, %s)
    ON CONFLICT (airport_code) DO UPDATE SET
        airport_name = EXCLUDED.airport_name,
        data_atualizacao = NOW()
""", data)
conn.commit()

print(f"dim_airport populada com {len(data):,} registros")

Populando dim_airport...
dim_airport populada com 419 registros


## 10. Buscar Chaves Surrogate (SRK)

Cria dicionários para mapear códigos naturais → surrogate keys das 3 dimensões.

In [22]:
print("Buscando surrogate keys...")

cur.execute("SELECT carrier_code, srk_carrier FROM dw.dim_carrier")
carrier_to_key = dict(cur.fetchall())
print(f"  {len(carrier_to_key)} carriers mapeados")

cur.execute("SELECT airport_code, srk_airport FROM dw.dim_airport")
airport_to_key = dict(cur.fetchall())
print(f"  {len(airport_to_key)} airports mapeados")

cur.execute("SELECT year, month, srk_time FROM dw.dim_time")
time_to_key = {(year, month): key for year, month, key in cur.fetchall()}
print(f"  {len(time_to_key)} períodos mapeados")

Buscando surrogate keys...
  21 carriers mapeados
  395 airports mapeados
  121 períodos mapeados


## 11. Tabela Fato (fact_flight_delays)

Popula a tabela fato com **1 registro por combinação** carrier-airport-time, mantendo todas as causas de atraso como colunas.

In [23]:
print("Populando fact_flight_delays...")

data = []
skipped = 0

for _, row in df.iterrows():
    carrier_srk = carrier_to_key.get(str(row['carrier']))
    airport_srk = airport_to_key.get(str(row['airport']))
    time_srk = time_to_key.get((int(row['year']), int(row['month'])))
    
    if not (carrier_srk and airport_srk and time_srk):
        skipped += 1
        continue
    
    # Converter valores para float, tratando NaN
    arr_flights = float(row['arr_flights']) if pd.notna(row['arr_flights']) else 0
    arr_del15 = float(row['arr_del15']) if pd.notna(row['arr_del15']) else 0
    arr_cancelled = float(row['arr_cancelled']) if pd.notna(row['arr_cancelled']) else 0
    arr_diverted = float(row['arr_diverted']) if pd.notna(row['arr_diverted']) else 0
    arr_delay = float(row['arr_delay']) if pd.notna(row['arr_delay']) else 0
    
    carrier_ct = float(row['carrier_ct']) if pd.notna(row['carrier_ct']) else 0
    weather_ct = float(row['weather_ct']) if pd.notna(row['weather_ct']) else 0
    nas_ct = float(row['nas_ct']) if pd.notna(row['nas_ct']) else 0
    security_ct = float(row['security_ct']) if pd.notna(row['security_ct']) else 0
    late_aircraft_ct = float(row['late_aircraft_ct']) if pd.notna(row['late_aircraft_ct']) else 0
    
    carrier_delay = float(row['carrier_delay']) if pd.notna(row['carrier_delay']) else 0
    weather_delay = float(row['weather_delay']) if pd.notna(row['weather_delay']) else 0
    nas_delay = float(row['nas_delay']) if pd.notna(row['nas_delay']) else 0
    security_delay = float(row['security_delay']) if pd.notna(row['security_delay']) else 0
    late_aircraft_delay = float(row['late_aircraft_delay']) if pd.notna(row['late_aircraft_delay']) else 0
    
    # Calcular métricas
    delay_rate = (arr_del15 / arr_flights * 100) if arr_flights > 0 else 0
    cancellation_rate = (arr_cancelled / arr_flights * 100) if arr_flights > 0 else 0
    diversion_rate = (arr_diverted / arr_flights * 100) if arr_flights > 0 else 0
    avg_delay_minutes = (arr_delay / arr_flights) if arr_flights > 0 else 0
    on_time_rate = 100 - delay_rate
    
    data.append((
        carrier_srk,
        airport_srk,
        time_srk,
        arr_flights,
        arr_del15,
        arr_cancelled,
        arr_diverted,
        arr_delay,
        carrier_ct,
        weather_ct,
        nas_ct,
        security_ct,
        late_aircraft_ct,
        carrier_delay,
        weather_delay,
        nas_delay,
        security_delay,
        late_aircraft_delay,
        round(delay_rate, 2),
        round(cancellation_rate, 2),
        round(diversion_rate, 2),
        round(avg_delay_minutes, 2),
        round(on_time_rate, 2)
    ))

print(f"  Processando {len(data):,} registros...")

execute_batch(cur, """
    INSERT INTO dw.fact_flight_delays 
    (srk_carrier, srk_airport, srk_time,
     arr_flights, arr_del15, arr_cancelled, arr_diverted, arr_delay,
     carrier_ct, weather_ct, nas_ct, security_ct, late_aircraft_ct,
     carrier_delay, weather_delay, nas_delay, security_delay, late_aircraft_delay,
     delay_rate, cancellation_rate, diversion_rate, avg_delay_minutes, on_time_rate)
    VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
""", data, page_size=1000)
conn.commit()

print(f"fact_flight_delays populada com {len(data):,} registros")
if skipped > 0:
    print(f"AVISO: {skipped} registros ignorados (FKs inválidas)")

Populando fact_flight_delays...
  Processando 171,666 registros...
fact_flight_delays populada com 171,666 registros


## 12. Finalizar Conexão

Fecha cursor e conexão com o banco de dados.

In [24]:
cur.close()
conn.close()
print("Conexão encerrada.")

Conexão encerrada.
