# Tratamento e Limpeza de Dados

Este notebook realiza o tratamento e a padroniza√ß√£o dos dados da camada **Bronze**, preparando-os para a camada **Silver** dentro da arquitetura *medallion*.

O objetivo √© transformar dados brutos em dados **limpos, consistentes e prontos para an√°lise**.


In [1]:
# Bibliotecas necess√°rias
import pandas as pd
import os
import warnings

warnings.filterwarnings("ignore", category=RuntimeWarning)

In [2]:
# Leitura do CSV original
df_bronze = pd.read_csv("/kaggle/input/uber-analise/ncr_ride_bookings.csv")

df_bronze.head()

Unnamed: 0,Date,Time,Booking ID,Booking Status,Customer ID,Vehicle Type,Pickup Location,Drop Location,Avg VTAT,Avg CTAT,...,Reason for cancelling by Customer,Cancelled Rides by Driver,Driver Cancellation Reason,Incomplete Rides,Incomplete Rides Reason,Booking Value,Ride Distance,Driver Ratings,Customer Rating,Payment Method
0,2024-03-23,12:29:38,"""CNR5884300""",No Driver Found,"""CID1982111""",eBike,Palam Vihar,Jhilmil,,,...,,,,,,,,,,
1,2024-11-29,18:01:39,"""CNR1326809""",Incomplete,"""CID4604802""",Go Sedan,Shastri Nagar,Gurgaon Sector 56,4.9,14.0,...,,,,1.0,Vehicle Breakdown,237.0,5.73,,,UPI
2,2024-08-23,08:56:10,"""CNR8494506""",Completed,"""CID9202816""",Auto,Khandsa,Malviya Nagar,13.4,25.8,...,,,,,,627.0,13.58,4.9,4.9,Debit Card
3,2024-10-21,17:17:25,"""CNR8906825""",Completed,"""CID2610914""",Premier Sedan,Central Secretariat,Inderlok,13.1,28.5,...,,,,,,416.0,34.02,4.6,5.0,UPI
4,2024-09-16,22:08:00,"""CNR1950162""",Completed,"""CID9933542""",Bike,Ghitorni Village,Khan Market,5.3,19.6,...,,,,,,737.0,48.21,4.1,4.3,UPI


## Nome das Colunas

Padroniza√ß√£o dos nomes das colunas para garantir consist√™ncia e compatibilidade com o banco de dados.  
Essa etapa ajusta os nomes para seguir o padr√£o definido no data warehouse (uso de snake_case, remo√ß√£o de espa√ßos e caracteres especiais).

In [3]:
# Cria uma c√≥pia da camada Bronze
df_silver = df_bronze.copy()

# Renomeia colunas para o padr√£o unificado (Silver)
df_silver.rename(columns={
    'Booking ID': 'Booking_ID',
    'Booking Status': 'Booking_Status',
    'Customer ID': 'Customer_ID',
    'Vehicle Type': 'Vehicle_Type',
    'Pickup Location': 'Pickup_Location',
    'Drop Location': 'Drop_Location',
    'Avg VTAT': 'Avg_VTAT',
    'Avg CTAT': 'Avg_CTAT',
    'Reason for cancelling by Customer': 'Reason_for_cancelling_by_Customer',
    'Driver Cancellation Reason': 'Driver_Cancellation_Reason',
    'Incomplete Rides Reason': 'Incomplete_Rides_Reason',
    'Booking Value': 'Booking_Value',
    'Ride Distance': 'Ride_Distance',
    'Driver Ratings': 'Driver_Ratings',
    'Customer Rating': 'Customer_Rating',
    'Payment Method': 'Payment_Method'
}, inplace=True)

df_silver.head()


Unnamed: 0,Date,Time,Booking_ID,Booking_Status,Customer_ID,Vehicle_Type,Pickup_Location,Drop_Location,Avg_VTAT,Avg_CTAT,...,Reason_for_cancelling_by_Customer,Cancelled Rides by Driver,Driver_Cancellation_Reason,Incomplete Rides,Incomplete_Rides_Reason,Booking_Value,Ride_Distance,Driver_Ratings,Customer_Rating,Payment_Method
0,2024-03-23,12:29:38,"""CNR5884300""",No Driver Found,"""CID1982111""",eBike,Palam Vihar,Jhilmil,,,...,,,,,,,,,,
1,2024-11-29,18:01:39,"""CNR1326809""",Incomplete,"""CID4604802""",Go Sedan,Shastri Nagar,Gurgaon Sector 56,4.9,14.0,...,,,,1.0,Vehicle Breakdown,237.0,5.73,,,UPI
2,2024-08-23,08:56:10,"""CNR8494506""",Completed,"""CID9202816""",Auto,Khandsa,Malviya Nagar,13.4,25.8,...,,,,,,627.0,13.58,4.9,4.9,Debit Card
3,2024-10-21,17:17:25,"""CNR8906825""",Completed,"""CID2610914""",Premier Sedan,Central Secretariat,Inderlok,13.1,28.5,...,,,,,,416.0,34.02,4.6,5.0,UPI
4,2024-09-16,22:08:00,"""CNR1950162""",Completed,"""CID9933542""",Bike,Ghitorni Village,Khan Market,5.3,19.6,...,,,,,,737.0,48.21,4.1,4.3,UPI


## Remo√ß√£o de Coluna Redundante

As colunas de flags apresentam informa√ß√µes que j√° est√£o representadas na vari√°vel Booking_status. Portanto, mant√™-las seria redundante e poderia causar duplicidade sem√¢ntica ou inconsist√™ncias nas an√°lises.

In [4]:
# Remove colunas de flags irrelevantes
df_silver.drop(columns=[
    'Cancelled Rides by Customer', 
    'Cancelled Rides by Driver', 
    'Incomplete Rides'
], inplace=True, errors='ignore')

## Tratamento de Texto

Aplicando tratamento de texto nas colunas necess√°rias.

In [5]:
# Remove aspas desnecess√°rias
df_silver['Booking_ID'] = df_silver['Booking_ID'].astype(str).str.replace('"', '', regex=False)
df_silver['Customer_ID'] = df_silver['Customer_ID'].astype(str).str.replace('"', '', regex=False)

## Remo√ß√£o de Tuplas com o Identificador √önico Duplicado

Existe um erro no dataset que permite que algumas tuplas tenham o Booking_ID duplicado

In [6]:
# Remove dados que possuem a coluna 'Booking_ID' duplicada
initial_count = len(df_silver)
df_silver.drop_duplicates(subset=['Booking_ID'], keep='first', inplace=True)
final_count = len(df_silver)
print(f"Quantidade de dados duplicados: {initial_count - final_count}")

Quantidade de dados duplicados: 1233


## Converte a coluna de data e hora

In [7]:
# Converte Date (formato padr√£o YYYY-MM-DD)
df_silver['Date'] = pd.to_datetime(df_silver['Date'], errors='coerce').dt.date

# Converte Time com formato expl√≠cito
df_silver['Time'] = pd.to_datetime(df_silver['Time'], format='%H:%M:%S', errors='raise').dt.time

## Remo√ß√£o de outliers usando m√©todo do IQR (Interquartile Range)

Esta etapa identifica e remove valores extremos nas colunas num√©ricas do dataset Silver para garantir consist√™ncia estat√≠stica:

Colunas analisadas: Avg_VTAT, Avg_CTAT, Booking_Value, Ride_Distance

M√©todo utilizado: Intervalo Interquartil (IQR)

Calcula Q1 (25%) e Q3 (75%) para cada coluna.

Determina o IQR = Q3 - Q1.

Define limites aceit√°veis: [Q1 - 1.5*IQR, Q3 + 1.5*IQR].

Remove linhas com valores fora desses limites.

In [8]:
# Lista de colunas num√©ricas para an√°lise
numeric_cols = ['Avg_VTAT', 'Avg_CTAT', 'Booking_Value', 'Ride_Distance']

def remove_outliers_iqr(df, cols):
    cleaned_df = df.copy()
    outlier_counts = {}
    
    for col in cols:
        if col not in df.columns:
            continue
        
        # Calcula Q1, Q3 e IQR
        Q1 = cleaned_df[col].quantile(0.25)
        Q3 = cleaned_df[col].quantile(0.75)
        IQR = Q3 - Q1

        # Define limites inferior e superior
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Conta quantos seriam removidos
        initial_len = len(cleaned_df)
        cleaned_df = cleaned_df[(cleaned_df[col] >= lower_bound) & (cleaned_df[col] <= upper_bound)]
        final_len = len(cleaned_df)
        outlier_counts[col] = initial_len - final_len

    return cleaned_df, outlier_counts


# Aplica a fun√ß√£o
df_silver_clean, outlier_summary = remove_outliers_iqr(df_silver, numeric_cols)

print("‚úÖ Outliers removidos com base no m√©todo IQR:")
for col, count in outlier_summary.items():
    print(f"  ‚Ä¢ {col}: {count} linhas removidas")

print(f"\nüìâ Linhas finais ap√≥s limpeza: {len(df_silver_clean)} (de {len(df_silver)})")

# Substitui o DataFrame antigo pelo limpo
df_silver = df_silver_clean


‚úÖ Outliers removidos com base no m√©todo IQR:
  ‚Ä¢ Avg_VTAT: 10401 linhas removidas
  ‚Ä¢ Avg_CTAT: 37191 linhas removidas
  ‚Ä¢ Booking_Value: 3410 linhas removidas
  ‚Ä¢ Ride_Distance: 0 linhas removidas

üìâ Linhas finais ap√≥s limpeza: 97765 (de 148767)


## Tratamento de valores ausentes (NaN)

Esta etapa realiza o tratamento de dados nulos para garantir consist√™ncia e integridade do dataset Silver:

**Motivos de cancelamento:** substitui valores ausentes por "Reason Unknown".

**Colunas num√©ricas:** converte para tipo num√©rico seguro e substitui valores nulos pela m√©dia da coluna.

**M√©todo de pagamento:** preenche valores ausentes com a moda (valor mais frequente).

In [9]:
# Substitui valores ausentes por texto padr√£o nas colunas de motivos
for col in ['Incomplete_Rides_Reason', 'Driver_Cancellation_Reason', 'Reason_for_cancelling_by_Customer']:
    if col in df_silver.columns:
        df_silver[col] = df_silver[col].fillna('Reason Unknown')

# Converte colunas num√©ricas para tipo num√©rico seguro
num_cols = ['Avg_VTAT', 'Avg_CTAT', 'Booking_Value', 'Ride_Distance', 'Driver_Ratings', 'Customer_Rating']
for col in num_cols:
    if col in df_silver.columns:
        df_silver[col] = pd.to_numeric(df_silver[col], errors='coerce')

# Imputa a m√©dia nas colunas num√©ricas
for col in num_cols:
    if col in df_silver.columns:
        df_silver[col] = df_silver[col].fillna(df_silver[col].mean())

# Imputa a moda na coluna de m√©todo de pagamento (se existir)
if 'Payment_Method' in df_silver.columns:
    mode_value = df_silver['Payment_Method'].mode(dropna=True)
    if not mode_value.empty:
        df_silver['Payment_Method'] = df_silver['Payment_Method'].fillna(mode_value[0])

In [10]:
# Visualiza uma amostra dos dados transformados
df_silver.head(10)

Unnamed: 0,Date,Time,Booking_ID,Booking_Status,Customer_ID,Vehicle_Type,Pickup_Location,Drop_Location,Avg_VTAT,Avg_CTAT,Reason_for_cancelling_by_Customer,Driver_Cancellation_Reason,Incomplete_Rides_Reason,Booking_Value,Ride_Distance,Driver_Ratings,Customer_Rating,Payment_Method
1,2024-11-29,18:01:39,CNR1326809,Incomplete,CID4604802,Go Sedan,Shastri Nagar,Gurgaon Sector 56,4.9,14.0,Reason Unknown,Reason Unknown,Vehicle Breakdown,237.0,5.73,4.230757,4.403841,UPI
2,2024-08-23,08:56:10,CNR8494506,Completed,CID9202816,Auto,Khandsa,Malviya Nagar,13.4,25.8,Reason Unknown,Reason Unknown,Reason Unknown,627.0,13.58,4.9,4.9,Debit Card
3,2024-10-21,17:17:25,CNR8906825,Completed,CID2610914,Premier Sedan,Central Secretariat,Inderlok,13.1,28.5,Reason Unknown,Reason Unknown,Reason Unknown,416.0,34.02,4.6,5.0,UPI
4,2024-09-16,22:08:00,CNR1950162,Completed,CID9933542,Bike,Ghitorni Village,Khan Market,5.3,19.6,Reason Unknown,Reason Unknown,Reason Unknown,737.0,48.21,4.1,4.3,UPI
5,2024-02-06,09:44:56,CNR4096693,Completed,CID4670564,Auto,AIIMS,Narsinghpur,5.1,18.1,Reason Unknown,Reason Unknown,Reason Unknown,316.0,4.85,4.1,4.6,UPI
6,2024-06-17,15:45:58,CNR2002539,Completed,CID6800553,Go Mini,Vaishali,Punjabi Bagh,7.1,20.4,Reason Unknown,Reason Unknown,Reason Unknown,640.0,41.24,4.0,4.1,UPI
7,2024-03-19,17:37:37,CNR6568000,Completed,CID8610436,Auto,Mayur Vihar,Cyber Hub,12.1,16.5,Reason Unknown,Reason Unknown,Reason Unknown,136.0,6.56,4.4,4.2,UPI
9,2024-12-16,19:06:48,CNR7721892,Incomplete,CID5214275,Auto,Rohini,Adarsh Nagar,6.1,26.0,Reason Unknown,Reason Unknown,Other Issue,135.0,10.36,4.230757,4.403841,Cash
10,2024-06-14,16:24:12,CNR9070334,Completed,CID6680340,Auto,Udyog Bhawan,Dwarka Sector 21,7.7,18.9,Reason Unknown,Reason Unknown,Reason Unknown,181.0,19.84,4.2,4.9,Cash
13,2024-09-11,19:29:39,CNR2987763,Completed,CID2669710,Go Mini,Malviya Nagar,Ghitorni Village,12.2,28.2,Reason Unknown,Reason Unknown,Reason Unknown,394.0,21.44,4.1,4.7,UPI


## Exporta√ß√£o dos Dados Tratados

In [11]:
# Caminho para salvar a camada Silver
SILVER_PATH = os.path.join('Data Layer', 'silver', 'uber_silver.csv')
os.makedirs(os.path.dirname(SILVER_PATH), exist_ok=True)

df_silver.to_csv(SILVER_PATH, index=False)
print(f"Dados transformados salvos em '{SILVER_PATH}'.")

Dados transformados salvos em 'Data Layer/silver/uber_silver.csv'.
