Importação inicial

In [1]:
import pandas as pd
import numpy as np

Leitura de Dados

In [2]:
df = pd.read_csv('../DataLayer/raw/vehicle_price_prediction.csv')
df.head()

Unnamed: 0,make,model,year,mileage,engine_hp,transmission,fuel_type,drivetrain,body_type,exterior_color,interior_color,owner_count,accident_history,seller_type,condition,trim,vehicle_age,mileage_per_year,brand_popularity,price
0,Volkswagen,Jetta,2016,183903,173,Manual,Electric,RWD,Sedan,Blue,Brown,5,,Dealer,Excellent,EX,9,20433.666667,0.040054,7208.52
1,Lexus,RX,2010,236643,352,Manual,Gasoline,FWD,Sedan,Silver,Beige,5,Minor,Dealer,Good,LX,15,15776.2,0.039921,6911.81
2,Subaru,Crosstrek,2016,103199,188,Automatic,Diesel,AWD,Sedan,Silver,Beige,5,,Dealer,Excellent,Touring,9,11466.555556,0.04023,11915.63
3,Cadillac,Lyriq,2016,118889,338,Manual,Gasoline,AWD,SUV,Black,Gray,3,,Private,Good,Base,9,13209.888889,0.039847,25984.79
4,Toyota,Highlander,2018,204170,196,Manual,Diesel,FWD,Sedan,Red,Brown,5,Minor,Dealer,Excellent,Sport,7,29167.142857,0.039627,8151.3


Análise mais aprofundada

In [3]:
print("Informações do Dataset:")
print(df.info())
print(df.shape)


Informações do Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 20 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   make              1000000 non-null  object 
 1   model             1000000 non-null  object 
 2   year              1000000 non-null  int64  
 3   mileage           1000000 non-null  int64  
 4   engine_hp         1000000 non-null  int64  
 5   transmission      1000000 non-null  object 
 6   fuel_type         1000000 non-null  object 
 7   drivetrain        1000000 non-null  object 
 8   body_type         1000000 non-null  object 
 9   exterior_color    1000000 non-null  object 
 10  interior_color    1000000 non-null  object 
 11  owner_count       1000000 non-null  int64  
 12  accident_history  249867 non-null   object 
 13  seller_type       1000000 non-null  object 
 14  condition         1000000 non-null  object 
 15  trim              1000000 

Tratamento de valores ausentes

In [4]:
df.isnull().sum().sort_values(ascending=False)

accident_history    750133
make                     0
year                     0
model                    0
mileage                  0
engine_hp                0
fuel_type                0
transmission             0
body_type                0
exterior_color           0
interior_color           0
drivetrain               0
owner_count              0
seller_type              0
condition                0
trim                     0
vehicle_age              0
mileage_per_year         0
brand_popularity         0
price                    0
dtype: int64

Exibição após resultados

In [5]:
df['accident_history'] = df['accident_history'].fillna('None')
df.isnull().sum().sort_values(ascending=False)

make                0
model               0
year                0
mileage             0
engine_hp           0
transmission        0
fuel_type           0
drivetrain          0
body_type           0
exterior_color      0
interior_color      0
owner_count         0
accident_history    0
seller_type         0
condition           0
trim                0
vehicle_age         0
mileage_per_year    0
brand_popularity    0
price               0
dtype: int64

Ajuste de tipo de dados

In [6]:
categorical_cols = ['make', 'model', 'fuel_type', 'transmission', 'drivetrain', 
                    'body_type', 'exterior_color', 'interior_color', 
                    'accident_history',
                    'seller_type', 'condition', 'trim']

for col in categorical_cols:
    df[col] = df[col].astype('category')

print("\nTipos de dados após conversão:")
print(df.info())


Tipos de dados após conversão:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 20 columns):
 #   Column            Non-Null Count    Dtype   
---  ------            --------------    -----   
 0   make              1000000 non-null  category
 1   model             1000000 non-null  category
 2   year              1000000 non-null  int64   
 3   mileage           1000000 non-null  int64   
 4   engine_hp         1000000 non-null  int64   
 5   transmission      1000000 non-null  category
 6   fuel_type         1000000 non-null  category
 7   drivetrain        1000000 non-null  category
 8   body_type         1000000 non-null  category
 9   exterior_color    1000000 non-null  category
 10  interior_color    1000000 non-null  category
 11  owner_count       1000000 non-null  int64   
 12  accident_history  1000000 non-null  category
 13  seller_type       1000000 non-null  category
 14  condition         1000000 non-null  category
 15  t

Validação dos resultados

In [7]:
print("\nValidando transformações:")
print(f"Total de registros preservados: {len(df):,}")
print(f"Total de colunas: {len(df.columns)}")
print("\nDistribuição de 'accident_history' após tratamento:")
print(df['accident_history'].value_counts())

print(f"\nValores únicos nas principais colunas categóricas:")
for col in categorical_cols:
    print(f"   • {col}: {df[col].nunique()} valores únicos")

print("\n")
print("Nulos tratados: 'accident_history' preenchido com 'None'")
print("Outliers preservados: Mantidos para análises futuras")
print("Tipos otimizados: Colunas categóricas convertidas para 'category'")
print(f"Integridade mantida: {len(df):,} registros preservados")


Validando transformações:
Total de registros preservados: 1,000,000
Total de colunas: 20

Distribuição de 'accident_history' após tratamento:
accident_history
None     750133
Minor    199981
Major     49886
Name: count, dtype: int64

Valores únicos nas principais colunas categóricas:
   • make: 25 valores únicos
   • model: 105 valores únicos
   • fuel_type: 3 valores únicos
   • transmission: 2 valores únicos
   • drivetrain: 3 valores únicos
   • body_type: 7 valores únicos
   • exterior_color: 6 valores únicos
   • interior_color: 4 valores únicos
   • accident_history: 3 valores únicos
   • seller_type: 2 valores únicos
   • condition: 3 valores únicos
   • trim: 6 valores únicos


Nulos tratados: 'accident_history' preenchido com 'None'
Outliers preservados: Mantidos para análises futuras
Tipos otimizados: Colunas categóricas convertidas para 'category'
Integridade mantida: 1,000,000 registros preservados


Exportando para a camada silver

In [8]:
df.to_csv('../DataLayer/silver/vehicle_price_clean.csv', index=False)

Bad pipe message: %s [b' 10.0; Win64; x64; rv:144.0) Gecko/20100101 Firefox/144.0\r\nAccept: text/html,']
Bad pipe message: %s [b'plication/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Language: pt-BR,pt;q=0.8,en-US;q=0.5']
Bad pipe message: %s [b'n;q=0.3\r\nAccept-Encoding: gzip, deflate, br', b'zstd\r\nConnection: keep-alive\r\nUpgrade-Insec']
Bad pipe message: %s [b'e-Requests: 1\r\nSec-Fetch-Dest: document\r\nSec-Fetch-Mode: navigate\r\nSec-Fetch-Site: none\r\nSec-Fetch-User: ?1\r\nPriorit']
Bad pipe message: %s [b' 10.0; Win64; x64; rv:144.0) Gecko/20100101 Firefox/144.0\r\nAccept: text/html,']
Bad pipe message: %s [b'plication/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Language: pt-BR,pt;q=0.8,en-US;q=0.5']
Bad pipe message: %s [b'n;q=0.3\r\nAccept-Encoding: gzip, deflate, br', b'zstd\r\nConnection: keep-alive\r\nUpgrade-Insec']
Bad pipe message: %s [b'e-Requests: 1\r\nSec-Fetch-Dest: document\r\nSec-Fetch-Mode: navigate\r\nSec-Fetch-Site: none\r\nSec-Fetch-User: 