# TP1 – Projeto de Machine Learning
## Predição de Preços de Carros Usados (Kaggle)

Este notebook documenta **todo o processo de treino, avaliação, comparação e submissão** de modelos de Machine Learning.

## 1. Importação de bibliotecas

In [56]:
import numpy as np
import pandas as pd

from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

## 2. Carregamento dos dados

In [57]:
train_df = pd.read_csv("train_small.csv")
test_df = pd.read_csv("test.csv")

train_df.head()

Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,111355,Nissan,Murano SV,2022,23677,Gasoline,3.5L V6 24V MPFI DOHC,Automatic CVT,Sunset Drift Chromaflair,Graphite,None reported,,37999
1,182258,Ford,Thunderbird Deluxe,2004,50000,Gasoline,280.0HP 3.9L 8 Cylinder Engine Gasoline Fuel,A/T,Gold,Beige,,,30000
2,14147,Buick,Enclave Avenir,2019,109646,Gasoline,3.6L V6 24V GDI DOHC,9-Speed Automatic,Dark Moss,Chestnut,None reported,Yes,26772
3,79313,BMW,340 i,2016,102000,Gasoline,320.0HP 3.0L Straight 6 Cylinder Engine Gasoli...,8-Speed A/T,White,Black,,,24999
4,101160,Toyota,Highlander SE,2020,75151,Gasoline,295.0HP 3.5L V6 Cylinder Engine Gasoline Fuel,8-Speed A/T,Gray,Black,None reported,Yes,47995


## 3. Separação entre features e target

In [58]:
X = train_df.drop("price", axis=1)
y = train_df["price"]

## 4. Identificação de variáveis

In [59]:
categorical_cols = X.select_dtypes(include="object").columns
numeric_cols = X.select_dtypes(exclude="object").columns

categorical_cols, numeric_cols

(Index(['brand', 'model', 'fuel_type', 'engine', 'transmission', 'ext_col',
        'int_col', 'accident', 'clean_title'],
       dtype='object'),
 Index(['id', 'model_year', 'milage'], dtype='object'))

## 5. Pré-processamento

In [60]:
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", StandardScaler(), numeric_cols)
    ]
)

## 6. Modelo Baseline – Regressão Linear

In [61]:
linear_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", LinearRegression())
])

linear_scores = cross_val_score(
    linear_pipeline,
    X,
    y,
    cv=5,
    scoring="neg_root_mean_squared_error"
)

print("RMSE médio (Linear):", -linear_scores.mean())

RMSE médio (Linear): 73411.42667269355


## 7. KNN com Grid Search

In [62]:
knn_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", KNeighborsRegressor())
])

knn_param_grid = {
    "model__n_neighbors": [3, 5, 7],
    "model__weights": ["uniform", "distance"]
}

knn_grid = GridSearchCV(
    knn_pipeline,
    knn_param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1
)

knn_grid.fit(X, y)

print("Melhores parâmetros KNN:", knn_grid.best_params_)
print("RMSE KNN:", -knn_grid.best_score_)

Melhores parâmetros KNN: {'model__n_neighbors': 7, 'model__weights': 'distance'}
RMSE KNN: 63925.70533632208


## 8. Random Forest com Grid Search

In [63]:
rf_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", RandomForestRegressor(random_state=RANDOM_STATE))
])

rf_param_grid = {
    "model__n_estimators": [100, 200],
    "model__max_depth": [None, 20],
    "model__min_samples_split": [2, 5]
}

rf_grid = GridSearchCV(
    rf_pipeline,
    rf_param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1
)

rf_grid.fit(X, y)

print("Melhores parâmetros RF:", rf_grid.best_params_)
print("RMSE RF:", -rf_grid.best_score_)

Melhores parâmetros RF: {'model__max_depth': None, 'model__min_samples_split': 5, 'model__n_estimators': 200}
RMSE RF: 64616.1433969219


## 9. Comparação dos Modelos

O modelo com **menor RMSE médio** é selecionado como modelo final.

In [64]:
results = pd.DataFrame({
    "Modelo": ["Linear Regression", "KNN", "Random Forest"],
    "RMSE": [
        -linear_scores.mean(),
        -knn_grid.best_score_,
        -rf_grid.best_score_
    ]
})

results.sort_values("RMSE")

Unnamed: 0,Modelo,RMSE
1,KNN,63925.705336
2,Random Forest,64616.143397
0,Linear Regression,73411.426673


## 10. Submissão Kaggle

In [65]:
best_model = knn_grid.best_estimator_

test_predictions = best_model.predict(test_df)

submission = pd.DataFrame({
    "id": test_df["id"],
    "price": test_predictions
})

submission.to_csv("submission.csv", index=False)
submission.head()

Unnamed: 0,id,price
0,188533,14263.877691
1,188534,63728.918863
2,188535,44143.603115
3,188536,42003.143399
4,188537,21722.74978
