# TP1 – Projeto de Machine Learning
## Predição de Preços de Carros Usados (Kaggle)

Este notebook documenta **todo o processo de treino, avaliação, comparação e submissão** de modelos de Machine Learning.

## 1. Importação de bibliotecas

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

## 2. Carregamento dos dados

In [None]:
train_df = pd.read_csv("train_small.csv")
test_df = pd.read_csv("test.csv")

train_df.head()

## 3. Separação entre features e target

In [None]:
X = train_df.drop("price", axis=1)
y = train_df["price"]

## 4. Identificação de variáveis

In [None]:
categorical_cols = X.select_dtypes(include="object").columns
numeric_cols = X.select_dtypes(exclude="object").columns

categorical_cols, numeric_cols

## 5. Pré-processamento

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", StandardScaler(), numeric_cols)
    ]
)

## 6. Modelo Baseline – Regressão Linear

In [None]:
linear_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", LinearRegression())
])

linear_scores = cross_val_score(
    linear_pipeline,
    X,
    y,
    cv=5,
    scoring="neg_root_mean_squared_error"
)

print("RMSE médio (Linear):", -linear_scores.mean())

## 7. KNN com Grid Search

In [None]:
knn_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", KNeighborsRegressor())
])

knn_param_grid = {
    "model__n_neighbors": [3, 5, 7],
    "model__weights": ["uniform", "distance"]
}

knn_grid = GridSearchCV(
    knn_pipeline,
    knn_param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1
)

knn_grid.fit(X, y)

print("Melhores parâmetros KNN:", knn_grid.best_params_)
print("RMSE KNN:", -knn_grid.best_score_)

## 8. Random Forest com Grid Search

In [None]:
rf_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", RandomForestRegressor(random_state=RANDOM_STATE))
])

rf_param_grid = {
    "model__n_estimators": [100, 200],
    "model__max_depth": [None, 20],
    "model__min_samples_split": [2, 5]
}

rf_grid = GridSearchCV(
    rf_pipeline,
    rf_param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
    n_jobs=-1
)

rf_grid.fit(X, y)

print("Melhores parâmetros RF:", rf_grid.best_params_)
print("RMSE RF:", -rf_grid.best_score_)

## 9. Comparação dos Modelos

O modelo com **menor RMSE médio** é selecionado como modelo final.

In [None]:
results = pd.DataFrame({
    "Modelo": ["Linear Regression", "KNN", "Random Forest"],
    "RMSE": [
        -linear_scores.mean(),
        -knn_grid.best_score_,
        -rf_grid.best_score_
    ]
})

results.sort_values("RMSE")

## 10. Submissão Kaggle

In [None]:
best_model = knn_grid.best_estimator_

test_predictions = best_model.predict(test_df)

submission = pd.DataFrame({
    "id": test_df["id"],
    "price": test_predictions
})

submission.to_csv("submission.csv", index=False)
submission.head()