<a href="https://colab.research.google.com/github/Davyeeh/Trabalho-final-de-Engenharia-de-Sistemas-Inteligente/blob/main/C%C3%B3pia_de_Treinamento_ESI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Imports e configuração do ambiente**

In [None]:
!git clone https://github.com/Davyeeh/Trabalho-final-de-Engenharia-de-Sistemas-Inteligente.git


Cloning into 'Trabalho-final-de-Engenharia-de-Sistemas-Inteligente'...
remote: Enumerating objects: 107, done.[K
remote: Counting objects: 100% (107/107), done.[K
remote: Compressing objects: 100% (76/76), done.[K
remote: Total 107 (delta 39), reused 75 (delta 17), pack-reused 0 (from 0)[K
Receiving objects: 100% (107/107), 7.34 MiB | 15.41 MiB/s, done.
Resolving deltas: 100% (39/39), done.


In [None]:
%cd Trabalho-final-de-Engenharia-de-Sistemas-Inteligente
!ls


/content/Trabalho-final-de-Engenharia-de-Sistemas-Inteligente
 app.py			   __pycache__
 artifacts		   pyproject.toml
 dados			   README.md
 dataset		   requirements.txt
 imoveis_tratados.csv	   src
 poetry.lock		  'Trabalho ESI.postman_collection.json'
 projeto_final_ESI.ipynb


In [None]:
!ls -R


.:
 app.py			   __pycache__
 artifacts		   pyproject.toml
 dados			   README.md
 dataset		   requirements.txt
 imoveis_tratados.csv	   src
 poetry.lock		  'Trabalho ESI.postman_collection.json'
 projeto_final_ESI.ipynb

./artifacts:
modelo_campeao.pkl

./dados:
historico_apartamentos.csv

./dataset:
dataset_original.csv  SaoPaulo_OnlyAppartments_2024-11-25.csv

./__pycache__:
app.cpython-312.pyc

./src:
dados  pipeline_dados.py  pipeline_modelos.py  __pycache__

./src/dados:
dados_tratados.csv

./src/__pycache__:
pipeline_dados.cpython-312.pyc


In [None]:
import os
import json
import joblib
import numpy as np
import pandas as pd

from sklearn.model_selection import TimeSeriesSplit, RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor


**Leitura e preparação do dataset**

In [None]:
df = pd.read_csv("imoveis_tratados.csv")

# Garantias
df["created_date"] = pd.to_datetime(df["created_date"], errors="coerce")
df = df.dropna(subset=["created_date", "Price"])

# Feature engineering
df["ano"] = df["created_date"].dt.year.astype(int)

# Drops de rua e created
drop_cols = ["created_date"]
if "Rua" in df.columns:
    drop_cols.append("Rua")
if "extract_date" in df.columns:
    drop_cols.append("extract_date")

df = df.drop(columns=drop_cols)

# Ordena temporalmente
df = df.sort_values("ano").reset_index(drop=True)

df.head()


Unnamed: 0,Price,Area,Bedrooms,Bathrooms,Parking_Spaces,Latitude,Longitude,Bairro,ano
0,360000.0,42,1,1,2,-23.636002,-46.737804,Vila Andrade,2018
1,215000.0,35,1,1,3,-23.555372,-46.487537,Cidade Líder,2018
2,580000.0,47,1,3,1,-23.562246,-46.649019,Bela Vista,2018
3,2926000.0,91,1,3,2,-23.589205,-46.683923,Itaim Bibi,2018
4,720000.0,150,1,3,3,-23.630588,-46.736447,Vila Andrade,2018


**Split temporal**

In [None]:
TARGET = "Price"
TEST_SIZE = 0.2

n = len(df)
cut = int((1 - TEST_SIZE) * n)

train_df = df.iloc[:cut]
test_df  = df.iloc[cut:]

X_train = train_df.drop(columns=[TARGET])
y_train = train_df[TARGET]

X_test = test_df.drop(columns=[TARGET])
y_test = test_df[TARGET]

print(f"Treino: {X_train.shape} | Teste: {X_test.shape}")


Treino: (12577, 8) | Teste: (3145, 8)


**Preprocessamento**

In [None]:
cat_cols = X_train.select_dtypes(include="object").columns.tolist()
num_cols = X_train.columns.difference(cat_cols).tolist()

numeric_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_pipe, num_cols),
    ("cat", categorical_pipe, cat_cols)
])


**Modelos**

In [None]:
models = {
    "Ridge": (
        Ridge(),
        {
            "model__alpha": np.logspace(-3, 3, 20)
        }
    ),
    "RandomForest": (
        RandomForestRegressor(random_state=42, n_jobs=-1),
        {
            "model__n_estimators": [200, 400, 600],
            "model__max_depth": [None, 10, 20],
            "model__min_samples_split": [2, 5],
        }
    ),
    "GradientBoosting": (
        GradientBoostingRegressor(random_state=42),
        {
            "model__n_estimators": [150, 250, 400],
            "model__learning_rate": [0.03, 0.05, 0.1],
            "model__max_depth": [2, 3, 4],
        }
    )
}


**Treino + validação**

In [None]:
tscv = TimeSeriesSplit(n_splits=5)

results = []

for name, (model, param_grid) in models.items():
    pipe = Pipeline([
        ("preprocess", preprocessor),
        ("model", model)
    ])

    search = RandomizedSearchCV(
        pipe,
        param_distributions=param_grid,
        n_iter=12,
        scoring="neg_mean_absolute_error",
        cv=tscv,
        n_jobs=-1,
        random_state=42
    )

    search.fit(X_train, y_train)

    y_pred = search.best_estimator_.predict(X_test)

    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    results.append({
        "Modelo": name,
        "MAE": mae,
        "RMSE": rmse,
        "R2": r2,
        "BestParams": search.best_params_
    })

results_df = pd.DataFrame(results).sort_values("MAE")
results_df




Unnamed: 0,Modelo,MAE,RMSE,R2,BestParams
1,RandomForest,652591.87286,1434898.0,0.707829,"{'model__n_estimators': 600, 'model__min_sampl..."
2,GradientBoosting,671362.82853,1431811.0,0.709085,"{'model__n_estimators': 400, 'model__max_depth..."
0,Ridge,889359.143324,2030163.0,0.415133,{'model__alpha': 233.57214690901213}


**Escolha do campeão**

In [None]:
best_model_name = results_df.iloc[0]["Modelo"]
best_model_name


'RandomForest'

**Treinar campeão e salvar binário**

In [None]:
best_model, best_params = models[best_model_name]

final_pipe = Pipeline([
    ("preprocess", preprocessor),
    ("model", best_model)
])

final_pipe.set_params(**results_df.iloc[0]["BestParams"])
final_pipe.fit(X_train, y_train)

os.makedirs("artifacts", exist_ok=True)
joblib.dump(final_pipe, "artifacts/modelo.pkl")

print("Modelo campeão salvo em artifacts/modelo.pkl")


✅ Modelo campeão salvo em artifacts/modelo.pkl


**Teste rápido (sanity check)**

In [None]:
final_pipe.predict(X_test.head(3))


array([1972487.38053712, 1217113.7481314 , 4215419.42545919])