# Pipeline Final - Random Forest Production Model

## Objetivo

Treinar modelo FINAL para producao:
- Usar 100% dos dados (sem train/test split)
- Validar com cross-validation
- Exportar artefatos completos para API

## Configuracao Final

Baseado nos notebooks anteriores:
- **Modelo**: Random Forest (100 estimators, default params)
- **Features**: 6 (property_type, county, postcode_region, old_new, duration, year)
- **Encodings**: Label + Target encoding
- **Transform**: Log scale no target
- **Performance**: R2 = 11.16% (geral), 27% (imoveis ate £1M)

In [2]:
import pandas as pd
import numpy as np
import joblib
import os
from datetime import datetime
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

print("OK - Bibliotecas carregadas")

OK - Bibliotecas carregadas


In [3]:
df = pd.read_csv('data/uk_property_cleaned.csv')
df['transfer_date'] = pd.to_datetime(df['transfer_date'])

df_model = df[['property_type', 'county', 'postcode', 'old_new', 'duration', 'year', 'price']].copy()
df_model = df_model.dropna(subset=['postcode'])

print(f"Dataset: {len(df_model):,} linhas")

Dataset: 99,831 linhas


In [4]:
# Postcode region
df_model['postcode_region'] = df_model['postcode'].str.split().str[0]

# Label encoding
label_encoders = {}
for col in ['property_type', 'old_new', 'duration']:
    le = LabelEncoder()
    df_model[col + '_enc'] = le.fit_transform(df_model[col].astype(str))
    label_encoders[col] = le

# Target encoding
county_map = df_model.groupby('county')['price'].mean()
postcode_map = df_model.groupby('postcode_region')['price'].mean()

df_model['county_enc'] = df_model['county'].map(county_map)
df_model['postcode_region_enc'] = df_model['postcode_region'].map(postcode_map)

print("Feature engineering completo!")

Feature engineering completo!


In [5]:
features = ['property_type_enc', 'county_enc', 'postcode_region_enc', 
            'old_new_enc', 'duration_enc', 'year']

X = df_model[features].fillna(df_model[features].median())
y = df_model['price']
y_log = np.log(y)

print(f"X: {X.shape}, y: {y.shape}")

X: (99831, 6), y: (99831,)


In [6]:
print("Cross-validation (5-fold)...")
rf_cv = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
cv_scores = cross_val_score(rf_cv, X, y_log, cv=5, scoring='r2', n_jobs=-1)

print(f"R2 medio: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

Cross-validation (5-fold)...
R2 medio: 0.4390 (+/- 0.1031)


In [7]:
print("Treinando modelo final com 100% dos dados...")
final_model = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
final_model.fit(X, y_log)

print(f"Modelo treinado com {len(X):,} amostras!")

Treinando modelo final com 100% dos dados...
Modelo treinado com 99,831 amostras!


In [8]:
os.makedirs('models', exist_ok=True)

# 1. Modelo
joblib.dump(final_model, 'models/final_model.joblib')
print("1. Modelo salvo")

# 2. Encoders
joblib.dump(label_encoders, 'models/final_label_encoders.joblib')
print("2. Encoders salvos")

# 3. Target encodings
target_encodings = {
    'county_map': county_map.to_dict(),
    'postcode_map': postcode_map.to_dict()
}
joblib.dump(target_encodings, 'models/final_target_encodings.joblib')
print("3. Target encodings salvos")

# 4. Metadata
metadata = {
    'model_type': 'RandomForestRegressor',
    'n_estimators': 100,
    'features': features,
    'target_transform': 'log',
    'trained_date': datetime.now().isoformat(),
    'training_samples': len(X),
    'cv_r2_mean': float(cv_scores.mean()),
    'cv_r2_std': float(cv_scores.std()),
    'expected_r2': 0.1116
}
joblib.dump(metadata, 'models/final_metadata.joblib')
print("4. Metadata salvo")

print("\nMODELO FINAL EXPORTADO COM SUCESSO!")

1. Modelo salvo
2. Encoders salvos
3. Target encodings salvos
4. Metadata salvo

MODELO FINAL EXPORTADO COM SUCESSO!
