# Hyperparameter Tuning - Random Forest

## Objetivo

Otimizar hiperparametros do Random Forest para melhorar a performance.

Modelo atual: R2 = 11.16%
Meta: R2 = 15-20%

## Estrategia

Usar RandomizedSearchCV para testar combinacoes aleatorias de hiperparametros:
- n_estimators: numero de arvores
- max_depth: profundidade maxima das arvores
- min_samples_split: minimo de amostras para dividir
- min_samples_leaf: minimo de amostras nas folhas

RandomizedSearchCV eh mais rapido que GridSearchCV e geralmente encontra bons resultados.

In [1]:
# Imports
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("Bibliotecas carregadas")

Bibliotecas carregadas


## Carregar Dados Preparados

Vou carregar os dados que ja foram processados no notebook anterior.

In [2]:
# Carregar dataset limpo
df = pd.read_csv('data/uk_property_cleaned.csv')
df['transfer_date'] = pd.to_datetime(df['transfer_date'])

print(f"Dataset: {len(df):,} linhas")

Dataset: 100,000 linhas


## Preparacao dos Dados

Aplicar o mesmo pipeline do notebook 02:
1. Extrair postcode_region
2. Label encoding (property_type, old_new, duration)
3. Train/test split
4. Target encoding (county, postcode_region)

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Preparar dataset
df_model = df[['property_type', 'county', 'postcode', 'old_new', 'duration', 'year', 'price']].copy()
df_model = df_model.dropna(subset=['postcode'])

# Extrair regiao do postcode
df_model['postcode_region'] = df_model['postcode'].str.split().str[0]

# Label encoding
for col in ['property_type', 'old_new', 'duration']:
    le = LabelEncoder()
    df_model[col + '_enc'] = le.fit_transform(df_model[col].astype(str))

# Preparar X e y
X_temp = df_model[['property_type_enc', 'county', 'postcode_region', 'old_new_enc', 'duration_enc', 'year']]
y = df_model['price']

# Split
X_train_temp, X_test_temp, y_train, y_test = train_test_split(X_temp, y, test_size=0.2, random_state=42)

# Target encoding (apenas train)
train_data = X_train_temp.join(y_train)
county_map = train_data.groupby('county')['price'].mean()
postcode_map = train_data.groupby('postcode_region')['price'].mean()

X_train_temp = X_train_temp.copy()
X_test_temp = X_test_temp.copy()

X_train_temp['county_enc'] = X_train_temp['county'].map(county_map)
X_train_temp['postcode_region_enc'] = X_train_temp['postcode_region'].map(postcode_map)

X_test_temp['county_enc'] = X_test_temp['county'].map(county_map).fillna(county_map.median())
X_test_temp['postcode_region_enc'] = X_test_temp['postcode_region'].map(postcode_map).fillna(postcode_map.median())

# Features finais
features = ['property_type_enc', 'county_enc', 'postcode_region_enc', 'old_new_enc', 'duration_enc', 'year']
X_train = X_train_temp[features].fillna(X_train_temp[features].median())
X_test = X_test_temp[features].fillna(X_test_temp[features].median())

# Log transform
y_train_log = np.log(y_train)
y_test_log = np.log(y_test)

print(f"X_train: {X_train.shape}")
print(f"X_test: {X_test.shape}")
print("Dados preparados!")

X_train: (79864, 6)
X_test: (19967, 6)
Dados preparados!


## RandomizedSearchCV

Definir espaco de busca de hiperparametros e executar otimizacao.

RandomizedSearchCV testa combinacoes aleatorias, mais rapido que GridSearch.

In [8]:
# Espaco de busca EXPANDIDO e REFINADO
param_distributions_v2 = {
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [10, 15, 20, 25, None],
    'min_samples_split': [2, 4, 8, 10],
    'min_samples_leaf': [1, 2, 4, 8],
    'max_features': ['sqrt', 'log2', 0.7, 0.8, 0.9],
    'max_samples': [0.8, 0.9, 1.0]  # NOVO: Bootstrap sampling
}

print("Espaco de busca V2 (expanded):")
for param, values in param_distributions_v2.items():
    print(f"  {param}: {values}")

# Calcular total
total_v2 = 1
for values in param_distributions_v2.values():
    total_v2 *= len(values)
print(f"\nTotal de combinacoes: {total_v2}")
print("RandomizedSearchCV vai testar 30 combinacoes")

Espaco de busca V2 (expanded):
  n_estimators: [50, 100, 150, 200]
  max_depth: [10, 15, 20, 25, None]
  min_samples_split: [2, 4, 8, 10]
  min_samples_leaf: [1, 2, 4, 8]
  max_features: ['sqrt', 'log2', 0.7, 0.8, 0.9]
  max_samples: [0.8, 0.9, 1.0]

Total de combinacoes: 4800
RandomizedSearchCV vai testar 30 combinacoes


### Executar RandomizedSearchCV

Isso vai demorar ~3-5 minutos dependendo do hardware.

In [9]:
# Criar Random Forest base
rf_base_v2 = RandomForestRegressor(random_state=42, n_jobs=-1)

# RandomizedSearchCV V2
random_search_v2 = RandomizedSearchCV(
    estimator=rf_base_v2,
    param_distributions=param_distributions_v2,
    n_iter=30,  # Mais combinacoes
    cv=3,
    scoring='r2',
    random_state=42,
    n_jobs=-1,
    verbose=2
)

print("Iniciando busca REFINADA...")
print("Testando 30 combinacoes (pode demorar 5-8 min)\n")

# Executar busca
random_search_v2.fit(X_train, y_train_log)

print("\n" + "="*60)
print("BUSCA REFINADA CONCLUIDA!")
print("="*60)
print(f"\nMelhores hiperparametros encontrados:")
for param, value in random_search_v2.best_params_.items():
    print(f"  {param}: {value}")

print(f"\nMelhor score (R2 CV): {random_search_v2.best_score_:.4f}")

Iniciando busca REFINADA...
Testando 30 combinacoes (pode demorar 5-8 min)

Fitting 3 folds for each of 30 candidates, totalling 90 fits


[CV] END max_depth=10, max_features=0.9, max_samples=0.9, min_samples_leaf=2, min_samples_split=10, n_estimators=50; total time=  13.8s
[CV] END max_depth=25, max_features=log2, max_samples=0.8, min_samples_leaf=2, min_samples_split=4, n_estimators=50; total time=  14.5s
[CV] END max_depth=25, max_features=log2, max_samples=0.8, min_samples_leaf=2, min_samples_split=4, n_estimators=50; total time=  15.6s
[CV] END max_depth=25, max_features=log2, max_samples=0.8, min_samples_leaf=2, min_samples_split=4, n_estimators=50; total time=  15.9s
[CV] END max_depth=25, max_features=0.9, max_samples=0.9, min_samples_leaf=8, min_samples_split=10, n_estimators=50; total time=  15.9s
[CV] END max_depth=10, max_features=0.9, max_samples=0.9, min_samples_leaf=2, min_samples_split=10, n_estimators=50; total time=  16.1s
[CV] END max_depth=25, max_features=0.9, max_samples=0.9, min_samples_leaf=8, min_samples_split=10, n_estimators=50; total time=  16.4s
[CV] END max_depth=25, max_features=0.7, max_sam

## Avaliacao no Test Set

Treinar modelo com melhores hiperparametros e avaliar na escala original (£).

In [10]:
# Pegar melhor modelo V2
best_rf_v2 = random_search_v2.best_estimator_

# Predicoes
y_pred_log_v2 = best_rf_v2.predict(X_test)
y_pred_v2 = np.exp(y_pred_log_v2)

# Metricas
mae_v2 = mean_absolute_error(y_test, y_pred_v2)
rmse_v2 = np.sqrt(mean_squared_error(y_test, y_pred_v2))
r2_v2 = r2_score(y_test, y_pred_v2)

print("MODELO V2 (REFINADO) - METRICAS NO TEST SET")
print("=" * 60)
print(f"MAE:  £{mae_v2:,.0f}")
print(f"RMSE: £{rmse_v2:,.0f}")
print(f"R2:   {r2_v2:.4f} ({r2_v2*100:.2f}%)")

print("\nCOMPARACAO COMPLETA:")
print("-" * 60)
print(f"Modelo base (100 est):           R2 = 0.1116 (11.16%)")
print(f"Modelo V1 (300 est, sem limit):  R2 = {r2:.4f} ({r2*100:.2f}%)")
print(f"Modelo V2 (200 est, depth=15):   R2 = {r2_v2:.4f} ({r2_v2*100:.2f}%)")

# Melhor modelo
best_r2 = max(0.1116, r2, r2_v2)
if best_r2 == 0.1116:
    print("\nVENCEDOR: Modelo base (11.16%)")
elif best_r2 == r2:
    print(f"\nVENCEDOR: Modelo V1 ({r2*100:.2f}%)")
else:
    print(f"\nVENCEDOR: Modelo V2 ({r2_v2*100:.2f}%)")

MODELO V2 (REFINADO) - METRICAS NO TEST SET
MAE:  £82,108
RMSE: £566,975
R2:   0.1094 (10.94%)

COMPARACAO COMPLETA:
------------------------------------------------------------
Modelo base (100 est):           R2 = 0.1116 (11.16%)
Modelo V1 (300 est, sem limit):  R2 = 0.0980 (9.80%)
Modelo V2 (200 est, depth=15):   R2 = 0.1094 (10.94%)

VENCEDOR: Modelo base (11.16%)


## Conclusao

Apos testar multiplas configuracoes de hiperparametros:
- Modelo V1 (300 est, sem limite): R2 = 9.80%
- Modelo V2 (200 est, depth=15): R2 = 10.94%
- **Modelo base (100 est, default): R2 = 11.16%** ← VENCEDOR

**DECISAO:** Usar modelo base para o pipeline final.

O modelo base eh mais simples, mais rapido e teve melhor performance.
Para imoveis ate £1M (98.6% dos casos), o R2 eh de 27%.