# Motor de Insights — Pipeline Completo
**Sumário:** 1. geração de dados sintéticos, 2. treino, salvamento e carregamento de modelo RandomForest.

**Objetivo:** Documentar o fluxo completo de geração de dados sintéticos, treino de modelos e salvamento para uso futuro.


**Autores:** Grupo nº17 - BootCampo FTL Angola  
**Data:** 2025

In [2]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib
import warnings
warnings.filterwarnings('ignore')

# Diretórios base (ajuste conforme necessário)
path_data = '../data/raw/dados_sinteticos.csv'
path_model = '../src/Motor_de_Insights_Streamlit/models/modelo_sintetico.pkl'

os.makedirs(os.path.dirname(path_data), exist_ok=True)
os.makedirs(os.path.dirname(path_model), exist_ok=True)

print('Paths configurados:')
print(' - Dataset:', path_data)
print(' - Modelo:', path_model)

Paths configurados:
 - Dataset: ../data/raw/dados_sinteticos.csv
 - Modelo: ../src/Motor_de_Insights_Streamlit/models/modelo_sintetico.pkl


## 1️⃣ Geração de Dados Sintéticos

In [4]:
np.random.seed(42)
provinces = [
    "Bengo", "Benguela", "Bié", "Cabinda", "Cuando-Cubango", "Cuanza Norte",
    "Cuanza Sul", "Cunene", "Huambo", "Huíla", "Luanda", "Lunda Norte",
    "Lunda Sul", "Malanje", "Moxico", "Namibe", "Uíge", "Zaire",
    "Icolo Bengo", "Moxico Leste", "Moxico Sul"
]
dates = pd.date_range('2018-01-01', '2024-12-01', freq='MS')

rows = []
for prov in provinces:
    base_visitors = np.random.randint(2000, 50000)
    seasonality = np.sin(np.linspace(0, 2*np.pi, len(dates)))
    growth = np.linspace(0, 0.25, len(dates))
    noise = np.random.normal(0, 0.15, len(dates))
    for i, d in enumerate(dates):
        visitors = max(0, int(base_visitors * (1 + growth[i]) * (1 + 0.4*seasonality[i]) * (1 + noise[i])))
        occupancy = min(100, max(10, np.random.normal(50 + (growth[i]*20), 8)))
        revenue = visitors * np.random.uniform(20, 60)
        mobility = np.clip(50 + 10*seasonality[i] + np.random.normal(0,5), 10, 100)
        env_index = np.clip(100 - (visitors/1000) + np.random.normal(0,3), 0, 100)
        events = np.random.poisson(lam=2 if d.month in [7,8,9] else 0.7)
        rows.append([d, prov, visitors, occupancy, revenue, mobility, env_index, events])

df = pd.DataFrame(rows, columns=['date','province','visitors','occupancy_rate','revenue','mobility_index','env_index','events_count'])
df.to_csv(path_data, index=False)
print(f"✅ Dataset sintético salvo em: {path_data}")
df.head()

✅ Dataset sintético salvo em: ../data/raw/dados_sinteticos.csv


Unnamed: 0,date,province,visitors,occupancy_rate,revenue,mobility_index,env_index,events_count
0,2018-01-01,Bengo,14827,60.974897,827993.828159,50.877766,83.066841,0
1,2018-02-01,Bengo,19268,47.438944,714540.51485,52.236889,81.515166,0
2,2018-03-01,Bengo,19776,28.879074,571216.162395,53.235833,79.195856,1
3,2018-04-01,Bengo,22539,43.762505,918500.872939,47.536275,81.672876,2
4,2018-05-01,Bengo,18401,34.890794,577673.997482,52.84941,81.906043,1


## 2️⃣ Pré-processamento e Features

In [6]:
df['date'] = pd.to_datetime(df['date'])
df['visitors_lag1'] = df.groupby('province')['visitors'].shift(1)
df['monthly_growth'] = (df['visitors'] - df['visitors_lag1']) / df['visitors_lag1']
df['monthly_growth'] = df['monthly_growth'].fillna(0)
df['revenue_per_visitor'] = df['revenue'] / df['visitors']
df['area_km2'] = np.random.randint(5000, 70000, len(df))
df['tourist_density'] = df['visitors'] / df['area_km2']
df = df.dropna().reset_index(drop=True)
df.head()

Unnamed: 0,date,province,visitors,occupancy_rate,revenue,mobility_index,env_index,events_count,visitors_lag1,monthly_growth,revenue_per_visitor,area_km2,tourist_density
0,2018-02-01,Bengo,19268,47.438944,714540.51485,52.236889,81.515166,0,14827.0,0.299521,37.084312,33975,0.567123
1,2018-03-01,Bengo,19776,28.879074,571216.162395,53.235833,79.195856,1,19268.0,0.026365,28.884312,58242,0.339549
2,2018-04-01,Bengo,22539,43.762505,918500.872939,47.536275,81.672876,2,19776.0,0.139715,40.751625,51546,0.43726
3,2018-05-01,Bengo,18401,34.890794,577673.997482,52.84941,81.906043,1,22539.0,-0.183593,31.39362,31551,0.583214
4,2018-06-01,Bengo,19099,46.840331,594854.752818,52.733513,81.805642,0,18401.0,0.037933,31.145859,45928,0.415847


## 3️⃣ Treino do Modelo (RandomForest)

In [8]:
features = ['visitors_lag1','occupancy_rate','revenue_per_visitor','mobility_index','env_index','events_count','tourist_density']
X = df[features]
y = df['visitors']

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, shuffle=False)
model = RandomForestRegressor(n_estimators=200, max_depth=15, random_state=42)
model.fit(X_train, y_train)
pred = model.predict(X_test)

mae = mean_absolute_error(y_test, pred)
rmse = mean_squared_error(y_test, pred, squared=False)
r2 = r2_score(y_test, pred)

print(f"Treino concluído — MAE={mae:.2f}, RMSE={rmse:.2f}, R2={r2:.3f}")

Treino concluído — MAE=2260.92, RMSE=2860.54, R2=0.973


## 4️⃣ Salvamento do Modelo

In [10]:
joblib.dump(model, path_model)
print(f"Modelo salvo em: {path_model}")

Modelo salvo em: ../src/Motor_de_Insights_Streamlit/models/modelo_sintetico.pkl


## 5️⃣ Carregamento e Teste do Modelo Salvo

In [12]:
modelo_carregado = joblib.load(path_model)
pred_test = modelo_carregado.predict(X_test)
print("Modelo carregado com sucesso. Exemplo de previsões:", pred_test[:5])

Modelo carregado com sucesso. Exemplo de previsões: [19114.73194454 22308.2445     19459.73979319 20834.1707619
 19619.91736655]


## 6️⃣ (Opcional) Treinar e salvar um modelo por província

In [14]:
# for prov in provinces:
#     df_p = df[df['province']==prov].dropna()
#     if len(df_p) < 12:
#         continue
#     X_p = df_p[features]
#     y_p = df_p['visitors']
#     Xp_train, Xp_test, yp_train, yp_test = train_test_split(X_p, y_p, test_size=0.2, shuffle=False)
#     model_p = RandomForestRegressor(n_estimators=200, random_state=42)
#     model_p.fit(Xp_train, yp_train)
#     path_model_prov = f"../src/Motor_de_Insights_Streamlit/models/modelo_{prov}.pkl"
#     joblib.dump(model_p, path_model_prov)
#     print(f"Modelo salvo para {prov}: {path_model_prov}")

## ATT:
Este notebook:
- Gera e salva dados sintéticos em `../data/raw/`
- Treina um modelo preditivo e salva em `../src/Motor_de_Insights_Streamlit/models/`
- Carrega o modelo salvo e executa previsões.

Pronto para integração com o módulo Streamlit (`app.py`).