# LSTM Simples (Previsão da Soma das Próximas 4 Semanas)

Este notebook implementa o modelo LSTM simples para previsão da **soma das próximas 4 semanas** (previsão mensal) das taxas de morbidade respiratória em municípios brasileiros.

- **Modelo:** LSTM Simples
- **Alvo:** Soma das próximas 4 semanas (previsão mensal)
- **Input:** sequência de 12 semanas (shape: [batch, 12, 1])
- **Arquitetura:** LSTM(32, return_sequences=False) → Dense(1)
- **Perda:** MAE
- **Todo o código é modular e importado dos módulos `src/`.**

In [1]:
import sys
import os

# Get the absolute path to the project root
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
print(f"Project root: {project_root}")

# Add the project root to sys.path (not the src directory)
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"Added {project_root} to sys.path")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

from src.preprocessing import load_city_data, prepare_data_for_model, filter_city, clean_timeseries
from src.models import build_lstm
from src.train import train_model, evaluate_model, generate_forecasts, save_predictions, save_metrics
from src.utils import plot_forecast, plot_forecast_error, plot_training_history

results_dir = os.path.join('results', 'morbcirc_run', 'lstm_simple')
os.makedirs(results_dir, exist_ok=True)

np.random.seed(42)

Project root: c:\Users\pedro\OneDrive - Unesp\Documentos\GitHub\cities-models\cities-models
Added c:\Users\pedro\OneDrive - Unesp\Documentos\GitHub\cities-models\cities-models to sys.path


## Estrutura do Repositório

- **data/**: Um CSV por cidade, cada um com coluna de data, coluna alvo e features opcionais.
- **notebooks/**: Notebooks para cada experimento. Apenas visualização e exploração.
- **src/**: Módulos reutilizáveis:
    - `preprocessing.py`: Carregamento, normalização, split, criação de janelas
    - `models.py`: Definições de modelos (baselines, MLP, LSTM, etc.)
    - `train.py`: Rotinas de treino e avaliação
    - `utils.py`: Funções auxiliares (plot, métricas, etc.)
- **results/**: Previsões e métricas salvas.
- **instructions.md**: Guia de boas práticas.

**Toda a lógica central está em `src/`.**

## Carregamento e Exploração dos Dados

Carregue os dados de morbidade respiratória para análise. Você pode iterar sobre todas as cidades ou selecionar uma específica.

In [2]:
# Exemplo: Carregar dados de uma cidade (ajuste o caminho conforme necessário)
data_path = '../data/df_base_morb_circ.csv'
df = load_city_data(data_path)

print(f"Formato do dataset: {df.shape}")
df.head()

Formato do dataset: (6344064, 11)


Unnamed: 0,CD_MUN,target,week,PIB,DENS,URB,CO2,CH4,N2O,LAT,LON
0,1100015,0.199872,1,3469.14,3.541043,0.000611,550.985905,92.946598,6.657747,-12.883213,-62.39
1,1100015,1.304184,2,3469.14,3.541043,0.000611,550.985905,92.946598,6.657747,-12.883213,-62.39
2,1100015,2.495194,3,3469.14,3.541043,0.000611,550.985905,92.946598,6.657747,-12.883213,-62.39
3,1100015,3.538533,4,3469.14,3.541043,0.000611,550.985905,92.946598,6.657747,-12.883213,-62.39
4,1100015,11.927224,5,3469.14,3.541043,0.000611,550.985905,92.946598,6.657747,-12.883213,-62.39


In [3]:
print("Colunas disponíveis:")
print(df.columns.tolist())
df.describe()

Colunas disponíveis:
['CD_MUN', 'target', 'week', 'PIB', 'DENS', 'URB', 'CO2', 'CH4', 'N2O', 'LAT', 'LON']


Colunas disponíveis:
['CD_MUN', 'target', 'week', 'PIB', 'DENS', 'URB', 'CO2', 'CH4', 'N2O', 'LAT', 'LON']


Unnamed: 0,CD_MUN,target,week,PIB,DENS,URB,CO2,CH4,N2O,LAT,LON
count,6344064.0,6344064.0,6344064.0,5726592.0,5726592.0,5726592.0,5726592.0,5726592.0,5726592.0,5726592.0,5726592.0
mean,3241650.0,14.6605,576.5,3567.274,69.38194,0.00991441,27641.12,597.1645,9.700125,-16.33024,-46.32234
std,979687.5,14.13721,332.5537,3776.046,343.3605,0.04076758,95698.34,10303.29,68.01206,8.333191,6.413752
min,1100015.0,0.0,1.0,625.09,0.08923965,7.13181e-07,0.0,0.0,0.0,-33.65254,-73.484
25%,2511202.0,5.462076,288.75,1482.63,10.60286,0.000928566,3303.876,57.40669,3.770217,-22.74837,-50.908
50%,3144672.0,11.58872,576.5,2745.46,22.5004,0.00214708,7940.41,110.0078,6.881932,-17.80183,-46.519
75%,4116604.0,20.15466,864.25,4477.39,44.06899,0.005144121,18026.59,187.5715,10.23172,-8.379808,-41.599
max,5300108.0,2106.944,1152.0,122011.2,12763.56,0.9601277,2074703.0,481882.2,3683.565,4.685425,-34.87


In [4]:
# Select city for modeling (set to None to use all cities)
CD_MUN_SELECTED = 3550308  # São Paulo

df_city = filter_city(df, cd_mun=CD_MUN_SELECTED)
df_city = clean_timeseries(df_city, target_column='target')
print(f"Selected city shape: {df_city.shape}")

Selected city shape: (1152, 11)


## Pré-processamento

Prepare os dados para o modelo LSTM simples. O input é uma sequência de 12 semanas.

In [5]:
# Preprocessamento para LSTM simples (soma das próximas 4 semanas)
model_params = {
    'sequence_length': 12,
    'forecast_horizon': 4,  # Soma das próximas 4 semanas (previsão mensal)
    'normalization': 'zscore',
    'val_size': None  # Use default (10% of train)
}

target_column = 'target'

data_dict = prepare_data_for_model(
    df=df_city,
    target_column=target_column,
    sequence_length=model_params['sequence_length'],
    forecast_horizon=model_params['forecast_horizon'],
    normalization=model_params['normalization'],
    val_size=model_params.get('val_size', None)
)

X_train = data_dict['X_train']
y_train = data_dict['y_train']
X_test = data_dict['X_test']
y_test = data_dict['y_test']
test_df = data_dict['test_df']
scaler = data_dict.get('scaler')
feature_columns = data_dict.get('feature_columns', None)

# Garantir que não ocorre erro caso X_val/y_val não existam
X_val = data_dict.get('X_val', None)
y_val = data_dict.get('y_val', None)

## Definição e Treinamento do Modelo

O modelo LSTM simples utiliza uma camada LSTM com 32 unidades e uma camada densa de saída.

In [6]:
input_shape = X_train.shape[1:]
model = build_lstm(input_shape=input_shape, units=32, loss='mae')

history = train_model(
    model=model,
    X_train=X_train,
    y_train=y_train,
    X_val=X_val,
    y_val=y_val,
    batch_size=32,
    epochs=100,
    patience=15,
    verbose=2
)

Epoch 1/100
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - loss: 2.6986 - val_loss: 2.6523 - learning_rate: 0.0010
Epoch 2/100
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 9ms/step - loss: 2.6986 - val_loss: 2.6523 - learning_rate: 0.0010
Epoch 2/100
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 2.0161 - val_loss: 2.1694 - learning_rate: 0.0010
Epoch 3/100
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 2.0161 - val_loss: 2.1694 - learning_rate: 0.0010
Epoch 3/100
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 1.6489 - val_loss: 2.0328 - learning_rate: 0.0010
Epoch 4/100
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 1.6489 - val_loss: 2.0328 - learning_rate: 0.0010
Epoch 4/100
[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 1.5108 - val_loss: 1.9285 - learning_rate: 0.0010

## Avaliação do Modelo

Calcule MAE, RMSE e R² para o modelo LSTM simples.

In [13]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
# generate_forecasts is imported from src.train
# scaler, y_test, X_test, model are from previous cells
# target_column, feature_columns are defined in cell 9 from data_dict

# Generate predictions (these are expected to be scaled)
y_pred_scaled = generate_forecasts(model, X_test)

# y_test is also scaled, from prepare_data_for_model.
# Ensure y_test and y_pred_scaled are 2D for the scaler: (n_samples, n_features=1)
y_test_for_scaler = np.asarray(y_test)
y_pred_for_scaler = np.asarray(y_pred_scaled)

if y_test_for_scaler.ndim == 1:
    y_test_for_scaler = y_test_for_scaler.reshape(-1, 1)
if y_pred_for_scaler.ndim == 1:
    y_pred_for_scaler = y_pred_for_scaler.reshape(-1, 1)

# Denormalize if scaler exists
if scaler is not None:
    # scaler.scale_ is the array of standard deviations. Its length is n_features_in_.
    # For older sklearn versions, scaler.n_features_in_ might not exist, use len(scaler.scale_)
    num_scaler_features = getattr(scaler, 'n_features_in_', len(scaler.scale_))

    if num_scaler_features > 1 and y_test_for_scaler.shape[1] == 1:
        # Scaler was fit on multiple features, but we have a single column (target) to denormalize.
        # We need to find the index of our target_column in the features the scaler was fit on.
        # 'target_column' (e.g., 'target') is from cell 9
        # 'feature_columns' (list of column names scaler was fit on) is from data_dict in cell 9
        
        if feature_columns is None or target_column not in feature_columns:
            # This is an issue. Defaulting to index 0 is a guess and might be wrong.
            print(f"Warning: 'feature_columns' from prepare_data_for_model is '{feature_columns}' and/or does not contain target '{target_column}'. Assuming target is at index 0 for the scaler's features for denormalization.")
            target_idx_in_scaler = 0
            # Ensure the guessed index is valid if feature_columns was None but scaler has multiple features
            if not (0 <= target_idx_in_scaler < num_scaler_features):
                 raise ValueError(f"Fallback target_idx {target_idx_in_scaler} is out of bounds for scaler with {num_scaler_features} features. Cannot denormalize.")
        else:
            try:
                target_idx_in_scaler = feature_columns.index(target_column)
            except ValueError:
                raise ValueError(f"Target column '{target_column}' not found in scaler's feature_columns: {feature_columns}. Cannot determine correct parameters for denormalization.")

        # Ensure target_idx is within bounds for scaler.mean_ and scaler.scale_
        if not (0 <= target_idx_in_scaler < num_scaler_features):
            raise ValueError(f"Determined target_idx {target_idx_in_scaler} is out of bounds for scaler with {num_scaler_features} features (mean/scale arrays length).")
            
        target_mean = scaler.mean_[target_idx_in_scaler]
        target_std = scaler.scale_[target_idx_in_scaler]

        y_test_denorm = (y_test_for_scaler * target_std) + target_mean
        y_pred_denorm = (y_pred_for_scaler * target_std) + target_mean
    else:
        # Scaler was fit on a single feature, or y_test_for_scaler has the same number of features.
        # Direct inverse_transform should work.
        y_test_denorm = scaler.inverse_transform(y_test_for_scaler)
        y_pred_denorm = scaler.inverse_transform(y_pred_for_scaler)
else:
    y_test_denorm = y_test_for_scaler # Already in original scale if no scaler
    y_pred_denorm = y_pred_for_scaler # Already in original scale if no scaler

# Flatten to 1D for metrics and plotting
y_test_1d = y_test_denorm.flatten()
y_pred_1d = y_pred_denorm.flatten()

# Calculate metrics on the original (denormalized) scale
mae = mean_absolute_error(y_test_1d, y_pred_1d)
rmse = np.sqrt(mean_squared_error(y_test_1d, y_pred_1d))
r2 = r2_score(y_test_1d, y_pred_1d)

metrics = {
    'mae': mae,
    'rmse': rmse,
    'r2': r2
}

print("Métricas de Avaliação (escala original):")
print(f"MAE: {metrics['mae']:.4f}")
print(f"RMSE: {metrics['rmse']:.4f}")
print(f"R²: {metrics['r2']:.4f}")

# y_test_1d, y_pred_1d are now denormalized and 1D.
# These will be used by the plotting cell and saving predictions cell.
# test_df (for dates) and model_params (for horizon in titles) are also available.

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
Métricas de Avaliação (escala original):
MAE: 27.6212
RMSE: 28.4157
R²: -17.3736
Métricas de Avaliação (escala original):
MAE: 27.6212
RMSE: 28.4157
R²: -17.3736


## Visualização dos Resultados

Plote os valores reais versus previstos e o erro de previsão. O alvo é a **soma das próximas 4 semanas** (previsão mensal).

In [None]:
# y_test_1d and y_pred_1d are already denormalized from the previous cell (Cell 13).
# test_df is available from Cell 9 (ID: 26d32c4f)
# model_params is available from Cell 9 (ID: 26d32c4f)

# Align test_dates
if 'week' in test_df.columns:
    # y_test_1d is now the denormalized target series.
    # test_df is the slice of the original dataframe for the test set.
    # The dates should correspond to the time periods y_test_1d refers to.
    test_dates = test_df['week'].values[-len(y_test_1d):]
else:
    test_dates = np.arange(len(y_test_1d))

fig = plot_forecast(
    true_values=y_test_1d,
    predictions=y_pred_1d,
    dates=test_dates,
    title=f"LSTM Simples: Previsão vs Real (Soma das próximas {model_params['forecast_horizon']} semanas)",
    true_label=f"Real (soma {model_params['forecast_horizon']} sem)",
    pred_label=f"Previsão LSTM (soma {model_params['forecast_horizon']} sem)"
)
plt.savefig(os.path.join(results_dir, 'forecast_plot.png'))
plt.show()

fig = plot_forecast_error(
    true_values=y_test_1d,
    predictions=y_pred_1d,
    dates=test_dates,
    title=f"LSTM Simples: Erro de Previsão (Soma das próximas {model_params['forecast_horizon']} semanas)"
)
plt.savefig(os.path.join(results_dir, 'error_plot.png'))
plt.show()

plt.figure(figsize=(14,6))
plt.plot(test_df['week'], test_df['target'], label='Semanal Real', color='gray', alpha=0.5)
plt.scatter(test_df['week'].values[-len(y_pred_1d):], y_pred_1d, label='Previsto (soma 4 semanas)', color='crimson', zorder=3)
plt.scatter(test_df['week'].values[-len(y_test_1d):], y_test_1d, label='Real (soma 4 semanas)', color='royalblue', marker='x', zorder=3)
plt.title('Série Semanal Real vs. Soma Prevista das 4 Semanas (Mensal)')
plt.xlabel('Semana')
plt.ylabel('Taxa de Morbidade Respiratória')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

# Diagnóstico: Verificar alinhamento de y_test e y_pred
N = 10
week_idx = test_dates  # Already aligned with y_test_1d and y_pred_1d
df_diag = pd.DataFrame({
    'week': week_idx[:N],
    f'y_test (soma {model_params["forecast_horizon"]} sem)': y_test_1d[:N],
    f'y_pred (soma {model_params["forecast_horizon"]} sem)': y_pred_1d[:N]
})
display(df_diag)

# --- Advanced Evaluation and Visualization ---

# 1. Forecast vs Actual (already present)
fig1 = plot_forecast(y_test_1d, y_pred_1d, dates=test_dates, title="Forecast vs Actual")
fig1.show()

# 2. Forecast Error (already present)
fig2 = plot_forecast_error(y_test_1d, y_pred_1d, dates=test_dates, title="Forecast Error")
fig2.show()

# 3. Distribution of Errors
errors = y_pred_1d - y_test_1d
plt.figure(figsize=(10,6))
sns.histplot(errors, kde=True, bins=30, color='crimson')
plt.title('Distribution of Forecast Errors')
plt.xlabel('Error (Predicted - Actual)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# 4. Scatter Plot: Actual vs Predicted
plt.figure(figsize=(8,8))
plt.scatter(y_test_1d, y_pred_1d, alpha=0.5)
plt.plot([y_test_1d.min(), y_test_1d.max()], [y_test_1d.min(), y_test_1d.max()], 'k--', lw=2)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted Scatter')
plt.grid(True)
plt.show()

# 5. Residuals Over Time
plt.figure(figsize=(12,6))
plt.plot(test_dates, errors, marker='o', linestyle='-', color='orange')
plt.axhline(0, color='black', linestyle='--')
plt.title('Residuals Over Time')
plt.xlabel('Week')
plt.ylabel('Residual (Predicted - Actual)')
plt.grid(True)
plt.show()

# 6. Rolling Mean of Errors
window = 8
rolling_error = pd.Series(errors).rolling(window=window).mean()
plt.figure(figsize=(12,6))
plt.plot(test_dates, rolling_error, color='purple')
plt.title(f'Rolling Mean of Errors (window={window})')
plt.xlabel('Week')
plt.ylabel('Rolling Mean Error')
plt.grid(True)
plt.show()

# 7. Cumulative Error
cumulative_error = np.cumsum(errors)
plt.figure(figsize=(12,6))
plt.plot(test_dates, cumulative_error, color='teal')
plt.title('Cumulative Forecast Error')
plt.xlabel('Week')
plt.ylabel('Cumulative Error')
plt.grid(True)
plt.show()

# 8. Feature Correlation Heatmap
plt.figure(figsize=(10,8))
sns.heatmap(df_city.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()

# 9. Target Distribution
plt.figure(figsize=(10,6))
sns.histplot(df_city['target'], kde=True, bins=30, color='royalblue')
plt.title('Target Distribution')
plt.xlabel('Target')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# 10. Missing Values Heatmap
plt.figure(figsize=(12,4))
sns.heatmap(df_city.isnull(), cbar=False, yticklabels=False)
plt.title('Missing Values Heatmap')
plt.show()

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step


ValueError: non-broadcastable output operand with shape (37,1) doesn't match the broadcast shape (37,10)

## Salvar Resultados

Salve as previsões e métricas para comparação posterior.

In [None]:
# --- Save Results ---
# Use correct city_name for saving
city_name = str(CD_MUN_SELECTED) if CD_MUN_SELECTED is not None else 'all'
preds_file = save_predictions(
    y_true=y_test_1d,
    y_pred=y_pred_1d,
    dates=test_dates,
    city_name=city_name,
    model_name='lstm_simple',
    output_dir=results_dir
)
print(f"Previsões salvas em: {preds_file}")

metrics_file = save_metrics(
    metrics=metrics,
    city_name=city_name,
    model_name='lstm_simple',
    output_dir=results_dir,
    params=model_params
)
print(f"Métricas salvas em: {metrics_file}")

## Conclusão

O modelo LSTM simples serve como baseline neural para previsão de morbidade respiratória semanal. Compare seu desempenho com outros modelos nos próximos notebooks.