# Redes Neurais - Projeto 4
# Modelos Recorrentes
---------------------
### Luis Filipe Menezes
#### RA: 164924

## 1. Objetivos:
Este caderno consiste na terceira entrega da disciplina de Redes Neurais realizada no programa de Pós Graduação em Ciência da Computação durante meu mestrado.

O projeto tem como objetivo:

- Implementar um modelo LSTM ou GRU para uma das tarefas abaixo:
1. Classificação de série temporais. O modelo deve receber uma janela temporal (qualquer tipo de dado) e classificar o conteúdo da janela.
2. Previsão. Treinar um modelo para predizer o valor de uma variável no instante t+k. O modelo deve receber os dados da série temporal (instantes anteriores a t – verificar tamanho da janela) e predizer um favor futuro. k a distância da predição. Por exemplo, podemos alimentar um modelo com dados de uma dada empresa (i.e. PETR3) e tentar predizer qual será o valor da ação daqui 5 dias (k==5)
3. Autoencoder recorrente. O modelo deve mapear a série temporal na própria série. O objetivo será avaliar como os dados estão representados no espaço latente

Para este projeto foi escolhido a tarefa 2 de previsão de ocupação de ambientes. 

# 1. Aquisição dos dados
Utilizaremos o banco de dados [Room Occupancy Estimation](https://archive.ics.uci.edu/dataset/864/room+occupancy+estimation) that is a dataset with a precise number of occupants in a room using multiple non-intrusive environmental sensors like temperature, light, sound, CO2 and PIR.



In [4]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
room_occupancy_estimation = fetch_ucirepo(id=864) 
  
# data (as pandas dataframes) 
X = room_occupancy_estimation.data.features 
y = room_occupancy_estimation.data.targets 
  
# metadata 
# print(room_occupancy_estimation.metadata) 
  
# variable information 
print(room_occupancy_estimation.variables) 


{'uci_id': 864, 'name': 'Room Occupancy Estimation', 'repository_url': 'https://archive.ics.uci.edu/dataset/864/room+occupancy+estimation', 'data_url': 'https://archive.ics.uci.edu/static/public/864/data.csv', 'abstract': 'Data set for estimating the precise number of occupants in a room using multiple non-intrusive environmental sensors like temperature, light, sound, CO2 and PIR.', 'area': 'Computer Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate', 'Time-Series'], 'num_instances': 10129, 'num_features': 18, 'feature_types': ['Real'], 'demographics': [], 'target_col': ['Room_Occupancy_Count'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2018, 'last_updated': 'Wed Aug 16 2023', 'dataset_doi': '10.24432/C5P605', 'creators': ['Adarsh Pal Singh', 'Sachin Chaudhari'], 'intro_paper': {'ID': 275, 'type': 'NATIVE', 'title': 'Machine Learning-Based Occupancy Estimation Using Multivariate Sensor Nodes', 'auth

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(24)
torch.manual_seed(24)

<torch._C.Generator at 0x71d142d195d0>

In [24]:
print("Dataset shape:", X.shape)
print("\nFeature columns:")
print(X.columns.tolist())
print("\nTarget shape:", y.shape)
print("\nTarget column:")
print(y.columns.tolist())

# Basic statistics
print("\nDataset info:")
print(f"Features: {X.shape[1]}")
print(f"Samples: {X.shape[0]}")
print(f"Missing values in X: {X.isnull().sum().sum()}")
print(f"Missing values in y: {y.isnull().sum().sum()}")

# Display first few rows
print("\nFirst 5 rows of features:")
display(X.head())
print("\nFirst 5 rows of target:")
display(y.head())

Dataset shape: (10129, 18)

Feature columns:
['Date', 'Time', 'S1_Temp', 'S2_Temp', 'S3_Temp', 'S4_Temp', 'S1_Light', 'S2_Light', 'S3_Light', 'S4_Light', 'S1_Sound', 'S2_Sound', 'S3_Sound', 'S4_Sound', 'S5_CO2', 'S5_CO2_Slope', 'S6_PIR', 'S7_PIR']

Target shape: (10129, 1)

Target column:
['Room_Occupancy_Count']

Dataset info:
Features: 18
Samples: 10129
Missing values in X: 0
Missing values in y: 0

First 5 rows of features:


Unnamed: 0,Date,Time,S1_Temp,S2_Temp,S3_Temp,S4_Temp,S1_Light,S2_Light,S3_Light,S4_Light,S1_Sound,S2_Sound,S3_Sound,S4_Sound,S5_CO2,S5_CO2_Slope,S6_PIR,S7_PIR
0,2017/12/22,10:49:41,24.94,24.75,24.56,25.38,121,34,53,40,0.08,0.19,0.06,0.06,390,0.769231,0,0
1,2017/12/22,10:50:12,24.94,24.75,24.56,25.44,121,33,53,40,0.93,0.05,0.06,0.06,390,0.646154,0,0
2,2017/12/22,10:50:42,25.0,24.75,24.5,25.44,121,34,53,40,0.43,0.11,0.08,0.06,390,0.519231,0,0
3,2017/12/22,10:51:13,25.0,24.75,24.56,25.44,121,34,53,40,0.41,0.1,0.1,0.09,390,0.388462,0,0
4,2017/12/22,10:51:44,25.0,24.75,24.56,25.44,121,34,54,40,0.18,0.06,0.06,0.06,390,0.253846,0,0



First 5 rows of target:


Unnamed: 0,Room_Occupancy_Count
0,1
1,1
2,1
3,1
4,1


In [25]:
# Enhanced TimeSeriesPreprocessor that handles all data types
class TimeSeriesPreprocessor:
    """Enhanced preprocessor for time series data with sliding windows"""
    
    def __init__(self, sequence_length=24, prediction_horizon=1, test_size=0.2):
        self.sequence_length = sequence_length
        self.prediction_horizon = prediction_horizon
        self.test_size = test_size
        self.feature_scaler = StandardScaler()
        self.target_scaler = MinMaxScaler()
        
    def preprocess_features(self, X):
        """Preprocess features to handle different data types"""
        print("Preprocessing features...")
        
        X_processed = X.copy()
        
        # Check for non-numeric columns
        non_numeric_cols = X_processed.select_dtypes(include=['object']).columns
        if len(non_numeric_cols) > 0:
            print(f"Found non-numeric columns: {non_numeric_cols.tolist()}")
            
            for col in non_numeric_cols:
                print(f"Processing non-numeric column: {col}")
                
                # Try to convert to datetime first
                try:
                    datetime_series = pd.to_datetime(X_processed[col])
                    print(f"  Converting {col} to datetime features")
                    
                    # Extract datetime features
                    X_processed[f'{col}_year'] = datetime_series.dt.year
                    X_processed[f'{col}_month'] = datetime_series.dt.month
                    X_processed[f'{col}_day'] = datetime_series.dt.day
                    X_processed[f'{col}_hour'] = datetime_series.dt.hour
                    X_processed[f'{col}_minute'] = datetime_series.dt.minute
                    X_processed[f'{col}_dayofweek'] = datetime_series.dt.dayofweek
                    
                    # Cyclical encoding
                    X_processed[f'{col}_hour_sin'] = np.sin(2 * np.pi * datetime_series.dt.hour / 24)
                    X_processed[f'{col}_hour_cos'] = np.cos(2 * np.pi * datetime_series.dt.hour / 24)
                    X_processed[f'{col}_month_sin'] = np.sin(2 * np.pi * datetime_series.dt.month / 12)
                    X_processed[f'{col}_month_cos'] = np.cos(2 * np.pi * datetime_series.dt.month / 12)
                    X_processed[f'{col}_dayofweek_sin'] = np.sin(2 * np.pi * datetime_series.dt.dayofweek / 7)
                    X_processed[f'{col}_dayofweek_cos'] = np.cos(2 * np.pi * datetime_series.dt.dayofweek / 7)
                    
                    # Drop original column
                    X_processed = X_processed.drop(columns=[col])
                    
                except ValueError:
                    print(f"  Could not convert {col} to datetime, trying categorical encoding")
                    
                    # Try categorical encoding
                    unique_values = X_processed[col].nunique()
                    if unique_values < 50:  # Arbitrary threshold for categorical
                        # Label encoding for categorical variables
                        from sklearn.preprocessing import LabelEncoder
                        le = LabelEncoder()
                        X_processed[f'{col}_encoded'] = le.fit_transform(X_processed[col].astype(str))
                        X_processed = X_processed.drop(columns=[col])
                    else:
                        print(f"  Dropping {col} - too many unique values or unsupported type")
                        X_processed = X_processed.drop(columns=[col])
        
        # Ensure all remaining columns are numeric
        X_processed = X_processed.select_dtypes(include=[np.number])
        
        print(f"Final processed features: {X_processed.shape[1]} columns")
        print(f"Feature names: {X_processed.columns.tolist()}")
        
        return X_processed
        
    def create_sequences(self, features, targets):
        """Create sliding window sequences"""
        X_seq, y_seq = [], []
        
        for i in range(len(features) - self.sequence_length - self.prediction_horizon + 1):
            # Input sequence
            X_seq.append(features[i:i + self.sequence_length])
            # Target (predict t+prediction_horizon)
            y_seq.append(targets[i + self.sequence_length + self.prediction_horizon - 1])
            
        return np.array(X_seq), np.array(y_seq)
    
    def fit_transform(self, X, y):
        """Fit scalers and transform data"""
        print("Preprocessing time series data...")
        
        # Preprocess features to handle different data types
        X_processed = self.preprocess_features(X)
        
        # Handle different target formats
        if isinstance(y, pd.DataFrame):
            if y.shape[1] == 1:
                target_values = y.iloc[:, 0].values
            else:
                print(f"Warning: Multiple target columns found: {y.columns.tolist()}")
                print("Using first column as target")
                target_values = y.iloc[:, 0].values
        elif isinstance(y, pd.Series):
            target_values = y.values
        else:
            target_values = y
        
        print(f"Target variable shape: {target_values.shape}")
        print(f"Target variable range: {target_values.min():.2f} to {target_values.max():.2f}")
        
        # Scale features and targets
        X_scaled = self.feature_scaler.fit_transform(X_processed)
        y_scaled = self.target_scaler.fit_transform(target_values.reshape(-1, 1)).flatten()
        
        # Create sequences
        X_seq, y_seq = self.create_sequences(X_scaled, y_scaled)
        
        # Train-test split (temporal split)
        split_idx = int(len(X_seq) * (1 - self.test_size))
        
        X_train, X_test = X_seq[:split_idx], X_seq[split_idx:]
        y_train, y_test = y_seq[:split_idx], y_seq[split_idx:]
        
        print(f"Training sequences: {X_train.shape}")
        print(f"Test sequences: {X_test.shape}")
        print(f"Sequence length: {self.sequence_length}")
        print(f"Prediction horizon: {self.prediction_horizon}")
        
        return (X_train, X_test, y_train, y_test)
    
    def inverse_transform_target(self, y_scaled):
        """Convert scaled targets back to original scale"""
        return self.target_scaler.inverse_transform(y_scaled.reshape(-1, 1)).flatten()

In [26]:
# Enhanced preprocessing to handle datetime columns
def preprocess_datetime_features(X, y):
    """Preprocess datetime columns and extract temporal features"""
    
    print("=== DATETIME PREPROCESSING ===")
    
    # Make a copy to avoid modifying original data
    X_processed = X.copy()
    
    # Identify datetime columns
    datetime_columns = []
    for col in X_processed.columns:
        if X_processed[col].dtype == 'object':
            # Try to convert to datetime
            try:
                pd.to_datetime(X_processed[col].head())
                datetime_columns.append(col)
                print(f"Found datetime column: {col}")
            except:
                print(f"Column {col} is object type but not datetime")
    
    # Process each datetime column
    for col in datetime_columns:
        print(f"Processing datetime column: {col}")
        
        # Convert to datetime
        X_processed[col] = pd.to_datetime(X_processed[col])
        
        # Extract temporal features
        X_processed[f'{col}_year'] = X_processed[col].dt.year
        X_processed[f'{col}_month'] = X_processed[col].dt.month
        X_processed[f'{col}_day'] = X_processed[col].dt.day
        X_processed[f'{col}_hour'] = X_processed[col].dt.hour
        X_processed[f'{col}_minute'] = X_processed[col].dt.minute
        X_processed[f'{col}_dayofweek'] = X_processed[col].dt.dayofweek
        X_processed[f'{col}_dayofyear'] = X_processed[col].dt.dayofyear
        
        # Cyclical encoding for periodic features
        X_processed[f'{col}_hour_sin'] = np.sin(2 * np.pi * X_processed[f'{col}_hour'] / 24)
        X_processed[f'{col}_hour_cos'] = np.cos(2 * np.pi * X_processed[f'{col}_hour'] / 24)
        X_processed[f'{col}_month_sin'] = np.sin(2 * np.pi * X_processed[f'{col}_month'] / 12)
        X_processed[f'{col}_month_cos'] = np.cos(2 * np.pi * X_processed[f'{col}_month'] / 12)
        X_processed[f'{col}_dayofweek_sin'] = np.sin(2 * np.pi * X_processed[f'{col}_dayofweek'] / 7)
        X_processed[f'{col}_dayofweek_cos'] = np.cos(2 * np.pi * X_processed[f'{col}_dayofweek'] / 7)
        
        # Remove original datetime column
        X_processed = X_processed.drop(columns=[col])
    
    print(f"Original features: {len(X.columns)}")
    print(f"Processed features: {len(X_processed.columns)}")
    print(f"New feature columns: {X_processed.columns.tolist()}")
    
    return X_processed

# Apply datetime preprocessing
X_processed = preprocess_datetime_features(X, y)

# Debug the processed data
X_debug, y_debug = debug_and_preprocess_data(X_processed, y)

# Then preprocess with the fixed class
preprocessor = TimeSeriesPreprocessor(
    sequence_length=24,
    prediction_horizon=1,
    test_size=0.2
)

# Now preprocess with the debugged target
X_train, X_test, y_train, y_test = preprocessor.fit_transform(X_debug, y_debug)

=== DATETIME PREPROCESSING ===
Found datetime column: Date
Found datetime column: Time
Processing datetime column: Date
Processing datetime column: Time
Original features: 18
Processed features: 42
New feature columns: ['S1_Temp', 'S2_Temp', 'S3_Temp', 'S4_Temp', 'S1_Light', 'S2_Light', 'S3_Light', 'S4_Light', 'S1_Sound', 'S2_Sound', 'S3_Sound', 'S4_Sound', 'S5_CO2', 'S5_CO2_Slope', 'S6_PIR', 'S7_PIR', 'Date_year', 'Date_month', 'Date_day', 'Date_hour', 'Date_minute', 'Date_dayofweek', 'Date_dayofyear', 'Date_hour_sin', 'Date_hour_cos', 'Date_month_sin', 'Date_month_cos', 'Date_dayofweek_sin', 'Date_dayofweek_cos', 'Time_year', 'Time_month', 'Time_day', 'Time_hour', 'Time_minute', 'Time_dayofweek', 'Time_dayofyear', 'Time_hour_sin', 'Time_hour_cos', 'Time_month_sin', 'Time_month_cos', 'Time_dayofweek_sin', 'Time_dayofweek_cos']
=== DATA DEBUGGING ===
Features (X) shape: (10129, 42)
Features columns: ['S1_Temp', 'S2_Temp', 'S3_Temp', 'S4_Temp', 'S1_Light', 'S2_Light', 'S3_Light', 'S4_Li