#Problema

Selecione um conjunto de dados de série temporal do Kaggle e inclua o link deste dataset no seu Notebook.


Desenvolva o modelo de predição desta série temporal utilizando Sktime ou Prophet.


Desenvolva o modelo de predição desta série temporal utilizando LSTM.


Apresente alguma métrica de erro obtida ao comparar os resultados obtidos, e justifique a escolha desta métrica (utilizando alguma referência).


Entregue o link do repositório GitHub com o arquivo IPYNB desenvolvido com acesso liberado para o professor.

#Preparando Colab

Para a elaboração deste colab, foi escolhida a biblioteca Prophet. Prophet é uma biblioteca criada pelo Facebook com o intuito de realizar previsões de series temporais. Possuindo uma grande capacidade de identificar tendencias e eventos sutís em periodos específicos, como a presença de feriados em uma serie temporal.

Para fins de comparação de resultado, foi implementado um Modelo LSTM.

In [1]:
!pip install prophet -q
!pip install keras -q
!pip install tensorflow -q

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from prophet import Prophet
from google.colab import drive
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout, Input

# Preparando Dados

## Importando Base

Dataset escolhido: Rain in Australia: https://www.kaggle.com/datasets/jsphyg/weather-dataset-rattle-package

O dataset conta com 10 observações diárias sobre a pluviosidade de cidades Australianas distintas, tendo dados entre 2009 e 2017

O target de ambos os modelos do benchmark é o _RainTomorrow_m, ou seja, a informação binária que indica se no proximo dia irá chover nas respectivas cidades analisadas.

In [4]:
data = pd.read_csv('/content/drive/MyDrive/poderada/weatherAUS.csv')

In [5]:
data.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


## Dados para Prophet

Criando uma copia da tabela, para não alterar a original

In [6]:
data_prophet = data.copy()

Retirar os Nulos e resetar os índices, para evitar problemas ao passar os dados para a rede neural

In [7]:
data_prophet = data_prophet.dropna().reset_index(drop=True)

Convertendo as informações RainToday e RainTomorrow para valores inteiros binários '0 ou 1', ao invés dos valores presentes na tabela 'Yes ou No'

In [8]:
data_prophet['RainToday'] = data_prophet['RainToday'].map({'Yes': 1, 'No': 0})
data_prophet['RainTomorrow'] = data_prophet['RainTomorrow'].map({'Yes': 1, 'No': 0})

Alterando nomes da Data e Target pasa ds e y, nomes exigidos pelo modelo Prophet

In [9]:
data_prophet['ds'] = pd.to_datetime(data_prophet['Date'])
data_prophet['y'] = data_prophet['RainTomorrow']

Revomendo coluna date antiga

In [10]:
data_prophet = data_prophet.drop(columns=['Date'])

Transformando as cidades e WindGust em One Hot Encoded

In [11]:
location_dummies = pd.get_dummies(data_prophet['Location'], prefix='Location')
data_prophet = pd.concat([data_prophet, location_dummies], axis=1)

In [12]:
wind_gust_dir_dummies = pd.get_dummies(data_prophet['WindGustDir'], prefix='WindGustDir', drop_first=True)
data_prophet = pd.concat([data_prophet, wind_gust_dir_dummies], axis=1)

Retirando o ano de 2017 e o separando para teste

In [42]:
y_2017 = data_prophet['ds'].max() - pd.DateOffset(years=1)
train_data = data_prophet[data_prophet['ds'] < y_2017]
test_data = data_prophet[data_prophet['ds'] >= y_2017]

Criando lista com todas as features

In [14]:
numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'RainToday'] + \
               list(location_dummies.columns) + list(wind_gust_dir_dummies.columns)

Criando a data de treino

In [15]:
scaler = StandardScaler()
train_data[numeric_cols] = scaler.fit_transform(train_data[numeric_cols])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data[numeric_cols] = scaler.fit_transform(train_data[numeric_cols])


In [16]:
data_prophet.head()

Unnamed: 0,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,WindGustDir_NNW,WindGustDir_NW,WindGustDir_S,WindGustDir_SE,WindGustDir_SSE,WindGustDir_SSW,WindGustDir_SW,WindGustDir_W,WindGustDir_WNW,WindGustDir_WSW
0,Cobar,17.9,35.2,0.0,12.0,12.3,SSW,48.0,ENE,SW,...,False,False,False,False,False,True,False,False,False,False
1,Cobar,18.4,28.9,0.0,14.8,13.0,S,37.0,SSE,SSE,...,False,False,True,False,False,False,False,False,False,False
2,Cobar,19.4,37.6,0.0,10.8,10.6,NNE,46.0,NNE,NNW,...,False,False,False,False,False,False,False,False,False,False
3,Cobar,21.9,38.4,0.0,11.4,12.2,WNW,31.0,WNW,WSW,...,False,False,False,False,False,False,False,False,True,False
4,Cobar,24.2,41.0,0.0,11.2,8.4,WNW,35.0,NW,WNW,...,False,False,False,False,False,False,False,False,True,False


## Dados para LSTM

Criando uma copia da tabela, para não alterar a original

In [17]:
data_LSTM = data.copy()

Retirar os Nulos e resetar os índices, para evitar problemas ao passar os dados para a rede neural

In [18]:
data_LSTM = data_LSTM.dropna().reset_index(drop=True)

Convertendo as informações RainToday e RainTomorrow para valores inteiros binários '0 ou 1', ao invés dos valores presentes na tabela 'Yes ou No'

In [19]:
data_LSTM['RainToday'] = data_LSTM['RainToday'].map({'Yes': 1, 'No': 0})
data_LSTM['RainTomorrow'] = data_LSTM['RainTomorrow'].map({'Yes': 1, 'No': 0})

Tranformando Location em one hot encoded

In [20]:
data_LSTM = pd.get_dummies(data_LSTM, columns=['Location'], drop_first=True)

Removendo colunas que não serão utilizadas no modelo LSTM

In [21]:
columns_to_drop = ['Date', 'RainTomorrow', 'WindGustDir', 'WindDir9am', 'WindDir3pm']
features = data_LSTM.drop(columns=columns_to_drop)
target = data_LSTM['RainTomorrow']

Separando Features de treino e teste

In [22]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features)

In [23]:
X = X_scaled
y = target.values

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [25]:
X_train = X_train.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test = X_test.reshape((X_test.shape[0], 1, X_test.shape[1]))

# Prophet

Importando modelo Prophet

In [26]:
model = Prophet()

Adicionado cada item do array com as features em regressores do model

In [27]:
for col in numeric_cols:
    model.add_regressor(col)

Treinar modelo com as features

In [28]:
model.fit(train_data[['ds', 'y'] + numeric_cols])

INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu2zlhcz_/hxf0yr62.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpu2zlhcz_/5ijvcr2a.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=42751', 'data', 'file=/tmp/tmpu2zlhcz_/hxf0yr62.json', 'init=/tmp/tmpu2zlhcz_/5ijvcr2a.json', 'output', 'file=/tmp/tmpu2zlhcz_/prophet_modelzs5w1_6l/prophet_model-20241007080741.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
08:07:41 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
08:08:10 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing


<prophet.forecaster.Prophet at 0x7dd695012980>

Criando df de teste

In [29]:
future = test_data[['ds'] + numeric_cols]

Realizando predições

In [30]:
forecast = model.predict(future)

In [31]:
test_data['predicted'] = forecast['yhat'].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['predicted'] = forecast['yhat'].values


In [32]:
comparison = test_data[['ds', 'RainTomorrow', 'predicted']]
comparison['predicted'] = comparison['predicted'].apply(lambda x: 1 if x > 0.5 else 0)

accuracy = accuracy_score(comparison['RainTomorrow'], comparison['predicted'])
print(f"Acurácia: {accuracy:.4f}")

print(comparison.head())

Acurácia: 0.2556
             ds  RainTomorrow  predicted
7638 2016-06-25             0          1
7639 2016-06-26             0          1
7640 2016-06-27             1          1
7641 2016-06-28             0          1
7642 2016-06-29             0          1


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  comparison['predicted'] = comparison['predicted'].apply(lambda x: 1 if x > 0.5 else 0)


# LSTM

Instanciando modelo

In [33]:
model = Sequential()

Adicionando as camadas necessárias

In [34]:
model.add(Input(shape=(X_train.shape[1], X_train.shape[2])))
model.add(LSTM(50, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

Compilação do modelo

In [35]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Treinamento

In [38]:
model.fit(X_train, y_train, epochs=5, batch_size=32, validation_data=(X_test, y_test))

Epoch 1/5
[1m1411/1411[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 6ms/step - accuracy: 0.8801 - loss: 0.2721 - val_accuracy: 0.8642 - val_loss: 0.3138
Epoch 2/5
[1m1411/1411[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 7ms/step - accuracy: 0.8817 - loss: 0.2738 - val_accuracy: 0.8657 - val_loss: 0.3147
Epoch 3/5
[1m1411/1411[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 3ms/step - accuracy: 0.8830 - loss: 0.2726 - val_accuracy: 0.8668 - val_loss: 0.3155
Epoch 4/5
[1m1411/1411[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.8810 - loss: 0.2735 - val_accuracy: 0.8665 - val_loss: 0.3143
Epoch 5/5
[1m1411/1411[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.8797 - loss: 0.2751 - val_accuracy: 0.8658 - val_loss: 0.3161


<keras.src.callbacks.history.History at 0x7dd6948565f0>

In [39]:
loss, accuracy = model.evaluate(X_test, y_test)

[1m353/353[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.8681 - loss: 0.3159


# Considerações finais

Para a realização desta comparação, foi utilizada a Mean squared error, que considera tanto os acertos quanto os erros do modelo.

https://www.britannica.com/science/homeostasis

Resultados Prophet

In [40]:
mse = mean_squared_error(comparison['RainTomorrow'], comparison['predicted'])
print(f"Mean Squared Error: {mse:.4f}")

Mean Squared Error: 0.7444


Resultados LSTM

In [41]:
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.87
