# Previsao em Serie Temporal e Definicao de Parametros usando CrossValidation

Fizemos na aula um exercicio de Previsao com serie temporal de consumo de energia eletrica, utilizando o dataset ([OPSD](https://open-power-system-data.org/)).

Vimos tambem que podemos estimar os parametros do nosso modelo Ridge automaticamente, utilizando a classe [RidgeCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html). Neste caso, devemos definir o parametro cv e os hiperparametros a serem avaliados/testados, caso nao queira se limitar ao padrao sugerido pelo RidgeCV.

**QUESTAO 01:** Treine um modelo com RidgeCV para prever o consumo de energia em 2017, a partir de um treinamento com dados ateh 2016. A parte inicial de preparacao das features jah esta pronta, definir como serah o modelo (veja comentarios no codigo. Dica: explore os hiperparametros do algoritmo, alem dos parametros cv e scoring do RidgeCV.

**QUESTAO 02:** Voce conseguiu um r2 acima de 0.9? Qual foi seu melhor r2? Entregue a solucao que gerou o melhor r2.

**QUESTAO 03:** Qual foi o alpha que proporciou melhor resultado?

## Todos os Imports

In [32]:
import requests
from io import StringIO

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score

from sklearn.preprocessing import StandardScaler


## Carga dos Dados

### Setup do acesso ao Drive

In [33]:
orig_url='https://drive.google.com/file/d/1fZAUBMt94Q4zEEpkgwNgcM4JbKt1gaGG/view?usp=sharing'
file_id = orig_url.split('/')[-2]
dwn_url='https://drive.google.com/uc?export=download&id=' + file_id
url = requests.get(dwn_url).text
csv_raw = StringIO(url)

#### Leitura da base e ajuste do index

In [34]:
df = pd.read_csv(csv_raw, sep=",", index_col=0)
df.index = pd.to_datetime(df.index)

## Gerando um novo dataframe com foco no consumo

Features consideram os 20 dias

In [35]:
df_consumption = df[['Consumption']].copy() # novo df com a coluna consumption

#primeiro dia anterior
df_consumption.loc[:,'Day1'] = df_consumption.loc[:,'Consumption'].shift()# nova coluna com os valores do dia anterior
df_consumption.loc[:,'DiffDay1'] = df_consumption.loc[:,'Day1'].diff()# nova coluna com a diferenca entre o dia anterior e o dia antes do anterior

#demais dias anteriores
for i in range(2,20):
  df_consumption.loc[:,'Day'+str(i)] = df_consumption.loc[:,'Day'+str(i-1)].shift()# nova coluna com os valores do dia anterior
  df_consumption.loc[:,'DiffDay'+str(i)] = df_consumption.loc[:,'Day'+str(i)].diff()# nova coluna com a diferenca entre o dia anterior e o dia antes do anterior

df_consumption = df_consumption.dropna()# removendo NA
df_consumption.head()


Unnamed: 0_level_0,Consumption,Day1,DiffDay1,Day2,DiffDay2,Day3,DiffDay3,Day4,DiffDay4,Day5,...,Day15,DiffDay15,Day16,DiffDay16,Day17,DiffDay17,Day18,DiffDay18,Day19,DiffDay19
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2006-01-21,1348.188,1540.604,-17.686,1558.29,-14.694,1572.984,-25.319,1598.303,34.565,1563.738,...,1403.427,-73.704,1477.131,19.914,1457.217,14.684,1442.533,62.012,1380.521,311.337
2006-01-22,1248.111,1348.188,-192.416,1540.604,-17.686,1558.29,-14.694,1572.984,-25.319,1598.303,...,1300.287,-103.14,1403.427,-73.704,1477.131,19.914,1457.217,14.684,1442.533,62.012
2006-01-23,1569.691,1248.111,-100.077,1348.188,-192.416,1540.604,-17.686,1558.29,-14.694,1572.984,...,1207.985,-92.302,1300.287,-103.14,1403.427,-73.704,1477.131,19.914,1457.217,14.684
2006-01-24,1603.252,1569.691,321.58,1248.111,-100.077,1348.188,-192.416,1540.604,-17.686,1558.29,...,1529.323,321.338,1207.985,-92.302,1300.287,-103.14,1403.427,-73.704,1477.131,19.914
2006-01-25,1613.312,1603.252,33.561,1569.691,321.58,1248.111,-100.077,1348.188,-192.416,1540.604,...,1576.911,47.588,1529.323,321.338,1207.985,-92.302,1300.287,-103.14,1403.427,-73.704


## Separando treino e teste

In [36]:
X_train = df_consumption.loc[:'2016'].drop(['Consumption'], axis=1)
y_train = df_consumption.loc[:'2016', 'Consumption']
X_test = df_consumption.loc['2017'].drop(['Consumption'], axis=1)
y_test = df_consumption.loc['2017', 'Consumption']


## Geracao e Avaliacao dos Modelos

### Modelo com parametros estimados automaticamente com Cross Validation (CV)

In [37]:
tscv = TimeSeriesSplit(n_splits=10)

steps = [
    ('scaler', StandardScaler()),
    ('polyfeatures', PolynomialFeatures(degree=3)),
    ('model', RidgeCV(cv=tscv, scoring='r2', alphas=(0.03, 0.05, 0.1, 0.3, 1.0, 5.0, 10.0)))
]


pipe=Pipeline(steps)

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print('r2_score: ', r2_score(y_test,y_pred))

r2_score:  0.8828290753239558


In [38]:
pipe.named_steps['model'].alpha_

np.float64(10.0)

# Para quem quiser se aprofundar na area

Alguns Materiais Interessantes sobre Series Temporais

[Time Series for scikit-learn - part 1](https://www.ethanrosenthal.com/2018/01/28/time-series-for-scikit-learn-people-part1/)

[Time Series for scikit-learn - part 2](https://https://www.ethanrosenthal.com/2018/03/22/time-series-for-scikit-learn-people-part2/)

[Time Series for scikit-learn - part 3](https://www.ethanrosenthal.com/2019/02/18/time-series-for-scikit-learn-people-part3/)

[Predicting Sales: Time-series Analysis Forecasting with Python](https://medium.com/analytics-vidhya/predicting-sales-time-series-analysis-forecasting-with-python-b81d3e8ff03f)

[The Complete Guide to Time Series Analysis and Forecasting](https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-forecasting-70d476bfe775)

[How to Develop Machine Learning Models for multivariate Time Series (Air Pollution Example)](https://machinelearningmastery.com/how-to-develop-machine-learning-models-for-multivariate-multi-step-air-pollution-time-series-forecasting/) - Este exemplo eh bem legal, mas infelizmente o dataset estah indisponivel