# PRACTICA INDEPEDIENTE: Introdução à Regressão Linear - Solução.

## Introdução:

#### Vamos trabalhar com um conjunto de dados sobre aluguéis de bicicletas que foi utilizado em um concurso de Kaggle


#### São fornecidos dados sobre aluguéis por hora que abrangem dois anos. O conjunto de treinamento abrange os primeiros 19 dias de cada mês e o conjunto de teste vai do dia 20 até o fim do mês. **_Queremos projetar o número total de bicicletas alugadas durante cada hora coberta pelo conjunto do teste, utilizando apenas as informações disponíveis no teste de treinamento._**

**CAMPOS DO SET**

**datetime** - hourly date + timestamp

**season** - 1 = spring, 2 = summer, 3 = fall, 4 = winter

**holiday** - whether the day is considered a holiday

**workingday** - whether the day is neither a weekend nor holiday

**weather** - 

1: Clear, Few clouds, Partly cloudy, Partly cloudy <br/>
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist <br/>
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds,Light Rain + Scattered clouds <br/>
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog <br/>
**temp** - temperature in Celsius

**atemp** - "feels like" temperature in Celsius

**humidity** - relative humidity

**windspeed** - wind speed

**casual** - number of non-registered user rentals initiated

**registered** - number of registered user rentals initiated

**count** - number of total rentals


#### Exercício 1: Faça a importação das bibliotecas `numpy` e `padas`, receba o arquivo `'bikeshare.csv'`, tomando a coluna `datetime` como índice e então faça a renomeação da coluna: `'count'` $\rightarrow$ `'total'`.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split

# Tamanho padrões das figuras e a fonte de seus textos neste notebook
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

In [2]:
dados = pd.read_csv('bikeshare.csv', index_col= 'datetime', parse_dates= True)
dados.head()

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,count
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


In [3]:
dados.rename(columns= {'count': 'total'}, inplace= True)
dados.head()

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,total
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1


#### Exercício 2: Considerando a engenharia dos atributos, tente criar as seguintes colunas dummy para as horas `hour` tabeladas.

- hora: Como um único atributo numérico (de 0 a 23).
- hora: Como um atributo categórico (use 23 variáveis dummy).
- dia: Como um único atributo categórico (día = 1 de 7am a 8pm e día = 0 de lo contrário)

In [4]:
#hora: Como um único atributo numérico (de 0 a 23)
dados['hora'] = dados.index.hour
dados.head()

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,total,hora
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,0
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,1
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32,2
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13,3
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1,4


In [5]:
#hora: Como um atributo categórico (use 23 variáveis dummy).
dummy_horas = pd.get_dummies(dados.index.hour, drop_first= True)
dummy_horas.index = dados.index
dummy_horas.head()

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
2011-01-01 00:00:00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2011-01-01 01:00:00,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2011-01-01 02:00:00,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2011-01-01 03:00:00,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2011-01-01 04:00:00,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [6]:
# dia: Como um único atributo categórico (día = 1 de 7am a 8pm e día = 0 de lo contrário)
#1 == sim, 0 == nao

#Função a testar a condição se determinada instância do dataframe foi de dia ou não
def dia_ou_nao(num):
    if num >= 7 and num <= 20:        
        return 1
    else:
      return 0

In [11]:
#Aplicando função dia_ou_nao a cada instância dos nossos dados 
#e criando uma nova coluna com a informação se os aluguéis foram feitos de dia ou não
dados['dia_ou_nao'] = dados['hora'].map(dia_ou_nao)

In [12]:
dados  = pd.concat([dados, dummy_horas], axis = 1)
dados.head()

Unnamed: 0_level_0,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,casual,registered,total,hora,dia_ou_nao,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1
2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,3,13,16,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,8,32,40,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,5,27,32,2,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,3,10,13,3,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,0,1,1,4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


##### Exercício 3: Separe o dataset fornecido em subconjuntos de treino e teste e calcule os valores de `RMSE`  para os diferentes conjuntos de atributos:

- ['temp', 'season', 'humidity']
- ['temp', 'season', 'humidity','dia']
- ['temp', 'season', 'humidity','hora']
- ['temp', 'season', 'humidity', 
                 1, 2, 3, 4, 5, 
                 6, 7, 8, 9, 10, 
                 11, 12, 13, 14, 
                 15, 16, 17, 18, 
                 19, 20, 21, 22, 
                 23
                ].
                
#### Estude qual dos modelos funciona melhor.

In [9]:
def regressao_e_rsme(preditoras):
  """Divide dados em dados de treino e de teste e retorna a métrica RSME do modelo treinado

  :param preditoras: variáveis preditoras/independentes do conjunto de teste
  :param resposta: variável depentende/resposta do conjunto de teste
  :param dados: dataframe com os dados
  """
  x = dados[preditoras]
  y = dados['total']

  # Como estamos trabalhando com observações ordenadas no tempo, podemos definir o parâmetro 
  # `shuffle = False` para evitar data leakage.

  #Separação dos dados em dados de treino e de teste
  X_treino, X_teste, y_treino, y_teste = train_test_split(x, 
                                                        y, 
                                                        shuffle = False
                                                       )
  linreg = LinearRegression() #Criando objeto LinearRegression
    
  linreg.fit(X_treino, y_treino) #Treinamento do modelo
    
  y_pred = linreg.predict(X_teste) #Predições do modelo
  
  rsme = np.sqrt(metrics.mean_squared_error(y_teste, y_pred)) #Raíz quadrada do erro quadrático medio do modelo
  return rsme

In [16]:
testes = (['temp', 'season', 'humidity'], 
          ['temp', 'season', 'humidity','dia_ou_nao'],
          ['temp', 'season', 'humidity','hora'],
          ['temp', 'season', 'humidity',
           1, 2, 3, 4, 
           5, 6, 7, 8, 
           9, 10, 
           11, 12, 13, 14, 
           15, 16, 17, 18, 
           19, 20, 21, 22, 
           23])


for variaveis_preditoras in testes:
  rsme = regressao_e_rsme(variaveis_preditoras)
  print(f'{variaveis_preditoras}')
  print(f'RSME: {rsme}')
  print()

['temp', 'season', 'humidity']
RSME: 208.60652132566102

['temp', 'season', 'humidity', 'dia_ou_nao']
RSME: 177.27663881641254

['temp', 'season', 'humidity', 'hora']
RSME: 197.75102530844302

['temp', 'season', 'humidity', 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
RSME: 153.9421799944251



O pior modelo foi o que utilizou somente as variáveis independentes temp, season e humidity. Os modelos começam a ter um erro menor de predição quando vamos adicionando a informação do tempo do aluguel. A variável se é dia ou não demonstra uma melhora considerável, porém o menor erro se dá quando utilizamos a variável dummy com a especificação categórica da hora em que determinada instância diz respeito. Isso se dá muito en função da explicitação de que as variáveis "dia_ou_nao" e as dummies são categóricas. Quando colocamos somente a hora como um número inteiro no modelo, ele não consegue distinguir que aquela variável é categórica.

#### Exercício 4: Crie e compare modelos com variáveis quadráticas, `temp_2` e `humidity_2`, por exemplo. Considere a seguinte lista de atributos:

- ['temp', 'season', 'humidity', 
                 1, 2, 3, 4, 5,
                 6, 7, 8, 9, 10, 
                 11, 12, 13, 14, 15,
                 16, 17, 18, 19, 20,
                 21, 22, 23,
               'temp_2', 'humidity_2']