<a href="https://colab.research.google.com/github/AntonioMoradoRamos/meia-masterdegree-experiments/blob/main/tarefas_ml_3_split_train_test_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dividir conjunto de dados em treino e teste 

Importar o ficheiro "jan_2019_periodo_dia.csv" para o colab.<br>

Características deste caso de uso:
*    Considera os dias da da semana de segunda-feira até sexta-feira
*    Atributos 'isoweekday', 'period_of_day'
*    Treino: 80%
*    Test: 20%


##Bibliotecas

In [None]:
USE_MATPLOTLIB = False

In [None]:
import pandas as pd
import numpy as np

import sklearn
#For divide data into train and test data
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score, accuracy_score


#For data visualization
if USE_MATPLOTLIB:
  import matplotlib.pyplot as plt
  import seaborn as sns
  import matplotlib.dates as mdates

import math

In [None]:
# Versão python
!python --version

Python 3.7.12


In [None]:
# Versão sklearn
sklearn_version = sklearn.__version__
print(sklearn_version)

1.0.2


In [None]:
# Versão Pandas
print(pd.__version__)

1.1.5


##Load Dataset

In [None]:
csv_file_name = "jan_2019_periodo_dia.csv"
ds_merged = pd.read_csv(csv_file_name, sep = ';')


##Análise exploratória


In [None]:
ds_merged.head()

Unnamed: 0.1,Unnamed: 0,date,time,consumption (w),generation (w),temperature (˚C),humidity (%),radiation (Wm^2),isoweekday,period_of_day
0,0,2019-01-01,00:05:00,2985,0,8.7,76.0,0.0,2,1
1,1,2019-01-01,00:10:00,2258,0,8.6,76.0,0.0,2,1
2,2,2019-01-01,00:15:00,2266,0,8.7,76.0,0.0,2,1
3,3,2019-01-01,00:20:00,3016,0,8.7,76.0,0.0,2,1
4,4,2019-01-01,00:25:00,2265,0,8.6,76.0,0.0,2,1


In [None]:
ds_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8927 entries, 0 to 8926
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Unnamed: 0        8927 non-null   int64  
 1   date              8927 non-null   object 
 2   time              8927 non-null   object 
 3   consumption (w)   8927 non-null   int64  
 4   generation (w)    8927 non-null   int64  
 5   temperature (˚C)  8927 non-null   float64
 6   humidity (%)      8927 non-null   float64
 7   radiation (Wm^2)  8927 non-null   float64
 8   isoweekday        8927 non-null   int64  
 9   period_of_day     8927 non-null   int64  
dtypes: float64(3), int64(5), object(2)
memory usage: 697.5+ KB


### Conclusões

É preciso retirar o atributo "Unnamed: 0"<br>
O tipo de dado do atributo "date" tem de ser convertido para datetime<br>



##Pré-processamento

In [None]:
# Retirar o atributo "Unnamed: 0"
ds_merged.drop(columns=['Unnamed: 0'], inplace=True)

In [None]:
# Converter o tipo de dado do atributo "date" para datetime
ds_merged['date']= pd.to_datetime(ds_merged['date'])

In [None]:
ds_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8927 entries, 0 to 8926
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date              8927 non-null   datetime64[ns]
 1   time              8927 non-null   object        
 2   consumption (w)   8927 non-null   int64         
 3   generation (w)    8927 non-null   int64         
 4   temperature (˚C)  8927 non-null   float64       
 5   humidity (%)      8927 non-null   float64       
 6   radiation (Wm^2)  8927 non-null   float64       
 7   isoweekday        8927 non-null   int64         
 8   period_of_day     8927 non-null   int64         
dtypes: datetime64[ns](1), float64(3), int64(4), object(1)
memory usage: 627.8+ KB


In [None]:
def bloxplot_consumption(ds_merged):
  ax = sns.boxplot(data=ds_merged['consumption (w)'], orient="v", width=0.2)
  ax.figure.set_size_inches(12, 6)
  ax.set_title("consumption (w)", fontsize=20)
  ax.set_xlabel("date", fontsize=16)
  ax.set_ylabel("consumption (w)", fontsize=16)

In [None]:
# https://www.python-graph-gallery.com/basic-time-series-with-matplotlib

def plot_consumption(ds):
  fig, ax = plt.subplots(figsize=(20, 6))


  day_locator = mdates.DayLocator(interval=1)
  ax.xaxis.set_major_locator(day_locator)

  year_month_day_formater = mdates.DateFormatter('%Y-%m-%d')
  ax.xaxis.set_major_formatter(year_month_day_formater)

  ax.plot(ds['date'], ds['consumption (w)'], color='b')
  # Rotates and right aligns the x labels. 
  # Also moves the bottom of the axes up to make room for them.
  fig.autofmt_xdate()

In [None]:
if USE_MATPLOTLIB:
  plot_consumption(ds_merged)

In [None]:
def barplot_consumption(ds):
  # set plot style: grey grid in the background:
  sns.set(style="darkgrid")

  # Set the figure size
  fig, ax = plt.subplots(figsize=(20, 6))

  day_locator = mdates.DayLocator(interval=1)
  ax.xaxis.set_major_locator(day_locator)

  year_month_day_formater = mdates.DateFormatter('%Y-%m-%d')
  ax.xaxis.set_major_formatter(year_month_day_formater)

  plt.bar(ds['date'], ds['consumption (w)'])
 
  plt.xlabel("date")
  plt.ylabel("consumption (w)")
  plt.title("consumption (w) x date")


  # Rotates and right aligns the x labels. 
  # Also moves the bottom of the axes up to make room for them.
  fig.autofmt_xdate()

  plt.show()

In [None]:
if USE_MATPLOTLIB:
  barplot_consumption(ds_merged)

In [None]:
# elimintar os consumos de sábados e domingos
MONDAY    = 1
TUESDAY   = 2
WEDNESDAY = 3
THRUSDAY  = 4
FRIDAY    = 5
SATURDAY  = 6
SUNDAY    = 7
ds_merged.drop(ds_merged[ds_merged['isoweekday'] == SATURDAY].index, inplace=True)

In [None]:
ds_merged.drop(ds_merged[ds_merged['isoweekday'] == SUNDAY].index, inplace=True)

In [None]:
ds_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6623 entries, 0 to 8926
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   date              6623 non-null   datetime64[ns]
 1   time              6623 non-null   object        
 2   consumption (w)   6623 non-null   int64         
 3   generation (w)    6623 non-null   int64         
 4   temperature (˚C)  6623 non-null   float64       
 5   humidity (%)      6623 non-null   float64       
 6   radiation (Wm^2)  6623 non-null   float64       
 7   isoweekday        6623 non-null   int64         
 8   period_of_day     6623 non-null   int64         
dtypes: datetime64[ns](1), float64(3), int64(4), object(1)
memory usage: 517.4+ KB


In [None]:
# Verificar que foi retirado o dia de sábado
ds_merged.query('date == 20190105')

Unnamed: 0,date,time,consumption (w),generation (w),temperature (˚C),humidity (%),radiation (Wm^2),isoweekday,period_of_day


In [None]:
if USE_MATPLOTLIB:
  plot_consumption(ds_merged)

In [None]:
if USE_MATPLOTLIB:
  barplot_consumption(ds_merged)

In [None]:
# Isolar X ( features that contribuite to the prediction ) and Y ( Y is the value to predict )
y = ds_merged['consumption (w)']     # O que pretendo prever
X = ds_merged[['isoweekday', 'period_of_day']]

In [None]:
X.head()

Unnamed: 0,isoweekday,period_of_day
0,2,1
1,2,1
2,2,1
3,2,1
4,2,1


##Dados de treino e teste

###Split

In [None]:
# Com o shuffle o r2 score abaixa muito
# Ainda, se adicionar a feature "date" o RF gera erro
# Rever: Se é uma regressão, então não deveria ser considerada apenas a data?
#.       Como fazer uma regressão com a data considerando outros atributos no eixo X?
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 42, shuffle=False)

# Note que este é um cenário de multi linear regression pois existe
# mais do que uma variável independente, i.e., eixo X.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 42)

In [None]:
X_train.head()

Unnamed: 0,isoweekday,period_of_day
6890,4,4
2121,2,2
3086,5,3
5846,1,2
4166,2,2


In [None]:
X_test.head()

Unnamed: 0,isoweekday,period_of_day
96,2,2
994,5,2
1976,1,4
865,5,1
8402,3,1


In [None]:
y_train.head()

6890    2421
2121    3688
3086    5696
5846    2173
4166    7194
Name: consumption (w), dtype: int64

In [None]:
y_test.head()

96      2267
994     3971
1976    3919
865     2241
8402    3686
Name: consumption (w), dtype: int64

In [None]:
print(X_train.shape)
print(y_train.shape)
#(5298, 3)
#(1325, 3)

(5298, 2)
(5298,)


In [None]:
print(X_test.shape)
print(y_test.shape)


(1325, 2)
(1325,)


In [None]:
print(type(X_train))
print(type(y_train))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [None]:
print(type(X_test))
print(type(y_test))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


In [None]:
from google.colab import files
def export_train_test_data_as_csv(ds, file_name):
  ds.to_csv(file_name, sep=';')
  files.download(file_name)

In [None]:
file_name = "X_train_segunda_a_sexta_dois_atributos.csv"
export_train_test_data_as_csv(X_train, file_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
file_name = "y_train_segunda_a_sexta_dois_atributos.csv"
export_train_test_data_as_csv(y_train, file_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
file_name = "X_test_segunda_a_sexta_dois_atributos.csv"
export_train_test_data_as_csv(X_test, file_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
file_name = "y_test_segunda_a_sexta_dois_atributos.csv"
export_train_test_data_as_csv(y_test, file_name)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Bibliografia

*   Random Forest<br>
https://towardsdatascience.com/random-forest-in-python-24d0893d51c0

*    SHAP<br>
https://www.youtube.com/watch?v=J5E4umn8Fb4<br>
https://www.youtube.com/watch?v=NkkwVIgUbKY

*   LIME<br>
https://github.com/marcotcr/lime<br>
https://marcotcr.github.io/lime/tutorials/Using%2Blime%2Bfor%2Bregression.html<br>
https://www.youtube.com/watch?v=d6j6bofhj2M<br>
https://www.youtube.com/watch?v=1mNhPoab9JI<br>
https://www.youtube.com/watch?v=z1iyYHpjcvs<br>
https://coderzcolumn.com/tutorials/machine-learning/how-to-use-lime-to-understand-sklearn-models-predictions

*   Dataset gecad - smartgridcompetitions<br>
http://www.gecad.isep.ipp.pt/smartgridcompetitions/data/

*   Python<br>
https://docs.python.org/3.7/library/datetime.html#module-datetime<br>
https://docs.python.org/3.7/library/datetime.html#datetime.datetime<br>
https://medium.com/horadecodar/data-science-tips-02-como-usar-loc-e-iloc-no-pandas-fab58e214d87<br>


*   Pandas<br>
https://pandas.pydata.org/docs/getting_started/index.html#getting-started<br>
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html<br>
https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.date.html?highlight=date#pandas.Series.dt.date<br>
https://stackoverflow.com/questions/26521266/using-pandas-to-pd-read-excel-for-multiple-worksheets-of-the-same-workbook<br>
https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html<br>
https://www.delftstack.com/pt/howto/python-pandas/pandas-unique-values-in-column/<br>

*   Métricas<br>
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics<br>
https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

*    Multi linear regression<br>
https://www.analyticsvidhya.com/blog/2021/05/multiple-linear-regression-using-python-and-scikit-learn/<br>
https://www.youtube.com/watch?v=4o0UPg4s8MM

*    NumPy<br>
https://colab.research.google.com/github/geekmj/python-tutorials/blob/master/numpy-basics/save-array-to-file.ipynb

*    Métricas<br>
https://arxiv.org/pdf/2011.09903.pdf<br>
https://arxiv.org/abs/2107.05693<br>
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8484963/<br>
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8484963/#B1<br>
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8056245/<br>
https://github.com/amparore/leaf/blob/master/LEAF_test.ipynb<br>
https://arxiv.org/pdf/2001.11757.pdf<br>


https://towardsdatascience.com/instability-of-lime-explanations-3e0efc00a7de<br>
https://pypi.org/project/lime-stability/<br>
shap accuracy - https://github.com/slundberg/shap/issues/1423<br>
https://github.com/suinleelab/treeexplainer-study