# WiDS Maastricht Datathon 2022 Pre-training Day-1 Notebook #
This jupyter notebook is prepared for [WiDS Maastricht Datathon 2022](https://www.maastrichtuniversity.nl/events/wids-datathon-maastricht-2022) training session on 4-5th Feb 2022. The objective of this pre-training notebook is to provide background knowledge and guidence for participants to start the data challenge. 

- The dataset contains approximately **100k** observations of building energy usage records, collected over **7 years** and a number of states within the United States. 
- The dataset consists of **building characteristics, weather data** for the location of the building, and energy usage for the building and the given year. 
- The **goal** is to predict energy usage for each building given the characteristics of the building and the weather data for the location of the building.


This notebook is delivered by [Chang Sun](https://www.linkedin.com/in/chang-sun-maastricht/), [Nicolas Perez](http://www.linkedin.com/in/nicolas-perez-zambrano/), [Yenisel Plasencia Calaña](https://www.linkedin.com/in/yenisel-plasencia-cala%C3%B1a-phd-10144190/), [Carlos Utrilla Guerrero](https://nl.linkedin.com/in/carlos-utrilla-guerrero-97ba7b31), [Parveen Kumar](https://nl.linkedin.com/in/parveensenza).

https://www.kaggle.com/changsun1025/wids-2022-datathon-maastricht-day-1/edit

## Why climate change matters?
**Climate Change costs lives and money.**
- The average temperature in Europe has risen sharply over the past 40 years.
- Climate change causes extreme weather.
- People are dying because of extreme weather
- Climate change leads to economic losses

*Source: European Council https://www.consilium.europa.eu/en/infographics/climate-costs/*

## Limit climate change and its effects
Immediate, rapid and large-scale **reductions in greenhouse gas (GHG) emissions** and reaching net-zero CO2 emissions have the potential to limit climate change and its effects.

Mitigation of GHG emissions requires changes to electricity systems, transportation, **buildings**, industry, and land use.


**Building energy prediction**
- prevent power shortages in modern cities
- reduce social costs caused by unnecessary energy supply
- support stable and efficient power grid operation
- make impactful mitigation strategies 


# Import needed packages

In [1]:
!pip install shap



You should consider upgrading via the 'C:\users\p70069673\appdata\local\programs\python\python39\python.exe -m pip install --upgrade pip' command.


In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import math
import matplotlib.pyplot as plt
import seaborn as sns
import shap as shap
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [3]:
# Use Jupyter Widges Package - https://ipywidgets.readthedocs.io/en/stable/index.html 
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual, Layout
import ipywidgets as widgets
style = {'description_width': 'initial'}

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 300)

# Load train and test data files

In [1]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
## Load the data 
import os
for dirname, _, filenames in os.walk('inputs'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

df = pd.read_csv("./inputs/row_tolerance.csv", sep=',')

inputs\clean_tolerance.csv
inputs\data-dictionary.csv
inputs\row_tolerance.csv
inputs\.ipynb_checkpoints\clean_tolerance-checkpoint.csv


## Glance at data
Always check the number of rows and columns, how data looks like, what variables/features/attributes/columns are in the dataset.

With the .shape method, you get this information. The first number is the amount of rows, while the second one is the amount of features.

In [2]:
# Check the number of rows and columns
print("Number of train samples are",df.shape) 

Number of train samples are (151, 2)


In [3]:
# Have a glance at data (shortcut tip: (un)comment: control + / )
df.head() 


Unnamed: 0,Timestamp;Age;Gender;Height (scale in cm,e.g. 183);Where are you come from? (Country);How many language you speak at home to your family?;How many different countries have you lived in?;Past intercultural experience: have you ever done one of the following international programs?;Indicate how strong you agree/disagree with the following sentence: [Q1. Interreligious dialog may help to mitigate conflicts and misunderstandings in society?];Indicate how strong you agree/disagree with the following sentence: [Q2. Is there the same freedom of religious practice for all religions in the society?];Indicate how strong you agree/disagree with the following sentence: [Q3. People should have the right to live how they wish];Indicate how strong you agree/disagree with the following sentence: [Q4. It is important that people have the freedom to live their life as they choose];Indicate how strong you agree/disagree with the following sentence: [Q5. It is okay for people to live as they wish as long as they do not harm other people];Indicate how strong you agree/disagree with the following sentence: [Q6. I respect other people’s beliefs and opinions];Indicate how strong you agree/disagree with the following sentence: [Q7. I respect other people’s opinions even when I do not agree];Indicate how strong you agree/disagree with the following sentence: [Q8. I like to spend time with people who are different from me]
0,1;34;0;186;Spain;1;5;Volunteer service;Strongly Agree;Strongly Agree;Strongly Agree;Strongly Agree;Strongly Agree;Agree;Strongly Agree;Strongly Agree,
1,11/22/2021 11:35:22;;;;;;;;;;;;;;;,
2,2;34;6;157;BR;2;3;Internship program;Strongly Agree;Strongly Agree;Agree;Agree;Disagree;Disagree;Disagree;Disagree,
3,3;27;8;191;RU;4;2;Studying school abroad;Strongly Disagree;Strongly Agree;Disagree;Strongly Agree;Strongly Disagree;Strongly Disagree;Strongly Agree;Disagree,
4,4;35;5;165;RU;3;5;Studying school abroad;Disagree;Strongly Agree;Strongly Agree;Disagree;Disagree;Agree;Agree;Agree,


In [8]:
# Check features in your dataset
df.columns

Index(['Timestamp', 'Age', 'Gender', 'Height (scale in cm, e.g. 183)',
       'Where are you come from? (Country)',
       'How many language you speak at home to your family?',
       'How many different countries have you lived in?',
       'Past intercultural experience: have you ever done one of the following international programs?',
       'Indicate how strong you agree/disagree with the following sentence: [Q1. Interreligious dialog may help to mitigate conflicts and misunderstandings in society?]',
       'Indicate how strong you agree/disagree with the following sentence: [Q2. Is there the same freedom of religious practice for all religions in the society?]',
       'Indicate how strong you agree/disagree with the following sentence: [Q3. People should have the right to live how they wish]',
       'Indicate how strong you agree/disagree with the following sentence: [Q4. It is important that people have the freedom to live their life as they choose]',
       'Indicate how str

> Question: What is the name of the target feature??

# Feature exploration 
## What do these features mean in your dataset?
(Please note people would call them features or variables or attributes or columns. All are collect ;)

**We Added an additional file - dataDictionary.csv**
This file is created from the Data description on WIDS Kaggle page https://www.kaggle.com/c/widsdatathon2022/data 

It is NOT necessary to have this file for your analysis. I added it here to help you have an easy way to explore the features in the dataset.

In [10]:
data_dict = pd.read_csv("./inputs/data-dictionary.csv", sep=';')
search_column_name = 'variable'

### Print the selected data rows ###
def f(x):
    return data_dict[data_dict[search_column_name].isin(list(x))]

### Multiple selection widgets ###
widget_variable=widgets.SelectMultiple(
    options=data_dict[search_column_name].unique(),
    layout=Layout(width='25%', height='150px'),
    description=search_column_name, 
    style = style
)
interact(f, x=widget_variable);

interactive(children=(SelectMultiple(description='variable', layout=Layout(height='150px', width='25%'), optio…

**Please note there are only 30 features/variables in the dictionary**

## Types of features
- Categorical or numerical?
- how to deal with different types of features?
- why they matter? 
    - for example: gender, country code, level of education

### Check the data type of each variable. 

In [12]:
# They are auto-recognized by Pandas when you read the csv file into dataframe.
# Question: Are they all correct?
df.dtypes

Timestamp                                                                                                                                                            object
Age                                                                                                                                                                 float64
Gender                                                                                                                                                              float64
Height (scale in cm, e.g. 183)                                                                                                                                      float64
Where are you come from? (Country)                                                                                                                                   object
How many language you speak at home to your family?                                                                                         

> Thinking: CHECK Features 
> - **Year_Factor**: int64?
> - **id**: int64?
> - **year_built**: float64?

In [22]:
df_1 = df.replace(['Strongly Agree','Agree', 'Neutral', 'Disagree', 'Strongly Disagree'], ['5','4','3','2','1'])
df_1.head(5)

Unnamed: 0,Timestamp,Age,Gender,"Height (scale in cm, e.g. 183)",Where are you come from? (Country),How many language you speak at home to your family?,How many different countries have you lived in?,Past intercultural experience: have you ever done one of the following international programs?,Indicate how strong you agree/disagree with the following sentence: [Q1. Interreligious dialog may help to mitigate conflicts and misunderstandings in society?],Indicate how strong you agree/disagree with the following sentence: [Q2. Is there the same freedom of religious practice for all religions in the society?],Indicate how strong you agree/disagree with the following sentence: [Q3. People should have the right to live how they wish],Indicate how strong you agree/disagree with the following sentence: [Q4. It is important that people have the freedom to live their life as they choose],Indicate how strong you agree/disagree with the following sentence: [Q5. It is okay for people to live as they wish as long as they do not harm other people],Indicate how strong you agree/disagree with the following sentence: [Q6. I respect other people’s beliefs and opinions],Indicate how strong you agree/disagree with the following sentence: [Q7. I respect other people’s opinions even when I do not agree],Indicate how strong you agree/disagree with the following sentence: [Q8. I like to spend time with people who are different from me]
0,1,34.0,0.0,186.0,Spain,1.0,5.0,Volunteer service,5.0,5.0,5.0,5.0,5.0,4.0,5.0,5.0
1,11/22/2021 11:35:22,,,,,,,,,,,,,,,
2,2,34.0,6.0,157.0,BR,2.0,3.0,Internship program,5.0,5.0,4.0,4.0,2.0,2.0,2.0,2.0
3,3,27.0,8.0,191.0,RU,4.0,2.0,Studying school abroad,1.0,5.0,2.0,5.0,1.0,1.0,5.0,2.0
4,4,35.0,5.0,165.0,RU,3.0,5.0,Studying school abroad,2.0,5.0,5.0,2.0,2.0,4.0,4.0,4.0


In [21]:
import mitosheet
mitosheet.sheet(df_1.head(5), view_df=True)

MitoWidget(analysis_data_json='{"analysisName": "UUID-1034732d-f944-49fd-aa68-3502d874e82d", "code": {"imports…

### Group categorical and numercial features

In [19]:
numerical_features=df.select_dtypes('number').columns
print(numerical_features)

Index(['Age', 'Gender', 'Height (scale in cm, e.g. 183)',
       'How many language you speak at home to your family?',
       'How many different countries have you lived in?'],
      dtype='object')


In [20]:
# convert dataset into type variables accordingly (perhaps exercice for students)

numerical = ['Age', 'Gender', 
             'Indicate how strong you agree/disagree with the following sentence: [Q1. Interreligious dialog may help to mitigate conflicts and misunderstandings in society?]',
             'Indicate how strong you agree/disagree with the following sentence: [Q2. Is there the same freedom of religious practice for all religions in the society?]',
             'Indicate how strong you agree/disagree with the following sentence: [Q3. People should have the right to live how they wish]',
             'Indicate how strong you agree/disagree with the following sentence: [Q4. It is important that people have the freedom to live their life as they choose]',
             'Indicate how strong you agree/disagree with the following sentence: [Q5. It is okay for people to live as they wish as long as they do not harm other people]',
             'Indicate how strong you agree/disagree with the following sentence: [Q6. I respect other people’s beliefs and opinions]',
             'Indicate how strong you agree/disagree with the following sentence: [Q7. I respect other people’s opinions even when I do not agree]',
             'Indicate how strong you agree/disagree with the following sentence: [Q8. I like to spend time with people who are different from me]'
             ]

categorical = ['Where are you come from? (Country)', 'Gender'] 

df[numerical] = df[numerical].apply(pd.to_numeric)
df[categorical] = df[categorical].astype("category")



df.dtypes

ValueError: Unable to parse string "Strongly Agree" at position 0

### Look at categorical variables

In [None]:
plot_cat_dataframe = train

### hist plot categorical features ###
def count_plot(var, plot_cat_dataframe):
    plt.figure(figsize = (10,8))
    ax = sns.countplot(y = var, data = plot_cat_dataframe)
    plt.title(var, size = 15)

def inter_cat_plot(x):
    return count_plot(x, plot_cat_dataframe)

### Multiple selection widgets ###
widget_cat_plot=widgets.Dropdown(
    options=categorical_features,
    value='State_Factor',
    description="Categorical Variable:", 
    style = style
)
interact(inter_cat_plot, x=widget_cat_plot);

### Look at numerical variables

In [15]:
plot_num_data =df[numerical_features]

# plot_num_data = train[train['State_Factor']=='State_1']

# data_state_1 = train[train['State_Factor']=='State_2']
# plot_num_data = data_state_1[data_state_1['building_class']=='Commercial']


### Trend line plot ###
def line_plot(var, plot_num_data):
    plt.figure(figsize = (20,4))
    plot_num_data[var].plot(figsize=(20,4));

def inter_cat_plot(x):
    return line_plot(x, plot_num_data)

### Multiple selection widgets ###
widget_cat_plot=widgets.Dropdown(
    options=plot_num_data.select_dtypes('number').columns,
    value="january_avg_temp",
    description="Numerical Variable:", 
    style = style
)
interact(inter_cat_plot, x=widget_cat_plot);

TraitError: Invalid selection: value not found

### Distribution plot of numerical variables

In [None]:
### Distribution plot ###
def dist_plot(feature_list, train, test):
    for each_feature in feature_list:
        plt.figure(figsize = (20, 4))

        sns.kdeplot(train[each_feature].to_numpy(), color = '#5499C7') # blue
        sns.kdeplot(test[each_feature].to_numpy(), color = '#D35400') # red

        plt.title(each_feature, fontsize=15)
        plt.show()
    
#     del values_train , values_test
    
def inter_dist_plot(x):
    return dist_plot(x, train, test)

### Multiple selection widgets ###
widget_dist_plot=widgets.SelectMultiple(
    options=train.select_dtypes('number').columns,
    value=["floor_area"],
    layout=Layout(width='50%', height='200px'),
    description="Numerical Variable:", 
    style = style
)
interact(inter_dist_plot, x=widget_dist_plot);

### Correlation matrix
Correlation map to see how features are correlated with each other and with target

In [None]:
month_avg_temp = ['january_avg_temp','february_avg_temp','march_avg_temp','april_avg_temp',
'may_avg_temp', 'june_avg_temp', 'july_avg_temp', 'august_avg_temp',
'september_avg_temp', 'october_avg_temp', 'november_avg_temp','december_avg_temp']

# Calculate correlation matrix 
# Check parameters: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html 
corr = train[month_avg_temp].corr() # method='pearson', 'kendall' , 'spearman'

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(9,6))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-0.8,vmax=0.8, square=True, linewidths=.5)

**TO DO:** Try to plot correlation matrix for all features or other features that you think are interesting to see the 
correlations

In [None]:
# ### Note: Not all the labels are printed. 
# corr = train[numerical_features].corr() # TRY DIFFERENT METHOD: method='pearson', 'kendall' , 'spearman'

# ### Generate a mask for the upper triangle
# mask = np.triu(np.ones_like(corr, dtype=bool))

# ### Set up the matplotlib figure
# f, ax = plt.subplots(figsize=(9,6)) # TRY TO CHANGE THE SIZE

# ### Generate a custom diverging colormap
# cmap = sns.diverging_palette(230, 20, as_cmap=True)

# ### Draw the heatmap with the mask and correct aspect ratio
# sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-0.8,vmax=0.8, square=True, linewidths=.5) # TRY TO CHANGE vmin, vmax

## Check missing values in the dataset

In [None]:
# plt.figure(figsize = (25,11))
# sns.heatmap(train.isna().values, xticklabels=train.columns)
# plt.title("Missing values in training Data", size=20)

In [None]:
### check if there is any missing value in the dataset ###
def check_missing(df, col):
    missing  = 0
    misVariables = []
    CheckNull = df.isnull().sum()
    for var in range(0, len(CheckNull)):
        misVariables.append([col[var], CheckNull[var], round(CheckNull[var]/len(df),3)])
        missing = missing + 1

    if missing == 0:
        print('Dataset is complete with no blanks.')
    else:
        df_misVariables = pd.DataFrame.from_records(misVariables)
        df_misVariables.columns = ['Variable', 'Missing', 'Percentage']
        s = df_misVariables.sort_values(by=['Percentage'], ascending=False)
        display(s)
    return df_misVariables

In [None]:
ranked_df_missing_value = check_missing(train, train.columns) 

> **Question**: Will the test data has the same missing value situation? 

In [None]:
# ranked_df_missing_value = check_missing(test, test.columns)

## Handling missing values [Challenge1 Dive into details on Day 2] ##
- Why missing values affect our results?
- How to handle missing values?
- How to handle missing values in training and test datasets? in the same or different way?

In [None]:
### Take all variables which has less than 5% missing values ###
included_col = list(ranked_df_missing_value[ranked_df_missing_value['Percentage']<0.05]['Variable'])

train_partial = train[included_col]

included_col.remove('site_eui')
test_partial = test[included_col]

### Remove the rows with missing values ###
### Please note how many rows you have excluded ###
train_partial = train_partial.dropna().reset_index(drop=True)

print("Original training dataset (rows):", len(train))
print("After removing missing (rows):", len(train_partial))

# Remove target feature + id feature

> **Question:** from which step should we exclude the target features? 

In [None]:
target = train_partial["site_eui"]
train_partial = train_partial.drop(["site_eui","id"],axis =1)
test_partial = test_partial.drop(["id"],axis =1)

## Encoding [Challenge2 Dive into details on Day 2]##
Why we need to encode features? There are many ways to encode the features... Figure out the differences between them, and how to choose the optimal one for this dataset, and why!

In [None]:
#encoding

label_encoder = LabelEncoder()
for col in categorical_features:
    train_partial[col] = label_encoder.fit_transform(train_partial[col])
    test_partial[col] = label_encoder.fit_transform(test_partial[col])

## Feature Scaling: standardization, normalization [Challenge1 Dive into details on Day 2] ##
- Why we need to normalize variables
- What methods can we use: 
    - https://scikit-learn.org/stable/modules/preprocessing.html#normalization
    - https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html
    - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
    - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
    - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html
- What are the differences using different normalization method

In [None]:
#scaling

scaler = StandardScaler()
train_partial = scaler.fit_transform(train_partial)
test_partial = scaler.transform(test_partial)

## Anomaly Detection [Challenge Dive into details on Day 2] ##
- **Outlier detection**: The **training data contains outliers** which are defined as observations that are far from the others. Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations.
- **Novelty detection**: The training data is not polluted by outliers and we are interested in detecting whether a **new** observation is an outlier. In this context an outlier is also called a novelty.

Source:https://scikit-learn.org/stable/modules/outlier_detection.html

In [None]:
# #### If you want to inspect outliers or use some type of flag features if the sample is an outlier
# from sklearn.neighbors import LocalOutlierFactor
# from sklearn.ensemble import IsolationForest
# #iso = LocalOutlierFactor(n_neighbors=35, contamination=0.01)
# iso = IsolationForest(contamination=0.01)

# outliers = iso.fit_predict(train)
# ### select all rows that are not outliers

# ### train = train[outliers!=-1]
# ### target = target[outliers!=-1]

# Model training and testing [Optimization will be presented in DAY 2]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_partial, target, test_size = 0.2, random_state = 2022)

In [None]:
import xgboost
xgboost_model = xgboost.XGBRegressor(n_estimators=200, learning_rate=0.02, gamma=0, subsample=0.75,
                           colsample_bytree=0.4, max_depth=5)

xgboost_model.fit(X_train,y_train)
y_pred = xgboost_model.predict(X_test)
# regression evalution metrics
RMSE = math.sqrt(np.square(np.subtract(y_pred,y_test)).mean())

In [None]:
RMSE

## What does the previous number mean?
In this case, we're dealing with a regression problem. Meaning we're trying to predict a number given the features.

When trying to evaluate how close all of our predictions were to reality. There are a couple of ways of doing this, one of the most common is through the Root-mean-square error, the method that we're using. What matters in general is that a lower number is better, as that means there were less errors.

The following wikipedia article contains more details about how it works:

https://en.wikipedia.org/wiki/Root-mean-square_deviation

In [None]:
# model evaluation
res = xgboost_model.predict(test_partial)
sub = pd.read_csv("/kaggle/input/widsdatathon2022/sample_solution.csv")
sub["site_eui"] = res
sub.to_csv("submission.csv", index = False)

In [None]:
sub