# AUHack 2023 - Grundfos Hands-on ML Workshop - Building Type Classification
The goal of this workshop is to introduce you to how and what real-world Data Science is.

This workshop is based on an internal Data Hackthon, where the goal was to classify if a building is residential or non-residential. To do this we used iGRID heat meter data from a city in Denmark.

As this is a hands-on workshop there are a number of exercises throughout the notebook. For each exercise I have provided a partial solution and a full solution. I strongly recommend that you to use `jupyterlab` because then the solutions are hidden by default.

The workshop has 5 parts:
* Loading Data
* Data Engineering
* Data Exploration
* Feature Engineering
* Modelling

In [None]:
# Let's import some of the common libraries
import pandas as pd
import numpy as np
import plotnine as p9

# ... and update a few of the default settings
pd.set_option('display.width', 120)
pd.set_option('display.max_columns', 40)

p9.options.set_option('dpi', 300)
p9.options.set_option('figure_size', (8, 4.6))

## Loading the data

We load the data from a Snowflake database using the `snowflake-connector-python` package and its `fetch_pandas_all()` function.

In [None]:
import snowflake.connector

snowflake_user = ''
snowflake_password = ''

conn = snowflake.connector.connect(
    account='da84422.west-europe.azure',
    user=snowflake_user,
    password=snowflake_password,
    database='GF_PROD_DB',
    schema='CURATED_HACKATHON',
    )

cur = conn.cursor()

try:
    cur.execute("select * from GF_PROD_DB.CURATED_HACKATHON.V_DATA;")
    heat_data=cur.fetch_pandas_all()
    print('data:')
    print(heat_data.head(3))
    cur.execute("select * from GF_PROD_DB.CURATED_HACKATHON.V_METADATA;")
    metadata=cur.fetch_pandas_all()
    print('\nMetadata:')
    print(metadata.head(3))
finally:
    cur.close()
    
conn.close()

#### EXERCISE 1
* Is there any missing data? If yes, in which columns and how many datapoints?

In [None]:
# Write your solution for Exercise 1 here



#### EXERCISE 1 SOLUTION (PARTIAL)

In [None]:
# Consider using the Pandas DataFrames built-in functions `info()`.
# Q: Which columns has missing data
# Q2: What happens if you use the `isna()` on that column
# Q3: Try applying `sum()` after `isna()`

#### EXERCISE 1 SOLUTION (FULL)

In [None]:
# Let's use the `info()` function to get an overview.
heat_data.info()
metadata.info()

# With this we notice that `LOCATION_ELEVATION` has a few missing values.
print(f'\nLOCATION_ELEVATION is missing for {metadata.LOCATION_ELEVATION.isna().sum()} out of {len(metadata)} buildings.')

#### EXERCISE 1 END

In [None]:
# Let's merge the heat meter data and metadata
# NOTE: In the real world, I would also split data here, but for simplicity we do that later.
data = metadata.merge(heat_data, how='left', on='METER_ID')
data.head(4)

## Data Engineering
The goal of this section is make the data understandable, usable, and trustworthy.

In [None]:
# Let's select a meter_id to look at
meter_id_to_plot = metadata.sample(n=1)['METER_ID'].iloc[0]
print(f'We are investigating METER_ID = {meter_id_to_plot}')

In [None]:
# Let's create a new dataframe only containing data from our selected METER_ID
from siuba import _, select, mutate, group_by, ungroup, filter, summarize

data_to_plot = (data >> filter(_.METER_ID == meter_id_to_plot)).reset_index()

In [None]:
# Let's create a scatter plot of the ENERGY column
from plotnine import ggplot, aes
from plotnine import ggtitle, facet_wrap
from plotnine import theme, element_text, xlim, scale_y_log10
from plotnine import geom_point, geom_histogram, geom_density, geom_bar, geom_col

(
    ggplot(data_to_plot, aes(x='TIMESTAMP', y='ENERGY')) + 
    geom_point() +
    theme(axis_text_x=element_text(rotation=90, hjust=0.5))
)

In [None]:
# Let's have a look at all four values in heat_data at once.
import patchworklib as pw

p1 = pw.load_ggplot(ggplot(data_to_plot, aes(x='TIMESTAMP', y='ENERGY')) + geom_point())
p2 = pw.load_ggplot(ggplot(data_to_plot, aes(x='TIMESTAMP', y='VOLUME')) + geom_point())
p3 = pw.load_ggplot(ggplot(data_to_plot, aes(x='TIMESTAMP', y='FORWARD_TEMPERATURE_CUMULATIVE')) + geom_point())
p4 = pw.load_ggplot(ggplot(data_to_plot, aes(x='TIMESTAMP', y='RETURN_TEMPERATURE_CUMULATIVE')) + geom_point())

p = (p1 | p2) / (p3 | p4)

p.savefig() # This is an artifact from Patchworklib and does not save anything, but its required to display the figure.

In [None]:
# They are all cumulative, so let's convert time, energy, volume, and the temperatures to delta values.
from siuba.dply.vector import lead, lag

data = (
    data >>
    group_by('METER_ID') >>
    mutate(
        TIME_DELTA = _.TIMESTAMP - lag(_.TIMESTAMP, n=1, default=np.NaN),
        ENERGY_DELTA = _.ENERGY - lag(_.ENERGY, n=1, default=None),
        VOLUME_DELTA = _.VOLUME - lag(_.VOLUME, n=1, default=None),
        FORWARD_TEMPERATURE_DELTA = _.FORWARD_TEMPERATURE_CUMULATIVE - lag(_.FORWARD_TEMPERATURE_CUMULATIVE, n=1, default=None),
        RETURN_TEMPERATURE_DELTA = _.RETURN_TEMPERATURE_CUMULATIVE - lag(_.RETURN_TEMPERATURE_CUMULATIVE, n=1, default=None)
    ) >>
    ungroup()
)

data.head(4)

In [None]:
# Let's have a look at the four newly created columns
data_to_plot = (data >> filter(_.METER_ID == meter_id_to_plot)).reset_index()

p1 = pw.load_ggplot(ggplot(data_to_plot, aes(x='TIMESTAMP', y='ENERGY_DELTA')) + geom_point())
p2 = pw.load_ggplot(ggplot(data_to_plot, aes(x='TIMESTAMP', y='VOLUME_DELTA')) + geom_point())
p3 = pw.load_ggplot(ggplot(data_to_plot, aes(x='TIMESTAMP', y='FORWARD_TEMPERATURE_DELTA')) + geom_point())
p4 = pw.load_ggplot(ggplot(data_to_plot, aes(x='TIMESTAMP', y='RETURN_TEMPERATURE_DELTA')) + geom_point())

p = (p1 | p2) / (p3 | p4)

p.savefig() # This is an artifact from Patchworklib and does not save anything, but its required to display the figure.

In [None]:
# The measurements are *supposed* to be daily, but let's make a sanity check
(
    ggplot(data >> filter(-np.isnat(_.TIME_DELTA )), aes('TIME_DELTA')) + 
    geom_histogram(bins=40, fill='#e66066', color='black') +
    scale_y_log10() +
    ggtitle('Distribution of time between measurements (for all buildings)')
)


In [None]:
# Let's make the daily estimates of each column.
data = (
    data >>
    mutate(
        ENERGY_DAILY = _.ENERGY_DELTA * (pd.Timedelta(hours=24) / _.TIME_DELTA),
        VOLUME_DAILY = _.VOLUME_DELTA * (pd.Timedelta(hours=24) / _.TIME_DELTA),
        FORWARD_TEMPERATURE_DAILY = _.FORWARD_TEMPERATURE_DELTA * (pd.Timedelta(hours=24) / _.TIME_DELTA),
        RETURN_TEMPERATURE_DAILY = _.RETURN_TEMPERATURE_DELTA * (pd.Timedelta(hours=24) / _.TIME_DELTA),
    )
)

data.head(4)

In [None]:
# Let's compare the 'energy_delta' and the 'energy_daily'.
data_to_plot = (data >> filter(_.METER_ID == meter_id_to_plot)).reset_index()

p1 = pw.load_ggplot(ggplot(data_to_plot, aes(x='TIMESTAMP', y='ENERGY_DELTA')) + geom_point() + ggtitle('This plot shows the non-normalised values'))
p2 = pw.load_ggplot(ggplot(data_to_plot, aes(x='TIMESTAMP', y='ENERGY_DAILY')) + geom_point() + ggtitle('This plot shows the normalised values with respect to the time gap'))
p = (p1 | p2)
p.savefig()

In [None]:
# ... and similar comparison for 'forward_temperature'.
p1 = pw.load_ggplot(ggplot(data_to_plot, aes(x='TIMESTAMP', y='FORWARD_TEMPERATURE_DELTA')) + geom_point() + ggtitle('This plot shows the non-normalised values'))
p2 = pw.load_ggplot(ggplot(data_to_plot, aes(x='TIMESTAMP', y='FORWARD_TEMPERATURE_DAILY')) + geom_point() + ggtitle('This plot shows the normalised values with respect to the time gap'))
p = (p1 | p2)
p.savefig()

In [None]:
# Let's convert forward and return temperatures to Celcius.
# NOTE: As `VOLUME_DAILY` might be 0 we replace np.inf with 0.
data = (
    data >>
    mutate(
        FORWARD_TEMPERATURE_CELCIUS_DAILY = _.FORWARD_TEMPERATURE_DAILY / _.VOLUME_DAILY,
        RETURN_TEMPERATURE_CELCIUS_DAILY = _.RETURN_TEMPERATURE_DAILY / _.VOLUME_DAILY,
    )
).replace(np.inf, 0)

data.head(4)

In [None]:
# The temperatures should be between 0 and 100 degress Celcius. Let's make another sanity check
(
    ggplot(data) +
    geom_histogram(aes('FORWARD_TEMPERATURE_CELCIUS_DAILY'), bins=40, fill='red', color='black', alpha=0.6) +
    geom_histogram(aes('RETURN_TEMPERATURE_CELCIUS_DAILY'), bins=40, fill='blue', color='black', alpha=0.6) +
    p9.scale_y_log10() +
    ggtitle('Distribution of forward and return temperatures, respectively.')
)

#### EXERCISE 2
* Create a new column, `TEMPERATURE_DIFFERENCE_CELCIUS_DAILY`, which shows the difference between the Forward and the Return temperature.
* Create a scatter plot showing `FORWARD_TEMPERATURE_CELCIUS_DAILY` in red and `RETURN_TEMPERATURE_CELCIUS_DAILY` in blue. **HINT**: Add two `geom_points()` to the same ggplot. See: https://plotnine.readthedocs.io/en/stable/generated/plotnine.geoms.geom_point.html
* Create a scatter plot showing the new column `TEMPERATURE_DIFFERENCE_CELCIUS_DAILY`.
* Combine the two plots using Patchworklib.

In [None]:
# Write your solution for Exercise 2 here



#### EXERCISE 2 SOLUTION (PARTIAL)

In [None]:
# Let's calculate the temperature difference.
# Q: What should we write inside the mutate() function?
data = (
    data >>
    mutate()
)

In [None]:
# Let's update data_to_plot with the newly create column.
data_to_plot = (data >> filter(_.METER_ID == meter_id_to_plot)).reset_index()

In [None]:
# Let's make the plot with forward and return temperature.
# Q: What goes into the two geom_points?
# HINT: Look at the plot above with two geom_histogram() functions
(
    ggplot(data_to_plot, aes(x='TIMESTAMP')) +
    geom_point() +
    geom_point() +
    theme(axis_text_x = p9.element_text(rotation=90, hjust=0.35)) +
    ggtitle('The forward (red) and return (blue) temperatures.')
)

In [None]:
# Let's combine the two plots with Pathworklib.
# Q: How do we combine p1 and p2?

p1 = pw.load_ggplot(
    ggplot(data_to_plot, aes(x='TIMESTAMP')) +
    geom_point(aes(y='FORWARD_TEMPERATURE_CELCIUS_DAILY'), color='red') +
    geom_point(aes(y='RETURN_TEMPERATURE_CELCIUS_DAILY'), color='blue') +
    theme(axis_text_x = p9.element_text(rotation=90, hjust=0.35)) +
    ggtitle('The forward (red) and return (blue) temperatures.')
)
p2 = pw.load_ggplot(
    ggplot(data_to_plot, aes(x='TIMESTAMP',y='TEMPERATURE_DIFFERENCE_CELCIUS_DAILY')) +
    geom_point() +
    theme(axis_text_x = p9.element_text(rotation=90, hjust=0.35)) +
    ggtitle('The difference between forward and return temperature.')
)

p = 

p.savefig()

#### EXERCISE 2 SOLUTION (FULL)

In [None]:
# Let's calculate the temperature difference
data = (
    data >>
    mutate(TEMPERATURE_DIFFERENCE_CELCIUS_DAILY = _.FORWARD_TEMPERATURE_CELCIUS_DAILY - _.RETURN_TEMPERATURE_CELCIUS_DAILY)
)

# Let`s plot the three temperatures
data_to_plot = (data >> filter(_.METER_ID == meter_id_to_plot)).reset_index()

p1 = pw.load_ggplot(
    ggplot(data_to_plot, aes(x='TIMESTAMP')) +
    geom_point(aes(y='FORWARD_TEMPERATURE_CELCIUS_DAILY'), color='red') +
    geom_point(aes(y='RETURN_TEMPERATURE_CELCIUS_DAILY'), color='blue') +
    theme(axis_text_x = p9.element_text(rotation=90, hjust=0.35)) +
    ggtitle('The forward (red) and return (blue) temperatures.')
)
p2 = pw.load_ggplot(
    ggplot(data_to_plot, aes(x='TIMESTAMP',y='TEMPERATURE_DIFFERENCE_CELCIUS_DAILY')) +
    geom_point() +
    theme(axis_text_x = p9.element_text(rotation=90, hjust=0.35)) +
    ggtitle('The difference between forward and return temperature.')
)
p = (p1 | p2)
p.savefig()

## Data Exploration
In this section we take a close look at the data and metadata, and try to get an intuitive understanding of the data and of what might impact our target; the type of building.

In [None]:
# First, let's see the distrubution of our target; `BUILDING_TYPE`
(
    ggplot(metadata, aes('BUILDING_TYPE', fill='BUILDING_TYPE')) +
    p9.geom_bar(color='black') +
    ggtitle('The counts of residential and non-residentail meters included in the dataset')
)

In [None]:
# Let's consider the metadata features and start with temperature difference.
(
    ggplot(data, aes('TEMPERATURE_DIFFERENCE_CELCIUS_DAILY', fill='BUILDING_TYPE')) +
    geom_histogram(bins=30, color='black') +
    xlim(-5, 60) +
    ggtitle('The distribution of the temperature differences (for all meters)')
)

In [None]:
# ... given the uneven distribution of residential and non-residential buildings, the above is hard to decipher. Let's split the plot using facet_wrap().
(
    ggplot(data, aes('TEMPERATURE_DIFFERENCE_CELCIUS_DAILY', fill='BUILDING_TYPE')) +
    geom_histogram(bins=30, color='black') +
    xlim(-5, 60) +
    facet_wrap('~BUILDING_TYPE', scales='free_y') +
    ggtitle('The distribution of the temperature differences (for all meters)')
)

In [None]:
# ... and repeat this for ENERGY_DAILY.
(
    ggplot(data, aes('ENERGY_DAILY', fill='BUILDING_TYPE')) +
    geom_histogram(bins=100, color='black') +
    xlim(0, 1) +
    facet_wrap('~BUILDING_TYPE', scales='free_y') +
    ggtitle('The distribution of the daily energy consumption (for all meters)')
)

In [None]:
# Let's look at a couple of the metadata variables.
(
    ggplot(metadata, aes('BUILT_UPON_AREA', fill='BUILDING_TYPE')) +
    geom_histogram(bins=40, color='black') +
    xlim(0, 2000) +
    facet_wrap('~BUILDING_TYPE', scales='free_y') +
    ggtitle('The distribution of the built upon area (for all meters)')
)

#### EXERCISE 3
* Make a histogram plot of the `LOCATION_ELEVATION` split by `BUILDING_TYPE`
* What does this result tell us? (Can you use this to "formulate" a simple algorithm?)
* I think this signal/result is surprising. Can you explain why it (probably) won't it generalize beyond this dataset to other cities?

In [None]:
# Write your solution for Exercise 3 here



#### EXERCISE 3 SOLUTION (PARTIAL)

In [None]:
# Let's look at the `LOCATION_ELEVATION`
# Q: Which aesthetic should we look at and how do we want to fill (color) the bars in the histogram? I.e. fill out the aes() function below.

(
    ggplot(metadata, aes()) +
    p9.geom_histogram(bins=55, color='black') +
    p9.xlim(0, 55) +
    facet_wrap('~BUILDING_TYPE', scales='free_y') +
    ggtitle('The distribution of the elevation (for all meters)')
)

In [None]:
# Q: To figure out what this plot tells, can you answer; Above which elevation are there no non-residential buildings?

In [None]:
# Q: Do you think it is always the case, that non-residential buildings is close to sea level?

#### EXERCISE 3 SOLUTION (FULL)

In [None]:
# EXERCISE 3 SOLUTION

# Let's look at the `LOCATION_ELEVATION`
(
    ggplot(metadata, aes('LOCATION_ELEVATION', fill='BUILDING_TYPE')) +
    p9.geom_histogram(bins=55, color='black') +
    p9.xlim(0, 55) +
    facet_wrap('~BUILDING_TYPE', scales='free_y') +
    ggtitle('The distribution of the elevation (for all meters)')
)
# Answer: We can see that the non-residential building are closer to the sea level in this city. This is unlikely to generalize to other cities.

## Feature engineering
The goal of this section is extract features from the Time Series that can be used in our models.

At the end of this section we will have a pruned dataset ready to use for modeling.

In [None]:
# Let's calculate some features from the daily energy consumption using the `summarize()` function.
features = (
    data >>
    group_by('METER_ID') >>
    summarize(
        ENERGY_DAILY_MEAN = _.ENERGY_DAILY.mean(),
        ENERGY_DAILY_MEDIAN = _.ENERGY_DAILY.median(),
        ENERGY_DAILY_CV = _.ENERGY_DAILY.std() / _.ENERGY_DAILY.mean(),
        ENERGY_DAILY_AUTOCORR = _.ENERGY_DAILY.autocorr(),
    )
)

features.head(4)

In [None]:
# Let's add the BUILDING_TYPE and visualize.
features = metadata.merge(features, how='left', on='METER_ID')
features.head(4)

In [None]:
# Let's plot the four features we just created
# NOTE: We use geom_density here as there are (relatively) few datapoints/rows.
p1 = pw.load_ggplot(
    ggplot(features, aes('ENERGY_DAILY_MEAN', fill='BUILDING_TYPE')) +
    p9.geom_density(alpha=0.5) +
    p9.xlim(0, 1.5) +
    ggtitle('The density of the Mean of the daily energy consumption')
)

p2 = pw.load_ggplot(
    ggplot(features, aes('ENERGY_DAILY_MEDIAN', fill='BUILDING_TYPE')) +
    p9.geom_density(alpha=0.5) +
    p9.xlim(0, 1.5) +
    ggtitle('The density of the Median of the daily energy consumption')
)

p3 = pw.load_ggplot(
    ggplot(features, aes('ENERGY_DAILY_CV', fill='BUILDING_TYPE')) +
    p9.geom_density(alpha=0.5) +
    p9.xlim(0, 4) +
    ggtitle('The density of the Coefficient of Variance of the daily energy consumption')
)

p4 = pw.load_ggplot(
    ggplot(features, aes('ENERGY_DAILY_AUTOCORR', fill='BUILDING_TYPE')) +
    p9.geom_density(alpha=0.5) +
    p9.xlim(0, 1) +
    ggtitle('The density of the Autocorrelation of the daily energy consumption')
)

p = (p1 | p2) / (p3 | p4)
p.savefig()

#### EXERCISE 4:
* Calculate `Mean` and `Coefficient of Variance` for daily Energy, Volume, Forward Temperature (celcius), Return Temperature (celcius), and Temperature Difference (celcius).
**HINT**: overwrite the `features` DataFrame.
* Calculate another feature!
**HINT**: you can find inspiration for new features here: https://pandas.pydata.org/docs/reference/api/pandas.Series.describe.html

In [None]:
# Write your solution for Exercise 4 here



#### EXERCISE 4 SOLUTION (PARTIAL)

In [None]:
# This code calculate the `mean` and the `coefficient of variance` for energy. 
# Q: Can you do the same for volume? HINT: Add more lines inside the summarize() function.

features = (
    data >>
    group_by('METER_ID') >>
    summarize(
        ENERGY_DAILY_MEAN = _.ENERGY_DAILY.mean(),
        ENERGY_DAILY_CV = _.ENERGY_DAILY.std() / _.ENERGY_DAILY.mean(),
    )
)


In [None]:
# Lets try to calculate the 25% and 75% quantile for the energy.
# HINT: Try to apply this function: https://pandas.pydata.org/docs/reference/api/pandas.Series.quantile.html

#### EXERCISE 4 SOLUTION (FULL)

In [None]:
# EXERCISE 4 SOLUTION

features = (
    data >>
    group_by('METER_ID') >>
    summarize(
        ENERGY_DAILY_MEAN = _.ENERGY_DAILY.mean(),
        ENERGY_DAILY_MEDIAN = _.ENERGY_DAILY.median(),
        ENERGY_DAILY_CV = _.ENERGY_DAILY.std() / _.ENERGY_DAILY.mean(),
        ENERGY_DAILY_AUTOCORR = _.ENERGY_DAILY.autocorr(),
        VOLUME_DAILY_MEAN = _.VOLUME_DAILY.mean(),
        VOLUME_DAILY_CV = _.VOLUME_DAILY.std() / _.VOLUME_DAILY.mean(),
        FORWARD_TEMPERATURE_CELCIUS_DAILY_MEAN = _.FORWARD_TEMPERATURE_CELCIUS_DAILY.mean(),
        FORWARD_TEMPERATURE_CELCIUS_DAILY_CV = _.FORWARD_TEMPERATURE_CELCIUS_DAILY.std() / _.FORWARD_TEMPERATURE_CELCIUS_DAILY.mean(),
        RETURN_TEMPERATURE_CELCIUS_DAILY_MEAN = _.RETURN_TEMPERATURE_CELCIUS_DAILY.mean(),
        RETURN_TEMPERATURE_CELCIUS_DAILY_CV = _.RETURN_TEMPERATURE_CELCIUS_DAILY.std() / _.RETURN_TEMPERATURE_CELCIUS_DAILY.mean(),
        TEMPERATURE_DIFFERENCE_CELCIUS_DAILY_MEAN = _.TEMPERATURE_DIFFERENCE_CELCIUS_DAILY.mean(),
        TEMPERATURE_DIFFERENCE_CELCIUS_DAILY_CV = _.TEMPERATURE_DIFFERENCE_CELCIUS_DAILY.std() / _.TEMPERATURE_DIFFERENCE_CELCIUS_DAILY.mean(),
        ENERGY_DAILY_Q25 = _.ENERGY_DAILY.quantile(q=0.25),
        ENERGY_DAILY_Q75 = _.ENERGY_DAILY.quantile(q=0.75),
        TEMPERATURE_DIFFERENCE_CELCIUS_DAILY_MIN = _.TEMPERATURE_DIFFERENCE_CELCIUS_DAILY.min(),
        TEMPERATURE_DIFFERENCE_CELCIUS_DAILY_MAX = _.TEMPERATURE_DIFFERENCE_CELCIUS_DAILY.max(),
    )
)

features.head(4)

#### EXERCISE 4 END

In [None]:
# Let's create our dataset for modelling. First lets remove unnecessary columns.
data_final  = (
    metadata >>
    select(- _.contains('UNIT')) >>
    select(- _.TIMESTAMP_TIMEZONE) >>
    select(- _.METER_TYPE)
).merge(features, how='left', on='METER_ID')

data_final.info()

## Modeling
The goal of this section is train and test a simple model to predict the BUILDING_TYPE!

In [None]:
# First we split data into train and test
from sklearn.model_selection import train_test_split

train, test = train_test_split(data_final, test_size = 0.25, stratify=data_final['BUILDING_TYPE'])
print(f'Number of train examples: {len(train)} out of {len(data_final)}.')
print(f'Number of test examples: {len(test)} out of {len(data_final)}.')

X_train = (
    train >> 
    select(- _.BUILDING_TYPE) 
).fillna(0)

y_train = (
    train >> 
    select(_.BUILDING_TYPE)
)

X_test = (
    test >> 
    select(- _.BUILDING_TYPE)
).fillna(0)

y_test = (
    test >> 
    select(_.BUILDING_TYPE)
)

In [None]:
# Lets train a decision tree
from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score, balanced_accuracy_score

tree_accuracy = accuracy_score(y_test, y_pred)
tree_balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
print(f'Our tree model has an accuracy of {tree_accuracy} and a balanced accuracy of {tree_balanced_accuracy}')

In [None]:
# Let's have a look at the feature importance of our decision tree
tree_feature_importance = clf.feature_importances_
print(tree_feature_importance)

importance_data = pd.DataFrame({'feature': X_train.columns, 'importance': tree_feature_importance})

(
    ggplot(importance_data, aes(x='feature', y='importance')) +
    geom_col(fill='#e66066', color='black') +
    theme(axis_text_x=element_text(rotation=90, hjust=0.5))
)

In [None]:
# Lets train an SVM model (Support Vector Machine)
from sklearn import svm

clf = svm.SVC()
clf = clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
svm_accuracy = accuracy_score(y_test, y_pred)
svm_balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
print(f'Our tree model has an accuracy of {svm_accuracy} and a balanced accuracy of {svm_balanced_accuracy}')

In [None]:
# Lets train an SGD model (Stochastic Gradiant Descent)
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
clf = clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

In [None]:
sgd_accuracy = accuracy_score(y_test, y_pred)
sgd_balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
print(f'Our tree model has an accuracy of {sgd_accuracy} and a balanced accuracy of {sgd_balanced_accuracy}')

#### EXERCISE 5
* Select another scikit-learn classification model and train it.
**HINT**: See: https://scikit-learn.org/stable/supervised_learning.html
* Implement the `precision` metric from scikit-learn (assuming residential is the positive/default label).
**HINT**: See: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

In [None]:
# Write your solution for Exercise 5 here



#### EXERCISE 5 SOLUTION (PRECISION-QUESTION, FULL)

In [None]:
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_pred, pos_label='residential')
precision