# Getting Started with Machine Learning

* Instructor: [Saber Taghvaeeyan](https://www.linkedin.com/in/saber-taghvaeeyan-bb285739/)

## Agenda

1. Machine Learning Workflow
1. Jupyter environment
1. Download sample dataset
1. Data pre-processing
1. Training a model
1. Testing the model
1. What to try next

We will end with time for questions.

## 1. Machine Learning Workflow
Following figure shows the the machine learning ecosystem at a high-level.

![ml-eco](./doc/media/ml-eco.png)

In this tutorial, our focus will be on the development phase.

The development phase is outlined in the following figure.

![ml-dev](./doc/media/ml-dev.png)


## 2. Jupyter environment
Let's start by a quick review of the jupyter environment and some simple tricks.

In [None]:
# Use this cell for some simple commands.
# Press ctrl+enter to execute a cell
# Use shift+enter to execute a cell and move on to the next cell
a = 1
b = ...

# Print the sum: a+b
print(...)
print(...)

## 3. Download sample dataset
We will download a sample data set. The dataset we will be using is "Appliances Energy Prediction Dataset".

Here is more information about his data set.
https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction

**Attribute Information:**

| Name | Description | Units |
| ---      |  ------  |---------:|
| date   | year-month-day   | hour:minute:second   |
| Appliances   | energy use | Wh |
| lights | energy use of light fixtures in the house | Wh |
| T1 | temperature in kitchen area | C | 
| RH_1 | humidity in kitchen area | % | 
| T2 | Temperature in living room area | C |
| RH_2 | Humidity in living room area | % | 
| T3 | Temperature in laundry room area | C |
| RH_3 | Humidity in laundry room area | % |
| T4 | Temperature in office room | C |
| RH_4 | Humidity in office room | C |
| T5 | Temperature in bathroom | C |
| RH_5 | Humidity in bathroom | % |
| T6 | Temperature outside the building (north side) | C |
| RH_6 | Humidity outside the building (north side) | % |
| T7 | Temperature in ironing room | C |
| RH_7 | Humidity in ironing room | % |
|T8 | Temperature in teenager room 2 | C |
| RH_8 | Humidity in teenager room 2 | % |
| T9 | Temperature in parents room | C |
| RH_9 | Humidity in parents room | % |
| To | Temperature outside (from Chievres weather station) | C |
| Press_mm_hg | Pressure (from Chievres weather station) | mm Hg | 
| RH_out | Humidity outside (from Chievres weather station) | % |
| Windspeed | Wind speed (from Chievres weather station) | m/s |
| Visibility | Visibility (from Chievres weather station) | km |
| Tdewpoint | Tdewpoint (from Chievres weather station) | Â°C | 
| rv1 | Random variable 1 | nondimensional | 
| rv2 | Random variable 2 | nondimensional | 

### A. Download the data as a DataFrame 
We can download the data directly from a web address using Pandas and put it into a DataFrame. 

In [None]:
import pandas as pd

# Let's get a sample dataset as a pandas dataframe
df = pd.read_csv("energydata_complete.csv")

# Alternatively, we can directly download it from the web
# df = pd.read_csv("https://github.com/LuisM78/Appliances-energy-prediction-data/raw/master/energydata_complete.csv")

# Print a few rows of the data, complete the following line:
df...

### B. Get some information about the dataset
We can leverage some of the internal *methods* of a DataFrame to gain more insight about our dataset

In [None]:
# How many samples do we have in this data set? Complete the following line
print('Total number of samples: ', ...)
print('')

# Get dataset initial stats
print("Dataset stats: ")
print(...)


### C. Visualize the data
We can use `matplotlib` module to plot and visualize our data.

In [None]:
# Let's visualize some of the data
import matplotlib.pyplot as plt

n_samples = 1000
feature_name = "T9"
target_name = "T2"

fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(df[feature_name].values[:n_samples], 'b-')
ax2.plot(df[target_name].values[:n_samples], 'g-')
ax1.set_xlabel('Samples')
ax1.set_ylabel(feature_name, color='b')
ax2.set_ylabel(target_name, color='g')
plt.show()

## 4. Data pre-processing
Most of the time, the data needs to be *prepared* to be used for developing machine learning models. We go over some of the basic steps in this section.

### A. Create input and output
We should extract the inputs (i.e. *data*) and outputs (i.e. *target*) from the dataframe. We will use the living room temperature as the target. We will use a subset of all the available features to speed up training.
We will also exclude some of other temperatures as they may be very correlated with the living room temperature.

In [None]:
features_to_use = ["lights", # energy use of light fixtures in the house in Wh
                   "T4", # Temperature in office room, in Celsius
                   "T6", # Temperature outside the building (north side), in Celsius
                   "T7", # Temperature in ironing room , in Celsius
                   "T8", # Temperature in teenager room 2, in Celsius
                   "T9", # Temperature in parents room, in Celsius
                   "T_out", # Temperature outside (from Chievres weather station), in Celsius
                   "Press_mm_hg", # (from Chievres weather station), in mm Hg
                   "RH_out", # Humidity outside (from Chievres weather station), in %
                   "Windspeed", # Windspeed (from Chievres weather station), in m/s
                   "Visibility", # Visibility (from Chievres weather station), in km
                   "Tdewpoint" # Dew point (from Chievres weather station), Â°C
                  ]
target_name = "T2"

data = df[...]
target = df[...]

### B. Split the data into train, test, validation
For training a model and evaluating the performance, we devide the model into train, validation, and test sets. 

We will use the training and validation set to design the architecture, train the model, and optimize the hyperparameters. Then use the test set to report the accuracy.

In [None]:
# Import Scikit-learn data splitting functions, complete the following line
from sklearn.model_selection import train_test_split

# Determine train test splits
test_ratio = ...

# Split the data into training and testing
x_trn, x_tst, y_trn, y_tst = train_test_split(..., ..., test_size=..., shuffle=..., random_state=0)

# Split the training data into training and validation
x_trn, x_vld, y_trn, y_vld = train_test_split(..., ..., test_size=..., shuffle=..., random_state=0)

# Print how many samples we have in each set, complete the following lines
print("Number of samples in the training set: ", x_trn...)
print("Number of samples in the validation set: ", x_vld...)
print("Number of samples in the test set: ", x_tst...)

### C. Normalize the Data

Most of the time, we should \"prepare\" our data and make it ready for model development. The preperation might include 
- proper formatting
- handling missing data
- converting categorical to numerical values
- normalization

Here, we only need to normalize the data since the other items are not applicable or have been addressed. Can you explain why we need to normalize the data?

In [None]:
# Normalize the data, complete the following lines
mean = x_trn...
std = x_trn...
x_trn = (x_trn - ...)/...
x_vld = (x_vld - ...)/...
x_tst = (x_tst - ...)/...

## 5. Training a model
Now that we have prepared the data and split it into train, validation, and test sets, we can train ML models.

Several different models are available from scikit-learn. We will start with a basic linear regression model. But, we will also look at other regressors as well.

In [None]:
# Import requried packages
import numpy as np
from sklearn.linear_model import ...

# Create an instance of the model, complete the following line
regr = ...

# Train the model, complete the following line
regr.fit(..., ...)

# Calculate the training error and print, completet the following lines
y_trn_prd = regr.predict(...)
trn_error = ...
print("Training Error: {:.3f} \n".format(...))

# Calculate the validation error, complete the following lines
y_vld_prd = regr.predict(...)
vld_error = ...
print("Validation Error: {:.3f} \n".format(...))

## 6. Testing the model
Once we have trained the model, and have finalized the parameters, we can see how it performs on out test set.

In [None]:
# Once we we have decided on the parameters, we can print the test error
y_tst_prd = regr.predict(...)
tst_error = np.mean(np.abs(y_tst - y_tst_prd))
print("Test Error: {:.3f}".format(tst_error))

In [None]:
# Making prediciton on a new data sample
target_prd = regr.predict((data-mean)/std)
samples_to_plot = 1000
plt.figure(figsize=(10, 4))
plt.plot(target[:samples_to_plot], label='Target')
plt.plot(target_prd[:samples_to_plot], "--", label='Target Prediction')
plt.legend()
plt.show()

In [None]:
# We can also take a look at the feature_importance. It basically shows how much each feature contributes to the final
# prediction. 
feature_imp_df = pd.DataFrame(data={"Name": ..., "Importance": ...})
display(...)

## 7. What to try next
You can read about the following topics if you like to further pursue this topic:
- Explore the data and check for missing values
- Try the model without normalization and see if it affects the result
- Read about and try other types of regressors in scikitlearn (`SVR`, `GradientBoostingRegressor`, `ExtraTreesRegressor`)
- Read and try neural network based approaches (`ANN`, `RNN`, `CNN`)