<a href="https://colab.research.google.com/github/MohdRad/ML_Course/blob/main/sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

General notes about google colab:
1. To use linux commands, use %, example: `%pwd`
2. To install python packages using pip, use !, example: `!pip install keras`
3. You can see the files in your current directory by the folder icon in the left

The Python code usually starts by importing the neccessary packages.

The most common packages used in ML are **pandas**, **sklearn**, and **tensorflow**

`pandas`: used to import the data that has the inputs and outputs.

`sklearn`: a popular package that has many ML algorithms like Decision Tree, Random Forest, and many others. It also has another functions to split the data for training and testing and scaling.

`tensorflow`: another popular package to use neural networks algorithms.  

This tutorial will be a regression tutorial to predict the housing price in Boston using 13 inputs.

Only sklearn supervised algorithms will be used. The full list of sklearn supervised algorithms can be found here: https://scikit-learn.org/stable/supervised_learning.html  

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeRegressor # Decision Tree Regressor
from sklearn.ensemble import RandomForestRegressor # Random Forest
from sklearn.ensemble import GradientBoostingRegressor # Gradient Boosting
from sklearn.svm import SVR # Support Vector Machine
from sklearn.linear_model import Ridge, Lasso # Ridge and Lasso (modified algorithms of linear regression)
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.utils import shuffle

The second step is naturally importing the data. The data is usually stored in csv file, where the columns are labeled.
function `read_csv` imports the data in DataFrame format.

The inputs are CRIM to LSTAT and the output is MEDV, they refer to different factors that are believed to control the median value of housing. The numbers in the last column are divided by \$1000.

In [None]:
# Import the data and shuffle it (randomize it)
df = shuffle(pd.read_csv('housing.csv'))
# Display the data
df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
326,0.30347,0.0,7.38,0,0.493,6.312,28.9,5.4159,5,287,19.6,396.90,6.15,23.0
235,0.33045,0.0,6.20,0,0.507,6.086,61.5,3.6519,8,307,17.4,376.75,10.88,24.0
388,14.33370,0.0,18.10,0,0.700,4.880,100.0,1.5895,24,666,20.2,372.92,30.62,10.2
56,0.02055,85.0,0.74,0,0.410,6.383,35.7,9.1876,2,313,17.3,396.90,5.77,24.7
54,0.01360,75.0,4.00,0,0.410,5.888,47.6,7.3197,3,469,21.1,396.90,14.80,18.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85,0.05735,0.0,4.49,0,0.449,6.630,56.1,4.4377,3,247,18.5,392.30,6.53,26.6
63,0.12650,25.0,5.13,0,0.453,6.762,43.4,7.9809,8,284,19.7,395.58,9.50,25.0
71,0.15876,0.0,10.81,0,0.413,5.961,17.5,5.2873,4,305,19.2,376.94,9.88,21.7
387,22.59710,0.0,18.10,0,0.700,5.000,89.5,1.5184,24,666,20.2,396.90,31.99,7.4


Now we need to preprocess our data and define the input X and the output y

In [None]:
# drop the output MEDV to define the input matrix
# axis = 1 means drop a column
# axis = 0 is used to drop a row
# np.array is to transfer X from DataFrame to array
# This is necessary for data split into training and testing
X = np.array(df.drop(['MEDV'], axis=1))

# The output is MEDV
y = np.array(df['MEDV'])

# Since we have only one output, we need to make it 2D array to avoid problems later in the scaling
# Reshape y to 2D, check the shape in the variables list, click on icon {x} on the left
y = y.reshape(-1,1)


Now we need to split the data into **training** and **testing**. 80\% will be deducted for training and 20\% for testing. Personally, I like to do it manually, other people use `train_test_split` function from sklearn.

The model will be trained using the training data (`X_train`, `y_train`). At this stage, the model sees both the inputs and the outputs and tries to learn how to relate them.

After the training, the model is tested using the testing data (`X_test`, `y_test`). The testing input (`X_test`) is fed to the model and the model gives predictions of the output (will be assigned as y_pred). Finally, we compare `y_pred` with `y_test` and we calculate the error and other metrics.


In [None]:
L = int(0.8*len(X)) # len() is a built function to determine the length of any variable, 0.8 because we want 80% for training
X_train = X[:L]
y_train = y[:L]
X_test = X[L:]
y_test = y[L:]

Scaling the data is not required for ALL ML algorithms, but it should be done because it is required for some like Neural Networks. `MinMaxScaler()` imported earlier will be used, this scaler scales the original values between [0-1] by dividing the values on the maximum.

In [None]:
# We need to define separated scalers for X and y because you need to fit back
# the scaled y_pred to the original values to compare it with y_test
# The model will give scaled predictions because all training data will be scaled
# X scaler
X_scaler = MinMaxScaler()
# Let the scaler defines the maximum and the minimum
X_scaler.fit(X_train)
# Apply scaling to train X data
X_train_scaled = X_scaler.transform(X_train)
# test data
X_scaler.fit(X_test)
X_test_scaled = X_scaler.transform(X_test)

# y scaler
y_scaler = MinMaxScaler()
y_scaler.fit(y_train)
y_train_scaled = y_scaler.transform(y_train)
y_scaler.fit(y_test)
# No need to scale the y_val, it will not and should not be used by the ML model



Up to now, we finihed the standard data preprocessing for a regression problem.

Now we can define ML algorithms and start training and testing them. Let's start with decision tree regressor.  

Most of ML algorithms have parameters that can be changed iteratively to imporve the performance, they are called **hyperparameters**. The maximum expected improvment is not that much, it is between 5-10\%.

The nubmer of hyperparameters that can be changed is usualy large, you can start by using the default values and if you did not get satisfactory resutls, you start tuning them.

You can know all hyperparameters for decision tree regressor from https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor

In [None]:
# Define the regressor
dtr = DecisionTreeRegressor()
# train
dtr.fit(X_train_scaled,y_train_scaled)
# predict
y_predict = dtr.predict(X_test_scaled)

# y_predict is scaled, we need to rescale it back to the original scale
y_pred = y_scaler.inverse_transform(y_predict.reshape(-1,1))

# Metrics
# Mean Absolute Error
MAE = mean_absolute_error(y_test,y_predict)
# Coefficient of determination
R2 = r2_score(y_test, y_pred)

print('MAE =', MAE)
print('R2 =', R2)

MAE = 22.330174291938995
R2 = 0.6680363410222396


The results for Decision Tree are **not good**, This happens probably because the training data size is small (404) and the problem is complex, so more data is needed.  

The best value for **MAE** or any error is **zero**.

The best value for **R2** is **1.0**. When R2 is near **zero or negative** that means it is very bad. R2 assesses the linearity between the `y_pred` and `y_test`, if R2 is 1, the `y_pred` = `y_test`, this is what we seek, and vice versa for low R2 values.

Usually, multiple ML algorithms are implemented and the commands in the last block must be repeated and new variables must be introduced for each algoritm. This is the hard way and called **"hard coding"**. This is highly undesirable because you may end up with a code of 500 lines while the same output can be produced by 100 lines and even less.

This can be handeled by in python by **functions** and **classes**. We keep the discussion of classes to another day, let's use the functions. We need to make a function that takes a regressor, train it, test it, and provides the metrics.  

In [None]:
def reg (regressor,name):
  # train
  regressor.fit(X_train_scaled,y_train_scaled.reshape(-1))
  # predict
  y_predict = regressor.predict(X_test_scaled)
  # Rescale
  y_pred = y_scaler.inverse_transform(y_predict.reshape(-1,1))
  # Metrics
  # Mean Absolute Error
  MAE = mean_absolute_error(y_test,y_pred)
  # Coefficient of determination
  R2 = r2_score(y_test,y_pred)
  metrics = [MAE,R2]
  print (name)
  print ('MAE =', MAE)
  print ('R2 =', R2)
  print ('')
  return metrics

# Define Random Forest regressor
# n_estimators is the number of decision trees used
rfr = RandomForestRegressor(n_estimators=50)
# Define Gradient boosting regressor
gbr = GradientBoostingRegressor(n_estimators=50,
                                learning_rate=0.1)
# apply the function
forest = reg(rfr, 'Random Forest')
boosting = reg(gbr, 'Gradient Boosting')


Random Forest
MAE = 2.6100588235294127
R2 = 0.8364314762886254

Gradient Boosting
MAE = 2.4491157406395936
R2 = 0.8650380458102951



The results for Random Forest and Gradient Boosting are better than decision tree. This is expected because these models utilize multiple decision trees.  

The hyperparameters `n_estimators` and `learning_rate` should be tuned for a better performance. Again it must be noted that this tuning will not increase the performance much but it must be done.

Support Vector Regressor (`SVR`), `Ridge`, and `Lasso` are alreday imported, I leave it to you as an exercise to implement them. Search for similar tutorials that used them if this is needed.