# CS5228 Assignment 2b - Linear Models (50 Points)

Hello everyone, this assignment notebook covers the Linear Models as part of the topic Classification & Regression. There are some code-completion tasks and question-answering tasks in this answer sheet. For code completion tasks, please write down your answer (i.e. your lines of code) between sentences that "your code starts here" and "your code ends here". The space between these two lines does not reflect the required or expected lines of code :). For answers in plain text, you can refer to [this Markdown guide](https://medium.com/analytics-vidhya/the-ultimate-markdown-guide-for-jupyter-notebook-d5e5abf728fd) to customize the layout (although it shouldn't be needed)

**Important:**
* Remember to save this Jupyter notebook as A1a_YourNameInLumiNUS_YourNUSNETID.ipynb
* Submission deadline is October 17th, 11.59 pm (together with A1a)

In [1]:
student_id = 'A0236597M'
nusnet_id = 'e0744016'

Here is an overview over the tasks to be solved and the points associated with each task. The notebook can appear very long and verbose, but note that a lot of parts provide additional explanations, documentation, or some discussion. The code and markdown cells you are supposed to complete are well, but you can use the overview below to double-check that you covered everything.

* **1 Linear Regression (10 Points)**
    * 1.1 Implementation w/ Normal Equation(4 Points)
    * 1.2 Questions about Linear Regression (6 Points)
        * 1.2a) (2 Points)
        * 1.2b) (2 Points)
        * 1.2c) (2 Points)
* **2 Logistic Regression (20 Points)**
    * 2.1 Implementing Logistic Regression (14 Points)
        * 2.1a) Calculating the Gradient (6 Points)
        * 2.1b) Implementing Gradient Descent (4 Points)
        * 2.1c) Hyperparameter Tuning "By Hand" (2 Points)
        * 2.1d) Predicting Labels (2 Points)
    * 2.2 From Logistic Regression to Linear Regression (6 Points)
* **3 Model Evaluation (20 Points)**
    * 3.1 Basic Model Evaluation (7 Points)
        * 3.1 a) (4 Points)
        * 3.1 b) (3 Points)
    * 3.2 Hyperparameter Tuning (7 Points)
        * 3.1 a) (4 Points)
        * 3.2 b) (3 Points)
    * 3.3 Handling Overfitting (6 Points)

### Setting up the Notebook

**Important:** In this notebook, most code-completion tasks require editing the file `src/linear.py`. The code cell below ensures that any change to the file (after saving) will cause a reload in this notebook. So there's no need to "manually" import the code after every change. Way more convenient.

In [2]:
# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

In [3]:
%matplotlib notebook

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import f1_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression


from src.linear import MyLinearRegression, MyLogisticRegression

## 1 Linear Regression (10 Points)

Your task is to implement basic Linear Regression by using the Normal Equation you have learned in the lecture. The Normal Equation provides an analytically approach to train a Linear Regressor which is straightforward to implement.

#### Creating Some Toy Data for Testing & Debugging

The following toy data is the CSI example used in the lecture, where the goal is to estimate a person's height based on the size of a shoe print. So 20 (shoe print size, height)-pairs have been collected.

In [5]:
data = np.array([
    (31.3, 180.3), (29.7, 175.3), (31.3, 184.8), (31.8, 177.8),
    (31.4, 182.3), (31.9, 185.4), (31.8, 180.3), (31.0, 175.5),
    (29.7, 177.8), (31.4, 185.4), (32.4, 190.5), (33.6, 195.0),
    (30.2, 175.3), (30.4, 180.3), (27.6, 172.7), (31.8, 182.9),
    (31.3, 189.2), (34.5, 193.7), (28.9, 170.3), (28.2, 173.8)
])

X = data[:,0].reshape(-1, 1)
y = data[:,1].reshape(-1, 1)

We can visualize the data using a simple scatter plot.

In [6]:
plt.figure()
axes = plt.axes()
axes.set_ylim([165, 200])
plt.tick_params(labelsize=14)
plt.scatter(X, y, s=50)
plt.xlabel('shoe print length (cm)', fontsize=16)
plt.ylabel('body height (cm)', fontsize=16)
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

The plot shows that there is a reasonably good linear relationship between the shoe print size and the height of a person. Hence one can justifiably use Linear Regression to find a good fit of the data.

### 1.1 Linear Regression w/ Normal Equation (4 Points)

We saw in the lecture that we can minimize the Mean Squared Error (MSE) loss $L$ to find the best values for $\theta$ analytically. We did so by finding the derivative of $L$ with respect to $\theta$, setting it to 0 and solving for $\theta$. We got the following formula called the **Normal Equation**:

$$\theta = X^{\dagger}y, \ with\ X^{\dagger} = (X^TX)^{-1} X^T$$

where $X^{\dagger}$ is the *pseudo inverse* of matrix $X$

**Implement the method `fit()` that evaluates this equation!** (Have look at [`np.dot()`](https://numpy.org/doc/stable/reference/generated/numpy.dot.html) and [`numpy.linalg.inv`](https://numpy.org/doc/stable/reference/generated/numpy.linalg.inv.html) to make your life easier)

You can test your implementation using the code below. The result should match the ones shown in the lecture slides as we're using the same example dataset here.


In [7]:
linreg = MyLinearRegression().fit(X, y)

theta_best = linreg.theta

print('The theta values that minmized the loss for your CSI example are: {}'.format(theta_best.squeeze()))

The theta values that minmized the loss for your CSI example are: [69.45528793  3.61092267]


**Testing your implementation.** The following 2 code cells are only for testing your complete implementation of `MyLinearRegression`; there is nothing for you to do here. We use the [Hitters](https://www.kaggle.com/floser/hitters) dataset which aims to predict the salaries of baseball players based on their statistics. You can check the website for more details about the different features. In the following, we just consider a subset of all features to keep it simple.

In [9]:
df = pd.read_csv('data/a2-hitters.csv')
df = df.dropna()

X_hitters = df[['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Years', 'Assists', 'Errors']].to_numpy()
y_hitter = df[['Salary']].to_numpy().squeeze()

print("Shape of X_hitters: {}".format(X.shape))

Shape of X_hitters: (20, 1)


Now let's compare your implementation with the one provided by `sklearn`. Note that `sklearn.LinearRegression` treats the coefficient for the bias term $x_0$ separately. In your implementation this is simply the first coefficient in `theta`.

In [10]:
homemade_linreg = MyLinearRegression().fit(X_hitters, y_hitter)
sklearn_linreg = LinearRegression().fit(X_hitters, y_hitter)

print('Homemade results')
print("Coefficent for x0: {}; other coefficients: {}".format(homemade_linreg.theta[0], homemade_linreg.theta[1:]))
print()
print('Sklearn results')
print("Coefficent for x0: {}; other coefficients: {}".format(sklearn_linreg.intercept_, sklearn_linreg.coef_))

Homemade results
Coefficent for x0: -82.6239065789254; other coefficients: [-1.56133266  5.750119   -2.95806524  3.69329966  4.24168825 34.85748766
  0.05172045 -2.15063924]

Sklearn results
Coefficent for x0: -82.6239065789008; other coefficients: [-1.56133266  5.750119   -2.95806524  3.69329966  4.24168825 34.85748766
  0.05172045 -2.15063924]


Apart from maybe some precision issues, the result should be identical.

### 1.2 Questions about Linear Regression (6 Points)

**1.2 a) How would you need to modify/extend `MyLinearRegression` to support Polynomial Linear Regression? (2 Points)** There is no need for any implementation, just briefly describe any changes to your implementation that need to be made.

**Your Answer:**

In this linear regression case, we have two features `x1` and `x2`, suppose we use the degree of polynomial as two, we can add more features such as `x1**2`, `x2**2` and `x1*x2`; if we use the degree of polynomial as three, then we can add features like `x1**2`, `x2**2`,`x1**3`, `x2**3`, `x1**2*x2`, `x1*x2**2`,`x1*x2`.

For the next question, assume that you perform 10-fold cross validation on a dataset using a **Decision Tree** as well as **Linear Regression**. While the average validation scores for both models are pretty similar, the 10 individual validations scores show a much greater variance for Linear Regression than for the Decision Tree.

**1.2 b) What insights into the data can you get from this observation? (2 Points)** Briefly explain your answer!

**Your Answer:**

There are outliers in the data. Decision trees are robust to outliers, while linear regressor is very sensitive to outliers.

The dataset for our CSI example consists of 20 data sample, i.e., 20 (shoe print size, body heigh)-pairs. And we saw that the best values for $\theta$ are around $\theta_0 \approx 69.5$ and $\theta_1 \approx 3.6$. Now you get a different dataset $D$ with 20 $(x, y)$-pairs of with $x$ and $y$ being numerical values. Now you train a Linear Regression Model over $D$ and get the same $\theta$ values like for the the CSI dataset.

**1.2 c) Knowing how the CSI dataset looks like, what can you say about $D$? (2 Points)** Briefly discuss what this means for applying Linear Regression and interpreting the results.

**Your Answer:**

Both D and CSI have the same parameters, therefore we can conclude that the dataset D has the same distribution as CSI, x and y are positively correlated: y = $\theta_0 $ * x + $\theta_1 $ . When applying linear regression, we can combine them into one feature (say `x/y`).

---------------------------------------------------------------

## 2 Logistic Regression (20 Points)

Logistic Regression is a linear model for classification tasks. It's call Logistic *Regression* because the output is a real value. However, this value is interpreted as a probability whether the input sample belongs to Class 0 or 1 -- we are only considering Logistic Regression for binary classification here.

In contrast to Linear Regression, we can no longer solve Logistic Regression analytically. Hence we have to implement it using Gradient Descent. Here, instead of calculating the $\nabla_\theta L$, setting it to 0, and solving for $\theta$, we start with initial parameter values for $\theta$, calculate the respective gradient, and update $\theta$ to reduce the loss $L$ iteratively.

The gradient for loss $L$ w.r.t. to $\theta$ is given as:

$$\nabla_\theta L = \frac{2}{n} X^T(h_\theta(X) - y)\ \text{, with }\ h_\theta(x_i) = \frac{1}{1+ e^{-\theta^{T}x_i}}$$

In the following, we again use the same dataset as in the lecture for testing your implementation.

In [11]:
data = [
    (31.3, 1), (29.7, 1), (31.3, 0), (31.8, 0),
    (31.4, 1), (31.9, 1), (31.8, 1), (31.0, 1),
    (29.7, 0), (31.4, 1), (32.4, 1), (33.6, 1),
    (30.2, 0), (30.4, 0), (27.6, 0), (31.8, 1),
    (31.3, 1), (34.5, 1), (28.9, 0), (28.2, 0)
]

# Convert input and outputs to numpy arrays; makes some calculations easier
X = np.array([ coord[0] for coord in data]).reshape(-1,1)
y = np.array([ coord[1] for coord in data]).reshape(-1,1)

print(X.shape)

(20, 1)


In [12]:
plt.figure()
plt.tick_params(labelsize=14)
plt.scatter(X, y, c='C0', s=100)
plt.xlabel('Shoe print size', fontsize=16)
plt.ylabel('P(male)', fontsize=16)
plt.tight_layout()
plt.show()

<IPython.core.display.Javascript object>

### 2.1 Implementing Logistic Regression (14 Points)

#### 2.1 a) Calculating the Gradient (6 Points)

**Implement the methods `calc_h()` and `calc_gradient()` to calculate the gradient!** The only reason doing this using 2 methods is that it allows us to re-use some code.

In [13]:
logreg = MyLogisticRegression()

# Initialize theta (again, we use the example values from the lecture)
logreg.theta = np.array([60, 4]).reshape(-1, 1)

X_with_bias = logreg.add_bias(X)

h = logreg.calc_h(X_with_bias)

grad = logreg.calc_gradient(X_with_bias, y, h)

print('The gradient for theta = {} is: {}'.format(logreg.theta.squeeze(), grad.squeeze()))

The gradient for theta = [60  4] is: [ 0.4   11.905]


The gradient should be around [ 0.4, 11.9].

#### 2.1 b) Implementing Gradient Descent (4 Points)

We now have everything in place to implement the training loop for Gradient Descent.

**Implement the method `fit()` to find the best $\theta$ using Gradient Descent!** Most of the code is given, so you can focus on the loop that performs Gradient Descent.

You can test your implementation using the code below. With the parameters given (`lr=0.00001`, `num_iter=1000`), you should be able to achieve a loss of around 0.66 for the CSI example.

In [65]:
logreg = MyLogisticRegression()

logreg.fit(X, y, lr=0.01, num_iter=2000000, verbose=True)

print(logreg.theta)

Loss: 0.693 	 0%
Loss: 0.471 	 10%
Loss: 0.438 	 20%
Loss: 0.426 	 30%
Loss: 0.421 	 40%
Loss: 0.418 	 50%
Loss: 0.417 	 60%
Loss: 0.416 	 70%
Loss: 0.416 	 80%
Loss: 0.415 	 90%
Loss: 0.415 	 100%
[[-42.71370035]
 [  1.39608796]]


#### 2.1 c) Hyperparameter Tuning "By Hand" (2 Points)

Seeing the loss going down and reaching a value below 0.66 (with the example parameter values above) is a good start. However, we already know that the optimal loss is around 0.415 (see Section 1.2). Of course, we could simply increase the value of `num_iter` more and more, knowing that at some point Gradient Descent will reach the minimum. But this would  unnecessarily increase the computation time.

**Evaluate different values for the learning rate `lr` and the number of iterations `num_iter`!** In more detail:

* Find a setting for both parameters that will reduce the loss below 12.0
* Find such a setting as to keep `num_iter` as small as possible
* Discuss any interesting observations you have made while find such a parameter setting (e.g., how the development of the loss behaves, limitations on the choice of parameter values, etc.) together with a brief explanation

(Hint: You don't have to make fine-grained changes to the parameters. For example, there's no point in increasing `num_iter` from 1000 to 1001 :). Start with changing the order of magnitude of the parameter values and maybe try some finer changes.)

**Your answer** (identified parameter values):

In [66]:
my_lr = 0.01
my_num_iter = 2000000

**Your answer** (interesting observations):

1. When the learning rate is greater than 0.1, the loss value cannot converge. 
2. In order to keep the num_iter as small as possible, we should use a larger learning rate, but it should be smaller than 0.1, otherwise, it may cause the loss not converge. 

#### 2.1 d) Predicting Labels (2 Points)

With the code you implemented you can already perform predictions. The method `predict()` only wraps the required steps to provide a clean interface. We therefore just give it to you; there's nothing more for you to implement here. Below is just the example prediction from the CSI example.

In [67]:
# The predicted value depends on the values of theta.
# To ensure a consistent results, let's use the best theta.
logreg.fit(X, y, lr=0.001, num_iter=100000, verbose=True)

X_suspect = np.array([[32.2]])

y_suspect_pred = logreg.predict(X_suspect)

print('The predicted class of the suspect is: {} (0=Female, 1=Male)'.format(y_suspect_pred.squeeze()))

Loss: 0.693 	 0%
Loss: 0.664 	 10%
Loss: 0.661 	 20%
Loss: 0.659 	 30%
Loss: 0.656 	 40%
Loss: 0.654 	 50%
Loss: 0.652 	 60%
Loss: 0.649 	 70%
Loss: 0.647 	 80%
Loss: 0.645 	 90%
Loss: 0.643 	 100%
The predicted class of the suspect is: 1.0 (0=Female, 1=Male)


------------------------------------------------

### 2.2 From Logistic Regression to Linear Regression (6 Points)

In Section 1, you have implemented Linear Regression using the Normal Equation. But we also so in the lecture how to train a Linear Regression model using Gradient Descent. And we also saw how closely related Linear and Logistic Regression are.

If we want to implement our own class `MyLinearRegressionGD` using Gradient Descent, we can utilize the close relationship between Linear and Logistic Regression to simplify this task. In fact, the implementation of `MyLinearRegressionGD` would look rather similar to the implementation  of `MyLogisticRegression`, but even simpler.

**Describe which methods of `MyLogisticRegression` need to be changed (and briefly how!), to implement `MyLinearRegressionGD!`** You are free to do this task in two ways:

* **Option A:** Implement the methods using the code cell below (you don't have to test your code and it may contain little syntax errors as long as the steps/calculations are clear; you may add comments as well, of course) **OR**
* **Option B:** Describe how the methods need to be changed in the Markdown cell below (please be provide sufficient detail to make the required changes and new calculations clear)


**Option A -- Your answer in Python code:**

In [None]:
#########################################################################################
### Your code starts here ###############################################################
class MyLinearRegression():
    def __init__(self):
        self.theta = None
        self.bias = None
    
    def calc_loss(self, y, y_pred):
        loss = np.square(y_pred-y)
        cost = np.sum(loss) / (2*len(y))
        return loss
    
    def calc_h(self, X):
        h = np.dot(X,self.theta) + self.bias
        return h
    
    def calc_gradient(self, X, y, h):
        dtheta = np.dot(X.T, (h-y)) / X.shape[0]
        dbias = np.sum(h-y) / X.shape[0]
        return dtheta, dbias
    
    def fit(self, X, y, lr=0.001, num_iter=100, verbose=False):
        self.theta = np.zeros(X.shape[1]).reshape(-1,1)
        self.bias = 0

        for i in range(num_iter):

            h = self.calc_h(X)

            dtheta, dbias = self.calc_gradient(X, y, h)

            self.theta = self.theta - lr*dtheta
            self.bias = self.bias - lr*dbias

    def predict(self, X):
        y_pred = np.dot(X, self.theta) + self.bias
        return y_pred
    
### Your code ends here #################################################################
#########################################################################################

**Option B -- Your Answer in Plain Text:**

------------------------------------------------------

## 3 Model Selection (20 Points)

The topic "Classification & Regression" covered a whole series of different models. In this section, we look at the basic data mining task of find the best model for a given dataset: which model performs best with which hyperparameters. To keep it simple and keep the implementation work to a minimum, we make full use of scikit-learn (see additional hints in the subtasks).

#### Prepare Dataset

We use the [Electrical Grid Stability Simulated Data Data Set](https://archive.ics.uci.edu/ml/datasets/Electrical+Grid+Stability+Simulated+Data+). It contains 10 samples, each with 12 features to predict the stability of a power grid. We use this dataset mainly for convenience. It basically does not require any data preprocessing as all features are numerical and there are no "dirty" records. We only normalize the data via standardization (see the code cell below).

In [70]:
df = pd.read_csv('data/a2-grid-cleaned.csv')

df.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,1
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0


In [71]:
# Convert to NumPy arrays
df_X, df_y = df.iloc[:,0:-1], df.iloc[:,-1]
X, y = df_X.to_numpy(), df_y.to_numpy()

# Split dataset in training and test data (20% test data)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Normalize data using the sciki-learn StandardScaler
scaler = StandardScaler().fit(X_train)
X_train, X_test = scaler.transform(X_train), scaler.transform(X_test)

print('Number of features: {}'.format(X_train.shape[1]))
print('Number of samples for training: {}'.format(X_train.shape[0]))
print('Number of samples for testing: {}'.format(X_test.shape[0]))

Number of features: 12
Number of samples for training: 8000
Number of samples for testing: 2000


### 3.1 Basic Model Evaluation (7 Points)

For all following tasks, we use the scitki-learn implementations of the models we covered in the module:

* [`KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

* [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

* [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

* [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

* [`AdaBoostClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) with (`base_estimator=DecisionTreeClassifier()`)

* [`GradientBoostingClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)




**3.1 a) Train 6 classification models using the available scikit-learn implementations! (4 Points)** Use each model with its default parameters! We leave any tuning for 3.2. 

Hint: As the dataset is rather small, most models will train rather quickly. The get a better sense of the runtime of each implementation, you can train the same model multiple times in a loop to get an aggregated runtime, something like:

```
for _ in range (100):
    # Train Decision Tree
```

This should make it a bit easier to compare the different models w.r.t. to their runtime.

In [None]:
%%time

clf_knn, clf_logreg, clf_dtree, clf_rforest, clf_adaboost, clf_gbtree = None, None, None, None, None, None

#########################################################################################
### Your code starts here ############################################################### 
import time

# 1. Train KNN
start = time.time()
for i in range(100): 
    clf_knn = KNeighborsClassifier().fit(X_train,y_train)
end = time.time()
print('knn total time:{}s'.format(end-start))

# 2. Train LogisticRegression
start = time.time()
for i in range(100):
    clf_logreg = LogisticRegression().fit(X_train,y_train)
end = time.time()
print('logreg total time:{}s'.format(end-start))

# 3. Train Decision Tree
start = time.time()
for i in range(100):
    clf_dtree = DecisionTreeClassifier().fit(X_train,y_train)
end = time.time()
print('dtree total time:{}s'.format(end-start))

# 4. Train Random Forest
start = time.time()
for i in range(100):
    clf_rforest = RandomForestClassifier().fit(X_train,y_train)
end = time.time()
print('rforest total time:{}s'.format(end-start))
    
# 5. Train adaboost
start = time.time()
for i in range(100):
    clf_adaboost = AdaBoostClassifier().fit(X_train,y_train)
end = time.time()
print('adboost total time:{}s'.format(end-start))

# 6. Train Gradient Boosting
start = time.time()
for i in range(100):
    clf_gbtree = GradientBoostingClassifier().fit(X_train,y_train)
end = time.time()
print('gbtree total time:{}s'.format(end-start))

### Your code ends here #################################################################
######################################################################################### 

knn total time:0.5126051902770996s
logreg total time:0.6660552024841309s
dtree total time:9.38041090965271s
rforest total time:162.6855628490448s
adboost total time:64.03704309463501s
gbtree total time:278.42111015319824s
CPU times: user 8min 32s, sys: 2.98 s, total: 8min 35s
Wall time: 8min 35s


The following code simply prints the f1 scores for each model -- nothing to implement for you here!

In [None]:
print('F1 scores for test data for all classifiers')
print('===========================================')
for clf in [clf_knn, clf_logreg, clf_dtree, clf_rforest, clf_adaboost, clf_gbtree]:
    try:
        # Predict labels for test samples
        y_pred = clf.predict(X_test)
        # Calculate the f1 score
        f1 = f1_score(y_test, y_pred)
        #f1 = mean_squared_error(y_test, y_pred, squared=False)
    except Exception as e:
        # Handle exception (e.g., a classifier is still None)
        f1 = '---'
    # Print classifier name and the f1 score
    print('{}:\t{:.3}'.format(type(clf).__name__, f1))
    

F1 scores for test data for all classifiers
KNeighborsClassifier:	0.769
LogisticRegression:	0.734
DecisionTreeClassifier:	0.776
RandomForestClassifier:	0.887
AdaBoostClassifier:	0.787
GradientBoostingClassifier:	0.894


**3.1 b) Briefly discuss the results and the observations you made during the training! (3 Points)** You can refer to the runtime, f1 scores, or any other interesting or surprising observation you have made.

1. GradientBoostingClassifier is the slowest of all six classifiers, RandomForestClassifier is the second slowest (approximately half time of the GradientBoostingClassifier). 
2. Regarding F1 score, GradientBoostingClassifier and RandomForestClassifier outperform other classifiers by more than 10%.
3. KNeighborsClassifier has similar performance with DecisionTreeClassifier and AdaBoostClassifier, but it takes the least time to train the model. 

### 3.2 Hyperparameter Tuning (7 Points)

The results of the different models will vary quite a bit, but of course, we used only default parameters of each implementation which might or might not be good for our dataset and task. In practice, you would perform hyperparameter tuning for all or at least most models. However, we won't do this here since the tuning process is very similar for each model. So we do it only for one model: **AdaBoost** (which shows a comparatively poor performance with the default values).

#### 3.2 a) Perform hyperparameter tuning for AdaBoost (using Decision Trees)! (4 Points)

**Important hints:**

* Use [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)! It automatically performs k-fold cross-validation (by default: k=5, which is fine) for all specified combinations of hyperparameter values. With [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), finding the best model (i.e., the model with the best hyperparameter models) should only require only very few lines of code!
* [`AdaBoostClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) with (`base_estimator=DecisionTreeClassifier()`) provides a whole range of hyperparameters. Pick a **maximum of 3 hyperparameters** to tune the model!

In [73]:
%%time

model = None

#########################################################################################
### Your code starts here ###############################################################
parameters = {'base_estimator__max_depth':[2, 4, 6, 8, 10], 'base_estimator__max_features':[5, 8, 10], 'algorithm': ('SAMME', 'SAMME.R')}
clf_adaboost = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())

model = GridSearchCV(clf_adaboost, parameters)
model.fit(X_train, y_train)
print(model.get_params())
# sorted(model.cv_results_.keys())           
### Your code ends here #################################################################
######################################################################################### 

# Store the parameters of the best model
best_params = model.best_params_

# Predict class labels of test data on the model with the best found parameters
y_pred = model.predict(X_test)

# Calculate the f1 score
best_f1 = f1_score(y_test, y_pred, average='macro')

print('Best AdaBoost (with Decision Tree) classifier: {} (f1 score: {:.3f})'.format(best_params, best_f1))

Best AdaBoost (with Decision Tree) classifier: {'algorithm': 'SAMME', 'base_estimator__max_depth': 4, 'base_estimator__max_features': 10} (f1 score: 0.925)
CPU times: user 3min 46s, sys: 1.53 s, total: 3min 47s
Wall time: 3min 49s


#### 3.2 b) Briefly discuss your process of finding the best hyperparameter values and the results (3 Points). 

Interesting points may include the choice of values for the grid search (and required changes), the improvements compared to the results for the default parameters in 3.1 a), the overall time required to find the best hyperparameter values, or any other interesting or surprising observations you have made during this task.

1. I use `model.get_params()` to look through all parameters and focus on numerical parameters to make some changes, such as `base_estimator_max_depth`, `base estimator_max_features`. 
2. At first, I set some random values with big difference to these two parameters and get the best parameters. Then I use values near the best parameters to continue finding the best parameters.
3. With `base_estimator_max_depth` set as 4 and `base estimator_max_features` set as 10, the F1 score increased greatly (from 0.787 to 0.925).

### 3.3 Handling Overfitting (6 Points)

Suppose your model is experiencing low training error but high test error, i.e. overfitting. For each of the 6 models listed in 3.1, select at least one of its hyperparameters (i.e., any input argument that could be passed to its scikit-learn implementation) and state how the hyperparameter could be modified to address the problem of overfitting. Add a brief(!) explanation to justify your choice!

1. KNeighborsClassifier: use more neighbors by increasing the value of `n_neighbors`, usually `log(number of samples)` is a good choice to start. 
2. LogisticRegression: add penalty to extreme parameter weights, try to set `penalty` to `L2` or `L1` or `elasticnet`. Regularization is a way of finding a good bias-variance tradeoff by tuning the complexity of the model. It is a very useful method to handle collinearity, filter out noise from data, and eventually prevent overfitting. 
3. DecisionTreeClassifier: use a smaller value of `max_depth` or use `ccp_alpha` to prune the tree. Overfitting occurs when the tree is too deep to fit all single data. So pruning is useful to prevent overfitting. 
4. RandomForestClassifier: use a larger value of `min_samples_split`, which is the minimum number of samples required to split an internal node. Or we can use the same methods mentioned above in DecisionTreeClassifier which are also useful. 
5. AdaBoostClassifier: use smaller value of `n_estimators` to stop early to avoid 'perfect fit'.
6. GradientBoostingClassifier: use a larger value of `min_samples_split` and `min_samples_leaf`. Setting higher values for these will not allow the model to memorize how to correctly identify a single piece of data or very small groups of data.
