<img src='https://weclouddata.com/wp-content/uploads/2016/11/logo.png' width='30%'>
-------------

<h3 align='center'> Applied Machine Learning Course - In-class Lab Week 1 </h3>
<h1 align='center'> Linear Regression </h1>

<br>
<center align="left"> Developed by:</center>
<center align="left"> WeCloudData Academy </center>


### Packages Used:

- [NumPy](http://www.numpy.org/): a fundamental package for scientific computing with Python.
- [seaborn](https://seaborn.pydata.org/): statistical data visualization.
- [Bokeh](http://bokeh.pydata.org/en/latest/): a Python interactive visualization library that targets modern web browsers for presentation.
- [Matplotlib](https://matplotlib.org/): a Python 2D plotting library.
- [Sklearn](http://scikit-learn.org/stable/): Machine Learning tools in Python.
- [SciPy](https://www.scipy.org/): a Python-based ecosystem of open-source software for mathematics, science, and engineering.
- [JSAnimation](https://github.com/jakevdp/JSAnimation): an HTML/Javascript writer for Matplotlib animations.
-------

# Content

- [Optimzation](#Optimzation)
  - Minimizing a quadratic equation using calculus
  - Minimizing a quadratic equation using Gradient Descent
- [Linear regression](#Linear-regression)
  - Data preparation
  - Train linear regressor analytically
  - Train linear regression using sklearn

In [1]:
# loading necessary libraries and setting up plotting libraries
import numpy as np
import seaborn as sns

import bokeh.plotting as bp
import matplotlib.pyplot as plt
import matplotlib.animation as animation

from sklearn.datasets.samples_generator import make_regression 
from scipy import stats 
from bokeh.models import  WheelZoomTool, ResetTool, PanTool
from bokeh.layouts import gridplot
from JSAnimation import IPython_display

W = 590
H = 350
bp.output_notebook()

%matplotlib inline

# Optimzation

## $\Delta$ 1. Minimizing a quadratic equation using calculus

Let's suppose we have a simple quadratic function, $f(x) = x^2 − 6x + 5$, and we want to find the minimum of this function. We can solve this analytically using calculus, by finding the derivate and setting it to zero:

$$\begin{align}
f'(x) = 0\\
\end{align}$$


### Define the quadratic function

In [2]:
# Define the function and data
x = np.linspace(-15,21,100) # evenly 100 spaced numbers over [-15,21].
y = x**2-6*x+5

### $\Omega$ Practice 1.1: Calculate the gradient $f'(x)$

Since we only have one variable $x$ in function $f(x)$, the gradient is the same as the derivate $f'(x)$

In [3]:
### Define derivative function

def f_derivative(x):
    # TODO: you need to implement this function so it returns f'(x)


SyntaxError: unexpected EOF while parsing (<ipython-input-3-d32a5d87223d>, line 4)

### $\Omega$ Practice 1.2: Verify your gradient

Find the optimal value of $x$ that minimizes the given equation $f(x)$ by solving the equation $f'(x) = 0$. You can verify if your calculated optimal value $x$ is correct by simply plotting the function $f(x)$, and see whether the function indeed reaches its minimum at your calculated value $x$.

The following code snippet plots the function.

In [None]:
# plot the function
plt.plot(x, y)
plt.xlim(-3, 10)
plt.ylim(-5, 10)
plt.plot(3, -4, 'go')

# $\Delta$ 2. Minimizing a quadratic equation using Gradient Descent


When the function is as simple as the quadratic function $f(x) = x^2 − 6x + 5$, we can find the optimal solution to $x$ by solving the equation $f'(x)=0$ analytically. We say these functions have closed-form solutions.

However, in the world of machine learning, most of the functions that we want to optimize are high dimensional (with a lot of variables) and very complicated. Therefore, we need a more generic solution that can help us find some optimal (usually only local optimal) solution.

Gradient Descent is one of those generic soultions to find the local minimum of a given function - it's an iterative **optimization algorithm** based on the steepest descent. To find the local minimum, you start at a random point, and move into the direction of steepest descent relative to the gradient, i.e. into the direction that goes down (hence, **descent**). 

In this example, let's suppose we start at $x = 15$. 

We then calculate the gradient, $f'(x) = 2x - 6$ (this should be your answer to **Practice 1.1** above), at this point is $2 \times 15 - 6 = 24$. We update our estimation of the optimal $x$ by 

$$\begin{align}
x'=x-\alpha f'(x)\\
\end{align}$$

$\alpha$ is the **step size** (or called learning rate) of our optimization process, it controls how aggressive we are when updating the value of $x$. It has to be chosen carefully, as a value too small will result in a long computation time, while a value too large will not give you the right result (by overshooting) or even fail to converge.

In this example, we'll set the step size to 0.01, which means we'll subtract $24 \times 0.01$ from 15, which is $14.76$. This is now our new temporary local minimum: We continue this method until we either don't see a change after we subtracted the gradient * step size, or until we've completed a pre-set number of iterations. 

The algorithm stops when the values between the new and the temporary minimum do not differ by more than 0.001 - if we need more precision, we can decrease this value. According to gradient descent, the local minimum occurs at $3.5$, which is not too far off from the true local minimum.

In [None]:
### Initialization

old_min = 0
temp_min = 15
step_size = 0.001  # small step size
precision = 0.001

### $\Omega$ Practice 2.1. Implement updating $x'=x-\alpha f'(x)$

In [None]:
def update_x(gradient, step_size, current_x):
    # TODO: implement the logic of updating x using the given step_size and gradient x'=x-\alphra * f'(x)
    
    
    

### $\Omega$ Practice 2.2. Implement the iterative gradient descent optimization process

In [None]:
### Gradient Updates

mins = [] # a list to keep track of minimum
cost = [] # a list to keep track of cost function

while abs(temp_min - old_min) > precision:
    old_min = temp_min 
    gradient = f_derivative(old_min)  # calculate gradient
    
    # make update to get the newly estimated best x so far
    temp_min = update_x(step_size=step_size, gradient=gradient, current_x=old_min) 
    
    cost.append((3-temp_min)**2)      # append squared error to  
    mins.append(temp_min)             # append minium to list


print("Local minimum occurs at {}.".format(round(temp_min,2))) 

### Plot the graident update steps

We can visualize the gradient descent by plotting all temporary local minima on the curve. As you can see, the improvement decreases over time; at the end, the local minimum barely improves.

In [None]:
def init():
    line.set_data([], [])
    return line,

def animate(i):
    x_n = mins[0::10][i]
    y_n = x_n**2-6*x_n+5
    line.set_data(x_n, y_n)
    return line,

fig = plt.figure(figsize=(10, 6))
ax = plt.axes(xlim=(-15, 21), ylim=(-50, 350))
ax.plot(x,y, linewidth=4 )
line, = ax.plot([], [], "D", markersize=12)
animation.FuncAnimation(fig, animate, init_func=init,
                        frames=len(mins[0::10]), interval=200)

### Plot the squared distance to local minimum

Another important visualization of gradient descent is that there should be a visible improvement over time: In this example, we simply plot the squared distance from the local minima calculated by gradient descent and the true local minimum against the iteration during which it was calculated. As we can see, the distance gets smaller over time, but barely changes in later iterations. This measure of distance is often called the cost or loss, but the implementation differs depending on what function you're trying to minimize.

In [None]:
TOOLS = [WheelZoomTool(), ResetTool(), PanTool()]

x_iter, y_distance = (zip(*enumerate(cost)))
s1 = bp.figure(width=W, 
               height=H, 
               title='Squared distance to true local minimum', 
               tools=TOOLS,
               x_axis_label = 'Iteration',
               y_axis_label = 'Distance'
)
s1.line(x_iter, y_distance, color="navy", alpha=0.5, line_width=3)
s1.title.text_font_size = '16pt'
s1.yaxis.axis_label_text_font_size = "14pt"
s1.xaxis.axis_label_text_font_size = "14pt"


bp.show(s1)

# Linear regression

1. Download practice data from [Kaggle's house sale prediction dataset](https://www.kaggle.com/harlfoxem/housesalesprediction#kc_house_data.csv).
2. Unzip the zip file and save the CSV.

In this practice, we will use some of the features (columns) in this CSV file to predict the `price` column. This is a regression problem, because our target is a continuous value.

## $\Delta$ 3. Data preparation

In [4]:
# read data from csv using pandas dataframe

import pandas as pd

df = pd.read_csv('kc_house_data.csv')

### $\Omega$ Practice 3.1. Inspect the data frame

In [5]:
# TODO: print out a snippet of the data in the data frame


In [None]:
# TODO: sample 10000 rows from df


In [None]:
# TODO: print out all column names and their data types


###  Choosing a subset of columns as features

In [None]:
# take a subset of the columns for practice.
df_sub = df[['price', 'bedrooms', 'bathrooms', 'sqft_living', 'floors']]
print(df_sub.columns)

Since we will predict the `price` column, we need to separate it from the "feature" columns. Conventionally, the data we use (i.e., the matrix whose rows are the training samples and whose columns are the features) are represented by variable $X$ (capitalized as this is a matrix) and the targets are represented by variable $y$ (in lowercase as this is a vector).

In [None]:
y = df_sub['price']
X = df_sub.drop(columns=['price'])

print(X.shape)
print(y.shape)

After the split above, the four columns **"bedrooms", "bathrooms", "sqft_living", and "floors"** are so-called "features", and **"price"** is our targets.

###  Adding the dummy variable $x_0$

As introduced in class, to simplify calculation, we usually add a dummy variable $x_0=1$ in addition to all the existing columns in $X$.

In [None]:
X = np.column_stack([np.ones(X.shape[0]), X]) # x0=1
y = np.array(y)

In [None]:
print(X.shape)
print(y.shape)

### train-test split of the dataset

To validate the quality of our model, we need to train the linear regressor on a set of data (usually called training data) and evaluate on a different set (usually called validation or test set). 

The convention is to separate the data set into 80% training and 20% validation/test.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
# Check the splitting result by printing out the shapes of our training/testing data and targets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## $\Delta$ 4. Train linear regressor analytically (Advanced)

Although we have more than one features (the columns) in our training data $X$, this is still a linear regression problem, simply with more variables. Therefore, as mentioned in the class, there is still a **closed-formed** solution to find the optimal values of our coefficients $\theta^*$ using the formula below.

$$\begin{align}
H = X^T\theta \\
\theta^* = (X^TX)^{-1}X^TY \\
\end{align}$$

$\theta^*$ is the set of coefficients associated with each of our features **"bedrooms", "bathrooms", "sqft_living", and "floors"** which yield the lowest mean squared error on our training data $X$.

### $\Omega$ Practice 4.1. Calculate $\theta^*$ using the formula

In [None]:
X_train

In [None]:
# TODO: calculate the optimal thetas
from numpy.linalg import inv
a = 

In [None]:
# print out your coefficients
print(f'coefficients: {theta}')

## Train linear regressor using Sklearn

Fortunately, for most of established machine learning algorithms, such as Linear Regression, we do not need to implmenet the optimization ourselves. Sklearn, as a general Python Machine Learning library, offers implementations of a wide range of commonly used models. All you need to do is to import the appropriate modules from sklearn, and call `.fit(X, y)` to fit (train/optmize) your chosen model on the training data $X$ and its associated labels $y$.

In [None]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

### $\Omega$ Pratice 4.2. Inspect the coefficients optimized by sklearn

In Pratice 4.1, you may have used the closed-form solution to find the optimal set of cofficients $\theta^*$. In this part, you will need to find their counterpart optimized by sklearn.

Useful resources: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html (hint: Look at the examples)

In [None]:
# TODO: find out the coefficients and intercept optimized by the sklearn LinearRegression object `regressor` above.
# and compare them with \theta^* you got from Practice 4.1


### Evaluate linear regressor

Since we are now done with our **phase 1: training time** (i.e., we have finished model optimization on our training data), we move on to our **phase 2: validation time**, where we need to evaluate how good our trained model is on a validation set. This validation set is the `X_test`, and `y_test` we got from the **train-test split of the dataset** step above.

In order to evaluate the model quality on `X_test`, we need to first use our trained model `regressor` to predict the outcome on `X_test`.

In [None]:
y_pred = regressor.predict(X_test)

In [None]:
# you can inspect what `y_pred` is. It is a 1-d vector (basically a list) of float numbers.
# and visually compare each individual prediction with the ground truth in `y_test`.
for i, yi in enumerate(y_pred):
    print(f'y_pred: {yi}\ty_truth: {y_test[i]}')

Then we can choose a set of evaluation metrics appropriate to our current task, i.e., regression, to compare our predicted values `y_pred` against the ground-truth `y_test`.

Sklearn, as a general Python Machine Learning package, offers common evaluation metrics out of the box. In class, we introduced **r2**, **mean absolute error**, and **mean squared error**. All of them can be found in `sklearn.metrics` module. Useful resources: https://scikit-learn.org/stable/modules/model_evaluation.html

In [None]:
def evaluate(y_pred, y_test):
    r2 = r2_score(y_pred=y_pred, y_true=y_test)
    mse = mean_squared_error(y_pred=y_pred, y_true=y_test)
    mae = mean_absolute_error(y_pred=y_pred, y_true=y_test)

    print(f'r2: {r2}')
    print(f'mse: {mse}')
    print(f'mae: {mae}')

In [None]:
evaluate(y_pred, y_test)

### Train linear regressor with standardized features

In general, it is a good idea to perform certain preprocessing steps, such as feature standardization and normalization, imputation, outlier removal, on your training data. Especially when you are using models which are very sensitive to outliers and individual feature magnitude, such as Linear Regression models. For other more robust models, these steps may make little difference but rarely harm. Therefore, it is generally recommended to perform these preprocessing steps, unless proved otherwise.

sklearn's `preprocessing` module provides a variety of common preprocessing steps.

https://scikit-learn.org/stable/modules/preprocessing.html


In [None]:
# Use StandardScaler to standardize our training data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# fit this StandardScaler on our training data, i.e., estimating the mean and variance base don our training data
X_train = scaler.fit(X_train)

# use the fitted StandardScaler object "scaler" to transform X_train, i.e., to actually performing standardization on X_train
X_train = scaler.transform(X_train)

### the above two step can be combined into one step (which is more commonly used than the two steps approach)
### X_train = scaler.fit_transform(X_train)

### $\Omega$ Pratice 4.3. Fit LinearRegression model on this standardized training set now

In [None]:
# TODO: fit regressor on the standardized training data `X_train` now
# print out the new coefficients and intercept



### Evaluate the new model with standardized features

To evaluate our new model, similarly to what we have done in the **Evaluate linear regressor** step, we need to first produce predictions of `X_test` using our newly trained model.

**EXTREMELY IMPORTANT (READ CAREFULLY):** If your applied any preprocessing steps in training time, you need to perform the same steps in prediction time as well!! Remember in the **Building Blocks of Machine Learning Applications** diagram, in phase 2, we always apply exactly the same set of feature extraction steps to convert raw data into matrices before fed into our trained model. **However, at training time, you may have used X_train = `scaler.fit_transform(X_train)` to transform your training data; at prediction time, NEVER use `.fit()` again!** Calling `.fit()` again on X_test would likely to produce inconsistent preprocessing result and make your trained model invalid or unusable.

Instead, we simply use `X_test=scaler.transform(X_test)`


In [None]:
X_test = scaler.transform(X_test)

### $\Omega$ Pratice 4.4. Evaluate the new model

In [None]:
# TODO: get the evaluation result on your new model by the same set of evaluation metrics: r2_score, mse, and mae
# Is the new performance better or worse than before?
