# Neoscholar Machine Learning Tutorials
### Session 04. Linear and Logistic Regression

### Contents
1. Linear Regression
2. Logistic Regression

### Aim
At the end of this session, you will be able to:
- Implement your first Machine Learning model for regression and classification
- Be more familiar with Sklearn lib


### Outline
1. Linear Regression
    1.1 Basic Linear Regression
    1.2 Advanced Linear Regression
2. Logistics Regression

## 1. Linear Regression

We are going to explore both the basic linear regression and more advanced linear regression with regulation terms, i.e., LASSO, Ridge, Elastic net regression. The modelling process begins from importing the dataset and ends at model evaluation.

This time we are going to practice Linear Regression with Boston House Price Data that are already embedded in scikit-learn datasets

In [None]:
# Import libs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import sklearn.datasets as datasets
BOSTON_DATA = datasets.load_boston()
print(BOSTON_DATA)

Check what features this dataset contains.

In [None]:
# TODO: print feature names
print(BOSTON_DATA[None])

### Simple Exploratory Data Analysis

As we discussed before, EDA is one of the most important step to implement a machine learning model in practice. You have to not only understand the data you have but also clean it accordingly. In this tutorial, we will visualise the data and then analyse their correlations.

First of all, let's define some useful funcitons.

In [None]:
# Function to load both boston data and target, and convert it as dataframe.
def add_target_to_data(dataset):
    df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
    print("Before adding target: ", df.shape)
    df['PRICE'] = dataset.target
    print("After adding target: {} \n {}\n".format(df.shape, df.head(2)))
    return df

In [None]:
# Function to visualise the relations between features and the target
def plotting_graph(df, features, n_row=2, n_col=5):
    fig, axes = plt.subplots(n_row, n_col, figsize=(16, 8))
    assert len(features) == n_row * n_col
    for i, feature in enumerate(features):
        row = int(i / n_col)
        col = i % n_col
        sns.regplot(x=feature, y='PRICE', data=df, ax=axes[row][col])
    plt.show()

Apply the function `add_target_to_data()` to transform the dataset into `Dataframe` type.

#### Visualisation

In [None]:
boston_df = add_target_to_data(BOSTON_DATA)
# TODO: print the first 5 samples of the dataset
print(None)

In [None]:
# Only ten features are demonstrated for simplicity
features = ['RM', 'ZN', 'INDUS', 'NOX', 'AGE', 'PTRATIO', 'LSTAT', 'RAD', 'CRIM', 'B']
plotting_graph(boston_df, features, n_row=2, n_col=5)

Correlation is a statistical measure that tells us about the association between the two variables. It describes how one variable behaves if there is some change in the other variable.

#### Pearson vs Spearman correlation

Both Pearson and Spearman are used for measuring the correlation but the difference between them lies in the kind of analysis we want.

Pearson correlation: Pearson correlation evaluates the linear relationship between two continuous variables.

Spearman correlation: Spearman correlation evaluates the monotonic relationship. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data.

In [None]:
# Calculate the Pearson correlation matrix.
correlation_matrix = boston_df.corr(method='pearson').round(2)
correlation_matrix

Visualise the correlation matrix by heat map.

In [None]:
sns.heatmap(correlation_matrix, cmap="YlGnBu")
plt.show()

### Basic Linear Regression

#### Dataset Split

We have practiced how to split a dataset into the testing and training set. Applying the `train_test_split()` function in `sklearn` to split your dataset with the ratio of 90:10.

In [None]:
from sklearn.model_selection import train_test_split

# TODO: split your dataset into training and testing sets.
# Set the random state as 17 to ensure every one to get the same result
train_X, test_X, train_Y, test_Y = None

Further split your training set into training set and validation set with a ratio of 90:10.

In [None]:
# Todo: split your dataset into training and validation sets. 
# set the random state as 17 to ensure every one to get the same result
train_X, val_X, train_Y, val_Y = None

#### Train and evaluate the basic Linear Regression model

Let's train your model.

In [None]:
from sklearn.linear_model import LinearRegression

# Initialize your linear regression model
lr_model = LinearRegression()
# Train!
lr_model.fit(train_X, train_Y)

Then, we should evaluate it on the validation set.

In [None]:
from sklearn.metrics import mean_squared_error
# Make predictions with validation data!
preds = lr_model.predict(val_X)
# What is mse between the answer and your prediction?
lr_mse = mean_squared_error(val_Y, preds)
print('LR_MSE: {0:.4f}'.format(lr_mse))
# Sort regression coefficient.
coeff = pd.Series(data=lr_model.coef_, index=train_X.columns).sort_values(ascending=False)
print(coeff)
# Todo: Which feature is the most important?

Plot the predicted price v.s. expected price (true)

In [None]:
plt.scatter(val_Y, preds)
plt.plot([0, 50], [0, 50], '--k')
plt.xlabel('Expected price')
plt.ylabel('Predicted price')
plt.tight_layout()

#### Mannully implement the basic single variant linear regression

Recall how we estimate the linear regression parameters based on OLS method. You should implement the `paramEstimates(x, y)` function that estimates the parameters of alpha and beta as follows:
\begin{align}
\hat{\beta} & =  \frac{\sum_{i=1}^n x_i\left(y_i - \bar{y} \right)}{\sum_{i=1}^n x_i\left(x_i - \bar{x} \right)}\\
\hat{\alpha} & = \bar{y}-\hat{\beta}\bar{x}
\end{align}

You have, however, to complete the `linearRegr_Predict(x_train, y_train, xTest)` function, or write your own, that returns the output variable y given the input x as follows: 
\begin{align}
\hat{y} & = \hat{\alpha}+\hat{\beta}x
\end{align}

For simplication, we only apply the most important feature `RM` as the regressor.

In [None]:
# Firstly, let's implement it using sklearn
# TODO: initialize your linear regression model
singleVar_lr_model = None
# TODO: Train!
singleVar_lr_model.None
print("The intercept (alpha) is {}:".format(singleVar_lr_model.intercept_))
print("The slope (beta) is {}:".format(singleVar_lr_model.coef_[0]))
# TODO: make predictions with validation data!
singleVar_preds_sk = singleVar_lr_model.None
# TODO: what is mse between the answer and your prediction?
singleVar_lr_mse_sk = None
print('LR_MSE_sk: {0:.4f}'.format(singleVar_lr_mse_sk))

In [None]:
# TODO: code the function to estimate the parameters according to the above equations
def paramEstimates(x, y):
    beta = None
    alpha = None
    return alpha, beta

def linearRegr_Predict(x_train, y_train,x_test):
    # TODO: Estimate the parameter by calling the paramEstimates function
    alpha, beta = None
    print("The intercept (alpha) is: {}".format(alpha))
    print("The slope (beta) is: {}".format(beta))
    # TODO: Apply your estimated parameters to implement the linear regression model: y=a+bx
    pred =  None
    return pred

singleVar_lr_preds_mannul=linearRegr_Predict(train_X.loc[:,'RM'], train_Y, val_X.loc[:,'RM'])

# Now evaluate your model and compare the performance with the one using Sklearn lib
singleVar_lr_mse_mannul = mean_squared_error(val_Y, preds_mannul)
print('LR_MSE_mannul: {0:.4f}'.format(singleVar_lr_mse_mannul))

#### PCA

Now we are going to compare the models performance if we apply the PCA to reduce the original dataset with 13 features into 8 dimensions.

To keep the test data is unseen from the beginning to the end, you may need to fit your PCA model in training set and transform the testing set by applying the trained PCA model.

In [None]:
from sklearn.decomposition import PCA

# Todo: apply PCA to reduce the data dimensionality from 13 to 8
pcaModel = None
train_pca_X = pcaModel.None
val_pca_X = pcaModel.transform(val_X)
test_pca_X = pcaModel.None

Train a new model using the PCAed dataset.

In [None]:
lr_model_pca = LinearRegression()
lr_model_pca.fit(train_pca_X, train_Y)

Evaluate your model

In [None]:
preds_pca = lr_model_pca.predict(val_pca_X)
lr_pca_mse = mean_squared_error(val_Y, preds_pca)
print('LR_PCA_MSE: {0:.4f}'.format(lr_pca_mse))

In [None]:
plt.scatter(val_Y, preds_pca)
plt.plot([0, 50], [0, 50], '--k')
plt.xlabel('Expected price')
plt.ylabel('Predicted price')
plt.tight_layout()

### Advanced Linear Regression -  Ridge, Lasso and ElasticNet

In [None]:
from sklearn.linear_model import Ridge, Lasso, ElasticNet
models = {
    "Ridge" : Ridge(),
    "Lasso" : Lasso(),
    "ElasticNet" : ElasticNet(),
}

In [None]:
pred_record = {}
for name, model in models.items():
    curr_model = model
    curr_model.fit(train_X, train_Y)
    preds = curr_model.predict(val_X)
    mse = mean_squared_error(val_Y, preds)
    print('{} MSE: {}'.format(name, mse))
    # Record predictions for every model
    pred_record.update({name : preds})

Now, let's compare these models' performance visually.

In [None]:
model_names = models.keys()
fig = plt.figure(figsize=(6, 8))
i=1
for model in model_names:
    prediction = pred_record[model]
    plt.subplot(310+i)
    plt.scatter(val_Y, prediction)
    plt.plot([0, 50], [0, 50], '--k')
    plt.xlabel('Expected price')
    plt.ylabel('Predicted price')
    plt.tight_layout()
    plt.title(model)
    i+=1

## 2. Logistic Regression

Useful videos:
1. [Andrew Ng's explanation 1](https://www.youtube.com/watch?v=-la3q9d7AKQ)
2. [Andrew Ng's explanation 2](https://www.youtube.com/watch?v=t1IT5hZfS48)
3. [Andrew Ng's explanation 3](https://www.youtube.com/watch?v=F_VG4LNjZZw)
4. [Andrew Ng's explanation 4](https://www.youtube.com/watch?v=HIQlmHxI6-0)

Logistic regression is a well-motivated approach to discriminative classification which leads to a smooth, convex, optimisation problem.  

Logistic regression is also a basis of Neural Network. Logistic Regression is sometimes called, a single node of Artificial Neuron. We will get back to what this means afterwards when we are doing Deep Learning.

#### In which case do we use classification?

Let's firstly generate a toy dataset that is suitable for classification

In [None]:
def generateDataset(seed=0):
    np.random.seed(seed)
    n_samples= 100

    X = np.random.normal(size=n_samples)
    y = (X > 0).astype(np.float)

    X[X > 0] *= 5
    X += .7 * np.random.normal(size=n_samples)
    X = X[:, np.newaxis]
    return X, y

X_train, y_train = generateDataset(seed=0)
plt.scatter(X_train, y_train)

What if our data looks like the above? Would you still use your linear regression model?  
Probably not. When your data has classes and your task is to classify the data, you normally use classification method, and Logistic Regression is a good start in learning classification.  
Please do watch the Andrew Ng's video on Logistic Regression to fully understand mathematically.  

Plus, note the the term 'logistic regression' has a word 'regression' inside.  
It is because the logistic regression is a generalised linear model using the same basic formula of linear regression but it is regressing for the probability of a categorical outcome by using `sigmoid` function.

In [None]:
from sklearn.linear_model import LogisticRegression
# TODO: initialize your linear regression model
logistic_clf = None
# TODO: Train!
logistic_clf.None

Now, let's generate another group of data to evaluate (test) our model.

In [None]:
from sklearn.metrics import classification_report,accuracy_score

X_test, y_test = generateDataset(seed=45)
# Generate predictions on the test set
preds_logistic = logistic_clf.None
# Evaluate the model
print('Accuracy on test set: '+str(accuracy_score(y_test,preds_logistic)))
print(classification_report(y_test,preds_logistic))

The function called `compare_logistic_linear` fits the data into the logistic regression model and a simple ordinary least squared linear regression model. Then, it plots the two in one plot for better visual representation on why you should consider using classification rather than regression.  

In [None]:
from scipy.special import expit

def compare_logistic_linear(model, X_data, y_data):
    """
    This function plots the given data - X_data and y_data
    then fit the data both into given `model` and LinearRegression model.
    Then shows the difference by plotting both of them.
    """
    plt.clf()
    plt.scatter(X_data.ravel(), y_data, color='black', zorder=20)
    X_test = np.linspace(-5, 10, 300)

    loss = expit(X_test * model.coef_ + model.intercept_).ravel()
    plt.plot(X_test, loss, color='red', linewidth=3)
    
    # Ordinary Least Squared Linear Regression
    ols = LinearRegression()
    ols.fit(X_data, y_data)
    plt.plot(X_test, ols.coef_ * X_test + ols.intercept_, linewidth=1)
    plt.axhline(.5, color='.5')

    plt.ylabel('y')
    plt.xlabel('X')
    plt.xticks(range(-5, 10))
    plt.yticks([0, 0.5, 1])
    plt.ylim(-.25, 1.25)
    plt.xlim(-4, 10)
    plt.legend(('Logistic Regression Model', 'Linear Regression Model'),
               loc="lower right", fontsize='small')
    plt.tight_layout()
    plt.show()

In [None]:
compare_logistic_linear(logistic_clf, X_test, y_test)