# Introduction to Machine Learning with Scikit Learn 

#### What is machine learning?

There are many ways to describe what is machine learning, you can find one at https://www.ibm.com/cloud/learn/machine-learning


I like what Addreas Mueller described in the youtube video https://www.youtube.com/watch?v=4PXAztQtoTg:
> Predictive Modeling 101: Make predictions of outcome of repeated events.
> Machine learning is useful when the frequency of the repetitive envent is high, or the historical observations or data is large, and an individual mistake is not too costly 

> All models are wrong, but some are useful - George Box

Scikit Learn - Library of Machine Learning algorithms, built on top of Python, NumPy, SciPy, Cython https://scikit-learn.org/

Now let's build a model using the KC house data set from https://www.kaggle.com/harlfoxem/housesalesprediction to predict house price


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# np.set_printoptions(suppress=True)

## Get the data

In [None]:
df = pd.read_csv('/kaggle/input/housesalesprediction/kc_house_data.csv')
print(f"sample size = {df.shape[0]}\nnumber of columns = {df.shape[1]}")
df.head()

Check the data type and statistics

In [None]:
df.info()

In [None]:
df.describe()

## Exploratory data analysis

Check the house price distributions

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize=(12,5))
ax1.hist(df['price']/1000000, bins=50)
ax2.hist(df['price']/1000000, bins=50)
ax2.set_xlim(0,2)
ax2.set_xlabel('price in millions')
ax3.boxplot(df['price']/1000000);

From above plot we see the house price has a long tail with a few very expensive houses, but most house prices are below 2 million dollrs, we could remove the house sample with price over 2 millions which are outliers based on the statistic boxplot

In [None]:
# the % house price over 2 millions
(df['price'] > 2000000).mean() * 100

So the houses which have price above 2 millions only count for less than 1% of all house data, we can simply remove them

In [None]:
df = df[df['price'] <= 2000000]

Check the correlations

In [None]:
df.corr().style.background_gradient(cmap='coolwarm')

The above correlation data may not display the color gird properly, as in github for example. Without color grid, it is hard to visualize the different correlations. So let's also plot the heatmap of correlations.

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(15, 8)
sns.heatmap(df.corr().round(4), annot=True, ax=ax);

Check the corralation related to house price 

In [None]:
pd.DataFrame(df.corr()['price'].sort_values(ascending=False)).style.background_gradient(cmap='coolwarm')#.set_precision(2)

From above we see sqrt_living and sqrt_above has very high corrlation of 0.87. Let's visualize it below:

In [None]:
sns.jointplot(x='sqft_living', y='sqft_above', data=df);

sqrt_living has the highest correlation price, so we can keep sqrt_living and remove the correlated sqrt_above from our feature when train model. Now let's visualize relationship between sqft_living and price

In [None]:
sns.jointplot(x='sqft_living', y='price', data=df);

We also see that yr_built has a low correlation with price, which is a little counter intuitive, so let's plot the relationship between yr_built and price

In [None]:
sns.jointplot(x='yr_built', y='price', data=df);

The result show that year built does not really have meaningful impact to price

Plot the correlation between loaction and price

In [None]:
fig, ax =plt.subplots(1,2, sharey=True)
fig.set_size_inches(12,6)
ax[0].scatter(x='lat', y='price', data=df, alpha=0.2)
ax[1].scatter(x='long', y='price', data=df, alpha=0.2)

From above it show that most expensive houses are located at above latitude of 47.5, so we can add a feature to denote if it is above 47.5 latitude, we then convert the value to 1 if it is True and to 0 if it is False, so it can be feed into model. 

In [None]:
df['above_47.5_lat'] = (df['lat'] > 47.5).astype(int)
df['above_47.5_lat'].value_counts().plot.bar()

Now lets compare its correlation with price:

In [None]:
df[['price', 'above_47.5_lat']].corr()

We can see the correlation increased from 0.362183 of "lat" to 0.4367 of out new feature "above_47.5_lat"

## Select features

what feature(s) do we want to pick?

Let's first do a simply model that only consider one feature of sqft_living. Then we will choose multiple features to to train model to compare model performance from the two approaches.

1. Model with only one feature: sqft_living
2. Model with mutiple features: we already see that sqrt_above is highly correlated with sqft_living so we exclude it from the feature, we will pick the features that has corr value above 0.2:
sqft_living, grade, sqft_living15, bathrooms, lat, view, bedrooms, sqft_basement, floors
3. Replace "lat" feature with re-engineered new feature of "above_47.5_lat"

Uncomment the below code to try both feature selection approaches

In [None]:
# features = ['sqft_living']
features  = ['sqft_living', 'grade', 'sqft_living15', 'bathrooms','lat', 'view', 'bedrooms', 'sqft_basement', 'floors']
features  = ['sqft_living', 'grade', 'sqft_living15', 'bathrooms','above_47.5_lat', 'view', 'bedrooms', 'sqft_basement', 'floors']

Get the training features as X

In [None]:
X = df[features]#.to_numpy()
print(X.shape)
X.iloc[0]

Get the target labels as y. Here we have lower case y to denote it as vector, not matrix

In [None]:
y = df['price']#.to_numpy()
# y = y[:, None]
y.shape

## Split training and test data

Here we choose 80/20 split where we use 80% of sample data to train model, and set aside 20% of sample to test the model prediction. We also set random_state so we can the same random sample for train and test for model evaluation later

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

## Defind a linear regression model

using default hyper paramters https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

In [None]:
# Build a model with all default parameters
model = LinearRegression()

## Fit/train the model
We can see how simple the scikit learn provided API is for train a model, simply call the fit method.

In [None]:
model.fit(x_train, y_train)

## Predict price using the test set

In [None]:
y_pred = model.predict(x_test)
y_pred

## Evaluate the model

In [None]:
# coefficient of determination - how well observed outcomes are replicated by the model, with 1 be the perfect score, it can have negative score because the model can be arbitrarily worse
print('R squared = ', model.score(x_test, y_test))

In [None]:
mean_absolute_error(y_test, y_pred)

In [None]:
print('Ratio of mean absolute error to the mean true outcomes: ', mean_absolute_error(y_test, y_pred) / y_test.mean())

Check the error distribution

In [None]:
sns.displot((y_pred - y_test).values, kde=True)
plt.title('prediction error/residual distribution')
plt.xlabel('prediction error')

We can see the prediction error is normal distributed with center around 0

## Results:

Here we see for single feature of sqrt_living, our model got R squared score of 0.465, and mean_absolute_error of 153402 which counts about 30% of the mean price

When we increase to 9 features we achieved score of 0.657, which is a big improvment, and mean_absolute_error reduced to 115007 which now counts only about 22% of the mean price

we can further improve it by re-engineer the feature of lat into a new feature of above_47.5_lat, doing so we further improved score to 0.685, and reduced mean_absolute_error to 110659 which now counts only about 21% of the mean price

## Look inside the model:  model parameters and how model predict price

Examine the model coefficients/theta(1-n) and intercept/theta0

In [None]:
theta = model.coef_
print("Model coefficents/theta(1-n):\n", theta)

In [None]:
theta0 = model.intercept_
print('The model intercept/theta(0): ', theta0)

How model predict the price? 

The price is calculated as: $$h_{\theta}(X)=\theta_{0}+\theta_{1}x_{1}+\theta_{2}x_{2}+...+\theta_{n}x_{n}$$ 

Where $\theta_{0}$ is the intercept, and $\theta_{1}...\theta_{n}$ are the coefficients. This can be calculated very efficiently using matrix multiplication of coefficients and X features, then plus the intercept scalar value as:

```coefficients @ X + intercept``` 

Now let's calculate the prediction using the coefficients and intercept for the first test sample, it should match the first value from y_pred

In [None]:
test_input = x_test.head(1).values[0]
print('first test input is:\n', test_input)

predicted_price = theta @ test_input + theta0
print('\nCalculated prediction is:\t\t\t', predicted_price)

print('The first predicted model from y_pred is:\t', y_pred[0])

## Plot the predicted price and price against the input features

Let's first build a dataframe with test features, target label, and prediction

In [None]:
df_test = x_test.copy()
df_test['price'] = y_test
df_test.reset_index(inplace=True)
df_test['pred'] = y_pred.round()
columns = features + ['price','pred']
df_test[columns]

Plot the target price/prediction against top (3) features

In [None]:
def plot_scatter(ax, by):
    ax.scatter(df_test[by], df_test['price'], alpha=0.3, label='real_price')
    ax.scatter(df_test[by], df_test['pred'], alpha=0.3, label='prediction')
    ax.set_title('house price by ' + by)
    ax.legend()

# plot top n features
n = 3 if len(features) > 3 else len(features)
fig, ax = plt.subplots(1, n, sharey=True)
for i in range(n):
    fig.set_size_inches(6*n,6)
    if isinstance(ax, np.ndarray):
        plot_scatter(ax[i], features[i])
    else:
        plot_scatter(ax, features[i])

## Want to better understand the how the model gets trained and the mathematics behind it?

In this exercise we built a LinearRegression model to predice new outcomes based on the historical outcomes or events, for additional information on the linear regression model please refer to https://scikit-learn.org/stable/modules/linear_model.html

To better understand the mathematics behind the model, I highly recommend the popular machine learning course from coursera: https://www.coursera.org/learn/machine-learning