# Linear Regression in Sci-Kit Learn - Introduction

This dataset concerns housing values in suburbs of Boston. The original dataset was taken from the StatLib library which is maintained at Carnegie Mellon University, here it is downloaded from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/).

Your goal is to create and train a model that can estimate the average housing price.

### Dataset description (columns)

     1. CRIM     per capita crime rate by town
     2. ZN       proportion of residential land zoned for lots over 
                 25,000 sq.ft.
     3. INDUS    proportion of non-retail business acres per town
     4. CHAS     Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
     5. NOX      nitric oxides concentration (parts per 10 million)
     6. RM       average number of rooms per dwelling
     7. AGE      proportion of owner-occupied units built prior to 1940
     8. DIS      weighted distances to five Boston employment centres
     9. RAD      index of accessibility to radial highways
    10. TAX      full-value property-tax rate per 10,000 USD
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks 
                 by town
    13. LSTAT    % lower status of the population
    14. MEDV     Median value of owner-occupied homes in 1000's of dollars
    

In [None]:
import pandas as pd
import numpy as np

Load and display data.

In [None]:
# Uncomment this if you are using Google Colab
#!wget https://raw.githubusercontent.com/PrzemekSekula/DeepLearningClasses1/master/LinearRegressionSKLearn/housing.csv

In [None]:
df = pd.read_csv('housing.csv')
df.head()

### Task 1
Select X (columns `['CRIM', 'TAX', 'RM']`) and y (column `MEDV`)

In [None]:
X = df[['CRIM', 'TAX', 'RM']]
X.head()

In [None]:
y = df['MEDV']
y.head()

### Task 2
Split data into two subsets
- train subset: 70% of data
- test subset: 30% of data
- set random_state to 1

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.3, random_state = 1)

print (X_train.shape)
print (X_test.shape)
print (y_train.shape)
print (y_test.shape)

### Task 3
Create and train linear regression model.

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

### Task 4
Compute $R^2$ coefficient for train and test datasets. Use `model.score()` to do it.

$$R^2=1-\frac{\Sigma{(y-\hat{y})^2}}{\Sigma{(y-\overline{y})^2}}$$

Where:
- $y$ - real `y` values
- $\hat{y}$ - model predictions
- $\overline{y}$ - mean value of `y`

In [None]:
print ('R2 train score:', model.score(X_train, y_train))
print ('R2 test score:', model.score(X_test, y_test))

### MAPE - Mean Absolute Percentage Error

$$MAPE = \frac{1}{n} \sum{ \left\lvert{\frac{y-\hat{y}}{y}}\right\rvert}$$

Where:
- $y$ - real `y` values
- $\hat{y}$ - model predictions
- $n$ - number of samples

In [None]:
y_pred = model.predict(X_train)
mape_train = 100*np.mean(np.abs(y_train - y_pred) / y_train)
print ('Train MAPE:', mape_train)

### Task 5
Create a function `mape`, that returns $MAPE$ value given $X$, $y$ and the model that is used to create $\hat{y}$ estimates. Then use your function to compute $MAPE$ for train and test datasets.

In [None]:
def mape(model, X, y):
    y_pred = model.predict(X)
    return 100 * np.mean(np.abs(y - y_pred) / y)

In [None]:
print ('Train MAPE:', mape(model, X_train, y_train))
print ('Test MAPE:', mape(model, X_test, y_test))


## Random forest regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, y_train)

print ('Train MAPE: {:.2f}%'.format(mape(model, X_train, y_train)))
print ('Test MAPE: {:.2f}%'.format(mape(model, X_test, y_test)))

### Task 6
Experiment with `min_samples_leaf` parameter to avoid overfitting.

In [None]:
model = RandomForestRegressor(min_samples_leaf=16)
model.fit(X_train, y_train)

print ('Train MAPE: {:.2f}%'.format(mape(model, X_train, y_train)))
print ('Test MAPE: {:.2f}%'.format(mape(model, X_test, y_test)))

### Task 7
Select all 13 features as $X$ and split dataset into two subsets (the same split ratio and random state).

In [None]:
df.head()

In [None]:
X = df.drop('MEDV', axis=1)
X.head()

In [None]:
y = df['MEDV']
y.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1)

### Task 8
Train and test linear regression model. Compare the results with the previous ones.

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)
print ('Train MAPE: {:.2f}%'.format(mape(model, X_train, y_train)))
print ('Train MAPE: {:.2f}%'.format(mape(model, X_test, y_test)))

### Task 9
Train and test Random Forest model (keep all parameters default). Does your model suffer from overfitting / underfitting?

In [None]:
model = RandomForestRegressor()
model.fit(X_train, y_train)

print ('Train MAPE: {:.2f}%'.format(mape(model, X_train, y_train)))
print ('Train MAPE: {:.2f}%'.format(mape(model, X_test, y_test)))

### Task 10
Try to modify `min_samples_leaf` parameter to get the best model possible.

In [None]:
model = RandomForestRegressor(min_samples_leaf=12)
model.fit(X_train, y_train)

print ('Train MAPE: {:.2f}%'.format(mape(model, X_train, y_train)))
print ('Test MAPE: {:.2f}%'.format(mape(model, X_test, y_test)))