<a href="https://colab.research.google.com/github/MarcinBadora1/Machine_Learning/blob/main/Linear_Regression%20-%20Sklearn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression in Sci-Kit Learn - Introduction

This dataset concerns housing values in suburbs of Boston. The original dataset was taken from the StatLib library which is maintained at Carnegie Mellon University, here it is downloaded from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/).

Your goal is to create and train a model that can estimate the average housing price.

### Dataset description (columns)

     1. CRIM     per capita crime rate by town
     2. ZN       proportion of residential land zoned for lots over 
                 25,000 sq.ft.
     3. INDUS    proportion of non-retail business acres per town
     4. CHAS     Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
     5. NOX      nitric oxides concentration (parts per 10 million)
     6. RM       average number of rooms per dwelling
     7. AGE      proportion of owner-occupied units built prior to 1940
     8. DIS      weighted distances to five Boston employment centres
     9. RAD      index of accessibility to radial highways
    10. TAX      full-value property-tax rate per 10,000 USD
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks 
                 by town
    13. LSTAT    % lower status of the population
    14. MEDV     Median value of owner-occupied homes in 1000's of dollars
    

In [None]:
import pandas as pd
import numpy as np

Load and display data.

In [None]:
# Uncomment this if you are using Google Colab
!wget https://raw.githubusercontent.com/PrzemekSekula/DeepLearningClasses1/master/LinearRegressionSKLearn/housing.csv

--2020-10-28 06:08:18--  https://raw.githubusercontent.com/PrzemekSekula/DeepLearningClasses1/master/LinearRegressionSKLearn/housing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38448 (38K) [text/plain]
Saving to: ‘housing.csv.2’


2020-10-28 06:08:18 (7.87 MB/s) - ‘housing.csv.2’ saved [38448/38448]



In [None]:
df = pd.read_csv('housing.csv')
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


### Task 1
Select X (columns `['CRIM', 'TAX', 'RM']`) and y (column `MEDV`)

In [None]:
x = df[['CRIM', 'TAX', 'RM']]
x.head()

Unnamed: 0,CRIM,TAX,RM
0,0.00632,296.0,6.575
1,0.02731,242.0,6.421
2,0.02729,242.0,7.185
3,0.03237,222.0,6.998
4,0.06905,222.0,7.147


In [None]:
y = df['MEDV']

### Task 2
Split data into two subsets
- train subset: 70% of data
- test subset: 30% of data
- set random_state to 1

In [None]:
from sklearn.model_selection import train_test_split
  

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
   x, y, test_size=0.3, random_state=1)
print("X_train",X_train.shape)
print("X_test",X_test.shape)
print("Y_train",y_train.shape)
print("Y_test",y_test.shape)

X_train (354, 3)
X_test (152, 3)
Y_train (354,)
Y_test (152,)


### Task 3
Create and train linear regression model.

In [None]:
 from sklearn.linear_model import LinearRegression

#reg = LinearRegression().fit(x, y)

reg = LinearRegression().fit(X_train, y_train)
#reg.score(X, y)


### Task 4
Compute $R^2$ coefficient for train and test datasets. Use `model.score()` to do it.

$$R^2=1-\frac{\Sigma{(y-\hat{y})^2}}{\Sigma{(y-\overline{y})^2}}$$

Where:
- $y$ - real `y` values
- $\hat{y}$ - model predictions
- $\overline{y}$ - mean value of `y`

In [None]:
reg.score(X_train,y_train)

reg.score(X_test,y_test)

0.6901893330926419

### MAPE - Mean Absolute Percentage Error

$$MAPE = \frac{1}{n} \sum{ \left\lvert{\frac{y-\hat{y}}{y}}\right\rvert}$$

Where:
- $y$ - real `y` values
- $\hat{y}$ - model predictions
- $n$ - number of samples

In [None]:
y_pred = reg.predict(X_train)
mape_train = np.mean(np.abs((y_train-y_pred)/y_train))*100
print("tain mape", mape_train )

print('tain mape  {:.3f}%'.format(mape_train))

tain mape 21.552430568659016
tain mape  21.552%


### Task 5
Create a function mape, that returns  𝑀𝐴𝑃𝐸  value given  𝑋 ,  𝑦  and the model that is used to create  𝑦̂   estimates. Then use your function to compute  𝑀𝐴𝑃𝐸  for train and test datasets. 

In [None]:
def mape(model, X, y):
  y_pred = model.predict(X)
  return np.mean(np.abs((y-y_pred)/y))*100

In [None]:
print('train mape  {:.3f}%'.format(mape(reg,X_train,y_train)))
print('train mape  {:.3f}%'.format(mape(reg,X_test,y_test)))


train mape  21.552%
train mape  20.784%


## Random forest regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train,y_train)

print('train mape  {:.3f}%'.format(mape(model,X_train,y_train)))
#tesst
print('train mape  {:.3f}%'.format(mape(model,X_test,y_test)))



train mape  7.348%
train mape  16.649%


### Task 6
Experiment with `min_samples_leaf` parameter to avoid overfitting.

In [None]:
#model = RandomForestRegressor(min_samples_leaf=6)
model = RandomForestRegressor(min_samples_leaf=14)

model.fit(X_train,y_train)

print('train mape  {:.3f}%'.format(mape(model,X_train,y_train)))
print('teest mape  {:.3f}%'.format(mape(model,X_test,y_test)))

train mape  16.901%
teest mape  18.543%


# Part 2

### Task 7
Select all 13 features as $X$ and split dataset into two subsets (the same split ratio and random state).

In [None]:
df = pd.read_csv('housing.csv')
df.head()

x = df[['CRIM', 'ZN', 'INDUS',	'CHAS',	'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']]
y = df['MEDV']
#print(x)
#print(y)

X_train, X_test, y_train, y_test = train_test_split(
   x, y, test_size=0.3, random_state=1)
print("X_train",X_train.shape)
print("X_test",X_test.shape)
print("Y_train",y_train.shape)
print("Y_test",y_test.shape)




X_train (354, 13)
X_test (152, 13)
Y_train (354,)
Y_test (152,)


In [None]:
## Enter your code here

In [None]:
# Enter your code here

In [None]:
# Enter your code here

### Task 8
Train and test linear regression model. Compare the results with the previous ones.

In [None]:
 from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(X_train, y_train)

print(reg.score(X_train,y_train))
print(reg.score(X_test,y_test))

print('train mape  {:.3f}%'.format(mape(reg,X_train,y_train)))
print('test mape  {:.3f}%'.format(mape(reg,X_test,y_test)))



0.7103879080674731
0.7836295385076271
train mape  16.715%
test mape  16.208%


### Task 9
Train and test Random Forest model (keep all parameters default). Does your model suffer from overfitting / underfitting?

In [None]:
model = RandomForestRegressor()

model.fit(X_train,y_train)

print('train mape  {:.3f}%'.format(mape(model,X_train,y_train)))
print('test mape  {:.3f}%'.format(mape(model,X_test,y_test)))

###overfit

train mape  4.260%
test mape  11.286%


### Task 10
Try to modify `min_samples_leaf` parameter to get the best model possible.

In [None]:
model = RandomForestRegressor(min_samples_leaf=120)

model.fit(X_train,y_train)

print('train mape  {:.3f}%'.format(mape(model,X_train,y_train)))
print('test mape  {:.3f}%'.format(mape(model,X_test,y_test)))

train mape  35.976%
test mape  35.366%
