# Task 1-I : Linear Models

* We will learn linear regression

In [2]:
###################
## Run this cell ##
###################
import pandas as pd
from sklearn.datasets import load_boston

boston = load_boston()

df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target
print("1 row is not about one house but one town ")
print(boston.DESCR)

1 row is not about one house but one town 
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property

# Q1. Split the df into training set & test set

1. x : all columns in df except 'MEDV'
2. y : the column 'MEDV' in df (df['MEDV'])
3. variable names :
    * x_train, y_train
    * x_test, y_test
4. train : test = 8 : 2
5. randomstate : 2021

Question : Why we need to prepare test set?

**Your Answer :** Because we need test sets to assess the performance of the trained ML system.

In [3]:
#################################################
## This cell will not be provided, after Task1 ##
#################################################

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df.drop(['MEDV'], axis=1), df['MEDV'],
                                                    test_size=0.2, random_state=2021 )

# Q2. Train linear regression model

1. declare your model as lr

In [4]:
####################
## Your code here ##
####################
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

# Q3. Make a prediction
1. y_pred_train : prediction on training set
2. y_pred_test : prediction on test set

In [5]:
####################
## Your code here ##
####################

y_pred_train = lr.predict(x_train)
y_pred_test = lr.predict(x_test)

# Q4. Evaluate the model on the training set & test set

* Use RMSE

Question : If rmse is 4, can we say the error of our model is $4? 

**Your Answer :** RMSE is the measure of root mean squared error. Therefore, as rmse is 4, we cannot say error of our model is 4

In [6]:
####################
## Your code here ##
####################
from math import sqrt
from sklearn.metrics import mean_squared_error


def RMSE(y, y_pred):
    return sqrt(mean_squared_error(y, y_pred))


print('RMSE for train set : {}'.format(RMSE(y_train, y_pred_train)))
print('RMSE for test set : {}'.format(RMSE(y_test, y_pred_test)))

RMSE for train set : 4.674667786621305
RMSE for test set : 4.826999984002404


# Q5. Complete the equation of the linear regression model

\begin{align}
MEDV = \beta_0 &\ + \beta_1*CRIM + \beta_2*ZN + \beta_3*INDUS + \beta_4*CHAS \\
&+ \beta_5*NOX + \beta_6*RM + \beta_7*AGE + \beta_8*DIS + \beta_9*RAD \\
& + \beta_{10}*TAX + \beta_{11}*PTRATIO + \beta_{12}*B + \beta_{13}*LSTAT
\end{align}

* print $ \beta_0 $ ~ $ \beta_{13} $ with feature(column) name
* example
```
beta_0 for intercept : 21
beta_1 for CRIM : - 12
~~~
beta_13 for LSTAT : -5 
```



In [7]:
####################
## Your code here ##
####################

print("beta_{} for {} : {}".format(0, "intercept", lr.intercept_))
for i in range(0, len(df.columns) - 1):
    print("beta_{} for {} : {}".format(i + 1, df.columns[i], lr.coef_[i]))

beta_0 for intercept : 35.074446443842035
beta_1 for CRIM : -0.11455671812732975
beta_2 for ZN : 0.05323427588138077
beta_3 for INDUS : 0.003283317180316192
beta_4 for CHAS : 3.508465036705998
beta_5 for NOX : -18.13566684114273
beta_6 for RM : 3.8252394730909423
beta_7 for AGE : 0.011058249800921273
beta_8 for DIS : -1.529967394769819
beta_9 for RAD : 0.33922130811188655
beta_10 for TAX : -0.011867833580858023
beta_11 for PTRATIO : -0.8842149500225382
beta_12 for B : 0.009528304243272022
beta_13 for LSTAT : -0.57816905178081


# Q6. Analyze the effect of 'RM' (average average number of rooms per dwelling)

**assumption1 : every other features are fixed.**<br>
**assumption2 : use training set to anlayze.**
1. How does the 'MEDV(house price)' change when 'RM' increases by 1 ?
2. What is the change in the 'MEDV' due to the standard deviation(std) of 'RM'
    * hint : beta_6 * std('RM')
    * you can regard std('RM') as a mean variablity of 'RM' ( roughly )
3. What is the change in the 'MEDV' due to the maximum change of 'RM'
    * hint : maximum change of 'RM' = max('RM') - min('RM')


In [8]:
####################
## Your code here ##
####################
std_rm = x_train['RM'].std()
max_ch_rm = x_train['RM'].max() - x_train['RM'].min()


print(f"A1 : {lr.coef_[5]:.3f}")
print(f"A2 : {lr.coef_[5]*std_rm:.3f}")
print(f"A2 : {lr.coef_[5]*max_ch_rm:.3f}")

A1 : 3.825
A2 : 2.696
A2 : 19.754


# Q7. Analyze the effect of 'NOX' ( nitric oxides concentration (parts per 10 million) )

**assumption1 : every other features are fixed.**<br>
**assumption2 : use training set to anlayze.**
1. How does the 'MEDV(house price)' change when 'NOX' increases by 1 ?
2. Can 'NOX' change by 1 in the data?
3. What is the change in the 'MEDV' due to the standard deviation(std) of 'NOX'
4. What is the change in the 'MEDV' due to the maximum change of 'NOX'


In [9]:
####################
## Your code here ##
####################

std_nx = x_train['NOX'].std()
max_ch_nx = x_train['NOX'].max() - x_train['NOX'].min()


print(f"A1 : {lr.coef_[4]:.3f}")
print(f"A2 : No, maximum change(range) of NOX : {max_ch_nx}")
print(f"A3 : {lr.coef_[4]*std_nx:.3f}")
print(f"A4 : {lr.coef_[4]*max_ch_nx:.3f}")

A1 : -18.136
A2 : No, maximum change(range) of NOX : 0.486
A3 : -2.168
A4 : -8.814


# Q8. Anlayze the intercept
**assumption1 : use training set to anlayze.**
1. What is the expected mean value of 'MEDV' in $ when all features(x) have no effect
    * you can regard intercept as default value of 'MEDV' ( roughly )
    * be careful : in $, not in $1,000
2. Can all features(x) be zero in the data?
    * Can all features(x) have no effect?


In [None]:
####################
## Your code here ##
####################

print('1. Expected mean value of MEDV in $ when all features(x) have no effect : ${}'.format(lr.intercept_ * 1000))
print('2. Can all features(x) be zero in the data? : {}'.format('No, only two variables can be zero in the following dataset.'))

print(x_test.min())

1. Expected mean value of MEDV in $ when all features(x) have no effect : $35074.44644384203
2. Can all features(x) be zero in the data? : No, only two variables can be zero in the following dataset.
CRIM         0.00632
ZN           0.00000
INDUS        1.38000
CHAS         0.00000
NOX          0.39800
RM           3.86300
AGE          6.20000
DIS          1.34490
RAD          1.00000
TAX        188.00000
PTRATIO     13.00000
B            3.65000
LSTAT        1.73000
dtype: float64
