Session #1 Homework

In this homework, we will use the California Housing Prices from Kaggle.

The goal of this homework is to create a regression model for predicting housing prices (column 'median_house_value').

__EDA__

- Load the data.
- Look at the median_house_value variable. Does it have a long tail?

Features

For the rest of the homework, you'll need to use only these columns:

- `'latitude'`,
- `'longitude'`,
- `'housing_median_age'`,
- `'total_rooms'`,
- `'total_bedrooms'`,
- `'population'`,
- `'households'`,
- `'median_income'`,
- `'median_house_value'`

Select only them!

## Question 1
Find a feature with missing values. How many missing values does it have?

In [1]:
import pandas as pd 
import numpy as np

In [2]:
# download data
data = pd.read_csv('housing.csv')
data = data[['latitude',
'longitude',
'housing_median_age',
'total_rooms',
'total_bedrooms',
'population',
'households',
'median_income',
'median_house_value']]

In [3]:
data.isna().sum()

latitude                0
longitude               0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
dtype: int64

## Question 2
What's the median (50% percentile) for variable 'population'?

In [4]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
latitude,20640.0,35.631861,2.135952,32.54,33.93,34.26,37.71,41.95
longitude,20640.0,-119.569704,2.003532,-124.35,-121.8,-118.49,-118.01,-114.31
housing_median_age,20640.0,28.639486,12.585558,1.0,18.0,29.0,37.0,52.0
total_rooms,20640.0,2635.763081,2181.615252,2.0,1447.75,2127.0,3148.0,39320.0
total_bedrooms,20433.0,537.870553,421.38507,1.0,296.0,435.0,647.0,6445.0
population,20640.0,1425.476744,1132.462122,3.0,787.0,1166.0,1725.0,35682.0
households,20640.0,499.53968,382.329753,1.0,280.0,409.0,605.0,6082.0
median_income,20640.0,3.870671,1.899822,0.4999,2.5634,3.5348,4.74325,15.0001
median_house_value,20640.0,206855.816909,115395.615874,14999.0,119600.0,179700.0,264725.0,500001.0


## Split the data
- Shuffle the initial dataset, use seed 42.
- Split your data in train/val/test sets, with 60%/20%/20% distribution.
- Make sure that the target value ('median_house_value') is not in your dataframe.
- Apply the log transformation to the median_house_value variable using the np.log1p() function.

In [5]:
n = len(data)
n_val = int(n*0.2)
n_test = int(n*0.2)
n_train = int(n*0.6)

idx = np.arange(n)
np.random.seed(2)
np.random.shuffle(idx)

df_shuffled = data.iloc[idx]

In [6]:
data_val = data.iloc[:n_val]
data_test = data.iloc[n_val:n_val+n_test]
data_train = data.iloc[n_val+n_test:]

In [7]:
data_train = df_shuffled.iloc[:n_train].copy()
data_val = df_shuffled.iloc[n_train:n_train+n_val].copy()
data_test = df_shuffled.iloc[n_train+n_val:].copy()

data_train = data_train.reset_index(drop=True)
data_val = data_val.reset_index(drop=True)
data_test = data_test.reset_index(drop=True)

In [8]:
data_train.head()

Unnamed: 0,latitude,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,33.6,-117.65,15.0,5736.0,,2529.0,762.0,6.4114,278700.0
1,38.62,-120.91,12.0,4545.0,748.0,2033.0,718.0,4.1843,207600.0
2,33.93,-118.23,35.0,1149.0,277.0,909.0,214.0,1.7411,96700.0
3,37.59,-122.37,39.0,4645.0,1196.0,2156.0,1113.0,3.4412,353800.0
4,33.7,-117.98,16.0,5127.0,631.0,2142.0,596.0,7.8195,390500.0


In [9]:
len(data_train), len(data_val), len(data_test)

(12384, 4128, 4128)

In [10]:
y_train = np.log1p(data_train.median_house_value.values)
y_val = np.log1p(data_val.median_house_value.values)
y_test = np.log1p(data_test.median_house_value.values)


In [11]:
del data_train['median_house_value']
del data_val['median_house_value']
del data_test['median_house_value']

## Question 3
- We need to deal with missing values for the column from Q1.
- We have two options: fill it with 0 or with the mean of this variable.
- Try both options. For each, train a linear regression model without regularization using the code from the lessons.
- For computing the mean, use the training only!
- Use the validation dataset to evaluate the models and compare the RMSE of each option.
- Round the RMSE scores to 2 decimal digits using round(score, 2)
- Which option gives better RMSE?

Options:

    With 0
    With mean
    With median
    Both are equally good

In [12]:
mean_b = int(data['total_bedrooms'].mean(0))
mean_b

537

filling missing values with 0 and mean 

In [13]:
def train_linear_regression(X, y):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w = XTX_inv.dot(X.T).dot(y)
    
    return w[0], w[1:]

In [17]:
check = ['latitude','longitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income']

def prepare_X(df, fillna_value):
    df_num = df[check]
    df_num = df_num.fillna(fillna_value)
    X = df_num.values
    return X

In [18]:
def rmse(y, y_pred):
    error = y_pred - y
    mse = (error ** 2).mean()
    return np.sqrt(mse)

In [19]:
X_mean_train = prepare_X(data_train, fillna_value=mean_b)
w_0_mean, w_mean = train_linear_regression(X_mean_train, y_train)

In [20]:
X_mean_val = prepare_X(data_val, fillna_value=mean_b)
y_mean_pred_val = w_0_mean + X_mean_val.dot(w_mean)

In [21]:
np.round(rmse(y_val, y_mean_pred_val),2)

0.33

In [22]:
X_null_train = prepare_X(data_train, fillna_value=0)
w_0_null, w_null = train_linear_regression(X_null_train, y_train)

In [23]:
X_null_val = prepare_X(data_val, fillna_value=0)
y_null_pred_val = w_0_null + X_null_val.dot(w_null)

In [24]:
np.round(rmse(y_val, y_null_pred_val),2)

0.33

## Question 4: Best regularization parameter r

In [25]:
def train_linear_regression_reg(X, y, r=0.0):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    reg = r * np.eye(XTX.shape[0])
    XTX = XTX + reg

    XTX_inv = np.linalg.inv(XTX)
    w = XTX_inv.dot(X.T).dot(y)
    
    return w[0], w[1:]

In [26]:
for r in [0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]:
    w_0, w = train_linear_regression_reg(X_null_train, y_train, r=r)
    y_null_reg_val = w_0 + X_null_val.dot(w)
    rmse_val = np.round(rmse(y_val, y_null_reg_val),2)
    print(r, w_0, rmse_val)

0 -11.806729362245843 0.33
1e-06 -11.80671362948933 0.33
0.0001 -11.805156323240967 0.33
0.001 -11.791017806954207 0.33
0.01 -11.651472789645943 0.33
0.1 -10.41842651392549 0.33
1 -5.060875818575246 0.34
5 -1.5386307850897722 0.34
10 -0.8216708327329312 0.34


## Question 5: STD of RMSE scores for different seeds


In [27]:
rmse_list = []

for r in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]:

    idx = np.arange(n)
    np.random.seed(r)
    np.random.shuffle(idx)

    df_shuffled = data.iloc[idx]
    
    data_train = df_shuffled.iloc[:n_train].copy()
    data_val = df_shuffled.iloc[n_train:n_train+n_val].copy()
    data_test = df_shuffled.iloc[n_train+n_val:].copy()

    data_train = data_train.reset_index(drop=True)
    data_val = data_val.reset_index(drop=True)
    data_test = data_test.reset_index(drop=True)
    
    y_train_orig = data_train.median_house_value.values
    y_val_orig = data_val.median_house_value.values
    y_test_orig = data_test.median_house_value.values

    y_train = np.log1p(y_train_orig)
    y_val = np.log1p(y_val_orig)
    y_test = np.log1p(y_test_orig)
    
    del data_train['median_house_value']
    del data_val['median_house_value']
    del data_test['median_house_value']
    
    X_null_train = prepare_X(data_train, fillna_value=0)
    w_0, w = train_linear_regression(X_null_train, y_train)
    
    X_null_val = prepare_X(data_val, fillna_value=0)
    y_null_reg_val = w_0 + X_null_val.dot(w)
    rmse_val = np.round(rmse(y_val, y_null_reg_val),2)
    
    rmse_list.append(rmse_val)
    
    print(r, w_0, rmse_val)

0 -11.900382140423538 0.34
1 -11.732757375530449 0.34
2 -11.806729362245843 0.33
3 -11.587900350126908 0.34
4 -11.389470590755955 0.34
5 -11.447114275064546 0.34
6 -11.370516353469474 0.35
7 -12.473448923061865 0.34
8 -11.800287430349286 0.35
9 -11.45904683391947 0.34


In [28]:
rmse_list

[0.34, 0.34, 0.33, 0.34, 0.34, 0.34, 0.35, 0.34, 0.35, 0.34]

In [29]:
np.round(np.std(rmse_list),3)

0.005

## Question 6: RMSE on test


In [32]:
r = 9

idx = np.arange(n)
np.random.seed(r)
np.random.shuffle(idx)

df_shuffled = data.iloc[idx]
    
data_train = df_shuffled.iloc[:n_train].copy()
data_val = df_shuffled.iloc[n_train:n_train+n_val].copy()
data_test = df_shuffled.iloc[n_train+n_val:].copy()

frames = [data_train, data_val]
data_train_val = pd.concat(frames)

data_train_val = data_train_val.reset_index(drop=True)
data_test = data_test.reset_index(drop=True)

y_train_val_orig = data_train_val.median_house_value.values
y_test_orig = data_test.median_house_value.values

y_train_val = np.log1p(y_train_val_orig)
y_test = np.log1p(y_test_orig)

del data_train_val['median_house_value']
del data_test['median_house_value']

In [33]:
X_null_train_val = prepare_X(data_train_val, fillna_value=0)
w_0_train_val, w_train_val = train_linear_regression_reg(X_null_train_val, y_train_val, r=0.001)

X_null_test = prepare_X(data_test, fillna_value=0)
y_null_pred_test = w_0_train_val + X_null_test.dot(w_train_val)

np.round(rmse(y_test, y_null_pred_test),2)

0.35