# Homework 02
---
[mlbookcamp 02-regresion](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/02-regression/homework.md)

In [2]:
import os.path as osp
import numpy as np
import pandas as pd

In [3]:
DIRPATH="./data/"
FILENAME = "housing.csv"
DATAPATH = osp.join(DIRPATH, FILENAME)

### Downloading data

In [4]:
! ./downloading_data.sh -d $DIRPATH -f $FILENAME https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

Downloading data from https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv to ./data//housing.csv
--2022-10-04 17:10:24--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
Resolvendo raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8001::154, 2606:50c0:8000::154, ...
Conectando-se a raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... conectado.
A requisição HTTP foi enviada, aguardando resposta... 200 OK
Tamanho: 1423529 (1,4M) [text/plain]
Salvando em: “./data//housing.csv”


2022-10-04 17:10:25 (5,51 MB/s) - “./data//housing.csv” salvo [1423529/1423529]



In [5]:
data = pd.read_csv(DATAPATH)

In [6]:
required_columns = [
    'latitude',
    'longitude',
    'housing_median_age',
    'total_rooms',
    'total_bedrooms',
    'population',
    'households',
    'median_income',
    'median_house_value'
]

In [7]:
data = data[required_columns]

## Question 1
Find a feature with missing values. How many missing values does it have?

- 207
- 307
- 408
- 508

In [8]:
amount_missing_values = data.isna().sum()
missing_values = amount_missing_values[amount_missing_values != 0]
missing_values

total_bedrooms    207
dtype: int64

## Question 2
What's the median (50% percentile) for variable 'population'?

- 1133
- 1122
- 1166
- 1188

In [9]:
data['population'].quantile(0.5)

1166.0

## Split the data
- Shuffle the initial dataset, use seed 42.
- Split your data in train/val/test sets, with 60%/20%/20% distribution.
- Make sure that the target value ('median_house_value') is not in your dataframe.
- Apply the log transformation to the median_house_value variable using the np.log1p() function.

In [10]:
data.head()

Unnamed: 0,latitude,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,37.88,-122.23,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,37.86,-122.22,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,37.85,-122.24,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,37.85,-122.25,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,37.85,-122.25,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0


In [11]:
data['median_house_value'] = np.log1p(data.median_house_value)

In [22]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(data.drop('median_house_value', axis=1), 
                                                    data['median_house_value'], 
                                                    test_size=0.2, 
                                                    shuffle=True, 
                                                    random_state=42)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, 
                                                Y_train, 
                                                test_size=0.25, 
                                                shuffle=False, 
                                                random_state=42)

In [23]:
np.asarray([X_train.shape[0], X_test.shape[0], X_val.shape[0]])/data.shape[0]

array([0.6, 0.2, 0.2])

## Question 3
- We need to deal with missing values for the column from Q1.
- We have two options: fill it with 0 or with the mean of this variable.
- Try both options. For each, train a linear regression model without regularization using the code from the lessons.
- For computing the mean, use the training only!
- Use the validation dataset to evaluate the models and compare the RMSE of each option.
- Round the RMSE scores to 2 decimal digits using round(score, 2)
- Which option gives better RMSE?

Options:

- With 0
- With mean
- Both are equally good


In [24]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

mean_transformer = ColumnTransformer(
    [('mean', SimpleImputer(), missing_values.index)],
    remainder='passthrough'
)
zero_transformer = ColumnTransformer(
    [('zero', SimpleImputer(strategy='constant', fill_value=0), missing_values.index)],
    remainder='passthrough'
)

In [25]:
from model import LinearRegressor

mean_impute_LR = LinearRegressor()
mean_impute_LR.fit(mean_transformer.fit_transform(X_train), Y_train)

zero_impute_LR = LinearRegressor()
zero_impute_LR.fit(zero_transformer.fit_transform(X_train), Y_train)

LinearRegressor()

In [26]:
yhat_mean = mean_impute_LR.predict(mean_transformer.transform(X_val))
yhat_zero = zero_impute_LR.predict(zero_transformer.transform(X_val))

In [27]:
from sklearn.metrics import mean_squared_error

round(mean_squared_error(yhat_mean, Y_val, squared=False), 2), round(mean_squared_error(yhat_zero, Y_val, squared=False), 2)

(0.35, 0.35)

## Question 4
- Now let's train a regularized linear regression.
- For this question, fill the NAs with 0.
- Try different values of r from this list: `[0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]`.
- Use RMSE to evaluate the model on the validation dataset.
- Round the RMSE scores to 2 decimal digits.
- Which r gives the best RMSE?

If there are multiple options, select the smallest r.

Options:

- 0
- 0.000001
- 0.001
- 0.0001

In [28]:
regularization_factor = [0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]

In [29]:
result_dict = {}
for rf in regularization_factor:
    lr = LinearRegressor(regularization_factor=rf)
    lr.fit(zero_transformer.transform(X_train), Y_train)
    result_dict[rf] = np.round(mean_squared_error(Y_val, lr.predict(zero_transformer.transform(X_val)), squared=False), 2)
result_dict

{0: 0.35,
 1e-06: 0.35,
 0.0001: 0.35,
 0.001: 0.35,
 0.01: 0.35,
 0.1: 0.35,
 1: 0.35,
 5: 0.36,
 10: 0.36}

In [30]:
min_value = min(result_dict.values())
min([r for r in result_dict.keys() if result_dict[r] == min_value])

0

## Question 5
- We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
- Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
- For each seed, do the train/validation/test split with 60%/20%/20% distribution.
- Fill the missing values with 0 and train a model without regularization.
- For each seed, evaluate the model on the validation dataset and collect the RMSE scores.
- What's the standard deviation of all the scores? To compute the standard deviation, use np.std.
- Round the result to 3 decimal digits (round(std, 3))

Options:

- 0.16
- 0.00005
- 0.005
- 0.15555

In [31]:
zero_impute_LR = LinearRegressor()
seed_dict = {'seed': [], 'score': []}
for seed in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]:
    X_train, X_test, Y_train, Y_test = train_test_split(data.drop('median_house_value', axis=1), 
                                                        data['median_house_value'], 
                                                        test_size=0.2, 
                                                        shuffle=True, 
                                                        random_state=seed)
    X_train, X_val, Y_train, Y_val = train_test_split(X_train, 
                                                    Y_train, 
                                                    test_size=0.25, 
                                                    shuffle=False, 
                                                    random_state=seed)
    zero_impute_LR.fit(zero_transformer.transform(X_train), Y_train)
    seed_dict['seed'].append(seed)
    seed_dict['score'].append(mean_squared_error(Y_val, lr.predict(zero_transformer.transform(X_val)), squared=False))
seed_dict


{'seed': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'score': [0.3476313639492783,
  0.3452271319288195,
  0.3566200408963487,
  0.3467842819478841,
  0.3451764720100373,
  0.3454713680532478,
  0.3538322317193187,
  0.3548220497824022,
  0.35828392911132356,
  0.356978425041303]}

In [32]:
seed_df = pd.DataFrame(seed_dict)
seed_df.score.std().round(3)

0.005

## Question 6
- Split the dataset like previously, use seed 9.
- Combine train and validation datasets.
- Fill the missing values with 0 and train a model with r=0.001.
- What's the RMSE on the test dataset?

Options:

- 0.35
- 0.135
- 0.450
- 0.245

In [33]:
combined_X = pd.concat([X_train, X_val])
combined_y = pd.concat([Y_train, Y_val])

lr = LinearRegressor(regularization_factor=0.001)
lr.fit(zero_transformer.transform(combined_X), combined_y)
round(mean_squared_error(Y_test, lr.predict(zero_transformer.transform(X_test)), squared=False), 2)

0.34