# Homework 02
---
[mlbookcamp 02-regresion](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/01-intro/homework.md)

In [35]:
import numpy as np
import pandas as pd

In [36]:
DATAPATH = "/dataset/AB_NYC_2019.csv"

### Downloading data

In [37]:
%%bash -s "$DATAPATH"
# Downloads data if not available.
if [[ -f "$1" ]]
    then
        echo 'Data already there.';
    else
        echo 'Downloading data';
        wget -O "$1" https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv
fi

Data already there.


In [38]:
data = pd.read_csv(DATAPATH)

In [39]:
required_columns = [
    'latitude', 
    'longitude', 
    'price', 
    'minimum_nights', 
    'number_of_reviews', 
    'reviews_per_month', 
    'calculated_host_listings_count', 
    'availability_365'
]

In [40]:
data = data[required_columns]

## Question 1
Find a feature with missing values. How many missing values does it have?

In [41]:
amount_missing_values = data.isna().sum()
missing_values = amount_missing_values[amount_missing_values != 0]
missing_values

reviews_per_month    10052
dtype: int64

## Question 2
What's the median (50% percentile) for variable 'minimum_nights'?

In [42]:
data['minimum_nights'].quantile(0.5)

3.0

## Split the data
- Shuffle the initial dataset, use seed 42.
- Split your data in train/val/test sets, with 60%/20%/20% distribution.
- Make sure that the target value ('price') is not in your dataframe.
- Apply the log transformation to the price variable using the np.log1p() function.

In [43]:
data.head()

Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,40.64749,-73.97237,149,1,9,0.21,6,365
1,40.75362,-73.98377,225,1,45,0.38,2,355
2,40.80902,-73.9419,150,3,0,,1,365
3,40.68514,-73.95976,89,1,270,4.64,1,194
4,40.79851,-73.94399,80,10,9,0.1,1,0


In [44]:
data['price'] = np.log1p(data.price)

In [45]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(data.drop('price', axis=1), 
                                                    data['price'], 
                                                    test_size=0.4, 
                                                    shuffle=True, 
                                                    random_state=42)
X_test, X_val, Y_test, Y_val = train_test_split(X_test, 
                                                Y_test, 
                                                test_size=0.5, 
                                                shuffle=False, 
                                                random_state=42)

## Question 3
- We need to deal with missing values for the column from Q1.
- We have two options: fill it with 0 or with the mean of this variable.
- Try both options. For each, train a linear regression model without regularization using the code from the lessons.
- For computing the mean, use the training only!
- Use the validation dataset to evaluate the models and compare the RMSE of each option.
- Round the RMSE scores to 2 decimal digits using round(score, 2)
- Which option gives better RMSE?

In [46]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

mean_transformer = ColumnTransformer(
    [('mean', SimpleImputer(), missing_values.index)],
    remainder='passthrough'
)
zero_transformer = ColumnTransformer(
    [('zero', SimpleImputer(strategy='constant', fill_value=0), missing_values.index)],
    remainder='passthrough'
)

In [47]:
from model import LinearRegressor

mean_impute_LR = LinearRegressor()
mean_impute_LR.fit(mean_transformer.fit_transform(X_train), Y_train)

zero_impute_LR = LinearRegressor()
zero_impute_LR.fit(zero_transformer.fit_transform(X_train), Y_train)

LinearRegressor()

In [48]:
yhat_mean = mean_impute_LR.predict(mean_transformer.transform(X_val))
yhat_zero = zero_impute_LR.predict(zero_transformer.transform(X_val))

In [49]:
from sklearn.metrics import mean_squared_error

def RMSE(y, yhat):
    return np.sqrt(mean_squared_error(y, yhat))


round(RMSE(yhat_mean, Y_val), 2), round(RMSE(yhat_zero, Y_val), 2)

(0.64, 0.64)

## Question 4
- Now let's train a regularized linear regression.
- For this question, fill the NAs with 0.
- Try different values of r from this list: `[0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]`.
- Use RMSE to evaluate the model on the validation dataset.
- Round the RMSE scores to 2 decimal digits.
- Which r gives the best RMSE?

If there are multiple options, select the smallest r.

In [50]:
regularization_factor = [0, 0.000001, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10]

In [51]:
result_dict = {}
for rf in regularization_factor:
    lr = LinearRegressor(regularization_factor=rf)
    lr.fit(zero_transformer.transform(X_train), Y_train)
    result_dict[rf] = np.round(RMSE(Y_val, lr.predict(zero_transformer.transform(X_val))), 2)
result_dict

{0: 0.64,
 1e-06: 0.64,
 0.0001: 0.64,
 0.001: 0.64,
 0.01: 0.65,
 0.1: 0.67,
 1: 0.67,
 5: 0.67,
 10: 0.67}

In [52]:
min_value = min(result_dict.values())
min([r for r in result_dict.keys() if result_dict[r] == min_value])

0

## Question 5
- We used seed 42 for splitting the data. Let's find out how selecting the seed influences our score.
- Try different seed values: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`.
- For each seed, do the train/validation/test split with 60%/20%/20% distribution.
- Fill the missing values with 0 and train a model without regularization.
- For each seed, evaluate the model on the validation dataset and collect the RMSE scores.
- What's the standard deviation of all the scores? To compute the standard deviation, use np.std.
- Round the result to 3 decimal digits (round(std, 3))

In [53]:
zero_impute_LR = LinearRegressor()
seed_dict = {'seed': [], 'score': []}
for seed in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]:
    X_train, X_test, Y_train, Y_test = train_test_split(data.drop('price', axis=1), 
                                                        data['price'], 
                                                        test_size=0.4, 
                                                        shuffle=True, 
                                                        random_state=seed)
    X_test, X_val, Y_test, Y_val = train_test_split(X_test, 
                                                    Y_test, 
                                                    test_size=0.5, 
                                                    shuffle=False, 
                                                    random_state=seed)
    zero_impute_LR.fit(zero_transformer.transform(X_train), Y_train)
    seed_dict['seed'].append(seed)
    seed_dict['score'].append(RMSE(Y_val, lr.predict(zero_transformer.transform(X_val))))
seed_dict


{'seed': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'score': [0.6874024180793985,
  0.6826733951309679,
  0.6787697800889332,
  0.6801772655176364,
  0.6944709763865387,
  0.6924747645433872,
  0.6883739406590289,
  0.6719325833264694,
  0.6913986582431862,
  0.6808741168720182]}

In [54]:
seed_df = pd.DataFrame(seed_dict)
seed_df.score.std().round(3)

0.007

## Question 6
- Split the dataset like previously, use seed 9.
- Combine train and validation datasets.
- Fill the missing values with 0 and train a model with r=0.001.
- What's the RMSE on the test dataset?

In [56]:
combined_X = pd.concat([X_train, X_val])
combined_y = pd.concat([Y_train, Y_val])

lr = LinearRegressor(regularization_factor=0.001)
lr.fit(zero_transformer.transform(combined_X), combined_y)
round(RMSE(Y_test, lr.predict(zero_transformer.transform(X_test))), 2)

0.65