## Homework

### Dataset

In this homework, we will use the California Housing Prices. You can take it from
[Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).

The goal of this homework is to create a regression model for predicting housing prices (column `'median_house_value'`).

In [None]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

--2022-09-26 19:11:41--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1423529 (1.4M) [text/plain]
Saving to: ‘housing.csv’


2022-09-26 19:11:42 (19.4 MB/s) - ‘housing.csv’ saved [1423529/1423529]



In [62]:
import numpy as np
import pandas as pd

from sklearn.exceptions import NotFittedError
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.metrics import mean_squared_error, mutual_info_score
from sklearn.model_selection import train_test_split
from sklearn.utils.validation import check_is_fitted

In [None]:
df = pd.read_csv('housing.csv')

# Features

For the rest of the homework, you'll need to use only these columns:

* `'latitude'`,
* `'longitude'`,
* `'housing_median_age'`,
* `'total_rooms'`,
* `'total_bedrooms'`,
* `'population'`,
* `'households'`,
* `'median_income'`,
* `'median_house_value'`
* `'ocean_proximity'`,

In [None]:
features = ['latitude', 'longitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income',
            'ocean_proximity', 'rooms_per_household', 'bedrooms_per_room', 'population_per_household']
target = 'above_average'

# Data preparation

* Select only the features from above and fill in the missing values with 0.
* Create a new column `rooms_per_household` by dividing the column `total_rooms` by the column `households` from dataframe. 
* Create a new column `bedrooms_per_room` by dividing the column `total_bedrooms` by the column `total_rooms` from dataframe. 
* Create a new column `population_per_household` by dividing the column `population` by the column `households` from dataframe.

In [None]:
df[[feature for feature in features if feature in list(df.columns)]].isnull().sum()

latitude                0
longitude               0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
ocean_proximity         0
dtype: int64

In [None]:
df.total_bedrooms = df.total_bedrooms.fillna(0)

In [None]:
df['rooms_per_household'] = df.total_rooms / df.households
df['bedrooms_per_room'] = df.total_bedrooms / df.total_rooms
df['population_per_household'] = df.population / df.households

# Question 1

What is the most frequent observation (mode) for the column `ocean_proximity`?

In [None]:
df.ocean_proximity.mode()

0    <1H OCEAN
dtype: object

# Split the data

* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
* Make sure that the target value (`median_house_value`) is not in your dataframe.

In [None]:
def train_val_test_split(df, seed=None):
  df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=seed)
  df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=seed)
  return df_train, df_val, df_full_train, df_test

# Question 2

* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.
    - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?

In [66]:
df_train[list(set(features) - set(['ocean_proximity']))].corr()

Unnamed: 0,longitude,total_rooms,total_bedrooms,latitude,rooms_per_household,housing_median_age,population,population_per_household,bedrooms_per_room,households,median_income
longitude,1.0,0.036449,0.06384,-0.925005,-0.034814,-0.099812,0.09167,0.011022,0.10232,0.049762,-0.016426
total_rooms,0.036449,1.0,0.931546,-0.025914,0.168926,-0.363522,0.853219,-0.029452,-0.194185,0.921441,0.198951
total_bedrooms,0.06384,0.931546,1.0,-0.05973,0.010381,-0.324156,0.87734,-0.034301,0.078094,0.979399,-0.009833
latitude,-0.925005,-0.025914,-0.05973,1.0,0.119118,0.002477,-0.100272,-0.002301,-0.124507,-0.063529,-0.076805
rooms_per_household,-0.034814,0.168926,0.010381,0.119118,1.0,-0.181275,-0.07621,0.001801,-0.500589,-0.085832,0.394154
housing_median_age,-0.099812,-0.363522,-0.324156,0.002477,-0.181275,1.0,-0.292476,0.012167,0.129456,-0.306119,-0.119591
population,0.09167,0.853219,0.87734,-0.100272,-0.07621,-0.292476,1.0,0.064998,0.031592,0.906841,-0.000849
population_per_household,0.011022,-0.029452,-0.034301,-0.002301,0.001801,0.012167,0.064998,1.0,-0.002851,-0.032522,-0.000454
bedrooms_per_room,0.10232,-0.194185,0.078094,-0.124507,-0.500589,0.129456,0.031592,-0.002851,1.0,0.058004,-0.616617
households,0.049762,0.921441,0.979399,-0.063529,-0.085832,-0.306119,0.906841,-0.032522,0.058004,1.0,0.011925


total_bedrooms - households: 0.98

# Make `median_house_value` binary

* We need to turn the `median_house_value` variable from numeric into binary.
* Let's create a variable `above_average` which is `1` if the `median_house_value` is above its mean value and `0` otherwise.

In [None]:
df['above_average'] = df['median_house_value'] > df['median_house_value'].mean()
df = df.astype({'above_average': 'int16'})

# Split the data

* Shuffle the initial dataset, use seed `42`.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Make sure that the target value ('median_house_value') is not in your dataframe.
* Apply the log transformation to the median_house_value variable using the `np.log1p()` function.


In [None]:
def train_val_test_split(df, val_split=0.2, test_split=0.2, seed=None):

  # train_split = 1 - val_split - test_split

  # create splits
  df_full_train, df_test = train_test_split(df, test_size=test_split, random_state=seed)
  df_train, df_val = train_test_split(df_full_train, test_size=val_split/(1-test_split), random_state=seed)

  # return
  return (df_train, df_val, df_full_train, df_test)

In [None]:
df_train, df_val, df_full_train, df_test = train_val_test_split(df, seed=42)

# Question 3

* Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only.
* What is the value of mutual information?
* Round it to 2 decimal digits using `round(score, 2)`

In [28]:
round(mutual_info_score(df_train.above_average, df_train.ocean_proximity), 2)

0.1

# Question 4

* Now let's train a logistic regression
* Remember that we have one categorical variable `ocean_proximity` in the data. Include it using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.


In [None]:
def prepare(dv, df, features, target):

  feat_dict = df[features].to_dict(orient='records')
  
  try:
    check_is_fitted(dv, attributes='feature_names_')
  except NotFittedError as e:
    dv.fit(feat_dict)
  
  X = dv.transform(feat_dict)
  y = df[target].values

  return (X, y)

In [61]:
def train_logreg(df, features, target, seed=None):
  dv = DictVectorizer(sparse=False)

  X_train, y_train = prepare(dv, df_train, features, target)
  X_val, y_val = prepare(dv, df_val, features, target)

  model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=seed)
  model.fit(X_train, y_train)

  y_pred = model.predict_proba(X_val)[:, 1]
  above_average_pred = (y_pred >= 0.5)
  # Accuracy
  return (y_val == above_average_pred).mean()

In [58]:
accuracy = train_logreg(df, features, target, seed=42)
round(accuracy, 2)

0.84

# Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 
* Which of following feature has the smallest difference? 
   * `total_rooms`
   * `total_bedrooms` 
   * `population`
   * `households`

> **note**: the difference doesn't have to be positive


In [None]:
exclude_features = ['total_rooms', 'total_bedrooms', 'population', 'households']

In [59]:
accuracy_diffs = {}

for excluded in exclude_features:
  features_small = list(set(features).difference({excluded}))

  accuracy_small = train_logreg(df, features_small, target, seed=42)

  accuracy_diff = accuracy_small - accuracy
  accuracy_diffs[excluded] = accuracy_diff

In [60]:
accuracy_diffs

{'total_rooms': 0.0029069767441860517,
 'total_bedrooms': 0.0021802325581395943,
 'population': -0.009205426356589164,
 'households': -0.0021802325581394832}

Population has the lowest difference (negative)

# Question 6

* For this question, we'll see how to use a linear regression model from Scikit-Learn
* We'll need to use the original column `'median_house_value'`. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model (`model = Ridge(alpha=a, solver="sag", random_state=42)`) on the training data.
* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest `alpha`.

In [65]:
df['median_house_value_log'] = np.log1p(df.median_house_value)

df_train, df_val, df_full_train, df_test = train_val_test_split(df, seed=42)

dv = DictVectorizer(sparse=False)
X_train, y_train_log = prepare(dv, df_train, features, target='median_house_value_log')
X_val, y_val_log = prepare(dv, df_val, features, target='median_house_value_log')

for alpha in [0, 0.01, 0.1, 1, 10]:
  
  model_lr = Ridge(alpha=alpha, solver="sag", random_state=42)
  model_lr.fit(X_train, y_train_log)

  y_pred_log = model_lr.predict(X_val)
    
  score = np.sqrt(mean_squared_error(y_val_log, y_pred_log))
    
  print(alpha, round(score, 3))

0 0.524
0.01 0.524
0.1 0.524
1 0.524
10 0.524
