## Homework

### Dataset

In this homework, we will use the California Housing Prices. You can take it from
[Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).

The goal of this homework is to create a regression model for predicting housing prices (column `'median_house_value'`).

In [None]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

--2022-09-26 19:11:41--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1423529 (1.4M) [text/plain]
Saving to: ‘housing.csv’


2022-09-26 19:11:42 (19.4 MB/s) - ‘housing.csv’ saved [1423529/1423529]



In [None]:
import numpy as np
import pandas as pd

from sklearn.exceptions import NotFittedError
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mutual_info_score
from sklearn.model_selection import train_test_split
from sklearn.utils.validation import check_is_fitted

In [None]:
df = pd.read_csv('housing.csv')

# Features

For the rest of the homework, you'll need to use only these columns:

* `'latitude'`,
* `'longitude'`,
* `'housing_median_age'`,
* `'total_rooms'`,
* `'total_bedrooms'`,
* `'population'`,
* `'households'`,
* `'median_income'`,
* `'median_house_value'`
* `'ocean_proximity'`,

In [None]:
features = ['latitude', 'longitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income',
            'ocean_proximity', 'rooms_per_household', 'bedrooms_per_room', 'population_per_household']
target = 'above_average'

# Data preparation

* Select only the features from above and fill in the missing values with 0.
* Create a new column `rooms_per_household` by dividing the column `total_rooms` by the column `households` from dataframe. 
* Create a new column `bedrooms_per_room` by dividing the column `total_bedrooms` by the column `total_rooms` from dataframe. 
* Create a new column `population_per_household` by dividing the column `population` by the column `households` from dataframe.

In [None]:
df[[feature for feature in features if feature in list(df.columns)]].isnull().sum()

latitude                0
longitude               0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
ocean_proximity         0
dtype: int64

In [None]:
df.total_bedrooms = df.total_bedrooms.fillna(0)

In [None]:
df['rooms_per_household'] = df.total_rooms / df.households
df['bedrooms_per_room'] = df.total_bedrooms / df.total_rooms
df['population_per_household'] = df.population / df.households

# Question 1

What is the most frequent observation (mode) for the column `ocean_proximity`?

In [None]:
df.ocean_proximity.mode()

0    <1H OCEAN
dtype: object

# Split the data

* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to 42.
* Make sure that the target value (`median_house_value`) is not in your dataframe.

In [None]:
def train_val_test_split(df, seed=None):
  df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=seed)
  df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=seed)
  return df_train, df_val, df_full_train, df_test

# Question 2

* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.
    - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?

In [None]:
df[list(set(features) - set(['ocean_proximity']))].corr()

Unnamed: 0,longitude,total_rooms,total_bedrooms,latitude,rooms_per_household,housing_median_age,population,population_per_household,bedrooms_per_room,households,median_income
longitude,1.0,0.044568,0.068082,-0.924664,-0.02754,-0.108197,0.099773,0.002476,0.084836,0.05531,-0.015176
total_rooms,0.044568,1.0,0.920196,-0.0361,0.133798,-0.361262,0.857126,-0.024581,-0.174583,0.918484,0.19805
total_bedrooms,0.068082,0.920196,1.0,-0.065318,0.002717,-0.317063,0.866266,-0.028019,0.122205,0.966507,-0.007295
latitude,-0.924664,-0.0361,-0.065318,1.0,0.106389,0.011173,-0.108785,0.002366,-0.104112,-0.071035,-0.079809
rooms_per_household,-0.02754,0.133798,0.002717,0.106389,1.0,-0.153277,-0.072213,-0.004852,-0.387465,-0.080598,0.326895
housing_median_age,-0.108197,-0.361262,-0.317063,0.011173,-0.153277,1.0,-0.296244,0.013191,0.125396,-0.302916,-0.119034
population,0.099773,0.857126,0.866266,-0.108785,-0.072213,-0.296244,1.0,0.069863,0.031397,0.907222,0.004834
population_per_household,0.002476,-0.024581,-0.028019,0.002366,-0.004852,0.013191,0.069863,1.0,0.003047,-0.027309,0.018766
bedrooms_per_room,0.084836,-0.174583,0.122205,-0.104112,-0.387465,0.125396,0.031397,0.003047,1.0,0.059818,-0.573836
households,0.05531,0.918484,0.966507,-0.071035,-0.080598,-0.302916,0.907222,-0.027309,0.059818,1.0,0.013033


total_bedrooms - households: 0.97  

# Make `median_house_value` binary

* We need to turn the `median_house_value` variable from numeric into binary.
* Let's create a variable `above_average` which is `1` if the `median_house_value` is above its mean value and `0` otherwise.

In [None]:
df['above_average'] = df['median_house_value'] > df['median_house_value'].mean()
df = df.astype({'above_average': 'int16'})

# Split the data

* Shuffle the initial dataset, use seed `42`.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Make sure that the target value ('median_house_value') is not in your dataframe.
* Apply the log transformation to the median_house_value variable using the `np.log1p()` function.


In [None]:
def train_val_test_split(df, val_split=0.2, test_split=0.2, seed=None):

  # train_split = 1 - val_split - test_split

  # create splits
  df_full_train, df_test = train_test_split(df, test_size=test_split, random_state=seed)
  df_train, df_val = train_test_split(df_full_train, test_size=val_split/(1-test_split), random_state=seed)

  # return
  return (df_train, df_val, df_full_train, df_test)

In [None]:
df_train, df_val, df_full_train, df_test = train_val_test_split(df, seed=42)

# Question 3

* Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only.
* What is the value of mutual information?
* Round it to 2 decimal digits using `round(score, 2)`

In [28]:
round(mutual_info_score(df_train.above_average, df_train.ocean_proximity), 2)

0.1

# Question 4

* Now let's train a logistic regression
* Remember that we have one categorical variable `ocean_proximity` in the data. Include it using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.


In [None]:
def prepare(dv, df, features, target):

  feat_dict = df[features].to_dict(orient='records')
  
  try:
    check_is_fitted(dv, attributes='feature_names_')
  except NotFittedError as e:
    dv.fit(feat_dict)
  
  X = dv.transform(feat_dict)
  y = df[target].values

  return (X, y)

In [None]:
dv = DictVectorizer(sparse=False)

X_train, y_train = prepare(dv, df_train, features, target)
X_val, y_val = prepare(dv, df_val, features, target)

In [None]:
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')

In [None]:
y_pred = model.predict_proba(X_val)[:, 1]
above_average_pred = (y_pred >= 0.5)
# Accuracy
(y_val == above_average_pred).mean()

0.8355135658914729

# Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 
* Which of following feature has the smallest difference? 
   * `total_rooms`
   * `total_bedrooms` 
   * `population`
   * `households`

> **note**: the difference doesn't have to be positive


In [None]:
exclude_features = ['total_rooms', 'total_bedrooms', 'population', 'households']
excluded = exclude_features[1]

In [38]:
features_small = list(set(features).difference({excluded}))



['longitude',
 'total_rooms',
 'latitude',
 'rooms_per_household',
 'ocean_proximity',
 'housing_median_age',
 'population',
 'population_per_household',
 'bedrooms_per_room',
 'households',
 'median_income']

In [31]:
excluded

'total_bedrooms'

In [None]:
dict(zip(dv.get_feature_names_out(), model.coef_[0].round(3)))

{'bedrooms_per_room': 0.73,
 'households': 0.004,
 'housing_median_age': 0.036,
 'latitude': 0.109,
 'longitude': 0.084,
 'median_income': 1.219,
 'ocean_proximity=<1H OCEAN': 0.408,
 'ocean_proximity=INLAND': -1.716,
 'ocean_proximity=ISLAND': 0.076,
 'ocean_proximity=NEAR BAY': 0.226,
 'ocean_proximity=NEAR OCEAN': 0.76,
 'population': -0.002,
 'population_per_household': 0.01,
 'rooms_per_household': -0.016,
 'total_bedrooms': 0.002,
 'total_rooms': -0.0}

In [None]:
df_full_train[numerical].corrwith(df_full_train.churn)