# Homework 03
---
[mlbookcamp 03-classification](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/cohorts/2022/03-classification/homework.md)

In [102]:
import os.path as osp
import numpy as np
import pandas as pd

In [103]:
DIRPATH="./data/"
FILENAME = "housing.csv"
DATAPATH = osp.join(DIRPATH, FILENAME)

### Downloading data

In [104]:
! ./downloading_data.sh -d $DIRPATH -f $FILENAME https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

./data//housing.csv already exists.


In [105]:
data = pd.read_csv(DATAPATH)

## Features
For the rest of the homework, you'll need to use the features from the previous homework with additional two 'neighbourhood_group' and 'room_type'. So the whole feature set will be set as follows:

- 'latitude',
- 'longitude',
- 'housing_median_age',
- 'total_rooms',
- 'total_bedrooms',
- 'population',
- 'households',
- 'median_income',
- 'median_house_value',
- 'ocean_proximity'

### Data Preparation

- Select only the features from above and fill in the missing values with 0.
- Create a new column rooms_per_household by dividing the column total_rooms by the column households from dataframe.
- Create a new column bedrooms_per_room by dividing the column total_bedrooms by the column total_rooms from dataframe.
- Create a new column population_per_household by dividing the column population by the column households from dataframe.


In [106]:
required_columns = [
    'latitude',
    'longitude',
    'housing_median_age',
    'total_rooms',
    'total_bedrooms',
    'population',
    'households',
    'median_income',
    'median_house_value',
    'ocean_proximity',
]

In [107]:
data = data[required_columns].fillna(0)
data['rooms_per_household'] = data['total_rooms'] / data['households']
data['bedrooms_per_room'] = data['total_bedrooms'] / data['total_rooms']
data['population_per_household'] = data['population'] / data['households']

## Question 1
What is the most frequent observation (mode) for the column ocean_proximity?

Options:

- NEAR BAY
- <1H OCEAN
- INLAND
- NEAR OCEAN

In [108]:
data['ocean_proximity'].mode()

0    <1H OCEAN
dtype: object

## Question 2
Create the correlation matrix for the numerical features of your train dataset.
In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
What are the two features that have the biggest correlation in this dataset?
Options:

- total_bedrooms and households
- total_bedrooms and total_rooms
- population and households
- population_per_household and total_rooms


In [109]:
correlation_table = data.corr().stack().reset_index().rename({'level_0':'feature_0', 'level_1':'feature_1', 0:'correlation'}, axis=1)
correlation_table.correlation = correlation_table.correlation.abs()
correlation_table.loc[correlation_table.feature_0 != correlation_table.feature_1, :].sort_values(by='correlation', ascending=False).head(1)

Unnamed: 0,feature_0,feature_1,correlation
76,households,total_bedrooms,0.966507


## Make median_house_value binary
- We need to turn the median_house_value variable from numeric into binary.
- Let's create a variable above_average which is 1 if the median_house_value is above its mean value and 0 otherwise.


In [110]:
data['above_average'] = (data['median_house_value'] > data['median_house_value'].mean()).astype(int)

## Split the data
- Split your data in train/val/test sets, with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
- Make sure that the target value ('median_house_value') is not in your dataframe.

In [111]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(data.drop(['above_average', 'median_house_value'], axis=1), 
                                                    data['above_average'], 
                                                    test_size=0.2, 
                                                    shuffle=True, 
                                                    random_state=42)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, 
                                                Y_train, 
                                                test_size=0.25, 
                                                shuffle=False, 
                                                random_state=42)

In [112]:
np.asarray([X_train.shape[0], X_test.shape[0], X_val.shape[0]])/data.shape[0]

array([0.6, 0.2, 0.2])

## Question 3
- Calculate the mutual information score between above_average and ocean_proximity . Use the training set only.
- Round it to 2 decimals using round(score, 2)
- What is their mutual information score?

Options:

- 0.26
- 0
- 0.10
- 0.16

In [113]:
column_types = X_train.dtypes
cat_features = list(column_types[column_types == 'object'].index)
cat_features

['ocean_proximity']

In [114]:
from sklearn.metrics import mutual_info_score

info_score_dict = {feature: np.round(mutual_info_score(X_train[feature], Y_train), 2) 
    for feature in cat_features}
info_score_dict

{'ocean_proximity': 0.1}

## Question 4
- Now let's train a logistic regression
- Remember that we have one categorical variable ocean_proximity in the data. Include it using one-hot encoding.
- Fit the model on the training dataset.
  - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
  - model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

Options:

- 0.60
- 0.72
- 0.84
- 0.95

In [122]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer


preprocessing = ColumnTransformer([
    ('cat', OneHotEncoder(), cat_features)], 
    remainder='passthrough'
    )
X_train_processed = preprocessing.fit_transform(X_train)
X_test_processed = preprocessing.transform(X_test)
X_val_processed = preprocessing.transform(X_val)

In [123]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
model.fit(X_train_processed, Y_train)

LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')

In [124]:
from sklearn.metrics import accuracy_score

yhat_val = model.predict(X_val_processed)
acc = accuracy_score(Y_val, yhat_val)
np.round(acc, 2)

0.83

## Question 5
- Let's find the least useful feature using the feature elimination technique.
- Train a model with all these features (using the same parameters as in Q4).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
- Which of following feature has the smallest difference?
    -  total_rooms
    -  total_bedrooms
    -  population
    -  households
> note: the difference doesn't have to be positive

In [125]:
feature_selection_dict = {}
for del_feature in ['total_rooms', 'total_bedrooms', 'population', 'households']:
    selected_cat_features = [feature for feature in cat_features if feature != del_feature]
    x_query = X_train.drop(columns=del_feature)
    val_query = X_val.drop(columns=del_feature)
    preprocessing = ColumnTransformer([
        ('cat', OneHotEncoder(), selected_cat_features)], 
        remainder='passthrough'
        )
    x_query = preprocessing.fit_transform(x_query)
    val_query = preprocessing.transform(val_query)
    model.fit(x_query, Y_train)
    feature_selection_dict[del_feature] = accuracy_score(Y_val, model.predict(val_query))
feature_selection_dict

{'total_rooms': 0.8289728682170543,
 'total_bedrooms': 0.8263081395348837,
 'population': 0.8391472868217055,
 'households': 0.8301841085271318}

In [126]:
feature_selection_df = pd.DataFrame(feature_selection_dict, index=['accuracy']).T
(acc - feature_selection_df.accuracy).abs().sort_values(ascending=True)

total_rooms       0.000969
total_bedrooms    0.001696
households        0.002180
population        0.011143
Name: accuracy, dtype: float64

## Question 6
- For this question, we'll see how to use a linear regression model from Scikit-Learn
- We'll need to use the original column 'median_house_value'. Apply the logarithmic transformation to this column.
- Fit the Ridge regression model (model = Ridge(alpha=a, solver="sag", random_state=42)) on the training data.
- This model has a parameter alpha. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
- Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest alpha.

Options:

- 0
- 0.01
- 0.1
- 1
- 10

In [89]:
Y = np.log1p(data.median_house_value)

X_train, X_test, Y_train, Y_test = train_test_split(data.drop(['above_average', 'median_house_value'], axis=1), 
                                                    Y, 
                                                    test_size=0.2, 
                                                    shuffle=True, 
                                                    random_state=42)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, 
                                                Y_train, 
                                                test_size=0.25, 
                                                shuffle=False, 
                                                random_state=42)

In [93]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

alpha_values = [0, 0.01, 0.1, 1, 10]
score_dict = {'alpha': [], 'val_score': []}
for alpha in alpha_values:
    model = Ridge(alpha=alpha, solver="sag", random_state=42)
    model.fit(X_train_processed, Y_train)
    yhat = model.predict(X_val_processed)
    score_dict['alpha'].append(alpha)
    score_dict['val_score'].append(mean_squared_error(Y_val, yhat, squared=False))

In [96]:
score_df = pd.DataFrame(score_dict)
score_df['val_score'] = score_df['val_score'].round(3)
score_df.sort_values(by=['val_score', 'alpha'], ascending=[False, True])

Unnamed: 0,alpha,val_score
0,0.0,0.534
1,0.01,0.534
2,0.1,0.534
3,1.0,0.534
4,10.0,0.534
