# Machine Learning for Classification
House price prediction

## Dataset

Dataset is the California Housing Prices from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).

Here's a wget-able [link](https://github.com/Ksyula/ML_Engineering/blob/master/02-regression/housing.csv):

```bash
wget https://raw.githubusercontent.com/Ksyula/ML_Engineering/master/02-regression/housing.csv
```

The goal is to create a regression model for predicting housing prices (column `'median_house_value'`).

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

np.__version__, pd.__version__

('1.21.5', '1.4.3')

## Data preparation

In [10]:
data = pd.read_csv('../02-regression/housing.csv')
data.shape

(20640, 10)

In [11]:
variables = ['latitude','longitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income','median_house_value','ocean_proximity']
data = data[variables].fillna(0)
data.head().T

Unnamed: 0,0,1,2,3,4
latitude,37.88,37.86,37.85,37.85,37.85
longitude,-122.23,-122.22,-122.24,-122.25,-122.25
housing_median_age,41.0,21.0,52.0,52.0,52.0
total_rooms,880.0,7099.0,1467.0,1274.0,1627.0
total_bedrooms,129.0,1106.0,190.0,235.0,280.0
population,322.0,2401.0,496.0,558.0,565.0
households,126.0,1138.0,177.0,219.0,259.0
median_income,8.3252,8.3014,7.2574,5.6431,3.8462
median_house_value,452600.0,358500.0,352100.0,341300.0,342200.0
ocean_proximity,NEAR BAY,NEAR BAY,NEAR BAY,NEAR BAY,NEAR BAY


In [12]:
data.dtypes

latitude              float64
longitude             float64
housing_median_age    float64
total_rooms           float64
total_bedrooms        float64
population            float64
households            float64
median_income         float64
median_house_value    float64
ocean_proximity        object
dtype: object

## Feature engineering

In [13]:
data['rooms_per_household'] = data['total_rooms'] / data['households']
data['bedrooms_per_room'] = data['total_bedrooms'] / data['total_rooms']
data['population_per_household'] = data['population'] / data['households']

### Question 1

What is the most frequent observation (mode) for the column `ocean_proximity`?

Options:
* NEAR BAY
* **<1H OCEAN**
* INLAND
* NEAR OCEAN

In [14]:
data['ocean_proximity'].mode()

0    <1H OCEAN
Name: ocean_proximity, dtype: object

## Set up validation framework

In [18]:
# Split the data in train/val/test sets, with 60%/20%/20% distribution

df_full_train, df_test = train_test_split(data, test_size = 0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size = 0.25, random_state=42)

len(df_train), len(df_val), len(df_test)

(12384, 4128, 4128)

In [22]:
df_train = df_train.reset_index(drop = True)
df_val = df_val.reset_index(drop = True)
df_test = df_test.reset_index(drop = True)

In [23]:
y_train = df_train.median_house_value.values
y_val = df_val.median_house_value.values
y_test = df_test.median_house_value.values

del df_train['median_house_value']
del df_val['median_house_value']
del df_test['median_house_value']

## EDA

In [29]:
data.isna().sum()

latitude                    0
longitude                   0
housing_median_age          0
total_rooms                 0
total_bedrooms              0
population                  0
households                  0
median_income               0
median_house_value          0
ocean_proximity             0
rooms_per_household         0
bedrooms_per_room           0
population_per_household    0
dtype: int64

### Question 2

* Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your train dataset.
    - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?

Options:
* **`total_bedrooms` and `households`**
* `total_bedrooms` and `total_rooms`
* `population` and `households`
* `population_per_household` and `total_rooms`

In [54]:
cor_mat = df_train.corr(method='pearson')
cor_coefs = cor_mat.where(np.triu(np.ones(cor_mat.shape), k=1).astype(bool)).stack().reset_index().rename(columns={0: "coef"})
cor_coefs.sort_values(by = "coef", ascending = False, key=abs).head(3)


Unnamed: 0,level_0,level_1,coef
35,total_bedrooms,households,0.979399
27,total_rooms,total_bedrooms,0.931546
0,latitude,longitude,-0.925005


### Target variable
Make `median_house_value` binary

* We need to turn the `median_house_value` variable from numeric into binary.
* Let's create a variable `above_average` which is `1` if the `median_house_value` is above its mean value and `0` otherwise.

In [66]:
median_house_value_mean = round(data.median_house_value.mean(), 2)

206855.82

In [67]:
y_train = [1 if y > median_house_value_mean else 0 for y in y_train]
y_val = [1 if y > median_house_value_mean else 0 for y in y_val]
y_test = [1 if y > median_house_value_mean else 0 for y in y_test]

### Question 3

* Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only.
* What is the value of mutual information?
* Round it to 2 decimal digits using `round(score, 2)`

Options:
- 0.263
- 0.00001
- 0.101
- 0.15555