# Context

Create a regression model for predicting housing prices (column 'median_house_value') using the California Housing Prices from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).

# Load Libararies

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
%matplotlib inline

# Load Data 

In [3]:
df = pd.read_csv('./data/03-California Housing Prices.csv')

In [4]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


# Data Preparation

- Select only the features from above and fill in the missing values with 0.
- Create a new column rooms_per_household by dividing the column total_rooms by the column households from dataframe.
- Create a new column bedrooms_per_room by dividing the column total_bedrooms by the column total_rooms from dataframe.
- Create a new column population_per_household by dividing the column population by the column households from dataframe.

In [5]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [6]:
df = df.fillna(0)

In [7]:
df['rooms_per_household'] = df['total_rooms'] / df['households']

In [8]:
df['bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms']

In [9]:
df['population_per_household'] = df['population'] / df['households']

# Q1

What is the most frequent observation (mode) for the column ocean_proximity?

In [10]:
df['ocean_proximity'].mode()

0    <1H OCEAN
Name: ocean_proximity, dtype: object

# Q2

Create the correlation matrix for the numerical features of your train dataset.
- In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.

What are the two features that have the biggest correlation in this dataset?

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   longitude                 20640 non-null  float64
 1   latitude                  20640 non-null  float64
 2   housing_median_age        20640 non-null  float64
 3   total_rooms               20640 non-null  float64
 4   total_bedrooms            20640 non-null  float64
 5   population                20640 non-null  float64
 6   households                20640 non-null  float64
 7   median_income             20640 non-null  float64
 8   median_house_value        20640 non-null  float64
 9   ocean_proximity           20640 non-null  object 
 10  rooms_per_household       20640 non-null  float64
 11  bedrooms_per_room         20640 non-null  float64
 12  population_per_household  20640 non-null  float64
dtypes: float64(12), object(1)
memory usage: 2.0+ MB


In [12]:
df.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity', 'rooms_per_household',
       'bedrooms_per_room', 'population_per_household'],
      dtype='object')

In [13]:
numeric = ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 
           'median_income', 'median_house_value', 'rooms_per_household', 'bedrooms_per_room', 'population_per_household']

In [14]:
correlation = df[numeric].corr()

In [15]:
correlation

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,rooms_per_household,bedrooms_per_room,population_per_household
longitude,1.0,-0.924664,-0.108197,0.044568,0.068082,0.099773,0.05531,-0.015176,-0.045967,-0.02754,0.084836,0.002476
latitude,-0.924664,1.0,0.011173,-0.0361,-0.065318,-0.108785,-0.071035,-0.079809,-0.14416,0.106389,-0.104112,0.002366
housing_median_age,-0.108197,0.011173,1.0,-0.361262,-0.317063,-0.296244,-0.302916,-0.119034,0.105623,-0.153277,0.125396,0.013191
total_rooms,0.044568,-0.0361,-0.361262,1.0,0.920196,0.857126,0.918484,0.19805,0.134153,0.133798,-0.174583,-0.024581
total_bedrooms,0.068082,-0.065318,-0.317063,0.920196,1.0,0.866266,0.966507,-0.007295,0.049148,0.002717,0.122205,-0.028019
population,0.099773,-0.108785,-0.296244,0.857126,0.866266,1.0,0.907222,0.004834,-0.02465,-0.072213,0.031397,0.069863
households,0.05531,-0.071035,-0.302916,0.918484,0.966507,0.907222,1.0,0.013033,0.065843,-0.080598,0.059818,-0.027309
median_income,-0.015176,-0.079809,-0.119034,0.19805,-0.007295,0.004834,0.013033,1.0,0.688075,0.326895,-0.573836,0.018766
median_house_value,-0.045967,-0.14416,0.105623,0.134153,0.049148,-0.02465,0.065843,0.688075,1.0,0.151948,-0.238759,-0.023737
rooms_per_household,-0.02754,0.106389,-0.153277,0.133798,0.002717,-0.072213,-0.080598,0.326895,0.151948,1.0,-0.387465,-0.004852


In [16]:
correlation_unstack = correlation.unstack().sort_values(ascending=False, kind='quicksort')

In [17]:
correlation_unstack

longitude            longitude              1.000000
latitude             latitude               1.000000
bedrooms_per_room    bedrooms_per_room      1.000000
rooms_per_household  rooms_per_household    1.000000
median_house_value   median_house_value     1.000000
                                              ...   
bedrooms_per_room    rooms_per_household   -0.387465
                     median_income         -0.573836
median_income        bedrooms_per_room     -0.573836
longitude            latitude              -0.924664
latitude             longitude             -0.924664
Length: 144, dtype: float64

In [18]:
correlation_unstack[[len(numeric)]]

total_bedrooms  households    0.966507
dtype: float64

Make median_house_value binary
- We need to turn the median_house_value variable from numeric into binary.
- Let's create a variable above_average which is 1 if the median_house_value is above its mean value and 0 otherwise.

In [19]:
above_average = df['median_house_value'].mean()

In [20]:
df['median_house_value'] = (df['median_house_value'] > above_average).astype(int)

In [21]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,rooms_per_household,bedrooms_per_room,population_per_household
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,1,NEAR BAY,6.984127,0.146591,2.555556
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,1,NEAR BAY,6.238137,0.155797,2.109842
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,1,NEAR BAY,8.288136,0.129516,2.80226
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,1,NEAR BAY,5.817352,0.184458,2.547945
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,1,NEAR BAY,6.281853,0.172096,2.181467


Split the data
- Split your data in train/val/test sets, with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
- Make sure that the target value (median_house_value) is not in your dataframe.- 

In [22]:
from sklearn.model_selection import train_test_split

In [23]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [24]:
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [25]:
y_train = df_train['median_house_value']
y_val = df_val['median_house_value']
y_test = df_test['median_house_value']

In [26]:
del df_train['median_house_value']
del df_val['median_house_value']
del df_test['median_house_value']

In [27]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

# Q3

- Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only.
- What is the value of mutual information?
- Round it to 2 decimal digits using round(score, 2)

In [28]:
from sklearn.metrics import mutual_info_score

In [29]:
mutual_info_score(df['ocean_proximity'], df['median_house_value']).round(2)

0.1

# Q4

- Now let's train a logistic regression
- Remember that we have one categorical variable ocean_proximity in the data. Include it using one-hot encoding.
- Fit the model on the training dataset.
 - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
 - model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

In [30]:
categorical = ['ocean_proximity']

In [31]:
from sklearn.preprocessing import OneHotEncoder

In [32]:
one_hot_encoder = OneHotEncoder()

In [33]:
X_train_cat = pd.DataFrame(one_hot_encoder.fit_transform(df_train[categorical]).toarray(), columns=one_hot_encoder.get_feature_names_out())

In [34]:
X_train_cat.head()

Unnamed: 0,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0


In [35]:
X_train = df_train.join(X_train_cat)

In [36]:
del X_train['ocean_proximity']

In [37]:
X_train.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,rooms_per_household,bedrooms_per_room,population_per_household,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-119.67,34.43,39.0,1467.0,381.0,1404.0,374.0,2.3681,3.92246,0.259714,3.754011,1.0,0.0,0.0,0.0,0.0
1,-118.32,33.74,24.0,6097.0,794.0,2248.0,806.0,10.1357,7.564516,0.130228,2.789082,0.0,0.0,0.0,0.0,1.0
2,-121.62,39.13,41.0,1317.0,309.0,856.0,337.0,1.6719,3.908012,0.234624,2.540059,0.0,1.0,0.0,0.0,0.0
3,-118.63,34.24,9.0,4759.0,924.0,1884.0,915.0,4.8333,5.201093,0.194158,2.059016,1.0,0.0,0.0,0.0,0.0
4,-122.3,37.52,38.0,2769.0,387.0,994.0,395.0,5.5902,7.010127,0.139762,2.516456,0.0,0.0,0.0,0.0,1.0


In [38]:
from sklearn.linear_model import LogisticRegression

In [39]:
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)

In [40]:
model.fit(X_train, y_train)

In [41]:
X_val_cat = pd.DataFrame(one_hot_encoder.transform(df_val[categorical]).toarray(), columns=one_hot_encoder.get_feature_names_out())

In [42]:
X_val = df_val.join(X_val_cat)

In [43]:
del X_val['ocean_proximity']

In [44]:
X_val.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,rooms_per_household,bedrooms_per_room,population_per_household,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-119.85,36.74,35.0,1191.0,190.0,537.0,182.0,3.5375,6.543956,0.15953,2.950549,0.0,1.0,0.0,0.0,0.0
1,-124.16,41.02,23.0,1672.0,385.0,1060.0,390.0,2.1726,4.287179,0.230263,2.717949,0.0,0.0,0.0,0.0,1.0
2,-117.92,33.67,14.0,6224.0,1679.0,3148.0,1589.0,4.2071,3.916929,0.269762,1.98112,1.0,0.0,0.0,0.0,0.0
3,-118.45,34.15,10.0,1091.0,260.0,517.0,266.0,4.1727,4.101504,0.238313,1.943609,1.0,0.0,0.0,0.0,0.0
4,-117.9,33.63,28.0,2370.0,352.0,832.0,347.0,7.1148,6.829971,0.148523,2.397695,1.0,0.0,0.0,0.0,0.0


In [45]:
from sklearn.metrics import accuracy_score

In [46]:
accuray = accuracy_score(model.predict(X_val), y_val).round(2)

In [47]:
accuray

0.84

# Q5

- Let's find the least useful feature using the feature elimination technique.
- Train a model with all these features (using the same parameters as in Q4).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
- Which of following feature has the smallest difference?
 - total_rooms
 - total_bedrooms
 - population
 - households
 
note: the difference doesn't have to be positive

In [48]:
for i in ['total_rooms', 'total_bedrooms', 'population', 'households']:
    X_train_new = X_train.copy()
    del X_train_new[i]
    X_val_new = X_val.copy()
    del X_val_new[i]
    model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
    model.fit(X_train_new, y_train)
    diff = np.abs(accuray - accuracy_score(model.predict(X_val_new), y_val))
    print(f'{i}: {diff}')

total_rooms: 0.0030329457364340895
total_bedrooms: 0.0054554263565891326
population: 0.013691860465116279
households: 0.006908914728682158


In [49]:
np.array([0.003, 0.005, 0.013, 0.006]).min()

0.003

# Q6

- For this question, we'll see how to use a linear regression model from Scikit-Learn
- We'll need to use the original column 'median_house_value'. Apply the logarithmic transformation to this column.
- Fit the Ridge regression model (model = Ridge(alpha=a, solver="sag", random_state=42)) on the training data.
- This model has a parameter alpha. Let's try the following values: [0, 0.01, 0.1, 1, 10]
- Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest alpha.

In [50]:
df = pd.read_csv('./data/03-California Housing Prices.csv')

In [51]:
df = df.fillna(0)

In [52]:
df['rooms_per_household'] = df['total_rooms'] / df['households']

In [53]:
df['bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms']

In [54]:
df['population_per_household'] = df['population'] / df['households']

In [55]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [56]:
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [57]:
y_train = np.log10(df_train['median_house_value'])
y_val = np.log10(df_val['median_house_value'])
y_test = np.log10(df_test['median_house_value'])

In [58]:
del df_train['median_house_value']
del df_val['median_house_value']
del df_test['median_house_value']

In [61]:
from sklearn.linear_model import Ridge

In [62]:
from sklearn.metrics import mean_squared_error

In [66]:
for a in [0, 0.01, 0.1, 1, 10]:
    model = Ridge(alpha=a, solver="sag", random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    print(f'a={a}    RMSE={mean_squared_error(y_val, y_pred) ** 0.5}')

a=0    RMSE=0.2275994757519327
a=0.01    RMSE=0.22759947575936526
a=0.1    RMSE=0.22759947582997314
a=1    RMSE=0.22759947653976415
a=10    RMSE=0.22759948362651974


In [69]:
np.array([0.2275994757519327, 0.22759947575936526, 0.22759947582997314, 0.22759947653976415, 0.22759948362651974]).min()

0.2275994757519327