# Homework MLZoomcamp week 3

## Dataset

In this homework, we will use the California Housing Prices data from Kaggle

https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

We'll be working with the 'median_house_value' variable, and we'll tranform it to a classification task

In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv('house_price.csv')

In [2]:
data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


## Features

For the rest of the homework, you'll need to use only these columns:
- 'latitude'
- 'longitude'
- 'housing_median_age'
- 'total_rooms'
- 'total_bedrooms'
- 'population'
- 'households'
- 'median_income'
- 'median_house_value'
- 'ocean_proximity'

In [3]:
features = ['latitude', 'longitude', 'housing_median_age', 'total_rooms', 
            'total_bedrooms', 'population', 'households', 'median_income', 
            'median_house_value', 'ocean_proximity']

## Data Preparation

- Select only the features from above and fill in the missing values with 0.

In [4]:
df = data[features]

In [5]:
df.head()

Unnamed: 0,latitude,longitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,37.88,-122.23,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,37.86,-122.22,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,37.85,-122.24,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,37.85,-122.25,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,37.85,-122.25,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [6]:
df.isnull().sum()

latitude                0
longitude               0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [7]:
df = df.fillna(0)

In [8]:
df.isnull().sum()

latitude              0
longitude             0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

- Create a new column rooms_per_household by dividing the column total_rooms by the column households from dataframe

In [9]:
df['rooms_per_household'] = df['total_rooms'] / df['households']

- Create a new column bedrooms_per_room by dividing the column total_bedrooms by the columns total_rooms from dataframe

In [10]:
df['bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms']

- Create a new column population_per_household by dividing the column population by the column households from dataframe

In [11]:
df['population_per_household'] = df['population'] / df['households']

## Question 1

What is the most frequent observation (mode) for the column ocean_proximity=

In [12]:
df['ocean_proximity'].value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

## Question 2

- Create the correlation matrix for the numerical features of your train dataset.
 - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
- What are the two features that have the biggest correlation in this dataset?

In [13]:
df.corr().abs().unstack().sort_values(ascending=False).head(13)

latitude                  latitude                    1.000000
longitude                 longitude                   1.000000
bedrooms_per_room         bedrooms_per_room           1.000000
rooms_per_household       rooms_per_household         1.000000
median_house_value        median_house_value          1.000000
median_income             median_income               1.000000
households                households                  1.000000
population                population                  1.000000
total_bedrooms            total_bedrooms              1.000000
total_rooms               total_rooms                 1.000000
housing_median_age        housing_median_age          1.000000
population_per_household  population_per_household    1.000000
total_bedrooms            households                  0.966507
dtype: float64

### Make median_house_value binary


- We need to turn the median_house_value variable from numeric into binary.
- Lets create a variable above_average which is 1 if the median_house_value is above its mean value and 0 otherwise

In [14]:
df['above_average'] = 0

In [15]:
df.loc[df['median_house_value'] > df['median_house_value'].mean(), 'above_average'] = 1

### Split the data

- Split your data in train/val/test sets, with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
- Make sure that the target value (median_house_value) is not in your dataframe.

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [18]:
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [19]:
len(df_train), len(df_val), len(df_test)

(12384, 4128, 4128)

In [20]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [21]:
y_train = df_train.above_average.values
y_val = df_val.above_average.values
y_test = df_test.above_average.values

In [22]:
del df_train['median_house_value']
del df_train['above_average']
del df_val['median_house_value']
del df_val['above_average']
del df_test['median_house_value']
del df_test['above_average']

## Question 3

- Calculate the mutual information with the binarized price for the categorical variable that we have. Use the training set only.
- What is the value of mutual information?
- Round it to 2 decimal digits using round(score, 2)

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 14 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   latitude                  20640 non-null  float64
 1   longitude                 20640 non-null  float64
 2   housing_median_age        20640 non-null  float64
 3   total_rooms               20640 non-null  float64
 4   total_bedrooms            20640 non-null  float64
 5   population                20640 non-null  float64
 6   households                20640 non-null  float64
 7   median_income             20640 non-null  float64
 8   median_house_value        20640 non-null  float64
 9   ocean_proximity           20640 non-null  object 
 10  rooms_per_household       20640 non-null  float64
 11  bedrooms_per_room         20640 non-null  float64
 12  population_per_household  20640 non-null  float64
 13  above_average             20640 non-null  int64  
dtypes: flo

In [24]:
from sklearn.metrics import mutual_info_score

In [25]:
round(mutual_info_score(df_full_train.above_average, df_full_train.ocean_proximity), 3)

0.102

## Question 4

- Now let's train a logistic regression.
- Remember that we have one categorical variable ocean_proximity in the data. Include it using one-hot encoding.
- Fit the model on the training dataset
 - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
 - model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits

In [26]:
from sklearn.feature_extraction import DictVectorizer

In [27]:
feat = list(df_train.columns)

In [28]:
train_dicts = df_train[feat].to_dict(orient='records')

In [29]:
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)

In [30]:
val_dicts = df_val[feat].to_dict(orient='records')
X_val = dv.transform(val_dicts)

In [31]:
from sklearn.linear_model import LogisticRegression

In [32]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')

In [33]:
y_pred = model.predict_proba(X_val)[:,1]

In [34]:
price_decision = (y_pred >= 0.50)

In [35]:
model_accuracy = (y_val == price_decision).mean()
model_accuracy

0.8372093023255814

## Question 5

- Let's find the least useful feature using the feature elimination technique.
- Train a model with all these features (using the same parameters as in Q4).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

In [36]:
useful_feature = ['total_rooms', 'total_bedrooms', 'population', 'households']
train_dicts_or = df_train[useful_feature].to_dict(orient='records')
dv_or = DictVectorizer(sparse=False)
X_train_or = dv.fit_transform(train_dicts_or)
val_dicts_or = df_val[useful_feature].to_dict(orient='records')
X_val_or = dv.transform(val_dicts_or)
model_or = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model_or.fit(X_train_or, y_train)
y_pred_or = model_or.predict_proba(X_val_or)[:,1]
price_decision_or = (y_pred_or >= 0.50)
model_accuracy_or = (y_val == price_decision_or).mean()
model_accuracy_or

0.7095445736434108

In [37]:
useful_feature = ['total_rooms', 'total_bedrooms', 'population', 'households']
feature_dict = {}
for f in useful_feature:
    feat_c = feat.copy()
    feat_c.remove(f)
    train_dicts_q5 = df_train[feat_c].to_dict(orient='records')
    dv_q5 = DictVectorizer(sparse=False)
    X_train_q5 = dv.fit_transform(train_dicts_q5)
    val_dicts_q5 = df_val[feat_c].to_dict(orient='records')
    X_val_q5 = dv.transform(val_dicts_q5)
    model_q5 = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model_q5.fit(X_train_q5, y_train)
    y_pred_q5 = model_q5.predict_proba(X_val_q5)[:,1]
    price_decision_q5 = (y_pred_q5 >= 0.50)
    model_accuracy_q5 = (y_val == price_decision_q5).mean()
    feature_dict[f] = abs(model_accuracy_or - model_accuracy_q5)
    print('The feature', f,'has a difference of', abs(model_accuracy_or - model_accuracy_q5))

The feature total_rooms has a difference of 0.12839147286821706
The feature total_bedrooms has a difference of 0.12669573643410859
The feature population has a difference of 0.11676356589147285
The feature households has a difference of 0.12306201550387597


In [38]:
min(feature_dict, key=feature_dict.get)

'population'

## Question 6

- For this question, we'll see how to use a linear regression model from Scikit-Learn.
- We'll need to use the original column 'median_house_value'. Apply the logarithmic transformation to this column.
- Fit the Ridge regression model (model = Ridge(alpha = a, solver ='sag', random_state=42)) on the training data.
- This model has a paramter alpha. Let's try the following values: [0, 0.01, 0.1, 1, 10]


In [39]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

In [40]:
del df['above_average']

In [41]:
df['median_house_value'] = np.log1p(df['median_house_value'])

In [42]:
df_full_train_lr, df_test_lr = train_test_split(df, test_size=0.2, random_state=42)
df_train_lr, df_val_lr = train_test_split(df_full_train_lr, test_size=0.25, random_state=42)
df_train_lr = df_train_lr.reset_index(drop=True)
df_val_lr = df_val_lr.reset_index(drop=True)
df_test_lr = df_test_lr.reset_index(drop=True)
y_train_lr = df_train_lr.median_house_value.values
y_val_lr = df_val_lr.median_house_value.values
y_test_lr = df_test_lr.median_house_value.values
del df_train_lr['median_house_value']
del df_val_lr['median_house_value']
del df_test_lr['median_house_value']

In [43]:
alpha_values = [0, 0.01, 0.1, 1, 10]
feat_lr = list(df_train_lr.columns)
train_dicts_lr = df_train_lr[feat_lr].to_dict(orient='records')
dv_lr = DictVectorizer(sparse=False)
X_train_lr = dv_lr.fit_transform(train_dicts_lr)
val_dicts_lr = df_val_lr[feat_lr].to_dict(orient='records')
X_val_lr = dv_lr.transform(val_dicts_lr)
rmse_dict = {}
for a in alpha_values:
    lin_reg = Ridge(alpha=a, solver='sag', random_state=42)
    lin_reg.fit(X_train_lr, y_train_lr)
    y_pred_lr = lin_reg.predict(X_val_lr)
    rmse_lr = np.sqrt(mean_squared_error(y_val_lr, y_pred_lr))
    rmse_dict[a] = rmse_lr
    print('For', a, 'the RMSE is', rmse_lr)

For 0 the RMSE is 0.524063570701514
For 0.01 the RMSE is 0.5240635707186291
For 0.1 the RMSE is 0.5240635708812071
For 1 the RMSE is 0.5240635725155536
For 10 the RMSE is 0.5240635888333284


In [44]:
min(rmse_dict, key=rmse_dict.get)

0