# Logistic Regression Homework
## Dataset
In this homework, we will use the California Housing Prices data from <a href='https://www.kaggle.com/datasets/camnugent/california-housing-prices'>Kaggle</a>

We'll keep working with the <code>'median_house_value'</code> variable, and we'll transform it to a classification task.

In [82]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error

##  Features

For the rest of the homework, you'll need to use only these columns:

* <code>'latitude'</code>,
* <code>'longitude'</code>,
* <code>'housing_median_age'</code>,
* <code>'total_rooms'</code>,
* <code>'total_bedrooms'</code>,
* <code>'population'</code>,
* <code>'households'</code>,
* <code>'median_income'</code>,
* <code>'median_house_value'</code>,
* <code>'ocean_proximity'</code>

## Data preparation

In [85]:
df = pd.read_csv('housing.csv')
df.head().T

Unnamed: 0,0,1,2,3,4
longitude,-122.23,-122.22,-122.24,-122.25,-122.25
latitude,37.88,37.86,37.85,37.85,37.85
housing_median_age,41.0,21.0,52.0,52.0,52.0
total_rooms,880.0,7099.0,1467.0,1274.0,1627.0
total_bedrooms,129.0,1106.0,190.0,235.0,280.0
population,322.0,2401.0,496.0,558.0,565.0
households,126.0,1138.0,177.0,219.0,259.0
median_income,8.3252,8.3014,7.2574,5.6431,3.8462
median_house_value,452600.0,358500.0,352100.0,341300.0,342200.0
ocean_proximity,NEAR BAY,NEAR BAY,NEAR BAY,NEAR BAY,NEAR BAY


* Select only the features from above and fill in the missing values with 0.

In [3]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [56]:
df = df.fillna(0)

In [5]:
df.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

* Create a new column <code>rooms_per_household</code> by dividing the column <code>total_rooms</code> by the column <code>households</code> from dataframe.

In [86]:
df['rooms_per_household'] = df.total_rooms / df.households

* Create a new column <code>bedrooms_per_room</code> by dividing the column <code>total_bedrooms</code> by the column <code>total_rooms</code> from dataframe.

In [87]:
df['bedrooms_per_room'] = df.total_bedrooms / df.total_rooms

Create a new column <code>population_per_household</code> by dividing the column <code>population</code> by the column <code>households</code> from dataframe.

In [88]:
df['population_per_household'] = df.population / df.households

In [9]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,rooms_per_household,bedrooms_per_room,population_per_household
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,6.984127,0.146591,2.555556
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,6.238137,0.155797,2.109842
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,8.288136,0.129516,2.80226
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,5.817352,0.184458,2.547945
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,6.281853,0.172096,2.181467


In [10]:
df.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity', 'rooms_per_household',
       'bedrooms_per_room', 'population_per_household'],
      dtype='object')

In [89]:
numerical = ['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'rooms_per_household', 'bedrooms_per_room', 'population_per_household']

In [90]:
categorical = ['ocean_proximity']

## Question 1
What is the most frequent observation (mode) for the column <code>ocean_proximity</code>?

Options:

* <code>NEAR BAY</code>
* <code><b><1H OCEAN</b></code>
* <code>INLAND</code>
* <code>NEAR OCEAN</code>

In [13]:
df.ocean_proximity.mode()

0    <1H OCEAN
dtype: object

In [14]:
df.ocean_proximity.value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

## Question 2
* Create the <a href='https://www.google.com/search?q=correlation+matrix'>correlation matrix</a> for the numerical features of your train dataset.
    * In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?

Options:

* <b><code>total_bedrooms</code> and <code>households</code></b>
* <code>total_bedrooms</code> and <code>total_rooms</code>
* <code>population</code> and <code>households</code>
* <code>population_per_household</code> and <code>total_rooms</code>

In [15]:
df.corr()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,rooms_per_household,bedrooms_per_room,population_per_household
longitude,1.0,-0.924664,-0.108197,0.044568,0.068082,0.099773,0.05531,-0.015176,-0.045967,-0.02754,0.084836,0.002476
latitude,-0.924664,1.0,0.011173,-0.0361,-0.065318,-0.108785,-0.071035,-0.079809,-0.14416,0.106389,-0.104112,0.002366
housing_median_age,-0.108197,0.011173,1.0,-0.361262,-0.317063,-0.296244,-0.302916,-0.119034,0.105623,-0.153277,0.125396,0.013191
total_rooms,0.044568,-0.0361,-0.361262,1.0,0.920196,0.857126,0.918484,0.19805,0.134153,0.133798,-0.174583,-0.024581
total_bedrooms,0.068082,-0.065318,-0.317063,0.920196,1.0,0.866266,0.966507,-0.007295,0.049148,0.002717,0.122205,-0.028019
population,0.099773,-0.108785,-0.296244,0.857126,0.866266,1.0,0.907222,0.004834,-0.02465,-0.072213,0.031397,0.069863
households,0.05531,-0.071035,-0.302916,0.918484,0.966507,0.907222,1.0,0.013033,0.065843,-0.080598,0.059818,-0.027309
median_income,-0.015176,-0.079809,-0.119034,0.19805,-0.007295,0.004834,0.013033,1.0,0.688075,0.326895,-0.573836,0.018766
median_house_value,-0.045967,-0.14416,0.105623,0.134153,0.049148,-0.02465,0.065843,0.688075,1.0,0.151948,-0.238759,-0.023737
rooms_per_household,-0.02754,0.106389,-0.153277,0.133798,0.002717,-0.072213,-0.080598,0.326895,0.151948,1.0,-0.387465,-0.004852


In [16]:
df.corr().style.background_gradient(cmap='coolwarm', axis=None)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,rooms_per_household,bedrooms_per_room,population_per_household
longitude,1.0,-0.924664,-0.108197,0.044568,0.068082,0.099773,0.05531,-0.015176,-0.045967,-0.02754,0.084836,0.002476
latitude,-0.924664,1.0,0.011173,-0.0361,-0.065318,-0.108785,-0.071035,-0.079809,-0.14416,0.106389,-0.104112,0.002366
housing_median_age,-0.108197,0.011173,1.0,-0.361262,-0.317063,-0.296244,-0.302916,-0.119034,0.105623,-0.153277,0.125396,0.013191
total_rooms,0.044568,-0.0361,-0.361262,1.0,0.920196,0.857126,0.918484,0.19805,0.134153,0.133798,-0.174583,-0.024581
total_bedrooms,0.068082,-0.065318,-0.317063,0.920196,1.0,0.866266,0.966507,-0.007295,0.049148,0.002717,0.122205,-0.028019
population,0.099773,-0.108785,-0.296244,0.857126,0.866266,1.0,0.907222,0.004834,-0.02465,-0.072213,0.031397,0.069863
households,0.05531,-0.071035,-0.302916,0.918484,0.966507,0.907222,1.0,0.013033,0.065843,-0.080598,0.059818,-0.027309
median_income,-0.015176,-0.079809,-0.119034,0.19805,-0.007295,0.004834,0.013033,1.0,0.688075,0.326895,-0.573836,0.018766
median_house_value,-0.045967,-0.14416,0.105623,0.134153,0.049148,-0.02465,0.065843,0.688075,1.0,0.151948,-0.238759,-0.023737
rooms_per_household,-0.02754,0.106389,-0.153277,0.133798,0.002717,-0.072213,-0.080598,0.326895,0.151948,1.0,-0.387465,-0.004852


In [17]:
df['total_bedrooms'].corr(df['households'])

0.9665072400420386

In [18]:
df['total_bedrooms'].corr(df['total_rooms'])

0.9201961721166259

In [19]:
df['population'].corr(df['households'])

0.9072222660959619

In [20]:
df['population_per_household'].corr(df['total_rooms'])

-0.024580658993988022

## Make median_house_value binary
* We need to turn the <code>median_house_value</code> variable from numeric into binary.
* Let's create a variable <code>above_average</code> which is <code>1</code> if the <code>median_house_value</code> is above its mean value and 0 otherwise.

In [91]:
df.loc[df['median_house_value'] > df['median_house_value'].mean(), 'above_average'] = 1
df['above_average'] = df['above_average'].fillna(0)
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,rooms_per_household,bedrooms_per_room,population_per_household,above_average
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,6.984127,0.146591,2.555556,1.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,6.238137,0.155797,2.109842,1.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,8.288136,0.129516,2.80226,1.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,5.817352,0.184458,2.547945,1.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,6.281853,0.172096,2.181467,1.0


## Split the data
* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the <code>train_test_split</code> function) and set the seed to 42.
* Make sure that the target value (<code>median_house_value</code>) is not in your dataframe.

In [22]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.above_average.values
y_val = df_val.above_average.values
y_test = df_test.above_average.values

In [23]:
del df_train['above_average']
del df_val['above_average']
del df_test['above_average']
del df_train['median_house_value']
del df_val['median_house_value']
del df_test['median_house_value']

In [24]:
len(df_train), len(df_val), len(df_test)

(12384, 4128, 4128)

In [25]:
len(y_train), len(y_val), len(y_test)

(12384, 4128, 4128)

## Question 3
* Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only.
* What is the value of mutual information?
* Round it to 2 decimal digits using <code>round(score, 2)</code>

Options:

* 0.26
* 0
* <b>0.10</b>
* 0.16

In [26]:
round(mutual_info_score(y_train, df_train.ocean_proximity), 2)

0.1

##  Question 4
* Now let's train a logistic regression
* Remember that we have one categorical variable <code>ocean_proximity</code> in the data. Include it using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - <code>model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)</code>
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

Options:

* 0.60
* 0.72
* <b>0.84</b>
* 0.95

In [27]:
dv = DictVectorizer(sparse=False)

train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict) 

val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [28]:
train_dict[:3]

[{'ocean_proximity': '<1H OCEAN',
  'longitude': -119.67,
  'latitude': 34.43,
  'housing_median_age': 39.0,
  'total_rooms': 1467.0,
  'total_bedrooms': 381.0,
  'population': 1404.0,
  'households': 374.0,
  'median_income': 2.3681,
  'rooms_per_household': 3.9224598930481283,
  'bedrooms_per_room': 0.25971370143149286,
  'population_per_household': 3.7540106951871657},
 {'ocean_proximity': 'NEAR OCEAN',
  'longitude': -118.32,
  'latitude': 33.74,
  'housing_median_age': 24.0,
  'total_rooms': 6097.0,
  'total_bedrooms': 794.0,
  'population': 2248.0,
  'households': 806.0,
  'median_income': 10.1357,
  'rooms_per_household': 7.564516129032258,
  'bedrooms_per_room': 0.13022798097424962,
  'population_per_household': 2.7890818858560795},
 {'ocean_proximity': 'INLAND',
  'longitude': -121.62,
  'latitude': 39.13,
  'housing_median_age': 41.0,
  'total_rooms': 1317.0,
  'total_bedrooms': 309.0,
  'population': 856.0,
  'households': 337.0,
  'median_income': 1.6719,
  'rooms_per_house

In [29]:
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)

In [30]:
model.fit(X_train, y_train)

LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')

In [31]:
y_pred = model.predict_proba(X_val)[:, 1]

In [32]:
decision = (y_pred >= 0.5)
decision

array([False, False,  True, ...,  True,  True, False])

In [33]:
(y_val == decision).mean()

0.8359980620155039

## Question 5
* Let's find the least useful feature using the <i>feature elimination</i> technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
* Which of following feature has the smallest difference?
    - <b><code>total_rooms</code></b>
    - <code>total_bedrooms</code>
    - <code>population</code>
    - <code>households</code>

<quote>note: the difference doesn't have to be positive</quote>

In [34]:
dv = DictVectorizer(sparse=False)

dicts_full_train = df_full_train[['ocean_proximity', 
                       'longitude', 
                       'latitude', 
                       'housing_median_age', 
                       'total_rooms',
                       'total_bedrooms', 
                       'population',
                       'households',
                       'median_income',
                       'rooms_per_household',
                       'bedrooms_per_room',
                       'population_per_household'
                      ]].to_dict(orient='records')
X_full_train = dv.fit_transform(dicts_full_train) 
y_full_train = df_full_train.above_average.values

In [35]:
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
model.fit(X_full_train, y_full_train)

LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')

In [36]:
# test
dicts_test = df_test[['ocean_proximity', 
                       'longitude', 
                       'latitude', 
                       'housing_median_age', 
                       'total_rooms',
                       'total_bedrooms', 
                       'population',
                       'households',
                       'median_income',
                       'rooms_per_household',
                       'bedrooms_per_room',
                       'population_per_household'
                      ]].to_dict(orient='records')
X_test = dv.transform(dicts_test)

In [37]:
y_pred = model.predict_proba(X_test)[:, 1]
decision = (y_pred >= 0.5)
full_model_score = (decision == y_test).mean()
full_model_score

0.8343023255813954

In [38]:
# without total_rooms
dicts_full_train = df_full_train[['ocean_proximity', 
                       'longitude', 
                       'latitude', 
                       'housing_median_age', 
                       'total_bedrooms', 
                       'population',
                       'households',
                       'median_income',
                       'rooms_per_household',
                       'bedrooms_per_room',
                       'population_per_household'
                      ]].to_dict(orient='records')
X_full_train = dv.fit_transform(dicts_full_train) 
y_full_train = df_full_train.above_average.values
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
model.fit(X_full_train, y_full_train)

LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')

In [39]:
dicts_test = df_test[['ocean_proximity', 
                       'longitude', 
                       'latitude', 
                       'housing_median_age', 
                       'total_bedrooms', 
                       'population',
                       'households',
                       'median_income',
                       'rooms_per_household',
                       'bedrooms_per_room',
                       'population_per_household'
                      ]].to_dict(orient='records')
X_test = dv.transform(dicts_test)
y_pred = model.predict_proba(X_test)[:, 1]
decision = (y_pred >= 0.5)
model1_score = (decision == y_test).mean()
print('without total_rooms diff', full_model_score - model1_score)

without total_rooms diff 0.0007267441860465684


In [40]:
# without total_bedrooms
dicts_full_train = df_full_train[['ocean_proximity', 
                       'longitude', 
                       'latitude', 
                       'housing_median_age', 
                       'total_rooms',
                       'population',
                       'households',
                       'median_income',
                       'rooms_per_household',
                       'bedrooms_per_room',
                       'population_per_household'
                      ]].to_dict(orient='records')
X_full_train = dv.fit_transform(dicts_full_train) 
y_full_train = df_full_train.above_average.values
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
model.fit(X_full_train, y_full_train)

LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')

In [41]:
dicts_test = df_test[['ocean_proximity', 
                       'longitude', 
                       'latitude', 
                       'housing_median_age', 
                       'total_rooms',
                       'population',
                       'households',
                       'median_income',
                       'rooms_per_household',
                       'bedrooms_per_room',
                       'population_per_household'
                      ]].to_dict(orient='records')
X_test = dv.transform(dicts_test)
y_pred = model.predict_proba(X_test)[:, 1]
decision = (y_pred >= 0.5)
model2_score = (decision == y_test).mean()
print('without total_bedrooms diff', full_model_score - model2_score)

without total_bedrooms diff 0.001211240310077577


In [42]:
# without population
dicts_full_train = df_full_train[['ocean_proximity', 
                       'longitude', 
                       'latitude', 
                       'housing_median_age', 
                       'total_rooms',
                       'total_bedrooms', 
                       'households',
                       'median_income',
                       'rooms_per_household',
                       'bedrooms_per_room',
                       'population_per_household'
                      ]].to_dict(orient='records')
X_full_train = dv.fit_transform(dicts_full_train) 
y_full_train = df_full_train.above_average.values
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
model.fit(X_full_train, y_full_train)

LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')

In [43]:
dicts_test = df_test[['ocean_proximity', 
                       'longitude', 
                       'latitude', 
                       'housing_median_age', 
                       'total_rooms',
                       'households',
                       'median_income',
                       'rooms_per_household',
                       'bedrooms_per_room',
                       'population_per_household'
                      ]].to_dict(orient='records')
X_test = dv.transform(dicts_test)
y_pred = model.predict_proba(X_test)[:, 1]
decision = (y_pred >= 0.5)
model3_score = (decision == y_test).mean()
print('without population diff', full_model_score - model3_score)

without population diff 0.08672480620155043


In [44]:
# without households
dicts_full_train = df_full_train[['ocean_proximity', 
                       'longitude', 
                       'latitude', 
                       'housing_median_age', 
                       'total_rooms',
                       'total_bedrooms', 
                       'population',
                       'median_income',
                       'rooms_per_household',
                       'bedrooms_per_room',
                       'population_per_household'
                      ]].to_dict(orient='records')
X_full_train = dv.fit_transform(dicts_full_train) 
y_full_train = df_full_train.above_average.values
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
model.fit(X_full_train, y_full_train)

LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')

In [45]:
dicts_test = df_test[['ocean_proximity', 
                       'longitude', 
                       'latitude', 
                       'housing_median_age', 
                       'total_rooms',
                       'total_bedrooms', 
                       'population',
                       'median_income',
                       'rooms_per_household',
                       'bedrooms_per_room',
                       'population_per_household'
                      ]].to_dict(orient='records')
X_test = dv.transform(dicts_test)
y_pred = model.predict_proba(X_test)[:, 1]
decision = (y_pred >= 0.5)
model4_score = (decision == y_test).mean()
print('without households diff', full_model_score - model4_score)

without households diff 0.006540697674418672


##  Question 6
* For this question, we'll see how to use a linear regression model from Scikit-Learn
* We'll need to use the original column <code>'median_house_value'</code>. Apply the logarithmic transformation to this column.
* Fit the Ridge regression model (<code>model = Ridge(alpha=a, solver="sag", random_state=42)</code>) on the training data.
* This model has a parameter <code>alpha</code>. Let's try the following values: <code>[0, 0.01, 0.1, 1, 10]</code>
* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.
If there are multiple options, select the smallest <code>alpha</code>.

Options:

* 0
* 0.01
* 0.1
* 1
* 10

In [60]:
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,rooms_per_household,bedrooms_per_room,population_per_household
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,6.984127,0.146591,2.555556
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,6.238137,0.155797,2.109842
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,8.288136,0.129516,2.802260
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,5.817352,0.184458,2.547945
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,6.281853,0.172096,2.181467
...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND,5.045455,0.224625,2.560606
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND,6.114035,0.215208,3.122807
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND,5.205543,0.215173,2.325635
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND,5.329513,0.219892,2.123209


In [92]:
df['median_house_value'] = np.log(df['median_house_value'])
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,rooms_per_household,bedrooms_per_room,population_per_household,above_average
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,13.022764,NEAR BAY,6.984127,0.146591,2.555556,1.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,12.789684,NEAR BAY,6.238137,0.155797,2.109842,1.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,12.771671,NEAR BAY,8.288136,0.129516,2.802260,1.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,12.740517,NEAR BAY,5.817352,0.184458,2.547945,1.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,12.743151,NEAR BAY,6.281853,0.172096,2.181467,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,11.265745,INLAND,5.045455,0.224625,2.560606,0.0
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,11.252859,INLAND,6.114035,0.215208,3.122807,0.0
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,11.432799,INLAND,5.205543,0.215173,2.325635,0.0
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,11.346871,INLAND,5.329513,0.219892,2.123209,0.0


In [93]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.median_house_value.values
y_val = df_val.median_house_value.values
y_test = df_test.median_house_value.values

In [94]:
del df_train['median_house_value']
del df_val['median_house_value']
del df_test['median_house_value']

In [95]:
dv = DictVectorizer(sparse=False)

train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict) 

val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [96]:
model = Ridge(alpha=0, solver="sag", random_state=42)

In [97]:
model.fit(X_train, y_train)

Ridge(alpha=0, random_state=42, solver='sag')

In [69]:
def rmse(y, y_pred):
    se = (y - y_pred) ** 2
    mse = se.mean()
    return np.sqrt(mse)

In [98]:
y_pred = model.predict(X_val)

In [99]:
mean_squared_error(y_val, y_pred, squared=False)

0.524067160039661

In [109]:
round(rmse(y_val, y_pred), 3)

0.524

In [101]:
model = Ridge(alpha=0.01, solver="sag", random_state=42)

In [102]:
model.fit(X_train, y_train)

Ridge(alpha=0.01, random_state=42, solver='sag')

In [103]:
y_pred = model.predict(X_val)

In [110]:
round(rmse(y_val, y_pred),3)

0.524

In [111]:
model = Ridge(alpha=0.1, solver="sag", random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
round(rmse(y_val, y_pred), 3)

0.524

In [112]:
model = Ridge(alpha=1, solver="sag", random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
round(rmse(y_val, y_pred), 3)

0.524

In [113]:
model = Ridge(alpha=10, solver="sag", random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_val)
round(rmse(y_val, y_pred), 3)

0.524

In [108]:
mean_squared_error(y_val, y_pred, squared=False)

0.5240671781715663