Dataset
In this homework, we will use the California Housing Prices data from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).

Here's a wget-able [link](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv):

In [1]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

--2022-09-25 16:38:53--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1423529 (1.4M) [text/plain]
Saving to: ‘housing.csv’


2022-09-25 16:38:55 (2.99 MB/s) - ‘housing.csv’ saved [1423529/1423529]



In [130]:
#import libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from sklearn.linear_model import LogisticRegression

In [3]:
df = pd.read_csv('housing.csv')
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


### Data preparation
- Select only the features from above and fill in the missing values with 0.
- Create a new column **rooms_per_household** by dividing the column **total_rooms** by the column **households** from dataframe.
- Create a new column **bedrooms_per_room** by dividing the column **total_bedrooms** by the column **total_rooms** from dataframe.
- Create a new column **population_per_household** by dividing the column **population** by the column **households** from dataframe.

In [9]:
df = df.fillna(0)

In [10]:
df['rooms_per_household'] = df['total_rooms'] / df['households']
df['bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms']
df['population_per_household'] = df['population'] / df['households']

In [11]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,rooms_per_household,bedrooms_per_room,population_per_household
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,6.984127,0.146591,2.555556
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,6.238137,0.155797,2.109842
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,8.288136,0.129516,2.80226
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,5.817352,0.184458,2.547945
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,6.281853,0.172096,2.181467


## Question 1
What is the most frequent observation (mode) for the column ocean_proximity?

Options:

- NEAR BAY
- <1H OCEAN
- INLAND
- NEAR OCEAN


In [15]:
df['ocean_proximity'].mode()

0    <1H OCEAN
Name: ocean_proximity, dtype: object

### Question 2
- Create the correlation matrix for the numerical features of your train dataset.
    - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
- What are the two features that have the biggest correlation in this dataset?

Options:

- total_bedrooms and households
- total_bedrooms and total_rooms
- population and households
- population_per_household and total_rooms

Answer = total_bedrooms and households

In [37]:
# Get all numerical columns
numerical = df.select_dtypes(include=np.number).columns.tolist()
df[numerical].corrwith(df.households)

longitude                   0.055310
latitude                   -0.071035
housing_median_age         -0.302916
total_rooms                 0.918484
total_bedrooms              0.966507
population                  0.907222
households                  1.000000
median_income               0.013033
median_house_value          0.065843
rooms_per_household        -0.080598
bedrooms_per_room           0.059818
population_per_household   -0.027309
dtype: float64

In [38]:
df[numerical].corrwith(df.total_rooms)

longitude                   0.044568
latitude                   -0.036100
housing_median_age         -0.361262
total_rooms                 1.000000
total_bedrooms              0.920196
population                  0.857126
households                  0.918484
median_income               0.198050
median_house_value          0.134153
rooms_per_household         0.133798
bedrooms_per_room          -0.174583
population_per_household   -0.024581
dtype: float64

In [44]:
med_house_val_dicts = df['median_house_value'].to_dict()
med_house_val_dicts

{0: 452600.0,
 1: 358500.0,
 2: 352100.0,
 3: 341300.0,
 4: 342200.0,
 5: 269700.0,
 6: 299200.0,
 7: 241400.0,
 8: 226700.0,
 9: 261100.0,
 10: 281500.0,
 11: 241800.0,
 12: 213500.0,
 13: 191300.0,
 14: 159200.0,
 15: 140000.0,
 16: 152500.0,
 17: 155500.0,
 18: 158700.0,
 19: 162900.0,
 20: 147500.0,
 21: 159800.0,
 22: 113900.0,
 23: 99700.0,
 24: 132600.0,
 25: 107500.0,
 26: 93800.0,
 27: 105500.0,
 28: 108900.0,
 29: 132000.0,
 30: 122300.0,
 31: 115200.0,
 32: 110400.0,
 33: 104900.0,
 34: 109700.0,
 35: 97200.0,
 36: 104500.0,
 37: 103900.0,
 38: 191400.0,
 39: 176000.0,
 40: 155400.0,
 41: 150000.0,
 42: 118800.0,
 43: 188800.0,
 44: 184400.0,
 45: 182300.0,
 46: 142500.0,
 47: 137500.0,
 48: 187500.0,
 49: 112500.0,
 50: 171900.0,
 51: 93800.0,
 52: 97500.0,
 53: 104200.0,
 54: 87500.0,
 55: 83100.0,
 56: 87500.0,
 57: 85300.0,
 58: 80300.0,
 59: 60000.0,
 60: 75700.0,
 61: 75000.0,
 62: 86100.0,
 63: 76100.0,
 64: 73500.0,
 65: 78400.0,
 66: 84400.0,
 67: 81300.0,
 68: 8500

In [47]:
house_val_mean = df['median_house_value'].mean()

In [50]:
df['above_average '] = (df.median_house_value > df.median_house_value.mean() )

In [52]:
df['above_average '] = df['above_average '].astype(int)

In [53]:
df['above_average ']

0        1
1        1
2        1
3        1
4        1
        ..
20635    0
20636    0
20637    0
20638    0
20639    0
Name: above_average , Length: 20640, dtype: int64

### Split the data
- Split your data in train/val/test sets, with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
- Make sure that the target value (median_house_value) is not in your dataframe.

In [56]:
y_train = df['median_house_value']

In [58]:
del df['median_house_value']

In [60]:
# get training and test dataset
# distribution will be: 
# train = 80%
# test = 20%
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [61]:
# split full train into 2 parts to get validation data
# change to the test size to .25 bcuz youre trying to get 20% of the 80% dataset
# 20/80 == 1/4 or .25

df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [62]:
len(df_train), len(df_val), len(df_test)

(12384, 4128, 4128)

In [72]:
# reset the shuffle index
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

##Question 3
- Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only.
- What is the value of mutual information?
- Round it to 2 decimal digits using round(score, 2)

Options:

- 0.26
- 0
- 0.10
- 0.16


Answer = 0.10

In [83]:
df_train.columns = df_train.columns.str.replace(' ', '')

In [89]:
score = mutual_info_score(df_train.above_average, df_train['ocean_proximity'])
score.round(2)

0.1

## Question 4
- Now let's train a logistic regression
- Remember that we have one categorical variable ocean_proximity in the data. Include it using one-hot encoding.
- Fit the model on the training dataset.
 - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
 - model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

Options:

- 0.60
- 0.72
- 0.84
- 0.95

In [107]:
# One-hot encoding 
df_train['ocean_proximity'] = df_train['ocean_proximity'].str.replace(' ', '_')


In [114]:
df_train['ocean_proximity'] = df_train['ocean_proximity'].str.replace('<', 'less_than_')
df_train['ocean_proximity']

0        less_than_1h_ocean
1                near_ocean
2                    inland
3        less_than_1h_ocean
4                near_ocean
                ...        
12379    less_than_1h_ocean
12380                inland
12381    less_than_1h_ocean
12382    less_than_1h_ocean
12383    less_than_1h_ocean
Name: ocean_proximity, Length: 12384, dtype: object

In [127]:
train_dicts = df_train.to_dict(orient='records')

In [132]:
train_dicts[0]

{'longitude': -119.67,
 'latitude': 34.43,
 'housing_median_age': 39.0,
 'total_rooms': 1467.0,
 'total_bedrooms': 381.0,
 'population': 1404.0,
 'households': 374.0,
 'median_income': 2.3681,
 'ocean_proximity': 'less_than_1h_ocean',
 'rooms_per_household': 3.9224598930481283,
 'bedrooms_per_room': 0.25971370143149286,
 'population_per_household': 3.7540106951871657,
 'above_average': 1}

In [128]:
dv = DictVectorizer(sparse=False)
# fit and transform as one 
X_train = dv.fit_transform(train_dicts)

In [134]:
X_train.shape,y_train.shape

((12384, 17), (20640,))

In [138]:
model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)

model.fit(X_train, y_train)

ValueError: Found input variables with inconsistent numbers of samples: [12384, 20640]