### **Dataset**

In this homework, we will use the California Housing Prices from [Kaggle](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv).

The goal of this homework is to create a regression model for predicting housing prices (column 'median_house_value').

In [115]:
# Imports

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import export_text

In [102]:
url = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv"
df_initial = pd.read_csv(url)
df_initial.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [103]:
df_initial.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [104]:
df_initial.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [105]:
df_initial.fillna(0, inplace=True)
df_initial.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

### **Preparing the dataset**
For this homework, we only want to use a subset of data. This is the same subset we used in homework #2.

First, keep only the records where ocean_proximity is either '<1H OCEAN' or 'INLAND'

In [106]:
df_initial.ocean_proximity.unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [107]:
df = df_initial[(df_initial.ocean_proximity == '<1H OCEAN') | (df_initial.ocean_proximity == 'INLAND')]
df.ocean_proximity.unique()

array(['<1H OCEAN', 'INLAND'], dtype=object)

Apply the log tranform to median_house_value.

In [108]:
df.median_house_value.head()

701    431000.0
830    217000.0
859    247600.0
860    283500.0
861    216900.0
Name: median_house_value, dtype: float64

In [109]:
df.loc[ : , 'median_house_value'] = np.log1p(df.median_house_value)
df.median_house_value.head()

701    12.973866
830    12.287657
859    12.419574
860    12.554971
861    12.287196
Name: median_house_value, dtype: float64

Do train/validation/test split with 60%/20%/20% distribution.

Use the train_test_split function and set the random_state parameter to 1.

In [111]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.median_house_value.values
y_val = df_val.median_house_value.values
y_test = df_test.median_house_value.values

del df_train['median_house_value']
del df_val['median_house_value']
del df_test['median_house_value']

Use DictVectorizer(sparse=True) to turn the dataframe into matrices.

In [113]:
features = ["longitude", "latitude", "housing_median_age", "total_rooms", "total_bedrooms", "population", "households", 
               "median_income", "ocean_proximity"]
train_dicts = df_train[features].to_dict(orient='records')
val_dicts = df_val[features].to_dict(orient='records')

In [114]:
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)
X_val = dv.transform(val_dicts)

### **Question 1**

Let's train a decision tree regressor to predict the median_house_value variable.

Train a model with max_depth=1.

Which feature is used for splitting the data?
- ocean_proximity
- total_rooms
- latitude
- population

In [116]:
dt = DecisionTreeRegressor(max_depth=1)
dt.fit(X_train, y_train)

In [117]:
print(export_text(dt, feature_names=dv.feature_names_))

|--- ocean_proximity=<1H OCEAN <= 0.50
|   |--- value: [11.61]
|--- ocean_proximity=<1H OCEAN >  0.50
|   |--- value: [12.30]



### Answer: ocean_proximity