## Homework 4 - Maciej Paczóski

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn import metrics
import warnings

warnings.filterwarnings("ignore")
import random
import eli5
from eli5.sklearn import PermutationImportance

In [2]:
df = pd.read_csv("housing.csv")
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


### Data prepocessing

In [3]:
df = df.dropna()

In [4]:
df["ocean_proximity"].unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [5]:
le = LabelEncoder()
df["ocean_proximity"] = le.fit_transform(df["ocean_proximity"])

In [6]:
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Model training 

In [7]:
modelGBR = GradientBoostingRegressor(random_state=0)
modelGBR.fit(X_train, y_train)
y_pred = modelGBR.predict(X_test)
print("GradientBoostingRegressor:  ", metrics.r2_score(y_test, y_pred))

GradientBoostingRegressor:   0.7705056681773399


In [8]:
modelRFR = RandomForestRegressor(random_state=0)
modelRFR.fit(X_train, y_train)
y_pred = modelRFR.predict(X_test)
print("RandomForestRegressor:  ", metrics.r2_score(y_test, y_pred))

RandomForestRegressor:   0.8142447370110542


In [9]:
modelMLPR = MLPRegressor(learning_rate_init=0.01, random_state=0)
modelMLPR.fit(X_train, y_train)
y_pred = modelMLPR.predict(X_test)
print("MLPRegressor:  ", metrics.r2_score(y_test, y_pred))

MLPRegressor:   0.6507014083261967


### Permutational importance 

#### GradientBoostingRegressor

In [10]:
pm_gbr = PermutationImportance(modelGBR, random_state=0).fit(X_train, y_train)
eli5.show_weights(pm_gbr, feature_names=X_train.columns.tolist())

Weight,Feature
0.7184  ± 0.0052,median_income
0.4546  ± 0.0084,longitude
0.3930  ± 0.0065,latitude
0.1181  ± 0.0038,ocean_proximity
0.0853  ± 0.0038,population
0.0602  ± 0.0024,total_bedrooms
0.0342  ± 0.0032,housing_median_age
0.0130  ± 0.0008,households
0.0069  ± 0.0003,total_rooms


#### RandomForestRegressor

In [11]:
pm_rfr = PermutationImportance(modelRFR, random_state=0).fit(X_train, y_train)
eli5.show_weights(pm_rfr, feature_names=X_train.columns.tolist())

Weight,Feature
0.8113  ± 0.0130,median_income
0.6294  ± 0.0088,longitude
0.4539  ± 0.0058,latitude
0.2942  ± 0.0110,ocean_proximity
0.1180  ± 0.0049,housing_median_age
0.0724  ± 0.0017,population
0.0394  ± 0.0005,total_rooms
0.0378  ± 0.0016,total_bedrooms
0.0213  ± 0.0004,households


#### MLPRegressor

In [12]:
pm_mlpr = PermutationImportance(modelMLPR, random_state=1).fit(X_train, y_train)
eli5.show_weights(pm_mlpr, feature_names=X_train.columns.tolist())

Weight,Feature
2.0908  ± 0.0581,households
1.2646  ± 0.0120,population
1.1676  ± 0.0388,total_rooms
1.1415  ± 0.0151,median_income
0.5288  ± 0.0131,total_bedrooms
0.0726  ± 0.0014,housing_median_age
0.0284  ± 0.0017,latitude
0.0053  ± 0.0005,longitude
0.0034  ± 0.0007,ocean_proximity


Both `GradientBoosting` and `RandomForest` models shows very similar results, with `median_income` being most important variable. Slightly less influencial is combination of coordinates, `longitude` and `latitude`, followed by `ocean_proximity`. Other varaiables has low weight, so these models don't cosider them important. However `MLP` neural network model gives opposite results, with `households`, `population` and `total_rooms` as most important variables. Those factors barely have any weight in first two models. Only `median_income` seems to be meanigful in all models. We might expect that `MLP` mostly differentiate between high population density localizations like Los Angeles aglomeration and more rural premises, while `GradientBoosting` and `RandomForest` models consider more factors.