## Machine Learning Supervised

So far, we have focused on using KNN as our model to predict California housing prices. However, there are other models worth exploring. Today, we will experiment with both simple Linear Regression and Decision Trees to understand how they explain our target variable. In machine learning, we typically choose our model based on the relationship between our features and the target variable, or simply by selecting the model with the higher score

Yesterday, we applied some feature engineering techniques, and our model indeed increased its performance. Now, let's see how Linear Regression and Decision Tree perform when we apply the same feature engineering techniques.

#### Loading and preparing the data

In [1]:
from sklearn.datasets import  fetch_california_housing
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

In [2]:
california = fetch_california_housing()
print(california["DESCR"])

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [3]:
df_cali = pd.DataFrame(california["data"], columns = california["feature_names"])
df_cali["median_house_value"] = california["target"]

df_cali.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,median_house_value
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


#### Normalization & Feature Selection

Like we did in Feature Engineering lesson, we are going to normalize our data and select a subset of columns as our features.

#### Train Test Split

In [4]:
#your code here
X_train, X_test, y_train, y_test = train_test_split(df_cali.drop(columns='median_house_value'), df_cali['median_house_value'], test_size=0.2, random_state=42)


Create an instance of the normalizer

In [5]:
#your code here
scaler = MinMaxScaler()
scaler.fit(X_train)

X_train_norm, X_test_norm = scaler.transform(X_train), scaler.transform(X_test)

## Linear Regression

Let's create an instance of Linear Regression model.

In [6]:
#your code here
lr = LinearRegression()
lr_norm = LinearRegression()

Training Linear Regression with our normalized data

In [7]:
#your code here
lr.fit(X_train, y_train)
lr_norm.fit(X_train_norm, y_train)

Evaluate model's performance

In [8]:
#your code here
print("Original:", lr.score(X_test, y_test))
print("Scaled:", lr_norm.score(X_test_norm, y_test))

Original: 0.5757877060324511
Normalized: 0.5757877060324512


In [None]:
import plotly.express as px

px.bar(y=lr.feature_names_in_, x=lr.coef_)

Linear Regression yielding a worse score than our previous model, KNN.

In Linear Regression, we often assess feature importance by examining the coefficients in the model. These coefficients indicate the impact of each feature on the model's predictions.

- Determine the coefficients (β) in the linear regression equation corresponding to each feature.
- The magnitude of these coefficients reflects the relative importance of the features. **Greater absolute values suggest more substantial impacts.**

In [9]:
#your code here

We can conclude that **Median Income** have the highest impact in our model.

## Decision Tree

So far between KNN and Liner Regression, the first yield a better score, let's see how a Decision Tree performs.

- Initialize a Decision Tree instance

- Setting max_depth as 10, this means we will allow our tree to split 10 times

In [11]:
#your code here
dt = DecisionTreeRegressor()
dt_norm = DecisionTreeRegressor()

- Training the model

In [12]:
#your code here
dt.fit(X_train, y_train)
dt_norm.fit(X_train_norm, y_train)

- Evaluate the model

In [13]:
#your code here
print("Original:", dt.score(X_test, y_test))
print("Scaled:", dt_norm.score(X_test_norm, y_test))

Original: 0.617014267613832
Scaled: 0.6097898848011593


Often we check what are the most relevant features, like we did before in Linear Regression.

In [14]:
#your code here     
px.bar(y=dt.feature_names_in_, x=dt.feature_importances_)

In [18]:
from sklearn.tree import export_text

tree_viz = export_text(dt, feature_names=dt.feature_names_in_)
print(tree_viz)


|--- MedInc <= 5.09
|   |--- MedInc <= 3.07
|   |   |--- AveRooms <= 4.31
|   |   |   |--- MedInc <= 2.21
|   |   |   |   |--- AveRooms <= 3.42
|   |   |   |   |   |--- AveBedrms <= 1.03
|   |   |   |   |   |   |--- Longitude <= -121.83
|   |   |   |   |   |   |   |--- Longitude <= -121.96
|   |   |   |   |   |   |   |   |--- AveOccup <= 4.20
|   |   |   |   |   |   |   |   |   |--- Population <= 1692.50
|   |   |   |   |   |   |   |   |   |   |--- AveBedrms <= 0.80
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- AveBedrms >  0.80
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 7
|   |   |   |   |   |   |   |   |   |--- Population >  1692.50
|   |   |   |   |   |   |   |   |   |   |--- Population <= 2290.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- Population >  2290.00
|   |   |   |   |   |   |   |   |   | 

A bit overwhelming to see, let's use graphviz library.

**Note**: you will need to install graphivz - pip install graphviz

- We will train a decision tree, in this case with max_depth=2 to better see the diagram

In [15]:
!pip install graphviz



In [20]:
#your code here
from sklearn.tree import DecisionTreeRegressor, export_graphviz
import graphviz

tree = DecisionTreeRegressor(max_depth=2)
tree.fit(X_train_norm, y_train)


dot_data = export_graphviz(tree, out_file="tree.dot", filled=True, rounded=True, feature_names=dt.feature_names_in_)

with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

ExecutableNotFound: failed to execute WindowsPath('dot'), make sure the Graphviz executables are on your systems' PATH

<graphviz.sources.Source at 0x17fde8ee390>