![](logo1.jpg)

# **shAI Training 2023 | Level 1**

## Task #8 (End-to-End ML Project {part_2})

## Welcome to the exercises for reviewing second part of end to end ML project.
**Make sure that you read and understand ch2 from the hands-on ML book (page 72 to the end of the chapter ) before start with this notebook.**

**If you stuck with anything reread that part from the book and feel free to ask about anything in the messenger group as you go along.**

 ## Good Luck : )

## first run the following cell for the first part of the project to continue your work 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

In [2]:
import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
def load_housing_data(housing_path=HOUSING_PATH):
   csv_path = os.path.join(housing_path, "housing.csv")
   return pd.read_csv(csv_path)
   
fetch_housing_data()
housing = load_housing_data()

rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
 ('imputer', SimpleImputer(strategy="median")),
 ('attribs_adder', CombinedAttributesAdder()),
 ('std_scaler', StandardScaler())])

full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", OneHotEncoder(), cat_attribs)])

housing_prepared = full_pipeline.fit_transform(housing)

# 1- Select and Train a Model

# Let’s first train a LinearRegression model 

In [3]:
# CODE HERE
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

# First try it out on a few instances from the training set:


In [4]:
some_data = housing.iloc[:5]
some_data = full_pipeline.transform(some_data)
some_labels = housing_labels.iloc[:5]

In [5]:
# CODE HERE
predictions =  lin_reg.predict(some_data)
predictions.round(-2)

array([181700., 290600., 245000., 146500., 163200.])

In [6]:
some_labels

14196    103000.0
8267     382100.0
17445    172600.0
14265     93400.0
2271      96500.0
Name: median_house_value, dtype: float64

# measure this regression model’s RMSE on the whole training set 
* sing Scikit-Learn’s mean_squared_error() function:

In [7]:
from sklearn.metrics import mean_squared_error

In [8]:
# CODE HERE
predicted = lin_reg.predict(housing_prepared)
lin_rmse = mean_squared_error(housing_labels, predicted, squared=False)
lin_rmse



67593.20745775253

# judge on the RMSE result for this model 
write down your answar 

your answer goes here:

clearly not a great score: most districts median_housing_values range between {$120,000} and $265,000, so a
typical prediction error of $67,593 is really not very satisfying. This is an
example of a model underfitting the training data. When this happens it can
mean that the features do not provide enough information to make good
predictions, or that the model is not powerful enough.

# Let’s train a Decision Tree Regressor model 
## more powerful model

In [9]:
from sklearn.tree import DecisionTreeRegressor 

In [10]:
# CODE HERE
decison_tree_model = DecisionTreeRegressor()
decison_tree_model.fit(housing_prepared, housing_labels)

# Now evaluate the model on the training set 
* using Scikit-Learn’s mean_squared_error() function:

In [11]:
# CODE HERE
predicted = decison_tree_model.predict(housing_prepared)
mean_squared_error(housing_labels, predicted)

0.0

# Explaine this result 
write down your answar

your answer goes here

**it is much more likely that the model has badly overfit the data**

# Evaluation Using Cross-Validation

1-split the training set into 10 distinct subsets then train and evaluate the Decision Tree model

In [12]:
from sklearn.model_selection import cross_val_score

In [13]:
# CODE HERE
tree_rmses = -cross_val_score(decison_tree_model, housing_prepared, housing_labels,
scoring="neg_root_mean_squared_error",
cv=10)

2- display the resultant scores and calculate its Mean and Standard deviation

In [14]:
# CODE HERE
pd.Series(tree_rmses).describe()

count       10.000000
mean     68621.992437
std       2378.172579
min      64627.268815
25%      67119.109519
50%      68432.035232
75%      69763.529471
max      73173.943133
dtype: float64

3-repaet the same steps to compute the same scores for the Linear Regression  model 

*notice the difference between the results of the two models*

In [15]:
# CODE HERE
lin_reg_rmses = -cross_val_score(lin_reg, housing_prepared, housing_labels,
scoring="neg_root_mean_squared_error",
cv=10)
pd.Series(lin_reg_rmses).describe()


count       10.000000
mean     67828.386774
std       2601.596761
min      65000.673826
25%      65472.168399
50%      67762.593108
75%      68849.373294
max      72739.875560
dtype: float64

## Let’s train one last model the RandomForestRegressor.

In [16]:
# CODE HERE
from sklearn.ensemble import RandomForestRegressor
forest_reg =  RandomForestRegressor(random_state=42)

# repeat the same steps to compute the same scores its Mean and Standard deviation for the Random Forest model

In [17]:
# CODE HERE
forest_rmses = -cross_val_score(forest_reg, housing_prepared, housing_labels,
scoring="neg_root_mean_squared_error", cv=10)
pd.Series(forest_rmses).describe()

count       10.000000
mean     49684.860859
std       2034.377239
min      46989.971180
25%      47744.410115
50%      49771.948697
75%      51378.281587
max      52771.910639
dtype: float64

# Save every model you experiment with 
*using the joblib library*

In [None]:
# CODE HERE


## now you have a shortlist of promising models. You now need to
## fine-tune them!
# Fine-Tune Your Model

## 1- Grid Search
## evaluate all the possible combinations of hyperparameter values for the RandomForestRegressor 
*It may take a long time*

In [18]:
from sklearn.model_selection import GridSearchCV

In [19]:
# CODE HERE
param_grid = [
    {'n_estimators': [50, 100, 150], 'max_features': [4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [50, 100], 'max_features': [4, 6, 8]},
]

grid_search = GridSearchCV(forest_reg, param_grid, cv=3,
                           scoring='neg_root_mean_squared_error')
grid_search.fit(housing_prepared, housing_labels)


with the evaluation scores

In [26]:
# CODE HERE
cv_res = pd.DataFrame(grid_search.cv_results_)
cv_res.sort_values(by="mean_test_score", ascending=False, inplace=True)
cv_res = cv_res[["param_max_features","param_n_estimators" , "split0_test_score",
                 "split1_test_score", "split2_test_score", "mean_test_score"]]
cv_res.columns  = ["max_features", "n_estimators","split0", "split1", "split2", "mean_test_rmse"]
score_cols = ["split0", "split1", "split2", "mean_test_rmse"]
cv_res[score_cols] = -cv_res[score_cols].round().astype(np.int64)
cv_res.head()

Unnamed: 0,max_features,n_estimators,split0,split1,split2,mean_test_rmse
12,6,100,48849,48505,49089,48814
10,4,100,48970,48442,49398,48937
11,6,50,49204,49038,49215,49152
9,4,50,49078,48921,49535,49178
14,8,100,49134,48874,49561,49190


# Analyze the Best Models and Their Errors
1-indicate the relative importance of each attribute

In [41]:
# CODE HERE
final_model = grid_search.best_estimator_
feature_importances = final_model.feature_importances_
feature_importances.round(2)

array([0.08, 0.07, 0.04, 0.02, 0.02, 0.02, 0.01, 0.33, 0.06, 0.11, 0.09,
       0.01, 0.15, 0.  , 0.  , 0.  ])

2-display these importance scores next to their corresponding attribute names:

In [57]:
cols1 = list(housing.columns)
cols1.remove('ocean_proximity')
cols2 = ["rooms_per_household", "population_per_household"]
ocean_proximity_cats = full_pipeline.named_transformers_["cat"].categories_[0].tolist()

In [58]:
# CODE HERE
# cols1 = ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
# cols2 = ["rooms_per_household", "population_per_household"]
# ocean_proximity_cats = ['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN']
sorted(zip(feature_importances,
           cols1+cols2+ocean_proximity_cats),
           reverse=True)

[(0.3252286130602134, 'median_income'),
 (0.14943895236718976, 'ISLAND'),
 (0.1084642176918226, 'population_per_household'),
 (0.08958162287930965, '<1H OCEAN'),
 (0.07978416533845123, 'longitude'),
 (0.07100898304235313, 'latitude'),
 (0.05587641324823509, 'rooms_per_household'),
 (0.03975813873324921, 'housing_median_age'),
 (0.016366745095557664, 'population'),
 (0.016170057869631098, 'total_rooms'),
 (0.015472815906292396, 'total_bedrooms'),
 (0.014781108708672743, 'households'),
 (0.009707346802396094, 'INLAND'),
 (0.0035176366202434705, 'NEAR OCEAN'),
 (0.00021282925786198826, 'NEAR BAY')]

## Now is the time to evaluate the final model on the test set.
# Evaluate Your System on the Test Set

1-get the predictors and the labels from your test set

In [63]:
# CODE HERE
X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()

2-run your full_pipeline to transform the data

In [64]:
# CODE HERE
x_test_prepared = full_pipeline.transform(X_test)

3-evaluate the final model on the test set

In [66]:
# CODE HERE
final_predictions = final_model.predict(x_test_prepared)
final_rmse = mean_squared_error(y_test, final_predictions,
squared=False)
print(final_rmse) # prints 41424.40026462184

48160.29567272085




array([[ 47997.  ,  47700.  ],
       [106705.  ,  45800.  ],
       [407127.32, 500001.  ],
       ...,
       [495679.95, 500001.  ],
       [ 70861.  ,  72300.  ],
       [173566.  , 151500.  ]])

In [68]:
df = pd.DataFrame(np.c_[final_predictions, y_test], columns=['true_values', 'predicted_values'], index=X_test.index)
df['error'] = df['true_values'] - df['predicted_values']
df

Unnamed: 0,true_values,predicted_values,error
20046,47997.00,47700.0,297.00
3024,106705.00,45800.0,60905.00
15663,407127.32,500001.0,-92873.68
20484,264892.01,218600.0,46292.01
9814,245948.00,278000.0,-32052.00
...,...,...,...
15362,247185.00,263300.0,-16115.00
16623,228781.00,266800.0,-38019.00
18086,495679.95,500001.0,-4321.05
2144,70861.00,72300.0,-1439.00


# compute a 95% confidence interval for the generalization error 
*using scipy.stats.t.interval():*

In [70]:
from scipy import stats

In [73]:
# CODE HERE
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                         loc=squared_errors.mean(),
                         scale=stats.sem(squared_errors)))

array([45980.76598472, 50245.37111044])

# Great Job!
# #shAI_Club