![](logo1.jpg)

# **shAI Training 2023 | Level 1**

## Task #8 (End-to-End ML Project {part_2})

## Welcome to the exercises for reviewing second part of end to end ML project.
**Make sure that you read and understand ch2 from the hands-on ML book (page 72 to the end of the chapter ) before start with this notebook.**

**If you stuck with anything reread that part from the book and feel free to ask about anything in the messenger group as you go along.**

 ## Good Luck : )

## first run the following cell for the first part of the project to continue your work

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

In [3]:
import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

def load_housing_data(housing_path=HOUSING_PATH):
   csv_path = os.path.join(housing_path, "housing.csv")
   return pd.read_csv(csv_path)

fetch_housing_data()
housing = load_housing_data()

rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
 ('imputer', SimpleImputer(strategy="median")),
 ('attribs_adder', CombinedAttributesAdder()),
 ('std_scaler', StandardScaler())])

full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", OneHotEncoder(), cat_attribs)])

housing_prepared = full_pipeline.fit_transform(housing)

# 1- Select and Train a Model

# Let’s first train a LinearRegression model

In [4]:
# CODE HERE
from sklearn.linear_model import LinearRegression

linear_regression = LinearRegression()

linear_regression.fit(housing_prepared, housing_labels)

# First try it out on a few instances from the training set:


In [5]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
preprocess_some_data = full_pipeline.transform(some_data)

In [6]:
# CODE HERE
predictions = linear_regression.predict(preprocess_some_data)
print("Predictions : ",predictions)
print("Labels (desired output) : ",list(some_labels))

Predictions :  [181746.54359616 290558.74973505 244957.50017771 146498.51061398
 163230.42393939]
Labels (desired output) :  [103000.0, 382100.0, 172600.0, 93400.0, 96500.0]


# measure this regression model’s RMSE on the whole training set
* sing Scikit-Learn’s mean_squared_error() function:

In [7]:
from sklearn.metrics import mean_squared_error

In [8]:
# CODE HERE
housing_predictions = linear_regression.predict(housing_prepared)
LR_RMSE = mean_squared_error(housing_labels, housing_predictions, squared=False)
LR_RMSE = np.sqrt(LR_RMSE)
LR_RMSE

259.98693709060177

# judge on the RMSE result for this model
write down your answar
- the MSE is high so the linear regression model is not the best choice for our problem .

your answer goes here

# Let’s train a Decision Tree Regressor model
## more powerful model

In [9]:
from sklearn.tree import DecisionTreeRegressor

In [10]:
# CODE HERE
DTR = DecisionTreeRegressor()

DTR.fit(housing_prepared, housing_labels)

# Now evaluate the model on the training set
* using Scikit-Learn’s mean_squared_error() function:

In [11]:
# CODE HERE
housing_predictions = DTR.predict(housing_prepared)
DTR_MSE = mean_squared_error(housing_labels, housing_predictions)
DTR_MSE = np.sqrt(DTR_MSE)
DTR_MSE

0.0

# Explaine this result
write down your answar
- Of course there is something wrong , it's impossible to get 0 error , maybe the DTR model has overfiting problem on the data , which is bad and will affect bad on testing set !

your answer goes here

# Evaluation Using Cross-Validation

1-split the training set into 10 distinct subsets then train and evaluate the Decision Tree model

In [12]:
from sklearn.model_selection import cross_val_score

In [13]:
# CODE HERE
scores = cross_val_score(DTR, housing_prepared, housing_labels,scoring="neg_mean_squared_error", cv=10)

DTR_RMSE_scores = np.sqrt(-scores)

2- display the resultant scores and calculate its Mean and Standard deviation

In [14]:
# CODE HERE
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())

Scores: [-4.06698235e+09 -4.88779741e+09 -4.68299122e+09 -4.91403558e+09
 -5.51217763e+09 -4.38148414e+09 -4.54723528e+09 -4.65880660e+09
 -4.27288286e+09 -5.15799384e+09]
Mean: -4708238690.485308
Standard deviation: 407395041.10337836


3-repaet the same steps to compute the same scores for the Linear Regression  model

*notice the difference between the results of the two models*

In [15]:
# CODE HERE
scores = cross_val_score(linear_regression, housing_prepared, housing_labels,scoring="neg_mean_squared_error", cv=10)

LR_RMSE_scores = np.sqrt(-scores)

print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())

Scores: [-4.22508760e+09 -5.03540116e+09 -4.50544871e+09 -4.36783940e+09
 -4.67890842e+09 -4.25969610e+09 -4.25348949e+09 -4.69574001e+09
 -5.29108950e+09 -4.75511489e+09]
Mean: -4606781527.467085
Standard deviation: 338444436.40315783


## Let’s train one last model the RandomForestRegressor.

In [16]:
# CODE HERE
from sklearn.ensemble import RandomForestRegressor

RFR = RandomForestRegressor(n_estimators=100, random_state=42)
RFR.fit(housing_prepared, housing_labels)

housing_predictions = RFR.predict(housing_prepared)
RFR_mse = mean_squared_error(housing_labels, housing_predictions)

RFR_rmse = np.sqrt(RFR_mse)
RFR_rmse

18527.322990316152

# repeat the same steps to compute the same scores its Mean and Standard deviation for the Random Forest model

In [17]:
# CODE HERE
RFR_score = cross_val_score(RFR,housing_prepared,housing_labels,scoring="neg_mean_squared_error",cv=10)
RFR_rmse_score = np.sqrt(-RFR_score)

print("Scores:", RFR_rmse_score)
print("Mean:", RFR_rmse_score.mean())
print("Standard deviation:", RFR_rmse_score.std())

Scores: [47341.96931397 51653.53070248 49360.29148883 51625.62777032
 52771.91063892 46989.97118038 47333.72603398 50636.24303693
 48951.73251683 50183.60590465]
Mean: 49684.86085873057
Standard deviation: 1929.9797084102233


# Save every model you experiment with
*using the joblib library*

In [18]:
# CODE HERE
import joblib
joblib.dump(linear_regression,'LinearRegressionModel.pkl')
joblib.dump(DTR,'DecsionTreeModel.pkl')
joblib.dump(RFR,'RandomForestModel.pkl')


['RandomForestModel.pkl']

## now you have a shortlist of promising models. You now need to
## fine-tune them!
# Fine-Tune Your Model

## 1- Grid Search
## evaluate all the possible combinations of hyperparameter values for the RandomForestRegressor
*It may take a long time*

In [19]:
from sklearn.model_selection import GridSearchCV

In [20]:
# CODE HERE
param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

RFR = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(RFR, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

with the evaluation scores

In [23]:
# CODE HERE
scores = np.sqrt(-grid_search.best_score_)
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())

Scores: 49672.50940389753
Mean: 49672.50940389753
Standard deviation: 0.0


# Analyze the Best Models and Their Errors
1-indicate the relative importance of each attribute

In [27]:
# CODE HERE
feature_importances = grid_search.best_estimator_.feature_importances_
print(feature_importances)


[6.84493392e-02 6.49131340e-02 4.17428333e-02 1.45158216e-02
 1.37060650e-02 1.43001651e-02 1.29591331e-02 3.71833888e-01
 4.94502910e-02 1.09758357e-01 6.11769498e-02 7.39554036e-03
 1.65012599e-01 2.28668090e-04 1.83994495e-03 2.71727020e-03]


2-display these importance scores next to their corresponding attribute names:

In [36]:
# CODE HERE
custom_attr = ["rooms_per_household", "population_per_household", "bedrooms_per_room"]
original_cat = full_pipeline.named_transformers_["cat"]

cat_one_hot = list(original_cat.categories_[0])
attributes = num_attribs + custom_attr + cat_one_hot

sorted(zip(feature_importances, attributes), reverse=True)

[(0.37183388814667373, 'median_income'),
 (0.16501259905352061, 'INLAND'),
 (0.1097583570805625, 'population_per_household'),
 (0.06844933917056337, 'longitude'),
 (0.06491313404989336, 'latitude'),
 (0.061176949819538154, 'bedrooms_per_room'),
 (0.04945029095965893, 'rooms_per_household'),
 (0.0417428332867604, 'housing_median_age'),
 (0.014515821649955954, 'total_rooms'),
 (0.014300165080528202, 'population'),
 (0.013706064997348836, 'total_bedrooms'),
 (0.012959133102106802, 'households'),
 (0.0073955403637295455, '<1H OCEAN'),
 (0.0027172702007613975, 'NEAR OCEAN'),
 (0.0018399449484004105, 'NEAR BAY'),
 (0.00022866808999787155, 'ISLAND')]

## Now is the time to evaluate the final model on the test set.
# Evaluate Your System on the Test Set

1-get the predictors and the labels from your test set

In [38]:
# CODE HERE
X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()
final_model = grid_search.best_estimator_

2-run your full_pipeline to transform the data

In [40]:
# CODE HERE
X_test_prepared = full_pipeline.transform(X_test)
predictions = final_model.predict(X_test_prepared)

3-evaluate the final model on the test set

In [44]:
# CODE HERE
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
rmse

49198.020631676336

# compute a 95% confidence interval for the generalization error
*using scipy.stats.t.interval():*

In [46]:
from scipy import stats

In [47]:
# CODE HERE
confidence = 0.95
squared_errors = (predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,loc=squared_errors.mean(),scale=stats.sem(squared_errors)))

array([46948.10215126, 51349.4515311 ])

# Great Job!
# #shAI_Club