![](logo1.jpg)

# **shAI Training 2023 | Level 1**

## Task #8 (End-to-End ML Project {part_2})

## Welcome to the exercises for reviewing second part of end to end ML project.
**Make sure that you read and understand ch2 from the hands-on ML book (page 72 to the end of the chapter ) before start with this notebook.**

**If you stuck with anything reread that part from the book and feel free to ask about anything in the messenger group as you go along.**

 ## Good Luck : )

## first run the following cell for the first part of the project to continue your work 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.model_selection import train_test_split
from pandas.plotting import scatter_matrix
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

In [2]:
import os
import tarfile
import urllib
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()
    
def load_housing_data(housing_path=HOUSING_PATH):
   csv_path = os.path.join(housing_path, "housing.csv")
   return pd.read_csv(csv_path)
   
fetch_housing_data()
housing = load_housing_data()

rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
housing = train_set.drop("median_house_value", axis=1)
housing_labels = train_set["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
 ('imputer', SimpleImputer(strategy="median")),
 ('attribs_adder', CombinedAttributesAdder()),
 ('std_scaler', StandardScaler())])

full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", OneHotEncoder(), cat_attribs)])

housing_prepared = full_pipeline.fit_transform(housing)

# 1- Select and Train a Model

# Let’s first train a LinearRegression model 

In [9]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

# First try it out on a few instances from the training set:


In [10]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]

In [11]:
some_data_prepared = full_pipeline.transform(some_data)

predictions = lin_reg.predict(some_data_prepared)

for prediction, label in zip(predictions, some_labels):
    print("Predicted price:", prediction, "\tActual price:", label)

Predicted price: 181746.54359616214 	Actual price: 103000.0
Predicted price: 290558.7497350538 	Actual price: 382100.0
Predicted price: 244957.5001777069 	Actual price: 172600.0
Predicted price: 146498.51061398484 	Actual price: 93400.0
Predicted price: 163230.4239393943 	Actual price: 96500.0


# measure this regression model’s RMSE on the whole training set 
* sing Scikit-Learn’s mean_squared_error() function:

In [12]:
from sklearn.metrics import mean_squared_error

In [13]:
housing_predictions = lin_reg.predict(housing_prepared)

# Compute RMSE
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
print("Linear Regression RMSE on training set:", lin_rmse)

Linear Regression RMSE on training set: 67593.20745775253


# judge on the RMSE result for this model 
write down your answar 

The RMSE of approximately 67593.21 indicates the average difference between the actual housing prices and the predictions made by the Linear Regression model on the training set. This level of error should be considered in the context of the dataset's price range and the application's tolerance for prediction errors.

# Let’s train a Decision Tree Regressor model 
## more powerful model

In [14]:
from sklearn.tree import DecisionTreeRegressor 

In [15]:
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

# Now evaluate the model on the training set 
* using Scikit-Learn’s mean_squared_error() function:

In [16]:
housing_predictions = tree_reg.predict(housing_prepared)

tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
print("Decision Tree Regressor RMSE on training set:", tree_rmse)

Decision Tree Regressor RMSE on training set: 0.0


# Explaine this result 
write down your answar

A Decision Tree Regressor RMSE of 0.0 on the training set indicates perfect predictions, suggesting potential overfitting where the model memorizes the training data rather than generalizing well to new data.

# Evaluation Using Cross-Validation

1-split the training set into 10 distinct subsets then train and evaluate the Decision Tree model

In [17]:
from sklearn.model_selection import cross_val_score

In [18]:
scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)

tree_rmse_scores = np.sqrt(-scores)

2- display the resultant scores and calculate its Mean and Standard deviation

In [19]:
print("Decision Tree Regressor RMSE scores:", tree_rmse_scores)

mean_rmse = tree_rmse_scores.mean()
std_rmse = tree_rmse_scores.std()

print("Mean RMSE:", mean_rmse)
print("Standard deviation of RMSE:", std_rmse)

Decision Tree Regressor RMSE scores: [65312.86044031 70581.69865676 67849.75809965 71460.33789358
 74035.29744574 65562.42978503 67964.10942543 69102.89388457
 66876.66473025 69735.84760006]
Mean RMSE: 68848.18979613911
Standard deviation of RMSE: 2579.6785558576307


3-repaet the same steps to compute the same scores for the Linear Regression  model 

*notice the difference between the results of the two models*

In [20]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()

lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                             scoring="neg_mean_squared_error", cv=10)

lin_rmse_scores = np.sqrt(-lin_scores)

print("Linear Regression RMSE scores:", lin_rmse_scores)

mean_rmse_lin = lin_rmse_scores.mean()
std_rmse_lin = lin_rmse_scores.std()

print("Mean RMSE for Linear Regression:", mean_rmse_lin)
print("Standard deviation of RMSE for Linear Regression:", std_rmse_lin)

Linear Regression RMSE scores: [65014.40855412 70960.56056304 67122.63935124 66089.63153865
 68402.54686442 65266.34735288 65218.78174481 68525.46981754
 72739.87555996 68957.34111906]
Mean RMSE for Linear Regression: 67829.76024657098
Standard deviation of RMSE for Linear Regression: 2466.5207354937706


## Let’s train one last model the RandomForestRegressor.

In [22]:
from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(random_state=42)

forest_reg.fit(housing_prepared, housing_labels)

# repeat the same steps to compute the same scores its Mean and Standard deviation for the Random Forest model

In [23]:
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)

forest_rmse_scores = np.sqrt(-forest_scores)

print("Random Forest Regressor RMSE scores:", forest_rmse_scores)

mean_rmse_forest = forest_rmse_scores.mean()
std_rmse_forest = forest_rmse_scores.std()

print("Mean RMSE for Random Forest Regressor:", mean_rmse_forest)
print("Standard deviation of RMSE for Random Forest Regressor:", std_rmse_forest)

Random Forest Regressor RMSE scores: [47341.96931397 51653.53070248 49360.29148883 51625.62777032
 52771.91063892 46989.97118038 47333.72603398 50636.24303693
 48951.73251683 50183.60590465]
Mean RMSE for Random Forest Regressor: 49684.86085873057
Standard deviation of RMSE for Random Forest Regressor: 1929.9797084102233


# Save every model you experiment with 
*using the joblib library*

In [24]:
from joblib import dump

dump(lin_reg, 'linear_regression_model.joblib')

dump(tree_reg, 'decision_tree_regressor_model.joblib')

dump(forest_reg, 'random_forest_regressor_model.joblib')

['random_forest_regressor_model.joblib']

## now you have a shortlist of promising models. You now need to
## fine-tune them!
# Fine-Tune Your Model

## 1- Grid Search
## evaluate all the possible combinations of hyperparameter values for the RandomForestRegressor 
*It may take a long time*

In [25]:
from sklearn.model_selection import GridSearchCV

In [26]:
param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor(random_state=42)

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)

grid_search.fit(housing_prepared, housing_labels)

best_params = grid_search.best_params_

best_model = grid_search.best_estimator_

print("Best hyperparameters:", best_params)

Best hyperparameters: {'max_features': 8, 'n_estimators': 30}


with the evaluation scores

In [27]:
cvres = grid_search.cv_results_

results = pd.DataFrame(cvres)

for mean_score, params in zip(results['mean_test_score'], results['params']):
    print(np.sqrt(-mean_score), params)

64878.27480854276 {'max_features': 2, 'n_estimators': 3}
55391.003575336406 {'max_features': 2, 'n_estimators': 10}
52721.66494842234 {'max_features': 2, 'n_estimators': 30}
58541.12715494087 {'max_features': 4, 'n_estimators': 3}
51623.59366665994 {'max_features': 4, 'n_estimators': 10}
49787.65951361993 {'max_features': 4, 'n_estimators': 30}
58620.88234614251 {'max_features': 6, 'n_estimators': 3}
51645.862673140065 {'max_features': 6, 'n_estimators': 10}
49917.66994061786 {'max_features': 6, 'n_estimators': 30}
58640.96129790229 {'max_features': 8, 'n_estimators': 3}
51650.365581628095 {'max_features': 8, 'n_estimators': 10}
49672.50940389753 {'max_features': 8, 'n_estimators': 30}
61580.24110015614 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
53889.80996032937 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
58667.89389226964 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52764.2630869393 {'bootstrap': False, 'max_features': 3, 'n_estimators': 

# Analyze the Best Models and Their Errors
1-indicate the relative importance of each attribute

In [None]:
feature_importances = best_model.feature_importances_

feature_importance_df = pd.DataFrame({'Feature': list(housing_prepared.columns),
                                      'Importance': feature_importances})

feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

2-display these importance scores next to their corresponding attribute names:

In [None]:
print(feature_importance_df)

## Now is the time to evaluate the final model on the test set.
# Evaluate Your System on the Test Set

1-get the predictors and the labels from your test set

In [32]:
X_test = test_set.drop("median_house_value", axis=1)
y_test = test_set["median_house_value"].copy()

2-run your full_pipeline to transform the data

In [33]:
X_test_prepared = full_pipeline.transform(X_test)

3-evaluate the final model on the test set

In [34]:
y_pred = best_model.predict(X_test_prepared)

final_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("Final RMSE on the test set:", final_rmse)

Final RMSE on the test set: 49198.020631676336


# compute a 95% confidence interval for the generalization error 
*using scipy.stats.t.interval():*

In [35]:
from scipy import stats

In [37]:
degrees_of_freedom = best_model.n_estimators - 1

residuals = y_test - y_pred

std_error = np.std(residuals, ddof=1)

confidence_interval = stats.t.interval(0.95, degrees_of_freedom, loc=final_rmse, scale=std_error / np.sqrt(len(y_test)))

print("95% Confidence Interval for Generalization Error:", confidence_interval)

95% Confidence Interval for Generalization Error: (47632.17987583224, 50763.86138752043)


# Great Job!
# #shAI_Club