In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
camnugent_california_housing_prices_path = kagglehub.dataset_download('camnugent/california-housing-prices')

print('Data source import complete.')
#try

# A complete Guide to End-to-End Machine Learning Project.

* In this Notebook, we are going to Build an End-to-End Machine Learning Project.
* This notebook will help you how to think while working on End-to-End machine Leanring Project.
* This Notebook will help you to Analyze details from business point of view rathet then just modeling.
* Let's say you are working as a data scientist in Real Estate Company.
* Main steps we will go through

1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Test your model on Test Dataset.

**IF YOU GAIN SOME KNOWLEDGE FROM THE NOTEBBOK THEN PLEASE UPVOTE IT.**

# 1. Look at the Big Picture

![image.png](attachment:6bdd805f-8586-4351-bfaf-82d704afdf99.png)

## TASK: As a Data scientist your task is to build a model of housing prices in California using the California census data.

* This data has metrics such as the population, median income, median housing
price, and so on for each block group in California.
* Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data.
* We will just call them “districts” for short.
* Your model should learn from this data and be able to predict the `median housing`
price in any district, given all the other metrics.

## Frame the Problem

* First, ask your Boss. What is the **Business objective** of this Model?. How does the company expect to use and `benefit from this model`?
* Building the Model is not only your end Goal

* Answering the above questions will help you decide what algorithms you will select, what performance measure you will use to
evaluate your model, and how much effort you should spend tweaking it.

* Your boss answers that your model’s output (a prediction of a district’s median housing
price) will be fed to another Machine Learning system along with many other signals.
*  This downstream system will determine whether it is worth investing in a given area or not. **Getting this right is critical, as it directly affects revenue.**

![image.png](attachment:d5d99fd1-0b5c-4003-b2a9-460ebd2d1bf8.png)

* This task seems to be a Multiple Regression Problem.
* Next, we have to Select  a Perfromance Measure.
* A typical performance measure for regression problems is the **Root Mean Square Error (RMSE)**.


# 2. Get the Data

In [None]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split

In [None]:
df=pd.read_csv('../input/california-housing-prices/housing.csv')

#### Let's look into the data

In [None]:
df

In [None]:
df.info()

### Inference:
1. There are 20,640 instances in the dataset.
2.`total_bed` rooms attribute has only 20,433 non-null values, meaning that 207 districts are missing
this feature.
3. All attributes are numerical, except the `ocean_proximity`.

In [None]:
df["ocean_proximity"].value_counts()

In [None]:
df.describe()

* Another quick way to get a feel of the type of data you are dealing with is to plot a histogram for each numerical attribute.
* You can call the `hist()` method on the whole dataset, and it will plot a histogram for each `numerical attribute`.

In [None]:
df.hist(bins=50, figsize=(20,15))

### Inference made from these histograms

1. First, the `median income` attribute does not look like it is expressed in `US dollars (USD)`. After checking with the team that collected the data, you are told that the data has been scaled and capped at 15 (actually 15.0001) for higher median incomes, and at 0.5 (actually 0.4999) for lower median incomes. The numbers represent roughly tens of thousands of dollars (e.g., 3 actually means about
$30,000).
    * Working with preprocessed attributes is common in Machine Learning, Get the Data  and it is not necessarily a problem, **but you should try to understand how the data was computed**.
    
2. The `housing median age` and the `median house value` were also capped. The latter may be a serious problem since it is your target attribute (your labels). Your Machine Learning algorithms may learn that prices never go beyond that limit. You need to check with your client team (the team that will use your system’s output) to see if this is a problem or not. If they tell you that they need precise predictions even beyond $\$$500,000, then you have mainly two options: </br>
>     a.  Collect proper labels for the districts whose labels were capped.
>     b.  Remove those districts from the training set (and also from the test set, since your system should not be evaluated poorly if it predicts values beyond $\$$500,000).

3. These attributes have very different scales. We will discuss this later in this notebook
when we explore **feature scaling**.

4. Finally, many histograms are `tail heavy`: they extend much farther to the right of
the median than to the left. This may make it a bit harder for some Machine
Learning algorithms to detect patterns. We will try transforming these attributes
later on to have more bell-shaped distributions.

**NOTE: Before doing anything with data. Let's create a test set and never look at it to avoid bias.**

## Create test set


In [None]:
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42,stratify=None)

### Points to consider while doing train_test_split
* So far we have considered purely random sampling methods. This is generally fine if
your dataset is large enough (especially relative to the number of attributes), but if it
is not, you run the risk of introducing a significant sampling bias.
* When a survey company decides to call 1,000 people to ask them a few questions, they don’t just pick
1,000 people randomly in a phone book. They try to ensure that these 1,000 people
are representative of the whole population.
* For example, the US population is composed
of 51.3% female and 48.7% male, so a well-conducted survey in the US would
try to maintain this ratio in the sample: 513 female and 487 male. **This is called stratified
sampling: the population is divided into homogeneous subgroups called strata,
and the right number of instances is sampled from each stratum to guarantee that the
test set is representative of the overall population.**

#### Business insights
* Suppose you chatted with experts who told you that the median income is a very
important attribute to predict median housing prices.
* You may want to ensure that
the test set is representative of the various categories of incomes in the whole dataset.
* Since the median income is a `continuous numerical attribute`, you first need to **create
an income category attribute**.
* Let’s look at the median income histogram more closely
(back in Figure of histograms): most median income values are clustered around 1.5 to 6 (i.e.,
$15,000–$60,000), but some median incomes go far beyond 6.
* It is important to have
a sufficient number of instances in your dataset for each stratum, or else the estimate
of the stratum’s importance may be biased. This means that you should not have too
many strata, and each stratum should be large enough.
* The following code uses the
pd.cut() function to create an income category attribute with 5 categories (labeled
from 1 to 5): category 1 ranges from 0 to 1.5 (i.e., less than $15,000), category 2 from 1.5 to 3, and so on:

In [None]:
# Bin values into discrete intervals.
df["income_cat"] = pd.cut(df["median_income"], bins=[0., 1.5, 3.0, 4.5, 6., np.inf], labels=[1, 2, 3, 4, 5])  #cut the median income across bins and give labels to each bin

In [None]:
df["income_cat"].hist()

Now, Let's  do `stratified sampling` based on the income category. For this
you can use Scikit-Learn’s `StratifiedShuffleSplit` class:

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)  # Provides train/test indices to split data in train/test sets.
for train_index, test_index in split.split(df, df["income_cat"]):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]

In [None]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

In [None]:
strat_train_set["income_cat"].value_counts() / len(strat_train_set)

You can see the proportion of each class is same in train and test set and it is also same in df(you can check).

* Now you should remove the `income_cat` attribute so the data is back to its original
state:

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)


* We spent quite a bit of time on test set generation for a good reason: this is an often
neglected but critical part of a Machine Learning project.
* Moreover, many of these
ideas will be useful later when we discuss cross-validation.
* Now it’s time to move on
to the next stage: exploring the data.

# Discover and Visualize the Data to Gain Insights

In [None]:
# Let’s create a copy so you can play with it without harming the training set:
housing = strat_train_set.copy()

### VIsualizing the Geographical data

In [None]:
housing.plot(kind='scatter',x='longitude',y='latitude')

Setting the alpha option to 0.1 makes it much easier to visualize the places
where there is a high density of data points (Figure 2-12):
    

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

**TIP: More generally, our brains are very good at spotting patterns on pictures, but you may need to play around with visualization parameters to make the patterns stand out.**

Now let’s look at the housing prices . The radius of each `circle` represents
the `district’s population (option s)`, and the color represents the `price (option c)`. We
will use a predefined color map (option cmap) called jet, which ranges from blue
(low values) to red (high prices):

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, s=housing["population"]/100,
             label="population", figsize=(10,7),c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,)
plt.legend()

This image tells you that the housing prices are very much related to the location
(e.g., close to the ocean) and to the population density, as you probably knew already.

### Looking for correlations

In [None]:
corr_matrix = housing.corr()
corr_matrix

In [None]:
sns.heatmap(corr_matrix,annot=True)
# plt.matshow(corr_matrix)

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

* The **correlation coefficient** ranges from `–1 to 1`. When it is close to 1, it means that
there is a **strong positive correlation**; for example, the median house value tends to go
up when the median income goes up.
* When the coefficient is close to –1, it means
that there is a **strong negative correlation**; you can see a small negative correlation
between the latitude and the median house value (i.e., prices have a slight tendency to
go down when you go north).

In [None]:
housing.plot(kind="scatter", x="median_income", y="median_house_value",
alpha=0.1)

* This plot reveals a few things. First, the correlation is indeed very strong; you can
clearly see the upward trend and the points are not too dispersed.
* Second, the price
cap that we noticed earlier is clearly visible as a horizontal line at $\$$500,000. But this
plot reveals other less obvious straight lines: a horizontal line around $\$$450,000,
another around $\$$350,000, perhaps one around $\$$280,000, and a few more below that.
* You may want to try removing the corresponding districts to prevent your algorithms
from learning to reproduce these data quirks.

### Experimenting with Attribute Combinations

* One last thing you may want to do before actually preparing the data for Machine
Learning algorithms is to try out various attribute combinations.
* For example, the total number of rooms in a district is not very useful if you don’t know how many
households there are.
* What you really want is the number of rooms per household.
* Similarly, the total number of bedrooms by itself is not very useful: you probably
want to compare it to the number of rooms.
* And the population per household also seems like an interesting attribute combination to look at. Let’s create these new
attributes

In [None]:
housing

In [None]:
housing["rooms_per_household"]=housing["total_rooms"]/housing["households"]
housing["bedroom_per_room"]=housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

Let's look correlataion matrix again


In [None]:
corr_matrix=housing.corr()

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

**INFERENCE:**
* The new bedrooms_per_room attribute is much more correlated with the median house value than the total number of rooms or bedrooms.
* Apparently houses with a lower bedroom/room ratio tend to be more expensive.
* The number of rooms per household is also more informative than the total number of rooms in a district—obviously the larger the houses, the more expensive they are.

# 4. Prepare the Data for Machine Learning Algorithms

It’s time to prepare the data for your Machine Learning algorithms. Instead of just doing this manually, you should write functions to do that, for several good reasons:
* This will allow you to reproduce these transformations easily on any dataset (e.g.,
the next time you get a fresh dataset).
* You will gradually build a library of transformation functions that you can reuse
in future projects.
* You can use these functions in your live system to transform the new data before
feeding it to your algorithms.
* This will make it possible for you to easily try various transformations and see
which combination of transformations works best.

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1)     #drop copy the orignal datframe into housing
housing_labels = strat_train_set["median_house_value"].copy()    # splitting the predictor and  target variable

### Data Cleaning

* First, Ml algo. doesn't work with missing features, so let's create a few functions to take care of them.
* You noticed earlier that the total_bedrooms
attribute has some missing values, so let’s fix this. You have three options:
* Get rid of the corresponding districts.
* Get rid of the whole attribute.
* Set the values to some value (zero, the mean, the median, etc.).

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

In [None]:
housing_num = housing.drop("ocean_proximity", axis=1)  # Since median can be calculated only on NUmerical attributes

In [None]:
# Now you can fit the imputer instance to the training data using the fit() method:
imputer.fit(housing_num)

In [None]:
imputer.statistics_

Only the total_bedrooms attribute had missing
values, but we cannot be sure that there won’t be any missing values in new data after
the system goes live, so it is safer to apply the imputer to all the numerical attributes:

In [None]:
housing_num.median().values

Now you can use this “trained” imputer to transform the training set by replacing
missing values by the learned medians

In [None]:
X = imputer.transform(housing_num)         # X = plain NumPy array containing the transformed features
housing_tr = pd.DataFrame(X, columns=housing_num.columns)

### Handling Text and Categorical Attributes


In [None]:
housing_cat = housing[["ocean_proximity"]]

Now, we convert text attribute `ocean_proximity` into numerical attribute.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder=OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

In [None]:
ordinal_encoder.categories_     # get the list of  categories

* Issue with Ordinal Encoding is that ML algorithms will assume that **two nearby
values are more similar than two distant values**.
* This may be fine in some cases (e.g.,
for ordered categories such as “bad”, “average”, “good”, “excellent”), but it is obviously
not the case for the ocean_proximity column (for example, categories 0 and 4 are
clearly more similar than categories 0 and 1)

So, let's use One Hot Encoder in place.

In [None]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot

In [None]:
housing_cat_1hot.toarray()

### Custom Transformers

* Although Scikit-Learn provides many useful transformers, you will need to write
your own for tasks such as `custom cleanup operations or combining specific
attributes`.
* You will want your transformer to work seamlessly with Scikit-Learn functionalities
(such as pipelines).

To Create Custom Transformers all you need is to create a `class and implement three methods`: **fit()
(returning self), transform(), and fit_transform().**

You can get the last one for free by simply adding **TransformerMixin** as a **base class**.
* Also, if you add BaseEstimator as a base class (and avoid *args and **kargs in your constructor) you will get
two extra methods `(get_params() and set_params())` that will be useful for automatic hyperparameter tuning. For example, here is a small transformer class that adds the combined attributes we discussed earlier.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):

    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room

    def fit(self, X, y=None):
        return self # nothing else to do

    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
            bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

### Feature Scaling

* One of the most important transformations you need to apply to your data is feature
scaling. With few exceptions, Machine Learning algorithms don’t perform well when
the input numerical attributes have very different scales.
* This is the case for the housing
data: the total number of rooms ranges from about 6 to 39,320, while the **median
incomes only range from 0 to 15**. Note that scaling the target values is generally not
required.

There are two common ways to get all attributes to have the same scale: **min-max
scaling** and **standardization**.

#### Transformation Pipelines

* As you can see, there are many data transformation steps that need to be executed in
the right order.
* Fortunately, Scikit-Learn provides the Pipeline class to help with
such sequences of transformations. Here is a small pipeline for the numerical
attributes:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),      # Imputing Missing values
    ('attribs_adder', CombinedAttributesAdder()),       # Adding attribs
    ('std_scaler', StandardScaler()),                    # Feature Scaling with Standard Scaler
    ])
housing_num_tr = num_pipeline.fit_transform(housing_num)

* So far, we have handled the **categorical columns** and the **numerical columns** separately.
* It would be more convenient to have a single transformer able to handle all columns,
applying the appropriate transformations to each column.
* In Scikit-Learn introduced the `ColumnTransformer` for this purpose.

Let’s use it to apply all the transformations to the housing data:

In [None]:
from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)             # columns with numerical attributes
cat_attribs = ["ocean_proximity"]           # columns with categorical attributes
full_pipeline = ColumnTransformer([
      ("num", num_pipeline, num_attribs),
      ("cat", OneHotEncoder(), cat_attribs),
    ])
housing_prepared = full_pipeline.fit_transform(housing)

* Note that the OneHotEncoder returns a sparse matrix, while the num_pipeline returns
a dense matrix.
* When there is such a mix of **sparse and dense matrices**, the `Column Transformer estimates the density of the final matrix (i.e., the ratio of non-zero cells)`, and it returns a sparse matrix if the density is lower than a given threshold (by
default, sparse_threshold=0.3).
* In this example, it returns a dense matrix. And
that’s it! We have a preprocessing pipeline that takes the full housing data and applies
the appropriate transformations to each column

In [None]:
housing_prepared.shape

# 5. Select and Train a Model

Finally we have framed the problem, you got the data and explored it, you sampled a
training set and a test set, and you wrote transformation pipelines to clean up and
prepare your data for Machine Learning algorithms automatically.

We are now ready
to select and train a Machine Learning model.

### Training and Evaluating on the Training Set

Let's train a Linear regression Model

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(housing_prepared,housing_labels)


You now have a working Linear Regression model. Let’s try it out on a few
instances from the training set:

In [None]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]

some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))

print("Labels:", list(some_labels))

Let’s measure this regression model’s RMSE on the whole training
set using Scikit-Learn’s mean_squared_error function.

In [None]:
from sklearn.metrics import mean_squared_error
housing_predictions=lin_reg.predict(housing_prepared)                 # predicting on training data
lin_mse = mean_squared_error(housing_labels, housing_predictions)     # calculate mean squared error
lin_rmse = np.sqrt(lin_mse)                                           # calculate root of mse
print("RMSE:",lin_rmse)
print("R-Squared:",lin_reg.score(housing_prepared,housing_labels))   # Return the R-squared

In [None]:
housing_labels.describe()

* Okay, this is better than nothing but clearly not a great score.
* R-squared nearer to 1 better is our model.
* This is an example of a **model underfitting** the training.

* The main ways to fix underfitting are to **select a more powerful model**, to feed the training algorithm with better features, or
to reduce the constraints on the model.

Let’s train a **DecisionTreeRegressor**. This is a powerful model, capable of finding
complex nonlinear relationships in the data.

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_reg=DecisionTreeRegressor()
tree_reg.fit(housing_prepared,housing_labels)

In [None]:
# Evaluate it on Training set
housing_predictions=tree_reg.predict(housing_prepared)
dt_mse = mean_squared_error(housing_labels, housing_predictions)     # calculate mean squared error
dt_rmse = np.sqrt(dt_mse)                                           # calculate root of mse
print("RMSE:",dt_rmse)
print("R-Squared:",tree_reg.score(housing_prepared,housing_labels))   # Return the R-squared

* **Wait, what!**? No error at all? Could this model really be absolutely perfect? Of course,
it is much more likely that the model has badly **overfit the data**.
* As we saw earlier, you don’t want to touch the **test set** until you are ready to launch a
model you are confident about, so you need to use part of the training set for training,
and part for **model validation**.

### Better Evaluation Using Cross-Validation

* We will use [K-fold Cross Validation](https://machinelearningmastery.com/k-fold-cross-validation/) to Reduce Overfitting and  Better Evaluate the Model.
* We will use Scikit-Learn’s K-fold cross-validation feature.
* The following code randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time and training on the other 9 folds.
* The result is an array containing the 10 evaluation scores

In [None]:
from sklearn.model_selection import cross_val_score
scores=cross_val_score(tree_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

Scikit-Learn’s **cross-validation features** expect a utility function
(greater is better) rather than a cost function (lower is better), so
the **scoring function is actually the opposite of the MSE** (i.e., a negative
value), which is why the preceding code computes -scores
before calculating the square root.

In [None]:
print(tree_rmse_scores)
print("Mean:", tree_rmse_scores.mean())
print("Standard deviation:", tree_rmse_scores.std())


**The Decision Tree model is overfitting so badly that it performs worse than the Linear Regression model.**

* Now the Decision Tree doesn’t look as good as it did earlier. In fact, it seems to perform
worse than the Linear Regression model! Notice that cross-validation allows
you to get not only an estimate of the performance of your model, but also a measure
of how precise this estimate is (i.e., its standard deviation).
* The Decision Tree has a
score of approximately 71,407, generally ±2,439. You would not have this information
if you just used one validation set. But cross-validation comes at the cost of training
the model several times, so it is not always possible.

Let’s try one last model with the RandomForestRegressor. As RAndom Forest is build up with multiple Decision trees . So it will help in Overfitting of model.

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)

In [None]:
scores=cross_val_score(forest_reg,housing_prepared,housing_labels,scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-scores)

In [None]:
print(forest_rmse_scores)
print("Mean:", forest_rmse_scores.mean())
print("Standard deviation:", forest_rmse_scores.std())

Wow, this is much better: Random Forests look very promising

# 6. Fine-Tune Your Model

Let’s assume that you now have a shortlist of promising models. You now need to fine-tune them.
Let’s look at a few ways you can do that

### Grid Search

It will evaluate all the possible combinations of hyperparameter values,
using cross-validation.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

* This param_grid tells Scikit-Learn to first evaluate all 3 × 4 = 12 combinations of **n_estimators and max_features hyperparameter** values specified in the first dict.
* Then try all 2 × 3 = 6 combinations of hyperparameter values in the
second dict

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_estimator_

In [None]:
# the evaluation scores are also available
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

### Result of fine-tunig
* We obtain the best solution by setting the max_features hyperparameter
to 8, and the n_estimators hyperparameter to 30.
* The RMSE score for this combination is 49,682, which is slightly better than the score you got earlier using the
default hyperparameter values (which was 50,182).
* So, we are succesfully able to fine tune our model.

## Randomized Search

* The grid search approach is fine when you are exploring relatively few combinations,
like in the previous example, but when the hyperparameter search space is large, it is
often preferable to use **RandomizedSearchCV** instead.
* This class can be used in much the same way as the GridSearchCV class, but instead of trying out all possible combinations,
it evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration.

This approach has two main benefits:


* If you let the randomized search run for, say, 1,000 iterations, this approach will
explore 1,000 different values for each hyperparameter (instead of just a few values
per hyperparameter with the grid search approach)
* You have more control over the computing budget you want to allocate to hyperparameter
search, simply by setting the number of iterations.

# Ensemble Methods

* Another way to fine-tune your system is to try to combine the models that perform
best.
* The group (or “ensemble”) will often perform better than the best individual
model (just like `Random Forests perform better than the individual Decision Trees`
they rely on), especially if the individual models make very different types of errors.

## Analyze the Best Models and Their Errors

* You will often gain good insights on the problem by inspecting the best models. F
* For example, the *RandomForestRegressor* can indicate the relative importance of each
attribute for making accurate predictions.

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

Let’s display these importance scores next to their corresponding attribute names:

In [None]:
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"]                   # use the categorical encoder of full pipeline
cat_one_hot_attribs = list(cat_encoder.categories_[0])                   # categorical one hot attribute
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

* With this information, you may want to try dropping some of the less useful features

* You should also look at the specific errors that your system makes, then try to understand
why it makes them and what could fix the problem (adding extra features or, on
the contrary, getting rid of uninformative ones, cleaning up outliers, etc.).

# 7. Evaluate Your System on the Test Set

* After tweaking your models for a while, you eventually have a system that performs
sufficiently well.
* Now is the time to evaluate the final model on the test set. There is
nothing special about this process; just get the predictors and the labels from your
test set, run your full_pipeline to transform the data (call transform(), not
fit_transform(), you do not want to fit the test set!), and evaluate the final model
on the test set:

In [None]:
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
print(final_rmse)

Finally we got 48209.81 RMSE on test dataset.

**I HOPE YOU GOT SOME LEARNING FROM THE NOTEBOOK.
PLEASE UPVOTE IT.**

### Reference
Hands on Machine Learning with Scikit-learn and Tensorflow book.