# Chapter 2

## Handle the missing values

Missing Values causes many problems and these are some of the methods to get rid of them

In [2]:
housing.dropna(subset=["total_bedrooms"], inplace=True)
#Dropping the missing values

housing.drop("total_bedrooms", axis=1)
#Dropping the column with missing values

median = housing["total_bedrooms"].median() # option 3
housing["total_bedrooms"].fillna(median, inplace=True)
# This last method is called imputing

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")
#You can change the strategy based on your data

"""
Missing values can also be replaced with the mean value
(strategy="mean"), or with the most frequent value
(strategy="most_frequent"), or with a constant value
(strategy="constant", fill_value=…). The last two strategies
support non-numerical data.
"""

housing_num = housing.select_dtypes(include=[np.number])
#Choosing the numerical columns to train it with them

imputer.fit(housing_num)

X = imputer.transform(housing_num)
#Imputing the missing values in "housing_num"

* **KNN Imputer :** replaces each missing value with the mean of the k-nearest
  neighbors’ values for that feature. The distance is based on all the available features.


* **Iterative Imputer :** trains a regression model per feature to predict the
  missing values based on all the other available features. It then trains the model
  again on the updated data, and repeats the process several times, improving the
  models and the replacement values at each iteration.

## Encoding

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

In [None]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

""""
you can set sparse=False when creating the
OneHotEncoder, in which case the transform() method will return a
regular (dense) NumPy array directly.
""""

<div class="alert alert-block alert-warning">
        <b>Warning :
        As with all estimators, it is important to fit the scalers to the training data only: never
        use fit() or fit_transform() for anything else than the training set. Once you
        have a trained scaler, you can then use it to transform() any other set, including the
        validation set, the test set, and new data. Note that while the training set values will
        always be scaled to the specified range, if new data contains outliers, these may end up
        scaled outside the range. If you want to avoid this, just set the clip hyperparameter toTrue.
        </b>
</div>

## Rescaling

In [None]:
from sklearn.preprocessing import MinMaxScaler
min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)

In [None]:
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

**What to do When a feature’s distribution has a heavy tail ?**
* you should first transform it to shrink the heavy tail, and if
  possible to make the distribution roughly symmetrical.
  
* replace the feature with its square root (or raise the feature to a power between 0 and 1).

* If the feature has a really long and heavy tail, such as a
  power law distribution, then replacing the feature with its logarithm may help.

In [None]:
from sklearn.preprocessing import FunctionTransformer
log_transformer = FunctionTransformer(np.log,
inverse_func=np.exp)
log_pop = log_transformer.transform(housing[["population"]])

In [None]:
""""
(RBF)—any function that depends only on the distance between the input
value and a fixed point. The most commonly used RBF is the Gaussian
RBF
""""
from sklearn.metrics.pairwise import rbf_kernel
age_simil_35 = rbf_kernel(housing[["housing_median_age"]],
[[35]], gamma=0.1)

"""
shows this new feature as a function of the housing median age
(solid line). It also shows what the feature would look like if you used a
smaller gamma value.
"""

In [None]:
from sklearn.linear_model import LinearRegression

target_scaler = StandardScaler()
scaled_labels = target_scaler.fit_transform(housing_labels.to_frame())
#Scaling

model = LinearRegression()
model.fit(housing[["median_income"]], scaled_labels)
#Modeling

some_new_data = housing[["median_income"]].iloc[:5] 
# pretend this is new data

scaled_predictions = model.predict(some_new_data)
#Prediction

predictions = target_scaler.inverse_transform(scaled_predictions)
#The Inverse scaling transform

## Pipelines

In [None]:
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
("impute", SimpleImputer(strategy="median")),
("standardize", StandardScaler()),
])

housing_num = num_pipeline.fit_transform(housing_num)

# This an example of a basic pipeline

In [None]:
from sklearn.pipeline import make_pipeline

num_pipeline = make_pipeline(SimpleImputer(strategy="median"),
StandardScaler())

housing_num = num_pipeline.fit_transform(housing_num)

# This an example of a basic pipeline

## Cross-Validation

In [None]:
from sklearn.model_selection import cross_val_score

tree_rmses = -cross_val_score(tree_reg, housing, housing_labels,scoring="neg_root_mean_squared_error", cv=10)

## Modeling

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = make_pipeline(preprocessing,
RandomForestRegressor(random_state=42))
forest_rmses = -cross_val_score(forest_reg, housing,
housing_labels,
scoring="neg_root_mean_squared_error", cv=10)

## Fine-Tune Your Model

**1- Grid Search**

In [None]:
from sklearn.model_selection import GridSearchCV

full_pipeline = Pipeline([
("preprocessing", preprocessing),
("random_forest", RandomForestRegressor(random_state=42)),
])

param_grid = [
{'preprocessing__geo__n_clusters': [5, 8, 10],
'random_forest__max_features': [4, 6, 8]},
{'preprocessing__geo__n_clusters': [10, 15],
'random_forest__max_features': [6, 8, 10]},
]

grid_search = GridSearchCV(full_pipeline, param_grid, cv=3,scoring='neg_root_mean_squared_error')

grid_search.fit(housing, housing_labels)

grid_search.best_params_

In [None]:
cv_res = pd.DataFrame(grid_search.cv_results_)
cv_res.sort_values(by="mean_test_score", ascending=False,inplace=True)
rmse = -score
cv_res.head()

**2- Randomized Search**

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {'preprocessing__geo__n_clusters':
randint(low=3, high=50),
'random_forest__max_features': randint(low=2,
high=20)}

rnd_search = RandomizedSearchCV(
full_pipeline, param_distributions=param_distribs, n_iter=10,
cv=3,
scoring='neg_root_mean_squared_error', random_state=42
)

rnd_search.fit(housing, housing_labels)

<div class="alert alert-block alert-info">
        <b>Scikit-Learn also has HalvingRandomSearchCV and
HalvingGridSearchCV hyperparameter search classes. Their goal is to
use the computational resources more efficiently, either to train faster or to
explore a larger hyperparameter space.</b>
</div>

## Analyzing the Best Models and Their Errors

In [None]:
final_model = rnd_search.best_estimator_ 
# includes preprocessing

feature_importances = final_model["random_forest"].feature_importances_
feature_importances.round(2)

# Let’s sort these importance scores in descending order and display them next to their corresponding attribute names:
sorted(zip(feature_importances,final_model["preprocessing"].get_feature_names_out()),reverse=True)