 # Hyperparameter Tuning - Feature engineering 

![nohayjupytersingif](https://media.giphy.com/media/jeDM590qtCP9C/giphy.gif)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Feature-engineering" data-toc-modified-id="Feature-engineering-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Feature engineering</a></span></li><li><span><a href="#Small-exploration-of-the-data" data-toc-modified-id="Small-exploration-of-the-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Small exploration of the data</a></span><ul class="toc-item"><li><span><a href="#We-look-at-the-&quot;Cabin&quot;-feature" data-toc-modified-id="We-look-at-the-&quot;Cabin&quot;-feature-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>We look at the "Cabin" feature</a></span></li><li><span><a href="#We-analyze-the-names-of-the-passengers" data-toc-modified-id="We-analyze-the-names-of-the-passengers-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>We analyze the names of the passengers</a></span></li><li><span><a href="#In-order-not-to-have-many-categories-with-the-titles,-we-are-going-to-keep-those-that-have-more-than-40" data-toc-modified-id="In-order-not-to-have-many-categories-with-the-titles,-we-are-going-to-keep-those-that-have-more-than-40-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>In order not to have many categories with the titles, we are going to keep those that have more than 40</a></span></li></ul></li><li><span><a href="#Categorical-encoding" data-toc-modified-id="Categorical-encoding-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Categorical encoding</a></span><ul class="toc-item"><li><span><a href="#Label-Encoder" data-toc-modified-id="Label-Encoder-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Label Encoder</a></span></li><li><span><a href="#One-Hot-Encoder:-get-dummies" data-toc-modified-id="One-Hot-Encoder:-get-dummies-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>One Hot Encoder: get dummies</a></span></li><li><span><a href="#By-hand-with-a-dictionary-ðŸ’¡" data-toc-modified-id="By-hand-with-a-dictionary-ðŸ’¡-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>By hand with a dictionary ðŸ’¡</a></span></li></ul></li><li><span><a href="#Feature-Scaling" data-toc-modified-id="Feature-Scaling-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Feature Scaling</a></span><ul class="toc-item"><li><span><a href="#Standardization" data-toc-modified-id="Standardization-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Standardization</a></span></li><li><span><a href="#Normalization" data-toc-modified-id="Normalization-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Normalization</a></span></li><li><span><a href="#I-quote-Andriy-Burkov:" data-toc-modified-id="I-quote-Andriy-Burkov:-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>I quote Andriy Burkov:</a></span></li></ul></li><li><span><a href="#Let's-review:-Train-Test-Split" data-toc-modified-id="Let's-review:-Train-Test-Split-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Let's review: Train-Test Split</a></span></li><li><span><a href="#Hyperparameter-Tuning" data-toc-modified-id="Hyperparameter-Tuning-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Hyperparameter Tuning</a></span><ul class="toc-item"><li><span><a href="#--Random-sampling" data-toc-modified-id="--Random-sampling-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>- Random sampling</a></span></li><li><span><a href="#--Grid-Sampling" data-toc-modified-id="--Grid-Sampling-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>- Grid Sampling</a></span></li><li><span><a href="#--Bayesian-sampling" data-toc-modified-id="--Bayesian-sampling-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>- Bayesian sampling</a></span></li><li><span><a href="#GridSearchCV-by-sklearn,-say-hello-to-your-new-friend!" data-toc-modified-id="GridSearchCV-by-sklearn,-say-hello-to-your-new-friend!-6.4"><span class="toc-item-num">6.4&nbsp;&nbsp;</span>GridSearchCV by sklearn, say hello to your new friend!</a></span></li><li><span><a href="#We-would-train-the-model-with-the-best-parameters" data-toc-modified-id="We-would-train-the-model-with-the-best-parameters-6.5"><span class="toc-item-num">6.5&nbsp;&nbsp;</span>We would train the model with the best parameters</a></span></li></ul></li><li><span><a href="#Save-/-Export-the-model" data-toc-modified-id="Save-/-Export-the-model-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Save / Export the model</a></span></li></ul></div>

## Feature engineering
It is the process of using domain knowledge to extract features from raw data. These features can be used to improve the performance of machine learning algorithms. Feature engineering can be thought of as applied machine learning itself

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

## Small exploration of the data

### We look at the "Cabin" feature

There are many missing values, but we must use the stateroom variable because it can be an important predictor. As you can see in the picture below, first class had cabins on deck A, B, or C, a mix was on deck D or E, and third class was mostly on f or g. We can identify the cover by the first letter.

![laimagendelbarco](images/barco.png)

The name could give us important information about the socioeconomic status of a passenger. And depending on their socioeconomic status, they have been able to buy a more expensive or cheaper ticket, which indicates a cabin located in one place or another on the ship. We can answer the question of whether or not someone is married or has a formal title and extract that information to generate a new variable.

### We analyze the names of the passengers

### In order not to have many categories with the titles, we are going to keep those that have more than 40

We only have the Age column with nulls... Let's fill them in, but exploring the data... are men the same age on average as women?

To adjust a little more, we are going to fill the NaN of the age with the median but based on their gender and also based on the cover.

## Categorical encoding

With an import line in our code, new possibilities are opened in options to how to model or manipulate our dataset.

Previously we discussed in the dummies post the possibility of generating with the get_dummies() function and transforming each non-numeric data into a binary representation (expanding our dataset to the amount of different data that exists in a column).

### Label Encoder
Pros and cons
- If we have categories that have value in themselves, such as "good, bad, regular" the ideal would be not to let LabelEncoder do it automatically but to apply it manually, since the value we put can influence the weight that the algorithm gives those variables.

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

```python
new_dict = {'M': 2, 
            'C': 7, 
            'E': 5, 
            'G': 3, 
            'D': 6, 
            'A': 9, 
            'B': 8, 
            'F': 4, 
            'T': 2}
```

### One Hot Encoder: get dummies

Now, as we have already mentioned, depending on the data we have, we could find situations in which, after encoding the labels, we could confuse our model by making it believe that a column has data with some kind of order or hierarchy, when clearly we don't have it. To avoid this, we "OneHotEncode" that column.
What a hot encoding does is it takes a column that has categorical data, which has been encoded with labels, and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has which value.

label encoder: 0, 1, 2, 3, 4, 5, 6
one hot encoder or get dummies: 0, 1, 0, 0 , 0 for each -> one column
hierachy: A > B > C > D

### By hand with a dictionary ðŸ’¡
We can give a numerical value to each category and decide its importance

##Â Feature Scaling

Some algorithms, especially those based on distance calculations, will give more weight to features that show large changes in value, interpreting these features as artificially more important. For these algorithms, it is important that we scale our features, or that we scale features with naturally different scales, so that the features are used by the algorithm without artificial overweighting, and allow two features with different scales to be compared.

Algorithms that do not require normalization/scaling are rule-based. They would not be affected by any monotonic transformation of the variables. Scaling is a monotonic transformation.
Examples of algorithms in this category are all tree-based algorithms:
- CART
-Random Forests
-Gradient Boosted Decision Trees
These algorithms use rules (series of inequalities) and do not require normalization.

There are two different types of feature scaling that we are going to explore:

Still, I leave you literature:
    - https://towardsdatascience.com/all-about-feature-scaling-bcc0ad75cb35

### Standardization
In standardization, we impose several statistical properties on the variable: the mean value is set to 0, and the standard deviation is set to 1. This is achieved by subtracting the mean from each feature value and dividing by the standard deviation. This is also sometimes called "z-score normalization."

So what does this mean, in practice, about standardized data? As we can see below, we now have the distributions of both variables centered around the zero mean, with a standard deviation of 1. Since we are enforcing this standard deviation, normalization reduces the effects of outliers on the feature. In addition, it allows comparing two characteristics with different scales or units. The different scales of the characteristics would be statistically reflected in differences in both the mean and the standard deviation. Standardizing these two numbers between features removes the influence of these scale differences.

Standardization is especially important in situations where we use algorithms that assume features in our data are distributed along a 'bell curve' or Gaussian distribution, such as linear and logistic regression.

In [None]:
from sklearn.preprocessing import StandardScaler

### Normalization

In the other form of feature scaling, called normalization, the feature is rescaled to a range between 0 and 1, without any change to its original distribution within that range. Mathematically, this is achieved by subtracting the minimum feature value from each feature value, and dividing by the difference between the largest value and the smallest value.

Since we compute the normalized value using the maximum and minimum values â€‹â€‹of the feature, this technique is sometimes called "min-max normalization."
Normalization is most useful in cases where your data has few outliers but highly variable ranges, you don't know how your data is distributed, or you know that it is not distributed on a bell (Gaussian) curve. It is generally applied with algorithms that make no assumptions about the distributions of the features.

In [None]:
# standarization: normality is assumed
#Â normalization: when it is not assumed

In [None]:
#Â This one we would normalize

![image-2.png](attachment:image-2.png)

In [None]:
#Â this one we wouldnt have to

![image.png](attachment:image.png)

In [None]:
from sklearn.preprocessing import MinMaxScaler

CATEGORICAL
    Encoding: 
        - OneHotEncoder / get_dummies
        - LabelEncoder 
        - Dictionaries 
        
NUMERICAL
    - Standarize: 
        mean = 0, std = 1
    - Normalize: 
        range 0-1

### I quote Andriy Burkov:
You may be wondering when normalization should be used and when standardization. There is no definitive answer to this question. Usually, if your dataset isn't too big and you have time, you can try both and see which one suits your task better.
If you don't have time to run multiple experiments, as a general rule:

- Unsupervised learning algorithms, in practice, benefit more from standardization than normalization.
- Standardization is also preferable for a characteristic if the values â€‹â€‹it takes are distributed close to a normal distribution (the so-called bell curve).
- Again, normalization is preferable for a feature if it can sometimes have extremely high or low values â€‹â€‹(outliers); this is because the normalization will "squeeze" the normal values â€‹â€‹into a very small range.
- In all other cases, normalization is preferable.

Feature rescaling is usually beneficial for most learning algorithms. However, modern implementations of learning algorithms, which can be found in popular libraries, are robust to features found in different ranges.

QUANTITAITVE
    - Standardize
    - Normalize


## Let's review: Train-Test Split

In [None]:
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score

In [None]:
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print("-------------")
print(f"F1 score: {f1_score(y_test, y_pred)}")

In [None]:
feature_names = list(X_train.columns)
importances = rfc.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10, 6))
plt.title("Feature Importances")
plt.bar(range(X.shape[1]), importances[indices], align="center")
plt.xticks(range(X.shape[1]), [feature_names[i] for i in indices], rotation=90)
plt.xlabel("Features")
plt.ylabel("Importance")
plt.tight_layout()
plt.show()

## Hyperparameter Tuning

What is hyperparameter tuning?
Hyperparameters are tunable parameters that allow you to control the process of training a model. For example, with neural networks, you can decide the number of hidden layers and the number of nodes in each layer. The performance of a model depends heavily on hyperparameters.
Hyperparameter tuning, also called hyperparameter optimization, is the process of finding the hyperparameter configuration that produces the best performance. Typically, the process is manual and computationally expensive.

There are different techniques to choose this hyperparameter tuning:
    
### - Random sampling
Random sampling supports discrete and continuous hyperparameters. Supports early termination of underperforming strings. Some users perform an initial search with random sampling and then narrow the search space to improve results.
In random sampling, hyperparameter values â€‹â€‹are randomly selected from the defined search space.

### - Grid Sampling
Grid sampling supports discrete hyperparameters. Use grid sampling if your budget allows you to search the search space exhaustively. Supports early termination of underperforming strings.

### - Bayesian sampling
Bayesian sampling is based on the Bayesian optimization algorithm. Pick the samples based on how the previous ones did, so that the new samples improve the main metric.
 For best results, it is recommended that the maximum number of runs be greater than or equal to 20 times the number of hyperparameters being optimized.
The number of simultaneous series affects the efficiency of the adjustment process. Fewer concurrent runs can lead to better sampling convergence, since the lower degree of parallelism increases the number of runs that benefit from previously completed runs.

We are going to look at grid hyperparameter tuning with GridSearchCV but I leave you to investigate Bayesian sampling with [HyperOpt](https://towardsdatascience.com/hyperopt-hyperparameter-tuning-based-on-bayesian-optimization-7fa32dffaf29)

### GridSearchCV by sklearn, say hello to your new friend!
And read the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

![](https://najeesmith.github.io/images/Classifiers/RF/header.png)

In [None]:
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print("-------------")
print(f"F1 score: {f1_score(y_test, y_pred)}")

In [None]:
#Â Second model

rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)


print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print("-------------")
print(f"F1 score: {f1_score(y_test, y_pred)}")

In [None]:
#Â Third model

rfc = RandomForestClassifier(n_estimators=200, max_depth=10)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)


print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print("-------------")
print(f"F1 score: {f1_score(y_test, y_pred)}")

In [None]:
#Â Fourth model

rfc = RandomForestClassifier(n_estimators=200, max_depth=10, min_samples_leaf=4)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)


print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print("-------------")
print(f"F1 score: {f1_score(y_test, y_pred)}")

In [None]:
parameters = {'bootstrap': [True, False],
     'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
     'max_features': ['auto', 'sqrt'],
     'min_samples_leaf': [1, 2, 4],
     'min_samples_split': [2, 5, 10],
     'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

In [None]:
gs = GridSearchCV(rfc, parameters)
gs.fit(X_train, y_train)

### We would train the model with the best parameters

If GridSearchCV() is set to refit=True, after identifying the best hyperparameters, the model is retrained with them and stored in .best_estimator_.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score

In [None]:
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Precision: {precision_score(y_test, y_pred)}")
print(f"Recall: {recall_score(y_test, y_pred)}")
print("-------------")
print(f"F1 score: {f1_score(y_test, y_pred)}")

https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/

## Save / Export the model
https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/

# RECAP