![](../src/logo.svg)

**© Jesús López**

Ask him any doubt on **[Twitter](https://twitter.com/jsulopz)** or **[LinkedIn](https://linkedin.com/in/jsulopz)**

<a href="https://colab.research.google.com/github/jsulopz/resolving-machine-learning/blob/main/04_Hyperparameter%20Tuning%20with%20Cross%20Validation/04_cross-validation_practice_solution.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## Load the Data

- We take some dataset from the _[Machine Learning Data Repository UCI](https://archive.ics.uci.edu/ml/datasets/adult)_
- The aim is to predict weather a **person** (rows) `earned>50k` a year or not
- Based on their **social-demographic features** (columns)

PD: You may see the column names & meanings [here ↗](https://archive.ics.uci.edu/ml/datasets/adult).

In [None]:
import pandas as pd
pd.set_option("display.max_columns", None)

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
df_salary = pd.read_csv(url, header=None, na_values=' ?')
df_salary.rename(columns={14: 'target'}, inplace=True)
df_salary.columns = [str(i) for i in df_salary.columns]
df_salary

## Preprocess the Data

In [None]:
df_salary.isna().sum().sum()

In [None]:
df_salary =df_salary.dropna()

In [None]:
df_salary = pd.get_dummies(data=df_salary, drop_first=True)

## Feature Selection

In [None]:
df_salary

In [None]:
X = df_salary.drop(columns='target_ >50K')
y= df_salary['target_ >50K']

## `train_test_split()` the Data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)

## `DecisionTreeClassifier()` with Default Hyperparameters

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
model_dt = DecisionTreeClassifier()

In [None]:
model_dt.fit(X=X_train,y=y_train)

### Accuracy

#### > In `train` data

In [None]:
model_dt = DecisionTreeClassifier(criterion='gini',max_depth=3,max_leaf_nodes=5)

In [None]:
model_dt.fit(X=X_train, y=y_train)

In [None]:
model_dt.predict(X=X_train)

In [None]:
model_dt.score(X=X_train,y=y_train)

#### > In `test` data

In [None]:
model_dt.score(X=X_test,y=y_test)

### Model Visualization

In [None]:
from sklearn.tree import plot_tree

In [None]:
#plot_tree(model_dt, feature_names=X.columns,filled=True);

## Interpretation

- [ ] Why the difference on accuracy is so much?

## `DecisionTreeClassifier()` with Custom Hyperparameters

In [None]:
model_dt.get_params()

### 1st Configuration

#### Accuracy

##### > In `train` data

In [None]:
model_dt.score(X=X_train,y=y_train)

##### > In `test` data

In [None]:
model_dt.score(X=X_test,y=y_test)

#### Model Visualization

### 2nd Configuration

### 3rd Configuration

### 4th Configuration

### 5th Configuration

## `GridSearchCV()` to find Best Hyperparameters

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
model_dt = DecisionTreeClassifier()

In [None]:
param_grid_dt = {
    'max_depth': [None,2,3],
    'min_samples_leaf': [1,10,20],
    'criterion': ['gini','entropy']
    
}

In [None]:
cv_dt= GridSearchCV(estimator=model_dt, param_grid=param_grid_dt,verbose=2)

In [None]:
cv_dt.fit(X= X_train, y=y_train)

In [None]:
cv_dt.best_estimator_

## Other Models

### Support Vector Machines `SVC()`

https://jakevdp.github.io/PythonDataScienceHandbook/05.07-support-vector-machines.html

In [None]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/efR1C6CvhmE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
from sklearn.svm import SVC

In [None]:
model_sv = SVC()

In [None]:
model_sv.get_params()

In [None]:
param_grid_sv = {
    'C':[0.1,1],
    'kernel':['linear','rbf']
}

In [None]:
cv_sv = GridSearchCV(estimator=model_sv, param_grid=param_grid_sv, verbose=2)

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()

In [None]:
scaler.fit(X=X_train)

In [None]:
#cv_sv.fit(X=X_train, y=y_train)

### K Nearest Neighbors `KNeighborsClassifier()`

In [85]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/HVXime0nQeI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

# Best Model with Best Hyperparameters

# Achieved Goals

_Double click on **this cell** and place an `X` inside the square brackets (i.e., [X]) if you think you understand the goal:_

- [ ] Even a model can be bettered
- [ ] The goal is to make models that perform a better accuracy on data not seen
    - The banks would like to know if a **future client** will be able to pay the loan
    - Not a past client
    - Unfortunately, we do not have data for future clients
    - So, we fix this with `Data Splitting` into
        - Train
            - Fold Validation
        - Test
- [ ] Understand the Machine Learning Applications to businesses
    - To predict if a customer will pay the loan
    - To predict if an athlete will have an injury