# Day 3 - Practical ✍
## Advanced ML Workshop
---
## Titanic Dataset
Welcome to the third practical session of SPAI's Advanced Machine Learning Workshop. In this practical, you will experience model selection, hyperparameter tuning techiniques as well as some useful feature selection techniques. Finally, you will learn how to save your model in a `.pkl` file so that you can work on it later. The goal of this dataset is to predict the survivability of passenger in the Titanic ship.

## ⚙ *Basic Initialization*
Please do not make any changes to this section.   
This section is required for the demo to work and not running it would result in errors 🚨.   
However, if you would like to understand the code, feel free to do so 😊

In [None]:
# Importing necessary libraries
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore') # Filter out warnings

In [None]:
titanic = pd.read_csv('https://raw.githubusercontent.com/SPAI-Team/Advanced-ML-Workshop-2021/main/Datasets/titanic_cleaned.csv', index_col=0)

In [None]:
X, y = titanic.drop(columns=['survived']), titanic['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

---

## Exercise 1️⃣
In this exercise, we will be attempting to select the best model for this dataset.

> Tip 💡: Try looping through all the models! Feel free to refer to the tutorials.

#### Task
1. Define a `models` dictionary that stores the respective name of the model as the key, the instance of the actual model as the value
    - Example, `Logistic Regression` key should be filled with `LogisticRegression()` 
    - Feel free to refer to Demonstration 1️⃣

In [None]:
# Test the following models. Evaluate which models perform the best 💯!
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
# ensemble models are usually tested too. However, it would take far too long!
# from sklearn.ensemble import ExtraTreesRegressor, AdaBoostRegressor

### Write your code here ###
# Declare a Dictionary of models with their name
models = {
    'Logistic Regression': _____,
    'Decision Tree': _____,
    'K Neighbours': _____,
    'SVC': _____
}
### End ###

#### Task
1. Import `cross_validate` from `sklearn.model_selection`
2. Inside the for loop, apply cross validation with the `cross_validate` function you just imported and save the result in the variable `results`. In the `cross_validate` function, parse the following:
    - pass the model, `models[model]`
    - `X_train` and `y_train`
    - `cv=5`, `scoring=scoring_metrics`, `return_train_score=True`
    - Set `n_jobs` to `-1` for faster training speeds
3. Store all the necessary result into their necessary keys in the `scores` dictionary
    - The pattern is as follows, `scores[`*key*`].append(np.mean(results[`*key with underscore*`]))`
    - Feel free to refer to Demonstration 1️⃣

In [None]:
scoring_metrics = ['accuracy', 'f1']

scores = {
    'model': [],
    'train accuracy': [],
    'test accuracy': [],
    'train f1': [],
    'test f1': [],
    'fit_time': [],
}


### Write your code here ###
# import the cross_validate library from model_selection in sklearn library
from sklearn._____ import _____

# For loop through the models
for model in models.keys():
    # Initialize your results
    results = _____(_____, _____, _____, cv=_____, scoring=_____, _____)



    # Append the corresponding results into scores
    scores['model'].append(model)
    scores[_____].append(np.mean(results[_____])) # For train accuracy
    _____[_____].append(np.mean(_____[_____])) # For test accuracy
    _____.append(np.mean(_____)) # For train f1
    _____.append(_____) # For test f1
    _____ # For fit time

### End ###


# Allows us to view scores in a nice format
pd.DataFrame(scores)

Unnamed: 0,model,train accuracy,test accuracy,train f1,test f1,fit_time
0,Logistic Regression,0.790452,0.791579,0.620912,0.61796,0.017629
1,Decision Tree,0.95372,0.733228,0.91894,0.543349,0.004807
2,K Neighbours,0.831996,0.775532,0.684899,0.572893,0.003501
3,SVC,0.784987,0.78428,0.615312,0.613222,0.012253


---

## Exercise 2️⃣
In this exercise, you will learn how to use both `GridSearchCV` and `RandomSearchCV`

In [None]:
dt = DecisionTreeClassifier()

### Grid Search Cross Validation
#### Task
1. Import `GridSearchCV` from `sklearn` under `model_selection`
2. Get the params from the Decision Tree Classifier
    - Do note that the `DecisionTreeClassified()` is already defined with the `dt` variable

In [None]:
### Write your code here ###
# Import GridSearchCV from sklearn under model_selection
from sklearn._____ import _____

# Get the parameters you can configure
dt._____

### End ###

{'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'presort': 'deprecated',
 'random_state': None,
 'splitter': 'best'}

#### Task
1. Set each key the array of values to experiment with. Set the following keys with the following array of values:
    - `min_samples_leaf` with values ranging from 2 to 6
    - `max_depth` with values ranging from 2 to 6
    - `min_samples_split` with values ranging from 1 to 5
    - `max_features` with values ranging from 5 to 8
2. Peform grid search with the `params` dictionary with the decision tree model. 
    - Set `n_jobs` to `-1` for faster training speeds
    - After which, fit it with `X_train` and `y_train`
    - Save the grid search instance to `result` variable
3. Print the best score and best parameters

In [None]:
'''
Hint:
Range creates an iterable from the start point of a until b-1
range(a,b) = [a, a+1, a+2, ..., b-2, b-1]
'''
### Write your code here ###
# Parameters to test 
params = {
    'min_samples_leaf': range(_____,_____),
    'max_depth': _____,
    'min_samples_split': _____,
    'max_features': _____
}

# Runs Grid Search
result = _____.fit(_____, _____)

# Returns Scores and parameters respectively
print('Best Score is:', result._____)
print('Best Parameters:\n', result._____)

### End ###

Best Score is: 0.7900772241616418
Best Parameters:
 {'max_depth': 6, 'max_features': 5, 'min_samples_leaf': 6, 'min_samples_split': 5}


### Random Search Cross Validation

#### Task
1. Import `RandomSearchCV` from `sklearn` under `model_selection`

In [None]:
### Write your code here ###
# Import GridSearchCV from sklearn under model_selection
from sklearn.model_selection import RandomizedSearchCV

### End ###

#### Task
1. Peform grid search with the `params` dictionary with the decision tree model. 
    - Set `n_jobs` to `-1` for faster training speeds
    - After which, fit it with `X_train` and `y_train`
    - Save the grid search instance to `result` variable

In [None]:
### Write your code here ###
# Runs Grid Search
result = _____(_____, _____, cv=_____, n_jobs=_____).fit(_____, _____)

### End ###

# Returns Scores and parameters respectively
print('Best Score is:', result._____)
print('Best Parameters:\n', result._____)

Best Score is: 0.7755209986247752
Best Parameters:
 {'min_samples_split': 5, 'min_samples_leaf': 6, 'max_features': 5, 'max_depth': 3}


---

## Exercise 3️⃣
In this exercise, you will learn how to carry out feature selection.

In [None]:
dt = DecisionTreeClassifier()

### Feature Importance

#### Task
1. Train a Decision Tree Classifier model with `X_train` and `y_train`
2. Display Feature Importance
    - Feel Free to Refer to Demonstration 3️⃣

In [None]:
### Write your code here ###
# Train Decision Tree Model
model = dt._____

# Display Feature Importance
pd.DataFrame({
    'columns': X_train.columns,
    'feature importance': model._____ # Call feature importance method here
}).sort_values('feature importance', ascending=False)

### End ###

Unnamed: 0,columns,feature importance
4,fare,0.371326
1,age,0.260593
5,sex_male,0.25558
0,pclass,0.042875
6,embarked_Q,0.032648
7,embarked_S,0.019819
2,sibsp,0.01716
3,parch,0.0


### Coeffecient

#### Task
1. Train a Logistic Regression model with `X_train` and `y_train`
2. Display the coeffecient
    - Hint: coeffecient is a 2D array

In [None]:
### Write your code here ###
# Train Logistic Regression Model
model = _____._____

# Display Coeffecient
pd.DataFrame({
    'columns': X_train.columns,
    'coef': model._____ # Call coef method here
}).sort_values('coef', ascending=False)

### End ###

Unnamed: 0,columns,coef
4,fare,1.377899
3,parch,8.5e-05
1,age,-0.157371
6,embarked_Q,-0.166903
0,pclass,-0.436873
2,sibsp,-0.464874
7,embarked_S,-0.506902
5,sex_male,-2.253249


### RFECV

#### Task
1. Import `RFECV` from scikit-learn in `feature_selection`
2. Initialize RFECV to result and do the following:
    - Parse the Decision Tree Classifier Model which has been initialized with variable `dt`
    - Set `scoring='accuracy'` and set `n_jobs=-1` for faster fit time
    - Apply `.fit()` method parsing `X_train` and `y_train`
3. Get the ranking of each variable
    - Feel Free to Refer to Demonstration 4️⃣

In [None]:
### Write your code here ###
# Import RFECV from sklearn under feature_selection
from sklearn.feature_selection import RFECV

# Performs RFECV
result = _____

# Display Ranking of features
pd.DataFrame({
    'columns': X_train.columns,
    'ranking': result._____ # Call ranking method here
}).sort_values('ranking')

### End ###

Unnamed: 0,columns,ranking
0,pclass,1
1,age,1
2,sibsp,1
4,fare,1
5,sex_male,1
6,embarked_Q,1
7,embarked_S,1
3,parch,2


---

## Exercise 4️⃣
In this exercise, we will be saving and loading models.

#### Task
1. Import joblib
2. Train a Decision Tree Classifier model
3. Save the model using `joblib`
    - Use the `dump` method in the joblib library
    - Parse the trained model and a file name of your choosing
    - Make sure that the file name ends with `.pkl`

In [None]:
### Write your code here ###
# Import Joblib library
import _____

# Train Decision Tree Classifier
model = _____

# Save the model
joblib._____

### End ###

['cool.pkl']

#### Task
1. Load the model from the `.pkl` file with the file name used in the previous cell.
    - Be sure to load the model into the variable `model`
2. Generate prediction from the model
    - run `.predict` method using `X_test`
3. Use your testing data to test the accuracy of the model

In [None]:
### Write your code here ###
# Load the model from the saved .pkl file
model = joblib._____

# Generate predictions
y_pred = _____

# Test accuracy of the model
accuracy_score(_____, _____)

### End ###

0.7383720930232558

# Conclusion
Congratulations! You have completed Practical Notebook for Day 3 of SPAI Advanced Machine Learning Workshop.  If you have any doubts or require any clarification feel free to approach us through our [Instagram](https://www.instagram.com/spai.sp/) or [Discord Server](https://discord.gg/zPYJMGfQFa)*(remember to verified yourself)*.

---
> ### Feel Free to Join the Subsequent **SPAI Machine Learning Competition** to earn **valuable Prize💰 and Experience🏆** if you have not done so!🥳🥳
[ML Comp Sign Up Link](https://docs.google.com/forms/d/e/1FAIpQLSchLDXDAY0LqM6fuRDyQwdRNbVT4FYrgDtqthEIfYpFvpWMAg/viewform) *(If it has not been closed)*