# SLU64 - Training for the hackathon, part 2
Welcome to the second part of the hackathon training. Hopefully, after going through part 1, you now know your data set in and out and have constructed your first model.

In this SLU, you will deal with the imbalance, try out other models, engineer new features, and finally optimize your model and its hyperparameters. We are providing hints to guide you through the workflow. It's on you to fill in all the code.

In [None]:
# import all you need in this cell - we already did some imports for you
# basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

#sklearn libraries
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer, RobustScaler
from sklearn.preprocessing import KBinsDiscretizer, Binarizer, LabelEncoder
from sklearn.impute import SimpleImputer
# metrics
from sklearn.metrics import confusion_matrix,roc_auc_score,roc_curve,classification_report,auc
# imbalanced datasets
from imblearn.over_sampling import SMOTE,RandomOverSampler
from imblearn.under_sampling import TomekLinks,RandomUnderSampler
# models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
# feature selection
from sklearn.feature_selection import SelectKBest, chi2, f_classif, SelectFromModel
# model selection
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# pipelines
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin

#other libraries
import category_encoders as ce

### Import the dataset and do the train-test split
Here we make a step back. The first step in the model set up is the division of the dataset into a train and test dataset. Hide your test dataset into a virtual drawer and take it out when it's time for testing. Prepare your preprocessing workflow with the train dataset.

In [None]:
# Import the dataset and do the train-test split.
# Take into account the imbalance of the dataset.

In [None]:
# Prepare your preprocessing pipeline using the train dataset.

### Feature engineering
The dataset contains numerical and categorical features. Basic feature engineering such as dealing with the categorical features is an obligatory part of the preprocessing. Engineering new features is an optional step that can be done as part of model optimization.

Can you encode the categorical features in a form that is more advantageous for the model? Can you engineer new features from the existing one?  Look especially at the `anonymous` column.

In [None]:
# Explore the categorical features and see if you can encode them

In [None]:
# Optionally, explore the anonymous column and see if you can extract more information from it.

### Preprocess the test data
Repeat the exact same preprocessing as for the train data. Apply the transformations that you have fitted before with the train data. Do not fit the transformers again!

The easiest way to do this is using a pipeline, but you can do this later.

In [None]:
# Apply all the transformation that you did with the train dataset to the test dataset.

### Test more models
Here you can test all the models that you know and see which one performs better.

In [None]:
# Test all the models you know.
# Use the default model parameters.

In [None]:
# Calculate the predictions for the train and test data and choose your best model. 

### Feature selection
Feature selection is part of model optimization. Use common sense first, then the automatic tools and always check if their output is sensible.

In [None]:
# Think about which features are likely to have or not have influence on the model outcome.

In [None]:
# Look at feature importance for each model.

In [None]:
# See if the automatic feature selection tools help you out. Critically evaluate their output.

### Model optimization
You can further refine your best performing models with hyperparameter optimization. This methods can take some time to run. If you don't have much computing power, consider running your notebooks on Kaggle or Google Colab.

In [None]:
# Use cross validation to test models or model parameters.

In [None]:
# Check the learning curves.

In [None]:
# Use the hyperparameter search schemes to find the best hyperparameters.

### Pipeline
Now is the time to turn your workflow into a pipeline! Usually, you'd use a pipeline from the start because it makes everything easier.

In [None]:
# Make your pipeline.
# You run your train data through it first, then your test data.

### Calculate the score with the test data
In this section, we will test the model predictions for the test data (the one with the unknown labels, not the data from your train-test split). In a real hackathon, you will generate these predictions, then submit them to the portal. The portal will compare your prediction with the real labels and give you a score.

Here, you have the test predictions in the `portal` directory, together with the code to calculate the score. Run the following cells to calculate your score (you need to uncomment some parts).

In [None]:
# Import the code to calculate the score.
from portal.score import load, validate, score

In [None]:
# Load the true labels and your prediction.
# Uncomment and run the code.
#y_true = load("portal/data")
#y_pred = load("submission.csv")

In [None]:
# This function just validates if the prediction has the correct format.
# Uncomment and run it.
#validate(y_true, y_pred)

In [None]:
# Calculate the auc-roc score for your test prediction.
# Uncomment and run the code.
#score(y_true, y_pred)

### Presentation
Add the model optimization part to your presentation. Remember to justify the decisions you made. You will need to use tables or visualizations to support your claims.

In [None]:
print("I have completed the hackathon training part 2, I'm a star!")