# SLU64 - Training for the hackathon, part 2
Welcome to part 2 of hackathon training. Hopefully, after going through part 1, you're well acquainted with your dataset and have successfully constructed your first model.

In this SLU, you'll deal with imbalance, try out other models, engineer new features, and finally optimize your model and its hyperparameters. We're providing hints to guide you through the workflow, but it's on you to fill in the code.

In [None]:
# import all you need in this cell - we already did some imports for you
# basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

#sklearn libraries
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer, RobustScaler
from sklearn.preprocessing import KBinsDiscretizer, Binarizer, LabelEncoder
from sklearn.impute import SimpleImputer
# metrics
from sklearn.metrics import confusion_matrix,roc_auc_score,roc_curve,classification_report,auc
# imbalanced datasets
from imblearn.over_sampling import SMOTE,RandomOverSampler
from imblearn.under_sampling import TomekLinks,RandomUnderSampler
# models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
# feature selection
from sklearn.feature_selection import SelectKBest, chi2, f_classif, SelectFromModel
# model selection
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# pipelines
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin

#other libraries
import category_encoders as ce

### 1. Import the dataset and do the train-test split
Let's start by taking a step back and divide our dataset into `train` and `test` datasets. Hide your test dataset in a _virtual drawer_ and take it out when it's time for testing. Prepare your preprocessing workflow with the `train` dataset.

Use the `train.csv` for the split, but don't touch the `test.csv` file for now.

In [None]:
# Import the dataset and do the train-test split.
# Take into account the imbalance of the dataset.

In [None]:
# Prepare your preprocessing pipeline using the train dataset.

### 2. Feature engineering
Datasets contain numerical and categorical features. Basic feature engineering, such as dealing with the categorical features, is an obligatory part of preprocessing. Engineering new features is an optional step that can be done as part of model optimization.

Can you encode the categorical features in a form that is more advantageous for the model? Can you engineer new features from the existing one?

In [None]:
# Explore the categorical features and see if you can encode them.

In [None]:
# Optionally, explore the anonymous column and see if you can extract more information from it.

### 3. Preprocess the test data
Repeat the same preprocessing you did earlier with the train data. Apply the previous transformations, but don't fit the transformers again!

The easiest way to do this is with a pipeline. Of course, you can also try it initally without a pipeline, which means setting up a pipeline later, but do try using a pipeline at some point. It'll make your life easier.

In [None]:
# Apply the transformation that you used on the train dataset to the test dataset.

### 4. Test more models
Here you can test all the models that you know and see which one performs better.

In [None]:
# Test all the models you know.
# Use the default model parameters.

In [None]:
# Calculate the predictions for the train and test data and choose your best model. 

### 5. Feature selection
Feature selection is part of model optimization. Use common sense first and then automatic tools, but always check if their output makes sense.

In [None]:
# Think about which features are likely to influence the model outcome.

In [None]:
# Look at feature importance for each model.

In [None]:
# Are automatic feature selection tools helpful? Critically evaluate their output.

### 6. Model optimization
You can further refine your best performing models with hyperparameter optimization. Keep in mind that this can take time to run. If you don't have much computing power, consider running your notebooks on [Kaggle](https://www.kaggle.com/)
or [Google Colab](https://colab.research.google.com/).

In [None]:
# Use cross validation to test models or model parameters.

In [None]:
# Check the learning curves.

In [None]:
# Use the hyperparameter search schemes to find the best hyperparameters.

### 7. Pipeline
If you didn't use a pipeline yet, now is your chance! Nomally, you'd start your process with a pipeline, because pipelines make things easier, but you're practicing now for the hackaton. Using a pipeline later in your process might a way to experiment with and understand your workflow.

In [None]:
# Make your pipeline.
# Run your train data through it first and then your test data next.

### 8. Calculate the score with the test data
In this section, we'll test model predictions for the test data (the one with the unknown labels, not the data from your train-test split). In a real hackathon, you'll generate these predictions and then submit them to the portal. The portal will compare your prediction with the real labels and give you a score.

For this SLU, you have the test predictions in the `portal` directory, together with the code to calculate the score. Run the following cells to calculate your score. You'll need to uncomment some parts.

In [None]:
# Import the code to calculate the score.
from portal.score import load, validate, score

In [None]:
# Load the true labels and your prediction.
# Uncomment and run the code.
#y_true = load("portal/data")
#y_pred = load("submission.csv")

In [None]:
# This function just validates if the prediction has the correct format.
# Uncomment and run it.
#validate(y_true, y_pred)

In [None]:
# Calculate the auc-roc score for your test prediction.
# Uncomment and run the code.
#score(y_true, y_pred)

### 9. Presentation
Add the model optimization part to your presentation. Remember to justify the decisions you made. You'll need to use tables or visualizations to support your claims.

In [None]:
print("I have completed the hackathon training part 2. I'm a star!")