# SLU32 - Training for the hackathon, part 1
In the hackathon, you will receive a dataset and a metric and you will have to set up and optimize a model for the given metric. You will have learned all the necessary bits and pieces to accomplish this task by then, nevertheless, it can still seem a daunting task. Solving a hackathon is a multistep task with multiple possible solutions and so a much bigger problem than any of the exercises in the exercise notebooks.

This SLU is meant to help you prepare for it. You can practice all that you learned in the first part of the S01 specialization, in SLUs 01-10. You should first get to know your dataset (exploratory data analysis - EDA) and then set up the first model.

We are providing hints to guide you through the workflow. It's on you to fill in all the code. This workflow is just a proposal to get you started, you will design your own workflow that suits your needs once you're more experienced.

The second part of the training is SLU64 where you can practice feature engineering, model optimization, and dealing with imbalanced datasets.

In [None]:
# import all you need in this cell - we already did some imports for you
# basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

#sklearn libraries
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,roc_auc_score,roc_curve,classification_report,auc

### 1. Read the problem statement
We will use the problem from the first hackathon from the previous batch of LDSSA. You can read the problem statement [here](https://docs.google.com/document/d/1mONeQMAyYW2cJvinYHGAQ2W_04P1HHU68dcUj-iosow/edit#heading=h.640xl1uqp9un). The topic might be from an area completely unfamiliar to you, but this is normal in data science. Data scientists often work on very different problems and each time have to learn a bit about the background to deliver a good solution. Of course, in a hackathon, you'll have little time to explore the topic, but you can always ask the instructors.

### 2. Import the dataset
There are two datasets in the `data` folder - train and test. You should train your model with the train dataset. The test dataset should be used only at the end to calculate the predictions, never in the training.

In [None]:
# Import and preview the dataset (you can first look at the datafile directly).

### 3. Exploratory data analysis (EDA) and data cleaning
Use Pandas tools from SLUs 01 - 06 and plots to get to know your dataset. 

In [None]:
# Look at the size of the dataset and the variable names.

In [None]:
# Is your data tidy - variables in columns and data points in rows?

In [None]:
# Look at the datatypes.
# Do they make sense or do you need to correct them?

In [None]:
# How many unique values does each column have?
# Identify the categorical and numerical variables.

In [None]:
# Look at the ranges of the numerical values.
# Are there outliers?

In [None]:
# see if the dataset is balanced - how many datapoints for each class do you have?
# you don't have tools to do with this yet, but it's good to know

In [None]:
# See where you have missing values.
# Think about how to deal with them - fill in (different strategies) or drop them.

In [None]:
# Do you have duplicated data?

In [None]:
# Are the variables independent?
# Check out the correlations between variables.

### 4. Model - baseline
This is clearly a classification problem, so you can apply the classification model you already know to get the first result (baseline).

In [None]:
# Choose the variables that you want to feed into the model.
# By SLU10, you don't yet know the tools for dealing with categorical variables, 
# so you can ignore them or use some simple strategy.
# Do the variables need to be scaled?

In [None]:
# Train the model.

### 5. Calculate the predictions
Calculate the predictions for both the train and the test data.

In [None]:
# Calculate the predictions for the train data.

In [None]:
# Calculate the prediction for the test data.

In [None]:
# Save the prediction to a csv file.
# Uncomment the code and run it.
# It is expected that the prediction is in the variable test_prediction.
#submission = pd.Series(test_prediction,index=test.index, name='id')
#submission.to_csv("submission.csv")

### 6. Model evaluation with the training data
Use the prediction for the training data to evaluate your model. In a real hackathon, you will have access to the labels of the training data, but not of the test data, so you will have to use the train data prediction to evaluate your model during the development.

As per the problem setting, you should be using the roc-auc-score to evaluate the model. Just for the sake of training, you can also look at other metrics and think if they make sense in this situation.

In [None]:
# In the first place, calculate the roc-auc-score for your model.

In [None]:
# Now check how the model does on the other metrics you learned about.
# Which of them are relevant for this situation?

In [None]:
# You can calculate the confusion matrix.

In [None]:
# Plot here the roc curve and calculate the auroc score.

### 7. Calculate the score with the test data
In this section, we will test the model predictions for the test data. In a real hackathon, you will generate these predictions, then submit them to the portal. The portal will compare your prediction with the real labels and give you a score.

Here, you have the test predictions in the `portal` directory, together with the code to calculate the score. Run the following cells to calculate your score (you need to uncomment some parts).

In [None]:
# Import the code to calculate the score.
from portal.score import load, validate, score

In [None]:
# Load the true labels and your prediction.
# Uncomment and run the code.
#y_true = load("portal/data")
#y_pred = load("submission.csv")

In [None]:
# This function just validates if the prediction has the correct format.
# Uncomment and run it.
#validate(y_true, y_pred)

In [None]:
# Calculate the auc-roc score for your test prediction.
# Uncomment and run the code.
#score(y_true, y_pred)

### 8. Feature selection
This is something we looked at only briefly in the Exercise notebook of SLU09. You will learn about it in depth in SLU14. You can look at how important are the features for the model outcome and then retrain the model with just the most important features.

In [None]:
# Get the coefficients of the model and see which ones have the most weight in the model 
# (the highest numbers in absolute terms).

In [None]:
# Retrain the model with the most important features.

In [None]:
# If you like, tweak other model parameters.
# Again, this is something that you will learn about in later SLUs.

### 9. Presentation
The final part of the hackathon is to present your solution to the other teams. The presentation should contain all the steps in the analysis, with justification of the decisions you made. You will need to use tables or visualizations to support your claims. You can think how you'd present the EDA part of the analysis.

In [None]:
print('I have completed the hackathon training!')