# SLU32 - Training for the hackathon, part 1
In the upcoming hackathon, you'll receive a dataset and a metric, for which you'll have to set up and optimize a model. Even though you'll learn all the necessary bits and pieces to accomplish this task by the date of the hackathon, it's probably going to seem like a daunting task. Don't worry! The hackathon is a multistep process with multiple possible solutions: a much bigger challenge than any of the exercise notebooks you'll complete in preparation, but it's also managable with preparation. 

This SLU is meant to prepare you. You can practice everytyhing you learned in the first part of the S01 specialization, in SLUs 01-10. You should first get to know your dataset (exploratory data analysis - EDA) and then set up the first model.

We're providing hints to guide you through the workflow, but it's on you to fill in the code. Consider this workflow as just a proposal to get you started. Once you're more experienced, you'll need to design your own workflow that suits your needs. 

The second part of hackathon training is SLU64, where you'll practice feature engineering, model optimization, and dealing with imbalanced datasets.

In [None]:
# import all you need in this cell - we already did some imports for you
# basic libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

#sklearn libraries
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,roc_auc_score,roc_curve,classification_report,auc

### 1. Read the problem statement
We'll use the problem from the first hackathon from a previous batch of LDSSA for this exercise. You can read its problem statement in the `README_hckt01.md` file. The topic might be from an area completely unfamiliar to you, but this is normal in data science. Data scientists work on a wide variety of challenges, and each time they need to learn the background of those challenges in order to deliver good solutions. Of course, in a hackathon, you'll have little time to explore your topic, but you can always ask for help from instructors.

### 2. Import the dataset
There are two datasets in the `data` folder: `train` and `test`. You should train your model with the `train` dataset, and use the `test` dataset only at the end to calculate predictions; never use the test in training!

In [None]:
# Import and preview the dataset (you can first look at the datafile directly).

### 3. Exploratory data analysis (EDA) and data cleaning
Use Pandas tools from SLUs 01 - 06 and plots to get to know your dataset. 

In [None]:
# Look at the size of the dataset and the variable names.

In [None]:
# Is your data tidy? Are the variables in columns and data points in rows?

In [None]:
# Look at the datatypes.
# Do they make sense or do you need to correct them?

In [None]:
# How many unique values does each column have?
# Identify the categorical and numerical variables.

In [None]:
# Look at the ranges of the numerical values.
# Are there outliers?

In [None]:
# See if the dataset is balanced. How many datapoints for each class do you have?
# You don't have the tools to do with this yet, but it's good to be aware of it this early. 

In [None]:
# See where you have missing values.
# Think about how to deal with them. Fill in (different strategies) or drop them.

In [None]:
# Do you have duplicated data?

In [None]:
# Are the variables independent?
# Check out the correlations between variables.

### 4. Model - baseline
This is clearly a classification problem, so you can apply the classification model you already know to get the first result (a baseline).

In [None]:
# Choose the variables that you want to feed into the model.
# You won't be familiar with the tools you'll use to deal with categorical variables by the time you work on SLU10,
# so you can ignore them or use a simple strategy.
# Do the variables need to be scaled?

In [None]:
# Train the model.

### 5. Calculate the predictions
Calculate the predictions for both `train` and `test` data.

In [None]:
# Calculate the predictions for the train data.

In [None]:
# Calculate the predictions for the test data
# and save it prediction in a variable named test_prediction.

In [None]:
# Save the predictions from the test_prediction variable to a csv file.
# Uncomment the code and run it.
#submission = pd.Series(test_prediction,index=test.index, name='id')
#submission.to_csv("submission.csv")

### 6. Model evaluation with the training data
Use the prediction from the training data to evaluate your model. In the hackathon, you'll have access to the training data's labels, but not the test data's labels, so you'll have to use the training data's predictions to evaluate your model during development.

Use the _auroc_ score to evaluate the model. For the sake of training, you can also look at other metrics and think about how well they fit this situation.

In [None]:
# Calculate the roc-auc-score for your model.

In [None]:
# Now check how the model does on the other metrics you learned about.
# Which of them are relevant for this situation?

In [None]:
# You can calculate the confusion matrix.

In [None]:
# Plot here the roc curve and calculate the auroc score.

### 7. Calculate the score with the test data
In this section, we'll test the model predictions for the test data. In the hackathon, you'll generate these predictions and then submit them to the portal, which will compare your prediction with the real labels and give you a score.

To complete this notebook, you'll use the test predictions in the `portal` directory with the code to calculate the score. Run the following cells to calculate your score. You'll need to uncomment some parts.

In [None]:
# Import the code to calculate the score.
from portal.score import load, validate, score

In [None]:
# Load the true labels and your prediction.
# Uncomment and run the code.
#y_true = load("portal/data")
#y_pred = load("submission.csv")

In [None]:
# This function just validates if the prediction has the correct format.
# Uncomment and run it.
#validate(y_true, y_pred)

In [None]:
# Calculate the auc-roc score for your test prediction.
# Uncomment and run the code.
#score(y_true, y_pred)

### 8. Feature selection
Feature selection is something we looked at only briefly in SLU09's exercise notebook, but you'll learn about it in depth in SLU14. 

The goal of feature selection is to look at how important is each feature for a model's outcome and then retrain the model with only the most important features.

In [None]:
# Get the coefficients of the model and see which ones have the most weight in the model 
# (the highest numbers in absolute terms).

In [None]:
# Retrain the model with the most important features.

In [None]:
# If you like, tweak other model parameters.
# Again, this is something that you will learn about in later SLUs.

### 9. Presentation
The final part of the hackathon is to present your solution to the other teams. The presentation should contain all the steps in the analysis, with your justification for the decisions you made. You'll need to use tables or visualizations to support your claims. Think about how you'd present the EDA part of the analysis.

In [None]:
print('I have completed the hackathon training!')