# Basic Workflow

In [None]:
# Always have your imports at the top
import random
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.base import TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

from hashlib import sha1 # just for grading purposes
import json # just for grading purposes

def _hash(obj, salt='none'):
    if type(obj) is not str:
        obj = json.dumps(obj)
    to_encode = obj + salt
    return sha1(to_encode.encode()).hexdigest()

from utils import get_dataset

X, y = get_dataset()  # preloaded dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

## Exercise 1: Workflow steps

What are the basic workflow steps?

It's incredibly obvious what the steps are since you can see them graded in plain text. However we deem it worth actually making you type each one of the steps and take a moment to think about it and internalize them.

Please do actually type them rather than just copy-pasting as fast as you can. Type it out character by character and internalize.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
### BEGIN TESTS
assert step_1 == 'Get the data'
assert step_2 == 'Data analysis and preparation'
assert step_2_a == 'Data analysis'
assert step_2_b == 'Dealing with data problems'
assert step_2_c == 'Feature engineering'
assert step_2_d == 'Feature selection'
assert step_3 == 'Train model'
assert step_4 == 'Evaluate results'
assert step_5 == 'Iterate'
### END TESTS

## Exercise 2: Specific workflow questions

Here are some more specific questions about individual workflow steps.

In [None]:
# Exercise 2.1. True or False, it's super easy to gather your dataset in a production environment
# real_world_dataset_gathering_easy = ...

# Exercise 2.2. True or False, it's super easy to gather your dataset in the context of the academy
# academy_dataset_gathering_easy = ...

# Exercise 2.3. True or False, you should try as hard as you can to get the best possible score
# on your test set by iterating until you can't get your test set score any higher
# by any means possible
# test_set_optimization_is_good = ...

# Exercise 2.4. True or False, you should choose one metric by which to evaluate your model and
# never consider using another one
# one_metric_should_rule_them_all = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
### BEGIN TESTS
assert _hash(real_world_dataset_gathering_easy, 'salt1') == '63b5b9a8f2d359e1fc175c3b01b907ef87590484'
assert _hash(academy_dataset_gathering_easy, 'salt2') == 'dd7dee495a153c95d28c7aa95289c0415242f5d8'
assert _hash(test_set_optimization_is_good, 'salt3') == 'f24a294afb4a09f7f9df9ee13eb18e7d341c439d'
assert _hash(one_metric_should_rule_them_all, 'salt4') == '2360691a582e4f0fbefa238ab6ced1cbfbfe8a50'
### END TESTS

## Scikit Pipelines

We've already loaded and splitted a dataset for the following exercises. They're stored in the `X_train`, `X_test`, `y_train` and `y_test` variables.

In a perfect world, where you have all your data clean and ready-to-go, you can create your pipeline with just Scikit-learn's Transformers. However, in the real world, that's not the case, and you'll need to create custom Transformers to get the job done. Take a look at the data set, what do you see?

In [None]:
# Do some data analysis here


While crunching your data, you probably found two issues:

1. There are 3 columns whose name starts with `evil` which are all filled with gibberish
2. There are some values missing in some columns

So, first things first, let's get rid of those columns through a Custom Transformer, so we can plug it in a Scikit Pipeline after.

## Exercise 3: Custom Transformer

In [None]:
# Create a pipeline step called RemoveEvilColumns that removes any
# column whose name starts with the string 'evil'

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
### BEGIN TESTS
assert _hash(sorted(RemoveEvilColumns().fit_transform(X).columns), 'salt5') == '71443dfc3077d773d4c74e958dadf91dc2cc148a'
assert _hash(list(map(lambda col: col.startswith('evil'), RemoveEvilColumns().fit_transform(X_train).columns)), 'salt6') == 'ce45cf3759d2210f2d1315f1673b18f34e3ac711'
### END TESTS

Now that we have our Custom Transformer in place, we can design our pipeline. For the sake of the exercise, you'll want to create a pipeline with the following steps:

1. Removes evil columns
2. Imputes missing values with the mean
3. Has a Random Forest Classifier as the last step

You may use `make_pipeline` to create your pipeline with as many steps as you want as long as the first two are the Custom Transformer you developed previously, a `SimpleImputer` as the second step, and a `RandomForestClassifier` as the last step.

## Exercise 4: Scikit Pipelines

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
### BEGIN TESTS
assert _hash(pipeline.steps[0][0], 'salt7') == 'a7b78d9fa6c8307ff8cdf5d99be86cecb007a8d1'
assert _hash(pipeline.steps[1][0], 'salt8') == 'ca83eaea1a7e243fa5574cfa6f52831166ee0f32'
assert _hash(pipeline.steps[-1][0], 'salt9') == '0d66ba4309ad4939673169e74f87088dcadd510b'
### END TESTS

Does it work? Let's check it out on our dataset!

In [None]:
pipeline.fit(X_train, y_train)

In [None]:
y_pred = pipeline.predict(X_test)
accuracy_score(y_test, y_pred)

That's it for this Exercise Notebook! It doesn't get much cleaner than this, does it?

You can still practice around with pipelines, maybe add a few more steps. See how you can adapt your pipeline and how it affects the predictions.

Can you see how Scikit-learn's Pipelines might save time? Can you imagine how useful that would be in stressful situations (like, *for example*, an Hackathon)?