# SLU16 - Workflow: Learning notebook 

In this notebook we will be covering the following:

* Workflow
 * Step 1: Get the data
 * Step 2: Data analysis and preparation
   * 2.1 Data analysis
   * 2.2 Dealing with data problems
   * 2.3 Feature engineering
   * 2.4 Feature selection
 * Step 3: Model training
 * Step 4: Evaluate results
* Pipelines and Custom Objects
    * Pipelines 
        * Doing it "the hard way"
        * What is a pipeline
        * Setting up a pipeline
    * Custom Objects
        * Custom Transformers
        * Custom Estimators
* Workflow tips tricks
  * 1.Establish a simple baseline FAST
  * 2.Incrementally increase complexity
  * 3.Use (and abuse) Scikit pipelines
* Some advice for working in hackathon teams


## Why learn workflow?

The goal for this SLU is **to establish the common steps and tools that you'll use to keep your data science workflow tight and efficient**.

Soon you'll get a new dataset, either in an SLU/BLU or in a hackathon, and you'll find yourself asking yourself _"**where and how do I begin?**"_. But fear not, This SLU will be your best friend!

**Data Science is largely an engineering discipline**. Writing code is an engineering practice and most data science is done with code these days. Nailing down a **workflow** and **how to express it in code** is one of the key skills you'll need. It will make your life an order of magnitude easier, and more importantly, **it makes your data science more responsible**.

You don't want the following:

<img src="media/xkcd-machine-learning.jpg" width="300" />

Before we get started, I need to take a dig at Jupyter:

## Jupyter is a terrible development environment

**Because it isn't one!** A development environment is centered around *being able to organize your code in an effective way*. Jupyter is made primarily for *rapid prototyping and communication*, **not** software engineering. 

There are going to be *significant drawbacks* when it comes to organizing your code and you will need to be extremely careful about following best practices because Jupyter won't do any of it for you the way that a real IDE would.

## So why are we using Jupyter?

Because our primary task in this academy is not to teach you how to be software engineers. It's to help you learn how to prototype and communicate as data scientists.

## Do not use Jupyter in production

Don't use Juypter notebooks in production. **Write code in real .py files that can be tested, properly tracked, diffed, imported into other code, linted in CI/CD, viewed in any editor**, and a million other advantages.

# Workflow

Let's start listing the steps that are needed in any data science project:

## Step 1: Get the data

In a real live environment, this step could *literally* take months. It depends on the organization, who guards the data, how well the data itself is known, what format it is in, as well as a million other factors. Throughout the academy we will be largely skipping this step, with the exception of the **Data Wrangling Specialization**. We will be handing you *nice and tidy* CSVs that you will be able to bring into your experiment with a single simple function call.

I would love to expand a bit more upon the substeps involved in this step but they vary so much in practice that the only thing I can say for certain is that *it will involve a lot of meetings* and will likely result in reading from a system that behaves a bit like the following:

<img src="media/xkcd-data-pipeline.png" width="600" />

## Step 2: Data analysis and preparation

This step has some more definitive substeps than the previous. In general you'll hit the following steps:

1. Data analysis
1. Dealing with data problems
1. Feature engineering
1. Feature selection

> Sounds familiar? 
### 2.1 Data analysis

You've already learned quite a bit about how to do Data Analysis. In [SLU01](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU01%20-%20Pandas%20101), [SLU02](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU02%20-%20Subsetting%20Data%20in%20Pandas), [SLU03](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU03%20-%20Visualization%20with%20Pandas%20%26%20Matplotlib), [SLU04](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU04%20-%20Basic%20Stats%20with%20Pandas), and [SLU05](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU05%20-%20Covariance%20and%20Correlation) you have a nice pile of tools that you can use to get a feel for the type of data that you are dealing with. **Use them until you feel comfortable enough that you could confidently describe the most important characteristics of the data set you are working with**.

<img src="media/xkcd-quality-data-analysis.png" width="600"/>

### 2.2 Dealing with data problems

As you've seen in [SLU06](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU06%20-%20Dealing%20with%20Data%20Problems), our data analysis will certainly uncover data problems. Some of these data problems you may be able to deal with by **manipulation the dataset directly**. Others you may need to **make part of a pipeline** (we'll dive into this in a bit).

An example of the first type of data problem is *changing numbers that are stored as strings in a csv into actual numbers*. An example of something that you might want to *integrate into your pipeline* is filling in *nans* so that you can experiment with imputation strategies.

In any case, the first time someone delivers you a dataset, the experience is likely to be very much like the following:

<img src="media/xkcd-dirty-data.png" width="200"/>

### 2.3 Feature engineering

Once you've got some clean data and have a benchmark model as a reference you may want to create some new features out of the existing features, as you've seen in [SLU12](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU12%20-%20Feature%20Engineering). A classic example of this would be to *create a debt to income ratio feature for credit risk* by simply dividing the debt of a person by some measure of their income.

You will likely iterate on this step several times.

### 2.4 Feature selection

You can do *feature selection* in a few different stages:
* Right at the beginning when you can remove features that you **know for sure** should not be in there.
  * Examples of these are features that are all unique, all one value, are leakage, or are disallowed by law. 
* In a later stage after data processing
  * For example if you found out that a given feature is redundant, or doesn't have any predictive power.

You will likely iterate on this step several times.

## Step 3: Model training

After [SLU07](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU07%20-%20Regression%20with%20Linear%20Regression), [SLU09](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU09%20-%20Classification%20with%20Logistic%20Regression) and [SLU11](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU11%20-%20Tree-Based%20Models), you know the drill here. Based upon **the attributes of the problem at hand** (binary classification, multi-class classification, supervised, unsupervised, regression, etc.), **choose a few different types of models to experiment with**. 

> Note that you should start as simple as possible in order to keep your complexity under control.

I'll also take the opportunity to, as explained in [SLU14](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU14%20-%20Model%20complexity%20and%20Overfitting), **stress the importance of creating a training and test set**. And never mix the two. Ever.

<img src="media/not-xkcd-model-training.png" width="300"/>

## Step 4: Evaluate results

You've *properly separated training and test data*, *fitted your model*, and *made some prediction on your test sets*. Now, depending on the type of problem once again, **you need to select a metric or set of metrics to understand how your model is performing** (*hint: [SLU08](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU08%20-%20Metrics%20for%20Regression) and [SLU10](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU10%20-%20Metrics%20for%20Classification)*. This is also a great time to use learning curves!

Try not to suffer from too much **tunnel vision** here when trying to **optimize a single test set on a single metric**. That will be tough, *especially since the nature of the hackathons in the course are actually all about doing just this...* However, in the real world, when you put a model into production you won't have the luxury of knowing what your test set will look like. **Be properly skeptical and be aware of your model's characteristics**.

Remember...

> **Just because something has never happened doesn't mean it may never happen.** 

A model that is **overfitted on your training set** is blissfully unaware of this (as seen in [SLU09](https://github.com/LDSSA/batch5-students/tree/main/S01%20-%20Bootcamp%20and%20Binary%20Classification/SLU09%20-%20Model%20Selection%20and%20Overfitting). Keep assumptions to a minimum and you'll fail more gracefully when previously unseen things happen.

<img src="media/xkcd-unseen-data.png" width="600"/>

# Pipelines and Custom objects

## Pipelines

Data cleaning and preparation is easily the most time-consuming and boring task in data science. All machine learning algorithms are really fussy, some want normalized or standardized features, some want encoded variables, and some want both. Not to mention the missing data... 


Repeating the same cleaning operations on all training, validation and test sets is not only time consuming, but it opens the door for errors and inconsistencies to occur. Fortunately, **[Scikit-learn’s Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) is a major productivity tool to facilitate cleaning and handeling of data, cleaning and readability of code, and collapsing all preprocessing and modeling steps into to a single line of code.** 

I can summarize why you should adopt Pipelines in three main points:
* conciseness, 
* consistency, and 
* easy of use.

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from IPython.display import Image
import numpy as np
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin
from sklearn.impute import SimpleImputer

Let's start off with a bit of motivation by looking at the *titanic dataset* where we will
* drop all categorical features
* fill the nulls on the rest with the median

### Doing it it "the hard way"

Let's load all the data we need. We already hade the data split into *train* and *test*:

In [2]:
train_df = pd.read_csv('data/titanic.csv')
X_train, y_train = train_df.drop('Survived', axis=1), train_df.Survived.copy()
X_test = pd.read_csv('data/titanic-test.csv')

Now let's *preprocess the data* and *train a simple random forest classifier*:

In [3]:
X_train_clean = X_train.select_dtypes(exclude='object').copy()
# note that you will want to impute with the median age from the training set
# and NOT the test set. This creates a few difficultites when trying to design
# around it
X_train_clean['Age'] = X_train_clean.Age.fillna(X_train_clean.Age.median())

clf = RandomForestClassifier(n_estimators=10)
clf.fit(X_train_clean, y_train)

RandomForestClassifier(n_estimators=10)

Then to test, we will need to do the same set of preprocessing:

In [4]:
X_test_clean = X_test.select_dtypes(exclude='object').copy()
X_test_clean['Age'] = X_test_clean.Age.fillna(X_train_clean.Age.median())
# now it turns out that X_test_clean has a column with nulls that X_test
# didn't have so the preprocessing would have to be a bit different

# Now there are some nulls in Fare for the test set that were not 
# in the training set.
X_test_clean['Fare'] = X_test_clean.Fare.fillna(X_train.Fare.median())

preds = clf.predict_proba(X_test_clean)[:, 1]
preds[:5] # printing first five predictions

array([0.3, 0.1, 0.1, 0.2, 0.6])

It's totally true that we could write a few functions to take care of this, but **scikit already provides [pipelines](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) that addresses this exact problem in a much cleaner way**.

### What is a pipeline

It's pretty simple: 

> **it's a set of steps that has a model at the end of it.**

It implements the same API as the models (has `predict` and/or `predict_proba`) but it applies each of the steps before calling the model with the input!

### Setting up a pipeline

Let's see how to use a pipeline with the same dataset. We'll be doing preprocessing and model fitting as in the code we just looked at for the titanic dataset:

In [5]:
# load train and test data
train_df = pd.read_csv('data/titanic.csv')
X_train, y_train = train_df.drop('Survived', axis=1), train_df.Survived.copy()
X_test = pd.read_csv('data/titanic-test.csv')

Now let's make the pipeline

In [6]:
pipeline = make_pipeline(
    # it's cool how scikit already has a mean imputer ready to go!
    SimpleImputer(strategy='mean'),
    RandomForestClassifier(n_estimators=10)
)
pipeline

Pipeline(steps=[('simpleimputer', SimpleImputer()),
                ('randomforestclassifier',
                 RandomForestClassifier(n_estimators=10))])

The goal of using a pipeline is so that there's no need for us to *manually* preprocess the X_train and X_test! Fitting and predicting turns into just two lines of code! 

As you can see, the pipeline contains all the steps we've done before, with the exception of the *feature selection step* where we dropped the non-numeric features.  Let's try to run our pipeline.

In [7]:
try:
    #fit the pipeline with the train data!
    pipeline.fit(X_train, y_train)

    # make predictions with the test data!
    probas = pipeline.predict_proba(X_test)
except ValueError as e:
    print(e)

Cannot use mean strategy with non-numeric data:
could not convert string to float: 'Braund, Mr. Owen Harris'


As you can see, our pipeline, although very easy to read and consise, is currently not functioning as the imputer that we've chosen requires numeric data. 

We could follow the same approach as before, where we manually drop the non-numeric features, but that defeats the purpose of using a pipeline! 

Fortunately there is a *very easy* solution...

## Custom Objects

For some cases, as is the example before, we will want to create our own pipeline step. **This provides a lot of flexibility!**

Thankfully, sklearn allows you to create your own **custom steps** to integrate into your pipeline. 

They can be of the following type (adapted from [the official sklearn documentation](https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator)):

<img src="media/sklearn_objects.png" width="800"/>

The **Predictors** are fairly familiar to you, with the `predict_proba` method. Indeed you can create not only your own custom predictors, but also custom **models**, that returns a *score*, and a custom **estimator**, that with a `fit` will learn and return some characteristic from the data (for example its distribution). Throughout the academy, by far, the most useful will be **custom transformers**. They allow you to modify input data as one of the steps of your pipeline.

### Custom Transformers


Le'ts create our own pipeline step, called `RemoveObjectColumns` to exclude any data of the type `object`. 
We'll use sklearn [TransformerMixin](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) for this by creating a subclass of it to automatically fit to data, and then transform it.

There are 3 methods to take care of here:
* `__init__` : This is the constructor. Called when pipeline is initialized.
* `fit()` : Called when we fit the pipeline.
* `transform()` : Called when we use fit or transform on the pipeline.

In [8]:
class RemoveObjectColumns(TransformerMixin):
    
    def transform(self, X, *_):
        return X.select_dtypes(exclude='object').copy()
    
    def fit(self, *_):
        return self

Let's add this custom transformer to our pipeline.

In [9]:
pipeline = make_pipeline(
    RemoveObjectColumns(),
    SimpleImputer(strategy='mean'),
    RandomForestClassifier(n_estimators=10)
)
pipeline

Pipeline(steps=[('removeobjectcolumns',
                 <__main__.RemoveObjectColumns object at 0x7fb9a91717d0>),
                ('simpleimputer', SimpleImputer()),
                ('randomforestclassifier',
                 RandomForestClassifier(n_estimators=10))])

And fit our train data and predict on our test data.

In [10]:
pipeline.fit(X_train, y_train)
probas = pipeline.predict_proba(X_test)
preds = probas[:, 1]
preds[:5]

array([0. , 0.1, 0.2, 0.2, 0.3])

And that's it!! Most transformers we design will inherit from `BaseEstimator` and/or `TransformerMixin` classes as they give us pre-existing methods for free. You can read more about them [here](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html) and [here](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html)

If you need some extra help in understanding custom transformers and pipelines, [this](https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156) offers some great examples!

## Custom Estimators

Although not as used as the custom transformer, the **estimator** is an object that **fits a model based on some training data and is capable of inferring some properties on new data**. It can be, for instance, a classifier or a regressor. All estimators implement the `fit` method and also have a `set_params` method, which sets data-independent parameters. It should inherit from [`sklearn.base.BaseEstimator`](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html).

Check out the section of the scikit user guide called [rolling your own estimator](https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator) for the official explanations
of exactly how to do this and the [project templates repo](https://github.com/scikit-learn-contrib/project-template/) which includes several examples of some [custom estimators](https://github.com/scikit-learn-contrib/project-template/blob/master/skltemplate/_template.py)

# Workflow tips and tricks

## 1. Establish a simple baseline FAST

We've already mentioned this a few times but it deserves it's own section. 

> **Run as quickly as you can toward a simple baseline, no matter how simple it may be.** 

For the specific problem you're working on, **it's the data that is important** and **even the simplest model will give you an idea as to whether or not it has signal**.

## 2. Incrementally increase complexity

Take your super simple baseline model and increase complexity **a little bit at a time**. Like any responsible scientist, *you don't want to be changing more than 1 variable at a time when running experiments*.

## 3. Use (and abuse) Scikit pipelines & Custom transformers

Sooner than later, you will run into the problem of having to do **duplicate pre-processing** for a *training and a test set* or for *different folds in cross validation*. This can be a huge pain in the butt and can result *in duplicated code* or *overly complex functions that have a crazy amount of arguments*. 

# Some advice for working in hackathon teams

### General advice

- Aim to make a submission as early as possible (baseline model)
- During the EDA, make sure to output some plots and save them - they will be helpful to build your presentation
- Try to keep a "pipeline" for your code, from the beginning to the end. Do not rely on successively edit the same DataFrame object, or you might end up unable to re-try to run your code. 


### Advice for working in teams

How to split work: should everyone work on their own notebooks? should you keep a single notebook? what is the best strategy?

Our advice is to keep a "main" notebook that everyone has access to. Nominate a "guardian" of such notebook. Work locally on small problems, starting from the "main" notebook  - make sure that everyone on the team knows which problem you are attacking. Once you are happy with the solution, add it to the main notebook and make sure everyone knows it has been updated.

Also, set time deadlines for tasks. For instance ("now, everyone has 40 minutes to explore these variables, and we talk again afterwards to share our findings"). Time goes by fast!

## Wrapping up

Keep this notebook open and reference it regularly, especially when you are doing your first few hackathons. The first few times you get a new dataset and it's 100% up to you to make all decisions about the steps to take
it will be VERY easy to skip important steps which will lead you to have much less fun than you deserve!

For lots of other additional advice on how to organize your code in your notebooks, check out the Examples notebook
that has lots of tips mostly focused around writing well-organized code.