In [1]:
from utils import css_from_file
css_from_file('style/style.css')

Modeling components
-------------------------

In modeling there are 3 major components
1. **Data**
   
   We need data to build models. 90% of the time spent on building models is data preparation, cleaning.
   Data does not come from a single source, many times you have to combine different source of data:
   - relational databases
   - internet
   - human experts <br/><br/>
   
2. **Algorithm**

   After the data is prepared you can start trying different predictive algorithms.<br/><br/>

3. **Validation**

   Validation is important because:
   - you want to know how your model will perform on unseen data
   - you want to be able to compare different models

Abstraction
------------------

You want to create abstraction that will let you test the models quickly and make the results reproducible

The most important is to define a framework for working and establish some ground rules.

Make your framework composable of the 3 components.

Don't create functions like

```python
SVM_5_fold_cv_v1()
```

It is much better to create an abstract function / class to handle model creation

```python
model = train_model(data, model_pipeline, cross_validation_method)
```

Advantages of thinking in "pipelines"
---------------

### 1 Lego vs Spaghetii


<table style="border: 0px; color:white;"><tr><td><img src="img/lego.jpg"/></td> <td><img src="img/spaghetti.jpg"/></td></table>

**Spaghetti processing**

```python
def prepare_data(X):
    X = X.drop("dummy",axis=1)
    X_t = StandardScaler().fit_transform(X)
    X_t = PCA().fit_transform(X_t)
    return X_t
```

**Pipeline processing**

```python
data = make_pipeline(
    RemoveColumn("dummy"),
    StandardScaler(),
    PCA()
)
```

The same number of lines of code but which one you prefer?

### 2 Reusable transformations

<img src="img/reusable bag.jpg"/>

```RemoveColumn``` is a transformer class - once defined you can use this.
Once you start doing things like this you'll soon realize that many things you can do are repeatable.

### 3 Managing complexity

<img src="img/complexity.jpg"/>

Chain of transformers are a transformer. From simple blocks you can create more complex transformers.

### 4 Easier to test

<img src="img/testing.jpg"/>

Because transformers are independent from your process you can test them in isolation.

### 5 One point of control

<img src="img/console.jpg"/>

You can define and control your whole process from a single point.