## Basic Pipeline Example

This notebook shows how to use basic functions of the Cortex Python SDK pipeline. 
In this example, see how to modify or enrich datasets to make them suitable for training or modeling. 
Data is modified in a sequential series of steps. Please install `cortex-python`,`cortex-python[builders]` for builder functionality, `cortex-python[viz]` for vizualizations. 

**NOTE**: This example requires `cortex-python`and `pandas` to be installed in your environment, for example:
> `pip install cortex-python[builders] pandas`

In [None]:
# Import Cortex and other required libraries
import math
from cortex import Cortex

# Create a Builder instance
cortex = Cortex.local()
builder = cortex.builder()

In the next step, create a data set and populate it from a comma separated values file. A pipeline operates on a dataset.

In [None]:
data_set = builder.dataset('example/forest_fires').title('Forest Fire Data')\
    .from_csv('./data/ff.sample.csv').build()
# Create a pandas DataFrame to view the last few lines of the dataset
data_frame = data_set.as_pandas()
data_frame.tail(20)

A dataset can have one or more named pipelines. Each pipeline is a chain of Python functions that transform the dataset.  In the next step, create a pipeline named "prep".

In [None]:
pipeline = data_set.pipeline('prep') # create or retrieve the pipline named 'prep'
pipeline.reset() # removes any previous steps or context for this pipeline

One pipeline step can be used to add a new column.

This [dataset](http://piano.dsi.uminho.pt/~pcortez/fires.pdf) uses components from the Fire Weather Index to make predictions. One element, the Build Up Index (BUI) is based on a relation of two other columns and is omitted. This step adds that missing element. 

In [None]:
def add_bui(pipeline, df):
    df['BUI'] = (0.8 * df['DMC'] * df['DC'])/(df['DMC'] + 0.4 * df['DC'])

pipeline.add_step(add_bui)

In the preceeding code, the pipeline step functions require a pipeline and a dataframe parameter. The [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) provides a rich set of functions for operating on table data.

A pipeline step may be used to modify a column.

The dataset's documentation says that the last column, __area__, is skewed towards zero and should be adjusted logarithmically "to improve regression results for right-skewed targets".

In [None]:
def fix_area(pipeline, df):
    df['area'] = df['area'].map(lambda a: math.log1p(a))
    
pipeline.add_step(fix_area)

### Running the Pipeline
After all the steps are added, you can call `run` on the pipeline. This invokes each of the steps in order and returns a transformed DataFrame instance.

In [None]:
pipeline.run(data_frame)