# Clean Data with Pipelines

This notebook is a template for common data cleaning and preparation activities. Use a data cleaning pipeline in preparation for training the model or as a part of the input processing for a deployed model. 

## Setup

This section defines variables used in the following sections to construct and run the pipeline. You can change the values of the variables to meet your needs. 

In [None]:
data_file = './data/clean.csv'

This template works with pandas datasets. Pandas provides a rich set of [I/O tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for reading and writing data. The template uses as `csv` file by default. 

In [None]:
name_space = 'example'
dataset_name = 'sample01'
full_ds_name = '/'.join([name_space, dataset_name])

A namespace is an organizational element for data saved in the Cortex infrastructure. Use it to keep various artifacts related to a particular project together. This template combines the namespace and the dataset name to create a qualified dataset name. 

In [None]:
pipeline_name = 'clean'
full_pipeline_name = '/'.join([name_space, pipeline_name])

Pipelines are persistent and are retrieved by name. 

### Column variables

The following cell contains lists of columns that can be used for common data cleaning activities. 

Extract the columns that fit your particular dataset. 

In [None]:
columns_of_interest = ['ALPHA','EPSILON','ZETA','ETA']

If some categorical columns need to be encoded, specify those columns in the following list. 

In [None]:
columns_to_encode = ['EPSILON']

Columns that have missing elements which need to be filled for modeling can be specified in the following cell:

In [None]:
columns_with_missing_data = ['ALPHA']

An additional step for removing outliers can be added here. Removing outliers requires that you write expressions to exclude upper or lower bounds for a column.

## Pipeline and Dataset 

Import the pandas' or other needed libraries, and then create a dataset and pipeline:

In [None]:
from cortex import Cortex
import pandas as pd

In the next line a local Cortex client is created. You create a server-side client by replacing `Cortex.local()` with `Cortex.client()`. Doing that will cause the data set to be persisted in Cortex.

In [None]:
cortex = Cortex.local()

Now create the dataset and the pipeline.

In [None]:
clean_dataset = cortex.dataset(full_ds_name) 
pipeline = clean_dataset.pipeline(full_pipeline_name, clear_cache=True)

If you run this pipeline multiple times reset the kernal after each run to remove steps. See the [pipeline persistence notebook](https://docs.cortex.insights.ai/docs/cortex-python-sdk-guide/pipeline/#pipeline-persistence) for more information.

In [None]:
pipeline.reset()

## Pipeline steps

### Get a subset of columns

Use the list you create `columns_of_interest` to create the desired subset of the data. 

In [None]:
def subset_columns(pipeline, df):
    return df[columns_of_interest]

pipeline.add_step(subset_columns)

### Use get_dummies for encoding categorical data

Pandas supplies the [get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function to provide one hot encoding for categorical data.

In [None]:
def encode_cat_cols(pipeline, df):
    return pd.get_dummies(df,columns=columns_to_encode)

pipeline.add_step(encode_cat_cols)

### Handle missing values

In [None]:
def fill_missing(pipeline, df):
    return df.fillna(0.0)

pipeline.add_step(fill_missing)

### Drop outliers

You can write expressions to provide bounds for acceptable values for a particular column.

In [None]:
def drop_outliers(pipeline, df):
    df.drop(df[df['ZETA'].astype(int)>1.0].index, inplace=True) # modify/add expressions to id outliers in your data  
    
pipeline.add_step(drop_outliers)

## Run the pipeline

In [None]:
cleaned_ds = pipeline.run(pd.read_csv(data_file))

## Display the results

In [None]:
cleaned_ds