# Auto Feature Engineering Workflow Demo

## Content

* [1. Introduction](#1.-Introduction)
* [2. Auto Feature Engineering workflow](#2.-Auto-Feature-Engineering-Workflow-Demo)
* [3. AutoFE Deep Dive](#3.-AutoFE-Deep-dive)
* [3.1 Profile](#3.1-Feature-Profiling)
* [3.2 Pipeline Plot](#3.2-Pipeline-Plot)
* [3.3 Feature Importance](#3.3-Feature-Importance)
* [4. Model Training](#4.-Model-training)
* [5. Performance](#)

## 1. Introduction

This AutoFE workflow demo shows how to leverage the Auto-Feature-Engineering toolkit (codename:RecDP) to automatically transform raw tabular data to a ready-to-train data with enriched usefull new features, while significantly improve developer productivity and end-to-end data prepration performance.
<center><img src="recdp_autofe_overview.jpg" width = "800" alt = 'recdp_autofe_overview'></center>

AutoFE workflows uses RecDP to: 

(1) Automatically profile the dataset, infer data type of each input columns  
(2) Determines the proper feature engineering primitives with infered data type  
(3) Generates data preparation pipelines with chained operators  
(4) Generates DAG for operations  
(5) Execute DAG on different engines  
(6) Feature importance analsyis  

## 2. Auto Feature Engineering Workflow Demo

### Step 1: Configuration file 
To launch Auto Feature Engineering workflow, only required work is to edit `workflow.yaml`.
Supported configurations are listed in table

| Name            | Description   |
| --------------- | ------------- |
| dataset_path | set dataset directory |
| target_label | specify target lable of dataset|
| engine_type | config auto feature engineering engine type, support pandas and spark |

In [None]:
!cat workflow.yaml

### Step 2: Kick off AuoFE workflow 

This toolkit provides Low code API, user only needs to use 3 lines of codes to launch Auto Feature Engineering to any input tabular data.

AutoFE api will analyze dataset and its target label, create data pipeline automatically, and then use specified engine_type to transform data.

You're expected to see transformed data displayed after codes completion.

In [None]:
from pyrecdp.autofe import AutoFE

pipeline = AutoFE(dataset=load_data(), label=target_label)
pipeline.fit_transform(engine_type = engine_type)

## 3. AutoFE Deep dive 

In below section, you'll see advanced interfaces provided by AutoFE pipeline. So you will be able to custom auto generated pipeline EDA report of original data.

* To view the EDA profiling of original data.
``` python
pipeline.profile(engine_type)
```

* To view generated data pipeline and customize data pipeline.
``` python
pipeline.plot()
```

* To view feature importance result.
``` python
pipeline.feature_importance()
```

### 3.1 Feature Profiling
AutoFE provides feature profiler to analyze the feature distribution and identify insights of feature.

In [None]:
pipeline.profile(engine_type)

## 3.2 Pipeline Plot

view or modify pipeline is supported

* view pipeline
```
pipeline.plot()
```

* add new operation to pipeline

``` python
def gussian_calulation(df):
    df = apply_gussian(df, columns= ['col_1'])
    return df

pipeline.add_operation(gussian_calulation)
```

* remove unwanted operation from pipeline

``` python
pipeline.delete_operation(id = 6)
```

In [None]:
pipeline.plot()

## 3.3 Feature Importance

We provided feature estimators to analyze transformed data, and do feature reduction in case autoFE generated unuseful features.

In [None]:
pipeline.feature_importance()

## 4. Model training

Now, autoFE is completed, let's achieve transformed data and fit it to your own model.

In [None]:
transformed_data = pipeline.get_transformed_data()