# RAPA Walkthrough
This tutorial is meant to cover multiple RAPA use cases, including starting from scratch or using a previous DataRobot project.

### Overview
1. Initialize the DataRobot API 
    - Save a pickled dictionary for DataRobot API Initialization
    - Use the pickled dictionary to initialize the DataRobot API
    - **(Optional)**: Skip this if the DataRobot API is previously initialized
2. Submit data as a project to DataRobot
    - Create a submittable `pandas` dataframe
    - Submit the data using RAPA
    - **(Optional)**: If parsimonious feature reduction is required on an existing project, it is possible to load the project instead of creating a new one.
3. Perform parsimonious feature reduction

<a href="https://life-epigenetics-rapa.readthedocs-hosted.com/en/latest/">[Link to the documentation]</a>

In [1]:
import rapa
print(rapa.__version__)

0.0.4


## 1. Initialize DataRobot the API

In [2]:
# import pickle
import pickle
import os

On the <a href="app.datarobot.com"> DataRobot website </a>, find the developer tools and retrieve an API key. Once you have a key, make sure to run the next block of code with your api key replacing the value in the dictionary. Here is a <a href="https://docs.datarobot.com/en/docs/api/api-quickstart/api-qs.html">detailed article on DataRobot API keys</a>.

**Make sure to remove code creating the pickled dataframe and the pickled dataframe itself from any public documents, such as GitHub.**


In [11]:
os.listdir()

['00-data-preparation.ipynb',
 '02-tutorial.ipynb',
 'general_tutorial.ipynb',
 '01-demo-rapa.ipynb']

In [3]:
# save a pickled dictionary for datarobot api initialization
api_dict = {'tutorial':'APIKEYHERE'}
if 'data' in os.listdir('.'):
    print('data folder already exists, skipping folder creation...')
else:
    print('Creating data folder in the current directory.')
    os.mkdir('data')

if 'dr-tokens.pkl' in os.listdir('data'):
    print('dr-tokens.pkl already exists.')
else:
    with open('data/dr-tokens.pkl', 'wb') as handle:
        pickle.dump(api_dict, handle)

Creating data folder in the current directory.


In [4]:
# Use the pickled dictionary to initialize the DataRobot API
rapa.utils.initialize_dr_api('tutorial')

Exception: Unable to authenticate to the server - are you sure the provided token of "APIKEYHERE" and endpoint of "https://app.datarobot.com/api/v2" are correct? Note that if you access the DataRobot webapp at `https://app.datarobot.com`, then the correct endpoint to specify would be `https://app.datarobot.com/api/v2`.

The majority of this tutorial uses the DataRobot API, so if the API is not initialized, it will not run.

## 2. Submit data as a project to DataRobot

This tutorial uses the Breast cancer wisconsin (diagnostic) dataset as an easily accessible example set for `rapa`, as it easily loaded with `sklearn`. 

This breast cancer dataset has **30 features** extracted from digitized images of aspirated breast mass cells. A few features are the mean radius of the cells, the mean texture, mean perimater  The **target** is whether the cells are from a malignant or benign tumor, with 1 indicating benign and 0 indicating malignant. There are 357 benign and 212 malignant samples, making 569 samples total.

In [None]:
from sklearn import datasets # data used in this tutorial
import pandas as pd # used for easy data management

In [None]:
# loads the dataset (as a dictionary)
breast_cancer_dataset = datasets.load_breast_cancer() 

In [None]:
# puts features and targets from the dataset into a dataframe
breast_cancer_df = pd.DataFrame(data=breast_cancer_dataset['data'], columns=breast_cancer_dataset['feature_names'])
breast_cancer_df['benign'] = breast_cancer_dataset['target']
print(breast_cancer_df.shape)
breast_cancer_df.head()

When using `rapa` to create a project on DataRobot, the number of features is reduced using on of the sklearn functions `sklearn.feature_selection.f_classif`, or `sklearn.feature_selection.f_regress` depending on the `rapa` instance that is called. In this tutorial's case, the data is a binary classification problem, so we have to create an instance of the `rapa.RAPAClassif` class.

As of now, `rapa` only supports classification and regression problems on DataRobot. Additionally, `rapa` has only been tested on tabular data. 

In [None]:
# Creates a rapa classifcation object
depression_classification = rapa.rapa.RAPAClassif()

In [None]:
# creates a datarobot submittable dataframe with cross validation folds stratified for the target (benign)
sub_df = depression_classification.create_submittable_dataframe(breast_cancer_df, target_name='benign')
print(sub_df.shape)
sub_df.head()

In [None]:
# submits a project to datarobot using our dataframe, target, and project name.
project = depression_classification.submit_datarobot_project(input_data_df=sub_df, target_name='benign', project_name='TUTORIAL_breast_cancer')
project

In [None]:
# if the project already exists, the `rapa.utils.find_project` function can be used to search for a project
project = rapa.utils.find_project("TUTORIAL_breast_cancer")
project

## 3. Perform parsimonious feature reduction

`rapa`'s main function is `perform_parsimony`. Requiring a *feature_range* and a *project*, this function recursively removes features by their relative feature impact scores across all models in a featurelist, creating a new featurelist and set of models with DataRobot each iteration.

* **feature_range:** a list of desired featurelist lengths as integers (Ex: [25, 20, 15, 10, 5, 4, 3, 2, 1]), or of desired featurelist sizes (Ex: [0.9, 0.7, 0.5, 0.3, 0.1]). This tells `rapa` how many features remain after each iteration of feature reduction.
* **project:** either a `datarobot` project, or a string of it’s id or name. `rapa.utils.find_project` can be used to find a project already existing in DataRobot, or `submit_datarobot_project` can be used to submit a new project.
* **featurelist_prefix:** provides `datarobot` with a prefix that will be used for all the featurelists created by the `perform_parsimony` function. If running `rapa` multiple times in one DataRobot project, make sure to change the **featurelist_prefix** each time to avoid confusion.
* **starting_featurelist_name:** the name of the featurelist you would like to start parsimonious reduction from. It defaults to 'Informative Features', but can be changed to any featurelist name that exists within the project.
* **lives:** number of times allowed for reducing the featurelist and obtaining a worse model. By default, ‘lives’ are off, and the entire ‘feature_range’ will be ran, but if supplied a number >= 0, then that is the number of ‘lives’ there are. (Ex: lives = 0, feature_range = [100, 90, 80, 50] RAPA finds that after making all the models for the length 80 featurelist, the ‘best’ model was created with the length 90 featurelist, so it stops and doesn’t make a featurelist of length 50.) This is similar to DataRobot’s <a href="https://www.datarobot.com/blog/using-feature-importance-rank-ensembling-fire-for-advanced-feature-selection/">Feature Importance Rank Ensembling for advanced feature selection (FIRE)</a> package’s ‘lifes’.
* **cv_average_mean_error_limit**: limit of cross validation mean error to help avoid overfitting. By default, the limit is off, and the each ‘feature_range’ will be ran. Limit exists only if supplied a number >= 0.0.
* **to_graph**: a list of keys choosing which graphs to produce. Current graphs are *feature_performance* and *models*. *feature_performance* graphs a stackplot of feature impacts across many featurelists, showing the change in impact over different featurelist lengths. *models* plots `seaborn` boxplots of some metric of accuracy for each featurelist length. These plots are created after each iteration.

Additional arguments and their effects can be found <a href="https://life-epigenetics-rapa.readthedocs-hosted.com/en/latest/">in the API documentation</a>, or within the functions.

In [None]:
ret = depression_classification.perform_parsimony(project=project, 
                                            featurelist_prefix='TEST_0.0', 
                                            starting_featurelist_name='Informative Features', 
                                            feature_range=[25, 20, 15, 10, 5, 4, 3, 2, 1],
                                            lives=5,
                                            cv_average_mean_error_limit=.8,
                                            to_graph=['feature_performance', 'models'])