## AutoPrognosis API Tutorial

A demonstration for AP functionality and operation

This tutorial shows how to use [Autoprognosis](https://arxiv.org/abs/1802.07207). We are using the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset.

See [installation instructions](../../doc/install.md) to install the dependencies.

Load dataset and show the first five samples:

In [1]:
from sklearn.datasets import load_breast_cancer
import pandas as pd
data = load_breast_cancer()  # get Breast Cancer Dataset

df = pd.DataFrame(data.data, columns=data.feature_names) # create pandas dataframe
target = 'target'
df[target] = data.target


X_ = df[data.feature_names]
Y_ = df[target]

df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


## Import the AutoPrognosis library

In [2]:
import model

## Run the model with few iterations

In [3]:
metric = 'aucprc'

AP_mdl   = model.AutoPrognosis_Classifier(
    metric=metric, CV=5, num_iter=3, kernel_freq=100, ensemble=True,
    ensemble_size=3, Gibbs_iter=100, burn_in=50, num_components=3)

AP_mdl.fit(X_, Y_)

[ median, Normalization, Gradient Boosting ]
[ mean, Gaussian Transformer, MultinomialNaiveBayes ]
[ missForest, Uniform Transformer, LinearSVM ]
[ mean, Gradient Boosting ]
[ missForest, PCA, MultinomialNaiveBayes ]
[ matrix_completion, PCA, GaussianNaiveBayes ]
[ matrix_completion, PCA, XGBoost ]
[ missForest, Gaussian random projections, BernoullinNaiveBayes ]
[ median, GaussianNaiveBayes ]
[ missForest, Normalization, Random Forest ]
[ median, Bagging ]
[ mean, Gaussian Transformer, GaussianNaiveBayes ]



**The best model is: **[ missForest, PCA, Bagging ]

 |||| Now building the ensemble...

**Ensemble: **['[ median, Bagging ]', '[ median, Normalization, GaussianNaiveBayes ]', '[ matrix_completion, Gaussian random projections, Random Forest ]']

**Ensemble weights: **[0.50139696 0.14415384 0.3544492 ]

**The ensemble helps!**

[{'aucprc': 0.9810585834962593, 'name': 'initial'},
 {'aucprc': 0.9881686666304725,
  'aucroc': 0.987389452997052,
  'component_idx': 0,
  'cv': 5,
  'hyperparameter_properties': [{'name': 'mean'},
   {'hyperparameters': {'model': "GradientBoostingClassifier(criterion='friedman_mse', init=None,\n                           learning_rate=0.2140593364154549, loss='deviance',\n                           max_depth=5, max_features=None, max_leaf_nodes=None,\n                           min_impurity_decrease=0.0, min_impurity_split=None,\n                           min_samples_leaf=1, min_samples_split=2,\n                           min_weight_fraction_leaf=0.0, n_estimators=137,\n                           n_iter_no_change=None, presort='auto',\n                           random_state=None, subsample=1.0, tol=0.0001,\n                           validation_fraction=0.1, verbose=0,\n                           warm_start=False)"},
    'name': 'Gradient Boosting'}],
  'iter': 0,
  'model': '<pipe

## Computing model predictions

##### ~~~First element in the output is the predictions of a single model, the second element is the prediction of the ensemble~~~

In [4]:
AP_mdl.predict(X_)

(array([[1.        , 0.        ],
        [0.99803922, 0.00196078],
        [0.99681373, 0.00318627],
        ...,
        [0.975     , 0.025     ],
        [0.99730392, 0.00269608],
        [0.22107843, 0.77892157]]), array([[1.        , 0.        ],
        [0.99803922, 0.00196078],
        [0.99681373, 0.00318627],
        ...,
        [0.975     , 0.025     ],
        [0.99730392, 0.00269608],
        [0.22107843, 0.77892157]]))

## Visualize data...

In [5]:
AP_mdl.visualize_data(X_)

## Visualize the model...

In [6]:
AP_mdl.APReport()

***Ensemble Report***

**----------------------**

**Rank0:   [ median, Bagging ],   Ensemble weight: 0.5013969581075489**

**----------------------**

{'model_list': [<models.imputers.median object at 0x000001E438644A58>, <models.classifiers.Bagging object at 0x000001E4386480B8>], 'explained': '[ , *Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.* ]', 'image_name': None, 'classes': None, 'num_stages': 2, 'pipeline_stages': ['imputer', 'classifier'], 'name': '[ median, Bagging ]', 'analysis_mode': None, 'analysis_type': None}


**_____________________________________________**

[ , *Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.* ]



**Rank1:   [ median, Normalization, GaussianNaiveBayes ],   Ensemble weight: 0.14415383748272464**

**----------------------**

{'model_list': [<models.imputers.median object at 0x000001E4386444A8>, <models.preprocessors.FeatureNormalizer object at 0x000001E4386441D0>, <models.classifiers.GaussNaiveBayes object at 0x000001E43863DDD8>], 'explained': '[ , , *A Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes theorem with strong (naive) independence assumptions between the features.* ]', 'image_name': None, 'classes': None, 'num_stages': 3, 'pipeline_stages': ['imputer', 'preprocessor', 'classifier'], 'name': '[ median, Normalization, GaussianNaiveBayes ]', 'analysis_mode': None, 'analysis_type': None}


**_____________________________________________**

[ , , *A Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes theorem with strong (naive) independence assumptions between the features.* ]



**Rank2:   [ matrix_completion, Gaussian random projections, Random Forest ],   Ensemble weight: 0.3544492044097264**

**----------------------**

{'model_list': [<models.imputers.matrix_completion object at 0x000001E43863D908>, <models.preprocessors.GaussProjection object at 0x000001E438648CC0>, <models.classifiers.RandomForest object at 0x000001E438648F60>], 'explained': '[ , , *Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.* ]', 'image_name': None, 'classes': None, 'num_stages': 3, 'pipeline_stages': ['imputer', 'preprocessor', 'classifier'], 'name': '[ matrix_completion, Gaussian random projections, Random Forest ]', 'analysis_mode': None, 'analysis_type': None}


**_____________________________________________**

[ , , *Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.* ]



**----------------------**

***Kernel Report***

**Component 0**

**Members: ['XGBoost', 'Gradient Boosting', 'Random Forest', 'Neural Network']**

  Mat52.       |  value  |  constraints  |  priors
  variance     |    1.0  |      +ve      |        
  lengthscale  |    1.0  |      +ve      |        


**Component 1**

**Members: ['Multinomial Naive Bayes', 'Bernoulli Naive Bayes', 'Bagging', 'Adaboost']**

  Mat52.       |               value  |  constraints  |  priors
  variance     |  0.9999990131977101  |      +ve      |        
  lengthscale  |  0.5799796857419405  |      +ve      |        


**Component 2**

**Members: ['Linear SVM', 'KNN', 'Decision Trees', 'Perceptron', 'Logistic Regression', 'Gauss Naive Bayes', 'QDA', 'LDA']**

  Mat52.       |               value  |  constraints  |  priors
  variance     |  1.0330784354351474  |      +ve      |        
  lengthscale  |  1.4854067242492952  |      +ve      |        


{'best_score_single_clf': 0.993040805691102,
 'ensemble_model_name': ['[ median, Bagging ]',
  '[ median, Normalization, GaussianNaiveBayes ]',
  '[ matrix_completion, Gaussian random projections, Random Forest ]'],
 'ensemble_model_weight': [0.5013969581075489,
  0.14415383748272464,
  0.3544492044097264],
 'ensemble_score': 0.9940387322357569,
 'hyperparameter_properties': [{'name': 'missForest'},
  {'name': 'PCA'},
  {'hyperparameters': {'model': 'BaggingClassifier(base_estimator=None, bootstrap=True, bootstrap_features=False,\n                  max_features=1.0, max_samples=0.1378951280696428,\n                  n_estimators=4080, n_jobs=None, oob_score=False,\n                  random_state=None, verbose=0, warm_start=False)'},
   'name': 'Bagging'}],
 'kernel_members': {0: ['XGBoost',
   'Gradient Boosting',
   'Random Forest',
   'Neural Network'],
  1: ['Multinomial Naive Bayes',
   'Bernoulli Naive Bayes',
   'Bagging',
   'Adaboost'],
  2: ['Linear SVM',
   'KNN',
   'Decisio