# PrplFrame Tutorial Notebook

In this notebook, I explain how to use this library to perform a parametric classification study. This notebook will contain different sections that can be edited to fit your purposes.

## Setup

In this section we walk through making sure that the environment you're using is ready for the experiments we are about to perform. I'm going to assume you are using Google Colab for these experiments (you can also run this code on your local PC but you'll need to make sure all dependencies are installed).

The first step is to clone the repository and change directory into the cloned repo.

In [1]:
!git clone https://github.com/PrplHrt/PrplFrame.git
%cd PrplFrame

Cloning into 'PrplFrame'...
remote: Enumerating objects: 262, done.[K
remote: Counting objects: 100% (262/262), done.[K
remote: Compressing objects: 100% (183/183), done.[K
remote: Total 262 (delta 113), reused 221 (delta 76), pack-reused 0[K
Receiving objects: 100% (262/262), 8.13 MiB | 25.53 MiB/s, done.
Resolving deltas: 100% (113/113), done.
/content/PrplFrame


The next step is to import all the needed libraries for the experiment.

In [2]:
# Import all needed libraries for the experiment
import pandas as pd
from evaluation import utils
from models import classification
from output import render
import os

## Loading in our data

After we've set up the environment, we can begin by first loading in our data for the experiments. I've designed the code so that most of it can be edited by working with a single dictionary below. Feel free to change the dictionary with the values appropriate to your use case. In this experiment, we're going to be using `data/WineQT.csv` as our dataset. If you'd like to use a different dataset, upload it to the `data` directory OR mount your Google Drive and replace the path with the path to the dataset.


Dataset Info details:
*   `name` : the name of the dataset, used for titles
*   `type` : either "Regression" or "Classification" but this can't be changed as of now
*   `target` :  the name of the target column in the dataset
*   `split` : the percentage/fraction of the data to be used as a test set
*   `path` : the relative or absolute path to the dataset
*   `source` : information about the source of the dataset to be used in output



In [3]:
# Dataset info
dataset_info = {'name': 'Wine Quality Dataset',
                'type': 'Classification',
                'target': 'quality',
                'split': 0.2,
                'path': "data/WineQT.csv",
                'source': """https://www.kaggle.com/datasets/yasserh/wine-quality-dataset"""}

Once we have the information for our dataset we can load  it in and split it accordingly. We also add information about the size of the dataset automatically.

In [4]:
dataset = pd.read_csv(dataset_info['path'], index_col="Id")
dataset_info['size'] = len(dataset)

data = utils.load_Xy(dataset, dataset_info['target'], dataset_info['split'])
dataset.head()

Unnamed: 0_level_0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## Inital Classification

The next step of this tutorial is the inital classification models and testing them. We begin by defining which models we'll be using for the experiments below and then testing each one to find out which performs the best. For now, we use the R2 score to determine the best model.

The models below are the same ones featured in the sklearn library and can accept the same parameters with the exception of the `SVC()` object which is a class made to make sure working with the underlying models easier.

In [5]:
# Define which models to use
models = [classification.LogisticRegression(),
          classification.SVC(),
          classification.DecisionTreeClassifier(),
          classification.GaussianProcessClassifier(),
          classification.RandomForestClassifier(),
          classification.DecisionTreeClassifier(),
          classification.MLPClassifier()
          ]

Now we run the experiments and determine the best performing model.

In [7]:
# Using a temporary variable to hold the best performing model
# we'll define the best performing model by the highest f1 score
top_model = None
best_f1 = None

scores=[]
for model in models:
    score = utils.classification_train_and_test(model, *data)
    # Check for best f1 score
    if (not best_f1) or (score['f1'] > best_f1):
        top_model = model
        best_f1 = score['f1']
    scores.append(score)

print("Best performing model: ", type(top_model).__name__)
print("w/ score: ", best_f1)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Best performing model:  RandomForestClassifier
w/ score:  0.6637554585152838




Finally, we can print out the results of the regression studies using the `render` functions. Currently, `render` doesn't support classification results.

## Parametric Studies

Now that we have the highest performing model, we can perform some parametric studies. There are two parametric functions in this library:

1.   `utils.parametric_study()` : this function takes in the dataset and iterates through the range of every column while using the average of the other columns as base values. We can then print out all the results in the form of graphs and a csv. This column can be used to customize the ranges of certain columns but not the base values.
2.   `utils.custom_parametric()` : this function is used to perform user-defined studied with values set by you. We'll start with a defined dictionary set of values and show you the results in the second subsubsection.






### Automatic regression study

In [10]:
# The results of this regression will be stored in the folder defined below
save_dir = 'autoClass'
stats, results = utils.parametric_study(top_model,
                                                   dataset,
                                                   dataset_info['target'])

stats, results

(                           Mean        Max      Min
 fixed acidity          8.311111   15.90000  4.60000
 volatile acidity       0.531339    1.58000  0.12000
 citric acid            0.268364    1.00000  0.00000
 residual sugar         2.532152   15.50000  0.90000
 chlorides              0.086933    0.61100  0.01200
 free sulfur dioxide   15.615486   68.00000  1.00000
 total sulfur dioxide  45.914698  289.00000  6.00000
 density                0.996730    1.00369  0.99007
 pH                     3.311015    4.01000  2.74000
 sulphates              0.657708    2.00000  0.33000
 alcohol               10.442111   14.90000  8.40000,
 {'fixed acidity': (array([ 4.6       ,  4.71414141,  4.82828283,  4.94242424,  5.05656566,
           5.17070707,  5.28484848,  5.3989899 ,  5.51313131,  5.62727273,
           5.74141414,  5.85555556,  5.96969697,  6.08383838,  6.1979798 ,
           6.31212121,  6.42626263,  6.54040404,  6.65454545,  6.76868687,
           6.88282828,  6.9969697 ,  7.1111111

### Custom parametric study

Here we are manually setting all the values we want to use in the below code block's `values` dictionary. We also create a directory to save all the results.

In [11]:
custom_save_dir = os.path.join('results', "custom_parametric")
os.makedirs(custom_save_dir, exist_ok=True)

values = {
}

Once the values are set we can run the function.

In [12]:
results, study_vals = utils.custom_parametric(top_model, dataset, values, dataset_info['target'])

Finally, we can extract the values and save them into the chosen folder.

In [13]:
df = pd.DataFrame(study_vals, columns=dataset.drop(dataset_info['target'], axis=1).columns)
df[dataset_info['target']] = results
df.to_csv(os.path.join(custom_save_dir, 'custom_parametric_data.csv'))
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,8.311111,0.531339,0.268364,2.532152,0.086933,15.615486,45.914698,0.99673,3.311015,0.657708,10.442111,6
