# PrplFrame Tutorial Notebook

In this notebook, I explain how to use this library to perform a parametric regression study. This notebook will contain different sections that can be edited to fit your purposes.

## Setup

In this section we walk through making sure that the environment you're using is ready for the experiments we are about to perform. I'm going to assume you are using Google Colab for these experiments (you can also run this code on your local PC but you'll need to make sure all dependencies are installed).

The first step is to clone the repository and change directory into the cloned repo.

In [1]:
!git clone https://github.com/PrplHrt/PrplFrame.git
%cd PrplFrame

fatal: destination path 'PrplFrame' already exists and is not an empty directory.
/content/PrplFrame


The next step is to import all the needed libraries for the experiment.

In [2]:
# Import all needed libraries for the experiment
import pandas as pd
from evaluation import utils
from models import regression
from output import render
import os

## Loading in our data

After we've set up the environment, we can begin by first loading in our data for the experiments. I've designed the code so that most of it can be edited by working with a single dictionary below. Feel free to change the dictionary with the values appropriate to your use case. In this experiment, we're going to be using `data/Concrete_Data.xls` as our dataset. If you'd like to use a different dataset, upload it to the `data` directory OR mount your Google Drive and replace the path with the path to the dataset.


Dataset Info details:
*   `name` : the name of the dataset, used for titles
*   `type` : either "Regression" or "Classification" but this can't be changed as of now
*   `target` :  the name of the target column in the dataset
*   `split` : the percentage/fraction of the data to be used as a test set
*   `path` : the relative or absolute path to the dataset
*   `source` : information about the source of the dataset to be used in output



In [3]:
# Dataset info
dataset_info = {'name': 'Concrete Compressive Strength 2',
                'type': 'Regression',
                'target': 'Concrete compressive strength(MPa, megapascals) ',
                'split': 0.2,
                'path': "data/Concrete_Data 2.xls",
                'source': """Prof. I-Cheng Yeh
  Department of Information Management
  Chung-Hua University,
  Hsin Chu, Taiwan 30067, R.O.C.
  e-mail:icyeh@chu.edu.tw
  TEL:886-3-5186511"""}

Once we have the information for our dataset we can load  it in and split it accordingly. We also add information about the size of the dataset automatically.

In [4]:
dataset = pd.read_excel(dataset_info['path'])
dataset_info['size'] = len(dataset)

data = utils.load_Xy(dataset, dataset_info['target'], dataset_info['split'])
dataset.head()

Unnamed: 0,Water/Binder Raito,Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,0.3,2.5,1040.0,676.0,28,79.986111
1,0.15,2.5,1055.0,676.0,28,61.887366
2,0.224631,0.0,932.0,594.0,270,40.269535
3,0.24,0.0,932.0,594.0,365,41.05278
4,0.238213,0.0,978.4,825.5,360,44.296075


## Inital Regression

The next step of this tutorial is the inital regression models and testing them. We begin by defining which models we'll be using for the experiments below and then testing each one to find out which performs the best. For now, we use the R2 score to determine the best model.

The models below are the same ones featured in the sklearn library and can accept the same parameters with the exception of the `SVR()` and `PolynomialRegression()` objects which are classes made to make sure working with the underlying models easier.

In [5]:
# Define which models to use
models = [regression.GaussianProcessRegressor(),
          regression.KNeighborsRegressor(),
          regression.Ridge(),
          regression.LinearRegression(),
          regression.MLPRegressor(),
          regression.PolynomialRegression(),
          regression.SVR(),
          regression.DecisionTreeRegressor(),
          regression.Lasso(),
          regression.RandomForestRegressor()
          ]

Now we run the experiments and determine the best performing model.

In [6]:
# Using a temporary variable to hold the best performing model
# we'll define the best performing model by the highest r2
top_model = None
best_r2 = None

scores=[]
for model in models:
    score = utils.regression_train_and_test(model, *data)
    # Check for best R2 score
    if (not best_r2) or (score['r2'] > best_r2):
        top_model = model
        best_r2 = score['r2']
    scores.append(score)

print("Best performing model: ", type(top_model).__name__)
print("w/ score: ", best_r2)

Best performing model:  RandomForestRegressor
w/ score:  0.8191298421262774


Finally, we can print out the results of the regression studies using the `render` functions.

In [7]:
# Sorting the results so that they print out in order of performance
scores.sort(key=lambda x: x['mse'])
render.render_results_html(dataset_info, scores)

Results page for Concrete Compressive Strength 2 saved in results/Concrete_Compressive_Strength_2_results.html...


'results/Concrete_Compressive_Strength_2_results.html'

## Parametric Studies

Now that we have the highest performing model, we can perform some parametric studies. There are two parametric functions in this library:

1.   `utils.regression_parametric_study()` : this function takes in the dataset and iterates through the range of every column while using the average of the other columns as base values. We can then print out all the results in the form of graphs and a csv. This column can be used to customize the ranges of certain columns but not the base values.
2.   `utils.custom_parametric()` : this function is used to perform user-defined studied with values set by you. We'll start with a defined dictionary set of values and show you the results in the second subsubsection.

For both of the following subsection we're going to try to be as close to a study that uses the following values:


*   `Water/Binder Ratio (W/B)` : [0.1, 0.2, 0.3, 0.4, 0.5]
*   `Superplasticizer (component 5)(kg in a m^3 mixture) (SP)` : [6, 9, 12, 15, 18, 21, 24, 27, 30, 33]
*   `Coarse Aggregate  (component 6)(kg in a m^3 mixture) (CA)	` : 973
*   `Fine Aggregate (component 7)(kg in a m^3 mixture) (FA)	` : 774.2
*   `Age (day)	` : 28






### Automatic regression study

In [9]:
# The results of this regression will be stored in the folder defined below
save_dir = 'autoRegress'
stats, results = utils.regression_parametric_study(top_model, dataset, dataset_info['target'], c1=[0.1, 0.2, 0.3, 0.4, 0.5], c2=[6, 9, 12, 15, 18, 21, 24, 27, 30, 33])

render.plot_parametric_graphs(stats, results, dataset_info['target'], save_dir, make_excel = True)

Parametric plots saved in directory: autoRegress/parametric
Parametric plots data saved in directory:  autoRegress/parametric/parametric_data.xlsx




### Custom parametric study

Here we are manually setting all the values we want to use in the below code block's `values` dictionary. We also create a directory to save all the results.

In [22]:
custom_save_dir = os.path.join('results', "custom_parametric")
os.makedirs(custom_save_dir, exist_ok=True)

values = {
  'Water/Binder Raito ' : [0.1, 0.2, 0.3, 0.4, 0.5],
  'Superplasticizer (component 5)(kg in a m^3 mixture)' : [6, 9, 12, 15, 18, 21, 24, 27, 30, 33],
  'Coarse Aggregate  (component 6)(kg in a m^3 mixture)' : 973,
  'Fine Aggregate (component 7)(kg in a m^3 mixture)' : 774.2,
  'Age (day)' : 28
}

Once the values are set we can run the function.

In [24]:
results, study_vals = utils.custom_parametric(top_model, dataset, values, dataset_info['target'])

Finally, we can extract the values and save them into the chosen folder.

In [25]:
df = pd.DataFrame(study_vals, columns=dataset.drop(dataset_info['target'], axis=1).columns)
df[dataset_info['target']] = results
df.to_csv(os.path.join(custom_save_dir, 'custom_parametric_data.csv'))
df.head()

Unnamed: 0,Water/Binder Raito,Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,0.1,6,973,774.2,28,64.268392
1,0.1,9,973,774.2,28,64.928812
2,0.1,12,973,774.2,28,66.388558
3,0.1,15,973,774.2,28,66.838234
4,0.1,18,973,774.2,28,70.067671
