# Example notebook
 This notebook contains demo for how each module works and expected outputs
### Part 1 - Get everything ready
Importing modules

In [None]:
import sys
sys.path.append('..')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from formulation.modules import classification
from formulation.modules import cross_validate
from formulation.modules import importance
from formulation.modules import predict
from formulation.modules import predict_missing_value
from sklearn.model_selection import train_test_split

Reading data

In [None]:
data = pd.read_csv("../formulation/data/FDA_APPROVED.csv")
data

Defining which features to be used as inputs and outputs

In [None]:
NEEDED = ['% Excreted Unchanged in Urine', 'CLogP', 'HBA', 'HBD', 'PSA', 'Formulation']
INPUTS = NEEDED[:-1]
OUTPUT = NEEDED[-1]

## Part 2 - Train the model

Eliminating data points with missing values (NaN) and splitting data into train and test sets

In [None]:
clean_data = predict_missing_value.data_dropna(data, NEEDED, NEEDED)
train, test = train_test_split(clean_data, test_size=0.1, random_state=1010)

`classification.predict` function takes in training data and inside the training data, 10% will be used as validation set.  
During training, the function will print out feature importance and accuracy on the validation set

In [None]:
model = classification.predict(train[INPUTS], train[OUTPUT])

Using the trained model to predict formulation by calling `predict.predict`

In [None]:
predict.predict(model, test[INPUTS])

## Part 3 - Fill missing value

Sometime, there might be missing values in data set. We can either drop those data (showed above) or use regression methods to fill those data. For example, we can use other features to fill missing values in solubility parameters.

In [None]:
NEEDED = ['MW Drug', 'MW Sol', 'CLogP', 'HBA', 'HBD', 'PSDA', 'ALOGPS 2.1 solubility', 'Measured LogD74']
INPUTS = NEEDED[:-1]
OUTPUT = NEEDED[-1]

filled_data = predict_missing_value.fill_missing_value(data, NEEDED, INPUTS, OUTPUT)


NEEDED = ['MW Drug', 'MW Sol', 'CLogP', 'HBA', 'HBD', 'PSDA', 'ALOGPS 2.1 solubility', 'Measured LogS (molar)']
INPUTS = NEEDED[:-1]
OUTPUT = NEEDED[-1]

filled_data = predict_missing_value.fill_missing_value(filled_data, NEEDED, INPUTS, OUTPUT)

Use the filled data to train a new model

In [None]:
NEEDED = ['% Excreted Unchanged in Urine', 'CLogP', 'Measured LogD74', 'Measured LogS (molar)', 'PSA', 'Formulation']
INPUTS = NEEDED[:-1]
OUTPUT = NEEDED[-1]

clean_data = predict_missing_value.data_dropna(filled_data, NEEDED, NEEDED)

new_model = classification.predict(clean_data[INPUTS], clean_data[OUTPUT])

In [None]:
name_list = ['capsules', 'solution', 'tablets', 'overall accuracy']
original_accuracy = [0.67, 0.40, 0.65, 0.61]
after_accuracy = [0.50, 0.62, 0.66, 0.65]

Make a bar plot to contrast accuracy before and after filling missing data

In [None]:
width = 0.2

#plt.figure(figsize=(5, 3), dpi=600)

x = np.arange(len(original_accuracy))
plt.bar(x, original_accuracy, width=width, label='Without data filling')

x = x + width
plt.bar(x, after_accuracy, width=width, label='With data filling', )

plt.legend()
plt.title("",size=12)
plt.ylabel('Accuracy', size=14)
plt.ylim(0, 1)
plt.xlabel('Formulation', size=14)
plt.xticks(ticks=x-width/2, labels=name_list)

## Part 4 - Choose best predictors

To evaluate the importance of each preidctor, in both `clean_data` and `filled_data`:

In [None]:
need = ['% Excreted Unchanged in Urine', 'CLogP', 'HBA', 'HBD', 'PSDA','Formulation']

In [None]:
clean = predict_missing_value.data_dropna(data, need, need)

In [None]:
importance.importance(clean[need[:-1]], clean[need[-1]], 0.2, [100,300,500,700,1000])

In [None]:
importance.importance(filled_data[need].dropna()[need[:-1]], filled_data[need].dropna()[need[-1]], 0.2, [100,300,500,700,1000])

The results are shown as the dataframes above.Inputing both `clean_data` and `filled _data` displays similar results. `CLogP` and `PSDA` are the most significant predictors in our Random Forest Model; Subsequently, `HBA` and `excreted unchanged in urine` also play important roles in fitting. `HBD` seems not as important as others. And the Change of n-estimator has little influence on the results.

By choosing different input factors, the accuracy could be different. 

In [None]:
classification.determine_new_accuracy(3, clean_data[INPUTS], clean_data[OUTPUT])

In [None]:
classification.determine_new_accuracy(3, clean_data[NEEDED[:-2]], clean_data[OUTPUT])

In [None]:
classification.determine_new_accuracy(3, clean_data[NEEDED[1:5]], clean_data[OUTPUT])

## Part 5 - Choose best hyperparameters

In [None]:
max_depth = range(1, 5)
n_trees = range(1, 200, 50)
results = cross_validate.cross_validate_grid_search(
            [max_depth, n_trees], clean_data[INPUTS], clean_data[OUTPUT].to_frame())

In [None]:
best_for_total = results[0]
best_for_solution = results[1]
best_for_capsules = results[2]
best_for_tablets = results[3]

print('Best max_depth: {:d}, best n_estimators: {:d}'.format(
                    best_for_total[0], best_for_total[1]))
print('Best parameter for solution catogory:', best_for_solution)
print('Best parameter for capsules catogory:', best_for_capsules)
print('Best parameter for tablets catogory:', best_for_tablets)