# Part A: Precision Medicine Optimization Pipeline Build
---

## Purpose
This file builds the exported pipeline from TPOT that is called in the `Optimization_POC.ipynb` Part B notebook.

#### Methods
We load in the dataset created from the raw `breastCancer29.csv` file to then split into training, validation, and testing data.
In particular, we leave aside 100 random, balanced participants to confirm accuracy scores. We obtain an accuracy score through fitting a TPOT model on the 75% of the training data, while using the remaining 25% to construct an accuracy report of the model.
Afterwards, we confirm the model's accuracy is robust and export the pipeline for usage in `Optimization_POC.ipynb`.

TL;DR: We split into training (`pm_train.csv`) & testing data (`pm_test.csv`). Then we split our training data in order to hold out a validation set for accuracy reporting purposes. We then export the pipeline for later use.

In [1]:
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import random

# Load breast cancer data
bcdata = pd.read_csv('breastCancer29.csv')

# store in two dfs the cases and controls
all_cases = bcdata.loc[bcdata.phenotype == 0, :].reset_index(drop=True)
all_controls = bcdata.loc[bcdata.phenotype == 1, :].reset_index(drop=True)

# Select 50 random numbers in both case and controls
random.seed(42)
case_idx = random.sample(range(all_cases.shape[0]), 50)            
control_idx = random.sample(range(all_controls.shape[0]), 50)

fifty_cases = all_cases.loc[case_idx, :]            
fifty_controls = all_controls.loc[control_idx, :]   

# Resulting 100 balanced random rows (50 case/50 control) as precision medicine test dataset
pm_test = pd.concat([fifty_cases, fifty_controls]).reset_index(drop=True)
pm_train = bcdata.drop(pm_test.index).reset_index(drop=True)

# pre_med_data == pm_test
# test_validated_data == pm_train

In [2]:
pm_test.to_csv('pm_test.csv', index = False)
pm_train.to_csv('pm_train.csv', index = False) 

In [3]:
# Train & validate on the pm_train
Xdata = pm_train.loc[:, bcdata.columns != 'phenotype']
Ydata = pm_train['phenotype']
X_train, X_test, Y_train, Y_test = train_test_split(Xdata, Ydata, random_state=42,
                                                    train_size=0.75, test_size=0.25)

In [4]:
tpot = TPOTClassifier(generations=100, population_size=100, verbosity=2, max_time_mins=15, early_stop=5)
tpot.fit(X_train, Y_train)
print(tpot.score(X_test, Y_test))



HBox(children=(IntProgress(value=0, description='Optimization Progress', style=ProgressStyle(description_width…

Generation 1 - Current best internal CV score: 0.5546321945213911
Generation 2 - Current best internal CV score: 0.5573945829485996
Generation 3 - Current best internal CV score: 0.5573945829485996
Generation 4 - Current best internal CV score: 0.5573945829485996
Generation 5 - Current best internal CV score: 0.5573945829485996
Generation 6 - Current best internal CV score: 0.560727916281933

19.778496816666667 minutes have elapsed. TPOT will close down.
TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: LogisticRegression(PCA(input_matrix, iterated_power=1, svd_solver=randomized), C=0.001, dual=False, penalty=l2)
0.5440931780366056


In [6]:
tpot.export('bcdata_pipeline.py')
