# Breast Cancer Data Processing and Modeling

I'm using a data set of 116 patients who were either diagnosed with breast cancer or healthy. The data set attributes are patient age, body mass index, and levels of glocose, insulin, Homeostatic Model Assessment for Insulin Resistance (HOMA), leptin, adiponectin, resistin, and monocyte chemoattractant protein 1 (MCP.1). 

The data set can be found here: https://archive-beta.ics.uci.edu/dataset/451/breast+cancer+coimbra

The units of each attribute are: Age (years), BMI (kg/m2), Glucose (mg/dL), Insulin (µU/mL), HOMA (an index value. >1.9 indicates early insulin resistance, >2.9 indicates significant insulin resistance), Leptin (ng/mL), Adiponectin (µg/mL), Resistin (ng/mL), MCP-1(pg/dL). After preprocessing, the classification value of 0 indicates a healthy patient and a value of 1 indicates a patient with cancer.

First, I will be processing the data and preparing it for analysis. Then, in the modeling section of this notebook, I will be using Logistic Regression, Decision Tree, Random Forest, and Classification ensemble techniques to create predictive models. I'll compare the results of these models and discuss which one is the most effective in predicting the presence of cancer in patients. 

## Import Necessary Packages and Set Seed

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

# set random seed to ensure that results are repeatable
np.random.seed(1)

## Load Data

In [2]:
patients = pd.read_csv("breast_cancer.csv")
patients.head(6)

Unnamed: 0,Age,BMI,Glucose,Insulin,HOMA,Leptin,Adiponectin,Resistin,MCP.1,Classification
0,48,23.5,70,2.707,0.467409,8.8071,9.7024,7.99585,417.114,1
1,83,20.690495,92,3.115,0.706897,8.8438,5.429285,4.06405,468.786,1
2,82,23.12467,91,4.498,1.009651,17.9393,22.43204,9.27715,554.697,1
3,68,21.367521,77,3.226,0.612725,9.8827,7.16956,12.766,928.22,1
4,86,21.111111,92,3.549,0.805386,6.6994,4.81924,10.57635,773.92,1
5,49,22.854458,92,3.226,0.732087,6.8317,13.67975,10.3176,530.41,1


## Initial Exploration

Determine how many rows and attributes are in the data, if any values are missing, what types of data are stored in the columns, and clean up the column names. 

In [3]:
# generate a basic summary of the data
patients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116 entries, 0 to 115
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             116 non-null    int64  
 1   BMI             116 non-null    float64
 2   Glucose         116 non-null    int64  
 3   Insulin         116 non-null    float64
 4   HOMA            116 non-null    float64
 5   Leptin          116 non-null    float64
 6   Adiponectin     116 non-null    float64
 7   Resistin        116 non-null    float64
 8   MCP.1           116 non-null    float64
 9   Classification  116 non-null    int64  
dtypes: float64(7), int64(3)
memory usage: 9.2 KB


In [4]:
# generate a statistical summary of the numeric value in the data
patients.describe()

Unnamed: 0,Age,BMI,Glucose,Insulin,HOMA,Leptin,Adiponectin,Resistin,MCP.1,Classification
count,116.0,116.0,116.0,116.0,116.0,116.0,116.0,116.0,116.0,116.0
mean,57.301724,27.582111,97.793103,10.012086,2.694988,26.61508,10.180874,14.725966,534.647,1.551724
std,16.112766,5.020136,22.525162,10.067768,3.642043,19.183294,6.843341,12.390646,345.912663,0.499475
min,24.0,18.37,60.0,2.432,0.467409,4.311,1.65602,3.21,45.843,1.0
25%,45.0,22.973205,85.75,4.35925,0.917966,12.313675,5.474283,6.881763,269.97825,1.0
50%,56.0,27.662416,92.0,5.9245,1.380939,20.271,8.352692,10.82774,471.3225,2.0
75%,71.0,31.241442,102.0,11.18925,2.857787,37.3783,11.81597,17.755207,700.085,2.0
max,89.0,38.578759,201.0,58.46,25.050342,90.28,38.04,82.1,1698.44,2.0


In [5]:
# Check the missing values
patients.isna().sum()

Age               0
BMI               0
Glucose           0
Insulin           0
HOMA              0
Leptin            0
Adiponectin       0
Resistin          0
MCP.1             0
Classification    0
dtype: int64

In [6]:
# clean up colum names
patients.columns = [s.strip() for s in patients.columns] 
patients.columns

Index(['Age', 'BMI', 'Glucose', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin',
       'Resistin', 'MCP.1', 'Classification'],
      dtype='object')

### Findings after exploration

After exploring the data, we see that there are no missing values. There are 116 rows and 10 columns, all of which are numerical. The categroical variable, classification, has already been encoded as a binary interger value by the data distributors so we do not need to perform this conversion, but we should change the values from 1 and 2 to 0 and 1, for clarity and to indicate the presence or absence of cancer.

In [7]:
# subtract 1 from all values in the classification column 
patients['Classification'] = patients['Classification'] - 1

Now, 0 = healthy patient and 1 = cancer patient

## Divide data into training and testing sections

In [8]:
# split the data into validation and training set
train_df, test_df = train_test_split(patients, test_size=0.2)

# to reduce repetition in later code, create variables to represent the columns
# that are our predictors and target
target = 'Classification'
predictors = list(patients.columns)
predictors.remove(target)

In [9]:
# observe training data to ensure the split was performed correctly
# 80% of 116 is about 92 rows
train_df

Unnamed: 0,Age,BMI,Glucose,Insulin,HOMA,Leptin,Adiponectin,Resistin,MCP.1,Classification
93,49,32.461911,134,24.887,8.225983,42.3914,10.793940,5.76800,656.393,1
33,43,34.422174,89,23.194,5.091856,31.2128,8.300955,6.71026,960.246,0
67,64,22.222222,98,5.700,1.377880,12.1905,4.783985,13.91245,395.976,1
48,69,29.400000,89,10.704,2.349885,45.2720,8.286300,4.53000,215.769,0
46,75,25.700000,94,8.079,1.873251,65.9260,3.741220,4.49685,206.802,0
...,...,...,...,...,...,...,...,...,...,...
9,75,23.000000,83,4.952,1.013839,17.1270,11.578990,7.09130,318.302,0
72,51,18.370000,105,6.030,1.561770,9.6200,12.760000,3.21000,513.660,1
12,25,22.860000,82,4.090,0.827271,20.4500,23.670000,5.14000,313.730,0
107,46,33.180000,92,5.750,1.304867,18.6900,9.160000,8.89000,209.190,1


## Finish preparing the data by standardizing the numeric variables

We want to standardize the variables because the values are in different scales. A change of 5 micro enzyme units per milliliter of insulin is not as significant as adding 5 to the age of the patient or to their body mass index. Therefore, we want to standardize the numeric variables so they are all on a similar scale. 

In [10]:
# create a standard scaler and fit it to the training set of predictors
scaler = preprocessing.StandardScaler()
cols_to_stdize = ['Age', 'BMI', 'Glucose', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin','Resistin', 'MCP.1']                
               
# Transform the predictors of training and validation sets
train_df[cols_to_stdize] = scaler.fit_transform(train_df[cols_to_stdize])

test_df[cols_to_stdize] = scaler.transform(test_df[cols_to_stdize])


In [11]:
# Observe changes to df
train_df.head(6)

Unnamed: 0,Age,BMI,Glucose,Insulin,HOMA,Leptin,Adiponectin,Resistin,MCP.1,Classification
93,-0.456885,0.899415,1.791729,1.440665,1.606306,0.770292,0.074376,-0.687001,0.292345,1
33,-0.817686,1.279792,-0.368993,1.275835,0.703938,0.214271,-0.299662,-0.615263,1.163703,0
67,0.445119,-1.087533,0.063152,-0.427379,-0.36538,-0.731894,-0.827335,-0.066936,-0.454451,1
48,0.745787,0.305271,-0.368993,0.05981,-0.085523,0.913572,-0.301861,-0.781254,-0.97123,0
46,1.106589,-0.412691,-0.128913,-0.19576,-0.222754,1.940897,-0.983788,-0.783778,-0.996945,0
92,-0.276484,0.577174,-0.465025,1.959107,1.104614,0.11783,-0.604598,0.719785,0.602842,1


Save the training and testing data to seperate files to prevent information leakage. 

In [12]:
# Save training and test data to seperate files
train_df.to_csv('cancer_training.csv', index=False)
test_df.to_csv('cancer_testing.csv', index=False)

In [13]:
# Split dataframes into predictors and classifiers
X_train = train_df.drop("Classification", axis=1)
y_train = train_df["Classification"]
X_test = test_df.drop("Classification", axis=1)
y_test = test_df["Classification"]

In [14]:
# Observe results before continuing
X_train

Unnamed: 0,Age,BMI,Glucose,Insulin,HOMA,Leptin,Adiponectin,Resistin,MCP.1
93,-0.456885,0.899415,1.791729,1.440665,1.606306,0.770292,0.074376,-0.687001,0.292345
33,-0.817686,1.279792,-0.368993,1.275835,0.703938,0.214271,-0.299662,-0.615263,1.163703
67,0.445119,-1.087533,0.063152,-0.427379,-0.365380,-0.731894,-0.827335,-0.066936,-0.454451
48,0.745787,0.305271,-0.368993,0.059810,-0.085523,0.913572,-0.301861,-0.781254,-0.971230
46,1.106589,-0.412691,-0.128913,-0.195760,-0.222754,1.940897,-0.983788,-0.783778,-0.996945
...,...,...,...,...,...,...,...,...,...
9,1.106589,-0.936610,-0.657089,-0.500204,-0.470193,-0.486354,0.192162,-0.586253,-0.677197
72,-0.336617,-1.835032,0.399264,-0.395251,-0.312435,-0.859750,0.369357,-0.881750,-0.116969
12,-1.900091,-0.963776,-0.705105,-0.584129,-0.523910,-0.321068,2.006253,-0.734812,-0.690308
107,-0.637286,1.038756,-0.224945,-0.422511,-0.386402,-0.408610,-0.170774,-0.449312,-0.990097


In [15]:
y_train

93     1
33     0
67     1
48     0
46     0
      ..
9      0
72     1
12     0
107    1
37     0
Name: Classification, Length: 92, dtype: int64

## Model the data

To evaluate the performance of the model, we will prioritize recall. Higher recall values mean that there are fewer false negatives which, in this case, mean fewer cases where a patient is diagnosed as healthy when they actually have breast cancer. The cost of not identifying someone with cancer is higher than the cost of diagnosing someone with cancer when they are not actually sick, so we want to optimize recall. 

We will be modeling using a basic Logistic regression, Stochastic Gradient Descent CLassifier, and decision tree model and various ensemble packages including random forest, gradient boost, ADA boost, and XGBoost.

To optimize recall without sacrificing accuracy, we will optimize accuracy in the RandomSearchCV paramter searching and optimize recall in the GridSearchCV. This way, the model will identify a model which can predict the presence or absence of cancer well first, then narrow that search to find a model with good recall as well. The purpose of this mix is to reduce the possibility of picking a model with determines that all the patients have cancer rather than risk misdiagnosing a patient who does have cancer. 

In [16]:
# First, import some additional packages necessary for modeling

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier 
#GridsearchCV will allow program to search different parameters
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier

In [17]:
# Save the results of the models in a table for comparison
results = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

## Model 1: Logistic Regression

In [18]:
log_reg_model = LogisticRegression()
_ = log_reg_model.fit(X_train, np.ravel(y_train))

In [19]:
# Save results to dataframe
y_pred = log_reg_model.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Basic Logistic Regression", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194


In [20]:
# Set score measure to recall to optimize it
score_measure = "accuracy"
kfolds = 5

param_grid = {
    'penalty': ['l1', 'l2'],
    'max_iter': np.arange(5,500000),
    'solver': ['lbfgs', 'liblinear', 'saga']
}

log_reg = LogisticRegression()
rand_search = RandomizedSearchCV(estimator = log_reg, param_distributions=param_grid, cv=kfolds, n_iter=100,
                           scoring=score_measure, n_jobs=-1, error_score=0,
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

The best accuracy score is 0.7707602339181286
... with parameters: {'solver': 'saga', 'penalty': 'l2', 'max_iter': 225584}


80 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to 0.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
80 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/alexharde/miniconda3/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/alexharde/miniconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/Users/alexharde/miniconda3/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py", line 54, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l

Note: a warning was raised because the randomsearchcv tried to use a combination of an l1 penalty and lbfgs solver, which are not compatible. Ignore the warning and use the parameters it recommends in a gridsearchcv to find the optimal parameters for a Logistic regression model. 

In [21]:
model_preds = rand_search.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
results = pd.concat([results, pd.DataFrame({'model':"Random Search Logistic Regression", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194


In [22]:
# GridSearchCV to narrow optimal parameters
score_measure = "recall"
kfolds = 5

penalty = rand_search.best_params_['penalty']
max_iter = rand_search.best_params_['max_iter']
solver = rand_search.best_params_['solver']

param_grid = {
    'penalty': [penalty],
    'max_iter': np.arange(max_iter-5,max_iter+5),
    'solver': [solver]
}

log_reg = LogisticRegression()
grid_search = GridSearchCV(estimator = log_reg, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

Fitting 5 folds for each of 10 candidates, totalling 50 fits
The best recall score is 0.6933333333333332
... with parameters: {'max_iter': 225579, 'penalty': 'l2', 'solver': 'saga'}


In [23]:
model_preds = grid_search.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
results = pd.concat([results, pd.DataFrame({'model':"Grid Search Logistic Regression", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194


## Model 2: SGDClassifier

In [24]:
# Import the necessary modules
from sklearn.linear_model import SGDClassifier

In [25]:
# Model without using optimal parameters
sgd_model = SGDClassifier()
_ = sgd_model.fit(X_train, y_train)

In [26]:
model_preds = sgd_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
results = pd.concat([results, pd.DataFrame({'model':"Basic SGDClassifier", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677


In [27]:
score_measure = "accuracy"
kfolds = 4

param_grid = {
    'loss':['hinge', 'modified_huber', 'log_loss'],
    'penalty': ['l1', 'l2', 'elasticnet'],
    'alpha': [0.0001, 0.001, 0.01, 0.1,.5,1],
    'max_iter': np.arange(1000,100000),
    'learning_rate': ['constant', 'invscaling', 'adaptive'],
    'eta0': np.arange(0.001,5)
}

sgd_model = SGDClassifier()
rand_search = RandomizedSearchCV(
    estimator = sgd_model,                     
    param_distributions=param_grid,     
    cv=kfolds,                      
    n_iter=200,                     
    scoring=score_measure,          
    verbose=0,                      
    n_jobs=-1,                       
    random_state=1                  
)

rand_search.fit(X_train, y_train)

bestMLPClassifier = rand_search.best_estimator_

print(rand_search.best_params_)

{'penalty': 'elasticnet', 'max_iter': 20057, 'loss': 'modified_huber', 'learning_rate': 'invscaling', 'eta0': 2.001, 'alpha': 0.001}


In [28]:
model_preds = rand_search.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
results = pd.concat([results, pd.DataFrame({'model':"Random Search SGDClassifier", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706


In [29]:
score_measure = "recall"
kfolds = 4

loss = rand_search.best_params_['loss']
penalty = rand_search.best_params_['penalty']
alpha = rand_search.best_params_['alpha']
learning_rate = rand_search.best_params_['learning_rate']
eta0 = rand_search.best_params_['eta0']
max_iter = rand_search.best_params_['max_iter']

param_grid = {
    'loss': [loss],
    'penalty': [penalty],
    'eta0': np.arange(eta0-.05,eta0+.05),
    'alpha': [alpha, alpha+0.1, alpha+0.2, alpha+0.4, alpha+0.6, alpha+0.8],
    'learning_rate': [learning_rate],
    'max_iter': np.arange(max_iter-10,max_iter+10)
}

sgd_model = SGDClassifier()
grid_search = GridSearchCV(
    estimator = sgd_model,        
    param_grid=param_grid,  
    cv=kfolds,              
    scoring=score_measure,  
    verbose=0,              
    n_jobs=-1,              
)
grid_search.fit(X_train, y_train)

bestMLPClassifier = grid_search.best_estimator_

print(grid_search.best_params_)

{'alpha': 0.801, 'eta0': 1.9509999999999998, 'learning_rate': 'invscaling', 'loss': 'modified_huber', 'max_iter': 20056, 'penalty': 'elasticnet'}


In [30]:
model_preds = grid_search.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
results = pd.concat([results, pd.DataFrame({'model':"Grid Search SGDClassifier", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667


## Model 3: Decision Tree

In [31]:
dtree = DecisionTreeClassifier()
_ = dtree.fit(X_train, np.ravel(y_train))

In [32]:
# Save results to dataframe
y_pred = dtree.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Basic Decision Tree", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75


In [33]:
# Set score measure to recall to optimize it
score_measure = "accuracy"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(2,200),  
    'min_samples_leaf': np.arange(1,100),
    'min_impurity_decrease': np.arange(0.0001, 0.001, 0.00005),
    'max_leaf_nodes': np.arange(10, 200), 
    'max_depth': np.arange(3,50), 
    'criterion': ['entropy', 'gini']
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=1000,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

Fitting 5 folds for each of 1000 candidates, totalling 5000 fits
The best accuracy score is 0.695906432748538
... with parameters: {'min_samples_split': 37, 'min_samples_leaf': 26, 'min_impurity_decrease': 0.0007500000000000001, 'max_leaf_nodes': 94, 'max_depth': 7, 'criterion': 'entropy'}


In [34]:
# Save results to dataframe
y_pred = rand_search.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Random Search Decision Tree", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667


Conduct an exhaustive search across a smaller range of parameters around the parameters found in the initial random search.

In [35]:
score_measure = "recall"
kfolds = 5

min_samples_split = rand_search.best_params_['min_samples_split']
min_samples_leaf = rand_search.best_params_['min_samples_leaf']
min_impurity_decrease = rand_search.best_params_['min_impurity_decrease']
max_leaf_nodes = rand_search.best_params_['max_leaf_nodes']
max_depth = rand_search.best_params_['max_depth']
criterion = rand_search.best_params_['criterion']

param_grid = {
    'min_samples_split': np.arange(min_samples_split-2,min_samples_split+2),  
    'min_samples_leaf': np.arange(min_samples_leaf-1,min_samples_leaf+2),
    'min_impurity_decrease': np.arange(min_impurity_decrease-0.0001, min_impurity_decrease+0.0001, 0.00005),
    'max_leaf_nodes': np.arange(max_leaf_nodes-2,max_leaf_nodes+2), 
    'max_depth': np.arange(max_depth-2,max_depth+2), 
    'criterion': [criterion]
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1, 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

Fitting 5 folds for each of 960 candidates, totalling 4800 fits
The best recall score is 0.6088888888888888
... with parameters: {'criterion': 'entropy', 'max_depth': 5, 'max_leaf_nodes': 92, 'min_impurity_decrease': 0.0006500000000000001, 'min_samples_leaf': 26, 'min_samples_split': 35}


In [36]:
# Save results to dataframe
y_pred = grid_search.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Grid Search Decision Tree", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Grid Search Decision Tree,0.583333,0.833333,0.555556,0.666667


## Model 4: Random Forest

In [37]:
rforest = RandomForestClassifier()
_ = rforest.fit(X_train, np.ravel(y_train))

In [38]:
# Save results to dataframe
y_pred = rforest.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Basic Random Forest", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Grid Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Basic Random Forest,0.666667,0.857143,0.666667,0.75


In [39]:
score_measure = "accuracy"
kfolds = 5

param_grid = {
    'n_estimators': np.arange(2,200),
    'max_depth': np.arange(2,20),
    'min_samples_split': np.arange(10, 200),
    'min_samples_leaf': np.arange(1, 200),
    'criterion': ['entropy', 'gini'],
}

rforest = RandomForestClassifier()
rand_search = RandomizedSearchCV(estimator = rforest, param_distributions=param_grid, cv=kfolds, n_iter=200,
                           scoring=score_measure, verbose=1, n_jobs=-1,  
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

Fitting 5 folds for each of 200 candidates, totalling 1000 fits
The best accuracy score is 0.6859649122807017
... with parameters: {'n_estimators': 169, 'min_samples_split': 26, 'min_samples_leaf': 20, 'max_depth': 12, 'criterion': 'gini'}


In [40]:
# Save results to dataframe
y_pred = rand_search.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Random Search Random Forest", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Grid Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Basic Random Forest,0.666667,0.857143,0.666667,0.75


Conduct an exhaustive search across a smaller range of parameters around the parameters found in the initial random search.

In [41]:
score_measure = "recall"
kfolds = 5

min_samples_split = rand_search.best_params_['min_samples_split']
min_samples_leaf = rand_search.best_params_['min_samples_leaf']
n_estimators = rand_search.best_params_['n_estimators']
max_depth = rand_search.best_params_['max_depth']
criterion = rand_search.best_params_['criterion']

param_grid = {
    'min_samples_split': np.arange(min_samples_split-2,min_samples_split+2),  
    'min_samples_leaf': np.arange(min_samples_leaf-2,min_samples_leaf+2),
    'n_estimators': np.arange(n_estimators-2,n_estimators+2), 
    'max_depth': np.arange(max_depth-2,max_depth+2), 
    'criterion': [criterion]
}

rforest = RandomForestClassifier()
grid_search = GridSearchCV(estimator = rforest, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1, 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

Fitting 5 folds for each of 256 candidates, totalling 1280 fits
The best recall score is 0.7377777777777778
... with parameters: {'criterion': 'gini', 'max_depth': 11, 'min_samples_leaf': 20, 'min_samples_split': 27, 'n_estimators': 167}


In [42]:
# Save results to dataframe
y_pred = grid_search.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Grid Search Random Forest", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Grid Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Basic Random Forest,0.666667,0.857143,0.666667,0.75


## Model 5: ADA Boost

In [43]:
aboost = AdaBoostClassifier()
_ = aboost.fit(X_train, np.ravel(y_train))

In [44]:
# Save results to dataframe
y_pred = aboost.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Basic ADA Boost", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Grid Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Basic Random Forest,0.666667,0.857143,0.666667,0.75


In [45]:
# reduced number of iterations to reduce computing time
score_measure = "accuracy"
kfolds = 5

param_grid = {
    'n_estimators': np.arange(2,200),
    'learning_rate': np.arange(.001,2),
    'algorithm':['SAMME', 'SAMME.R']
}

aboost = AdaBoostClassifier()
rand_search = RandomizedSearchCV(estimator = aboost, param_distributions=param_grid, cv=kfolds, n_iter=200,
                           scoring=score_measure, verbose=1, n_jobs=-1,  
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

Fitting 5 folds for each of 200 candidates, totalling 1000 fits
The best accuracy score is 0.8362573099415205
... with parameters: {'n_estimators': 27, 'learning_rate': 1.001, 'algorithm': 'SAMME.R'}


In [46]:
# Save results to dataframe
y_pred = rand_search.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Random Search ADA Boost", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Grid Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Basic Random Forest,0.666667,0.857143,0.666667,0.75


Conduct an exhaustive search across a smaller range of parameters around the parameters found in the initial random search.

In [47]:
score_measure = "recall"
kfolds = 5

n_estimators = rand_search.best_params_['n_estimators']
learning_rate = rand_search.best_params_['learning_rate']
algorithm = rand_search.best_params_['algorithm']

param_grid = {
    'n_estimators': np.arange(n_estimators-2,n_estimators+2), 
    'learning_rate': np.arange(learning_rate-.005,learning_rate+.005),
    'algorithm': [algorithm]
}

aboost = AdaBoostClassifier()
grid_search = GridSearchCV(estimator = aboost, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1, 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

Fitting 5 folds for each of 4 candidates, totalling 20 fits
The best recall score is 0.7377777777777778
... with parameters: {'algorithm': 'SAMME.R', 'learning_rate': 0.9959999999999999, 'n_estimators': 28}


In [48]:
# Save results to dataframe
y_pred = grid_search.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Grid Search ADA Boost", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Grid Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Basic Random Forest,0.666667,0.857143,0.666667,0.75


## Model 6: Gradient Boost

In [49]:
gboost = GradientBoostingClassifier()
_ = gboost.fit(X_train, np.ravel(y_train))

In [50]:
# Save results to dataframe
y_pred = gboost.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Basic Gradient Boost", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Grid Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Basic Random Forest,0.666667,0.857143,0.666667,0.75


In [51]:
score_measure = "accuracy"
kfolds = 5

param_grid = {
    'n_estimators': np.arange(2,200),
    'learning_rate': np.arange(.001,2),
    'max_depth': np.arange(2,20),
    'min_samples_split': np.arange(10,200),
    'min_samples_leaf': np.arange(1,200),
    'max_features': ['sqrt', 'log2'],
    'criterion': ['friedman_mse', 'squared_error']
}

gboost = GradientBoostingClassifier()
rand_search = RandomizedSearchCV(estimator = gboost, param_distributions=param_grid, cv=kfolds, n_iter=200,
                           scoring=score_measure, verbose=1, n_jobs=-1,  
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

Fitting 5 folds for each of 200 candidates, totalling 1000 fits
The best accuracy score is 0.7192982456140351
... with parameters: {'n_estimators': 11, 'min_samples_split': 24, 'min_samples_leaf': 12, 'max_features': 'log2', 'max_depth': 4, 'learning_rate': 1.001, 'criterion': 'squared_error'}


In [52]:
# Save results to dataframe
y_pred = rand_search.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Random Search Gradient Boost", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Grid Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Basic Random Forest,0.666667,0.857143,0.666667,0.75


Conduct an exhaustive search across a smaller range of parameters around the parameters found in the initial random search.

In [53]:
score_measure = "recall"
kfolds = 5

n_estimators = rand_search.best_params_['n_estimators']
learning_rate = rand_search.best_params_['learning_rate']
max_depth = rand_search.best_params_['max_depth']
min_samples_split = rand_search.best_params_['min_samples_split']
min_samples_leaf = rand_search.best_params_['min_samples_leaf']
max_features = rand_search.best_params_['max_features']
criterion = rand_search.best_params_['criterion']

param_grid = {
    'learning_rate': np.arange(learning_rate,learning_rate+.005),
    'n_estimators': np.arange(n_estimators-2,n_estimators+2), 
    'min_samples_split': np.arange(min_samples_split-2,min_samples_split+2),  
    'min_samples_leaf': np.arange(min_samples_leaf-2,min_samples_leaf+2),
    'max_depth': np.arange(max_depth-2,max_depth+2), 
    'max_features': [max_features],
    'criterion': [criterion]
}

gboost = GradientBoostingClassifier()
grid_search = GridSearchCV(estimator = gboost, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1, 
                           return_train_score=True)

gboost_grid_search = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {gboost_grid_search.best_score_}")
print(f"... with parameters: {gboost_grid_search.best_params_}")

Fitting 5 folds for each of 256 candidates, totalling 1280 fits
The best recall score is 0.8488888888888889
... with parameters: {'criterion': 'squared_error', 'learning_rate': 1.001, 'max_depth': 3, 'max_features': 'log2', 'min_samples_leaf': 12, 'min_samples_split': 25, 'n_estimators': 11}


In [54]:
# Save results to dataframe
y_pred_gboost = gboost_grid_search.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Grid Search Gradient Boost", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Grid Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Basic Random Forest,0.666667,0.857143,0.666667,0.75


## Model 7: XGBoost

In [55]:
xgboost = XGBClassifier()
_ = xgboost.fit(X_train, np.ravel(y_train))

In [56]:
# Save results to dataframe
y_pred = xgboost.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Basic XGBoost", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Grid Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Basic Random Forest,0.666667,0.857143,0.666667,0.75


In [57]:
score_measure = "accuracy"
kfolds = 5

param_grid = {
    'n_estimators': np.arange(2,200),
    'learning_rate': np.arange(.001,2),
    'max_depth': np.arange(2,20),
    'subsample': np.arange(.1,1.0),
    'colsample_bytree': np.arange(.1,1.0)
}

xgboost = XGBClassifier()
rand_search = RandomizedSearchCV(estimator = xgboost, param_distributions=param_grid, cv=kfolds, n_iter=200,
                           scoring=score_measure, verbose=1, n_jobs=-1,  
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

Fitting 5 folds for each of 200 candidates, totalling 1000 fits
The best accuracy score is 0.7181286549707602
... with parameters: {'subsample': 0.1, 'n_estimators': 120, 'max_depth': 18, 'learning_rate': 1.001, 'colsample_bytree': 0.1}


In [58]:
# Save results to dataframe
y_pred = rand_search.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Random Search XG Boost", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Grid Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Basic Random Forest,0.666667,0.857143,0.666667,0.75


Conduct an exhaustive search across a smaller range of parameters around the parameters found in the initial random search.

In [59]:
score_measure = "recall"
kfolds = 5

n_estimators = rand_search.best_params_['n_estimators']
learning_rate = rand_search.best_params_['learning_rate']
max_depth = rand_search.best_params_['max_depth']
subsample = rand_search.best_params_['subsample']
colsample_bytree = rand_search.best_params_['colsample_bytree']


param_grid = {
    'learning_rate': np.arange(learning_rate-.005,learning_rate+.005),
    'n_estimators': np.arange(n_estimators-2,n_estimators+2), 
    'max_depth': np.arange(max_depth-2,max_depth+2), 
    'subsample': np.arange(subsample-.05,subsample+.05),
    'colsample_bytree': np.arange(colsample_bytree-.05,colsample_bytree+.05)
}

xgboost = XGBClassifier()
grid_search = GridSearchCV(estimator = xgboost, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1, 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

Fitting 5 folds for each of 16 candidates, totalling 80 fits
The best recall score is 0.6
... with parameters: {'colsample_bytree': 0.05, 'learning_rate': 0.9959999999999999, 'max_depth': 16, 'n_estimators': 118, 'subsample': 0.05}


In [60]:
# Save results to dataframe
y_pred = grid_search.predict(X_test)
results = pd.concat([results, pd.DataFrame({'model':"Grid Search XG Boost", 
                                                    'Accuracy': [accuracy_score(y_test, y_pred)], 
                                                    'Precision': [precision_score(y_test, y_pred)], 
                                                    'Recall': [recall_score(y_test, y_pred)], 
                                                    'F1': [f1_score(y_test, y_pred)]
                                                     }, index=[0])])
results

  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Basic Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Basic SGDClassifier,0.625,0.846154,0.611111,0.709677
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search SGDClassifier,0.583333,0.833333,0.555556,0.666667
0,Basic Decision Tree,0.666667,0.857143,0.666667,0.75
0,Random Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Grid Search Decision Tree,0.583333,0.833333,0.555556,0.666667
0,Basic Random Forest,0.666667,0.857143,0.666667,0.75


The grid search for random forest and xgboost may produce values of 0, this is because the generated model has no positive results, whether true or false. These models are not a good fit to the data. See below for a confusion matrix if the grid search XG Boost model has a warning for performance metrics of 0. 

In [61]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
c_matrix

array([[ 6,  0],
       [18,  0]])

# Conclusion

In [62]:
results_sorted = results.sort_values(by='Recall', ascending=False)
results_sorted

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Grid Search Gradient Boost,0.791667,0.882353,0.833333,0.857143
0,Random Search Gradient Boost,0.791667,0.882353,0.833333,0.857143
0,Basic XGBoost,0.75,0.875,0.777778,0.823529
0,Basic Gradient Boost,0.708333,0.823529,0.777778,0.8
0,Random Search Random Forest,0.666667,0.8125,0.722222,0.764706
0,Random Search SGDClassifier,0.666667,0.8125,0.722222,0.764706
0,Grid Search ADA Boost,0.666667,0.8125,0.722222,0.764706
0,Random Search ADA Boost,0.666667,0.8125,0.722222,0.764706
0,Random Search Logistic Regression,0.708333,0.923077,0.666667,0.774194
0,Grid Search Random Forest,0.625,0.8,0.666667,0.727273


The goal of this model creation was to be able to preemptively determine if a patient has breast cancer based on their age, body mass index, hormone and enzyme levels. By identifying high-risk patients in this way, the hospital can run further testing and confirm a cancer diagnosis early before the illness metastasizes. Recall was a prioritized metric in this study because a false positive, testing a healthy patient for cancer when they didn't have it, is less damaging than missing a cancer diagnosis.  

After running the models, the gradient boost model with gridsearch parameter tuning had the greatest recall of them all with a score of .833. This model is immediately followed up by the same model with randomsearchcv parameter tuning. Because we optimized accuracy and then recall, we do not run into the situation where the model deliberately predicts that all the patients have cancer, even if they do not, to avoid missing predicting a false negative result. I would choose the grid search gradient boost model as the best model for this data set because it has an accuracy score of .792, a precision score of .882, a recall of .833, and an F1 score of .857. In the context of this situation, these scores reflect that the model is capable of picking the correct predictions (the patient has cancer or the patient does not have cancer) 79% of the time. When the model predicts that a patient has cancer, this decision is correct 88% of the time. If a pateint has cancer, the model picks the correct prediction 83% of the time, and the high f1 score (.86) indicates that the gradient boost classifier model is a good model for making predictions. 

The confusion matrix below shows us the predictions of the grid search gradient descent model. We can see that the model was correct in predicting healthy patients 4 times, incorrectly predicted a healthy pateint when the patient actually had cancer 4 times, predicted that a patient had cancer when they did not 2 times, and predicted the presence of breast cancer when the patient actually did have cancer 14 times. 

In [64]:
# Confusion matrix for grid search gradient boost
c_matrix = confusion_matrix(y_test, y_pred_gboost)
c_matrix

array([[ 4,  2],
       [ 4, 14]])

I believe the goal of using age, body mass index, hormone and enzyme levels to identify patients with breat cancer was successful. Doctors could use the Gradient Boost model with the grid search tuned parameters to identify possible breat cancer with reasonable accuracy, precision, recall, and F1. 