# White Blood Cell Image Classification
### By [Anthony Medina](https://www.linkedin.com/in/anthony-medina-math/)

# Modeling Notebook
1. Notebook Objectives
2. Imports
3. Final Pre-Building Checks
4. Model 1 Neural Network
5. Model 2 Random Forest
6. Model 3 Gradient Boosting Machine
7. Model results analysis
8. Model Choice
9. Next Steps

### 1. Notebook Objectives

This notebook will house the model building, evaluation of each model, and picking the model with best Recall score.

### 2. Imports

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer, recall_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

2023-08-27 19:54:36.220009: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
df = pd.read_csv('../cleaned_data/cleaned_data.csv')
df.head()

Unnamed: 0,cell_key,image_array
0,0,[0.01176471 0. 0. ... 0. ...
1,0,[0. 0. 0. ... 0.760784...
2,0,[0. 0. 0. ... 0. 0. 0.]
3,0,[0. 0. 0. ... 0.764705...
4,0,[0. 0. 0. ... 0. 0. 0.]


In [3]:
df.dtypes

cell_key        int64
image_array    object
dtype: object

In [4]:
df['image_array'] = df['image_array'].apply(lambda x: np.fromstring(x[1:-1], sep=' '))

  df['image_array'] = df['image_array'].apply(lambda x: np.fromstring(x[1:-1], sep=' '))


In [5]:
df.dtypes

cell_key        int64
image_array    object
dtype: object

### 3. Data Split

In [6]:
X = np.array(df['image_array'].tolist())
y = np.array(df['cell_key'])
recall_list = []
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 4. Model 1 Neural Network

In [7]:
def create_model():
    model = Sequential([
        Dense(64, activation='relu', input_shape=(num_features,)),
        Dense(32, activation='relu'),
        Dense(num_classes, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Define constants
num_features = X_train.shape[1]
num_classes = np.max(y_train) + 1  # Assuming class labels start from 0

# Create a KerasClassifier based on the create_model function
model = KerasClassifier(build_fn=create_model, epochs=10, batch_size=32, verbose=0)

# Define the parameter grid for grid search
param_grid = {
    'epochs': [10, 20, 50],
    'batch_size': [32, 64],
}

# Define the scoring function (recall)
scoring = make_scorer(recall_score, average='macro')

# Initialize GridSearchCV
grid_search = GridSearchCV(model, param_grid, scoring=scoring, cv=3)

# Fit the grid search to your training data
grid_search.fit(X_train, y_train)

# Get the best parameters and the best estimator from grid search
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate the best model on your test data
y_pred = best_model.predict(X_test)
recall = recall_score(y_test, y_pred, average='macro')

print("Best Parameters:", best_params)
print("Test Recall:", recall)
recall_list.append(recall)

  model = KerasClassifier(build_fn=create_model, epochs=10, batch_size=32, verbose=0)
2023-08-27 19:54:41.551613: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Best Parameters: {'batch_size': 32, 'epochs': 50}
Test Recall: 0.25190056899622115


### 5. Model 2 Random Forest

In [8]:
# Define the Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Define the parameter grid for grid search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Define the scoring function (recall)
scoring = make_scorer(recall_score, average='macro')

# Initialize GridSearchCV
grid_search = GridSearchCV(rf_classifier, param_grid, scoring=scoring, cv=3)

# Fit the grid search to your training data
grid_search.fit(X_train, y_train)

# Get the best parameters and the best estimator from grid search
best_params = grid_search.best_params_
best_rf = grid_search.best_estimator_

# Evaluate the best model on your test data
y_pred = best_rf.predict(X_test)
recall = recall_score(y_test, y_pred, average='macro')

print("Best Parameters:", best_params)
print("Test Recall:", recall)
recall_list.append(recall)

Best Parameters: {'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 2, 'n_estimators': 100}
Test Recall: 0.2614910008561814


### 6. Model 3 Gradient Boosting Machine

In [9]:
# Define the Gradient Boosting classifier
gbm_classifier = GradientBoostingClassifier(random_state=42)

# Define the parameter grid for grid search
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

# Define the scoring function (recall)
scoring = make_scorer(recall_score, average='macro')

# Initialize GridSearchCV
grid_search = GridSearchCV(gbm_classifier, param_grid, scoring=scoring, cv=3)

# Fit the grid search to your training data
grid_search.fit(X_train, y_train)

# Get the best parameters and the best estimator from grid search
best_params = grid_search.best_params_
best_gbm = grid_search.best_estimator_

# Evaluate the best model on your test data
y_pred = best_gbm.predict(X_test)
recall = recall_score(y_test, y_pred, average='macro')

print("Best Parameters:", best_params)
print("Test Recall:", recall)
recall_list.append(recall)

Best Parameters: {'learning_rate': 0.2, 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Test Recall: 0.2552132268247596


### 7. Model results analysis

In [10]:
print('All Model Scores', recall_list)

All Model Scores [0.25190056899622115, 0.2614910008561814, 0.2552132268247596]


### 8. Model Choice

In [13]:
max_index = recall_list.index(max(recall_list))
print('The model with the best score was model number', max_index + 1, '.')

The model with the best score was model number 2 .


## The best model was the Random Forest.

### 9. Next Steps


Create a final model using the best parameters from the random forest model