# Model 1: Toxicity Classification using Random Forest

Code to train a simple Random Forest model to predict toxicity

- use k-fold crossvalidation with grid-search to find best performing model on the training data
    - k==5
    - scoring metric = recall
- validate final model against a "hold-out" dataset
- fit final model using the full dataset and best hyper-parameters
- save trained model as pickle file (in case it is selected as the final model)

_**note on scoring metric:**_ choosing 'recall' as performance metric due to the potentail harm that can be caused be mis-labelling a toxic drug as harmless. In a real research scenario an in-silico model such as this would only be used to screen out obviously toxic drugs from a large list of candidates at an early stage and there must be more robust experimentally reliable tests performed downstream before the drug gets anywhere near a human being. 

---

## Imports and Constants



In [1]:
import pandas as pd
import numpy as np
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

#string constants - centralise to avoid later 'finger trouble'
data_folder = "../data/"                     #relative folder path to the notebooks folder
data  = f"{data_folder}final-data.csv"       # in the real world datasets would be registered in a system like W&B to protect them from modification
model_folder = "../models/"

class_column = "Toxic"

---

## Data Prep

- Split dataset into input variables (X) and response variables (y)
- Create a hold-out set
    - use stratification to ensure proportion of toxic vs non-toxic is retained in the hold-out set

In [2]:
#read csv
toxicity_data = pd.read_csv(data)

#seperate in to X, y
X = toxicity_data.drop(class_column, axis=1)
y = toxicity_data[class_column]

#create a hold-out dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2)


---

## Model training and hyper-parameter tuning


In [None]:

# Initialize the Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=666)

# Define the hyperparameters to search
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 15, 25],
    'min_samples_split': [5, 8, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt']
}

# Create a k-fold cross-validation splitter
k_fold = KFold(n_splits=5, shuffle=True, random_state=666)

# Perform hyperparameter search with k-fold cross-validation
grid_search = GridSearchCV(rf_classifier, param_grid, cv=k_fold, scoring='recall')
grid_search.fit(X_train, y_train)



In [4]:
# Display the best hyperparameters and the corresponding mean cross-validated score
print("Best hyperparameters:", grid_search.best_params_)

Best hyperparameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}


In [5]:
# test set validation

#define classifier
final_rf_classifier = RandomForestClassifier(random_state=666, 
                                            n_estimators        =grid_search.best_params_["n_estimators"],
                                            max_depth           =grid_search.best_params_["max_depth"],
                                            min_samples_split   =grid_search.best_params_["min_samples_split"],
                                            min_samples_leaf    =grid_search.best_params_["min_samples_leaf"],
                                            max_features        =grid_search.best_params_["max_features"])
#fit model with best hyper-parameters
final_rf_classifier.fit(X,y)

#make predictions
y_test_pred = final_rf_classifier.predict(X_test)

#compare with actuals
original_toxic = np.count_nonzero(y_test)
correctly_predicted_toxic = np.count_nonzero(y_test & y_test_pred)

recall = correctly_predicted_toxic/original_toxic
accuracy = np.count_nonzero(y_test==y_test_pred) / len(y_test_pred)

print("'Hold out' test set results")
print(f"\tRecall:\t\t{recall:.2f}")
print(f"\tAccuracy:\t{accuracy:.2f}")

# Full data set ------------------------
#make predictions
y_pred = final_rf_classifier.predict(X)

#compare with actuals
original_toxic = np.count_nonzero(y)
correctly_predicted_toxic = np.count_nonzero(y & y_pred)

recall = correctly_predicted_toxic/original_toxic
accuracy = np.count_nonzero(y==y_pred) / len(y_pred)

print("Full dataset results")
print(f"\tRecall:\t\t{recall:.2f}")
print(f"\tAccuracy:\t{accuracy:.2f}")

with open(f'{model_folder}model1-rf-trained.pkl', 'wb') as f:
        pickle.dump(final_rf_classifier, f)

'Hold out' test set results
	Recall:		0.91
	Accuracy:	0.97
Full dataset results
	Recall:		0.95
	Accuracy:	0.98
