<a id="top"></a>
# Random Forest Classification
## Contents

* <a href="#Dependencies">Dependencies</a>
* <a href="#Load">Loading the Data</a>
* <a href="#ModelPrep">Model Preperation</a>
    <!-- * <a href="#FeatSel">Feature Selection</a>
    * <a href="#Scale">Scaling</a>
    * <a href="#TTSplit">Train/Test Split</a>
    * <a href="#Tune">Hyperparameter Tuning</a>
    * <a href="#Train">Training</a>
    * <a href="#Eval">Evaluate Models</a>
    * <a href="#Best">Choose Best Model</a> -->
* <a href="#Exp^2">Explain Features & Export Model</a>
* <a href="#Other">Other</a>
* <a href="#Cite">Citations</a>

----
<a id="Dependencies"></a>
<a href="#top">Back to Top</a>
## Dependencies

In [1]:
# Loading the Data
import pandas as pd
# Model Preperation
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import numpy as np
import os
# Export & Explain
import shap
import joblib

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)


----
<a id="Load"></a>
<a href="#top">Back to Top</a>
## Loading the Data
    Ensure that your data is clean and properly preprocessed. Handle missing values, encode categorical variables if necessary, and address any outliers.

In [2]:
# # file path
# file_path = "" # previously cleaned

# # Load the dataset
# df = pd.read_csv(file_path)

# # Set the maximum number of columns to display to None
# pd.set_option('display.max_columns', None)
# df.head()

In [3]:
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

----
<a id="ModelPrep"></a>
<a href="#top">Back to Top</a>
## Model Preperation

In [4]:
# Specify the number of folds for cross-validation
num_folds = 5
cv = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)

# Specify the feature selection method (SelectKBest with ANOVA F-statistic in this example)
# You can choose a different method based on your requirements
feature_selector = SelectKBest(score_func=f_classif, k='all')

# Get the number of available CPU cores
num_cores = os.cpu_count()

# Specify the values for n_jobs
n_jobs_values = [-1] + list(range(1, num_cores + 1))

# Specify the parameter grid for hyperparameter tuning
param_grid = {
    'n_estimators': [50, 100, 200],
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'min_weight_fraction_leaf': [0.0, 0.1, 0.2],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_leaf_nodes': [None, 5, 10, 20],
    'min_impurity_decrease': [0.0, 0.1, 0.2],
    'min_impurity_split': [None, 0.1, 0.2],
    'bootstrap': [True, False],
    'oob_score': [True, False],
    'n_jobs': n_jobs_values
    # Add other hyperparameters to tune
}

# Specify the classifiers
classifiers = {
    'Accuracy': GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=cv, scoring='accuracy'),
    'Precision': GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=cv, scoring='precision_weighted'),
    'Recall': GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=cv, scoring='recall_weighted'),
    'F1 Score': GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=cv, scoring='f1_weighted'),
    'ROC AUC': GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=cv, scoring='roc_auc')
}

# Initialize a dictionary to store the final selected features for each metric
final_selected_features = {metric: [] for metric in classifiers}

# Perform feature selection and model training for each metric
for metric, classifier in classifiers.items():
    # Initialize a list to store the selected features across folds
    selected_features_across_folds = []

    for train_index, test_index in cv.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Fit the feature selector on the training data
        feature_selector.fit(X_train, y_train)

        # Transform the training and testing data to keep only the selected features
        X_train_selected = feature_selector.transform(X_train)
        X_test_selected = feature_selector.transform(X_test)

        # Perform grid search for hyperparameter tuning
        classifier.fit(X_train_selected, y_train)

        # Store the selected features for this fold
        selected_features_across_folds.append(np.where(feature_selector.get_support())[0])

    # Aggregate the selected features for this metric
    final_selected_features[metric] = np.unique(np.concatenate(selected_features_across_folds))
    print(f"Final selected features for {metric}: {final_selected_features[metric]}")
    print(f"Best hyperparameters for {metric}: {classifier.best_params_}")

ValueError: Invalid parameter min_impurity_split for estimator RandomForestClassifier(random_state=42). Check the list of available parameters with `estimator.get_params().keys()`.

----
<a id="Exp^2"></a>
<a href="#top">Back to Top</a>
## Explain Features & Export Model

In [None]:
# Assuming best_models_across_metrics is a dictionary containing the best models for each metric
chosen_metric = 'Accuracy'  # Change this based on user input
chosen_model = best_models_across_metrics[chosen_metric]

# 1. Calculate SHAP values for the chosen model
explainer = shap.TreeExplainer(chosen_model)
shap_values = explainer.shap_values(X)

# 2. Export the chosen model
joblib.dump(chosen_model, 'chosen_model.joblib')

# Optionally, you can also save the SHAP values for later interpretation
np.save('shap_values.npy', shap_values)

----
<a id="Other"></a>
<a href="#top">Back to Top</a>
## Other

Additionally, you might want to consider the following:

Handling Imbalanced Classes:

    If your classes are imbalanced, consider techniques like oversampling, undersampling, or using class weights to handle the imbalance.

Ensemble Methods:

    Random Forest itself is an ensemble method. You might explore techniques like stacking or boosting with other models to see if they improve performance.

Interpretability:

    Random Forest models can be challenging to interpret. Consider using tools like SHAP values or partial dependence plots to gain insights into feature importance and model behavior.

----
<a id="Cite"></a>
<a href="#top">Back to Top</a>
## Citations
I used ChatGPT 3.5 as my assistant for debugging and code suggestions as well as for information on model making processes. 