# Drug to Drug Interaction Multiclass Prediction

This notebook contains experiments using Random Forest Classifier for multiclass classification. The data used is sourced from an online database called [DrugBank](https://go.drugbank.com/).

## Experiment decription

### 1st Experiment
  **Baseline model**

 A first look at the model's metrics without any data preprocessing or model fine tuning.

### 2nd Experiment
  - Data Preprocessing
    - Upsampling minority classes
    - Downsampling majority class
  - Hyperparameter tuning (GridSearch)

### 3rd Experiment
  Same as experiment **2** , + balanced weights in classes as RFC parameters

### 4rth Experiment

  Same as experiment **2**, + experimenting with manual weights per class

## Data preparation

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold, cross_val_predict, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import confusion_matrix,classification_report, f1_score, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE
from sklearn.utils import resample
from sklearn.model_selection import train_test_split

In [None]:
# Load the dataset
df = pd.read_csv('/content/drive/MyDrive/DDI/Development Files/FinalFeatures.csv')

df['Description'] = df['Description'].replace(6, 5)

# Keep column names
column_names = df.columns[1:106]

# Separate features from classes
X = df.iloc[:, 1:106].values
y = df.iloc[:, -1].values

### Baseline Experiment

In [None]:
# Create a random forest classifier with 100 trees
rfc = RandomForestClassifier(n_estimators=100, random_state=42)

# Define a 10-fold cross-validation object
cv = KFold(n_splits=10, shuffle=True, random_state=42)

# Get cross-validated predictions
y_pred = cross_val_predict(rfc, X, y, cv=cv)

# Get the unique class labels
labels = np.unique(y)

# Get the classification report for each class
report = classification_report(y, y_pred, labels=labels)

# Print the classification report
print(report)

In [None]:
 # Load the dataset
df = pd.read_csv('/content/drive/MyDrive/DDI/Development Files/FinalFeatures.csv')

# Keep column names
column_names = df.columns[1:106]
df['Description'] = df['Description'].replace(6, 5)

# Separate features from classes
X1 = df.iloc[:, 1:106].values
y1 = df.iloc[:, -1].values

X, X_test, y, y_test = train_test_split(X1, y1, test_size=0.2, stratify = y1 )

### 2nd Experiment

In [None]:
# Create a random forest classifier
rfc = RandomForestClassifier()

# Define parameters for grid search
param_grid = {
    'n_estimators': [75, 100, 125],
    'max_depth': [10, 20],
    'min_samples_leaf': [5, 10]
}

# Define upsampling ratios to try
sampling_values = [[2,10], [2,5], [3,10], [3,5], [1,10], [1,5]]

# Create a stratified 10-fold cross-validation object
outer_cv = StratifiedKFold(n_splits=10)
inner_cv = StratifiedKFold(n_splits=5)

# Initialize variables to track the best overall model
best_score = 0
best_model = None
best_cm_outer = None
best_report_outer = None
outer_scores_list = []

# Initialize variables to accumulate confusion matrices
overall_cm = np.zeros((len(np.unique(y)), len(np.unique(y))), dtype=int)
overall_supports = np.zeros(len(np.unique(y)), dtype=int)

# Initialize a dictionary to store cumulative recalls for each class
cumulative_recalls = {class_label: 0 for class_label in np.unique(y)}
cumulative_precisions = {class_label: 0 for class_label in np.unique(y)}
cumulative_f1_scores= {class_label: 0 for class_label in np.unique(y)}

# Initialize a dictionary to store the count of occurrences of each class
class_counts = {class_label: 0 for class_label in np.unique(y)}

# Iterate over the outer cross-validation splits
for train_index, test_index in outer_cv.split(X, y):
    X_train_outer, X_test_outer = X[train_index], X[test_index]
    y_train_outer, y_test_outer = y[train_index], y[test_index]

    # Initialize variables to track the best inner model
    best_inner_score = 0
    best_inner_model = None
    best_cm_inner = None
    best_report_inner = None

    feature_names = df.columns[1:106].astype(str)
    #print(feature_names)

    # Turn numpy arrays into dfs
    X_train_df = pd.DataFrame(X, columns=column_names)
    y_train_df = pd.DataFrame(y, columns=['Description'])

    # Combine X_train and y_train for later use
    train_data = pd.concat([X_train_df, y_train_df], axis=1)

    for i,j in sampling_values:
        #for upsampling_ratio in upsampling_ratios:
        majority_class_0 = train_data[train_data['Description'] == 0]
        minority_class_1 = train_data[train_data['Description'] == 1]
        minority_class_3 = train_data[train_data['Description'] == 3]
        class_2 = train_data[train_data['Description'] == 2]
        class_4 = train_data[train_data['Description'] == 4]
        class_5 = train_data[train_data['Description'] == 5]

        # Calculate the number of samples based on percentages
        majority_downsampled_samples = len(majority_class_0) // i  # Use // for integer division
        minority_upsampled_samples_1 = int(len(minority_class_1) * j)  # Convert to integer
        minority_upsampled_samples_3 = int(len(minority_class_3) * j)  # Convert to integer


        # Downsample the majority class 0
        majority_downsampled = resample(majority_class_0,
                                replace=False,
                                n_samples=majority_downsampled_samples)

        # Combine minority classes 1 and 3
        minority_combined = pd.concat([minority_class_1, minority_class_3])

        # Upsample minority classes 1 and 3 to have 1000 examples each using SMOTE
        smote = SMOTE(sampling_strategy={1: minority_upsampled_samples_1, 3: minority_upsampled_samples_3})
        minority_upsampled, y1 = smote.fit_resample(minority_combined.iloc[:, :105], minority_combined.iloc[:, -1])
        minority_upsampled = pd.DataFrame(minority_upsampled, columns=minority_combined.columns[:105])
        minority_upsampled['Description'] = y1

        # Combine classes
        df_downsampled = pd.concat([majority_downsampled, minority_upsampled, class_2, class_4, class_5], ignore_index=True)

        # Shuffle the dataset
        df_downsampled = df_downsampled.sample(frac=1, random_state=42)

        # Split the data back into X_train and y_train
        X_train_upsampled = df_downsampled.iloc[:, 0:105].values
        y_train_upsampled = df_downsampled.iloc[:, -1].values

        # Create a grid search object for hyperparameter tuning
        grid_search_inner = GridSearchCV(rfc, param_grid=param_grid, cv=inner_cv,
                                         scoring='f1_macro', refit=True)

        # Fit the grid search to the upsampled training data
        grid_search_inner.fit(X_train_upsampled, y_train_upsampled)

        # Get the best model from the inner grid search
        best_inner_model = grid_search_inner.best_estimator_

    #get metrics for outer CV
    y_pred_outer = best_inner_model.predict(X_test)
    outer_score = classification_report(y_test, y_pred_outer, labels=np.unique(y), output_dict=True)
    cm_outer = confusion_matrix(y_test, y_pred_outer, labels=np.unique(y))

    # Accumulate the confusion matrix
    overall_cm += cm_outer
    overall_supports += np.sum(cm_outer, axis=1)

    outer_scores_list.append(outer_score)

    # Iterate over the outer_scores_list
    for outer_score in outer_scores_list:
        # Iterate over the classes in the outer_score dictionary
        for class_label in np.unique(y):
            if class_label not in ['accuracy', 'macro avg', 'weighted avg']:
                avg_recall_score = outer_score.get(str(class_label), {'recall': 0})['recall']
                avg_precision_score = outer_score.get(str(class_label), {'precision': 0})['precision']
                avg_f1_score = outer_score.get(str(class_label), {'f1-score': 0})['f1-score']

                support = outer_score.get(str(class_label), {'support': 0})['support']

                # Update cumulative recalls and class counts
                cumulative_recalls[class_label] += avg_recall_score
                cumulative_precisions[class_label] += avg_precision_score
                cumulative_f1_scores[class_label] += avg_f1_score
                class_counts[class_label] += 1

# Print or use the overall confusion matrix
print("Overall Confusion Matrix:")
print(overall_cm)

# Calculate the average recall for each class
average_recalls = {class_label: cumulative_recall / class_counts[class_label] for class_label, cumulative_recall in cumulative_recalls.items()}
average_precisions = {class_label: cumulative_precision / class_counts[class_label] for class_label, cumulative_precision in cumulative_precisions.items()}
average_f1_scores = {class_label: cumulative_f1_score / class_counts[class_label] for class_label, cumulative_f1_score in cumulative_f1_scores.items()}

# Print the average metrics for each class
for class_label in np.unique(y):
    if class_label not in ['accuracy', 'macro avg', 'weighted avg']:
        print(f"Class '{class_label}':")
        print(f"Average F1 Score: {average_f1_scores[class_label]}")
        print(f"Average Recall: {average_recalls[class_label]}")
        print(f"Average Precision: {average_precisions[class_label]}")


### 3rd Experiment

In [None]:
# Create a random forest classifier
rfc = RandomForestClassifier(class_weight = "balanced")

# Define parameters for grid search
param_grid = {
    'n_estimators': [75, 100, 125],
    'max_depth': [10, 20],
    'min_samples_leaf': [5, 10]
}

# Define upsampling ratios to try
sampling_values = [[2,10], [2,5], [3,10], [3,5], [1,10], [1,5]]

# Create a stratified 10-fold cross-validation object
outer_cv = StratifiedKFold(n_splits=10)
inner_cv = StratifiedKFold(n_splits=5)

# Initialize variables to track the best overall model
best_score = 0
best_model = None
best_cm_outer = None
best_report_outer = None
outer_scores_list = []

# Initialize variables to accumulate confusion matrices
overall_cm = np.zeros((len(np.unique(y)), len(np.unique(y))), dtype=int)
overall_supports = np.zeros(len(np.unique(y)), dtype=int)

# Initialize a dictionary to store cumulative recalls for each class
cumulative_recalls = {class_label: 0 for class_label in np.unique(y)}
cumulative_precisions = {class_label: 0 for class_label in np.unique(y)}
cumulative_f1_scores= {class_label: 0 for class_label in np.unique(y)}

# Initialize a dictionary to store the count of occurrences of each class
class_counts = {class_label: 0 for class_label in np.unique(y)}

# Iterate over the outer cross-validation splits
for train_index, test_index in outer_cv.split(X, y):
    X_train_outer, X_test_outer = X[train_index], X[test_index]
    y_train_outer, y_test_outer = y[train_index], y[test_index]

    # Initialize variables to track the best inner model
    best_inner_score = 0
    best_inner_model = None
    best_cm_inner = None
    best_report_inner = None

    feature_names = df.columns[1:106].astype(str)

    # Turn numpy arrays into dfs
    X_train_df = pd.DataFrame(X, columns=column_names)
    y_train_df = pd.DataFrame(y, columns=['Description'])

    # Combine X_train and y_train for later use
    train_data = pd.concat([X_train_df, y_train_df], axis=1)

    for i,j in sampling_values:
        #for upsampling_ratio in upsampling_ratios:
        majority_class_0 = train_data[train_data['Description'] == 0]
        minority_class_1 = train_data[train_data['Description'] == 1]
        minority_class_3 = train_data[train_data['Description'] == 3]
        class_2 = train_data[train_data['Description'] == 2]
        class_4 = train_data[train_data['Description'] == 4]
        class_5 = train_data[train_data['Description'] == 5]

        # Calculate the number of samples based on percentages
        majority_downsampled_samples = len(majority_class_0) // i  # Use // for integer division
        minority_upsampled_samples_1 = int(len(minority_class_1) * j)  # Convert to integer
        minority_upsampled_samples_3 = int(len(minority_class_3) * j)  # Convert to integer


        # Downsample the majority class 0
        majority_downsampled = resample(majority_class_0,
                                replace=False,
                                n_samples=majority_downsampled_samples)

        # Combine minority classes 1 and 3
        minority_combined = pd.concat([minority_class_1, minority_class_3])

        # Upsample minority classes 1 and 3 to have 1000 examples each using SMOTE
        smote = SMOTE(sampling_strategy={1: minority_upsampled_samples_1, 3: minority_upsampled_samples_3})
        minority_upsampled, y1 = smote.fit_resample(minority_combined.iloc[:, :105], minority_combined.iloc[:, -1])
        minority_upsampled = pd.DataFrame(minority_upsampled, columns=minority_combined.columns[:105])
        minority_upsampled['Description'] = y1

        # Combine classes
        df_downsampled = pd.concat([majority_downsampled, minority_upsampled, class_2, class_4, class_5], ignore_index=True)

        # Shuffle the dataset
        df_downsampled = df_downsampled.sample(frac=1, random_state=42)

        # Split the data back into X_train and y_train
        X_train_upsampled = df_downsampled.iloc[:, 0:105].values
        y_train_upsampled = df_downsampled.iloc[:, -1].values

        # Create a grid search object for hyperparameter tuning
        grid_search_inner = GridSearchCV(rfc, param_grid=param_grid, cv=inner_cv,
                                         scoring='f1_macro', refit=True)

        # Fit the grid search to the upsampled training data
        grid_search_inner.fit(X_train_upsampled, y_train_upsampled)

        # Get the best model from the inner grid search
        best_inner_model = grid_search_inner.best_estimator_

    #get metrics for outer CV
    y_pred_outer = best_inner_model.predict(X_test)
    outer_score = classification_report(y_test, y_pred_outer, labels=np.unique(y), output_dict=True)
    cm_outer = confusion_matrix(y_test, y_pred_outer, labels=np.unique(y))

    # Accumulate the confusion matrix
    overall_cm += cm_outer
    overall_supports += np.sum(cm_outer, axis=1)

    outer_scores_list.append(outer_score)

    # Iterate over the outer_scores_list
    for outer_score in outer_scores_list:
        # Iterate over the classes in the outer_score dictionary
        for class_label in np.unique(y):
            if class_label not in ['accuracy', 'macro avg', 'weighted avg']:
                avg_recall_score = outer_score.get(str(class_label), {'recall': 0})['recall']
                avg_precision_score = outer_score.get(str(class_label), {'precision': 0})['precision']
                avg_f1_score = outer_score.get(str(class_label), {'f1-score': 0})['f1-score']
                support = outer_score.get(str(class_label), {'support': 0})['support']

                # Update cumulative recalls and class counts
                cumulative_recalls[class_label] += avg_recall_score
                cumulative_precisions[class_label] += avg_precision_score
                cumulative_f1_scores[class_label] += avg_f1_score
                class_counts[class_label] += 1

# Print or use the overall confusion matrix
print("Overall Confusion Matrix:")
print(overall_cm)

# Calculate the average recall for each class
average_recalls = {class_label: cumulative_recall / class_counts[class_label] for class_label, cumulative_recall in cumulative_recalls.items()}
average_precisions = {class_label: cumulative_precision / class_counts[class_label] for class_label, cumulative_precision in cumulative_precisions.items()}
average_f1_scores = {class_label: cumulative_f1_score / class_counts[class_label] for class_label, cumulative_f1_score in cumulative_f1_scores.items()}

# Print the average metrics for each class
for class_label in np.unique(y):
    if class_label not in ['accuracy', 'macro avg', 'weighted avg']:
        print(f"Class '{class_label}':")
        print(f"Average F1 Score: {average_f1_scores[class_label]}")
        print(f"Average Recall: {average_recalls[class_label]}")
        print(f"Average Precision: {average_precisions[class_label]}")


### 4th Experiment

In [None]:
dict_weights = {0:0.4, 1: 4, 2: 1.3, 3: 2.3, 4: 2, 5: 3}

# Create a random forest classifier
rfc = RandomForestClassifier(class_weight = dict_weights)

# Define parameters for grid search
param_grid = {
    'n_estimators': [75, 100, 125],
    'max_depth': [10, 20],
    'min_samples_leaf': [5, 10]
}

# Define upsampling ratios to try
sampling_values = [[2,10], [2,5], [3,10], [3,5], [1,10], [1,5]]


# Create a stratified 10-fold cross-validation object
outer_cv = StratifiedKFold(n_splits=10)
inner_cv = StratifiedKFold(n_splits=5)

# Initialize variables to track the best overall model
best_score = 0
best_model = None
best_cm_outer = None
best_report_outer = None
outer_scores_list = []

# Initialize variables to accumulate confusion matrices
overall_cm = np.zeros((len(np.unique(y)), len(np.unique(y))), dtype=int)
overall_supports = np.zeros(len(np.unique(y)), dtype=int)

# Initialize a dictionary to store cumulative recalls for each class
cumulative_recalls = {class_label: 0 for class_label in np.unique(y)}
cumulative_precisions = {class_label: 0 for class_label in np.unique(y)}
cumulative_f1_scores= {class_label: 0 for class_label in np.unique(y)}

# Initialize a dictionary to store the count of occurrences of each class
class_counts = {class_label: 0 for class_label in np.unique(y)}

# Iterate over the outer cross-validation splits
for train_index, test_index in outer_cv.split(X, y):
    X_train_outer, X_test_outer = X[train_index], X[test_index]
    y_train_outer, y_test_outer = y[train_index], y[test_index]

    # Initialize variables to track the best inner model
    best_inner_score = 0
    best_inner_model = None
    best_cm_inner = None
    best_report_inner = None

    feature_names = df.columns[1:106].astype(str)
    #print(feature_names)

    # Turn numpy arrays into dfs
    X_train_df = pd.DataFrame(X, columns=column_names)
    y_train_df = pd.DataFrame(y, columns=['Description'])

    # Combine X_train and y_train for later use
    train_data = pd.concat([X_train_df, y_train_df], axis=1)

    for i,j in sampling_values:
        #for upsampling_ratio in upsampling_ratios:
        majority_class_0 = train_data[train_data['Description'] == 0]
        minority_class_1 = train_data[train_data['Description'] == 1]
        minority_class_3 = train_data[train_data['Description'] == 3]
        class_2 = train_data[train_data['Description'] == 2]
        class_4 = train_data[train_data['Description'] == 4]
        class_5 = train_data[train_data['Description'] == 5]

        # Calculate the number of samples based on percentages
        majority_downsampled_samples = len(majority_class_0) // i  # Use // for integer division
        minority_upsampled_samples_1 = int(len(minority_class_1) * j)  # Convert to integer
        minority_upsampled_samples_3 = int(len(minority_class_3) * j)  # Convert to integer


        # Downsample the majority class 0
        majority_downsampled = resample(majority_class_0,
                                replace=False,
                                n_samples=majority_downsampled_samples)

        # Combine minority classes 1 and 3
        minority_combined = pd.concat([minority_class_1, minority_class_3])

        # Upsample minority classes 1 and 3 to have 1000 examples each using SMOTE
        smote = SMOTE(sampling_strategy={1: minority_upsampled_samples_1, 3: minority_upsampled_samples_3})
        minority_upsampled, y1 = smote.fit_resample(minority_combined.iloc[:, :105], minority_combined.iloc[:, -1])
        minority_upsampled = pd.DataFrame(minority_upsampled, columns=minority_combined.columns[:105])
        minority_upsampled['Description'] = y1

        # Combine classes
        df_downsampled = pd.concat([majority_downsampled, minority_upsampled, class_2, class_4, class_5], ignore_index=True)

        # Shuffle the dataset
        df_downsampled = df_downsampled.sample(frac=1, random_state=42)

        # Split the data back into X_train and y_train
        X_train_upsampled = df_downsampled.iloc[:, 0:105].values
        y_train_upsampled = df_downsampled.iloc[:, -1].values

        # Create a grid search object for hyperparameter tuning
        grid_search_inner = GridSearchCV(rfc, param_grid=param_grid, cv=inner_cv,
                                         scoring='f1_macro', refit=True)

        # Fit the grid search to the upsampled training data
        grid_search_inner.fit(X_train_upsampled, y_train_upsampled)

        # Get the best model from the inner grid search
        best_inner_model = grid_search_inner.best_estimator_

    #get metrics for outer CV
    y_pred_outer = best_inner_model.predict(X_test)
    outer_score = classification_report(y_test, y_pred_outer, labels=np.unique(y), output_dict=True)
    cm_outer = confusion_matrix(y_test, y_pred_outer, labels=np.unique(y))

    # Accumulate the confusion matrix
    overall_cm += cm_outer
    overall_supports += np.sum(cm_outer, axis=1)

    outer_scores_list.append(outer_score)


    # Iterate over the outer_scores_list
    for outer_score in outer_scores_list:
        # Iterate over the classes in the outer_score dictionary
        for class_label in np.unique(y):
            if class_label not in ['accuracy', 'macro avg', 'weighted avg']:
                avg_recall_score = outer_score.get(str(class_label), {'recall': 0})['recall']
                avg_precision_score = outer_score.get(str(class_label), {'precision': 0})['precision']
                avg_f1_score = outer_score.get(str(class_label), {'f1-score': 0})['f1-score']

                support = outer_score.get(str(class_label), {'support': 0})['support']

                # Update cumulative recalls and class counts
                cumulative_recalls[class_label] += avg_recall_score
                cumulative_precisions[class_label] += avg_precision_score
                cumulative_f1_scores[class_label] += avg_f1_score
                class_counts[class_label] += 1

# Print or use the overall confusion matrix
print("Overall Confusion Matrix:")
print(overall_cm)


# Calculate the average recall for each class
average_recalls = {class_label: cumulative_recall / class_counts[class_label] for class_label, cumulative_recall in cumulative_recalls.items()}
average_precisions = {class_label: cumulative_precision / class_counts[class_label] for class_label, cumulative_precision in cumulative_precisions.items()}
average_f1_scores = {class_label: cumulative_f1_score / class_counts[class_label] for class_label, cumulative_f1_score in cumulative_f1_scores.items()}

# Print the average metrics for each class
for class_label in np.unique(y):
    if class_label not in ['accuracy', 'macro avg', 'weighted avg']:
        print(f"Class '{class_label}':")
        print(f"Average F1 Score: {average_f1_scores[class_label]}")
        print(f"Average Recall: {average_recalls[class_label]}")
        print(f"Average Precision: {average_precisions[class_label]}")

# Predict new DDIs and Feature Importance

## Explainability & Predictions (Balanced Weights)

In [20]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

 # Load the dataset
train_data = pd.read_csv('/content/drive/MyDrive/DDI/Development Files/FinalFeatures.csv')

# Keep column names
column_names = train_data.columns[1:106]
train_data['Description'] = train_data['Description'].replace(6, 5)

# Assuming 'train_data' is your original dataset
# Step 1: Split data into train (80%) and test (20%), ensuring the test set contains only Class 0
class_0_data = train_data[train_data['Description'] == 0]
class_non_0_data = train_data[train_data['Description'] != 0]

train_class_0, test_class_0 = train_test_split(class_0_data, test_size=0.2, random_state=42)

train_data_final = pd.concat([train_class_0, class_non_0_data], ignore_index=True)

# Step 2: Perform up/downsampling
sampling_values = [[2, 10]]
for i, j in sampling_values:
    majority_class_0 = train_data_final[train_data_final['Description'] == 0]
    minority_class_1 = train_data_final[train_data_final['Description'] == 1]
    minority_class_3 = train_data_final[train_data_final['Description'] == 3]
    class_2 = train_data_final[train_data_final['Description'] == 2]
    class_4 = train_data_final[train_data_final['Description'] == 4]
    class_5 = train_data_final[train_data_final['Description'] == 5]

    # Downsample Class 0
    majority_downsampled_samples = len(majority_class_0) // i
    majority_downsampled = resample(majority_class_0, replace=False, n_samples=majority_downsampled_samples)

    # Upsample Class 1 and 3 using SMOTE
    minority_combined = pd.concat([minority_class_1, minority_class_3])
    minority_upsampled_samples_1 = int(len(minority_class_1) * j)
    minority_upsampled_samples_3 = int(len(minority_class_3) * j)

    smote = SMOTE(sampling_strategy={1: minority_upsampled_samples_1, 3: minority_upsampled_samples_3})
    minority_upsampled, y1 = smote.fit_resample(minority_combined.iloc[:, 1:105], minority_combined.iloc[:, -1])

    minority_upsampled = pd.DataFrame(minority_upsampled, columns=minority_combined.columns[1:105])
    minority_upsampled['Description'] = y1

    # Combine all classes
    df_final = pd.concat([majority_downsampled, minority_upsampled, class_2, class_4, class_5], ignore_index=True)

    # Shuffle the dataset
    df_final = df_final.sample(frac=1, random_state=42)

# Step 3: Prepare training and test sets
X_train = df_final.iloc[:, 1:106].values
y_train = df_final.iloc[:, -1].values
X_test = test_class_0.iloc[:, 1:106].values
y_test = test_class_0.iloc[:, -1].values  # Should all be 0

# Step 4: Train the Random Forest Classifier
rfc = RandomForestClassifier(class_weight="balanced", n_estimators=75, max_depth=10, min_samples_leaf=5, random_state=42)
rfc.fit(X_train, y_train)

# Step 5: Make predictions on the test set
y_pred_proba = rfc.predict_proba(X_test)  # Get probabilities
y_pred = rfc.predict(X_test)  # Get class predictions

# Step 6: Identify misclassified examples (i.e., classified as anything other than 0)
classified_indices = np.where(y_pred != 0)[0]
classified_samples = test_class_0.iloc[classified_indices].copy()
classified_samples['Predicted_Class'] = y_pred[classified_indices]
classified_samples['Confidence'] = np.max(y_pred_proba[classified_indices], axis=1)  # Max confidence per row

# Step 7: Extract feature importances
feature_importance = pd.DataFrame({
    'Feature': train_data.columns[:105],
    'Importance': rfc.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Sort misclassified examples by confidence in descending order
classified_samples = classified_samples.sort_values(by='Confidence', ascending=False)

# Display misclassified examples with Drug Pair names
print("\nClassified Examples with Drug Pairs (Sorted by Confidence):")
print(classified_samples[['DB_PAIR', 'Predicted_Class', 'Confidence']])

# Get the best (highest confidence) prediction for each class
best_predictions_per_class = classified_samples.loc[
    classified_samples.groupby('Predicted_Class')['Confidence'].idxmax()
]

# Display the best misclassified example for each class
print("\nBest Prediction for Each Class:")
print(best_predictions_per_class[['DB_PAIR', 'Predicted_Class', 'Confidence']])

# Save it as a CSV file for further analysis
best_predictions_per_class[['DB_PAIR', 'Predicted_Class', 'Confidence']].to_csv("best_predictions_per_class.csv", index=False)

# Display top 10 important features
print("\nTop 10 Feature Importances:")
print(feature_importance.head(10))

  minority_upsampled['Description'] = y1



Classified Examples with Drug Pairs (Sorted by Confidence):
               DB_PAIR  Predicted_Class  Confidence
27922  DB00399_DB01181                2    0.699794
11775  DB00399_DB00762                2    0.663858
16374  DB00599_DB13145                5    0.660697
20551  DB00293_DB00317                2    0.655927
23276  DB00515_DB00646                5    0.651843
...                ...              ...         ...
24186  DB00762_DB01626                5    0.233621
12383  DB00762_DB00821                5    0.229726
15982  DB01143_DB04978                4    0.222120
15498  DB00563_DB00765                5    0.221127
8169   DB00563_DB01034                5    0.207922

[1425 rows x 3 columns]

Best Prediction for Each Class:
               DB_PAIR  Predicted_Class  Confidence
14857  DB00278_DB00530                1    0.338608
27922  DB00399_DB01181                2    0.699794
10980  DB00530_DB06688                3    0.373695
26525  DB01229_DB06803                4    0.5572

## Explainability & Predictions (Custom Weights)

In [21]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier

 # Load the dataset
train_data = pd.read_csv('/content/drive/MyDrive/DDI/Development Files/FinalFeatures.csv')  # use your file path

# Keep column names
column_names = train_data.columns[1:106]
train_data['Description'] = train_data['Description'].replace(6, 5)


# Step 1: Split data into train (80%) and test (20%), ensuring the test set contains only Class 0
class_0_data = train_data[train_data['Description'] == 0]
class_non_0_data = train_data[train_data['Description'] != 0]

train_class_0, test_class_0 = train_test_split(class_0_data, test_size=0.2, random_state=42)

train_data_final = pd.concat([train_class_0, class_non_0_data], ignore_index=True)

# Step 2: Perform up/downsampling
sampling_values = [[2, 10]]
for i, j in sampling_values:
    majority_class_0 = train_data_final[train_data_final['Description'] == 0]
    minority_class_1 = train_data_final[train_data_final['Description'] == 1]
    minority_class_3 = train_data_final[train_data_final['Description'] == 3]
    class_2 = train_data_final[train_data_final['Description'] == 2]
    class_4 = train_data_final[train_data_final['Description'] == 4]
    class_5 = train_data_final[train_data_final['Description'] == 5]

    # Downsample Class 0
    majority_downsampled_samples = len(majority_class_0) // i
    majority_downsampled = resample(majority_class_0, replace=False, n_samples=majority_downsampled_samples)

    # Upsample Class 1 and 3 using SMOTE
    minority_combined = pd.concat([minority_class_1, minority_class_3])
    minority_upsampled_samples_1 = int(len(minority_class_1) * j)
    minority_upsampled_samples_3 = int(len(minority_class_3) * j)

    smote = SMOTE(sampling_strategy={1: minority_upsampled_samples_1, 3: minority_upsampled_samples_3})
    minority_upsampled, y1 = smote.fit_resample(minority_combined.iloc[:, 1:105], minority_combined.iloc[:, -1])

    minority_upsampled = pd.DataFrame(minority_upsampled, columns=minority_combined.columns[1:105])
    minority_upsampled['Description'] = y1

    # Combine all classes
    df_final = pd.concat([majority_downsampled, minority_upsampled, class_2, class_4, class_5], ignore_index=True)

    # Shuffle the dataset
    df_final = df_final.sample(frac=1, random_state=42)

# Step 3: Prepare training and test sets
X_train = df_final.iloc[:, 1:106].values
y_train = df_final.iloc[:, -1].values
X_test = test_class_0.iloc[:, 1:106].values
y_test = test_class_0.iloc[:, -1].values  # Should all be 0

# Step 4: Train the Random Forest Classifier
dict_weights = {0: 0.4, 1: 4, 2: 1.3, 3: 2.3, 4: 2, 5: 3}
rfc = RandomForestClassifier(class_weight=dict_weights, n_estimators=75, max_depth=10, min_samples_leaf=5, random_state=42)
rfc.fit(X_train, y_train)

# Step 5: Make predictions on the test set
y_pred_proba = rfc.predict_proba(X_test)  # Get probabilities
y_pred = rfc.predict(X_test)  # Get class predictions

# Step 6: Identify misclassified examples (i.e., classified as anything other than 0)
classified_indices = np.where(y_pred != 0)[0]
classified_samples = test_class_0.iloc[classified_indices].copy()
classified_samples['Predicted_Class'] = y_pred[classified_indices]
classified_samples['Confidence'] = np.max(y_pred_proba[classified_indices], axis=1)  # Max confidence per row

# Step 7: Extract feature importances
feature_importance = pd.DataFrame({
    'Feature': train_data.columns[:105],
    'Importance': rfc.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Sort misclassified examples by confidence in descending order
classified_samples = classified_samples.sort_values(by='Confidence', ascending=False)

# Display misclassified examples with Drug Pair names
print("\nClassified Examples with Drug Pairs (Sorted by Confidence):")
print(classified_samples[['DB_PAIR', 'Predicted_Class', 'Confidence']])

# Get the best (highest confidence) prediction for each class
best_predictions_per_class = classified_samples.loc[
    classified_samples.groupby('Predicted_Class')['Confidence'].idxmax()
]

# Display the best misclassified example for each class
print("\nBest Prediction for Each Class:")
print(best_predictions_per_class[['DB_PAIR', 'Predicted_Class', 'Confidence']])

# Save it as a CSV file for further analysis
best_predictions_per_class[['DB_PAIR', 'Predicted_Class', 'Confidence']].to_csv("best_predictions_per_class.csv", index=False)

# Display top 10 important features
print("\nTop 10 Feature Importances:")
print(feature_importance.head(10))

  minority_upsampled['Description'] = y1



Classified Examples with Drug Pairs (Sorted by Confidence):
               DB_PAIR  Predicted_Class  Confidence
16374  DB00599_DB13145                5    0.702814
26525  DB01229_DB06803                4    0.658919
12065  DB00426_DB13145                5    0.655771
12715  DB06767_DB09063                4    0.655422
23479  DB00557_DB13145                5    0.653597
...                ...              ...         ...
3790   DB06235_DB12465                4    0.263053
18018  DB00361_DB09352                4    0.258857
7269   DB08865_DB11989                5    0.251181
5189   DB00970_DB09037                4    0.249755
18767  DB08916_DB13025                5    0.245560

[2071 rows x 3 columns]

Best Prediction for Each Class:
               DB_PAIR  Predicted_Class  Confidence
14857  DB00278_DB00530                1    0.340146
15551  DB00099_DB00642                2    0.533017
10980  DB00530_DB06688                3    0.333969
26525  DB01229_DB06803                4    0.6589