# Summary of the article "Breast Cancer Classification using Random Forest Algorithm"

The article titled "Breast Cancer Classification using Random Forest Algorithm," published in the Journal of Physics: Conference Series, focuses on utilizing the Random Forest (RF) algorithm for diagnosing breast cancer, reducing variance and boosting accuracy.

They have provided the following diagram  of the workflow: 

<img src='workflow.png'>

The methodology includes data collection and analysis, feature standardization and decomposition, and training/testing data preparation.

The dataset used for this research is the Wisconsin diagnostic breast cancer (WBDC dataset, available from the UCI Machine Learning Repository). The dataset has 569 samples of nuclei and 32 features. They have divided the data in train/test split as follows: 70% / 30%. The training set is further divided into k subset and a k-fold cross-validation is performed (k = 10). This is to ensure robustness and avoid overfitting.

The classifier used is the Random Forest classifier from scikitlearn. They refer to other researches that show that the RF tree is more efficient in the low number of data samples. They claim that RF is not affected by noise. A key reason would be RF's ability to manage data minorities. They claim that it is possible to classify tumor as benign or malignant, except that the latter class accounts for just 10% of all input data.

They also use the KNIME node Tree Ensemble Learner to compare the results of the two methods. The Tree Ensemble Learner in KNIME is a component used for building ensemble models based on decision trees, such as Random Forests. This learner combines multiple decision trees to create a more robust and accurate model. Each tree in the ensemble is trained on a random subset of the data, and their predictions are aggregated (through voting for classification or averaging for regression) to produce the final output. This approach helps in reducing overfitting, improving model accuracy, and handling large datasets effectively.

The model evaluation is done using AUC (Area under the curve), Accuracy, F1 score and Sensitivity as measures. 
- AUC is the area under the ROC curve. This curve plots the True positives against the False positives. A higher AUC represents better model performance
- Accuracy is the sum of True positives and True negatives divided by the total number of cases examined. This measure can be misleading in cases of imbalanced classes (such as here).
- F1 score is the harmonic mean of Precision and Recall. This measure is useful when there are imbalanced classes.
- Sensitivity (a.k.a Recall) measures the proportion of actual positives that are correctly identified as such. Important in medical testing or other diagnostic tests where missing out on positives is particularly costly.

The results are comparing the performance of the Random Forest algorithm to the KNIME node Tree Ensemble Learner: 

<img src='model_evaluation.png'>

<img src='evaluation2.png'>


We can see that Random Forest excels in every metric compared to the KNIME node Tree Ensemble Learner. 

The main contribution of the paper is its demonstration of the effectiveness of the RF algorithm in classifying breast cancer, providing a more precise and dependable diagnostic tool than conventional techniques.
It draws attention to the potential benefits of machine learning in healthcare, especially in terms of enhancing patient safety and diagnostic precision.
The approach and findings provide a valuable reference for future investigations into machine learning-based medical diagnosis.

# Code reproduction

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
from sklearn.datasets import load_breast_cancer

In [3]:
from ucimlrepo import fetch_ucirepo 

In [4]:
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

In [5]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   mean radius              569 non-null    float64
 1   mean texture             569 non-null    float64
 2   mean perimeter           569 non-null    float64
 3   mean area                569 non-null    float64
 4   mean smoothness          569 non-null    float64
 5   mean compactness         569 non-null    float64
 6   mean concavity           569 non-null    float64
 7   mean concave points      569 non-null    float64
 8   mean symmetry            569 non-null    float64
 9   mean fractal dimension   569 non-null    float64
 10  radius error             569 non-null    float64
 11  texture error            569 non-null    float64
 12  perimeter error          569 non-null    float64
 13  area error               569 non-null    float64
 14  smoothness error         5

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [6]:
# Splitting the data into training and testing sets (70% training, 30% testing)
X = df.drop('target', axis=1)
y = df['target']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [8]:
# Random Forest Classifier with 10-fold cross-validation

kf = KFold(n_splits=10, shuffle=True, random_state=42)
rf_classifier = RandomForestClassifier(random_state=42)

In [9]:
# Variables to store evaluation metrics
accuracy_scores = []
auc_scores = []
f1_scores = []
sensitivity_scores = []

In [10]:
# K-Fold Cross-Validation
for train_index, val_index in kf.split(X_train):
    X_train_fold, X_val_fold = X_train.iloc[train_index], X_train.iloc[val_index]
    y_train_fold, y_val_fold = y_train.iloc[train_index], y_train.iloc[val_index]

In [11]:
# Training the model
rf_classifier.fit(X_train_fold, y_train_fold)

# Making predictions
y_val_pred = rf_classifier.predict(X_val_fold)

In [12]:
# Evaluation Metrics
accuracy = accuracy_score(y_val_fold, y_val_pred)
auc = roc_auc_score(y_val_fold, y_val_pred)
report = classification_report(y_val_fold, y_val_pred, output_dict=True)
f1 = report['weighted avg']['f1-score']
sensitivity = report['1']['recall']  # Sensitivity for the '1' class (malignant)

In [13]:
accuracy_scores.append(accuracy)
auc_scores.append(auc)
f1_scores.append(f1)
sensitivity_scores.append(sensitivity)

In [14]:
# Calculating average scores across all folds
avg_accuracy = np.mean(accuracy_scores)
avg_auc = np.mean(auc_scores)
avg_f1 = np.mean(f1_scores)
avg_sensitivity = np.mean(sensitivity_scores)

avg_accuracy, avg_auc, avg_f1, avg_sensitivity

(1.0, 1.0, 1.0, 1.0)

In [15]:
# Evaluating the model on the testing set
y_test_pred = rf_classifier.predict(X_test)

# Evaluation Metrics for testing set
test_accuracy = accuracy_score(y_test, y_test_pred)
test_auc = roc_auc_score(y_test, y_test_pred)
test_report = classification_report(y_test, y_test_pred, output_dict=True)
test_f1 = test_report['weighted avg']['f1-score']
test_sensitivity = test_report['1']['recall']

print(f"Accuracy: {test_accuracy}")
print(f"AUC: {test_auc}")
print(f"F1-score: {test_f1}")
print(f"Recall (for the first class 'malignant'): {test_sensitivity}")


Accuracy: 0.9707602339181286
AUC: 0.9669312169312169
F1-score: 0.9707106475867088
Recall (for the first class 'malignant'): 0.9814814814814815


# Conclusion

<p style='font-size: 16px'>The results of reproducing the code are similar to those in the paper. The model has a high recall, for the class 'malignant', which means that it successfully captures the cases where there is a diseased cell which is very important for imbalanced datasets such as this one.</p>