<table align="left">
  <td>
    <a href="https://is.gd/M5qGmU" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
  </td>
</table>

# Applied Machine Learning Techniques to Breast Cancer Recurrence Classification

## Abstract



Diego R. Páez Ardila - 2120653


**Updated version: 20/04/2023**


# Methodology

1. **General Pre-Processing (GPP)**: The dataset is studied to identify missing data, analyze the correlation between the attributes and the target class, observe if there is an imbalance in the target class and identify if it is necessary to apply regularization to the data.

    After the analysis of the data, the following actions were taken:
    
    1. **Remove "Sex" attribute**: The number of men record (4) is not significant to be considered in the analysis. If only men records are removed the attribute will have only one value (Women).

    2. **Standardize all continuous variables**: The continuous variables are standardized to have a mean of 0 and a standard deviation of 1.

    3. **Categorical attributes with int values**: The categorical attributes with int values stay as they are.

    4. **Correlation between attributes**: Winter Hypoxia Score and Ragnum Hypoxia Score are highly correlated with Buffa Hypoxia Score are (0.84 and 0.72). The attribute Buffa Hypoxia Score is removed from the dataset.

    5. **Target Class**: The target class is unbalanced and has 3 classes related to the recurrence of breast cancer. Where class 3 is related to the absence of recurrence and classes 1 and 2 are related to the time when the recurrence occurred. In this stage no action is taken to balance the target class.

2. **Experiments**: Four experiments (E-1 to E-4) were defined to evaluate the performance of different machine learning models in breast cancer recurrence classification.

    1. **E-1**: Training of Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM), k nearest neighbors (KNN) and XGBoots (XGB) models using the original dataset with the target class grouped in three classes ( 0 = Early recurrence, 1 = Mid/Late recurrence and 3 = No recurrence). The target class is unbalanced and has 4 classes related to the recurrence of breast cancer. Where class 3 is related to the absence of recurrence and classes 0, 1 and 2 are related to the time when the recurrence occurred. Class 0 is related to an early recurrence, class 1 is related to an intermediate recurrence and class 2 is related to a late recurrence. The analysis of the target class shows that the classes 0, 1 and 2 can be grouped in two classes using k-means clustering. The classes 0, 1 and 2 are grouped in class 0 early recurrence and class 1 mid/late recurrence.

    2. **E-2 (E-1 + GPP)**: Training of Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM), k nearest neighbors (KNN) and XGBoots (XGB) models using the dataset with GPP.

    3. **E-3 (E-2 + Feature Selection)**: A feature selection techniques (mRMR) is applied to the dataset with GPP to select the most significant attributes for the target class. Then, the training of Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM), k nearest neighbors (KNN) and XGBoots (XGB) models is performed.

    4. **E-4 (E-3 + Oversample)**: The technique SMOTE is applied to the dataset with GPP + Feature Selection to balance the target class. Then, the training of Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM), k nearest neighbors (KNN) and XGBoots (XGB) models is performed.

3. **Test E-1 to E-4**: All the models trained in the experiments are tested using the test dataset.

4. **Models Evaluation**: Precision, Recall, F1-score, Cohen Kappa Score, AUROC, Accuracy. 

<a href="https://ibb.co/3YCjfWk"><img src="https://i.ibb.co/qNdzF57/DS-Cancer-v2.jpg" alt="DS-Cancer-v2" width=400 border="0"></a>

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
#mpl.rc('axes', labelsize=14)
#mpl.rc('xtick', labelsize=12)
#mpl.rc('ytick', labelsize=12)

# to make this notebook's output stable across runs
np.random.seed(42)
from collections import Counter


from utilities.preprocessing import gpp_preprocess, mrmr_preprocess, kmeans_preprocess
from utilities.models import run_experiment
from utilities.tuning import load_params, tune_experiment, save_tuned_params



# 1 - General Pre-Processing (GPP)

The dataset is studied to identify missing data, analyze the correlation between the attributes and the target class, observe if there is an imbalance in the target class and identify if it is necessary to apply regularization to the data.

In [None]:
E_data = pd.read_excel('dadoscancer_4classes.xlsx')
E_data.head()

In [None]:
E_data.info()

In [None]:
E_data.nunique()

In [None]:
E_data.isnull().sum()

## Histograms Matrix 

In [None]:
#E_data.hist(figsize=(14,14))
#plt.show()

## Graphs Analysis 

Out of 18 attributes of the dataset, 6 continuous variables and 12 categorical variables were identified.

<img src="images/dataset.png" alt="Metodolog-a-3"  width="600" border="0"> 

## Sex Attribute

In the histogram of the dataset it was observed that the **sex attribute** has records for both women and men. Although there is a possibility that breast cancer occurs in men, the number of records present in the dataset is not significant, therefore, we proceed to identify how many male records exist in the dataset and eliminate them. By eliminating the records associated with men, the sex variable only presents one category, therefore, it is discarded. 

In [None]:
counter = Counter(E_data["Sex"])
print(counter)
print('%s : %d' % ('Women', counter[0]))
print('%s : %d' % ('Men', counter[1]))

#Smote
idx_0 = np.where(E_data.Sex == 0)
idx_Men = np.where(E_data.Sex == 1)

print(idx_Men[0])

## Correlation Matrix

The correlation between the attributes is analyzed to identify if there are attributes that are highly correlated. If there are highly correlated attributes, one of them is removed. The correlation matrix is shown below. 

In [None]:
#f, ax = plt.subplots(1, figsize=(15,15))
#sns.heatmap(E_data.corr(), annot=True, ax=ax,fmt="0.2f");

## Target Class Analysis

It was identified that the distribution of data in the target class is unbalanced.

In [None]:
plt.figure(figsize=(5,5))
sns.countplot(x=E_data.Class);

In [None]:
# Generate and plot a synthetic imbalanced classification dataset
counter = Counter(E_data["Class"])
print('%s : %d' % ('Early Recurrence', counter[0]))
print('%s : %d' % ('Middle Recurrence', counter[1]))
print('%s : %d' % ('Late Recurrence', counter[2]))
print('%s : %d' % ('Non Recurrence', counter[3]))

## GPP Conclusions and Actions

Considering the previus analysis, the following actions were taken:

1. **Remove "Sex" attribute**: The number of men record (4) is not significant to be considered in the analysis. If only men records are removed the attribute will have only one value (Women).

2. **Standardize all continuous variables**: The continuous variables are standardized to have a mean of 0 and a standard deviation of 1.

3. **Categorical attributes with int values**: The categorical attributes with int values stay as they are.

4. **Correlation between attributes**: Winter Hypoxia Score and Ragnum Hypoxia Score are highly correlated with Buffa Hypoxia Score are (0.84 and 0.72). The attribute Buffa Hypoxia Score is removed from the dataset.

5. **Target Class**: The target class is unbalanced and has 3 classes related to the recurrence of breast cancer. Where class 3 is related to the absence of recurrence and classes 1 and 2 are related to the time when the recurrence occurred. In this stage no action is taken to balance the target class. The classes 1 and 2 are grouped in class 1 mid/late recurrence.

## E-1 Dataset

In [None]:
# Merge class 1 and 2 into class 1, and replace 3 with 2
E_data['Class'] = E_data['Class'].replace([2, 3], [1,2])
E_data.value_counts('Class')

In [None]:
E1_data = E_data.copy()
E1_data.head()

In [None]:
figura = plt.figure(figsize=(5,3))
sns.countplot(x=E1_data.Class);

In [None]:
E1_data["Class"].value_counts()

In [None]:
counter = Counter(E1_data["Class"])
print('%s : %d' % ('Early Recurrence', counter[0]))
print('%s : %d' % ('Mid/late Recurrence', counter[1]))
print('%s : %d' % ('Non Recurrence', counter[2]))

## E-2 Dataset

In [None]:
drop_cols = ['Sex', 'Buffa Hypoxia Score']
standardize_cols = ["Diagnosis Age","Aneuploidy Score","Ragnum Hypoxia Score",
                    "Winter Hypoxia Score","Fraction Genome Altered", "Mutation Count"]
E2_data = gpp_preprocess(E1_data, drop_cols, standardize_cols)
E2_data.head()

## E-3 Dataset
This dataset is used in the experiment E-3. It is the dataset with GPP + Feature Selection with mRMR.

In [None]:
E3_data = mrmr_preprocess(E2_data, "Class")
E3_data.head()

## E-4 Dataset

This dataset is used in the experiment E-4. It is the dataset with GPP + Feature Selection with mRMR + Oversample with SMOTE. We are going to do a copy of E-3 dataset to use it in E-4 and then apply the SMOTE technique to balance the target class when the experiment is executed.

In [None]:
E4_data = E3_data.copy()
E4_data.head()



---



# 2 - Run Experiments



## E-1 (Experiment 1)

**E-1**: Training of Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM), k nearest neighbors (KNN) models using the original dataset.

Calculation of performance metrics using the original dataset, to have an initial reference point. 

In [None]:
default_params = load_params("tune_params/default_params_4C.json")

In [None]:
#E1_Results = run_experiment(E1_data, "Class","E1", default_params)
#E1_Results

## E-2 (Experiment 2)

**E-2**: Training of Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM), k nearest neighbors (KNN) models using the dataset with GPP.

In [None]:
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning)

In [None]:
E2_Results = run_experiment(E2_data, "Class","E2", default_params, num_exp=10)
E2_Results

## E-3 (Experiment 3)

**E-3**: A feature selection techniques (mRMR) is applied to the dataset with GPP to select the most significant attributes for the target class. Then, the training of Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM), k nearest neighbors (KNN) models is performed.

In [None]:
E3_Results = run_experiment(E3_data, "Class","E3", default_params, num_exp=10)
E3_Results

## E-4 (Experiment 4)

**E-4**: The technique SMOTE is applied to the dataset with GPP + Feature Selection to balance the target class. Then, the training of Logistic Regression (LR), Naive Bayes (NB), Support Vector Machine (SVM), k nearest neighbors (KNN) models is performed.

In [None]:
E4_Result = run_experiment(E4_data, "Class","E4", default_params, num_exp=10)
E4_Result

----

# 3 - Tuning with GridSearchCV

The models are tuned using GridSearchCV to find the best parameters for each model. The parameters are defined in [tuning_parameters.yml](tune_params/tuning_parameters.yml) file.

The results of the tuning are saved in `json` files in the folder `tune_params/`.

In [None]:
tune_params = load_params("tune_params/tuning_parameters.json")

## Tuning E-1

In [None]:
#E1_tune_bests = tune_experiment(E1_data,"Class","E1",tune_params,default_params)
#file_path = "tune_params/best_4C/E1_best_params.json"
#save_tuned_params(E1_tune_bests, file_path)

## Tuning E-2

In [None]:
E2_tune_bests = tune_experiment(E2_data,"Class","E2",tune_params,default_params)
file_path = "tune_params/best_4C/E2_best_params.json"
save_tuned_params(E2_tune_bests, file_path)

## Tuning E-3

In [None]:
E3_tune_bests = tune_experiment(E3_data,"Class","E3",tune_params,default_params)
file_path = "tune_params/best_4C/E3_best_params.json"
save_tuned_params(E3_tune_bests, file_path)

## Tuning E-4

In [None]:
E4_tune_bests = tune_experiment(E4_data,"Class","E4",tune_params,default_params)
file_path = "tune_params/best_4C/E4_best_params.json"
save_tuned_params(E4_tune_bests, file_path)

---

# 4 - Run Experiments With Tuning Parameters

In [None]:
E1_best_params = load_params("tune_params/best_4C/E1_best_params.json")
E2_best_params = load_params("tune_params/best_4C/E2_best_params.json")
E3_best_params = load_params("tune_params/best_4C/E3_best_params.json")
E4_best_params = load_params("tune_params/best_4C/E4_best_params.json")

## E-1 (Experiment 1)

In [None]:
#E1_Results = run_experiment(E1_data, "Class","E1", E1_best_params)
#E1_Results

## E-2 (Experiment 2)

In [None]:
E2_Results = run_experiment(E2_data, "Class","E2", E2_best_params, num_exp=10)
E2_Results

## E-3 (Experiment 3)

In [None]:
E3_Results = run_experiment(E3_data, "Class","E3", E3_best_params, num_exp=10)
E3_Results

## E-4 (Experiment 4)

In [None]:
E4_Results = run_experiment(E4_data, "Class","E4", E4_best_params, num_exp=10)
E4_Results

In [None]:
#%%shell
#jupyter nbconvert --to html /content/Projeto_DS.ipynb