# Machine Learning Project

## Sawsan Daban - Alaa AlSharekh

<img src="HEADER-IMAGE.png" width="" align="" />

## Learning tasks

- Binary Classification - Recurrence Prediction:

Task: Predict whether a patient will experience a recurrence of breast cancer (binary outcome: recurrence or no recurrence).
Target Variable: Recurrence column.

- Multiclass Classification - Cancer Severity Prediction:

Task: Predict the severity level of breast cancer based on the degree-of-malignance.
Target Variable: degree-of-malignance (assuming it has multiple classes).

- Regression - Age Prediction:

Task: Predict the age of a patient based on other available features.
Target Variable: Age.

- Categorical Classification - Breast Type Prediction:

Task: Predict the type of breast involved (left or right).
Target Variable: Breast column.

- Categorical Classification - Menopause Prediction:

Task: Predict whether a patient is in menopause or not.
Target Variable: Menopause column.

- Clustering - Patient Segmentation:

Task: Cluster patients based on their features to identify subgroups with similar characteristics.

- Association Rule Mining - Patterns in Treatment:

Task: Identify patterns or associations between different features and the type of treatment received (Irradiation).

## Machine Learning Model

For predicting disease recurrence (a binary classification task), several machine learning models can be applied. The choice of model often depends on various factors like the size of the dataset, the complexity of relationships, interpretability, and computational efficiency. Here are some suitable ML models for predicting disease recurrence:

- Logistic Regression:

Suitable for binary classification tasks.
Interpretable and provides probabilities.
Works well with linearly separable data.

- Decision Trees and Random Forests:

Effective for classification tasks.
Can handle nonlinear relationships and interactions between features.
Random Forests reduce overfitting and increase accuracy by combining multiple decision trees.

- Support Vector Machines (SVM):

Effective in high-dimensional spaces.
Works well with both linear and nonlinear data.
Finds the best separation boundary (hyperplane) between classes.

- Neural Networks:

Deep learning models suitable for complex, nonlinear relationships.
Requires more data and computational power but can capture intricate patterns.

- Naive Bayes:

Simple and efficient for binary classification.
Assumes independence between features (which might not hold true in all cases).

- K-Nearest Neighbors (KNN):

Non-parametric and instance-based method.
Predicts based on the majority class among its nearest neighbors.


When choosing a model, considerations include the dataset size, feature importance, interpretability, computational resources, and the trade-off between accuracy and model complexity. Additionally, techniques such as cross-validation, hyperparameter tuning, and feature selection can enhance model performance.

# Preparing a dataset for machine learning

## Data summary 

> Population:
The dataset appears to represent a sample of individuals, likely patients, from a medical context. These individuals are described based on various characteristics related to medical conditions.

> Observations:
Each row in the dataset represents an observation of an individual patient. The dataset has multiple observations, each corresponding to a different patient.

Quantitative features: 
1. Age: Numerical data representing the age of individuals. 
2. Tumor-size: Although it appears as a string ('15-19', '35-39', etc.), it represents numerical intervals. 
3. Nodes: Numerical data representing the number of nodes. It is a discrete numerical variable.

Categorical features: 
1. Menopause: Categorical data representing the menopausal status of individuals ('premeno', 'ge40'). 
2. Degree-of-malignance: Categorical data representing the degree of malignancy ('3', '1', '2'). 
3. Breast: Categorical data indicating the side of the breast ('right' or 'left'). 
4. Breast-quad: Categorical data representing the quadrant of the breast ('left_up', 'central', etc.). 
5. Irradiation: Categorical data indicating whether irradiation was done ('yes' or 'no'). 
6. Recurrence: Categorical data indicating the occurrence of events ('recurrence-events' or 'no-recurrence-events').

In [1]:
# Libraries 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
! pip install openpyxl
! pip install --upgrade pip

# To read the data
Data = pd.read_excel('Disease-Reccurence DCS 873.xlsx')

Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.0/250.0 kB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting pip
  Downloading pip-23.3.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m95.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.0.1
    Uninstalling pip-23.0.1:
      Successfully uninstall

In [2]:
# Display the current data frame
Data

Unnamed: 0,Age,Menopause,Tumor-size,nodes,Unnamed: 4,degree-of-malignance,Breast,Breast-quad,Irradiation,reccurence
0,45,'premeno','15-19',1,'yes','3','right','left_up','no','recurrence-events'
1,57,'ge40','15-19',0,'no','1','right','central','no','no-recurrence-events'
2,56,'ge40','35-39',2,'no','2','left','left_low','no','recurrence-events'
3,42,'premeno','35-39',2,'yes','3','right','left_low','yes','no-recurrence-events'
4,44,'premeno','30-34',5,'yes','2','left','right_up','no','recurrence-events'
...,...,...,...,...,...,...,...,...,...,...
281,51,'ge40','30-34',7,'yes','2','left','left_low','no','no-recurrence-events'
282,57,'premeno','25-29',3,'yes','2','left','left_low','yes','no-recurrence-events'
283,38,'premeno','30-34',8,'yes','2','right','right_up','no','no-recurrence-events'
284,57,'premeno','15-19',2,'no','2','right','left_low','no','no-recurrence-events'


## Quality report

### Data Exploration and Understanding (EDA)


EDA plays a critical role in the machine learning workflow. It helps us dig into the data, uncover patterns, and understand how different things relate to each other. This exploration is crucial before we dive into building machine learning models, as it gives us the necessary insights for better decision-making.

In [3]:
Data.shape

(286, 10)

In [4]:
# Info about the dataset
Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286 entries, 0 to 285
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Age                   286 non-null    int64 
 1   Menopause             286 non-null    object
 2   Tumor-size            286 non-null    object
 3   nodes                 286 non-null    int64 
 4   Unnamed: 4            278 non-null    object
 5   degree-of-malignance  286 non-null    object
 6   Breast                286 non-null    object
 7   Breast-quad           285 non-null    object
 8   Irradiation           286 non-null    object
 9   reccurence            286 non-null    object
dtypes: int64(2), object(8)
memory usage: 22.5+ KB


In [5]:
# To describe the continious featurs Age , nodes
Data.describe()

Unnamed: 0,Age,nodes
count,286.0,286.0
mean,52.073427,2.818182
std,10.433323,3.421575
min,25.0,0.0
25%,45.0,1.0
50%,52.0,2.0
75%,59.0,3.0
max,77.0,25.0


In [6]:
# Data Profiling EDA

!pip install ydata-profiling
!pip install ipywidgets==8.1.1
!pip install --upgrade pip

Collecting ydata-profiling
  Downloading ydata_profiling-4.6.2-py2.py3-none-any.whl.metadata (20 kB)
Collecting pydantic>=2 (from ydata-profiling)
  Downloading pydantic-2.5.2-py3-none-any.whl.metadata (65 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.2/65.2 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyYAML<6.1,>=5.0.0 (from ydata-profiling)
  Downloading PyYAML-6.0.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting visions==0.7.5 (from visions[type_image_path]==0.7.5->ydata-profiling)
  Downloading visions-0.7.5-py3-none-any.whl (102 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m102.7/102.7 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
Collecting htmlmin==0.1.12 (from ydata-profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting phik<0.13,>=0.11.1 (from ydata-profiling)
  Downloading phik-0.12.3-cp39-cp39-manylinux_2_17_

In [7]:
from ydata_profiling import ProfileReport

# Generate the data profiling report 
report1 = ProfileReport(Data, title='Disease-Reccurence - report1 ')
report1.to_file("Disease-Reccurence - report1.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [8]:
report1



The provided summary offers a concise overview of the existing dataset and highlights certain noteworthy aspects. According to the description, our dataset comprises 10 variables and encompasses a total of 286 observations. Among these variables, 8 fall under the categorical category, while 2 are designated as numerical variables. Additionally, the summary points out the presence of duplicate rows and an unnamed column as part of the dataset’s characteristics.

> Please find below the full report:
https://msc-science-in-computing-2023.github.io/ML-Reports1

## Apply quality controls

### Data cleaning: handling duplication 

2 duplicate rows were identified.

In [9]:
# Checking duplication
duplicate_rows = Data[Data.duplicated()]
duplicate_rows

Unnamed: 0,Age,Menopause,Tumor-size,nodes,Unnamed: 4,degree-of-malignance,Breast,Breast-quad,Irradiation,reccurence
178,47,'premeno','25-29',1,'no','2','right','left_low','no','recurrence-events'
239,56,'ge40','40-44',7,'yes','3','left','left_low','yes','recurrence-events'


In [10]:
# Remove doublication
New_data = Data.drop_duplicates()

In [12]:
New_data.shape

(284, 10)

After removing duplications, # of rows has been decreased to 284 instead of 286

### Data cleaning: missing column name

Based on the provided problem statement, we have identified the column with missing information.

In [13]:
# Assign a name to the unnamed column  (Unnamed: 4)
New_data.rename(columns={'Unnamed: 4': 'Node-caps'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(


An additional step has been added, by unifying the column to start with capital letters.

In [14]:
# Modify column names to start with Capital letters
New_data.rename(columns={'nodes': 'Nodes'}, inplace=True)
New_data.rename(columns={'degree-of-malignance': 'Degree-of-malignance'}, inplace=True)
New_data.rename(columns={'reccurence': 'Reccurence'}, inplace=True)


### Data cleaning: unified missing values

Some techniques to handle Null values:

1. Removing Rows with Null Values.
2. Filling Null Values with a Specific Value like 'NA', 'Unknown' .. etc

Based on the report, there are a total of 8 missing values in Node-caps out of 284 observations which represent only 0.3%.

In [15]:
# Node-caps

# Check for null values in 'Node-caps' column
missing_node_caps_values = New_data['Node-caps'].apply(lambda x: pd.isnull(x) or (isinstance(x, str) and (x.strip() == '' or x.strip().lower() == 'nan')))

# Display the rows where the column had missing values
New_data[missing_node_caps_values]

Unnamed: 0,Age,Menopause,Tumor-size,Nodes,Node-caps,Degree-of-malignance,Breast,Breast-quad,Irradiation,Reccurence
20,56,'lt40','20-24',2,,'1','left','left_low','no','recurrence-events'
31,68,'ge40','25-29',3,,'1','right','left_low','yes','no-recurrence-events'
50,73,'ge40','15-19',10,,'1','left','left_low','yes','recurrence-events'
54,48,'premeno','25-29',1,,'2','left','right_low','yes','no-recurrence-events'
71,61,'ge40','25-29',5,,'1','right','left_up','yes','no-recurrence-events'
92,51,'lt40','20-24',0,,'1','left','left_up','no','recurrence-events'
149,50,'ge40','30-34',9,,'3','left','left_up','yes','no-recurrence-events'
264,57,'ge40','30-34',11,,'3','left','left_low','yes','no-recurrence-events'


In [16]:
# Replace null values and empty strings with None
New_data.loc[missing_node_caps_values, 'Node-caps'] = None

# Display the rows where the column had missing values
New_data[missing_node_caps_values]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


Unnamed: 0,Age,Menopause,Tumor-size,Nodes,Node-caps,Degree-of-malignance,Breast,Breast-quad,Irradiation,Reccurence
20,56,'lt40','20-24',2,,'1','left','left_low','no','recurrence-events'
31,68,'ge40','25-29',3,,'1','right','left_low','yes','no-recurrence-events'
50,73,'ge40','15-19',10,,'1','left','left_low','yes','recurrence-events'
54,48,'premeno','25-29',1,,'2','left','right_low','yes','no-recurrence-events'
71,61,'ge40','25-29',5,,'1','right','left_up','yes','no-recurrence-events'
92,51,'lt40','20-24',0,,'1','left','left_up','no','recurrence-events'
149,50,'ge40','30-34',9,,'3','left','left_up','yes','no-recurrence-events'
264,57,'ge40','30-34',11,,'3','left','left_low','yes','no-recurrence-events'


In [None]:
New_data['Node-caps'].unique()

array(["'yes'", "'no'", None], dtype=object)

In [None]:
# Breast-quad

# Check for null values in 'Breast-quad' column
missing_breast_quad_values = New_data['Breast-quad'].apply(lambda x: pd.isnull(x) or (isinstance(x, str) and (x.strip() == '' or x.strip().lower() == 'nan')))

# Display the rows where the column had missing values
New_data[missing_breast_quad_values]

Unnamed: 0,Age,Menopause,Tumor-size,Nodes,Node-caps,Degree-of-malignance,Breast,Breast-quad,Irradiation,Reccurence
240,59,'ge40','30-34',1,'no','3','left',,'no','recurrence-events'


In [None]:
# Replace null values and empty strings with None
New_data.loc[missing_breast_quad_values, 'Breast-quad'] = None

# Display the rows where the column had missing values
New_data[missing_breast_quad_values]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


Unnamed: 0,Age,Menopause,Tumor-size,Nodes,Node-caps,Degree-of-malignance,Breast,Breast-quad,Irradiation,Reccurence
240,59,'ge40','30-34',1,'no','3','left',,'no','recurrence-events'


In [None]:
New_data['Breast-quad'].unique()

array(["'left_up'", "'central'", "'left_low'", "'right_up'",
       "'right_low'", None], dtype=object)

### Data cleaning: Checking outlier

In [None]:
# for Age

import plotly.express as px

fig = px.box(New_data, y="Age", points="all")
fig.show()


Validation step

In [None]:
from ydata_profiling import ProfileReport

# Generate the data profiling report 
report2 = ProfileReport(New_data, title='Disease-Reccurence - report2 ')
report2.to_file("Disease-Reccurence - report2.html")



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


The input array could not be properly checked for nan values. nan values will be ignored.


The input array could not be properly checked for nan values. nan values will be ignored.


The input array could not be properly checked for nan values. nan values will be ignored.


The input array could not be properly checked for nan values. nan values will be ignored.


The input array could not be properly checked for nan values. nan values will be ignored.


The input array could not be properly checked for nan values. nan values will be ignored.



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
report2



> Please find below the full report:
https://msc-science-in-computing-2023.github.io/ML-Reports2

Based on the report outcomes following the implementation of the necessary checks, it appears that additional missing values have surfaced.

### Data cleaning: Removing None values

In [None]:
# Current dataset with None = New_data 
# New dataset without None = New_data_without_none 

New_data_without_none = New_data.dropna()


In [None]:
New_data_without_none.shape

(275, 10)

In [None]:
from ydata_profiling import ProfileReport

# Generate the data profiling report 
report3 = ProfileReport(New_data_without_none, title='Disease-Reccurence - report3 ')
report3.to_file("Disease-Reccurence - report3.html")



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


The input array could not be properly checked for nan values. nan values will be ignored.


The input array could not be properly checked for nan values. nan values will be ignored.


The input array could not be properly checked for nan values. nan values will be ignored.


The input array could not be properly checked for nan values. nan values will be ignored.


The input array could not be properly checked for nan values. nan values will be ignored.


The input array could not be properly checked for nan values. nan values will be ignored.



Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
report3



> Please find below the full report:
https://msc-science-in-computing-2023.github.io/ML-Reports3

After creating a new dataset without None values. The dataset has been decreased from 284 to 275.

### Data cleaning: encoding categorical variables

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# Encoding categorical variables
#encoder = LabelEncoder()
#categorical_cols = ['Menopause', 'Tumor-size', 'Node-caps', 'Degree-of-malignance', 'Breast', 'Breast-quad', 'Irradiation', 'Reccurence']
#for col in categorical_cols:
#    New_data[col] = encoder.fit_transform(New_data[col])

In [None]:
# Encoding New_data_without_none

encoder = LabelEncoder()
categorical_cols = ['Menopause', 'Tumor-size', 'Node-caps', 'Degree-of-malignance', 'Breast', 'Breast-quad', 'Irradiation', 'Reccurence']
for col in categorical_cols:
    New_data_without_none[col] = encoder.fit_transform(New_data_without_none[col])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [None]:
New_data_without_none

Unnamed: 0,Age,Menopause,Tumor-size,Nodes,Node-caps,Degree-of-malignance,Breast,Breast-quad,Irradiation,Reccurence
0,45,2,2,1,1,2,1,2,0,1
1,57,0,2,0,0,0,1,0,0,0
2,56,0,6,2,0,1,0,1,0,1
3,42,2,6,2,1,2,1,1,1,0
4,44,2,5,5,1,1,0,4,0,1
...,...,...,...,...,...,...,...,...,...,...
281,51,0,5,7,1,1,0,1,0,0
282,57,2,4,3,1,1,0,1,1,0
283,38,2,5,8,1,1,1,4,0,0
284,57,2,2,2,0,1,1,1,0,0


### Data cleaning: imbalanced data


Resampling: Address class imbalance using techniques like oversampling (creating more samples of the minority class) or undersampling (reducing samples of the majority class).
Use Appropriate Evaluation Metrics: Consider metrics like precision, recall, F1-score, or AUC-ROC for imbalanced datasets instead of accuracy.

Several techniques can be employed to assess whether the target variable is imbalanced or not.

In [None]:
# Class distribution 

class_distribution = New_data['Reccurence'].value_counts()
class_distribution

'no-recurrence-events'    201
'recurrence-events'        83
Name: Reccurence, dtype: int64

In [None]:
# Class distribution 

class_distribution = New_data_without_none['Reccurence'].value_counts()
class_distribution

0    196
1     79
Name: Reccurence, dtype: int64

In [None]:
# Class ratio

class_ratios = New_data['Reccurence'].value_counts(normalize=True)
class_ratios


'no-recurrence-events'    0.707746
'recurrence-events'       0.292254
Name: Reccurence, dtype: float64

In [None]:
# Class ratio

class_ratios = New_data_without_none['Reccurence'].value_counts(normalize=True)
class_ratios


0    0.712727
1    0.287273
Name: Reccurence, dtype: float64

# Implementation and Evaluation

Experimenting with multiple models and evaluating their performance using metrics like accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC) would help determine the most suitable model for predicting disease recurrence in your specific dataset.

## Naïve Bayes Classifier GaussianNB

Description: Naïve Bayes is a probabilistic classifier based on Bayes' theorem. It assumes independence between features given the class.
Implementation: Train the Naïve Bayes model using the dataset.
Evaluation: Assess model performance using accuracy, precision, recall, and F1-score for both hold-out sampling and K-fold cross-validation.

In [None]:
from sklearn                 import preprocessing
from sklearn.metrics         import classification_report
from sklearn.metrics         import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes     import GaussianNB
from sklearn.metrics         import accuracy_score
from sklearn.preprocessing   import LabelEncoder, MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.impute          import SimpleImputer

For continuous and categorical data

Data Splitting:
Train-Test Split: Divide the dataset into training and testing sets to train the model on one subset and validate its performance on another.

In [None]:
# Define features and target variable
x = New_data_without_none.drop('Reccurence', axis=1)
y = New_data_without_none['Reccurence']

Feature Scaling/Normalization:
Scale Numerical Features: Standardize or normalize numerical features to a similar scale to avoid bias in models that rely on distance measures.

In [None]:
# Separate categorical and numeric features
numeric_features = x.select_dtypes(include=['float64', 'int64'])

# Normalizing numeric features
scaler = MinMaxScaler()
x_normalized_numeric = scaler.fit_transform(numeric_features)

# Ensure x and y have the same number of rows
assert len(x_normalized_numeric) == len(y), "Number of samples in x and y must be the same."

In [None]:
# Split the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x_normalized_numeric, y, test_size=0.2, random_state=42)

Initialize the model

In [None]:
# Modeling and evaluation
model = GaussianNB()

# Train the model
model.fit(x_train, y_train)

Cross-Validation:
Use cross-validation techniques to assess model performance robustly and avoid overfitting.

Hold-out Sampling: 
Split the dataset into training and testing sets, train the models on the training set, and evaluate their performance on the unseen testing set.

K-fold Cross-Validation: 
Divide the dataset into k subsets (folds), train the models on k-1 folds, and validate on the remaining fold. This process is repeated k times, and the performance is averaged.

In [None]:
# Testing the model on hold-out set
y_pred_holdout = model.predict(x_test)
print('Hold-out Test Accuracy: {:.3f}'.format(model.score(x_test, y_test)))
print('Hold-out Training Accuracy: {:.3f}'.format(model.score(x_train, y_train)))


Hold-out Test Accuracy: 0.727
Hold-out Training Accuracy: 0.736


In [None]:
# Evaluate the model using k-fold cross-validation
cv_scores = cross_val_score(model, x_normalized_numeric, y, cv=5)  # You can adjust the number of folds (cv) as needed

# Results
print('K-fold Cross-Validation Scores:', cv_scores)
print('Average K-fold Cross-Validation Score: {:.3f}'.format(cv_scores.mean()))

K-fold Cross-Validation Scores: [0.70909091 0.76363636 0.69090909 0.8        0.69090909]
Average K-fold Cross-Validation Score: 0.731


This is an indicator of the accuracy of the model on the test set that was not seen during training. In this case, our model correctly predicted the target variable for approximately 72.7% of the samples. In addition to how well the model performs on the training data. The model achieved an accuracy of approximately 73.6% on the training set.

The k-fold cross-validation scores represent the accuracy of your model across different folds (splits) of the dataset. The scores for each fold are as follows: [0.709, 0.764, 0.691, 0.800, 0.691]. The average k-fold cross-validation score is calculated as 0.731, representing the overall model performance across the different folds.

Testing the model

In [None]:
# Create a prediction model
y_pred = model.predict(x_test)
print('Test Accuracy: {:.3f}'.format(model.score(x_test, y_test)))
print('Training Accuracy: {:.3f}'.format(model.score(x_train, y_train)))

Test Accuracy: 0.727
Training Accuracy: 0.736


In [None]:
# Compute confusion matrix

cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n',cm)

Confusion Matrix :
 [[30  4]
 [11 10]]


30 non-recurrence cases are correctly predicted as nonrecurrence - True Negative (TN)  - (0-0)
4 non-recurrence cases are incorrectly predicted as recurrence      - False Negative (FN) - (0-1)
11 recurrence cases are incorrectly predicted as nonrecurrence      - False Positive (FP)   - (1-0)
10 recurrence cases are correctly predicted as recurrence                 - True Positive (TP)     - (1-1)

In [None]:
cr = classification_report(y_test,y_pred)
print('Classification Report :\n', cr)

Classification Report :
               precision    recall  f1-score   support

           0       0.73      0.88      0.80        34
           1       0.71      0.48      0.57        21

    accuracy                           0.73        55
   macro avg       0.72      0.68      0.69        55
weighted avg       0.73      0.73      0.71        55



Precision: The proportion of true positive predictions among all positive predictions.
Recall: The proportion of true positive predictions among all actual positive instances.
F1-score: The harmonic mean of precision and recall.
Support: The number of actual occurrences of the class in the specified dataset.


## K-Nearest Neighbors (KNN)

Description: KNN is an instance-based learning algorithm that classifies new instances based on the majority class of its k-nearest neighbors.
Implementation: Train the KNN model using different values of k and evaluate.
Evaluation: Measure accuracy, precision, recall, and F1-score using hold-out sampling and K-fold cross-validation.

### KNN Classification

In [None]:
from sklearn.neighbors        import KNeighborsClassifier
from sklearn.model_selection  import train_test_split
from sklearn.metrics          import accuracy_score
import numpy as np

Data Splitting:
Train-Test Split: Divide the dataset into training and testing sets to train the model on one subset and validate its performance on another.

In [None]:
# Define features and target variable
x = New_data_without_none.drop('Reccurence', axis=1)
y = New_data_without_none['Reccurence']

Feature Scaling/Normalization:
Scale Numerical Features: Standardize or normalize numerical features to a similar scale to avoid bias in models that rely on distance measures.

In [None]:
# Separate categorical and numeric features
# categorical_features = x.select_dtypes(include=['object'])
numeric_features = x.select_dtypes(include=['float64', 'int64'])

# Normalizing numeric features
scaler = MinMaxScaler()
x_normalized_numeric = scaler.fit_transform(numeric_features)

In [None]:
# Split the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(x_normalized_numeric, y, test_size=0.2, random_state=42)

Initialize the model

In [None]:
# Modeling and evaluation
model = KNeighborsClassifier()

# Train the model
model.fit(x_train, y_train)

Cross-Validation:
Use cross-validation techniques to assess model performance robustly and avoid overfitting.

Hold-out Sampling: 
Split the dataset into training and testing sets, train the models on the training set, and evaluate their performance on the unseen testing set.

K-fold Cross-Validation: 
Divide the dataset into k subsets (folds), train the models on k-1 folds, and validate on the remaining fold. This process is repeated k times, and the performance is averaged.

In [None]:
# Evaluate the model on hold-out set

# Results
print('Hold-out Test Accuracy: {:.3f}'.format(model.score(x_test, y_test)))

Hold-out Test Accuracy: 0.673


In [None]:
# Evaluate the model using k-fold cross-validation
cv_scores = cross_val_score(model, x_normalized_numeric, y, cv=5)  # You can adjust the number of folds (cv) as needed

# Results
print('K-fold Cross-Validation Scores:', cv_scores)
print('Average K-fold Cross-Validation Score: {:.3f}'.format(cv_scores.mean()))

K-fold Cross-Validation Scores: [0.76363636 0.83636364 0.70909091 0.72727273 0.70909091]
Average K-fold Cross-Validation Score: 0.749


This metric indicates the accuracy of the KNN model on the test set that was not seen during training. In this case, the model correctly predicted the target variable for approximately 67.3% of the samples.

The k-fold cross-validation scores represent the accuracy of the model across different folds (splits) of the dataset. The scores for each fold are as follows: [0.764, 0.836, 0.709, 0.727, 0.709]. The average k-fold cross-validation score is calculated as 0.749, representing the overall model performance across the different folds.

Testing the model

In [None]:
# Testing the model

y_pred = model.predict(x_test)
print('Test Accuracy: {:.3f}'.format(model.score(x_test, y_test)))
print('Training Accuracy: {:.3f}'.format(model.score(x_train, y_train)))

Test Accuracy: 0.673
Training Accuracy: 0.795


> This might be an indication of overfitting since the training accuracy is higher than the test accuracy.

In [None]:
# Compute confusion matrix

cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n',cm)

Confusion Matrix :
 [[33  1]
 [17  4]]


33 non-recurrence cases are correctly predicted as nonrecurrence - True Negative (TN)  - (0-0)
1 non-recurrence cases are incorrectly predicted as recurrence      - False Negative (FN) - (0-1)
17 recurrence cases are incorrectly predicted as nonrecurrence      - False Positive (FP)   - (1-0)
4 recurrence cases are correctly predicted as recurrence                 - True Positive (TP)     - (1-1)

In [None]:
cr = classification_report(y_test,y_pred)
print('Classification Report :\n', cr)

Classification Report :
               precision    recall  f1-score   support

           0       0.66      0.97      0.79        34
           1       0.80      0.19      0.31        21

    accuracy                           0.67        55
   macro avg       0.73      0.58      0.55        55
weighted avg       0.71      0.67      0.60        55



Precision: The proportion of true positive predictions among all positive predictions.
Recall: The proportion of true positive predictions among all actual positive instances.
F1-score: The harmonic mean of precision and recall.
Support: The number of actual occurrences of the class in the specified dataset.
These metrics provide insights into the performance of KNN model. Similar to the Gaussian Naive Bayes model, we may want to consider exploring ways to balance precision and recall, especially for the '1' class (Recurrence), and further fine-tune the model.

### KNN Imputation

An effective approach to data imputing is to use a model to predict the missing values.

In [None]:
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder

Listing the features that have None values (Node-caps & Breast-quad)

In [None]:
# List of features that has None values
distinct_node_caps = New_data['Node-caps'].unique()

# Display the distinct values
distinct_node_caps

array(["'yes'", "'no'", None], dtype=object)

In [None]:
# List of features that has None values
distinct_breast_quad = New_data['Breast-quad'].unique()

# Display the distinct values
distinct_breast_quad

array(["'left_up'", "'central'", "'left_low'", "'right_up'",
       "'right_low'", None], dtype=object)

Create copy of the data set 

In [None]:
# Create a copy of the data set
New_data_imputed = New_data.copy()

# Identify columns with missing values
columns_with_missing_values = New_data_imputed.columns[New_data_imputed.isnull().any()].tolist()

In [None]:
columns_with_missing_values 

['Node-caps', 'Breast-quad']

Define the imputer & fit to the data set

In [None]:
# Separate columns into numerical and categorical
numerical_columns = New_data_imputed.select_dtypes(include=np.number).columns
categorical_columns = list(set(columns_with_missing_values) - set(numerical_columns))

# Encode categorical variables
label_encoder = LabelEncoder()
for col in categorical_columns:
    New_data_imputed[col] = label_encoder.fit_transform(New_data_imputed[col])

Initialize the imputer

In [None]:
# Initialize the KNNImputer for numerical columns
numerical_imputer = KNNImputer(n_neighbors=3, weights='uniform', metric='nan_euclidean')
New_data_imputed[numerical_columns] = numerical_imputer.fit_transform(New_data_imputed[numerical_columns])

In [None]:
# Display the DataFrame after imputation
New_data_imputed

Unnamed: 0,Age,Menopause,Tumor-size,Nodes,Node-caps,Degree-of-malignance,Breast,Breast-quad,Irradiation,Reccurence
0,45.0,'premeno','15-19',1.0,1,'3','right',2,'no','recurrence-events'
1,57.0,'ge40','15-19',0.0,0,'1','right',0,'no','no-recurrence-events'
2,56.0,'ge40','35-39',2.0,0,'2','left',1,'no','recurrence-events'
3,42.0,'premeno','35-39',2.0,1,'3','right',1,'yes','no-recurrence-events'
4,44.0,'premeno','30-34',5.0,1,'2','left',4,'no','recurrence-events'
...,...,...,...,...,...,...,...,...,...,...
281,51.0,'ge40','30-34',7.0,1,'2','left',1,'no','no-recurrence-events'
282,57.0,'premeno','25-29',3.0,1,'2','left',1,'yes','no-recurrence-events'
283,38.0,'premeno','30-34',8.0,1,'2','right',4,'no','no-recurrence-events'
284,57.0,'premeno','15-19',2.0,0,'2','right',1,'no','no-recurrence-events'


In [None]:
# Decode numerical values back to categorical
for col in categorical_columns:
    if New_data_imputed[col].dtype == 'float64':
        New_data_imputed[col] = label_encoder.inverse_transform(New_data_imputed[col].astype(int))

In [None]:
# List of features that after applying KNN Imputation
distinct_breast_quad = New_data_imputed['Breast-quad'].unique()

# Display the distinct values
distinct_breast_quad

array([2, 0, 1, 4, 3, 5])

In [None]:
# List of features that after applying KNN Imputation
distinct_node_caps = New_data_imputed['Node-caps'].unique()

# Display the distinct values
distinct_node_caps

array([1, 0, 2])

The KNN imputation process helps fill missing values in the dataset, providing a more complete dataset for analysis. This imputed dataset (New_data_imputed) can now be used for further exploration and modeling. It's essential to assess the impact of imputation on the overall analysis and adjust the modeling process accordingly.

## Support Vector Machine (SVM)

Description: SVM constructs a hyperplane to separate classes with the maximum margin, often using kernel tricks for non-linear separation.
Implementation: Train SVM with different kernels (linear, polynomial, RBF) and tune hyperparameters.
Evaluation: Assess accuracy, precision, recall, and F1-score using hold-out sampling and K-fold cross-validation.

Data Splitting:
Train-Test Split: Divide the dataset into training and testing sets to train the model on one subset and validate its performance on another.

In [None]:
from sklearn import preprocessing
from sklearn import svm
from sklearn.model_selection import train_test_split, cross_val_score, KFold

In [None]:
# Define features and target variable
X = New_data_without_none.drop('Reccurence', axis=1)
y = New_data_without_none['Reccurence']

Feature Scaling/Normalization:
Scale Numerical Features: Standardize or normalize numerical features to a similar scale to avoid bias in models that rely on distance measures.

In [None]:
# Scaling and normalizing features
mm_scaler = preprocessing.MinMaxScaler()
X_mm = mm_scaler.fit_transform(X)

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_mm, y, test_size=0.2)

Initialize the model

In [None]:
# Train the SVM classifier
clf = svm.SVC()
clf.fit(X_train, y_train)

Cross-Validation:
Use cross-validation techniques to assess model performance robustly and avoid overfitting.

In [None]:
# Assuming kfold splits
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

Hold-out Sampling: Split the dataset into training and testing sets, train the models on the training set, and evaluate their performance on the unseen testing set.

K-fold Cross-Validation: Divide the dataset into k subsets (folds), train the models on k-1 folds, and validate on the remaining fold. This process is repeated k times, and the performance is averaged.

In [None]:
# Evaluate using hold-out sampling
accuracy_holdout_clf = clf.score(X_test, y_test)

In [None]:
# Perform k-fold cross-validation for Support Vector Machine
accuracy_clf_cv = cross_val_score(clf, X, y, cv=kfold).mean()

In [None]:
# Results
print("Support Vector Machine Accuracy (Hold-out):", accuracy_holdout_clf)

Support Vector Machine Accuracy (Hold-out): 0.6909090909090909


In [None]:
# Results
print("Support Vector Machine Accuracy (K-Fold CV):", accuracy_clf_cv)

Support Vector Machine Accuracy (K-Fold CV): 0.7127272727272728


Testing the model

In [None]:
# Create a prediction model
y_pred = clf.predict(X_test)
print('Test Accuracy: {:.3f}'.format(clf.score(X_test, y_test)))
print('Training Accuracy: {:.3f}'.format(clf.score(X_train, y_train)))

Test Accuracy: 0.691
Training Accuracy: 0.823


In [None]:
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n',cm)

Confusion Matrix :
 [[37  1]
 [16  1]]


37 non-recurrence cases are correctly predicted as nonrecurrence - True Negative (TN)  - (0-0)
1 non-recurrence cases are incorrectly predicted as recurrence      - False Negative (FN) - (0-1)
16 recurrence cases are incorrectly predicted as nonrecurrence      - False Positive (FP)   - (1-0)
1 recurrence cases are correctly predicted as recurrence                 - True Positive (TP)     - (1-1)

In [None]:
# Compute classification report
cr = classification_report(y_test,y_pred)
print('Classification Report :\n', cr)

Classification Report :
               precision    recall  f1-score   support

           0       0.70      0.97      0.81        38
           1       0.50      0.06      0.11        17

    accuracy                           0.69        55
   macro avg       0.60      0.52      0.46        55
weighted avg       0.64      0.69      0.59        55



## Decision Tree Classifier

Description: Decision trees partition data into subsets based on features to create a tree-like model for classification.
Implementation: Train decision trees and evaluate using different tree depths or pruning techniques.
Evaluation: Measure accuracy, precision, recall, and F1-score with hold-out sampling and K-fold cross-validation.

Data Splitting:
Train-Test Split: Divide the dataset into training and testing sets to train the model on one subset and validate its performance on another.

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder

In [None]:
# Define features and target variable
X = New_data_without_none.drop('Reccurence', axis=1)
y = New_data_without_none['Reccurence']

In [None]:
# Split the data into train and test sets using hold-out sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Initialize the model

In [None]:
# Initialize Decision Tree Classifier
decision_tree = DecisionTreeClassifier(random_state=42)

In [None]:
# Fit the model
decision_tree.fit(X_train, y_train)

Cross-Validation:
Use cross-validation techniques to assess model performance robustly and avoid overfitting.

In [None]:
# Assuming kfold splits
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

Hold-out Sampling: Split the dataset into training and testing sets, train the models on the training set, and evaluate their performance on the unseen testing set.

K-fold Cross-Validation: Divide the dataset into k subsets (folds), train the models on k-1 folds, and validate on the remaining fold. This process is repeated k times, and the performance is averaged.

In [None]:
# Evaluate using hold-out sampling
accuracy_holdout_dt  = decision_tree.score(X_test, y_test)

In [None]:
# Perform k-fold cross-validation for Decision Tree
accuracy_dt_cv = cross_val_score(decision_tree, X, y, cv=kfold).mean()

In [None]:
# Results
print("Decision Tree Accuracy (Hold-out):", accuracy_holdout_dt )

Decision Tree Accuracy (Hold-out): 0.5636363636363636


In [None]:
# Results
print("Decision Tree Accuracy (K-Fold CV):", accuracy_dt_cv)

Decision Tree Accuracy (K-Fold CV): 0.5745454545454545


Testing the model

In [None]:
# Create a prediction model
y_pred = decision_tree.predict(X_test)
print('Test Accuracy: {:.3f}'.format(decision_tree.score(X_test, y_test)))
print('Training Accuracy: {:.3f}'.format(decision_tree.score(X_train, y_train)))

Test Accuracy: 0.564
Training Accuracy: 1.000


In [None]:
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n',cm)

Confusion Matrix :
 [[24 10]
 [14  7]]


24 non-recurrence cases are correctly predicted as nonrecurrence - True Negative (TN)  - (0-0)
10 non-recurrence cases are incorrectly predicted as recurrence      - False Negative (FN) - (0-1)
14 recurrence cases are incorrectly predicted as nonrecurrence      - False Positive (FP)   - (1-0)
7 recurrence cases are correctly predicted as recurrence                 - True Positive (TP)     - (1-1)

In [None]:
# Compute classification report
cr = classification_report(y_test,y_pred)
print('Classification Report :\n', cr)

Classification Report :
               precision    recall  f1-score   support

           0       0.63      0.71      0.67        34
           1       0.41      0.33      0.37        21

    accuracy                           0.56        55
   macro avg       0.52      0.52      0.52        55
weighted avg       0.55      0.56      0.55        55



## Artificial Neural Network (ANN)

Description: ANN is a network of interconnected nodes inspired by the human brain, capable of learning complex patterns.
Implementation: Design and train a neural network with multiple layers, neurons, and activation functions.
Evaluation: Assess accuracy, precision, recall, and F1-score using hold-out sampling and K-fold cross-validation.

Data Splitting:
Train-Test Split: Divide the dataset into training and testing sets to train the model on one subset and validate its performance on another.

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing

In [None]:
# Define features and target variable
X = New_data_without_none.drop('Reccurence', axis=1)
y = New_data_without_none['Reccurence']

Feature Scaling/Normalization:
Scale Numerical Features: Standardize or normalize numerical features to a similar scale to avoid bias in models that rely on distance measures.

In [None]:
# Scaling and normalizing features
mm_scaler = preprocessing.MinMaxScaler()
X_mm = mm_scaler.fit_transform(X)

In [None]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_mm, y, test_size=0.2, random_state=42)

Initialize the model

In [None]:
# Initialize Neural Network Classifier (Multi-layer Perceptron)
mlp = MLPClassifier(random_state=42)

In [None]:
# Fit the model
mlp.fit(X_train, y_train)


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.



### Another solution by assuming hidden layers

In [None]:
# Initialize an ANN using MLPClassifier from scikit-learn
ann = MLPClassifier(hidden_layer_sizes=(10,), max_iter=1000, random_state=42)

In [None]:
# Fit the model
ann.fit(X_train, y_train)

Cross-Validation:
Use cross-validation techniques to assess model performance robustly and avoid overfitting.

In [None]:
# Assuming kfold splits
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

Hold-out Sampling: Split the dataset into training and testing sets, train the models on the training set, and evaluate their performance on the unseen testing set.

K-fold Cross-Validation: Divide the dataset into k subsets (folds), train the models on k-1 folds, and validate on the remaining fold. This process is repeated k times, and the performance is averaged.

In [None]:
# Evaluate using hold-out sampling
accuracy_mlp_holdout = mlp.score(X_test, y_test)

In [None]:
# Perform k-fold cross-validation for Neural Network
accuracy_mlp_cv = cross_val_score(mlp, X, y, cv=kfold).mean()

In [None]:
# Results
print("Neural Network Accuracy (Hold-out):", accuracy_mlp_holdout)

Neural Network Accuracy (Hold-out): 0.7272727272727273


In [None]:
# Results
print("Neural Network Accuracy (K-Fold CV):", accuracy_mlp_cv)

Neural Network Accuracy (K-Fold CV): 0.7345454545454546


> This is another solution by assuming hidden layers

In [None]:
# Evaluate using hold-out sampling
accuracy_holdout_ann = ann.score(X_test, y_test)

In [None]:
# Evaluate using k-folds cross-validation
accuracy_cross_val_ann = cross_val_score(ann, X, y, cv=5)

In [None]:
# Results
print("Accuracy with hold-out sampling (ANN):", accuracy_holdout_ann)

Accuracy with hold-out sampling (ANN): 0.6909090909090909


In [None]:
# Results
print("Accuracy with k-folds cross-validation (ANN):", accuracy_cross_val_ann.mean())

Accuracy with k-folds cross-validation (ANN): 0.7454545454545454


> I noticed that if we did not normalize and scale the data, the hold-out validation varies but for the k-fold cross validation it does not change.

Testing the model

In [None]:
# Create a prediction model
y_pred = mlp.predict(X_test)
print('Test Accuracy: {:.3f}'.format(mlp.score(X_test, y_test)))
print('Training Accuracy: {:.3f}'.format(mlp.score(X_train, y_train)))

Test Accuracy: 0.727
Training Accuracy: 0.809


In [None]:
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n',cm)

Confusion Matrix :
 [[34  0]
 [15  6]]


34 non-recurrence cases are correctly predicted as nonrecurrence - True Negative (TN)  - (0-0)
0 non-recurrence cases are incorrectly predicted as recurrence      - False Negative (FN) - (0-1)
15 recurrence cases are incorrectly predicted as nonrecurrence      - False Positive (FP)   - (1-0)
6 recurrence cases are correctly predicted as recurrence                 - True Positive (TP)     - (1-1)

In [None]:
# Compute classification report
cr = classification_report(y_test,y_pred)
print('Classification Report :\n', cr)

Classification Report :
               precision    recall  f1-score   support

           0       0.69      1.00      0.82        34
           1       1.00      0.29      0.44        21

    accuracy                           0.73        55
   macro avg       0.85      0.64      0.63        55
weighted avg       0.81      0.73      0.68        55



> This is another solution by assuming hidden layers

In [None]:
# Create a prediction model
y_pred = ann.predict(X_test)
print('Test Accuracy: {:.3f}'.format(ann.score(X_test, y_test)))
print('Training Accuracy: {:.3f}'.format(ann.score(X_train, y_train)))

Test Accuracy: 0.691
Training Accuracy: 0.773


In [None]:
# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)
print('Confusion Matrix :\n',cm)

Confusion Matrix :
 [[33  1]
 [16  5]]


33 non-recurrence cases are correctly predicted as nonrecurrence - True Negative (TN)  - (0-0)
1 non-recurrence cases are incorrectly predicted as recurrence      - False Negative (FN) - (0-1)
16 recurrence cases are incorrectly predicted as nonrecurrence      - False Positive (FP)   - (1-0)
5 recurrence cases are correctly predicted as recurrence                 - True Positive (TP)     - (1-1)

In [None]:
# Compute classification report
cr = classification_report(y_test,y_pred)
print('Classification Report :\n', cr)

Classification Report :
               precision    recall  f1-score   support

           0       0.67      0.97      0.80        34
           1       0.83      0.24      0.37        21

    accuracy                           0.69        55
   macro avg       0.75      0.60      0.58        55
weighted avg       0.73      0.69      0.63        55



## Discussion

- Naive Bayes performed well with both hold-out sampling and K-fold cross-validation, suggesting that it is a robust model for this dataset. This is likely because the Naive Bayes algorithm is based on the assumption that the features are independent of each other. This assumption is often violated in real-world datasets, but it may be approximately true for the dataset used in this study.

- The KNN model has a higher average K-fold cross-validation score 0.749 compared to GNB 0.731, indicating better generalization The GNB model has a slightly higher hold-out test accuracy 0.727 compared to KNN 0.673. KNN is a sensitive algorithm that can be affected by outliers and noise in the data. 

- SVM exhibited good performance with hold-out sampling, indicating robust performance, but had a lower K-fold cross-validation score. This suggests that SVM may be prone to overfitting.

- Decision Tree's performance was relatively low, suggesting that it may not be well-suited for this dataset. This is likely because decision trees are not well-suited for handling complex relationships between features. 

- ANN demonstrated promising results with both hold-out sampling and K-fold cross-validation, indicating that it is a capable model for this task. ANNs are powerful algorithms that can learn complex relationships between features. 

> Overall, SVM and ANN appear to be the most promising models for this dataset.

Conclusion

> SVM shows the highest hold-out accuracy, but there's a notable discrepancy between hold-out and cross-validation results, suggesting potential overfitting.
Naïve Bayes and ANN exhibit relatively consistent performances across test and cross-validation sets, indicating good generalization.
Decision tree's low accuracy on both test and cross-validation sets might indicate overfitting or insufficient model complexity for the data.
KNN shows moderate performance but might benefit from tuning hyperparameters or scaling features.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=088cb172-9da2-499b-b4a1-35eee5a02df9' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>