<h1>Table of Contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#import_libraries">Import Libraries</a></li>
        <li><a href="#import_dataset">Import "Breast Cancer Wisconsin (Original)" Dataset</a></li>
        <li><a href="#information">Information about the Dataset</a></li>
        <li><a href="#pre-processing">Pre-processing</a></li>        
        <li><a href="#feature_selection">Feature Selection</a></li>
        <li><a href="#classification">Classification</a></li>        
    </ol>
</div>
<br>
<hr>

<div id="import_libraries"> 
    <h2>Import Libraries</h2>    
</div>

In [1]:
import numpy as np
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.cluster import DBSCAN
from sklearn.experimental import enable_iterative_imputer  
from sklearn.impute import IterativeImputer
import seaborn as sns  
import matplotlib.pyplot as plt  
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler 
from sklearn.linear_model import LogisticRegression  
from sklearn.feature_selection import RFE 
from sklearn.feature_selection import mutual_info_classif
from sklearn import metrics
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_curve, roc_auc_score 

import warnings
warnings.filterwarnings("ignore")

<div id="import_dataset"> 
    <h2>Import "Breast Cancer Wisconsin (Original)" Dataset</h2>         
</div>

**About the dataset:**  
<ul>  
    <li>  
        The "<strong>Breast Cancer Wisconsin (Original) Dataset</strong>" contains <strong>699 samples</strong> of breast cancer cases. Each sample includes the following <strong>10 features</strong>:  
        <ul>  
            <li><strong>ID number:</strong> An identifier for each sample.</li>  
            <li><strong>Clump Thickness:</strong> A measurement of the thickness of the clump of cells.</li>  
            <li><strong>Uniformity of Cell Size:</strong> Measures how uniform the size of the cells is.</li>  
            <li><strong>Uniformity of Cell Shape:</strong> Assesses how uniform the shape of the cells is.</li>  
            <li><strong>Marginal Adhesion:</strong> The adhesion of the cells at the margins.</li>  
            <li><strong>Single Epithelial Cell Size:</strong> The size of a single epithelial cell.</li>  
            <li><strong>Bare Nuclei:</strong> The presence of nuclei without surrounding cytoplasm.</li>  
            <li><strong>Bland Chromatin:</strong> The texture of the chromatin in the cell nucleus.</li>  
            <li><strong>Normal Nucleoli:</strong> The presence of normal nucleoli in the cells.</li>  
            <li><strong>Mitoses:</strong> The count of cells undergoing mitosis.</li>  
        </ul>  
        Additionally, the dataset includes a <strong>Class</strong> label indicating whether the tumor is <strong>benign (2)</strong> or <strong>malignant (4)</strong>.  
        <br><br>  
    </li>  
    <li>  
        This is a well-known dataset used for research in medical care and machine learning. Researchers and practitioners commonly use this dataset for:  
        <ol>  
            <li><strong>Classification Tasks:</strong> Predicting whether a breast mass is benign or malignant using machine learning algorithms.</li>  
            <li><strong>Feature Selection:</strong> Identifying the most relevant features for classification.</li>  
            <li><strong>Model Evaluation:</strong> Comparing the performance of different machine learning models on a standardized dataset.</li>  
            <li><strong>Educational Purposes:</strong> Teaching students and practitioners about data preprocessing, feature extraction, and model building in the context of machine learning.</li>  
        </ol>  
        <br>  
    </li>  
    <li>  
        Researchers have used this dataset to achieve various findings, including:  
        <ol>  
            <li><strong>Improved Classification Accuracy:</strong> Developing and refining machine learning models to enhance the accuracy of breast cancer diagnosis.</li>  
            <li><strong>Feature Importance:</strong> Identifying which features are most significant for distinguishing between benign and malignant masses.</li>  
            <li><strong>Model Comparisons:</strong> Comparing the performance of different algorithms (e.g., Decision Trees, Support Vector Machines, Neural Networks) to find the most effective approach for this task.</li>  
            <li><strong>Data Augmentation:</strong> Exploring techniques to augment the dataset and improve model performance.</li>  
        </ol>        
    </li>  
</ul>

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data" 
column_names = ["id number" , "Clump Thickness" , "Uniformity of Cell Size" , "Uniformity of Cell Shape" , "Marginal Adhesion" ,
                "Single Epithelial Cell Size" , "Bare Nuclei" , "Bland Chromatin" , "Normal Nucleoli" , "Mitoses" , "Class"]  
bcw_df = pd.read_csv(url, header=None, names=column_names)
display(bcw_df)

<div id="information"> 
    <h2>Information about the Dataset</h2>    
</div>

In [None]:
# Show summary statistics for the dataset
# This includes count, mean, standard deviation, minimum, 25%, 50%, 75%, and maximum values for numeric columns
print('\nThe dataset description:\n')

data_describe = bcw_df.describe()
display(data_describe)

In [None]:
# Display a concise summary of the dataset
# This summary includes the index dtype, column dtypes, non-null values, and memory usage 
print('\nMore information about the dataset:\n')

data_information = bcw_df.info()
display(data_information)

In [None]:
# Get the shape of the dataset, which returns the number of rows and columns
shape_of_the_dataset = bcw_df.shape
print("\nThe shape of the dataset -->", shape_of_the_dataset)

In [None]:
# Calculate the number of unique values in each column of the dataset
print('\nNumber of unique data in the dataset:\n')

unique_data = bcw_df.nunique()
print(unique_data)

<div id="pre-processing"> 
    <h2>Pre-processing</h2>    
</div>
<div>
    <ol>
        <li><a href="#duplicates">Duplicate Tuples</a></li>
        <li><a href="#outliers">Detecting Outliers (Noise)</a></li>
        <li><a href="#missing_values">Handling Missing Values</a></li>
        <li><a href="#standardization">Standardization</a></li>     
    </ol>
</div>
<br>
<hr>

<div id="duplicates"> 
    <h2>Duplicate Tuples</h2>    
</div>

In [3]:
# Deleting 'id number' column because it has no effect on learning process
bcw_df = bcw_df.drop('id number', axis=1)

In [None]:
# Calculate the number of duplicate rows in the dataframe
Num_of_duplicate_rows = bcw_df.duplicated().sum()
print("\nThe number of duplicate rows -->", Num_of_duplicate_rows)

In [None]:
# Identify all duplicated rows in the dataframe  
# 'duplicated(keep=False)' marks all duplicates (including the first occurrence as True)
df_all_duplicate = bcw_df[bcw_df.duplicated(keep=False)]
print("\nAll the rows and their duplicates:\n")
display(df_all_duplicate)

In [None]:
# Identify only the duplicated rows in the dataframe
# 'duplicated()' without any parameters, meaning its output only shows the rows that are duplicates and excludes the first occurrences
duplicate = bcw_df[bcw_df.duplicated()]
print("\nJust duplicate rows:\n")
display(duplicate)

In [None]:
# Drop all duplicate rows from the dataframe
# df_ADD --> df_after dropping duplicates
df_ADD = bcw_df.drop_duplicates()
print("\nThe dataset after dropping the duplicate tuples:\n")
display(df_ADD)

<div id="outliers"> 
    <h2>Detecting Outliers (Noise)</h2>    
</div>
<div>
    <ol>
        <li><a href="#iqr">Interquartile Range (IQR) method</a></li> 
        <li><a href="#db_scan">DBSCAN Clustering (Density-Based Spatial Clustering)</a></li> 
        <li><a href="#output">Output the results</a></li>        
    </ol>
</div>
<br>
<hr>

<div id="iqr"> 
    <h2>Interquartile Range (IQR) method</h2>    
</div>

In [9]:
# Select only numeric columns from the dataframe
# Ignore the 'Bare Nuclei' column from the original dataframe, because its type is object 
numeric_df = df_ADD.select_dtypes(include=['number'])

In [10]:
# Calculate Q1 (25th percentile) and Q3 (75th percentile)  
Q1 = numeric_df.quantile(0.25)  
Q3 = numeric_df.quantile(0.75)  
IQR = Q3 - Q1  

# Define the outlier detection bounds  
lower_bound = Q1 - 1.5 * IQR  
upper_bound = Q3 + 1.5 * IQR  

In [None]:
# Create a mask to filter out rows with outliers  
outlier_mask = ~((numeric_df < lower_bound) |   
                 (numeric_df > upper_bound)).any(axis=1)  

# Create a new dataframe after outlier detection and deleting 
df_iqr = numeric_df[outlier_mask]  
display(df_iqr)

In [12]:
# Validate the IQR method
# Separate features and target variable  
x = df_iqr.drop('Class', axis=1)            # Features
y = df_iqr['Class']                         # Target variable

# Split the data into training and testing sets (80/20) 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)  

In [None]:
# Initialize the KNN classifier  
clf_iqr = KNeighborsClassifier(n_neighbors=1)  

# Perform cross-validation to check accuracy after IQR outlier removal  
accuracy_iqr = np.mean(cross_val_score(clf_iqr, x_train, y_train, scoring='accuracy', cv=10))  
print(f'\nCross-validated accuracy after IQR outlier removal: {accuracy_iqr:.4f}\n')

In [14]:
# Add the 'Bare Nuclei' column from the original dataframe 
df_iqr['Bare Nuclei'] = df_ADD['Bare Nuclei']

# Specify the desired column order
columns_order = ["Clump Thickness", "Uniformity of Cell Size", "Uniformity of Cell Shape", "Marginal Adhesion",
                 "Single Epithelial Cell Size", "Bare Nuclei", "Bland Chromatin", "Normal Nucleoli", "Mitoses", "Class"]

In [None]:
# Reorder the dataframe columns
df_iqr = df_iqr[columns_order].reset_index(drop=True)

# Display updated dataframe
print("Updated dataframe:") 
display(df_iqr)

<div id="db_scan"> 
    <h2>DBSCAN Clustering (Density-Based Spatial Clustering)</h2>    
</div>

In [16]:
# Select only numeric columns from the dataframe, ignoring the 'Bare Nuclei' column (datatype: object)  
numeric_df = df_ADD.select_dtypes(include=['number'])

In [17]:
# Standardize the features (z-score normalization) 
scaler = StandardScaler() 
X_scaled = scaler.fit_transform(numeric_df.drop(columns=["Class"]))

# Set the parameters for DBSCAN: eps and min_samples
dbs = DBSCAN(eps=2, min_samples=5).fit(X_scaled)

In [20]:
# Labels of the clusters(outliers will have the label -1) 
labels = dbs.labels_

# Add cluster labels to the dataframe 
numeric_df['Cluster'] = labels

In [None]:
# Filter out the outliers (rows with label -1) 
df_dbs = numeric_df[numeric_df['Cluster'] != -1]

# Display the dataframe without outliers 
print("\nDataframe after removing outliers:\n") 
display(df_dbs)

In [None]:
# Display the number of detected outliers 
n_outliers = (labels == -1).sum() 
print(f'\nNumber of outliers detected: {n_outliers}\n')

In [None]:
# Display the number of clusters, ignoring noise if present 
n_clusters = len(set(labels)) - (1 if -1 in labels else 0) 
print(f'\nEstimated number of clusters: {n_clusters}\n')

In [24]:
# Validate the DBSCAN method
# Separate features and target variable  
x = df_dbs.drop('Class', axis=1)            # Features
y = df_dbs['Class']                         # Target variable

# Split the data into training and testing sets (80/20) 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)  

In [None]:
# Initialize the KNN classifier  
clf_dbs = KNeighborsClassifier(n_neighbors=1)  

# Perform cross-validation to check accuracy after DBSCAN outlier removal  
accuracy_dbs = np.mean(cross_val_score(clf_dbs, x_train, y_train, scoring='accuracy', cv=10))  
print(f'\nCross-validated accuracy after DBSCAN outlier removal: {accuracy_dbs:.4f}\n')

In [26]:
# Remove the 'Cluster' column to clean up the dataframe
df_dbs = df_dbs.drop(columns=['Cluster'])

# Add the 'Bare Nuclei' column from the original dataframe 
df_dbs['Bare Nuclei'] = df_ADD['Bare Nuclei']

# Specify the desired column order for the final dataframe
columns_order = ["Clump Thickness", "Uniformity of Cell Size", "Uniformity of Cell Shape", "Marginal Adhesion",
                 "Single Epithelial Cell Size", "Bare Nuclei", "Bland Chromatin", "Normal Nucleoli", "Mitoses", "Class"]

In [None]:
# Reorder the dataframe columns
df_dbs = df_dbs[columns_order].reset_index(drop=True)

# Display updated dataframe
print("\nUpdated dataframe:\n") 
display(df_dbs)

<div id="output"> 
    <h2>Output the results</h2>    
</div>

In [None]:
# Output the results of different outlier detection methods   
print('\nIQR result:', accuracy_iqr)                   # Print accuracy score for the iqr method  
print('\nDBSCAN clustering result:', accuracy_dbs)     # Print accuracy score for the dbscan clustering method 

In [None]:
print("\nContinue working with DBSCAN Clustering after comparing different outlier detection methods:\n")
display(df_dbs.head())

In [None]:
# Get the shape of the dataset
print("\nDataset shape before dropping duplicate tuples -->", bcw_df.shape)
print("\nDataset shape after dropping duplicate tuples -->", df_ADD.shape)
print("\nDataset shape after deleting the outliers -->", df_dbs.shape)

<div id="missing_values"> 
    <h2>Handling Missing Values</h2>    
</div>
<div>
    <ol>
        <li><a href="#drop">Drop Imputation</a></li>
        <li><a href="#median">Median Imputation</a></li>        
        <li><a href="#iterative">Iterative Imputation</a></li>        
        <li><a href="#output">Output the results</a></li>    
    </ol>
</div>
<br>
<hr>

In [None]:
# Display a concise summary of the dataset after deleting the outliers
# This summary includes the index dtype, column dtypes, non-null values, and memory usage 
print('\nMore information about the dataset after deleting the outliers:\n')

data_information = df_dbs.info()
display(data_information)

In [None]:
# Check the distribution of the 'Bare Nuclei' column
df_dbs['Bare Nuclei'].value_counts()

In [None]:
# Check for missing values in the dataframe
isna = pd.DataFrame(df_dbs.isna().sum(axis=0))
print(isna)

In [None]:
print('\nThere are no NaN values in the dataset')
print('But according to the description, it has the missing values are in other shapes')

In [36]:
# Find missing values ​​in other shapes
# Define unwanted values and consider them as null/missing  
unwanted_values = ['?', '!', '$', 'None', 'null', '']

# Replace unwanted values with NaN   
df_dbs.replace(unwanted_values, np.nan, inplace=True)

In [None]:
# Check for any NaN values now present in the dataframe  
missing_values_count = df_dbs.isna().sum() 

# Display the count of missing values for each column  
print("\nCount of missing values in each column:")  
print(missing_values_count[missing_values_count > 0])

In [None]:
# Display rows with missing values  
rows_with_missing = df_dbs[df_dbs.isna().any(axis=1)]  
print("\nRows with missing values:")  
display(rows_with_missing)

In [39]:
# Convert 'Bare Nuclei' column to float
df_dbs['Bare Nuclei'] = df_dbs['Bare Nuclei'].astype("float")

<div id="drop"> 
    <h2>Drop Imputation</h2>    
</div>

In [40]:
# Create a copy of the original dataset 
DfDrop = df_dbs.copy(deep=True)

# Fill missing values in 'Bare Nuclei' with '0'
DfDrop['Bare Nuclei'] = DfDrop['Bare Nuclei'].fillna(0)

In [None]:
# Drop rows where 'Bare Nuclei' is 0 
DfDrop.drop(DfDrop.index[(DfDrop["Bare Nuclei"] == 0)],axis=0,inplace=True)

# Preview the data after drop imputation  
print('\nPreview the data after drop imputation: \n')
display(DfDrop.head())

<div id="median"> 
    <h2>Median Imputation</h2>    
</div>

In [42]:
# Create a copy of the original dataset   
DfDrop_med = df_dbs.copy(deep=True)

# Fill missing values in 'Bare Nuclei'  
# Fill with the median of the column
median_bare_nuclei = DfDrop_med['Bare Nuclei'].median()
DfDrop_med['Bare Nuclei'].fillna(median_bare_nuclei, inplace=True) 

In [None]:
# Preview the data after median imputation  
print('\nPreview the data after median imputation: \n')
display(DfDrop_med.head())

<div id="iterative"> 
    <h2>Iterative Imputation</h2>    
</div>

In [44]:
# Create a copy of the original dataset  
DfIterative = df_dbs.copy(deep=True)  

# Keep only numeric columns for iterative imputation
df_ite_numeric = DfIterative.select_dtypes(include=[np.number])  

In [45]:
# Set up the iterative imputer  
imputer_ite = IterativeImputer(missing_values=np.nan, sample_posterior=True, min_value=0, 
                               random_state=0)  

# Performe the iterative imputation  
imputed_data_ite = imputer_ite.fit_transform(df_ite_numeric)  

In [None]:
# Convert back to dataframe  
DfIterative[df_ite_numeric.columns] = imputed_data_ite  

# Preview the data after iterative imputation  
print('\nPreview the data after iterative imputation:\n')
display(DfIterative.head())

<div id="output"> 
    <h2>Output the results</h2>    
</div>

Compare the different Imputation Methods using **Kernel Density Estimation (KDE) Plots**

In [None]:
# 'Bare Nuclei' column 
# Setup the plotting environment  
plt.figure(figsize=(14, 10))  

# KDE for 'Bare Nuclei' column  
sns.kdeplot(df_dbs['Bare Nuclei'], label='Baseline', fill=False, bw_adjust=0.5)
sns.kdeplot(DfDrop['Bare Nuclei'], label='Drop Imputation', fill=False, bw_adjust=0.5) 
sns.kdeplot(DfDrop_med['Bare Nuclei'], label='Median Imputation', fill=False, bw_adjust=0.5)   
sns.kdeplot(DfIterative['Bare Nuclei'], label='Iterative Imputation', fill=False, bw_adjust=0.5) 

# Aesthetic aspects of the plot  
plt.title('KDE Plot comparison of Bare Nuclei across Imputation Methods')  
plt.xlabel('Bare Nuclei')  
plt.ylabel('Density')  
plt.legend()  
plt.grid(True)  
plt.show() 

In [None]:
print("\nContinue working with drop imputation after comparing different imputation methods:\n")
display(DfDrop)

<div id="standardization"> 
    <h2>Standardization</h2>    
</div>
<div>
    <ol>
        <li><a href="#z-score">Z-Score Standardization (Standard Scaling)</a></li>
        <li><a href="#min-max">Min-Max Scaling (Normalization)</a></li>         
        <li><a href="#output">Output the results</a></li>     
    </ol>
</div>
<br>
<hr>


<div id="z-score"> 
    <h2>Z-Score Standardization (Standard Scaling)</h2>    
</div>

In [49]:
# Apply the Z-score standardization
Z_scaler = StandardScaler()  
Z_Scaled = Z_scaler.fit_transform(DfDrop)

# Create a new dataframe with the scaled data  
df_Z_Scaled = pd.DataFrame(Z_Scaled, columns = list(DfDrop.columns))

In [None]:
# Use all columns except 'Class'
df_Z_Scaled_final = df_Z_Scaled.drop('Class', axis = 1)

# Add the 'Class' column back to the dataframe
df_Z_Scaled_final['Class'] = DfDrop['Class'].tolist()
display(df_Z_Scaled_final)

In [51]:
# Perform data manipulation in the 'Class' column
df_Z_Scaled_final['Class'] = df_Z_Scaled_final['Class'].replace({2: 0, 4: 1}) 

In [52]:
# Validate the Z-score standardization
# Separate features and target variable  
x_z = df_Z_Scaled_final.drop('Class', axis = 1)               # Features
y_z = df_Z_Scaled_final['Class']                              # Target variable

# Split the data into training and testing sets (80/20)  
x_train_z, x_test_z, y_train_z, y_test_z = train_test_split(x_z, y_z, test_size=0.2, random_state=0)  

In [None]:
# Initialize the KNN classifier  
clf_z = KNeighborsClassifier(n_neighbors=10)  

# Perform cross-validation to check accuracy after the Z-standard scaling  
accuracy_z = np.mean(cross_val_score(clf_z, x_train_z, y_train_z, scoring='accuracy', cv=10)) 
print(f'\nCross-validated accuracy after the Z-standard scaling: {accuracy_z:.4f}\n')

<div id="min-max"> 
    <h2>Min-Max Scaling (Normalization)</h2>    
</div>

In [None]:
# Apply Min Max scaler
MM_scaler = MinMaxScaler()
Min_Max_Scaled = MM_scaler.fit_transform(DfDrop)

# Create a new dataframe with the scaled data 
df_Min_Max_Scaled_final = pd.DataFrame(Min_Max_Scaled, columns = list(DfDrop.columns))
display(df_Min_Max_Scaled_final)

In [55]:
# Validate the Min Max Scaler
# Separate features and target variable  
x_mm = df_Min_Max_Scaled_final.drop('Class', axis = 1)            # Features
y_mm = df_Min_Max_Scaled_final['Class']                           # Target variable

# Split the data into training and testing sets (80/20) 
x_train_mm, x_test_mm, y_train_mm, y_test_mm = train_test_split(x_mm, y_mm, test_size=0.2, random_state=0) 

In [None]:
# Initialize the KNN classifier  
clf_mm = KNeighborsClassifier(n_neighbors=10)  

# Perform cross-validation to check accuracy after the Min Max scaling  
accuracy_mm = np.mean(cross_val_score(clf_mm, x_train_mm, y_train_mm, scoring='accuracy', cv=10))
print(f'\nCross-validated accuracy after the Min Max scaling: {accuracy_mm:.4f}\n')

<div id="output"> 
    <h2>Output the results</h2>    
</div>

In [None]:
# Output the results of different standardization methods   
print('\nZ-standard scaling result:', accuracy_z)  # Print accuracy score for the z-score standardization method  
print('\nMin Max scaling result:', accuracy_mm)    # Print accuracy score for the min max scaler method 

In [None]:
print("\nContinue working with the dataset scaled by Z-standard scaling after comparing different scaling methods:\n")
df_Scaled = df_Z_Scaled_final
display(df_Scaled.head())

<div id="feature_selection"> 
    <h2>Feature Selection</h2>    
</div>
<div>
    <ol>
        <li><a href="#fm">Filter Method (Correlation Analysis)</a></li>
        <li><a href="#rfe">Recursive Feature Elimination (RFE)</a></li>        
        <li><a href="#mi">Mutual Information</a></li>       
        <li><a href="#output">Output the results</a></li> 		
    </ol>
</div>
<br>
<hr>

<div id="fm"> 
    <h2>Filter Method (Correlation Analysis)</h2>    
</div>

In [None]:
# Calculate correlation matrix
corr = df_Scaled.corr()
print('\nCorrelation between the features in the dataset:\n')

# Display the correlation matrix
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  
    display(corr)

In [None]:
# Visualize the correlation matrix using a heatmap plot
# Setup the plotting environment 
plt.figure(figsize=(15,15))
print('\nVisualizing the correlation of the dataset:\n')

# Heatmap plot for correlation
sns.heatmap(corr, cbar=True, square= True, fmt='.2f', annot=True, annot_kws={'size':10}, cmap='Blues')

# Aesthetic aspects of the plot
plt.title('Feature Correlation Heatmap', fontsize=18)  
plt.show()

In [None]:
# Get and print correlation of 'Class' with other features
print("\nThe correlation of 'Class' with other features:\n")
class_corr = df_Scaled.corr()['Class'].sort_values(ascending=False) 
print(class_corr) 

In [None]:
# Select features with correlation >= 0.6 with 'Class'
# Apply abs() to consider both positive and negative correlations
significant_features_fm = class_corr[class_corr.abs() >= 0.6].index.tolist()    

# Remove 'Class' from the list of significant features
significant_features_fm = [feature for feature in significant_features_fm if feature != 'Class']            
print("\nChoosing features that have correlation >= 0.6':\n", significant_features_fm) 

In [63]:
# Validate filter method
# Separate features and target variable
x_fm = df_Scaled[significant_features_fm].drop(columns=['Class'], errors='ignore')        # Features 
y_fm = df_Scaled['Class']                                                                 # Target variable

# Split the dataset into training and testing sets (80/20) with random state  
X_train, X_test, y_train, y_test = train_test_split(x_fm, y_fm, test_size=0.2, random_state=0, stratify=y_fm)

In [None]:
# Initialize the KNN classifier
clf_fm = KNeighborsClassifier(n_neighbors=1)  

# Perform cross-validation to check accuracy after filter method 
accuracy_fm = np.mean(cross_val_score(clf_fm, x_fm, y_fm, scoring='accuracy', cv=10))  
print(f"\nCross-validated accuracy after filter method: {accuracy_fm:.4f}") 

<div id="rfe"> 
    <h2>Recursive Feature Elimination (RFE)</h2>    
</div>

In [65]:
# Separate features and target variable  
X = df_Scaled.drop('Class', axis=1)              # Features  
y = df_Scaled['Class']                           # Target variable

# Split the dataset into training and testing sets (80/20) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

In [66]:
# Initialize the Logistic Regression model  
reg_model = LogisticRegression(max_iter=1000)                   # Added max_iter for convergence if needed  

In [None]:
# Initialize and fit RFE  
rfe = RFE(estimator=reg_model, n_features_to_select=6)          # Select 6 features  

# Fit the model to the training data
rfe.fit(X_train, y_train)

In [None]:
# Get the selected features  
significant_features_rfe = X.columns[rfe.support_]  
print("Selected features using RFE:")  
print(significant_features_rfe.tolist()) 

In [69]:
# Validate RFE
# Separate features and target variable
X_rfe = df_Scaled[significant_features_rfe]            # Features
y = df_Scaled['Class']                                 # Target variable

In [None]:
# Initialize the KNN classifier
clf_rfe = KNeighborsClassifier(n_neighbors = 1)

# Perform cross-validation to check accuracy after RFE
accuracy_rfe = np.mean(cross_val_score(clf_rfe, X_rfe, y, scoring='accuracy', cv=10))
print(f"\nCross-validated accuracy after RFE: {accuracy_rfe:.4f}")

<div id="mi"> 
	<h2>Mutual Information</h2>    
</div>

In [71]:
# Separate features and target variable  
X = df_Scaled.drop('Class', axis=1)              # Features  
y = df_Scaled['Class']                           # Target variable

# Split the dataset into training and testing sets (80/20) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

In [72]:
# Calculate mutual information  
mi = mutual_info_classif(X_train, y_train, discrete_features='auto', random_state=0)  

# Create a dataframe to view feature importances  
mi_df = pd.DataFrame({'Feature': X.columns, 'Mutual Information': mi}) 

In [None]:
# Sort the dataframe based on mutual information  
mi_df = mi_df.sort_values(by='Mutual Information', ascending=False)

# Display mutual information for each feature  
print("Mutual information for each feature:")  
print(mi_df)

In [None]:
# Visualize the mutual information using a bar plot 
# Setup the plotting environment  
plt.figure(figsize=(10, 6))  

# Horizontal bar plot for the mutual information
plt.barh(mi_df['Feature'], mi_df['Mutual Information'], color='skyblue')  

# Aesthetic aspects of the plot
plt.xlabel('Mutual Information')  
plt.title('Mutual Information')  
plt.show()

In [None]:
# Set a fixed threshold 
fixed_threshold = 0.2

# Using fixed threshold  
significant_features_mi = mi_df[mi_df['Mutual Information'] > fixed_threshold]
print("Selected features:")  
print(significant_features_mi) 

In [76]:
# Validate MI
# Get feature names as a list  
selected_feature_names = significant_features_mi['Feature'].tolist()         

# Separate features and target variable
X_selected_MI = df_Scaled[selected_feature_names]              # Features
y = df_Scaled['Class']                                         # Target variable

In [None]:
# Initialize the KNN classifier
clf_mi = KNeighborsClassifier(n_neighbors = 1)

# Perform cross-validation to check accuracy after mutual information
accuracy_mi = np.mean(cross_val_score(clf_mi, X_selected_MI, y, scoring='accuracy', cv=10))
print(f"\nCross-validated accuracy after mutual information: {accuracy_mi:.4f}") 

<div id="output"> 
    <h2>Output the results</h2>    
</div>  

In [None]:
# Output the results of different feature selection methods   
print('\nFilter method result:', accuracy_fm)     # Print accuracy score for filter (correlation analysis) method  
print('\nRFE result:', accuracy_rfe)              # Print accuracy score for recursive feature elimination (RFE) method
print('\nMI result:', accuracy_mi)                # Print accuracy score for mutual information method

In [None]:
print('\nUsing features that obtained from recursive feature elimination (RFE) because it had better accuracy \n')

**Final dataset** after feature selection

In [None]:
# Final dataset after feature selection (recursive feature elimination (RFE))
# Extract the names of the selected features 
selected_features = significant_features_rfe.tolist()
print('\nselected features:\n', selected_features)

In [None]:
# Create the final dataframe with the selected features and add the target column
df_final = df_Scaled[selected_features] 
df_final['Class'] = df_Scaled['Class']

# Display the final dataframe
print("Final dataframe with selected features and target column:") 
display(df_final)

In [None]:
# Check the distribution of the 'Class' variable
class_counts = df_final['Class'].value_counts()  
print("Class distribution:\n", class_counts)

<div id="classification"> 
    <h2>Classification</h2>    
</div>
<div>
    <ol>
        <li><a href="#knn">K-Nearest Neighbors (KNN)</a></li>   
    </ol>
</div>
<br>
<hr>

In [80]:
# Separate features and target variable  
X = df_final.drop('Class', axis=1)              # Features  
y = df_final['Class']                           # Target variable

# Split the dataset into training and testing sets (80/20) 
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
print('\nThe shape of the X_train dataset -->', X_train.shape)
print('\nThe shape of the Y_train dataset -->', Y_train.shape)
print('\nThe shape of the X_test dataset -->', X_test.shape)
print('\nThe shape of the Y_test dataset -->', Y_test.shape)
print('\n')

<div id="knn">   
    <h2>K-Nearest Neighbors (KNN)</h2>    
</div>  
<div>  
    <ol>  
        <li>  
            <a href="#valid">Validating</a>  
            <ol>   
                <li><a href="#holdout">Holdout</a></li>   
                <li><a href="#rrs">Repeated Random Sampling</a></li>                
            </ol>  
        </li>  
        <li><a href="#test">Testing</a></li>  
        <li><a href="#roc">ROC plot and AUC score</a></li> 
        <li><a href="#output">Output the results</a></li> 
    </ol>  
</div>  
<br>  
<hr>

<div id="holdout"> 
    <h2>Holdout</h2>    
</div>

In [None]:
# Holdout 
# Split the dataset into training and validating sets (80/20)
x_train, x_val, y_train, y_val = train_test_split(X_train, Y_train, test_size=0.2)

# Train a KNN classifier (with K=5)
clf_knn_h = KNeighborsClassifier(n_neighbors = 5)

# Fit the model to the training data
clf_knn_h.fit(x_train, y_train)

In [None]:
# Predict the labels for the validating data
y_predict = clf_knn_h.predict(x_val)

# Evaluate model performance
print('\nHoldout result:')
accuracy_score_holdout = accuracy_score(y_val, y_predict)
print('\nAccuracy  -->', accuracy_score_holdout)
print('\n')

<div id="rrs"> 
    <h2>Repeated Random Sampling</h2>    
</div>

In [84]:
# Repeated random sampling

Accuracy = []             # Initialize a list to store accuracy results
num_repeats = 10          # Number of times to repeat random sampling

# Perform repeated random sampling
for i in range(num_repeats):

    # Split the dataset into training and validating sets (80/20)
    x_train, x_val, y_train, y_val = train_test_split(X_train, Y_train, test_size=0.2)

    # Train a KNN classifier (with K=5)
    clf_knn_rrs = KNeighborsClassifier(n_neighbors = 5)

    # Fit the model to the training data
    clf_knn_rrs.fit(x_train, y_train)

    # Predict the labels for the validating data
    y_val_predict = clf_knn_rrs.predict(x_val)
    accuracy_score(y_val, y_val_predict)
    Accuracy.append(accuracy_score(y_val, y_val_predict))

In [None]:
# Evaluate model performance
df_Accuracy = pd.DataFrame(Accuracy, columns=['Accuracy'])
print('\nAccuracy in 10 iterations for different train and validation sets:\n')
display(df_Accuracy)

accuracy_score_rrs = df_Accuracy.Accuracy.mean()
print('\nThe mean of different accuracies for validating the model -->', accuracy_score_rrs)
print('\n')

<div id="test"> 
    <h2>Testing</h2>    
</div>

In [86]:
# Testing
# Train a KNN classifier (with K=5)
clf_knn = KNeighborsClassifier(n_neighbors = 5)

# Fit the model to the training data
clf_knn.fit(X_train, Y_train)

# Predict the labels for the testing data
Y_predict = clf_knn.predict(X_test)

In [None]:
# Evaluate model performance
print('\nTesting the model:\n')

accuracy_score_knn_testing = accuracy_score(Y_test, Y_predict)
print('\nAccuracy  -->', accuracy_score_knn_testing)
print('\nRecall or Sensitivity or TPR --->', recall_score(Y_test, Y_predict))
print('\nPrecision -->', precision_score(Y_test, Y_predict))
print('\nF1_score -->', f1_score(Y_test, Y_predict))
print('\n')

In [None]:
# Generate and display the classification report
print('\nClassification report:\n', classification_report(Y_test, Y_predict))

In [None]:
# Generate and display the confusion matrix
confusion_matrix = metrics.confusion_matrix(Y_predict, Y_test)

# Create a dataframe for the confusion matrix for better visualization
confusion_matrix_dataframe = pd.DataFrame(confusion_matrix, columns = ['benign present', 'malignant present'], 
                                          index = ['test benign', 'test malignant'])
print("\nConfusion matrix:\n")
display(confusion_matrix_dataframe)
print('\n')

### C.1.6 ROC plot and AUC Score

In [90]:
# ROC
def plot_roc_curve(y_test, y_prid):

    # Calculate ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_prid)
    plt.plot(fpr, tpr)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.grid()

In [None]:
#ROC plot and AUC score
plot_roc_curve(Y_test, Y_predict)

# Calculate AUC score
auc_score = roc_auc_score(Y_test, Y_predict)   
print('\nAUC score:', auc_score)
print('\n')

<div id="output"> 
    <h2>Output the results</h2>    
</div>

In [None]:
# Output the results of different validation methods and the Naive Bayes testing  
print('\nHoldout result:', accuracy_score_holdout)                       # Print accuracy score for the holdout method  
print('\nRepeated random sampling result:', accuracy_score_rrs)          # Print accuracy score for repeated random sampling method 
print('\nKNN testing result:', accuracy_score_knn_testing)               # Print accuracy score for KNN testing  
print('\nAUC score:', auc_score)                                         # Print AUC score for the model