# About Dataset
This project will be based on the Coimbra Breast Cancer dataset. The data set consists of 116 observations, 
where 64 patients have breast cancer and 52 patients are a control group. The main purpose of this data is to build a predictive model,
but this project will mainly focus on PCA and see if it can help to get better results when combined with the KNN algorithm and comparisons on several neighbors and also get insights about data.

# The dataset consists of 10 variables:

* Age (years): the age of the individuals.
* BMI (kg/m2): Body mass index, body fat index.
* Glucose (mg/dL): Blood glucose levels.
* Insulin (µU/mL): levels of insulin, a hormone linked to glucose regulation.
* HOMA: assessment of a homeostatic model for insulin resistance and beta-cell function.
* Leptin (ng/mL): Leptin levels, a hormone involved in appetite and energy balance.
* Adiponectin (μg/ml): levels of adiponectin, a protein associated with metabolic regulation.
* Resistin (ng/mL): levels of Resistin, a protein involved in insulin resistance.
* MCP-1 (pg/dL): Monocyte Chemoattractant Protein-1, a cytokine involved in inflammation.

# tags:
1. Health controls
2. Patients with breast cancer

# In this project we will:
* Investigating the data before performing tests
* Attempting to reproduce the article we read with the KNN algorithm 
* PCA on the data
* Principal Component Analysis (PCA)
* component analysis
* KNN on several different PCA components
* conclusions

# Data Analysis:
# Principal component analysis (PCA)
Principal component analysis (PCA) is a statistical method for dimensionality reduction. PCA is mainly used in multidimensional data set with lots of variables. The main goal is to create a simpler version of such a dataset. After this step the data is represented by the components that represent most of the variance from the original data set, with usually the first few components explaining the largest part of the variance.

# K-Nearest Neighbors
K-Nearest Neighbors is a supervised learning algorithm. When the data is 'trained' with data points corresponding to their classification. Once a point is predicted, it takes into account the points closest to it to determine its classification.

By performing KNN on the data obtained by running PCA we hope to get better results than the article we studied

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/coimbra-breastcancer/Coimbra_breast_cancer_dataset.csv


In [2]:
# import the libraries:

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score, precision_score, confusion_matrix, classification_report ,accuracy_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
import numpy as np
from tabulate import tabulate



In [3]:
df = pd.read_csv('/kaggle/input/coimbra-breastcancer/Coimbra_breast_cancer_dataset.csv')
df.head()

Unnamed: 0,Age,BMI,Glucose,Insulin,HOMA,Leptin,Adiponectin,Resistin,MCP.1,Classification
0,48,23.5,70,2.707,0.467409,8.8071,9.7024,7.99585,417.114,1
1,83,20.690495,92,3.115,0.706897,8.8438,5.429285,4.06405,468.786,1
2,82,23.12467,91,4.498,1.009651,17.9393,22.43204,9.27715,554.697,1
3,68,21.367521,77,3.226,0.612725,9.8827,7.16956,12.766,928.22,1
4,86,21.111111,92,3.549,0.805386,6.6994,4.81924,10.57635,773.92,1


# Load the dataset

In [4]:
# view info:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116 entries, 0 to 115
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             116 non-null    int64  
 1   BMI             116 non-null    float64
 2   Glucose         116 non-null    int64  
 3   Insulin         116 non-null    float64
 4   HOMA            116 non-null    float64
 5   Leptin          116 non-null    float64
 6   Adiponectin     116 non-null    float64
 7   Resistin        116 non-null    float64
 8   MCP.1           116 non-null    float64
 9   Classification  116 non-null    int64  
dtypes: float64(7), int64(3)
memory usage: 9.2 KB


# creates a copy of a DataFrame

In [5]:
#identify and count missing values in a DataFrame
#creates a copy of a DataFrame
df.isnull().sum()
cdf = df.copy()

In [6]:
cdf.columns

Index(['Age', 'BMI', 'Glucose', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin',
       'Resistin', 'MCP.1', 'Classification'],
      dtype='object')

# Overview of the Dataset using fast_eda() Function from fastead Module


In [7]:
!pip install fasteda
from fasteda import fast_eda
# quick overview of the dataset using fast_eda() function from fasteda module
fast_eda(df)

[0m[31mERROR: Could not find a version that satisfies the requirement fasteda (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for fasteda[0m[31m
[0m

ModuleNotFoundError: No module named 'fasteda'

In [None]:
# check for the correlation of the data

plt.figure(figsize=(10,8))
cor = cdf.corr()
sns.heatmap(cor, annot=True,
            cmap=plt.cm.RdYlGn, vmin=-1, vmax=1)
plt.show()

# Visualize the correlation of features with the target column
correlation_with_target = cdf.corr()['Classification'].sort_values(ascending=False)

# Remove correlation with itself
correlation_with_target = correlation_with_target.drop('Classification')

plt.figure(figsize=(8, 6))
correlation_with_target.plot(kind='bar')
plt.title('Correlation with Target (Classification)')
plt.xlabel('Features')
plt.ylabel('Correlation')
plt.show()

(Glucose, HOMA, Insullin and Resistin) have high positive correlation with the 'classification' column and BMI have the highiest negative correlation

as we can see we have high correlation between HOMA and Insulin feature
Solution:
we can omit one of the features
'but we can use PCA instead foe feature extraction in the following sections'

In [None]:
sns.pairplot(df, hue = 'Classification', palette = 'rainbow', diag_kind = "hist" );

Currently there do not seem to be two features that separate the data well

In [None]:
cdf['Classification'] = cdf['Classification'].replace({1: 0, 2: 1})
cdf['Classification'] = cdf['Classification'].replace({0: 1, 1: 0})

X = cdf.drop('Classification',axis=1)
y = cdf['Classification']

0 = negative

1 = positive

# part 1 - Attempting to reproduce the article we read
In this section we will try to reproduce the results of the article. In the article, they used the same data set and the KNN algorithm with 7 neighbors on the pixels: resistin, glucose, age and BMI and got the following results:

* Accuracy -> 87.50%
* Sensitivity -> 84.62%
* Specificity -> 91.00%
* AUC -> 87.76
* Confusion Matrix:

[[ 11  1]

 [ 2 10]]

In [None]:
X_artcl= X[['Age', 'BMI', 'Glucose','Resistin']]
X_train, X_test, y_train, y_test = train_test_split(X_artcl,y,test_size=0.2,random_state=42 ,shuffle=True)

In [None]:

# Define the number of neighbors
n_neighbors = 7

# Initialize the k-NN classifier with the specified number of neighbors and parallel processing
knn = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=-1)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Predict the classes for training and test data
y_train_predicted = knn.predict(X_train)
y_test_predicted = knn.predict(X_test)


In [None]:

def calculate_specificity(conf_matrix):
    true_negatives = conf_matrix[0, 0]
    false_positives = conf_matrix[0, 1]
    specificity = true_negatives / (true_negatives + false_positives)
    return specificity


In [None]:

# Calculate evaluation metrics for the training set
knn_train_accuracy_score = accuracy_score(y_train, y_train_predicted)
knn_train_precision_score = precision_score(y_train, y_train_predicted)
knn_train_recall_score = recall_score(y_train, y_train_predicted)
knn_train_f1_score = f1_score(y_train, y_train_predicted)
knn_train_conf_matrix = confusion_matrix(y_train, y_train_predicted)
knn_train_auc_score = roc_auc_score(y_train, y_train_predicted)
knn_train_specificity_score = calculate_specificity(knn_train_conf_matrix)


# Calculate evaluation metrics for the test set
knn_test_accuracy_score = accuracy_score(y_test, y_test_predicted)
knn_test_precision_score = precision_score(y_test, y_test_predicted)
knn_test_recall_score = recall_score(y_test, y_test_predicted)
knn_test_f1_score = f1_score(y_test, y_test_predicted)
knn_test_conf_matrix = confusion_matrix(y_test, y_test_predicted)
knn_test_auc_score = roc_auc_score(y_test, y_test_predicted)
knn_test_specificity_score = calculate_specificity(knn_test_conf_matrix)


# Print evaluation metrics for the training set
print("Training Set Evaluation Metrics:")
print("Accuracy Score:", knn_train_accuracy_score)
print("Precision Score:", knn_train_precision_score)
print("Recall Score:", knn_train_recall_score)
print("F1 Score:", knn_train_f1_score)
print("AUC Score:", knn_train_auc_score)
print("Specificity Score:", knn_train_specificity_score)
print("Confusion Matrix:")
print(knn_train_conf_matrix)
print("------------------------------------------------------")

# Print evaluation metrics for the test set
print("Test Set Evaluation Metrics:")
print("Accuracy Score:", knn_test_accuracy_score)
print("Precision Score:", knn_test_precision_score)
print("Recall Score:", knn_test_recall_score)
print("F1 Score:", knn_test_f1_score)
print("AUC Score:", knn_test_auc_score)
print("Specificity Score:", knn_test_specificity_score)
print("Confusion Matrix:")
print(knn_test_conf_matrix)


As you can see, our results are a little different from the results of the article.

# part 2 - Attempt to improve the results by performing PCA

**PCA** is a dimensionality reduction technique used to transform large datasets into smaller ones while retaining most of the original information. It works by creating new variables, called principal components, which are linear combinations of the original variables. These components capture the maximum variance in the data, making them more interpretable and easier to visualize. PCA helps simplify data analysis by reducing the number of variables, making it faster and more efficient for machine learning algorithms. While the principal components themselves may not have direct interpretability, they provide valuable insights into the underlying structure of the data.  PCA aims to preserve as much information as possible while reducing dimensionality, leading to improved data exploration and analysis.

**The method works in five main stages:**
1. Standardization: The range of values of each variable is adjusted so that they all have a mean of 0 and a variance of 1.
2. The covariance matrix: calculated to identify relationships (correlations) between the variables.
3. Eigenvectors and eigenvalues of the covariance matrix: calculated to find the principal components.
4. Feature vector: decides which primary components to use.
5. Transformation: the data is adjusted to the new axes created by the main components

In [None]:

# Define the features you want to analyze using PCA
features = ['Age', 'BMI', 'Glucose', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin',
            'Resistin', 'MCP.1']

# Extract the features you defined earlier
X_f = cdf[features]

# Normalize the features to have a mean of 0 and a standard deviation of 1
# This helps PCA work better with features on different scales
scaler = StandardScaler()
X_norm = scaler.fit_transform(X_f.to_numpy())

# Perform PCA on the normalized features
# This will reduce the dimensionality of your data while preserving most of the information
pca = PCA()
X_pca = pca.fit_transform(X_norm)

# Convert the PCA transformed data to a DataFrame for easier manipulation
names = [f"PC{i+1}" for i in range(X_pca.shape[1])]  # Create column names for PCA components
X_pcadf = pd.DataFrame(X_pca, columns=names)

# Print the first few rows of the PCA DataFrame to see the transformed data
print("First few rows of the PCA DataFrame:")
print(X_pcadf.head())
print("------------------------------------------------------------------------")

# Print the shape of the PCA DataFrame to see the new dimensionality
print("Shape of the PCA DataFrame:")
print(X_pcadf.shape)



In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_pca,y,test_size=0.2,random_state=42 ,shuffle=True)

**We will run the PCA results on KNN with a different number of vandevador if there is an improvement in the results**

In [None]:

# Define the range of n_neighbors values to test
n_neighbors_values = [3, 5, 7, 9]

# Initialize lists to store evaluation metrics for each n_neighbors value
train_accuracy_scores = []
test_accuracy_scores = []
train_precision_scores = []
test_precision_scores = []
train_recall_scores = []
test_recall_scores = []
train_f1_scores = []
test_f1_scores = []
train_conf_matrices = []
test_conf_matrices = []

# Loop over different values of n_neighbors
for n_neighbors in n_neighbors_values:
    # Create and fit the k-NN classifier
    knn = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=-1)
    knn.fit(X_train, y_train)
    
    # Predictions on training and test sets
    y_train_predicted = knn.predict(X_train)
    y_test_predicted = knn.predict(X_test)
    
    # Compute evaluation metrics for the training set
    train_accuracy_scores.append(accuracy_score(y_train, y_train_predicted))
    train_precision_scores.append(precision_score(y_train, y_train_predicted))
    train_recall_scores.append(recall_score(y_train, y_train_predicted))
    train_f1_scores.append(f1_score(y_train, y_train_predicted))
    train_conf_matrices.append(confusion_matrix(y_train, y_train_predicted))
    
    # Compute evaluation metrics for the test set
    test_accuracy_scores.append(accuracy_score(y_test, y_test_predicted))
    test_precision_scores.append(precision_score(y_test, y_test_predicted))
    test_recall_scores.append(recall_score(y_test, y_test_predicted))
    test_f1_scores.append(f1_score(y_test, y_test_predicted))
    test_conf_matrices.append(confusion_matrix(y_test, y_test_predicted))

# Print evaluation metrics for each value of n_neighbors
for i, n_neighbors in enumerate(n_neighbors_values):
    print(f"n_neighbors = {n_neighbors}:")
    print("Training Set Evaluation Metrics:")
    print("Accuracy Score:", train_accuracy_scores[i])
    print("Precision Score:", train_precision_scores[i])
    print("Recall Score:", train_recall_scores[i])
    print("F1 Score:", train_f1_scores[i])
    print("Confusion Matrix:")
    print(train_conf_matrices[i])
    print("------------------------------------------------------")
    print("Test Set Evaluation Metrics:")
    print("Accuracy Score:", test_accuracy_scores[i])
    print("Precision Score:", test_precision_scores[i])
    print("Recall Score:", test_recall_scores[i])
    print("F1 Score:", test_f1_scores[i])
    print("Confusion Matrix:")
    print(test_conf_matrices[i])
    print("======================================================")


As you can see, there is no improvement in the results for all the PCA components together, therefore we will investigate the PCA components and at the end we will try to run the test again on fewer components

# Analysis of components

In [None]:
# Print the singular values
singular_values=pca.singular_values_
print("Singular values of PCA:", singular_values)

In PCA, singular values are the key to understanding data variance. They tell us how much information each principal component (PC) captures, with larger values indicating more important directions. By keeping the top k largest ones, we reduce data complexity while preserving key patterns. This analysis helps us choose the optimal number of PCs and identify dominant patterns in the data

In [None]:
# Compute the covariance matrix from the normalized data
cov_matrix = np.cov(X_norm.T)
print("Convariance matrix: ", cov_matrix)

Calculating the covariance matrix from the original data allows us to understand the relationships between the various variables in the data system. The covariance matrix provides information about the linear relationships between pairs of variables, and thus may help us identify the most significant features of the data.

When we perform PCA, we look for the principal components that represent the maximum variance in the data set. Calculating the covariance matrix allows us to calculate the eigenvalues and eigenvectors of the matrix, which are exactly the main components we are looking for. Each eigenvalue represents the maximum variation in certain directions, and each eigenvector represents the way (or "component") in which the data varies in these directions.

In [None]:
# Calculate eigenvalues and eigenvectors of the covariance matrix
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print("Eigenvectors:", eigenvectors)
print("Eigenvalues:", eigenvalues)

By calculating the eigenvectors and eigenvalues of the covariance matrix, we obtain the principal components of the data set. These principal components represent the directions of maximum variance in the data (the larger the eigenvalue, the greater the maximum variance in the directions, the more relevant they are to us), and are used to transform the original data into a new, lower dimensional space in PCA. The eigenvalues provide information on the amount of variance explained by each principal component

In [None]:
# Sort the eigenvalues and eigenvectors in descending order
eig_pairs = [(eigenvalues[index], eigenvectors[:, index]) for index in range(len(eigenvalues))]
eig_pairs.sort(key=lambda x: x[0], reverse=True)

# Print sorted eigenvalue-eigenvector pairs
print("Sorted eigenvalue-eigenvector pairs:")
print(eig_pairs)

# Extract the sorted eigenvalues and eigenvectors
eigenvalues_sorted = [pair[0] for pair in eig_pairs]
eigenvectors_sorted = [pair[1] for pair in eig_pairs]

# Print sorted eigenvalues
print("Sorted eigenvalues:", eigenvalues_sorted)

Sorting eigenvalues and eigenvectors in descending order of the eigenvalues allows us to identify the most important components of the data. When we represent the data in the space of the eigenvectors, the first components contain the most central variance of the data, and are therefore more important.

Using eigenvalue sorting, we organize the most central components of the data so that the first component will be the component containing the most central variance, the second component will be the component containing the second most variance, and so on. This allows us to efficiently identify the most important elements in the data set and translate them into display and analysis. 

In [None]:
# Plot explained variance ratio
plt.figure(figsize=(9,10))

# Individual explained variance
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, alpha=0.5, label='Individual explained variance')

# Cumulative explained variance
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), np.cumsum(pca.explained_variance_ratio_), marker='o', linestyle='--', label='Cumulative explained variance')

# Highlighting 85% explained variance threshold
plt.axhline(y=0.85, color='r', linestyle='-', label='85% Explained Variance')

plt.xlabel('Principal Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance Ratio by Principal Components')
plt.legend()
plt.grid(True)
plt.xlim(0, len(eigenvalues_sorted)+1)
plt.ylim(0, 1.01)
plt.show()



The code calculates the variance explained by each principal component in a PCA analysis and visualizes it using bar and step plots. It first computes the total variance by summing the eigenvalues of the covariance matrix. Then, it calculates the proportion of variance explained by each principal component and computes the cumulative explained variance. The bar plot illustrates the individual variance explained by each principal component, while the step plot shows the cumulative explained variance as more components are added. These visualizations aid in identifying the principal components that capture the most variance and determining the number of components needed to retain a significant portion of the total variance, crucial for dimensionality reduction in PCA analysis.
(In PCA, we typically aim to retain a sufficient number of principal components to capture a high percentage of the total variance (e.g., 80-95%)).

The dashed line shows the quantitative sum of the information relations of the components up to the current component. It shows the quantitative sum of the variance in the data that was transformed by the previous components, and the remaining variance that remains to be transformed by the current component.

In [None]:
# Get the explained variance for each principal component
ev = pca.explained_variance_
print(ev)

# Plot the explained variance for each feature using seaborn lineplot
plt.figure(figsize=(8, 6))
sns.lineplot(x=np.array(X_pcadf.columns), y=ev)
plt.xlabel("Principal Components")
plt.ylabel("% Explained variance")
plt.title("Explained Variance by Features")
plt.ylim(0, 4)  # Adjust y-axis limit for better visualization
plt.xticks(rotation=45)  # Rotate x-axis labels for better visibility
plt.tight_layout()
plt.show()

Explained variance is a measure of how much of the total variance in the original dataset is explained by each principal component. The explained variance of a principal component is equal to the eigenvalue associated with that component.

In Sklearn PCA, the explained variance of each principal component can be accessed through the explained_variance_ attribute. For example, if pca is a Sklearn PCA object, pca.explained_variance_[i] gives the explained variance of the i-th principal component.

The total explained variance of a set of principal components is simply the sum of the explained variance of those components.

Principal components with higher variance (or explained variance ratio) are considered more important as they capture more information from the original dataset. Components with lower variance may contain less relevant information and may be discarded if the goal is to reduce dimensionality while preserving most of the variability in the data.

In [None]:
# Calculate the cumulative explained variance
evc = np.cumsum(pca.explained_variance_)
print(evc)

# Plot the cumulative explained variance for principal components using seaborn lineplot
plt.figure(figsize=(8, 6))
sns.lineplot(x=np.array(X_pcadf.columns), y=evc)
plt.xlabel("Principal Components")  # Label for x-axis
plt.ylabel("Cumulative Explained Variance")  # Label for y-axis
plt.title("Cumulative Explained Variance by Features")  # Title of the plot
plt.ylim(0, 9.2)  # Set y-axis limit for better visualization
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.tight_layout()  # Adjust layout to prevent labels from being cut off
plt.show()





The plot visualizes the cumulative explained variance by the principal components across the pc.
It helps understand how much of the total variance in the dataset is explained cumulatively as more principal components are considered.
Each point on the line represents the cumulative explained variance when including a certain number of principal components.

In [None]:
# Create a DataFrame to store the loadings of each feature on each principal component
# Loadings represent the correlations between original features and principal components
loadings = pd.DataFrame(pca.components_, index=names, columns=np.array(features))

# Display the loadings DataFrame
print("Loadings of Features on Principal Components:")
loadings

represent the correlations between the original features and the PCs, indicating how much each feature contributes to each PC.

In [None]:
# covariance matrix of principal components
pca.get_covariance()

The covariance matrix of principal components in PCA encapsulates the covariance structure among the orthogonal vectors derived from the original features of the dataset. Each element in this matrix represents the covariance between different principal components, where diagonal elements signify the variance of individual principal components and off-diagonal elements indicate their covariance. Analyzing this matrix offers insights into how principal components are related to each other and aids in understanding the variability captured by each component. Overall, the covariance matrix of principal components serves as a tool in interpreting the transformed feature space obtained through PCA.

In [None]:


# Define the features
features = ['Age', 'BMI', 'Glucose', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin', 'Resistin', 'MCP.1']

# Extract the features
X_f = cdf[features]
y = cdf['Classification']  # Assuming 'Classification' is the target variable

# Normalize the features
scaler = StandardScaler()
X_norm = scaler.fit_transform(X_f.to_numpy())

# Perform PCA
pca = PCA(n_components=5).fit(X_f)
X_pca = pca.fit_transform(X_norm)


# Define the number of top principal components to consider
num_components = 5

# Define the range of neighbors in KNN
neighbors_range = range(3, 9)

# Convert the PCA transformed data to a DataFrame for easier manipulation
names = [f"PC{i+1}" for i in range(X_pca.shape[1])]  # Create column names for PCA components
X_pcadf = pd.DataFrame(X_pca, columns=names)

# Lists to store evaluation metrics
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []
auc_scores = []
conf_matrices = []
specificity_scores = []

# Loop over each number of components from 1 to num_components
for n in range(1, num_components + 1):
    # Select the top n principal components
    X_pca_subset = X_pca[:, :n]

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X_pca_subset, y, test_size=0.2, random_state=42)

    # Lists to store evaluation metrics for different number of neighbors
    accuracy_scores_n = []
    precision_scores_n = []
    recall_scores_n = []
    f1_scores_n = []
    auc_scores_n = []
    conf_matrices_n = []
    specificity_scores_n = []



# Loop over each number of components from 1 to num_components
for n in range(1, num_components + 1):
    # Select the top n principal components
    X_pca_subset = X_pca[:, :n]

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X_pca_subset, y, test_size=0.2, random_state=42)

    # Lists to store evaluation metrics for different number of neighbors
    accuracy_scores_n = []
    precision_scores_n = []
    recall_scores_n = []
    f1_scores_n = []
    auc_scores_n = []
    conf_matrices_n = []
    specificity_scores_n = []

    # Loop over each number of neighbors
    for k in neighbors_range:
        # Train the KNN classifier
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)

        # Predict the labels for the test set
        y_pred = knn.predict(X_test)

        # Calculate the evaluation metrics
        accuracy = accuracy_score(y_test, y_pred)
        accuracy_scores_n.append(accuracy)

        precision = precision_score(y_test, y_pred)
        precision_scores_n.append(precision)

        recall = recall_score(y_test, y_pred)
        recall_scores_n.append(recall)

        f1 = f1_score(y_test, y_pred)
        f1_scores_n.append(f1)

        auc = roc_auc_score(y_test, y_pred)  # Calculate AUC
        auc_scores_n.append(auc)

        conf_matrix = confusion_matrix(y_test, y_pred)
        conf_matrices_n.append(conf_matrix)
        
        specificity = conf_matrix[0, 0] / (conf_matrix[0, 0] + conf_matrix[0, 1])  # Calculate Specificity
        specificity_scores_n.append(specificity)

    # Store the evaluation metrics for the best number of neighbors
    best_idx = np.argmax(accuracy_scores_n)
    accuracy_scores.append(accuracy_scores_n[best_idx])
    precision_scores.append(precision_scores_n[best_idx])
    recall_scores.append(recall_scores_n[best_idx])
    f1_scores.append(f1_scores_n[best_idx])
    auc_scores.append(auc_scores_n[best_idx])
    conf_matrices.append(conf_matrices_n[best_idx])
    specificity_scores.append(specificity_scores_n[best_idx])

    best_k = neighbors_range[best_idx]  # Get the best number of neighbors
    print(f"Accuracy with {n} principal components and {best_k} neighbors: {accuracy_scores_n[best_idx]:.4f}")

# Plotting the accuracy vs number of principal components
plt.figure(figsize=(8, 6))
plt.plot(range(1, num_components + 1), accuracy_scores, marker='o')
plt.title('Accuracy vs Number of Principal Components')
plt.xlabel('Number of Principal Components')
plt.ylabel('Accuracy')
plt.xticks(range(1, num_components + 1))
plt.grid(True)
plt.show()

# Print evaluation metrics for the test set
for n in range(num_components):
    print(f"Test Set Evaluation Metrics with {n + 1} Principal Components:")
    print("Accuracy Score:", accuracy_scores[n])
    print("Precision Score:", precision_scores[n])
    print("Recall Score:", recall_scores[n])
    print("F1 Score:", f1_scores[n])
    print("AUC Score:", auc_scores[n])
    print("Specificity:", specificity_scores[n])
    print("Confusion Matrix:")
    print(conf_matrices[n])
    print("--------------------------------")


This code performs a classification task using k-Nearest Neighbors (KNN) algorithm with different numbers of principal components obtained through PCA. Here's an explanation of the results:

The plot shows how the accuracy of the KNN classifier changes as the number of principal components increases.
Generally, we expect the accuracy to increase as we include more principal components because they capture more variance in the data, providing more information for classification.
However, there may be a point where adding more components does not significantly improve accuracy, or even reduces it due to noise or overfitting.

* The amount of components can be explained because in machine learning, increasing the number of features beyond a certain point can lead to performance degradation due to the curse of dimensionality. This phenomenon occurs because as the number of dimensions increases, the volume of space increases exponentially, making the data sparse. This sparsity makes it difficult to accurately estimate classified parameters and affects the performance of distance-based algorithms.

* The value of K can be explained as follows: since there is no built-in method for finding the best value for K. Values of K are tested for data size with the rule of thumb being that the maximum K is k = sqrt(N ) where "N" represents the number of samples in a data set Your training, and the value of K is odd. Therefore for us the value of K is 1-10 .In addition it is known that choosing smaller values for K can be noisy and affect the result more.And larger values of K will have smoother decision boundaries, meaning lower variance but increased bias.

The results demonstrate the effectiveness of PCA in dimensionality reduction and its impact on classification performance.


In [None]:

# Extract the first two principal components
pc1 = X_pca[:, 0]
pc2 = X_pca[:, 1]

# Plotting the first two principal components
plt.figure(figsize=(10, 8))
scatter = plt.scatter(pc1, pc2, c=y, cmap='viridis')
plt.title('Scatter Plot of First Two Principal Components')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')

# Add legend
legend1 = plt.legend(*scatter.legend_elements(),
                    title="Classification",
                    loc="upper right")
plt.gca().add_artist(legend1)

plt.grid(True)
plt.show()



the graph shows a scatter plot of the first two principal components extracted using PCA from the input data. In the plot, each point represents one sample from the data, where the color of the point indicates its corresponding category or class based on the values of y.

# add matrix of the resolt

In [None]:

# Define the number of top principal components to consider
num_components = 5

# Define the range of n_neighbors values to test
n_neighbors_values = [3, 5, 7, 9]

# Create a matrix to store accuracy scores
accuracy_matrix = np.zeros((num_components, len(n_neighbors_values)))

# Loop over each number of components from 1 to num_components
for n in range(1, num_components + 1):
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X_pca[:, :n], y, test_size=0.2, random_state=42)
    
    # Loop over different values of n_neighbors
    for i, n_neighbors in enumerate(n_neighbors_values):
        # Create and fit the k-NN classifier
        knn = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=-1)
        knn.fit(X_train, y_train)

        # Evaluate the model on the test set
        accuracy = knn.score(X_test, y_test)

        # Store accuracy in the matrix
        accuracy_matrix[n - 1, i] = accuracy

# Print the accuracy matrix with component and neighbor labels
print("Accuracy Matrix:")
headers = ["Components/Neighbors"] + [f"{n_neighbors} Neighbors" for n_neighbors in n_neighbors_values]
table_data = []
for n in range(1, num_components + 1):
    row = [f"{n} Components"]
    for i, n_neighbors in enumerate(n_neighbors_values):
        row.append(f"{accuracy_matrix[n - 1, i]:.4f}")
    table_data.append(row)

print(tabulate(table_data, headers=headers, tablefmt="pretty"))


This table shows the accuracy scores achieved by the k-NN classifier for each combination of the number of principal components and the number of neighbors.

Interpretation of the results:

* When only one component is used, the accuracy is relatively low for all values of k (number of neighbors). This is expected because a single component may not capture enough information to accurately classify the data.
* As the number of components increases (from 2 to 5), the accuracy generally improves, especially for lower values of k (eg, 3 or 5 neighbors).
* With more components, the classifier has access to more information about the data structure, leading to better classification performance.
* However, using too many components can also introduce noise or irrelevant information, which can reduce performance. This is evident in the slight drop in accuracy when going from 4 to 5 components.
* The choice of the number of neighbors (k) also affects the classification accuracy. In some cases, increasing the number of neighbors leads to better performance, while in others, a smaller number of neighbors performs better.

Summary:

The results demonstrate the importance of feature extraction (using PCA) and parameter tuning (such as choosing the appropriate number of neighbors) in achieving good classification performance.
It is essential to find a balance between the amount of information stored (number of components) and the complexity of the model (number of neighbors) to avoid overfitting or underfitting the data.

# Results of each component for each number of neighbors

In [None]:

# Define the features
features = ['Age', 'BMI', 'Glucose', 'Insulin', 'HOMA', 'Leptin', 'Adiponectin', 'Resistin', 'MCP.1']

# Extract the features
X_f = cdf[features]
y = cdf['Classification']  # Assuming 'Classification' is the target variable

# Normalize the features
scaler = StandardScaler()
X_norm = scaler.fit_transform(X_f)

# Perform PCA
pca = PCA()
X_pca = pca.fit_transform(X_norm)

# Define the number of top principal components to consider
num_components = 5

# Define the range of n_neighbors values to test
n_neighbors_values = [3, 5, 7, 9]

# Loop over each number of components from 1 to num_components
for n in range(1, num_components + 1):
    # Select the top n principal components
    X_pca_subset = X_pca[:, :n]

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X_pca_subset, y, test_size=0.2, random_state=42)
    
    # Loop over different values of n_neighbors
    for n_neighbors in n_neighbors_values:
        # Create and fit the k-NN classifier
        knn = KNeighborsClassifier(n_neighbors=n_neighbors, n_jobs=-1)
        knn.fit(X_train, y_train)

        # Predictions on training and test sets
        y_train_predicted = knn.predict(X_train)
        y_test_predicted = knn.predict(X_test)

        # Compute evaluation metrics for the training set
        train_accuracy_score = accuracy_score(y_train, y_train_predicted)
        train_precision_score = precision_score(y_train, y_train_predicted)
        train_recall_score = recall_score(y_train, y_train_predicted)
        train_f1_score = f1_score(y_train, y_train_predicted)
        train_conf_matrix = confusion_matrix(y_train, y_train_predicted)

        # Compute evaluation metrics for the test set
        test_accuracy_score = accuracy_score(y_test, y_test_predicted)
        test_precision_score = precision_score(y_test, y_test_predicted)
        test_recall_score = recall_score(y_test, y_test_predicted)
        test_f1_score = f1_score(y_test, y_test_predicted)
        test_conf_matrix = confusion_matrix(y_test, y_test_predicted)

        # Print evaluation metrics
        print(f"n_neighbors = {n_neighbors}, Principal Components = {n}:")
        print("Training Set Evaluation Metrics:")
        print("Accuracy Score:", train_accuracy_score)
        print("Precision Score:", train_precision_score)
        print("Recall Score:", train_recall_score)
        print("F1 Score:", train_f1_score)
        print("Confusion Matrix:")
        print(train_conf_matrix)
        print("------------------------------------------------------")
        print("Test Set Evaluation Metrics:")
        print("Accuracy Score:", test_accuracy_score)
        print("Precision Score:", test_precision_score)
        print("Recall Score:", test_recall_score)
        print("F1 Score:", test_f1_score)
        print("Confusion Matrix:")
        print(test_conf_matrix)
        print("======================================================")


# Results

After performing PCA and running the obtained data on the KNN algorithm, it was found that the best result was obtained with 2 Principal Components and with 5 neighbors. By Reduction the additional features we get a simpler system that yields good results

**To conclude, the results we received are:**
Test Set Evaluation Metrics with 2 Principal Components:
* Accuracy Score: 0.875
* Precision Score: 0.909
* Recall Score: 0.833
* F1 Score: 0.869
* AUC Score: 0.875
* Specificity: 0.916
* Confusion Matrix:

[[11  1]

 [ 2 10]]
 
Surprisingly, these are also the results obtained in the article. This means that we were able to compare the results and even improve them because we predicted the same level of accuracy with a smaller dimension (smaller amount of components)


# conclusions

* The results of the KNN algorithm with 7 neighbors and 4 features indicate its effectiveness in predicting breast cancer, with an accuracy of 83%. Although the results are a little lower than those of the original article (87.5%), they are still very high and prove the efficiency of the algorithm.
*  Choosing the number of PC components and the number of neighbors are important parameters in the classification process in the KNN algorithm. The results show that there is an increase and decrease exactly according to the change in the number of components and the number of neighbors.
* The results of the KNN algorithm after performing PCA show that the use of the Dimensionality Reduction technique by PCA can lead to an improvement in performance. Especially when using a small number of PC components.
* The small differences between the results of the article and our results show the ability of the model to maintain high efficiency and accuracy, even after performing PCA. This indicates the importance of the exact choice of parameters and the ability of the model to identify the most important details in the data.
* The results obtained highlight the potential of machine learning algorithms, and in particular KNN, for predicting breast cancer. Optimum parameter selection, use of "Dimensionality Reduction" techniques such as PCA, and selection of relevant features are key factors for achieving high accuracy and optimal performance.

# sources
We have used this notebook in a number of Kagle brochures, materials from various websites on the Internet, most of which are attached here:

https://saturncloud.io/blog/what-is-sklearn-pca-explained-variance-and-explained-variance-ratio-difference/

https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html

https://builtin.com/data-science/step-step-explanation-principal-component-analysis

https://www.kaggle.com/code/avikumart/pca-principal-component-analysis-from-scratch

https://rstudio-pubs-static.s3.amazonaws.com/471974_9c6250108efc497789ee5840f24b0db4.html#data-analysis

https://rpubs.com/KAndruszek/471974

https://erdogant.github.io/pca/pages/html/Algorithm.html#normalizing-out-pcs

https://www.kaggle.com/code/abdelrasoul/knn-pca-ml-83

https://www.kaggle.com/code/tanshihjen/module-eda-with-fasteda

https://www.kaggle.com/code/saswattulo/coimbra-breastcancer-prediction-with-91-6-accurac

https://www.kaggle.com/code/benyaminghahremani/coimbra-breast-cancer-classification-feature-eng

https://medium.com/analytics-vidhya/dimensionality-reduction-principal-component-analysis-d1402b58feb1