# Breast Cancer Detection Using K Neighbours Classifier

# By: Toluwani Olukanni

# Problem Statement 
Breast cancer is one of the most common types of cancer among women worldwide, and early detection is critical for effective treatment and improved survival rates. However, current diagnostic methods such as mammography and biopsy can be invasive, time-consuming, and subject to interpretation by human experts. Therefore, there is a need for accurate and efficient tools to detect breast cancer at an early stage.

Using AI to predict breast cancer: AI algorithms can analyze large amounts of medical data and identify patterns that are not easily discernible by humans. Machine learning models can be trained on large datasets of breast cancer patient information to predict the likelihood of an individual having breast cancer based on various factors such as age, family history, and medical history. By analyzing mammography images, AI can also help detect early-stage breast cancer and distinguish between benign and malignant tumors.

# Objective
Our objective is to train our K Neighbour model to predict if a tumor is benign (0) or malignant (1).

First import the necessary libraries and their packages

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

Load the dataset into the data frame

In [2]:
# Load the dataset
data = pd.read_csv('breast cancer.csv')

In [9]:
data

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,diagnosis
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,1
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,1
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,1
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,1
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,1
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,1
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,1


In [3]:
data.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,diagnosis
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,1
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,1
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,1
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,1


In [6]:
null_values = data.isnull().sum()
print(null_values)

radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
diagnosis                  0
dtype: int64


Split the features from the true labels

In [4]:
# Split the dataset into features (X) and labels (y)
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']

Split the dataset into training samples (70%) and testing samples (30%)

In [5]:
# Split the dataset into a training set (70%) and a test set (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=4)

After spliting them, now scale the features using StandardScaler function

In [7]:
# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Next is to train our Logistic Regression model using the fit () method on the training examples and
Train our K Neighbours Classifier model also using fit() method on the training examples

In [10]:
# Create and train the Logistic Regression and kNN models
log_reg = LogisticRegression(random_state=4)
knn = KNeighborsClassifier(n_neighbors=23)

log_reg.fit(X_train_scaled, y_train)
knn.fit(X_train_scaled, y_train)

Now we test the models using the testing samples (30%) and we evaluate the performance.

In [11]:
# Make predictions on the test set
y_pred_log_reg = log_reg.predict(X_test_scaled)
y_pred_knn = knn.predict(X_test_scaled)

# Algorithm Evaluation

It's time now to evaluate how good our model is. In a context of a binary classification, here are the main metrics that are important to track in order to assess the performance of the model:
- Accuracy
- Precision
- Recall
- F1-Score

In [12]:
# Calculate metrics
metrics = {
    'Model': ['Logistic Regression', 'k-Nearest Neighbors'],
    'Accuracy': [accuracy_score(y_test, y_pred_log_reg), accuracy_score(y_test, y_pred_knn)],
    'Precision': [precision_score(y_test, y_pred_log_reg), precision_score(y_test, y_pred_knn)],
    'Recall': [recall_score(y_test, y_pred_log_reg), recall_score(y_test, y_pred_knn)],
    'F1 Score': [f1_score(y_test, y_pred_log_reg), f1_score(y_test, y_pred_knn)]
}

# Confusion matrices
cm_log_reg = confusion_matrix(y_test, y_pred_log_reg)
cm_knn = confusion_matrix(y_test, y_pred_knn)

print("Confusion Matrix (Logistic Regression):\n", cm_log_reg)
print("\nConfusion Matrix (k-Nearest Neighbors):\n", cm_knn)

# Display the metrics as a DataFrame
results_df = pd.DataFrame(metrics)
print("\nResults:")
print(results_df)


Confusion Matrix (Logistic Regression):
 [[112   5]
 [  1  53]]

Confusion Matrix (k-Nearest Neighbors):
 [[117   0]
 [  3  51]]

Results:
                 Model  Accuracy  Precision    Recall  F1 Score
0  Logistic Regression  0.964912   0.913793  0.981481  0.946429
1  k-Nearest Neighbors  0.982456   1.000000  0.944444  0.971429
