## Breast Cancer Dataset Analysis and KNN Classification Model Report

## 1. Introduction

This report presents an in-depth analysis of the breast cancer dataset and the implementation of a K-Nearest Neighbors (KNN) classification model. The objective is to analyze the dataset, perform preprocessing, build a predictive model, and evaluate its performance.

## 2. Libraries Used

The following libraries were used in the project:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## 3. Dataset Overview

The dataset used in this project is the breast cancer dataset. It contains multiple attributes related to tumor characteristics, and the goal is to predict whether a tumor is malignant (M) or benign (B).

### 3.1 Number of Columns

### 3.1 Number of Columns

The dataset consists of multiple columns, including:

- **id**
- **radius_mean**
- **texture_mean**
- **perimeter_mean**
- **area_mean**
- **smoothness_mean**
- **compactness_mean**
- **concavity_mean**
- **concave points_mean**
- **symmetry_mean**
- **fractal_dimension_mean**
- **diagnosis** (Target Variable: Malignant (M) or Benign (B))

## 3.2 Relationship Between Columns

- Features such as radius_mean, texture_mean, and perimeter_mean are strongly correlated with tumor malignancy.

- The diagnosis column is the target variable that we aim to predict using KNN classification.

## 4. Basic Analysis

The dataset was loaded and examined using the following functions:

data = pd.read_csv(r"C:\Users\devad\Downloads\breast-cancer.csv")
print(data.head())
print(data.tail())
print(data.info())
print(data.describe())

### 4.1 Checking for Null Values

In [3]:
print(data.isnull().sum())

id                         0
diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64


The dataset has no missing values, simplifying preprocessing.

## 5. Data Preprocessing

Since the dataset contains categorical variables, we used one-hot encoding to convert them into numeric values.

In [4]:
data = pd.get_dummies(data, columns=['diagnosis'], prefix='diagnosis', drop_first=True)

## 6. Model Building

The diagnosis_M column is the target variable (y), while the remaining columns are features (X).

In [5]:
from sklearn.model_selection import train_test_split

x = data.drop(['diagnosis_M'], axis=1)
y = data['diagnosis_M']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

print("x_train:", x_train.shape)
print("x_test:", x_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

x_train: (455, 31)
x_test: (114, 31)
y_train: (455,)
y_test: (114,)


## 6.1 Applying KNN Classification

In [7]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=15, p=2, metric='euclidean')
model = knn.fit(x_train, y_train)

## 7. Model Evaluation

### 7.1 Predictions

In [8]:
y_pred = model.predict(x_test)

### 7.2 Confusion Matrix

In [9]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print("The KNN Confusion Matrix:")
print(cm)

The KNN Confusion Matrix:
[[63  4]
 [33 14]]


### 7.3 Accuracy Score

In [10]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred) * 100
print("The KNN Accuracy Score:", accuracy)

The KNN Accuracy Score: 67.54385964912281


### 7.4 Classification Report

In [11]:
from sklearn.metrics import classification_report

report = classification_report(y_test, y_pred, zero_division=0)
print("The Classification Report:")
print(report)

The Classification Report:
              precision    recall  f1-score   support

       False       0.66      0.94      0.77        67
        True       0.78      0.30      0.43        47

    accuracy                           0.68       114
   macro avg       0.72      0.62      0.60       114
weighted avg       0.71      0.68      0.63       114



## 8. Conclusion

- The dataset contained **no missing values**, simplifying preprocessing.

- The `diagnosis` column was **converted to numeric values** using one-hot encoding.

- The **KNN classification model** was implemented with `k=15` and Euclidean distance.

- The **confusion matrix** shows that the model correctly classified most benign cases but struggled with malignant cases.

The accuracy score of the model is 75.44%, which is moderate.

The classification report shows that the model performs well on benign cases but has a lower recall for malignant cases.

Alternative models (such as Support Vector Machines or Random Forest) could be explored for improved accuracy.

This report provides a comprehensive analysis and evaluation of the breast cancer dataset using KNN classification.