<a href="https://colab.research.google.com/github/Hesam-h-j/Bioinformatics-Biostatostics/blob/main/Breast_Cancer_KNN_Classifier/Breast_Cancer_KNN_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Loading and Initial Inspection

This section handles downloading the breast cancer dataset from KaggleHub, loading it into a pandas DataFrame, and then displaying the column names for an initial overview of the available features. The `kagglehub` library simplifies dataset access, and `pd.read_csv` reads the data into a DataFrame.

In [None]:
import kagglehub
import pandas as pd

# Download the dataset
path = kagglehub.dataset_download("uciml/breast-cancer-wisconsin-data")
print("Dataset downloaded to:", path)

# Read the CSV
# The main file in this dataset is named 'data.csv'
df = pd.read_csv(f"{path}/data.csv")

df.columns

Using Colab cache for faster access to the 'breast-cancer-wisconsin-data' dataset.
Dataset downloaded to: /kaggle/input/breast-cancer-wisconsin-data


Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

### Data Preparation: Separating Features (x) and Target (y)

Before training a machine learning model, we need to separate our dataset into features (input variables, `x`) and the target variable (what we want to predict, `y`). Here, `x` is created by selecting columns from index 2 to 31 (inclusive) of the DataFrame, which represent the various measurements. `y` is assigned the 'diagnosis' column, which contains the labels ('M' for Malignant, 'B' for Benign) that the model will learn to predict.

In [None]:
x = df.iloc[:,2:32]
y = df.iloc[:,1]

### Data Splitting: Training and Testing Sets

To ensure our model can generalize to new, unseen data, we split the dataset into training and testing sets using `train_test_split` from `sklearn.model_selection`.
- `x_train`, `y_train`: Used to train the model.
- `x_test`, `y_test`: Used to evaluate the model's performance.

`test_size=0.2` allocates 20% of the data for testing, and `random_state=42` ensures the split is reproducible, meaning you'll get the same split every time you run the code.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=42, test_size=0.2)
print(x_train.shape)
print(x_test.shape)

(455, 30)
(114, 30)


### Inspecting the Training Features

This cell displays the first 5 rows of the `x_train` DataFrame. This provides a quick visual check of the feature data that the model will be trained on, helping to confirm that the data separation was performed correctly and to understand the format of the input features.

In [None]:
x_train.head()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
68,9.029,17.33,58.79,250.5,0.1066,0.1413,0.313,0.04375,0.2111,0.08046,...,10.31,22.65,65.5,324.7,0.1482,0.4365,1.252,0.175,0.4228,0.1175
181,21.09,26.57,142.7,1311.0,0.1141,0.2832,0.2487,0.1496,0.2395,0.07398,...,26.68,33.48,176.5,2089.0,0.1491,0.7584,0.678,0.2903,0.4098,0.1284
63,9.173,13.86,59.2,260.9,0.07721,0.08751,0.05988,0.0218,0.2341,0.06963,...,10.01,19.23,65.59,310.1,0.09836,0.1678,0.1397,0.05087,0.3282,0.0849
248,10.65,25.22,68.01,347.0,0.09657,0.07234,0.02379,0.01615,0.1897,0.06329,...,12.25,35.19,77.98,455.7,0.1499,0.1398,0.1125,0.06136,0.3409,0.08147
60,10.17,14.88,64.55,311.9,0.1134,0.08061,0.01084,0.0129,0.2743,0.0696,...,11.02,17.45,69.86,368.6,0.1275,0.09866,0.02168,0.02579,0.3557,0.0802


### Model Training: K-Nearest Neighbors (KNN) Classifier

Here, we initialize and train a K-Nearest Neighbors (KNN) classifier.
- `KNeighborsClassifier(n_neighbors=17, metric='euclidean')`: Creates a KNN model that considers 17 nearest neighbors to make a classification decision, using Euclidean distance to measure similarity between data points.
- `model.fit(x_train, y_train)`: Trains the KNN model using the training features (`x_train`) and their corresponding diagnoses (`y_train`). The model learns the relationships between the features and the target labels.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=17, metric='euclidean')
model.fit(x_train, y_train)
model

### Making Predictions on the Test Set

After the model has been trained, this cell uses the `model.predict()` method to make predictions on the `x_test` dataset. These are the diagnoses that the model believes correspond to the test set features, based on what it learned during training. The `y_pred` array will contain a series of 'M' or 'B' labels predicted by the model.

In [None]:
y_pred = model.predict(x_test)
y_pred

array(['B', 'M', 'M', 'B', 'B', 'M', 'M', 'M', 'B', 'B', 'B', 'M', 'B',
       'M', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'B', 'B', 'B',
       'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'M',
       'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M',
       'B', 'B', 'B', 'B', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'B', 'B',
       'B', 'M', 'M', 'B', 'B', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'B',
       'B', 'B', 'M', 'B', 'B', 'M', 'M', 'M', 'B', 'M', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'M', 'B', 'M', 'M',
       'B', 'B', 'B', 'M', 'B', 'B', 'M', 'B', 'B', 'M'], dtype=object)

### Displaying True Labels of the Test Set

This cell retrieves and displays the actual 'diagnosis' values (`y_test`) for the test set. By converting `y_test` to a NumPy array using `.values`, it makes it easier to compare these true labels directly with the model's predictions (`y_pred`) to assess performance.

In [None]:
y_test.values

array(['B', 'M', 'M', 'B', 'B', 'M', 'M', 'M', 'B', 'B', 'B', 'M', 'B',
       'M', 'B', 'M', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'B', 'B', 'B',
       'B', 'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'B', 'M',
       'B', 'B', 'M', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'M', 'M',
       'B', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'B', 'M', 'M', 'B', 'B',
       'B', 'M', 'M', 'B', 'B', 'M', 'M', 'B', 'M', 'B', 'B', 'B', 'M',
       'B', 'B', 'M', 'B', 'M', 'M', 'M', 'M', 'M', 'M', 'B', 'B', 'B',
       'B', 'B', 'B', 'B', 'B', 'M', 'M', 'B', 'M', 'M', 'B', 'M', 'M',
       'B', 'B', 'B', 'M', 'B', 'B', 'M', 'B', 'B', 'M'], dtype=object)

### Model Evaluation: Calculating Accuracy Score

This cell calculates the accuracy of our trained KNN model using `accuracy_score` from `sklearn.metrics`. Accuracy is a common metric that measures the proportion of correctly classified instances. It is calculated by comparing the model's predicted labels (`y_pred`) against the true labels (`y_test`). A higher accuracy score indicates better model performance.

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.9649122807017544

### Extracting a Single Sample for Prediction Demonstration

To demonstrate how the trained model can make a prediction on an individual data point, this cell extracts a specific sample from the `x_test` dataset. `x_test.values[12]` gets the 13th sample (remember Python uses 0-based indexing) as a NumPy array. This array `a` represents all the feature measurements for that single instance.

In [None]:
a = x_test.values[12]
a

array([1.497e+01, 1.976e+01, 9.550e+01, 6.902e+02, 8.421e-02, 5.352e-02,
       1.947e-02, 1.939e-02, 1.515e-01, 5.266e-02, 1.840e-01, 1.065e+00,
       1.286e+00, 1.664e+01, 3.634e-03, 7.983e-03, 8.268e-03, 6.432e-03,
       1.924e-02, 1.520e-03, 1.598e+01, 2.582e+01, 1.023e+02, 7.821e+02,
       1.045e-01, 9.995e-02, 7.750e-02, 5.754e-02, 2.646e-01, 6.085e-02])

### Predicting on a Single New Sample

This final step shows how to use the trained `model` to predict the diagnosis for the single sample (`a`) extracted previously.
1. **`a_df = pd.DataFrame([a], columns=x_train.columns)`**: The single NumPy array `a` is converted into a pandas DataFrame (`a_df`). This is crucial because `model.predict()` typically expects input in a DataFrame-like format with column names matching those used during training.
2. **`model.predict(a_df)`**: The model then takes this single-row DataFrame and outputs its predicted diagnosis ('B' for Benign or 'M' for Malignant) for that specific set of features.

In [None]:
# Convert the numpy array 'a' to a DataFrame with the same columns as x_train
a_df = pd.DataFrame([a], columns=x_train.columns)

# Now predict using the DataFrame
model.predict(a_df)

array(['B'], dtype=object)