## Description of the Task

The objective of this task is to apply a **K-Nearest Neighbors (KNN)** classifier to a medical dataset
and study how individual features influence the model’s performance.

Specifically, the task involves:

- Training a KNN model to classify breast tumors as **malignant** or **benign**
- Evaluating the model using multiple performance metrics
- Performing a **feature ablation study**, where one feature is removed at a time to observe
  its impact on classification accuracy and reliability

This task helps in understanding both **how KNN works** and **why feature selection is important**,
especially in sensitive domains like healthcare.

## Understanding K-Nearest Neighbors (KNN)

K-Nearest Neighbors is a **supervised, distance-based classification algorithm**.  
Instead of learning mathematical equations, KNN makes predictions by comparing new data points
with existing labeled data.

When a new data point is given, the algorithm:

1. Calculates the distance between the new point and all training points  
2. Selects the **K nearest data points**  
3. Assigns the class that appears most frequently among those neighbors  

In this task, **K = 5**, meaning the algorithm looks at the **5 closest tumors** and predicts
the class based on **majority voting**.


## Visual Intuition of KNN

Imagine a 2D graph where:

- Each dot represents a tumor  
- **Red dots** represent malignant tumors  
- **Blue dots** represent benign tumors  

When a new tumor appears on the graph, KNN draws a circle around it and looks at the nearest
neighbors inside the circle.

- If most nearby points are red → **Predicted malignant**
- If most nearby points are blue → **Predicted benign**

This intuitive **“neighborhood voting”** is the core idea behind KNN.

## Importance of Feature Scaling in KNN

KNN relies on **distance calculations**, usually Euclidean distance.  
However, tumor features exist on very different numerical scales:

- Area-related features can have values in the **hundreds**
- Smoothness-related features are **small decimal values**

If features are not scaled:

- Large-valued features dominate distance calculations  
- The model becomes biased and less accurate  

To address this issue, **StandardScaler** is used to normalize all features so that each feature
contributes equally to distance computation.

## Dataset Used

The **Breast Cancer Wisconsin (Diagnostic)** dataset was used in this task.

- **Total samples:** 569  
- **Target variable:** `diagnosis`  
  - Malignant (M) → 1  
  - Benign (B) → 0  
- **Features:**  
  - 30 numerical measurements related to tumor size, shape, and texture  
- **Missing values:** None  

This dataset is widely used for medical classification and benchmarking
machine learning models.

## Approach Followed to Solve the Task

### 1. Data Preprocessing

The preprocessing steps were handled using a **machine learning pipeline** to ensure
consistency and robustness.

The steps included:

- Encoding the target variable (M → 1, B → 0)
- Separating features (X) and target (y)
- Splitting the dataset into training (80%) and testing (20%) sets
  using **stratified sampling**
- Handling missing values using **mean imputation**
- Normalizing all features using **StandardScaler**

Using a pipeline ensures that preprocessing steps such as imputation and scaling
are applied consistently during both training and testing, preventing data leakage.

### 2. Training the KNN Model

A **KNN classifier** was trained using a pipeline that combined preprocessing
and classification steps.

- **Number of neighbors (K):** 5  
- Preprocessing (imputation and scaling) and model training were performed together
  using a single pipeline  

This approach simplifies the workflow and ensures that the same transformations
are applied whenever the model is trained or evaluated.

### 3. Model Evaluation

To evaluate the model reliably, the following metrics were used:

- **Accuracy** – overall correctness of predictions  
- **Precision** – correctness of malignant predictions  
- **Recall** – ability to detect malignant tumors  
- **F1-score** – balance between precision and recall  

Using multiple metrics is essential in medical datasets to avoid misleading conclusions.

## Feature Ablation Study

Feature ablation is a technique used to understand **feature importance** by:

- Removing one feature at a time  
- Retraining the model  
- Observing the change in performance  

If removing a feature causes a significant drop in performance,
that feature is considered important.

### How Feature Ablation Was Performed

For each of the 30 features:

- One feature was removed from the dataset  
- The data was split again into training and testing sets  
- A preprocessing + KNN pipeline (imputation, scaling, and classification) was rebuilt  
- The model was retrained with **K = 5**  
- Accuracy, precision, recall, and F1-score were recorded  

Only **one feature was removed at a time**, ensuring a fair and controlled comparison.

## Visualization and Interpretation

### Feature Ablation Visualization

A bar chart was plotted showing:

- **Accuracy values when each feature was removed**

### Interpretation of the Visualization

- Features whose removal caused the **largest drop in accuracy** are the most influential  
- Features with minimal impact are less important for classification  

The visualization clearly highlighted that features related to:

- Tumor **radius**
- **Area**
- **Perimeter**
- **Concavity**

play a critical role in breast cancer diagnosis.

## Outcomes and Learnings

Through this task, I gained a clear understanding of how the K-Nearest Neighbors algorithm
works by making predictions based on similarity and distance between data points.

I also realized how crucial feature scaling is for distance-based models, as unscaled features
can easily mislead the algorithm.

The feature ablation study helped me understand which tumor characteristics truly matter
for accurate classification. By removing one feature at a time and observing the change
in performance, I could see how certain features play a much bigger role than others.

Overall, this task demonstrated how machine learning can be meaningfully applied in
medical diagnosis and highlighted the importance of careful preprocessing, evaluation,
and analysis when working with real-world healthcare data.
