# CS512 - Homework 1
**Due Date:** Monday, October 28th, 08.00

## Submission Guidelines:
- Please ensure that your notebook is complete with all outputs displayed. This is important, so verify that none of the outputs are missing before submission.
- Save your notebook file using the following format: **CS512_Homework1_FirstName_LastName.ipynb** For example: *CS512_Homework1_Mert_Pekey.ipynb*

## Objective:
The goal of this homework is to help you gain hands-on experience with two fundamental classification techniques, **k-Nearest Neighbors (kNN)** and **Decision Trees**, while introducing performance evaluation metrics. You will train both classifiers on a dataset, evaluate their performance using a **Confusion Matrix**, plot the **ROC Curve**, and explore how **hyperparameters** impact the performance of these classifiers. Additionally, you will generate some **visualizations** to better understand the data and the models.

## Dataset:
For this homework, you will use the Breast Cancer Wisconsin (Diagnostic) Dataset, a well-known dataset from scikit-learn that is commonly used for binary classification tasks. This dataset contains features that are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, describing the characteristics of the cell nuclei present in the image. The goal is to classify whether the breast cancer is malignant (cancerous) or benign (non-cancerous).

### Instructions:

1. **Load the Dataset**:
    - Load the Dataset: Use sklearn.datasets.load_breast_cancer() to import the dataset.
    - Split the data into a training set (70%), validation set (15%), and test set (15%).
    - Print the data sizes.
2. **Exploratory Data Analysis (EDA) on Training Data:**
    - Class Distribution: Visualize the class distribution using a bar plot.
    - Feature Analysis: For the first 5 features:
        - Plot pairwise relationships using a scatter plot matrix or pairplot, coloring points by class.
        - Plot the correlation matrix for these features.
    - Summary Statistics: Display basic statistics (mean, standard deviation) for the first 5 features, grouped by class (malignant vs. benign).
3. **Preprocessing:**
    - Standardize the feature values (using StandardScaler or similar) before training the model.


In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load the dataset
data = load_breast_cancer()
X = data.data  # Features
y = data.target  # Labels

# Split the dataset (70% training, 15% validation, 15% testing)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.30, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.50, random_state=42, stratify=y_temp)

# Print the data sizes
print(f"Training data size: {X_train.shape}")
print(f"Validation data size: {X_val.shape}")
print(f"Test data size: {X_test.shape}")

4. **Train Classifiers:**
    - Train a k-Nearest Neighbors (kNN) classifier using Scikit-Learn.
    - Train a Decision Tree classifier using Scikit-Learn.
    - Use the default parameters for both classifiers initially.
    - Evaluate both classifiers on the validation set using:
        - Accuracy as a baseline metric.
        - Confusion Matrix and Classification Report (precision, recall, F1-score) to analyze performance.
    - **Hyperparameter Tuning:** Experiment by tuning the hyperparameters and compare the results on the **validation set**.
        - Tune n_neighbors from 1 to 20 for kNN
        - Tune max_depth for 1 to 10 for Decision Tree
        - Plot accuracy vs. hyperparameters:
            - For kNN, plot accuracy vs. n_neighbors.
            - For Decision Tree, plot accuracy vs. max_depth.
    - **Feature Importance:** For the Decision Tree, visualize the feature importance on the validation set using a bar plot.

5. **Evaluate on Test Data:**

    - After choosing the best models from the validation set, evaluate the performance on the test set.
    - Compute the following metrics for both models:
        - Accuracy
        - Precision
        - Recall
        - F1-score
        - Confusion Matrix
        - AUC Score
        - Plot the ROC curve for the positive class (malignant).


6. **Analysis**:
    - Answer the followings briefly:
        - Compare the performance of the two classifiers based on the metrics computed (both with default and tuned hyperparameters).
        - Discuss how changing hyperparameters (k for kNN and max_depth for Decision Tree) affected the performance.
        - Summarize which classifier performed better overall and why.
