# Exercise 2: Classification and Clustering

In this exercise, you will work with the **Digits** dataset to classify handwritten digits using various machine learning algorithms.

Additionally, you'll explore clustering using K-Means and experiment with hyperparameter tuning.

### Load the Digits Dataset

The **Digits Dataset** consists of 8x8 pixel images of handwritten digits (0–9). You will perform classification to predict the correct digit label.

In [None]:
# Load the Digits dataset
from sklearn.datasets import load_digits
import pandas as pd

digits = load_digits()

print("Digits dataset keys: ", digits.keys())
print("Data count: ", len(digits['data']))
print("Feature count: ", len(digits['data'][0]))

### Visualize the Digits Dataset

In [None]:
import matplotlib.pyplot as plt

# Display the first digit
plt.figure(1, figsize=(3, 3))
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()

### Split the Data

Separate the dataset into features and labels, and split the data into training and testing sets.

- Separate the features and target labels.

- Split the data into training and test sets using `train_test_split`.

In [None]:
from sklearn.model_selection import train_test_split

# Separate features (X) and labels (y)
X = digits['data']
y = digits['target']

# Split the data into training and test sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shape of training and test sets
print("Training set size:", X_train.shape)
print("Test set size:", X_test.shape)

### Train a Decision Tree Classifier

In this step, you will train a Decision Tree classifier on the training set.

**Instructions:**

- Use `DecisionTreeClassifier` from `sklearn.tree`.

- Fit the model on the training data (`X_train` and `y_train`).

In [None]:
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# TODO: Create and train a DecisionTreeClassifier

### Evaluate the Model

After training, you will evaluate the model's performance on the test set using accuracy, confusion matrix, and classification report.

**Instructions:**

- Predict the test labels using the trained Decision Tree model.

- Compute and print the accuracy score.

- Use a `confusion matrix` and `classification report` for a more detailed evaluation.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# TODO: Predict on the test set

# TODO: Compute the accuracy
print("Accuracy:")

# TODO: Display the confusion matrix
print("Confusion Matrix:\n")

# TODO: Display a detailed classification report
print("Classification Report:\n")

### Visualize the Decision Tree

You can visualize the structure of the decision tree to understand how the model makes decisions based on the features.

**Instructions:**

- Use `plot_tree` from `sklearn.tree` to visualize the trained Decision Tree model.

In [None]:
from sklearn.tree import plot_tree

# TODO: Visualize the decision tree

### Experiment with Hyperparameters

Decision Trees have several hyperparameters that can be tuned to improve performance, such as:

- `max_depth`: The maximum depth of the tree.

- `min_samples_split`: The minimum number of samples required to split an internal node.

- `min_samples_leaf`: The minimum number of samples required to be at a leaf node.

**Instructions for Hyperparameter Tuning:**

- Use `GridSearchCV` to find the optimal hyperparameters for the Decision Tree.

- Try tuning `max_depth`, `min_samples_split`, and `min_samples_leaf`.

In [None]:
from sklearn.model_selection import GridSearchCV

# TODO: Define the hyperparameter grid, and use GridSearchCV to find the best parameters

# Display the best parameters and best score
print("Best Params:")
print("Best Score:")

### Exercise: Using KNN and SVM on the Digits Dataset

Now that you've trained and evaluated a Decision Tree classifier, your next task is to try different machine learning algorithms on the same dataset. Specifically, you will implement **K-Nearest Neighbors (KNN)** and **Support Vector Machine (SVM)** classifiers and compare their performance with each other and with the Decision Tree.

**Instructions:**

1. Train a K-Nearest Neighbors (KNN) Classifier.

    - Use `KNeighborsClassifier` from `sklearn.neighbors`.

    - Choose an appropriate number of neighbors (`n_neighbors`).

    - Fit the model on the training data (`X_train` and `y_train`).

    - Evaluate the model using accuracy, confusion matrix, and classification report, just like you did for the Decision Tree.

2. Train a Support Vector Machine (SVM) Classifier.
    
    - Use `SVC` from `sklearn.svm`.

    - Try using different kernel types such as `linear`, `poly`, and `rbf`.
    
    - Fit the model on the training data (`X_train` and `y_train`).
    
    - Evaluate the model using accuracy, confusion matrix, and classification report.  

3. Compare the performance of the three models (Decision Tree, KNN, and SVM) based on the evaluation metrics.

    - Compare the accuracy scores of all three classifiers: Decision Tree, KNN, and SVM.

    - Consider how the confusion matrices differ between the models. Are there particular digits that one model predicts better than the others?

    - Analyze which model performs best in terms of classification report metrics such as precision, recall, and F1-score.

In [None]:
# TODO: Your code for KNN and SVM.

## K-Means Clustering with the Digits Dataset

In this task, you will perform K-Means clustering on the Digits dataset and analyze how well the clusters correspond to the actual digits.

### Apply K-Means Clustering

Perform K-Means clustering on the dataset without using the labels (since this is unsupervised learning). You will need to cluster the data into 10 groups, corresponding to the digits 0 through 9.

In [None]:
from sklearn.cluster import KMeans

# TODO: Create a KMeans model with 10 clusters

# Display the first few cluster labels
print("Cluster Labels:")


### Evaluate the Clustering Results

Although clustering is unsupervised, you can still compare the predicted clusters to the actual labels in the dataset to evaluate how well the algorithm performed.

**Instructions:**

- Use the actual labels to compute the accuracy or confusion matrix, though this is just a rough comparison (K-Means assigns arbitrary labels to clusters).

- Use metrics such as `Adjusted Rand Index (ARI)` or `Normalized Mutual Information (NMI)` to evaluate clustering performance in a more appropriate way.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, adjusted_rand_score

# TODO: Compare clusters to actual labels

# TODO: Use a better metric like Adjusted Rand Index

# Optionally, display a confusion matrix


### Visualize the Clusters

You can visualize the clusters by reducing the dimensions of the dataset using techniques like `Principal Component Analysis (PCA`) or `t-SNE`. This will help in visualizing how the data points are grouped into clusters.

**Instructions:**

- Use PCA to reduce the dimensionality of the dataset to 2D for easy visualization.

- Plot the clusters in 2D space, colored by their predicted cluster labels.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# TODO: Reduce the dimensionality of the dataset using PCA

# TODO: Plot the data points with colors corresponding to their clusters