### Short Coding Project: K-Nearest Neighbors (k-NN) Classification

#### Project Overview

This project consists of several tasks where you will apply the concepts learned in the K-Nearest Neighbors lab to classify types of glass based on their chemical composition. You will work with the glass dataset, explore the data, implement the k-NN algorithm, and investigate its performance under different conditions.

- Delete the `# YOUR CODE HERE` comments and write your code.
- **Do not change** the variable names.

### Load the Dataset

Start by loading the glass dataset and examining its structure.

In [None]:
# Import necessary libraries
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

# Load the dataset
url = 'https://raw.githubusercontent.com/CyConProject/Lab/main/Datasets/glass.csv'
df = pd.read_csv(url)

# Display the first few rows of the dataset
df.head()

### Question 1: Data Exploration

Explore the dataset to understand its structure and content.

1. Display the summary statistics of the dataset.
2. Check for any missing values in the dataset.

In [None]:
# Display summary statistics
des = # YOUR CODE HERE
print(des)

# Check for missing values
missing_values= # YOUR CODE HERE
print(missing_values)

### Question 2: Visualize the Distribution of Classes

Now we want to create a bar plot to show the number of samples for each glass type in the dataset. Write a line of code that counts the number of samples for each unique glass type and sorts the counts in ascending order by the glass type.

In [None]:
import matplotlib.pyplot as plt

# Count the number of samples for each glass type
class_counts = # YOUR CODE HERE

# Create a bar plot
plt.bar(class_counts.index.astype(str), class_counts.values, color='skyblue')

# Add labels and title
plt.xlabel('Glass Type')
plt.ylabel('Number of Samples')
plt.title('Distribution of Glass Types')
plt.show()

### Question 3: Feature Selection

Select the most relevant features for classification using correlation analysis. This is a critical step in improving machine learning models. By selecting the most important features, we can reduce the complexity of the model, improve training efficiency, and potentially increase the model's accuracy.

1. **Find the correlation matrix** of the dataset.
2. **Get the absolute values** of the correlations with the `'Type'` column.
3. **Sort the features** based on their correlation with the target variable `'Type'`.
4. **Select the top six features** most strongly correlated with `'Type'`.

In [None]:
# Step 1: Calculate the correlation matrix for all features
corr_matrix = # YOUR CODE HERE

# Step 2: Get the absolute values of the correlations with 'Type'
type_correlation = # YOUR CODE HERE

# Step 3: Sort the correlations to find the top six features
sorted_correlations = # YOUR CODE HERE

# Step 4: Select the top six features (excluding 'Type' itself)
top_features = # YOUR CODE HERE

# Step 5: Create a new DataFrame with the top six features
X_selected = df[top_features]

# Step 6: Display the selected features
X_selected.head()


### Question 4: Split the Data into Training and Testing Sets

Using `X_selected` and the target variable `y`, split the dataset into training and testing sets. Use 70% of the data for training and 30% for testing. Set `random_state=42` for reproducibility.

In [213]:
from sklearn.model_selection import train_test_split

# Define the target variable y
y = # YOUR CODE HERE

# Split the dataset
X_train, X_test, y_train, y_test = # YOUR CODE HERE

### Question 5: Standardize the Features

Standardize the selected features using `StandardScaler` to ensure that all features contribute equally to the model, preventing features with larger scales from dominating.

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and testing data
X_train_scaled = # YOUR CODE HERE
X_test_scaled = # YOUR CODE HERE

# Display the first few rows of X_train_scaled
print(X_train_scaled[:5])

### Question 6: Train and Evaluate the k-NN Classifier

Initialize a k-NN classifier with `n_neighbors=5` and train it on the scaled training data. Evaluate the model's performance by calculating the accuracy score and printing the classification report on the test data.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the k-NN classifier with k=5
k = 5
knn_classifier = # YOUR CODE HERE

# Train the classifier
# YOUR CODE HERE

# Make predictions on the test data
y_pred = # YOUR CODE HERE

# Evaluate the performance
accuracy = # YOUR CODE HERE
classification_rep = # YOUR CODE HERE

# Display the results
print(f"Accuracy of the k-NN classifier (k={k}): {accuracy:.2f}")
print("Classification Report:\n", classification_rep)

### Question 7: Implement Weighted k-NN (Advanced)

In this question, you will modify the k-NN classifier to use distance-weighted voting, where closer neighbors have a greater influence on the classification than more distant ones.

**Steps:**

1. Initialize a weighted k-NN classifier with `n_neighbors=5`.

    **Hint**: Setting `weights='distance'` in `KNeighborsClassifier` uses the inverse of the distance as weights.
    If you want to learn more about KNN classifier parameters such as `weights`, you can refer to the scikit-learn documentation in this [link](https://scikit-learn.org/dev/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

2. Train the classifier on the scaled training data.
3. Evaluate the model's performance by calculating the accuracy score and printing the classification report on the test data.
4. Compare the results with the uniform-weighted k-NN classifier from Question 6.


    

In [None]:
# Initialize the weighted k-NN classifier with k=5 and distance weights
knn_classifier_weighted = # YOUR CODE HERE

# Train the classifier
# YOUR CODE HERE

# Make predictions on the test data
y_pred_weighted = # YOUR CODE HERE

# Evaluate the performance
accuracy_weighted = # YOUR CODE HERE
classification_rep_weighted = # YOUR CODE HERE

# Display the results
print(f"Accuracy of the weighted k-NN classifier (k={k}): {accuracy_weighted:.2f}")
print("Classification Report:\n", classification_rep_weighted)

### Question 8: Evaluate k-NN with Different Distance Metrics (Advanced)

In this question, you will explore how changing the distance metric affects the k-NN classifier's performance.

**Steps:**

1. For each distance metric ('euclidean', 'manhattan', 'chebyshev'), initialize a k-NN classifier with `n_neighbors=5` and the specified metric.

    **Hint**: If you want to learn more about the chebyshev distance metric, you can refer to the documentation in this [link](https://en.wikipedia.org/wiki/Chebyshev_distance).

2. Train the classifier on the scaled training data.
3. Evaluate the model's performance by calculating the accuracy score on the test data.

Look at the results to see which distance metric performed best on this dataset.

In [None]:
# List of distance metrics to evaluate
distance_metrics = ['euclidean', 'manhattan', 'chebyshev']

# Dictionary to store accuracy scores
accuracy_metrics = {}

for metric in distance_metrics:
    # Initialize the k-NN classifier with the current metric
    knn_classifier_metric = # YOUR CODE HERE
    
    # Train the classifier
    # YOUR CODE HERE
    
    # Make predictions on the test data
    y_pred_metric = # YOUR CODE HERE
    
    # Evaluate the performance
    accuracy_metric = # YOUR CODE HERE
    
    # Store the accuracy
    accuracy_metrics[metric] = accuracy_metric
    
    # Display the results
    print(f"\nAccuracy with {metric} distance: {accuracy_metric:.2f}")

# Summarize which distance metric performed best
best_metric = max(accuracy_metrics, key=accuracy_metrics.get)
print(f"\nThe distance metric with the highest accuracy is: {best_metric}")