In [27]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate 
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


data = pd.read_csv(r"C:\Users\KRP\Programming\School\AiCphBusiness\MachineLearning\Assignment2\diamonds.csv")
data = data.drop(["Unnamed: 0"], axis=1)
data.head()


data['volume']=data['x']*data['y']*data['z']
data=data.drop(['x','y','z'],axis=1)

new_color = {'J':1,'I':2, 'H':3,'G':4,'F':5,'E':6,'D':7}
data['color'] = data['color'].map(new_color)

new_clarity = {'I1':1,'SI2':2,'SI1':3,'VS2':4,'VS1':5,'VVS2':6,'VVS1':7,'IF':8}
data['clarity'] = data['clarity'].map(new_clarity)

y = data.pop("cut")
x = data

In [16]:
# Split the data into training and test sets

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.1)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

In [17]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train)

In [None]:
neigh = KNeighborsClassifier(n_neighbors=3)


In [20]:
y_pred = knn.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.6331108639228773


## Hyperparameter Adjustment

In the development of the k-Nearest Neighbors (kNN) model, the primary hyperparameter that was adjusted is the **number of neighbors ($k$)**. Adjusting $k$ is pivotal because it fundamentally influences the model's decision-making dynamics and its capability to generalize effectively to unseen data.

### Adjusting K

- Overfitting vs. Underfitting: Selecting a small $k$ can cause the model to overfit, interpreting noise in the training data as significant patterns. This results in excellent performance on the training data but poor performance on new, unseen data. Conversely, a large $k$ might lead the model to underfit, simplifying the model too much and missing important patterns, which similarly degrades performance on new data.
- Balancing Bias and Variance: The aim of adjusting $k$ is to find an optimal balance between bias (error from overly simplistic assumptions in the learning algorithm) and variance (error from too much complexity in the learning algorithm), leading to a model that is neither too simple nor too complex but just right for making accurate predictions.

## Measuring Quality Using F1 Score

The F1 score, serving as a harmonic mean of precision and recall, is a crucial metric for evaluating a model's performance, particularly in scenarios with imbalanced datasets. It provides a balanced measure of a model's efficiency by combining precision and recall into a single metric.

### Accuracy

Accuracy, the most straightforward measure of performance, is calculated as the ratio of correct predictions to the total number of predictions. However, in the context of imbalanced datasets, accuracy can be misleading, as it might not accurately reflect the model's effectiveness in predicting minority class instances.

### Precision

Precision assesses the accuracy of positive predictions, defined as the ratio of true positives (correct positive predictions) to the total number of positive predictions made (comprising both true positives and false positives). A high precision indicates a model's low rate of false positive predictions, vital in applications where false positives carry a significant cost.

### Recall (Sensitivity)

Recall, or sensitivity, measures the model's capacity to identify all relevant instances within a dataset, calculated as the ratio of true positives to the actual number of positive instances (true positives plus false negatives). A high recall signifies a low rate of false negatives, essential in situations where failing to detect a positive instance is highly detrimental (e.g., medical diagnoses).

### Why Use F1 Score

The F1 score is particularly advantageous when requiring a balance between precision and recall in the presence of an uneven class distribution (imbalanced classes). It ensures a model does not excel in one metric at the detriment of the other, promoting a more comprehensive and equitable evaluation of the model's quality.

In [29]:
k_values = [i for i in range (1,31)]
performance = []

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train_scaled, y_train)
    y_pred = knn.predict(x_test_scaled)
    val_score = cross_val_score(knn, x_train_scaled, y_train, cv=5)
    performance.append({
        'k': k,
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred, average='macro', zero_division=0),
        'recall': recall_score(y_test, y_pred, average='macro', zero_division=0),
        'cross_val_mean_accuracy': np.mean(val_score),
        'f1': f1_score(y_test, y_pred, average='macro', zero_division=0)
    })

    
df = pd.DataFrame(performance)
df.head(100)    
    

Unnamed: 0,k,accuracy,precision,recall,cross_val_mean_accuracy,f1
0,1,0.616055,0.600346,0.586147,0.610534,0.592518
1,2,0.643307,0.582159,0.632858,0.642483,0.586592
2,3,0.633111,0.587094,0.594922,0.638137,0.58706
3,4,0.669262,0.635798,0.636958,0.666399,0.627045
4,5,0.669077,0.646138,0.618791,0.663556,0.623072
5,6,0.67538,0.644236,0.623181,0.671899,0.622346
6,7,0.679459,0.657824,0.619507,0.673979,0.627971
7,8,0.683723,0.665883,0.630961,0.676678,0.636224
8,9,0.685577,0.67488,0.626762,0.677481,0.638736
9,10,0.686318,0.678051,0.634045,0.682425,0.64357


## Hyperparameter Adjustment

In the development of the k-Nearest Neighbors (kNN) model, the primary hyperparameter that was adjusted is the **number of neighbors ($k$)**. Adjusting $k$ is pivotal because it fundamentally influences the model's decision-making dynamics and its capability to generalize effectively to unseen data.

### Adjusting K

- Overfitting vs. Underfitting: Selecting a small $k$ can cause the model to overfit, interpreting noise in the training data as significant patterns. This results in excellent performance on the training data but poor performance on new, unseen data. Conversely, a large $k$ might lead the model to underfit, simplifying the model too much and missing important patterns, which similarly degrades performance on new data.
- Balancing Bias and Variance: The aim of adjusting $k$ is to find an optimal balance between bias (error from overly simplistic assumptions in the learning algorithm) and variance (error from too much complexity in the learning algorithm), leading to a model that is neither too simple nor too complex but just right for making accurate predictions.

## Measuring Quality Using F1 Score

The F1 score, serving as a harmonic mean of precision and recall, is a crucial metric for evaluating a model's performance, particularly in scenarios with imbalanced datasets. It provides a balanced measure of a model's efficiency by combining precision and recall into a single metric.

### Accuracy

Accuracy, the most straightforward measure of performance, is calculated as the ratio of correct predictions to the total number of predictions. However, in the context of imbalanced datasets, accuracy can be misleading, as it might not accurately reflect the model's effectiveness in predicting minority class instances.

### Precision

Precision assesses the accuracy of positive predictions, defined as the ratio of true positives (correct positive predictions) to the total number of positive predictions made (comprising both true positives and false positives). A high precision indicates a model's low rate of false positive predictions, vital in applications where false positives carry a significant cost.

### Recall (Sensitivity)

Recall, or sensitivity, measures the model's capacity to identify all relevant instances within a dataset, calculated as the ratio of true positives to the actual number of positive instances (true positives plus false negatives). A high recall signifies a low rate of false negatives, essential in situations where failing to detect a positive instance is highly detrimental (e.g., medical diagnoses).

### Why Use F1 Score

The F1 score is particularly advantageous when requiring a balance between precision and recall in the presence of an uneven class distribution (imbalanced classes). It ensures a model does not excel in one metric at the detriment of the other, promoting a more comprehensive and equitable evaluation of the model's quality.
