# To - Do Exercise:
**Problem - 1: Perform a classification task with knn from scratch.**
2.	Load the Dataset:

•	Read the dataset into a pandas DataFrame.

•	Display the first few rows and perform exploratory data analysis (EDA) to understand the dataset (e.g., check data types, missing values, summary statistics).


In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = "/content/drive/MyDrive/diabetes.csv"
data = pd.read_csv(file_path)

# Display the first few rows
print("First five rows of the dataset:")
print(data.head())

# Display dataset information
print("\nDataset Information:")
print(data.info())

# Summary statistics
print("\nSummary Statistics:")
print(data.describe())

# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())


First five rows of the dataset:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 

2. Handle Missing Data:

• Handle any missing values appropriately, either by dropping or imputing them based on the data.


In [None]:
# Check for missing values
if data.isnull().sum().any():
    # Example: Fill missing numerical values with column mean
    data.fillna(data.mean(), inplace=True)
    print("\nHandled missing values by imputing with mean.")
else:
    print("\nNo missing values found.")



No missing values found.


3. Feature Engineering:

• Separate the feature matrix (X) and target variable (y).

• Perform a train - test split from scratch using a 70% − 30% ratio.

In [None]:
# Separate features and target
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Train-test split
def train_test_split_manual(X, y, test_ratio=0.3, random_state=42):
    np.random.seed(random_state)
    indices = np.random.permutation(len(X))
    test_size = int(len(X) * test_ratio)
    test_indices = indices[:test_size]
    train_indices = indices[test_size:]
    return X.iloc[train_indices], X.iloc[test_indices], y.iloc[train_indices], y.iloc[test_indices]

X_train, X_test, y_train, y_test = train_test_split_manual(X, y)
print("\nDataset successfully split into training and test sets.")



Dataset successfully split into training and test sets.


4. Implement KNN:

• Build the KNN algorithm from scratch (no libraries like sickit-learn for KNN).

• Compute distances using Euclidean distance.

• Write functions for:

– Predicting the class for a single query.

– Predicting classes for all test samples.

• Evaluate the performance using accuracy.

In [None]:
# Euclidean distance function
def euclidean_distance(point1, point2):
    return np.sqrt(np.sum((point1 - point2) ** 2))

# Predict the class for a single query
def knn_predict_single(query, X_train, y_train, k=3):
    distances = []
    for i in range(len(X_train)):
        dist = euclidean_distance(query, X_train.iloc[i])
        distances.append((dist, y_train.iloc[i]))
    distances.sort(key=lambda x: x[0])  # Sort by distance
    nearest_neighbors = [label for _, label in distances[:k]]
    prediction = max(set(nearest_neighbors), key=nearest_neighbors.count)  # Majority vote
    return prediction

# Predict classes for all test samples
def knn_predict(X_test, X_train, y_train, k=3):
    predictions = []
    for i in range(len(X_test)):
        pred = knn_predict_single(X_test.iloc[i], X_train, y_train, k)
        predictions.append(pred)
    return np.array(predictions)

# Compute accuracy
def compute_accuracy(y_true, y_pred):
    correct_predictions = np.sum(y_true.values == y_pred)
    accuracy = (correct_predictions / len(y_true)) * 100
    return accuracy

# Perform prediction and evaluation
k = 3  # Number of neighbors
y_pred = knn_predict(X_test, X_train, y_train, k)
accuracy = compute_accuracy(y_test, y_pred)

print(f"\nAccuracy of the KNN model (k={k}): {accuracy:.2f}%")



Accuracy of the KNN model (k=3): 67.39%


# Problem - 2 - Experimentation:

1. Repeat the Classification Task:

• Scale the Feature matrix X.
• Use the scaled data for training and testing the kNN Classifier.
• Record the results.

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize the feature matrix (X) using StandardScaler
scaler = StandardScaler()

# Fit the scaler on training data and transform both train and test sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert scaled arrays back to DataFrame for compatibility with KNN functions
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

print("Feature scaling complete. Training and testing data have been standardized.")


Feature scaling complete. Training and testing data have been standardized.


2. Comparative Analysis: Compare the Results -

• Compare the accuracy and performance of the kNN model on the original dataset from problem 1

versus the scaled dataset.

• Discuss:
– How scaling impacted the KNN performance.
– The reason for any observed changes in accuracy.

In [None]:
# Perform prediction and evaluation with scaled data
y_pred_scaled = knn_predict(X_test_scaled, X_train_scaled, y_train, k=3)
accuracy_scaled = compute_accuracy(y_test, y_pred_scaled)

print(f"Accuracy of the KNN model on scaled data (k=3): {accuracy_scaled:.2f}%")


Accuracy of the KNN model on scaled data (k=3): 70.87%


In [None]:
print(f"Accuracy on original data: {accuracy:.2f}%")
print(f"Accuracy on scaled data: {accuracy_scaled:.2f}%")


Accuracy on original data: 67.39%
Accuracy on scaled data: 70.87%


# Problem - 3 - Experimentation with k:

1. Vary the number of neighbors - k:

• Run the KNN model on both the original and scaled datasets for a range of:

k= 1, 2, 3, . . . 15

• For each k, record:
– Accuracy.

– Time taken to make predictions.

In [None]:
import time

# Range of k values to test
k_values = range(1, 16)

# Initialize dictionaries to store results
results_original = {"k": [], "accuracy": [], "time": []}
results_scaled = {"k": [], "accuracy": [], "time": []}

for k in k_values:
    # Time and test on original dataset
    start_time = time.time()
    y_pred_original = knn_predict(X_test, X_train, y_train, k)
    elapsed_time_original = time.time() - start_time
    accuracy_original = compute_accuracy(y_test, y_pred_original)

    # Record results for original dataset
    results_original["k"].append(k)
    results_original["accuracy"].append(accuracy_original)
    results_original["time"].append(elapsed_time_original)

    # Time and test on scaled dataset
    start_time = time.time()
    y_pred_scaled = knn_predict(X_test_scaled, X_train_scaled, y_train, k)
    elapsed_time_scaled = time.time() - start_time
    accuracy_scaled = compute_accuracy(y_test, y_pred_scaled)

    # Record results for scaled dataset
    results_scaled["k"].append(k)
    results_scaled["accuracy"].append(accuracy_scaled)
    results_scaled["time"].append(elapsed_time_scaled)

print("Experimentation with varying k completed.")


Experimentation with varying k completed.


2. Visualize the Results:

• Plot the following graphs:

– k vs. Accuracy for original and scaled datasets.

– k vs. Time Taken for original and scaled datasets.

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(results_original["k"], results_original["accuracy"], label="Original Data", marker='o')
plt.plot(results_scaled["k"], results_scaled["accuracy"], label="Scaled Data", marker='o')
plt.title("k vs. Accuracy")
plt.xlabel("Number of Neighbors (k)")
plt.ylabel("Accuracy (%)")
plt.legend()
plt.grid()
plt.show()


In [None]:
plt.figure(figsize=(10, 6))
plt.plot(results_original["k"], results_original["time"], label="Original Data", marker='o')
plt.plot(results_scaled["k"], results_scaled["time"], label="Scaled Data", marker='o')
plt.title("k vs. Time Taken")
plt.xlabel("Number of Neighbors (k)")
plt.ylabel("Time Taken (seconds)")
plt.legend()
plt.grid()
plt.show()


3. Analyze and Discuss:

• Discuss how the choice of k affects the accuracy and computational cost.

• Identify the optimal k based on your analysis.


# Problem 3: Experimentation with k
To address the experimentation with varying the number of neighbors, we will follow these steps:

1. Vary the Number of Neighbors (k)
Steps:
Run the KNN model on both the original and scaled datasets for a range of
𝑘
=
1
,
2
,
3
,
…
,
15
k=1,2,3,…,15.
Record the accuracy and time taken for each value of
𝑘
k.
Code:
python
Copy code
import time

# Range of k values to test
k_values = range(1, 16)

# Initialize dictionaries to store results
results_original = {"k": [], "accuracy": [], "time": []}
results_scaled = {"k": [], "accuracy": [], "time": []}

for k in k_values:
    # Time and test on original dataset
    start_time = time.time()
    y_pred_original = knn_predict(X_test, X_train, y_train, k)
    elapsed_time_original = time.time() - start_time
    accuracy_original = compute_accuracy(y_test, y_pred_original)
    
    # Record results for original dataset
    results_original["k"].append(k)
    results_original["accuracy"].append(accuracy_original)
    results_original["time"].append(elapsed_time_original)
    
    # Time and test on scaled dataset
    start_time = time.time()
    y_pred_scaled = knn_predict(X_test_scaled, X_train_scaled, y_train, k)
    elapsed_time_scaled = time.time() - start_time
    accuracy_scaled = compute_accuracy(y_test, y_pred_scaled)
    
    # Record results for scaled dataset
    results_scaled["k"].append(k)
    results_scaled["accuracy"].append(accuracy_scaled)
    results_scaled["time"].append(elapsed_time_scaled)

print("Experimentation with varying k completed.")
2. Visualize the Results
Visualization 1:
𝑘
k vs. Accuracy
Plot the accuracy for both original and scaled datasets.
python
Copy code
plt.figure(figsize=(10, 6))
plt.plot(results_original["k"], results_original["accuracy"], label="Original Data", marker='o')
plt.plot(results_scaled["k"], results_scaled["accuracy"], label="Scaled Data", marker='o')
plt.title("k vs. Accuracy")
plt.xlabel("Number of Neighbors (k)")
plt.ylabel("Accuracy (%)")
plt.legend()
plt.grid()
plt.show()
Visualization 2:
𝑘
k vs. Time Taken
Plot the time taken for predictions on both original and scaled datasets.
python
Copy code
plt.figure(figsize=(10, 6))
plt.plot(results_original["k"], results_original["time"], label="Original Data", marker='o')
plt.plot(results_scaled["k"], results_scaled["time"], label="Scaled Data", marker='o')
plt.title("k vs. Time Taken")
plt.xlabel("Number of Neighbors (k)")
plt.ylabel("Time Taken (seconds)")
plt.legend()
plt.grid()
plt.show()
3. Analyze and Discuss
How the Choice of
𝑘
k Affects Accuracy
Low
𝑘
k:
For
𝑘
=
1
k=1, the model is very sensitive to noise, leading to overfitting. Accuracy might fluctuate.
Higher
𝑘
k:
As
𝑘
k increases, the predictions are based on more neighbors, smoothing the decision boundary.
Accuracy might increase up to an optimal
𝑘
k, after which it may drop due to underfitting.
How the Choice of
𝑘
k Affects Computational Cost
Low
𝑘
k:
Computational cost is relatively low because fewer neighbors need to be identified.
Higher
𝑘
k:
Time taken increases as more neighbors must be considered, especially with a large dataset.
Optimal
𝑘
k
Identify the
𝑘
k value with the highest accuracy while balancing computational cost.
The optimal
𝑘
k often lies between a range of values (e.g., 3 to 7) depending on the dataset.
Example Analysis
If the accuracy graph shows a peak at
𝑘
=
5
k=5 with acceptable time cost,
𝑘
=
5
k=5 can be chosen as the optimal value. Scaling may consistently provide higher accuracy due to better distance calculations.



In [3]:
#1. Vary the Number of Neighbors (k)
import time

# Range of k values to test
k_values = range(1, 16)

# Initialize dictionaries to store results
results_original = {"k": [], "accuracy": [], "time": []}
results_scaled = {"k": [], "accuracy": [], "time": []}

for k in k_values:
    # Time and test on original dataset
    start_time = time.time()
    y_pred_original = knn_predict(X_test, X_train, y_train, k)
    elapsed_time_original = time.time() - start_time
    accuracy_original = compute_accuracy(y_test, y_pred_original)

    # Record results for original dataset
    results_original["k"].append(k)
    results_original["accuracy"].append(accuracy_original)
    results_original["time"].append(elapsed_time_original)

    # Time and test on scaled dataset
    start_time = time.time()
    y_pred_scaled = knn_predict(X_test_scaled, X_train_scaled, y_train, k)
    elapsed_time_scaled = time.time() - start_time
    accuracy_scaled = compute_accuracy(y_test, y_pred_scaled)

    # Record results for scaled dataset
    results_scaled["k"].append(k)
    results_scaled["accuracy"].append(accuracy_scaled)
    results_scaled["time"].append(elapsed_time_scaled)

print("Experimentation with varying k completed.")


In [4]:
#2. Visualize the Results
plt.figure(figsize=(10, 6))
plt.plot(results_original["k"], results_original["accuracy"], label="Original Data", marker='o')
plt.plot(results_scaled["k"], results_scaled["accuracy"], label="Scaled Data", marker='o')
plt.title("k vs. Accuracy")
plt.xlabel("Number of Neighbors (k)")
plt.ylabel("Accuracy (%)")
plt.legend()
plt.grid()
plt.show()


In [None]:
#Visualization 2: k k vs. Time Taken

plt.figure(figsize=(10, 6))
plt.plot(results_original["k"], results_original["time"], label="Original Data", marker='o')
plt.plot(results_scaled["k"], results_scaled["time"], label="Scaled Data", marker='o')
plt.title("k vs. Time Taken")
plt.xlabel("Number of Neighbors (k)")
plt.ylabel("Time Taken (seconds)")
plt.legend()
plt.grid()
plt.show()


# Problem - 4 - Additional Questions {Optional - But Highly Recommended}:
• Discuss the challenges of using KNN for large datasets and high-dimensional data.

• Suggest strategies to improve the efficiency of KNN (e.g., approximate nearest neighbors, dimensionality
reduction).

In [4]:
import numpy as np
import faiss

# Generate a random dataset
data = np.random.random((10000, 128)).astype('float32')

# Create an FAISS index
index = faiss.IndexFlatL2(128)  # L2 distance (Euclidean)
index.add(data)

# Query with a random vector
query = np.random.random((5, 128)).astype('float32')
distances, indices = index.search(query, k=5)

print("Nearest neighbors:", indices)
