# Problem - 1: Perform a classification task with knn from scratch.

**1. Load the Dataset:**

• Read the dataset into a pandas DataFrame.

• Display the first few rows and perform exploratory data analysis (EDA) to understand the dataset

(e.g., check data types, missing values, summary statistics).

In [3]:
import pandas as pd
import numpy as np
# Load the Titanic dataset
data = pd.read_csv('/content/drive/MyDrive/Concepts of AI Colab/W3 Datasets/diabetes.csv')
df = pd.DataFrame(data)
# Display the first few rows of the dataset
print("First few rows of the dataset:\n", df.head())

First few rows of the dataset:
    Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  


**2. Handle Missing Data:**

• Handle any missing values appropriately, either by dropping or imputing them based on the data.

In [4]:
# Check for missing values
missing_info = data.isnull().sum() / len(data) * 100
# Handle missing values
for column in data.columns:
  if missing_info[column] > 10: # If more than 10% missing
    data[column].fillna(data[column].mean(), inplace=True)
  else: # If less than 10% missing
    data.dropna(subset=[column], inplace=True)

**3. Feature Engineering:**

• Separate the feature matrix (X) and target variable (y).

• Perform a train - test split from scratch using a 70% − 30% ratio.

In [5]:
# Separate features (X) and target (y)
X = data.drop(columns=['Outcome']).values # Convert features to NumPy array
y = data['Outcome'].values # Convert target to NumPy array
# Define a function for train-test split from scratch
def train_test_split_scratch(X, y, test_size=0.3, random_seed=42):
  """
  Splits dataset into train and test sets.
  Arguments:
  X : np.ndarray
    Feature matrix.
  y : np.ndarray
    Target array.
  test_size : float
    Proportion of the dataset to include in the test split (0 < test_size < 1).
  random_seed : int
    Seed for reproducibility.
  Returns:
  X_train, X_test, y_train, y_test : np.ndarray
    Training and testing splits of features and target.
  """

  np.random.seed(random_seed)
  indices = np.arange(X.shape[0])
  np.random.shuffle(indices) # Shuffle the indices
  test_split_size = int(len(X) * test_size)
  test_indices = indices[:test_split_size]
  train_indices = indices[test_split_size:]
  X_train, X_test = X[train_indices], X[test_indices]
  y_train, y_test = y[train_indices], y[test_indices]
  return X_train, X_test, y_train, y_test
# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split_scratch(X, y, test_size=0.3)
# Output shapes to verify
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (538, 8)
Shape of X_test: (230, 8)
Shape of y_train: (538,)
Shape of y_test: (230,)


**4. Implement KNN:**

• Build the KNN algorithm from scratch (no libraries like sickit-learn for KNN).

• Compute distances using Euclidean distance.

• Write functions for:

– Predicting the class for a single query.

– Predicting classes for all test samples.

• Evaluate the performance using accuracy.

In [6]:
def euclidean_distance(point1, point2):
  """
  Calculate the Euclidean distance between two points in n-dimensional space.
  Arguments:
  point1 : np.ndarray
    The first point as a numpy array.
  point2 : np.ndarray
    The second point as a numpy array.
  Returns:
  float
    The Euclidean distance between the two points.
  Raises:
  ValueError: If the input points do not have the same dimensionality.
  """
  # Check if the points are of the same dimension
  if point1.shape != point2.shape:
    raise ValueError("Points must have the same dimensions to calculate Euclidean distance.")
  # Calculate the Euclidean distance
  distance = np.sqrt(np.sum((point1 - point2) ** 2))
  return distance

In [7]:
# Test case for the function
try:
  # Define two points
  point1 = np.array([3, 4])
  point2 = np.array([0, 0])
  # Calculate the distance
  result = euclidean_distance(point1, point2)
  # Check if the result matches the expected value (e.g., sqrt(3^2 + 4^2) = 5)
  expected_result = 5.0
  assert np.isclose(result, expected_result), f"Expected {expected_result}, but got {result}"
  print("Test passed successfully!")
except ValueError as ve:
  print(f"ValueError: {ve}")
except AssertionError as ae:
  print(f"AssertionError: {ae}")
except Exception as e:
  print(f"An unexpected error occurred: {e}")

Test passed successfully!


In [8]:
# Function for KNN prediction for a single query
def knn_predict_single(query, X_train, y_train, k=3):
  """
  Predict the class label for a single query using the K-nearest neighbors algorithm.
  Arguments:
  query : np.ndarray
  The query point for which the prediction is to be made.
  X_train : np.ndarray
  The training feature matrix.
  y_train : np.ndarray
  The training labels.
  k : int, optional
  The number of nearest neighbors to consider (default is 3).
  Returns:
  int
  The predicted class label for the query.
  """
  distances = [euclidean_distance(query, x) for x in X_train]
  sorted_indices = np.argsort(distances)
  nearest_indices = sorted_indices[:k]
  nearest_labels = y_train[nearest_indices]
  prediction = np.bincount(nearest_labels).argmax()
  return prediction

In [9]:
# Function to test KNN for all test samples
def knn_predict(X_test, X_train, y_train, k=3):
  """
  Predict the class labels for all test samples using the K-nearest neighbors algorithm.
  Arguments:
  X_test : np.ndarray
  The test feature matrix.
  X_train : np.ndarray
  The training feature matrix.
  y_train : np.ndarray
  The training labels.
  k : int, optional
  The number of nearest neighbors to consider (default is 3).
  Returns:
  np.ndarray
  An array of predicted class labels for the test samples.
  """
  predictions = [knn_predict_single(x, X_train, y_train, k) for x in X_test]
  return np.array(predictions)

In [10]:
# Test case for KNN on the Titanic dataset
# Assume X_train, X_test, y_train, and y_test have been prepared using the code above
try:
  # Define the test set for the test case
  X_test_sample = X_test[:5] # Taking a small subset for testing
  y_test_sample = y_test[:5] # Corresponding labels for the subset
  # Make predictions
  predictions = knn_predict(X_test_sample, X_train, y_train, k=3)
  # Print test results
  print("Predictions:", predictions)
  print("Actual labels:", y_test_sample)
  # Check if predictions match expected format
  assert predictions.shape == y_test_sample.shape, "The shape of predictions does not match the shape of the actual labels."
  print("Test case passed successfully!")
except AssertionError as ae:
  print(f"AssertionError: {ae}")
except Exception as e:
  print(f"An unexpected error occurred: {e}")

Predictions: [0 1 0 1 1]
Actual labels: [0 0 0 0 0]
Test case passed successfully!


# To - Do Exercise:

For the provided dataset:

• diabetes.csv

Complete the following Problems.

**Problem - 1: Perform a classification task with knn from scratch.**
1. Load the Dataset:

• Read the dataset into a pandas DataFrame.

• Display the first few rows and perform exploratory data analysis (EDA) to understand the dataset
(e.g., check data types, missing values, summary statistics).

In [12]:
diabetes_df = pd.read_csv('/content/drive/MyDrive/Concepts of AI Colab/W3 Datasets/diabetes.csv')

print("First few rows of the dataframe:")
print(diabetes_df.head())

print("\nThe data types of the dataframe:")
print(diabetes_df.dtypes)

print("\nThe summary statistics of the dataframe:")
print(diabetes_df.describe())

print("\nThe missing values of the dataframe:")
print(diabetes_df.isnull().sum())

First few rows of the dataframe:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  

The data types of the dataframe:
Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunc

**2. Handle Missing Data:**

• Handle any missing values appropriately, either by dropping or imputing them based on the data.

In [13]:
missing_info = diabetes_df.isnull().sum() / len(diabetes_df) * 100
# Handle missing values
for column in diabetes_df.columns:
  if missing_info[column] > 10: # If more than 10% missing
    diabetes_df[column].fillna(diabetes_df[column].mean(), inplace=True)
  else: # If less than 10% missing
    diabetes_df.dropna(subset=[column], inplace=True)

print("\nMissing values after processing:\n", diabetes_df.isnull().sum())



Missing values after processing:
 Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


**3. Feature Engineering:**

• Separate the feature matrix (X) and target variable (y).

In [14]:
X = diabetes_df.drop(columns=['Outcome']).values
y = diabetes_df['Outcome'].values

• Perform a train - test split from scratch using a 70% − 30% ratio.

In [15]:
def train_test_split_scratch(X, y, test_size=0.3, random_seed=42):
  np.random.seed(random_seed)
  indices = np.arange(X.shape[0])
  np.random.shuffle(indices) # Shuffle the indices
  test_split_size = int(len(X) * test_size)
  test_indices = indices[:test_split_size]
  train_indices = indices[test_split_size:]
  X_train, X_test = X[train_indices], X[test_indices]
  y_train, y_test = y[train_indices], y[test_indices]

  X_train = X_train.astype(np.int64)
  X_test = X_test.astype(np.int64)
  y_train = y_train.astype(np.int64)
  y_test = y_test.astype(np.int64)
  return X_train, X_test, y_train, y_test

# Perform the train-test split
X_train, X_test, y_train, y_test = train_test_split_scratch(X, y, test_size=0.3)
# Output shapes to verify
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

# Additional checks to ensure correct data types after the split
print("\nData Types after Split:")
print(f"X_train data type: {X_train.dtype}")  # Should be float64 or int64 depending on features
print(f"X_test data type: {X_test.dtype}")    # Should be float64 or int64 depending on features
print(f"y_train data type: {y_train.dtype}")  # Should be int64 (Outcome)
print(f"y_test data type: {y_test.dtype}")    # Should be int64 (Outcome)

Shape of X_train: (538, 8)
Shape of X_test: (230, 8)
Shape of y_train: (538,)
Shape of y_test: (230,)

Data Types after Split:
X_train data type: int64
X_test data type: int64
y_train data type: int64
y_test data type: int64


4. Implement KNN:

• Build the KNN algorithm from scratch (no libraries like sickit-learn for KNN).

• Compute distances using Euclidean distance.

• Write functions for:

– Predicting the class for a single query.

– Predicting classes for all test samples.


In [16]:
def euclidean_distance(point1, point2):
  distance = np.sqrt(np.sum((point1 - point2) ** 2))
  return distance


def knn_predict_single(query, X_train, y_train, k=3):
  distances = [euclidean_distance(query, x) for x in X_train]
  sorted_indices = np.argsort(distances)
  nearest_indices = sorted_indices[:k]
  nearest_labels = y_train[nearest_indices]
  prediction = np.bincount(nearest_labels).argmax()
  return prediction

def knn_predict(X_test, X_train, y_train, k=3):
  predictions = [knn_predict_single(x, X_train, y_train, k) for x in X_test]
  return np.array(predictions)

try:
  # Define the test set for the test case
  X_test_sample = X_test[:5] # Taking a small subset for testing
  y_test_sample = y_test[:5] # Corresponding labels for the subset
  # Make predictions
  predictions = knn_predict(X_test_sample, X_train, y_train, k=3)
  # Print test results
  print("Predictions:", predictions)
  print("Actual labels:", y_test_sample)
  # Check if predictions match expected format
  assert predictions.shape == y_test_sample.shape, "The shape of predictions does not match the shape of the actual labels."
  print("Test case passed successfully!")
except AssertionError as ae:
  print(f"AssertionError: {ae}")
except Exception as e:
  print(f"An unexpected error occurred: {e}")

Predictions: [0 1 0 1 1]
Actual labels: [0 0 0 0 0]
Test case passed successfully!


• Evaluate the performance using accuracy.

In [17]:
def compute_accuracy(y_true, y_pred):
    correct_predictions = np.sum(y_true == y_pred)
    total_samples = len(y_true)
    accuracy = correct_predictions / total_samples
    return accuracy

try:
  y_pred = knn_predict(X_test, X_train, y_train, k=3)
  accuracy = compute_accuracy(y_test, y_pred)
except Exception as e:
  print(f"An unexpected error occurred: {e}")

print("Accuracy:", accuracy)

Accuracy: 0.6695652173913044
