Lauren Steel - 20218337

October 14, 2024 

## CMPE 251 Assignment 2 Part 2: Coding 


The objective of this assignment is to assess your understanding and implementation skills in two fundamental classification algorithms: K-Nearest Neighbors (KNN) and Logistic Regression. You will be required to perform data preprocessing, implement KNN from scratch, and use Logistic Regression with the scikit-learn library on the same dataset.


We will be using the Breast Cancer Wisconsin (Diagnostic) Dataset, which is available in the sklearn.datasets library. This dataset consists of 30 features and a binary target variable indicating whether a given cancer diagnosis is benign (0) or malignant (1).

In [1]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from collections import Counter 

from sklearn.datasets import load_breast_cancer

## Question 1: KNN Implementation from Scratch (5 points)

1. Data Preprocessing:

a) Load the Breast Cancer dataset from sklearn.datasets.

b) Normalize the feature data so that all features have a mean of 0 and a standard deviation
of 1.

c) Split the dataset into training (70%) and testing (30%) sets.

In [2]:
# (part a) load the breast cancer dataset 
data = load_breast_cancer()
x = data.data
y = data.target

# (part b) apply standardization 
scaler = StandardScaler()
scaled = scaler.fit_transform(x)

# (part c) split the dataset into training and test sets 
x_train, x_test, y_train, y_test = train_test_split(scaled, y, test_size=0.3, random_state=42)


2. KNN Implementation:

a) Implement the KNN algorithm from scratch without using sklearn or any other library functions for the classifier. The implementation should include:

-   Calculating the Euclidean distance between data points.
-   Finding the nearest neighbors for a given test point.
-   Predicting the class based on majority voting among the nearest neighbors.


In [3]:
# (part a) implementing the knn algorithm from scratch 
# function defn for euclidean distance calculations 
def euclidean_distance(x1, x2):
    euc_dist = np.sqrt(np.sum((x1 - x2) ** 2))
    return euc_dist

# function for finding the nearest neighbor 
def nearest_neighbor(x_train, y_train, Xtest, k):
    distances = []
    for i, Xtrain in enumerate(x_train):
        dist = euclidean_distance(Xtest, Xtrain)
        distances.append((dist, y_train[i]))

    distances.sort(key=lambda x: x[0])
    return [label for _, label in distances[:k]]

# function for predicting the majority class
def prediction(neighbors):  
    class_count = Counter(neighbors)
    majority = class_count.most_common(1)[0][0]
    return majority

# K nearest neighbors algorithm 
def knn(x_train, y_train, x_test, k=3):
    y_prediction = []
    for Xtest in x_test:
        neighbors = nearest_neighbor(x_train, y_train, Xtest, k)
        predict = prediction(neighbors)
        y_prediction.append(predict)
    return y_prediction 


b) Evaluate your KNN model on the testing set using the following metrics:
- Accuracy

In [4]:
# (part b) compare the correct predictions against total predictions made 
def acc(true_val, y_prediction):
    correct = np.sum(true_val == y_prediction)
    accuracy = correct / len(true_val)
    return accuracy

3. Parameter Tuning:

a) Experiment with different values of k (number of neighbors) and find the optimal k value that gives the best performance on the test set.

In [5]:
# (part c) experiment with different values of k
# test various different k values calculating their respective success rates
def k_testing(x_train, y_train, x_test, y_test, k_vals):
    optimal_k = None 
    highest_acc = 0
    for k in k_vals:
        y_prediction = knn(x_train, y_train, x_test, k)
        acc_dec = acc(y_test, y_prediction)
        acc_perc = acc_dec * 100

        # print all of the k values with their associated accuracy percentage 
        print(f"Accuracy using k={k}: {acc_perc:.2f}%")
        if acc_perc > highest_acc:
            highest_acc = acc_perc
            optimal_k = k

    # print the optimal k value and it's success 
    print(f"The highest accuracy of the model is {highest_acc:.2f}% with the optimal k value being {optimal_k}")
    return optimal_k, highest_acc

# set the possible k values to be any odd number btwn 1 and 15
k_vals = [1, 3, 5, 7, 9, 11, 13, 15]
optimal_k, highest_acc = k_testing(x_train, y_train, x_test, y_test, k_vals)

# establish y_prediction_knn for future metric calculations 
y_prediction_knn = knn(x_train, y_train, x_test, k=optimal_k)

Accuracy using k=1: 95.32%
Accuracy using k=3: 95.91%
Accuracy using k=5: 95.91%
Accuracy using k=7: 95.91%
Accuracy using k=9: 97.08%
Accuracy using k=11: 95.91%
Accuracy using k=13: 95.91%
Accuracy using k=15: 95.32%
The highest accuracy of the model is 97.08% with the optimal k value being 9


## Question 2: Logistic Regression using sklearn (5 points)

1. Data Preprocessing:

a) Reuse the same normalized and split data from Part 1.

2. Model Implementation:

a) Implement Logistic Regression using the sklearn library.

b) Fit the model on the training data and make predictions on the test data.

In [6]:
# (part a)initialize a logistic regression model
logisticModel = LogisticRegression(penalty=None)

# (part b) fit the model
logisticModel.fit(x_train, y_train)

# make predictions on the test data 
y_prediction_logr = logisticModel.predict(x_test)

3. Model Evaluation:

a) Evaluate the model using the same metrics as in Part 1 (accuracy, precision, recall, F1- score).

In [7]:
# (part a) evaluate the model with various metrics
# calculate the accuracy, precision, recall and f1 score for log reg using sklearn metrics library 
acc_logr = accuracy_score(y_test, y_prediction_logr)
precision_logr = precision_score(y_test, y_prediction_logr)
recall_logr = recall_score(y_test, y_prediction_logr)
f1_score_logr = f1_score(y_test, y_prediction_logr)

# print the metric outputs 
print(f"Log reg accuracy: {acc_logr * 100:2f}%")
print(f"Log reg precision: {precision_logr * 100:2f}%")
print(f"Log reg recall: {recall_logr * 100:2f}%")
print(f"Log reg f1 score: {f1_score_logr * 100:2f}%")

Log reg accuracy: 95.321637%
Log reg precision: 99.019608%
Log reg recall: 93.518519%
Log reg f1 score: 96.190476%


b) Compare the performance of Logistic Regression to the KNN model.

In [8]:
# (part b) compare the performance of log reg and knn
# calculate the precision, recall and f1 score for knn using sklearn metrics library 
precision_knn = precision_score(y_test, y_prediction_knn)
recall_knn = recall_score(y_test, y_prediction_knn)
f1_score_knn = f1_score(y_test, y_prediction_knn)

# print the metric outputs 
print(f"Knn accuracy: {highest_acc:2f}%")
print(f"Knn precision: {precision_knn * 100:2f}%")
print(f"Knn recall: {recall_knn * 100:2f}%")
print(f"Knn f1 score: {f1_score_knn * 100:2f}%")

Knn accuracy: 97.076023%
Knn precision: 97.247706%
Knn recall: 98.148148%
Knn f1 score: 97.695853%
