***KNN***


k-NN Explanation:

This machine learning algorithm oworks with similat things that are near each other. I imagine it  as a scenario where people from various professions are standing in a room. SOo, if you are trying to identify one person's profession and observe that they are mainly surrounded by doctors, it would be a reasonable assumption that they might be a doctor.

The thing that differentiates this algotrith mith others is that it doesn't make assumptions, but instead it compares the data to its surroundings and from that makes a decision. This means that the "training" would simply be storing the dataset.

Then, when a new point is presented, it simply measures the distance between the other points in the dataset and after previously decided the number of "neighbors"(k), it compares the data with them and then makes a prediction based on these neighbors. In classification it takes the most repeated label, and for regression it makes an average of the data of the neighbors.


***Pseudocode***

```
FUNCTION Train(train, test, Y_train):
    FOR every test_instance IN test:
        FOR every train_instance IN train:
            COMPUTE Euclidean distance between test_instance and train_instance
            STORE distance and corresponding Y_train label
        END FOR
    END FOR
    RETURN list of distances for each test_instance

FUNCTION Test(distances, Y_test, k):
    FOR every sublist IN distances:
        SORT sublist based on distance
        SELECT top k distances
        DETERMINE majority class among the k-nearest neighbors
        STORE the majority class as prediction
    END FOR
    COMPARE predictions with Y_test to determine accuracy
    RETURN sorted distances, labels, predictions, and accuracy

MAIN:
    data = LOAD your dataset
    PREPROCESS data (e.g., encoding, normalization)

    train, test = SPLIT data into training and testing sets

    X_train, X_test = INPUT FEATURES from train, test
    Y_train, Y_test = TARGET VALUES from train, test

    distances = Train(X_train, X_test, Y_train)
    sortedL, labels, predictions, accuracy = Test(distances, Y_test, k)

    PRINT accuracy and predictions




***Loss and optimization function***

In k-Nearest Neighbors:

1. **Loss Function:** As k-NN does not learn through an iterative process, the traditional loss functions do not apply to it. The performance of the algorithm is judged simply by how accurate the chosen neighbors predict the class or value of a new data.

2. **Optimization Function:** As previously stated, k-NN does not use iterative optimization as many other models do. Instead, the "optimization" in k-NN involves tuning hyperparameters. This primarily includes selecting the best number of neighbors (k) and determining the appropriate distance metric to achieve optimal performance on validation or test data.

***KNN code***

In [None]:
# Importing libraries to use the dataset.

import pandas as pd
from math import sqrt
from sklearn.model_selection import train_test_split

In [None]:
data = pd.read_csv("Credit Score Classification Dataset.csv")
data

Unnamed: 0,Age,Gender,Income,Education,Marital Status,Number of Children,Home Ownership,Credit Score
0,25,Female,50000,Bachelor's Degree,Single,0,Rented,High
1,30,Male,100000,Master's Degree,Married,2,Owned,High
2,35,Female,75000,Doctorate,Married,1,Owned,High
3,40,Male,125000,High School Diploma,Single,0,Owned,High
4,45,Female,100000,Bachelor's Degree,Married,3,Owned,High
...,...,...,...,...,...,...,...,...
159,29,Female,27500,High School Diploma,Single,0,Rented,Low
160,34,Male,47500,Associate's Degree,Single,0,Rented,Average
161,39,Female,62500,Bachelor's Degree,Married,2,Owned,High
162,44,Male,87500,Master's Degree,Single,0,Owned,High


In [None]:
# Replace education levels and credit score with numerical values, home ownership status to binary values (0 for rented, 1 for owned).

data['Education'] = data['Education'].replace({"High School Diploma": 1, "Associate's Degree": 2, "Bachelor's Degree": 3, "Master's Degree": 4, 'Doctorate': 5})
data['Home Ownership'] = data['Home Ownership'].replace({'Rented': 0, 'Owned': 1})
data['Credit Score'] = data['Credit Score'].replace({'Low': 0, 'Average': 1, 'High':2})

# Convert the gender and marital status columns to dummy variables, drop the original columns and add them to the dataframe
dummies = pd.get_dummies(data['Gender'], prefix='Gender')
data = data.drop(columns=['Gender'])

dummies2 = pd.get_dummies(data['Marital Status'], prefix='Marital Status')
data = data.drop(columns=['Marital Status'])

data = pd.concat([data.iloc[:, :2], dummies, data.iloc[:, 2:]], axis=1)
data = pd.concat([data.iloc[:, :4], dummies2, data.iloc[:, 4:]], axis=1)

# Normalize all columns in the dataframe to range between 0 and 1
for column in data.columns:
    minimun = data[column].min()
    maximun = data[column].max()

    data[column] = (data[column] - minimun) / (maximun - minimun)

data

Unnamed: 0,Age,Income,Gender_Female,Gender_Male,Marital Status_Married,Marital Status_Single,Education,Number of Children,Home Ownership,Credit Score
0,0.000000,0.181818,1.0,0.0,0.0,1.0,0.50,0.000000,0.0,1.0
1,0.178571,0.545455,0.0,1.0,1.0,0.0,0.75,0.666667,1.0,1.0
2,0.357143,0.363636,1.0,0.0,1.0,0.0,1.00,0.333333,1.0,1.0
3,0.535714,0.727273,0.0,1.0,0.0,1.0,0.00,0.000000,1.0,1.0
4,0.714286,0.545455,1.0,0.0,1.0,0.0,0.50,1.000000,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...
159,0.142857,0.018182,1.0,0.0,0.0,1.0,0.00,0.000000,0.0,0.0
160,0.321429,0.163636,0.0,1.0,0.0,1.0,0.25,0.000000,0.0,0.5
161,0.500000,0.272727,1.0,0.0,1.0,0.0,0.50,0.666667,1.0,1.0
162,0.678571,0.454545,0.0,1.0,0.0,1.0,0.75,0.000000,1.0,1.0


In [None]:
# Split the data into 80% training and 20% testing.
train, test = train_test_split(data, test_size=0.2)

X_train = train.drop('Credit Score', axis=1)  # Drop the 'Credit Score' column to get the input features for the training set.
Y_train = train["Credit Score"] # Get the 'Credit Score' column to get the target labels for the training set.

X_train, Y_train # Display the input features and target labels for the training set.

(          Age    Income  Gender_Female  Gender_Male  Marital Status_Married  \
 59   0.178571  0.636364            0.0          1.0                     1.0   
 22   0.821429  0.327273            1.0          0.0                     1.0   
 115  0.821429  0.454545            1.0          0.0                     1.0   
 4    0.714286  0.545455            1.0          0.0                     1.0   
 135  0.000000  0.218182            1.0          0.0                     0.0   
 ..        ...       ...            ...          ...                     ...   
 100  0.214286  0.327273            0.0          1.0                     0.0   
 108  0.607143  0.618182            0.0          1.0                     0.0   
 79   0.107143  0.054545            1.0          0.0                     0.0   
 5    0.892857  0.909091            0.0          1.0                     1.0   
 81   0.464286  0.309091            1.0          0.0                     1.0   
 
      Marital Status_Single  Education

In [None]:
X_test = test.drop('Credit Score', axis=1) # Drop the 'Credit Score' column to get the input features for the test set.
Y_test = test["Credit Score"] # Get the 'Credit Score' column to get the target labels for the test set.


X_test, Y_test  # Display the input features and target labels for the training set.

(          Age    Income  Gender_Female  Gender_Male  Marital Status_Married  \
 113  0.464286  0.309091            1.0          0.0                     1.0   
 86   0.321429  0.163636            0.0          1.0                     0.0   
 53   0.142857  0.018182            1.0          0.0                     0.0   
 133  0.142857  0.312727            1.0          0.0                     1.0   
 163  0.857143  0.381818            1.0          0.0                     1.0   
 146  0.928571  0.836364            0.0          1.0                     1.0   
 55   0.500000  0.272727            1.0          0.0                     1.0   
 42   0.250000  0.236364            0.0          1.0                     0.0   
 40   0.928571  0.836364            0.0          1.0                     1.0   
 123  0.178571  0.636364            0.0          1.0                     1.0   
 8    0.392857  0.400000            1.0          0.0                     1.0   
 26   0.500000  0.254545            1.0 

In [None]:
def Train(train, test, Y_train):
    # Initialize a list to hold the distances between test and train instances.
    Distances = []

    # Iterate over each instance in the test set.
    for i in range(len(test)):
        # Initialize a list to hold distances for the current test instance.
        distances_0 = []

        # Iterate over each instance in the train set.
        for j in range(len(train)):
            # Calculate the Euclidean distance between test[i] and train[j].

            diff = test.iloc[i] - train.iloc[j]
            squared_diff = diff**2
            distance = sqrt(sum(squared_diff))

            # Append the calculated distance and corresponding label from Y_train to the distances_0 list.
            distances_0.append((distance, Y_train.iloc[j]))

        # Append the distances for the current test instance to the main Distances list.
        Distances.append(distances_0)

    # Return the list of distances.
    return Distances



In [None]:
def Test(distances, Y_test, k):
    Sorted = []          # List to hold sorted distances.
    labels = []          # List to hold labels corresponding to the top k distances.
    classification = []  # List to hold predicted classifications.

    # Iterate over the distance sublists.
    for sublist in distances:
        # Sorting and taking top k distances
        sorted_sublist = sorted(sublist)[:k]
        Sorted.append(sorted_sublist)

        # Extracting labels of k-nearest neighbors
        labels_current = [item[1] for item in sorted_sublist]
        labels.append(labels_current)

        # Classifying based on majority label.
        class_1 = labels_current.count(0.0)
        class_2 = labels_current.count(0.5)
        class_3 = labels_current.count(1.0)

        if class_1 >= class_2 and class_1 >= class_3:
            prediction = 0.0
        elif class_2 >= class_1 and class_2 >= class_3:
            prediction = 0.5
        elif class_3 >= class_1 and class_3 >= class_2:
            prediction = 1.0

        # Append the current prediction to the classification list.
        classification.append(prediction)

    # Calculate how many predictions match the true labels.
    correct = 0
    for i in range(len(classification)):
        if classification[i] == Y_test.iloc[i]:
            correct += 1

    # Compute the accuracy.
    accuracy = (correct/len(Y_test))

    return Sorted, labels, classification, accuracy


In [None]:
# Compute the Euclidean distances between each test instance and all training instances. Store these distances in the 'distance' variable.
distance = Train(X_train, X_test, Y_train)

In [None]:
distance #Display the distances

[[(1.5004133067517609, 1.0),
  (0.6992801370206871, 1.0),
  (0.7140162157210876, 1.0),
  (0.4790395387712297, 1.0),
  (1.9152728609115093, 0.5),
  (0.7075047405958177, 1.0),
  (1.9233535434969722, 0.0),
  (1.7068312990269479, 1.0),
  (0.3280275000915724, 1.0),
  (0.050968855722572516, 1.0),
  (2.396357766831556, 0.5),
  (1.508762425970339, 1.0),
  (1.7661598108614904, 1.0),
  (0.05096885572257248, 1.0),
  (0.4979872251377211, 1.0),
  (0.6128343247235879, 1.0),
  (1.508762425970339, 1.0),
  (1.9703778495884008, 0.0),
  (1.9233535434969722, 0.0),
  (0.7105353465005892, 1.0),
  (1.928054660360108, 0.5),
  (0.30721149119207564, 1.0),
  (2.3575733430775236, 0.5),
  (1.9233535434969722, 0.0),
  (0.6852157497288484, 1.0),
  (1.727526400233382, 1.0),
  (1.9265967244406879, 0.5),
  (2.1386958261016855, 1.0),
  (0.6947976426070656, 1.0),
  (1.9739261599470852, 0.0),
  (1.7032314594533036, 1.0),
  (0.6319553484637803, 1.0),
  (0.6171346415149316, 1.0),
  (1.7378163507500557, 1.0),
  (0.4654240296

In [None]:
# Using the distances computed in the Train function, determine the 5-nearest neighbors for each test instance and classify each test instance.
# Calculate the accuracy of the predictions against the true labels in Y_test.

sortedL, labels, predictions, accuracy = Test(distance, Y_test, 5)

In [None]:
predictions #Display predictions

[1.0,
 0.5,
 0.0,
 1.0,
 1.0,
 1.0,
 1.0,
 0.5,
 1.0,
 1.0,
 1.0,
 1.0,
 1.0,
 0.5,
 0.5,
 0.5,
 1.0,
 1.0,
 0.5,
 0.0,
 1.0,
 1.0,
 1.0,
 1.0,
 0.0,
 0.5,
 0.0,
 0.5,
 0.5,
 1.0,
 0.5,
 0.5,
 0.0]

In [None]:
accuracy #Display accuracy

0.9393939393939394