**KNN algorithm**

The KNN algorithm, as the name says, stands for k-nearest neighbor. It is used to find the number of "points" closest to a new point. In this context, a point is defined as an observation in our data.
To visualize it, we can think that every feature is a dimension in space. For 3 features, data is in a 3D space, so with the measure of the Euclidean distance, we can determine which data points are closest to the new point. It is important to note that the formula is applicable to n-dimensions, so it does not matter if the dataset has more than 3 dimensions. The algorithm works the same; it is just harder to visualize since we live in a 3D space.
After measuring the distances, we select the k nearest, usually 5, but it can differ depending on your data's characteristics. As k is usually an odd number, the labels will tend to one side, so we can predict the label of the new point based on the most frequent label of the k-neighbors.

**Pseudocode**

```
1.Read dataset.
2. Dataset Preprocessing:
    -Drop unnecessary columns if needed.
    -Replace text with numeric values.
    -Generate dummies if needed.
    -Normalize or scale features.
3.Dataset split (80% for training and 20% for test).
4.Get the features (X) and labels (Y).

5.Training function:
    Initialize an empty list for distances_measure.
    For test_row in test.rows:
      Initialize an empty list for distances.
      For train_row in train.rows:
          distance = sqrt(sum(test.iloc[test_row, feature] - train.iloc[train_row, feature])**2)
        Append (distance, y_test[train_row]) to distances.
      Append distances to distances_measure.
    Return distances

6.Classify function:
  Initialize predictions as an empty list.
  For each distance_list in distances_measure:
    Sort distance_list.
    Initialize counters: One = 0 and Zero = 0.
    For j in range(k):
      Extract label from the sorted distance_list.
      If label is 1.0, increment One.
      Else, increment Zero.
    Append majority label (1 if One > Zero else 0) to predictions.
  Return predictions.

7.Accuracy measure:
  Initialize correct = 0.
  For each predicted label and true label in the dataset:
    If predicted_label is equal to true_label, increment correct.
  Calculate accuracy: accuracy = correct/len(predictions).
    
8.Show outcomes.
```



KNN implementation

In [None]:
#Libraries to read and modify the dataset and to get the square root
import pandas as pd
from math import sqrt

In [None]:
#Read the dataset using pandas
dt = pd.read_csv("train.csv")
dv = pd.read_csv("test.csv")

In [None]:
#Preprocessing of dhe dataset
df = dt.drop(columns = ["id", "Gender", "Unnamed: 0"]) #Drop features that do not influences in the output

df['Customer Type'] = df['Customer Type'].replace({'Loyal Customer': 1, 'disloyal Customer': 0}) #Replace class features to numeric values
df['satisfaction'] = df['satisfaction'].replace({'satisfied': 1, 'neutral or dissatisfied': 0})
df['Class'] = df['Class'].replace({'Business': 1, 'Eco Plus': 0.5,'Eco': 0})

dummies = pd.get_dummies(df['Type of Travel'], prefix='Type') #Split and replace class features to numeric values, in dummies because there is not a hierarchy relation
df = df.drop(columns=['Type of Travel'])
df = pd.concat([
    df.iloc[:, :3],
    dummies,
    df.iloc[:, 3:]
], axis=1)

for column in df.columns:
    min_val = df[column].min()
    max_val = df[column].max()

    df[column] = (df[column] - min_val) / (max_val - min_val) #Normalize the data (New features: 0 to 1)

In [None]:
df #Training dataset

In [None]:
#The same preprocessing for the test dataset
ds = dv.drop(columns = ["id", "Gender", "Unnamed: 0"])
ds['Customer Type'] = ds['Customer Type'].replace({'Loyal Customer': 1, 'disloyal Customer': 0})
ds['satisfaction'] = ds['satisfaction'].replace({'satisfied': 1, 'neutral or dissatisfied': 0})
ds['Class'] = ds['Class'].replace({'Business': 1, 'Eco Plus': 0.5,'Eco': 0})

dummies = pd.get_dummies(dv['Type of Travel'], prefix='Type')
ds = ds.drop(columns=['Type of Travel'])
ds = pd.concat([
    ds.iloc[:, :3],
    dummies,
    ds.iloc[:, 3:]
], axis=1)

for column in df.columns:
    min_val = ds[column].min()
    max_val = ds[column].max()

    ds[column] = (ds[column] - min_val) / (max_val - min_val)

In [None]:
ds #test dataset

Unnamed: 0,Customer Type,Age,Class,Type_Business travel,Type_Personal Travel,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,1.0,0.576923,0.0,1.0,0.0,0.026050,1.0,0.8,0.6,0.75,...,1.0,1.0,1.0,1.00,0.25,1.0,1.0,0.044326,0.039462,1.0
1,1.0,0.371795,1.0,1.0,0.0,0.571890,0.2,0.2,0.6,0.00,...,0.8,0.8,0.8,0.75,0.50,0.8,1.0,0.000000,0.000000,1.0
2,0.0,0.166667,0.0,1.0,0.0,0.032512,0.4,0.0,0.4,0.75,...,0.4,0.8,0.2,0.50,0.25,0.4,0.4,0.000000,0.000000,0.0
3,1.0,0.474359,1.0,1.0,0.0,0.675687,0.0,0.0,0.0,0.25,...,0.2,0.2,0.2,0.00,0.50,0.2,0.8,0.000000,0.005381,1.0
4,1.0,0.538462,0.0,1.0,0.0,0.232431,0.4,0.6,0.8,0.50,...,0.4,0.4,0.4,0.25,0.75,0.4,0.8,0.000000,0.017937,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25971,0.0,0.346154,1.0,1.0,0.0,0.099960,0.6,0.6,0.6,0.00,...,0.8,0.6,0.4,0.75,0.75,1.0,0.8,0.000000,0.000000,0.0
25972,1.0,0.205128,1.0,1.0,0.0,0.124192,0.8,0.8,0.8,0.75,...,0.8,0.8,1.0,1.00,1.00,1.0,0.8,0.000000,0.000000,1.0
25973,1.0,0.128205,0.0,0.0,1.0,0.160945,0.4,1.0,0.2,1.00,...,0.4,0.8,0.6,0.75,1.00,0.8,0.4,0.000000,0.000000,0.0
25974,1.0,0.089744,1.0,1.0,0.0,0.221325,0.6,0.6,0.6,0.50,...,0.8,0.6,0.4,1.00,0.75,1.0,0.8,0.000000,0.000000,1.0


In [None]:
X_train = df.drop('satisfaction', axis=1) #Drop the labels(Y) column to get the features(X)
Y_train = df["satisfaction"] #Get just the labels(Y)

X_train, Y_train #Show the datasets

(        Customer Type       Age  Class  Type_Business travel  \
 0                 1.0  0.076923    0.5                   0.0   
 1                 0.0  0.230769    1.0                   1.0   
 2                 1.0  0.243590    1.0                   1.0   
 3                 1.0  0.230769    1.0                   1.0   
 4                 1.0  0.692308    1.0                   1.0   
 ...               ...       ...    ...                   ...   
 103899            0.0  0.205128    0.0                   1.0   
 103900            1.0  0.538462    1.0                   1.0   
 103901            0.0  0.294872    1.0                   1.0   
 103902            0.0  0.192308    0.0                   1.0   
 103903            1.0  0.256410    1.0                   1.0   
 
         Type_Personal Travel  Flight Distance  Inflight wifi service  \
 0                        1.0         0.086632                    0.6   
 1                        0.0         0.041195                    0.6   

In [None]:
X_test = ds.drop('satisfaction', axis=1) #Drop the labels(Y) column to get the features(X)
Y_test = ds["satisfaction"] #Get just the labels(Y)


X_test, Y_test

(       Customer Type       Age  Class  Type_Business travel  \
 0                1.0  0.576923    0.0                   1.0   
 1                1.0  0.371795    1.0                   1.0   
 2                0.0  0.166667    0.0                   1.0   
 3                1.0  0.474359    1.0                   1.0   
 4                1.0  0.538462    0.0                   1.0   
 ...              ...       ...    ...                   ...   
 25971            0.0  0.346154    1.0                   1.0   
 25972            1.0  0.205128    1.0                   1.0   
 25973            1.0  0.128205    0.0                   0.0   
 25974            1.0  0.089744    1.0                   1.0   
 25975            1.0  0.448718    0.0                   0.0   
 
        Type_Personal Travel  Flight Distance  Inflight wifi service  \
 0                       0.0         0.026050                    1.0   
 1                       0.0         0.571890                    0.2   
 2            

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# 1. Leer el dataset CSV
file_path = 'winequality-red.csv'  # Reemplaza esto con la ruta real de tu archivo
df = pd.read_csv(file_path)

# Convertir valores de la columna 'quality'
def convert_quality(value):
    if 1 <= value <= 5:
        return 0
    elif 6 <= value <= 10:
        return 1

df['quality'] = df['quality'].apply(convert_quality)

# 2. Separar el dataset en conjuntos de entrenamiento y prueba
X = df.drop('quality', axis=1)  # Features (eliminamos la columna 'quality' para obtener solo las características)
Y = df['quality']  # Labels

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

for column in X_train.columns:
    min_val = X_train[column].min()
    max_val = X_train[column].max()

    X_train[column] = (X_train[column] - min_val) / (max_val - min_val)

for column in X_test.columns:
    min_val = X_test[column].min()
    max_val = X_test[column].max()

    X_test[column] = (X_test[column] - min_val) / (max_val - min_val)

In [None]:
#Training function, here is calculated the euclidean distance with the square root of the sumation of the square differences
#sqrt((x1 - x2)**2)
def training(train, test, Y_train):
  distances = [] #Define an empty list to get the final distances

  for i in range(len(test)):
    distances_i = []
    for j in range(len(train)):
      squares = 0 #Variable where the sumation will be saved

      for k in range(len(train.columns)): #There are three for's loops to get each feature of the training dataset in relation with each point of the test dataset
        #Euclidean distance
        points = (test.iloc[i,k] - train.iloc[j,k])**2
        squares = squares + points
      distance = sqrt(squares)
      distance = round(distance, 3)

      #Save the distances in lists inside a list
      distances_i.append((distance, Y_train.iloc[j])) #It will get a tupple of the distance and the label

    distances.append(distances_i) #Save the sublist in the distance list

  return distances


In [None]:
def classify(distances, k):
  sorted_list = [] #Define empty list to get the sorted sublists, the sorted labels and the prediction the algorithm did
  labels = []
  prediction = []
  for i in range(len(distances)):
    sublist = distances[i] #Get the i sublist of the greater list and sort it
    sublist.sort()
    labels_0 = [] #Define an empty list to get the sorted labels
    One = 0 #Define variables to count the classes
    Zero = 0
    for j in range(k):  # Loop through the k-nearest neighbors
      element = sublist[j]
      labels_0.append(element[1]) #Get the label os the tupple of the element in the sublist[j]

      if (element[1] == 1.0): #Count the labels, it will add to One or Zero a 1 when the label is 1 or 0 respectivly
        One = One + 1
      elif (element[1] == 0.0):
        Zero = Zero + 1

    if One > Zero:  #Compare which labels is the most common and make the prediction
      prediction.append(1)
    elif One < Zero:
      prediction.append(0) #Save the prediction in the list

    labels.append(labels_0) #Save the sublist in the graters list
    sorted_list.append(sublist)

  return sorted_list, labels, prediction

In [None]:
def accuracy(prediction, Y_test):
  correct = 0 #Define a variable to store the count of correct predictions
  for i in range(len(Y_test)):
    if prediction[i] == Y_test[i]:  #If is the same label it will add 1 to the variable
      correct += 1

  accuracy = (correct/len(Y_test))*100 #Divide the correct predictions by the numbers of labels

  return accuracy

In [None]:
#Create a smaller dataset, because it is a really large dataset with 100,000 observations
small_train = X_train
small_Y = Y_train

small_test = X_test
small_y_test = Y_test

In [None]:
result = training(small_train, small_test, small_Y) #Train using the new smaller datasets

In [None]:
result #Show the list of sublists of tupples

In [None]:
sortedlist, labels, predictions = classify(result, 3) #Classify with 5 nearest neighbors

In [None]:
sortedlist

In [None]:
labels

In [None]:
predictions

In [None]:
def accuracy(prediction, Y_test):
    # Convertir prediction a un DataFrame y reindexar ambos
    prediction_df = pd.DataFrame(prediction).reset_index(drop=True)
    Y_test = Y_test.reset_index(drop=True)

    correct = 0
    for i in range(len(Y_test)):
        if prediction_df.iloc[i, 0] == Y_test.iloc[i]:
            correct += 1

    accuracy = (correct/len(Y_test))*100

    return accuracy


In [None]:
Y_test

In [None]:
accurate = accuracy(predictions, small_y_test) #Calculate the accuracy

In [None]:
accurate

70.9375

The accuracy of the KNN algorithm was 84.5% of predictions

**Loss and Optimization functions**

The kNN algorithm differs from many machine learning models. It memorizes the data. Predictions are made by simply examining the k nearest data points from the training set. Since there is no model training phase with adjustable parameters, there is no need for loss or optimization functions. The core of kNN lies in choosing an effective distance metric and the number of neighbors, 'k', which are not learned through optimization but are typically selected via techniques like cross-validation.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
! pwd

/content


In [None]:
%%shell
jupyter nbconvert --to html ///content/KNN.ipynb

[NbConvertApp] Converting notebook ///content/KNN.ipynb to html
[NbConvertApp] Writing 7017129 bytes to /content/KNN.html




In [None]:
data = pd.read_csv("Credit Score Classification Dataset.csv")

In [None]:
# Replace education levels and credit score with numerical values, home ownership status to binary values (0 for rented, 1 for owned).

data['Education'] = data['Education'].replace({"High School Diploma": 1, "Associate's Degree": 2, "Bachelor's Degree": 3, "Master's Degree": 4, 'Doctorate': 5})
data['Home Ownership'] = data['Home Ownership'].replace({'Rented': 0, 'Owned': 1})
data['Credit Score'] = data['Credit Score'].replace({'Low': 0, 'Average': 1, 'High':2})

# Convert the gender and marital status columns to dummy variables, drop the original columns and add them to the dataframe
dummies = pd.get_dummies(data['Gender'], prefix='Gender')
data = data.drop(columns=['Gender'])

dummies2 = pd.get_dummies(data['Marital Status'], prefix='Marital Status')
data = data.drop(columns=['Marital Status'])

data = pd.concat([data.iloc[:, :2], dummies, data.iloc[:, 2:]], axis=1)
data = pd.concat([data.iloc[:, :4], dummies2, data.iloc[:, 4:]], axis=1)

# Normalize all columns in the dataframe to range between 0 and 1
for column in data.columns:
    minimun = data[column].min()
    maximun = data[column].max()

    data[column] = (data[column] - minimun) / (maximun - minimun)

data

Unnamed: 0,Age,Income,Gender_Female,Gender_Male,Marital Status_Married,Marital Status_Single,Education,Number of Children,Home Ownership,Credit Score
0,0.000000,0.181818,1.0,0.0,0.0,1.0,0.50,0.000000,0.0,1.0
1,0.178571,0.545455,0.0,1.0,1.0,0.0,0.75,0.666667,1.0,1.0
2,0.357143,0.363636,1.0,0.0,1.0,0.0,1.00,0.333333,1.0,1.0
3,0.535714,0.727273,0.0,1.0,0.0,1.0,0.00,0.000000,1.0,1.0
4,0.714286,0.545455,1.0,0.0,1.0,0.0,0.50,1.000000,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...
159,0.142857,0.018182,1.0,0.0,0.0,1.0,0.00,0.000000,0.0,0.0
160,0.321429,0.163636,0.0,1.0,0.0,1.0,0.25,0.000000,0.0,0.5
161,0.500000,0.272727,1.0,0.0,1.0,0.0,0.50,0.666667,1.0,1.0
162,0.678571,0.454545,0.0,1.0,0.0,1.0,0.75,0.000000,1.0,1.0


In [None]:
# Split the data into 80% training and 20% testing.
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2)

X_train = train.drop('Credit Score', axis=1)  # Drop the 'Credit Score' column to get the input features for the training set.
Y_train = train["Credit Score"] # Get the 'Credit Score' column to get the target labels for the training set.

X_train, Y_train # Display the input features and target labels for the training set.

In [None]:
X_test = test.drop('Credit Score', axis=1) # Drop the 'Credit Score' column to get the input features for the test set.
Y_test = test["Credit Score"] # Get the 'Credit Score' column to get the target labels for the test set.


X_test, Y_test  # Display the input features and target labels for the training set.

In [None]:
def classify(distances, k):
  sorted_list = [] #Define empty list to get the sorted sublists, the sorted labels and the prediction the algorithm did
  labels = []
  prediction = []
  for i in range(len(distances)):
    sublist = distances[i] #Get the i sublist of the greater list and sort it
    sublist.sort()
    labels_0 = [] #Define an empty list to get the sorted labels
    One = 0 #Define variables to count the classes
    Zero = 0
    Half = 0
    for j in range(k):  # Loop through the k-nearest neighbors
      element = sublist[j]
      labels_0.append(element[1]) #Get the label os the tupple of the element in the sublist[j]

      if (element[1] == 1.0): #Count the labels, it will add to One or Zero a 1 when the label is 1 or 0 respectivly
        One = One + 1
      elif (element[1] == 0.0):
        Zero = Zero + 1
      elif (element[1] == 0.5):
        Half = Half + 1

    if One >= Zero and One >= Half:
      prediction.append(1)
    elif Half >= One and Half >= Zero:
      prediction.append(0.5)
    elif Zero >= One and Zero >= Half:
      prediction.append(0)

    labels.append(labels_0) #Save the sublist in the graters list
    sorted_list.append(sublist)

  return sorted_list, labels, prediction

In [None]:
train = training(X_train, X_test, Y_train)

In [None]:
sl, acutallabels, predictedlabels = classify(train,5)

In [None]:
predictedlabels

[1,
 1,
 1,
 1,
 1,
 1,
 0.5,
 1,
 1,
 1,
 0.5,
 1,
 1,
 0.5,
 0.5,
 0.5,
 0.5,
 1,
 1,
 0.5,
 1,
 0.5,
 1,
 1,
 1,
 1,
 0,
 0.5,
 1,
 1,
 0.5,
 1,
 1]

In [None]:
def accuracy(prediction, Y_test):
  correct = 0 #Define a variable to store the count of correct predictions
  for i in range(len(Y_test)):
    if prediction[i] == Y_test.iloc[i]:  #If is the same label it will add 1 to the variable
      correct += 1

  accuracy = (correct/len(Y_test))*100 #Divide the correct predictions by the numbers of labels

  return accuracy

In [None]:
acc = accuracy(predictedlabels, Y_test)

In [None]:
acc

100.0