# Exercice 2: Classification system with KNN - To Loan or Not To Loan

## Imports

Import some useful libraries

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.model_selection import train_test_split

## a. Getting started

### Data loading

The original dataset comes from the Kaggle's [Loan Prediction](https://www.kaggle.com/ninzaami/loan-predication) problem. The provided dataset has already undergone some processing, such as removing some columns and invalid data. Pandas is used to read the CSV file.

In [7]:
data = pd.read_csv("loandata.csv")

Display the head of the data.

In [8]:
data.head()

Unnamed: 0,Gender,Married,Education,TotalIncome,LoanAmount,CreditHistory,LoanStatus
0,Male,Yes,Graduate,6091.0,128.0,1.0,N
1,Male,Yes,Graduate,3000.0,66.0,1.0,Y
2,Male,Yes,Not Graduate,4941.0,120.0,1.0,Y
3,Male,No,Graduate,6000.0,141.0,1.0,Y
4,Male,Yes,Graduate,9613.0,267.0,1.0,Y


Data's columns:
* **Gender:** Applicant gender (Male/ Female)
* **Married:** Is the Applicant married? (Y/N)
* **Education:** Applicant Education (Graduate/ Not Graduate)
* **TotalIncome:** Applicant total income (sum of `ApplicantIncome` and `CoapplicantIncome` columns in the original dataset)
* **LoanAmount:** Loan amount in thousands
* **CreditHistory:** Credit history meets guidelines
* **LoanStatus** (Target)**:** Loan approved (Y/N)

### Data preprocessing

Define a list of categorical columns to encode.

In [9]:
categorical_columns = ["Gender", "Married", "Education", "LoanStatus"]

Encode categorical columns using the [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) of scikit learn.

In [10]:
data[categorical_columns] = OrdinalEncoder().fit_transform(data[categorical_columns])
data.head()

Unnamed: 0,Gender,Married,Education,TotalIncome,LoanAmount,CreditHistory,LoanStatus
0,1.0,1.0,0.0,6091.0,128.0,1.0,0.0
1,1.0,1.0,0.0,3000.0,66.0,1.0,1.0
2,1.0,1.0,1.0,4941.0,120.0,1.0,1.0
3,1.0,0.0,0.0,6000.0,141.0,1.0,1.0
4,1.0,1.0,0.0,9613.0,267.0,1.0,1.0


Split into `X` and `y`.

In [11]:
X = data.drop(columns="LoanStatus") # removes the LoanStatus column from data
print(X.head())
y = data.LoanStatus # y becomes what you want to predict --> if we might to accept the load demand
#print(y.head())

# X == input features --> 6 features (gender, married, education, totalIncome, loanAmount, creditHistory)
# y == target output
# (X,y) == Test + training samples (to separate !)

   Gender  Married  Education  TotalIncome  LoanAmount  CreditHistory
0     1.0      1.0        0.0       6091.0       128.0            1.0
1     1.0      1.0        0.0       3000.0        66.0            1.0
2     1.0      1.0        1.0       4941.0       120.0            1.0
3     1.0      0.0        0.0       6000.0       141.0            1.0
4     1.0      1.0        0.0       9613.0       267.0            1.0


Normalize data using the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) of scikit learn.

In [12]:
X[X.columns] = StandardScaler().fit_transform(X[X.columns]) # Loi normale centrée réduite
print(X.head())
#plt.hist(X,bins=30)
#plt.show()

     Gender   Married  Education  TotalIncome  LoanAmount  CreditHistory
0  0.467198  0.737162  -0.503253    -0.143254   -0.208089       0.413197
1  0.467198  0.737162  -0.503253    -0.661554   -0.979001       0.413197
2  0.467198  0.737162   1.987072    -0.336086   -0.307562       0.413197
3  0.467198 -1.356553  -0.503253    -0.158512   -0.046446       0.413197
4  0.467198  0.737162  -0.503253     0.447317    1.520245       0.413197


Convert `y` type to `int` 

In [13]:
y = y.astype(int)

Split dataset into train and test sets.

In [14]:
# Separation entre train set & test set
# stratify=y ==> Garde les mêmes proportions de LoanStatus dans le train set et le test set
# test_size=0.2 ==> 20% dans test set & 80% dans training set
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

## b. Dummy classifier

Build a dummy classifier that takes decisions randomly.

In [15]:
class DummyClassifier():
    
    def __init__(self):
        """
        Initialize the class.
        """
        pass
    
    def fit(self, X, y): # On n'entraine pas un dummy classifier ! Il va prédire de manière random !
        """
        Fit the dummy classifier.
        
        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)
            Training data.
        y : Numpy array or Pandas DataFrame of shape (n_samples,)
            Target values.
        """
        self.Xtr = X
        self.ytr = y
        pass
    
    def predict(self, X):
        """
        Predict the class labels for the provided data.

        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)
            Test samples.

        Returns
        -------
        y : Numpy array or Pandas DataFrame of shape (n_queries,)
            Class labels for each data sample.
        """
        nb_line = X.shape[0]
        y_pred = np.zeros(nb_line, dtype=self.ytr.dtype) # init y_pred
        for i in range(nb_line):
            y_pred[i] = np.random.randint(2)

        return y_pred
        pass

Implement a function to evaluate the performance of a classification by computing the accuracy ($N_{correct}/N$).

In [32]:
def accuracy_score(y_true, y_pred):
    N = np.size(y_true)
    N_correct = np.sum(y_true == y_pred)
    #for i in range(N):
    #   if y_true.iloc[i] == y_pred[i]:
    #        N_correct += 1
           
    return N_correct/N
    
    pass

Compute the performance of the dummy classifier using the provided test set.

In [34]:
dummy = DummyClassifier()
dummy.fit(X_train, y_train)
print(accuracy_score(y_test, dummy.predict(X_test)))

0.5520833333333334


## c. K-Nearest Neighbors classifier

Build a K-Nearest Neighbors classifier using an Euclidian distance computation and a simple majority voting criterion.

In [50]:
class KNNClassifier():
    
    def __init__(self, n_neighbors=3):
        """
        Initialize the class.
        
        Parameters
        ----------
        n_neighbors : int, default=3
            Number of neighbors to use by default.
        """
        self.k = n_neighbors
        pass
    
    def fit(self, X, y): # On n'entraine rien pour le KNN
        """
        Fit the k-nearest neighbors classifier.
        
        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_samples, n_features)
            Training data.
        y : Numpy array or Pandas DataFrame of shape (n_samples,)
            Target values.
        """
        self.Xtr = np.array(X)
        self.ytr = np.array(y)
        
        pass
    
    @staticmethod
    def _euclidian_distance(a, b):
        """
        Utility function to compute the euclidian distance.
        
        Parameters
        ----------
        a : Numpy array or Pandas DataFrame
            First operand.
        b : Numpy array or Pandas DataFrame
            Second operand.
        """
        return np.sqrt(np.sum((a - b) ** 2, axis=1))
        pass

    def majority_vote(self,label_k_voisins):
        
        #labels = y[k_indice_voisins]
        nb_accepted = np.sum(label_k_voisins == 1)

        if nb_accepted > len(label_k_voisins)-nb_accepted:
            return 1    # label prédit (y^) : accepted
        elif nb_accepted < len(label_k_voisins)-nb_accepted:
            return 0    # label prédit (y^) : refused
        else:
            return -1
        pass
    
    def predict(self, X):
        """
        Predict the class labels for the provided data.

        Parameters
        ----------
        X : Numpy array or Pandas DataFrame of shape (n_queries, n_features)
            Test samples.

        Returns
        -------
        y : Numpy array or Pandas DataFrame of shape (n_queries,)
            Class labels for each data sample.
        """
        X = np.array(X)
        nb_line = X.shape[0] 
        y_pred = np.zeros(nb_line, dtype=self.ytr.dtype) # init y_pred

        
        for i in range(nb_line):
           
            ## Trouver k voisins pour chaque X[i]
            distances = self._euclidian_distance(self.Xtr, X[i])  # Broadcast : calcul la distance entre Xtr[0 ... nb_line] et X[i]  
                                                                    
            k_indice_voisins = np.argsort(distances)[:self.k]       # Retourne les indices des k voisins de X[i] 
            

            ## Predict by majority vote 
            y_pred[i] = self.majority_vote(self.ytr[k_indice_voisins])

        return y_pred
        pass

Compute the performance of the system as a function of $k = 1...7$.

In [51]:
for i in range(1,8):
    KNN = KNNClassifier(i)
    KNN.fit(X_train, y_train)
    print("K = ", i, ", Accuracy = ", accuracy_score(y_test, KNN.predict(X_test)))


K =  1 , Accuracy =  0.6979166666666666
K =  2 , Accuracy =  0.5520833333333334
K =  3 , Accuracy =  0.7916666666666666
K =  4 , Accuracy =  0.6979166666666666
K =  5 , Accuracy =  0.8125
K =  6 , Accuracy =  0.7604166666666666
K =  7 , Accuracy =  0.8020833333333334


**k=5** donne les meilleures performances. Les k paires donnent de moins bon résultats que les k paires car l'algoritme ne prend pas en compte les cas où il n'y a pas de majorité. 

1) Run the KNN algorithm using only the features `TotalIncome` and `CreditHistory`.

In [54]:
KNN = KNNClassifier(3)
KNN.fit(X_train[["TotalIncome","CreditHistory"]], y_train)
print("K = ", 3, ", Accuracy = ", accuracy_score(y_test, KNN.predict(X_test[["TotalIncome","CreditHistory"]])))

K =  3 , Accuracy =  0.78125


2) Re-run the KNN algorithm using the features `TotalIncome`, `CreditHistory` and `Married`.

In [55]:
KNN = KNNClassifier(3)
KNN.fit(X_train[["TotalIncome","CreditHistory","Married"]], y_train)
print("K = ", 3, ", Accuracy = ", accuracy_score(y_test, KNN.predict(X_test[["TotalIncome","CreditHistory", "Married"]])))

K =  3 , Accuracy =  0.8645833333333334


3) Re-run the KNN algorithm using all features.

In [56]:
KNN = KNNClassifier(3)
KNN.fit(X_train, y_train)
print("K = ", 3, ", Accuracy = ", accuracy_score(y_test, KNN.predict(X_test)))

K =  3 , Accuracy =  0.7916666666666666


On observe le phénomène du "Curse of dimensionality" : en général, ajouter des features va augmenter l'accuracy comme on le voit avec 2 (78%) et 3 (86%) features. Seulement on va observer à un moment donné une baisse de l'accuracy en ajoutant de plus en plus de features. C'est ce que l'on remarque en utilisant l'algo KNN avec toutes les features, où l'accuracy atteint 79% (alors qu'avec 3 features, on avait 86%).

4) How is your system taking decisions when you have an equal number of votes for both
classes with values of k = 2, 4, 6 ?

Actuellement, l'algo ne prend pas de décision, car nous avons pas départager les classes ayant une égalité.