## Introduction

This project is a problem of supervised learning classification. The ultimate goal is to predict whether a client will have problems with the credit they have made with a third party, this will be modeled by a categorical variable (0 or 1). The dataset presents several data for each line representing a consumer (categorical, continuous and dates data). The dataset is highly unbalanced as we will see.

We have tried several classification algorithms and methods to avoid overfitting. We perform the data processing part to better serve the above algorithms.

In [1]:
import pandas as pd
path = '../data/'
data = pd.read_csv(path+"CreditTraining.csv")
data.head()

Unnamed: 0,Id_Customer,Y,Customer_Type,BirthDate,Customer_Open_Date,P_Client,Educational_Level,Marital_Status,Number_Of_Dependant,Years_At_Residence,Years_At_Business,Prod_Sub_Category,Prod_Decision_Date,Source,Type_Of_Residence,Nb_Of_Products,Prod_Closed_Date,Prod_Category,Net_Annual_Income
0,7440,0,Non Existing Client,07/08/1977,13/02/2012,NP_Client,University,Married,3.0,1,1.0,C,14/02/2012,Sales,Owned,1,0,B,36.0
1,573,0,Existing Client,13/06/1974,04/02/2009,P_Client,University,Married,0.0,12,2.0,C,30/06/2011,Sales,Parents,1,0,G,18.0
2,9194,0,Non Existing Client,07/11/1973,03/04/2012,NP_Client,University,Married,2.0,10,1.0,C,04/04/2012,Sales,Owned,1,0,B,36.0
3,3016,1,Existing Client,08/07/1982,25/08/2011,NP_Client,University,Married,3.0,3,1.0,C,07/09/2011,Sales,New rent,1,31/12/2012,L,36.0
4,6524,0,Non Existing Client,18/08/1953,10/01/2012,NP_Client,University,Married,2.0,1,1.0,C,11/01/2012,Sales,Owned,1,0,D,36.0


In [2]:
data.describe()

Unnamed: 0,Id_Customer,Y,Number_Of_Dependant,Years_At_Residence,Years_At_Business,Nb_Of_Products,Net_Annual_Income
count,5380.0,5380.0,5380.0,5380.0,5380.0,5380.0,5380.0
mean,4784.535688,0.073048,1.058178,12.626022,4.264684,1.089033,2635.775279
std,2781.436262,0.26024,1.338908,9.972164,7.225051,0.297587,16887.691237
min,1.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,2368.5,0.0,0.0,4.0,1.0,1.0,21.0
50%,4762.5,0.0,0.0,10.0,1.0,1.0,36.0
75%,7180.25,0.0,2.0,18.0,4.0,1.0,50.0
max,9605.0,1.0,20.0,70.0,98.0,3.0,387906.0


As we can see above the label we want to predict is highly unbalanced. Some approachs might be useful, such as using over sampling for the minority class (1) or under sampling for the majority class (0). This is to avoid overfitting and best estimates each model parameter.

We will split the data as much as we can and see to take as much information a possible. We may also add some features that will help our models, such as the client age when he did the credit contract, the gap years between the decision and the open date.

We performed some cluster algorithms and we added the cluster label as one more feature and hot encoding as well.

### Data Preparation

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
import featuretools as ft

def hot_encoding(): #only categorical attributes
    data = pd.read_csv("CreditTraining.csv")
    cat = ["Customer_Type", "P_Client","Educational_Level", "Marital_Status", "Prod_Sub_Category", "Source",
           "Type_Of_Residence","Prod_Category",]
    
    for col in cat:
        one_hot = pd.get_dummies(data[[col]])
        data = data.drop(col, axis = 1)
        data = data.join(one_hot)
    
    data.to_csv("CreditTraining_Hot_Encoding.csv", encoding = 'utf-8',  index = False)
    pass

def DeepFeatureSynthesis(): #only numerical attributes
    data = pd.read_csv("CreditTraining_Hot_Encoding.csv")
    
    customers_df = data[["Id_Customer", "BirthDate", "Customer_Open_Date", 
                         "Number_Of_Dependant", "Years_At_Residence", 
                         "Net_Annual_Income", "Years_At_Business", 
                         "Prod_Decision_Date", "Nb_Of_Products"]] 
    
    entity = {"customers": (customers_df, "Id_Customer")}
    feature_matrix_customers, _ = ft.dfs(entities=entity, target_entity="customers")

    
    feature_matrix_customers.to_csv("dfs.csv", encoding = 'utf-8',  index = False)
    pass

def merge(): #We merge the matrix with only numerical and date attributes with the one hot encoded 
    dfs = pd.read_csv("dfs.csv")
    data = pd.read_csv("CreditTraining_Hot_Encoding.csv")
    numerical = ["Id_Customer", "BirthDate", "Customer_Open_Date", 
                         "Number_Of_Dependant", "Years_At_Residence", 
                         "Net_Annual_Income", "Years_At_Business", 
                         "Prod_Decision_Date", "Nb_Of_Products"]
    
    for col in numerical:
        data = data.drop(col, axis = 1)
        
    data = data.drop("Prod_Closed_Date", axis = 1)
    data = data.join(dfs)
    data.to_csv("df.csv", encoding = 'utf-8',  index = False)
    pass

def feature_creation(cluster = True):
    data = pd.read_csv("df.csv")
    data["Age(Prod_Decision_Date)"] = data["YEAR(Prod_Decision_Date)"] - data["YEAR(BirthDate)"]
    data["Gap"] = data["YEAR(Prod_Decision_Date)"] - data["YEAR(Customer_Open_Date)"]
    
    data.to_csv("df.csv", encoding = 'utf-8',  index = False)
              
    if cluster:
        data_cluster = pd.read_csv("df.csv")
        label = data_cluster["Y"]
        data_cluster = data_cluster.drop(['Y'], axis = 1)
        
        scaled_features = StandardScaler().fit_transform(data_cluster.values)
        scaled_features_df = pd.DataFrame(scaled_features, index = data_cluster.index, columns = data_cluster.columns)
        
        #K Means
        #n_clusters = 2
        kmeans = KMeans(n_clusters=2, random_state=0).fit(scaled_features_df)
        group_kmeans = kmeans.labels_
        data_cluster["KMeans 2"] = group_kmeans
        print(group_kmeans)
        #n_clusters = 3
        kmeans = KMeans(n_clusters=3, random_state=0).fit(scaled_features_df)
        group_kmeans = kmeans.labels_
        data_cluster["KMeans 3"] = group_kmeans
        print(group_kmeans)
        
        #Agglomerative Clustering
        #Cosine
        agg_clustering = AgglomerativeClustering(n_clusters=2, affinity='cosine', linkage='average').fit(scaled_features_df)
        group_agg = agg_clustering.labels_
        data_cluster["Agglomerative Clustering Cosine"] = group_agg
        print(group_agg)
        #Euclidean
        agg_clustering = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward').fit(scaled_features_df)
        group_agg = agg_clustering.labels_
        data_cluster["Agglomerative Clustering Euclidean"] = group_agg
        print(group_agg)
        
        #DBSCAN
        #eps = 3
        dbscan = DBSCAN(eps = 3, min_samples = 2).fit(scaled_features_df)
        group_dbscan = dbscan.labels_ + 1
        data_cluster["DBSCAN eps=3"] = group_dbscan
        print(group_dbscan)
        
        data_cluster = data_cluster.join(label)
        data_cluster.to_csv("df_final.csv", encoding = 'utf-8',  index = False)        
    pass

hot_encoding()
DeepFeatureSynthesis()
merge()
feature_creation(cluster = True)
