# 💲 Credit Card Approval 💲

<br>

In this Notebook we will predict wether a person will pay or not their credit card, using that information, we will discover if a credit card should be approved to that said person.

<br><br>

#### This Notebook is divided into the following sections:

<br>

- Importing 
- Cleaning
- Exploratory Data Visualization
- Training
- Hyperparameter Tuning
- Formatting answer

## 1. Importing

<br>

### 1.1 Importing Tools

In [114]:
# Importing Basic Tools
import numpy              as np
import pandas             as pd
import seaborn            as sns
import matplotlib.pyplot  as plt

In [21]:
# Importing Preprocessing Tools
from sklearn              import preprocessing

In [222]:
# Importing Models
from sklearn.svm           import SVC
from sklearn.ensemble      import RandomForestClassifier
from sklearn.linear_model  import LogisticRegression, LinearRegression
from sklearn.neighbors    import KNeighborsClassifier
from sklearn.decomposition import PCA

### 1.2 Importing Data

In [244]:
# Importing Data

test = pd.read_csv('conjunto_de_teste.csv', index_col = 'id_solicitante')
train = pd.read_csv('conjunto_de_treinamento.csv.zip', index_col = 'id_solicitante')

# Creating Target Column
test['inadimplente'] =  np.zeros(len(test))

# 2. Cleaning

<br>

## 2.1 Checking NaN

In [242]:
# Checking if there is any NaN value

# This shows the percentage of information that is Nan
train.isna().sum() * 100 / len(train.index)

produto_solicitado                   0.000
dia_vencimento                       0.000
forma_envio_solicitacao              0.000
tipo_endereco                        0.000
sexo                                 0.000
idade                                0.000
estado_civil                         0.000
qtde_dependentes                     0.000
grau_instrucao                       0.000
nacionalidade                        0.000
estado_onde_nasceu                   0.000
estado_onde_reside                   0.000
possui_telefone_residencial          0.000
codigo_area_telefone_residencial     0.000
tipo_residencia                      2.680
meses_na_residencia                  7.250
possui_telefone_celular              0.000
possui_email                         0.000
renda_mensal_regular                 0.000
renda_extra                          0.000
possui_cartao_visa                   0.000
possui_cartao_mastercard             0.000
possui_cartao_diners                 0.000
possui_cart

In [246]:
test.isna().sum() * 100 / len(test.index)

produto_solicitado                   0.00
dia_vencimento                       0.00
forma_envio_solicitacao              0.00
tipo_endereco                        0.00
sexo                                 0.00
idade                                0.00
estado_civil                         0.00
qtde_dependentes                     0.00
grau_instrucao                       0.00
nacionalidade                        0.00
estado_onde_nasceu                   0.00
estado_onde_reside                   0.00
possui_telefone_residencial          0.00
codigo_area_telefone_residencial     0.00
tipo_residencia                      2.50
meses_na_residencia                  7.24
possui_telefone_celular              0.00
possui_email                         0.00
renda_mensal_regular                 0.00
renda_extra                          0.00
possui_cartao_visa                   0.00
possui_cartao_mastercard             0.00
possui_cartao_diners                 0.00
possui_cartao_amex                

As we can see, the colunms `profissao_companheiro` and `grau_instrucao_companheiro`, have a higher than 50% of NaN values, because of that, they'll be deleted.

In [247]:
# Creating a list of Dataframes, so all changes are done on both
dfs = [test, train]

# List o f Columns to be deleted
deleted_columns = ['possui_telefone_celular', 
                   'grau_instrucao', 
                   'qtde_contas_bancarias_especiais',
                   'profissao_companheiro',
                   'grau_instrucao_companheiro']

# List of the columns that are numerical
numerical =        ['idade',  
                    'qtde_dependentes',
                    'meses_na_residencia', 
                    'renda_mensal_regular', 
                    'renda_extra', 
                    'qtde_contas_bancarias', 
                    'valor_patrimonio_pessoal',
                    'meses_no_trabalho']

# List of the columns that are categorical
categorical_1  =   ['possui_email', 
                    'possui_cartao_visa', 
                    'possui_cartao_mastercard', 
                    'possui_cartao_diners', 
                    'possui_cartao_amex', 
                    'possui_outros_cartoes', 
                    'possui_carro',   
                    'possui_telefone_trabalho',
                    'vinculo_formal_com_empresa']

categorical_2 =   [ 'profissao', 
                    'ocupacao',
                    'possui_telefone_residencial',
                    'produto_solicitado', 
                    'dia_vencimento', 
                    'forma_envio_solicitacao', 
                    'tipo_endereco', 
                    'sexo', 
                    'estado_civil', 
                    'nacionalidade',
                    'tipo_residencia'] 
                    
                    
# Deletando por enquanto
categories_multi = ['estado_onde_nasceu', 
                    'estado_onde_reside',
                    'estado_onde_trabalha', 
                    'local_onde_reside', 
                    'local_onde_trabalha',
                    'codigo_area_telefone_residencial',
                    'codigo_area_telefone_trabalho']

In [248]:
def Cleaning_Pipeline(dfs, deleted_columns, numerical, categories_multi):

    for df in dfs:
        
        # These columns were advised to be deleted
        df.drop(columns = list(deleted_columns + categories_multi) , inplace = True)
        
        df.sexo.replace(' ', 'N', inplace = True)
        
        for col, content in df.items():
            
            if col in list(numerical): 

                df[col].fillna(inplace = True, value = df[col].median())

            elif col not in list(numerical + ['inadimplente']):

                df[col].fillna(inplace = True, value = df[col].mode().iloc[0])

In [249]:
# Testing Pipeline
Cleaning_Pipeline(dfs,deleted_columns, numerical, categories_multi)

In [250]:
# Enconding Columns

train = pd.get_dummies(data = train, columns = list(categorical_1 + categorical_2), drop_first= True)


test = pd.get_dummies(data = test, columns = list(categorical_1 + categorical_2),  drop_first= True)

In [251]:
# Normalizing Numerical Data

train[numerical] = train[numerical].apply(lambda x:(x-x.min()) / (x.max()-x.min()))
test[numerical] = test[numerical].apply(lambda x:(x-x.min()) / (x.max()-x.min()))

In [252]:
# Checking if train and test columns are the same
train_list = list(train.columns)
test_list = list(test.columns)

not_in_train = [item for item in test_list if item not in train_list]
not_in_test  = [item for item in train_list if item not in test_list]

In [253]:
# Creating columns that differ on each dataframe
for item in not_in_train:
    train[item] = np.zeros(len(train))
    
for item in not_in_test:
    test[item] = np.zeros(len(test))

In [254]:
# Sorting the columns' name so they are in the same order
train = train.reindex(sorted(train.columns), axis=1)
test = test.reindex(sorted(test.columns), axis=1)

## 3. EDA

## 4. Training

In [255]:
# Splitting the data
X_train = train.drop(labels = 'inadimplente', axis = 1)
y_train = train.inadimplente

X_test = test.drop(labels = 'inadimplente', axis = 1)
y_test = test.inadimplente

In [261]:
models = {'Logistic Regression': LogisticRegression(max_iter = 1000),
          #'Linear Regression': LinearRegression(n_jobs = -1),
          #'Random Forest': RandomForestClassifier(),
          'SVC_Linear': SVC(max_iter= -1,
                            kernel = 'linear',
                            C = 1.0),
          'SVC_Poly': SVC(max_iter= -1,
                              kernel = 'poly'),
          'KNeighbors': KNeighborsClassifier()
          }

In [None]:
# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
    '''
    Fits and evaluates given machine learning models.
    models: a dict of different Scikit-Learn machine learning models.
    x_train: training data ( no labels)
    x_test: test data (no labels)
    y_train: training labels
    y_test: test labels
    '''
    
    # Make a dict to keep model scores
    model_scores = {}
    
    # Loop through models
    for name, model in models.items():
        
        # Fit the model to the data
        model.fit(X_train,y_train)
        
        #Evaluates the model and append its score to model_scores
        model_scores[name] = model.score(X_test, y_test)
        
    return model_scores

model_scores = fit_and_score(models = models,
                            X_train = X_train,
                            X_test = X_test,
                            y_train = y_train,
                            y_test = y_test)
'''
{'Logistic Regression': 0.4026,
 'Linear Regression': 0.0,
 'Random Forest': 0.4416,
 'SVC': 0.4212}
 '''


model_scores

As the results were not good, we will perform a Principal Component Analysis so we can perform better.

In [265]:
def fit_score_PCA(models,X_train,y_train,X_test,y_test):

    for i in [0.91]:
    
        pca = PCA(n_components = i)
        X_train_PCA = pca.fit_transform(X_train)
        X_test_PCA = pca.transform(X_test)

        # Make a dict to keep model scores
        model_scores = {}

        # Loop through models
        for name, model in models.items():

            # Fit the model to the data
            model.fit(X_train_PCA,y_train)

            #Evaluates the model and append its score to model_scores
            model_scores[name + '_' + str(i)] = model.score(X_test_PCA, y_test)

    return model_scores

In [266]:
model_scores = fit_score_PCA(models,X_train,y_train,X_test,y_test)

'''
{'Logistic Regression_0.95': 0.515,
 'SVC_Linear_0.95': 0.5278,
 'SVC_Poly_0.95': 0.5246,
 'SVC_Sigmoid_0.95': 0.4896,
 'KNeighbors_0.95': 0.5198}
 '''

# TODO: N_components between 0.9 ~ 0.99

model_scores

{'Logistic Regression_0.91': 0.5174,
 'SVC_Linear_0.91': 0.5392,
 'SVC_Poly_0.91': 0.5362,
 'SVC_Sigmoid_0.91': 0.5072,
 'KNeighbors_0.91': 0.5128}

In [264]:
model_scores = fit_score_PCA(models,X_train,y_train,X_test,y_test)

'''
{'Logistic Regression_0.95': 0.515,
 'Random Forest_0.95': 0.4996,
 'SVC_0.95': 0.5278,
 'KNeighbors_0.95': 0.5198}
'''
model_scores

{'Logistic Regression_0.93': 0.5148,
 'SVC_Linear_0.93': 0.529,
 'SVC_Poly_0.93': 0.538,
 'SVC_Sigmoid_0.93': 0.4904,
 'KNeighbors_0.93': 0.5188}