# Credit Score Classification : Basic Example Notebook

This is an example notebook for the FML Kaggle Challenge. Here are some basic steps to preprocess, train et generate predictions.
### Caution: Some of the steps here are just explanatories, you will still need to change and add steps to perform well in the challenge 

In [212]:
import numpy as np
import pandas as pd

## Reading the Dataset

In [213]:
train_data = pd.read_csv('train_set.csv', index_col=0)
test_data = pd.read_csv('test_set.csv', index_col=0)

In [214]:
test_data

Unnamed: 0_level_0,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,...,Num_Credit_Inquiries,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
68757,25026,4,Aradhana Aravindanw,23.0,8170608.0,Lawyer,88752.260,7109.021667,5.0,6.0,...,8.0,Good,1180.26,36.713885,221.0,No,72.524443,85.233292,High_spent_Medium_value_payments,290.805548
113371,33960,2,Kanekoe,43.0,819245756.0,Mechanic,87745.920,7166.160000,0.0,1.0,...,1.0,Good,207.00,36.121435,245.0,NM,0.000000,115.395977,Low_spent_Small_value_payments,639.825556
154933,34269,8,Lucianam,45.0,767367303.0,Lawyer,8974.555,783.879583,10.0,8.0,...,7.0,Bad,1660.14,33.883240,202.0,Yes,30.443262,0.000000,Low_spent_Small_value_payments,280.503360
77449,24946,8,Lefteris Papadimasd,32.0,794097808.0,Manager,17091.960,1182.330000,10.0,8.0,...,11.0,Bad,4047.31,40.077930,101.0,Yes,54.857946,47.483510,Low_spent_Small_value_payments,282.354884
60732,9532,7,Tom Halss,34.0,399035425.0,Developer,49128.900,4231.075000,8.0,3.0,...,7.0,Standard,2574.10,29.415814,172.0,Yes,169.529221,27.492075,High_spent_Large_value_payments,399.169994
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129058,40969,5,Rick Rothackert,27.0,534994357.0,Architect,39089.090,3089.424167,2.0,4.0,...,3.0,Good,474.96,38.711520,228.0,No,53.515554,56.359550,Low_spent_Small_value_payments,232.453130
108661,16367,8,Jenniferq,53.0,634716210.0,Lawyer,54426.180,4546.515000,4.0,6.0,...,6.0,Standard,865.04,29.607863,366.0,No,137.767461,65.206009,High_spent_Medium_value_payments,445.876920
120097,39091,8,Philipp Halstrickm,27.0,587143010.0,Mechanic,58508.760,4722.730000,6.0,5.0,...,7.0,Standard,1048.25,26.961976,179.0,Yes,164.199022,91.230386,High_spent_Medium_value_payments,383.913135
109412,18783,3,Jeffsp,26.0,602146009.0,Lawyer,20012.870,1830.739167,7.0,8.0,...,11.0,Bad,3500.17,33.280902,9.0,Yes,59.458388,51.512374,High_spent_Medium_value_payments,319.545308


## Selecting Features and Encoding

In this example, we only consider numerical values but some categorical values might be of help

In [215]:

categorical_columns = ['Credit_Mix',
                      'Month',
                      'Occupation',
                      'Payment_Behaviour',
                      'Payment_of_Min_Amount',
                      'Type_of_Loan',
                      'Customer_ID',
                      'Name',
                      'SSN']


numerical_columns = ['Age',
                            'Num_Bank_Accounts',
                            'Num_Credit_Card',
                            'Interest_Rate',
                            'Num_of_Loan',
                            'Delay_from_due_date',
                            'Num_of_Delayed_Payment',
                            'Num_Credit_Inquiries',
                            'Credit_History_Age',
                            'Credit_Utilization_Ratio',
                           'Annual_Income',
                           'Monthly_Inhand_Salary', 
                           'Changed_Credit_Limit', 
                           'Outstanding_Debt', 
                           'Total_EMI_per_month',
                           'Amount_invested_monthly', 
                           'Monthly_Balance']

# Feature to predict
label_to_predict = 'Credit_Score'

In [216]:
X_train = train_data[numerical_columns]
X_test  = test_data[numerical_columns]

## Begin Training

We start training a simple DecisionTree model, you will need to think about which model to use to outperform this one and which hyperparameters to use

In [217]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier

In [218]:
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_data[label_to_predict]).astype(int)

In [219]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [220]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

## Evaluation

In [221]:
from sklearn.metrics import accuracy_score

In [222]:
y_pred = model.predict(X_val)
accuracy_score(y_pred, y_val)

0.721

## Generating the Submission

Below you will find a function that does the predictions for the test set and generates a submission, this submission is to be uploaded to kaggle to update the leaderboard

In [223]:
def generate_submission():
    list_of_predictions = model.predict(X_test)
    preds = label_encoder.inverse_transform(list_of_predictions)
    df = pd.DataFrame({'Credit_Score': preds}, index=X_test.index)
    df.to_csv('sandbox_submission.csv')

In [224]:
generate_submission()