# Credit Score Classification : Basic Example Notebook

This is an example notebook for the FML Kaggle Challenge. Here are some basic steps to preprocess, train et generate predictions.
### Caution: Some of the steps here are just explanatories, you will still need to change and add steps to perform well in the challenge 

In [5]:
import numpy as np
import pandas as pd

## Reading the Dataset

In [6]:
train_data = pd.read_csv('data/train_set.csv', index_col=0) #把ID作为索引
test_data = pd.read_csv('data/test_set.csv', index_col=0)

In [21]:
train_data

Unnamed: 0_level_0,Customer_ID,Month,Name,Age,SSN,Occupation,Annual_Income,Monthly_Inhand_Salary,Num_Bank_Accounts,Num_Credit_Card,...,Credit_Mix,Outstanding_Debt,Credit_Utilization_Ratio,Credit_History_Age,Payment_of_Min_Amount,Total_EMI_per_month,Amount_invested_monthly,Payment_Behaviour,Monthly_Balance,Credit_Score
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
84094,23943,5,Leonoray,18.0,84355102.0,Journalist,70260.200,5564.016667,9.0,7.0,...,Bad,2228.79,25.308445,217.0,Yes,246.378645,99.939577,Low_spent_Small_value_payments,78.800397,Poor
46702,29066,5,Osamua,21.0,981149909.0,Teacher,18001.590,1258.132500,7.0,9.0,...,Bad,2225.58,32.088726,240.0,NM,24.986447,45.523998,Low_spent_Small_value_payments,343.581411,Poor
147514,8183,5,Benf,47.0,324295086.0,Developer,9824.310,707.692500,7.0,4.0,...,Standard,1233.96,25.500503,227.0,No,0.000000,22.006889,Low_spent_Medium_value_payments,322.170689,Good
16675,27938,2,Matt Falloonm,41.0,564682345.0,Entrepreneur,87481.620,7022.135000,0.0,4.0,...,Good,214.43,38.505066,282.0,No,55.653369,56.550721,High_spent_Medium_value_payments,726.849284,Standard
84080,38740,3,Seetharamank,53.0,228116416.0,Manager,129204.920,10508.076667,5.0,6.0,...,Standard,1075.37,38.359175,289.0,Yes,277.610885,165.019535,High_spent_Large_value_payments,742.018547,Standard
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100440,34564,7,Daniel Basesv,31.0,497367729.0,Mechanic,76726.640,6098.886667,6.0,6.0,...,Standard,103.86,22.988139,353.0,No,145.229101,56.871503,High_spent_Medium_value_payments,490.055738,Standard
97738,11931,5,Palmery,15.0,68740168.0,Engineer,29735.220,2208.935000,10.0,8.0,...,Bad,1750.14,34.177601,234.0,Yes,64.477076,52.468341,Low_spent_Small_value_payments,265.301240,Poor
32228,14790,3,Johnsonj,37.0,959945192.0,Engineer,16833.355,1395.779583,7.0,3.0,...,Standard,65.59,31.230439,127.0,Yes,19.785558,14.823454,Low_spent_Small_value_payments,295.742037,Standard
47676,10962,7,Erin Smithu,32.0,193636325.0,Journalist,34947.630,3022.302500,3.0,7.0,...,Standard,936.29,23.539433,201.0,No,77.862907,108.987267,Low_spent_Small_value_payments,251.420052,Standard


## Selecting Features and Encoding

In this example, we only consider numerical values but some categorical values might be of help

In [8]:

categorical_columns = ['Credit_Mix',
                      'Month',
                      'Occupation',
                      'Payment_Behaviour',
                      'Payment_of_Min_Amount',
                      'Type_of_Loan',
                      'Customer_ID',
                      'Name',
                      'SSN']


numerical_columns = ['Age',
                            'Num_Bank_Accounts',
                            'Num_Credit_Card',
                            'Interest_Rate',
                            'Num_of_Loan',
                            'Delay_from_due_date',
                            'Num_of_Delayed_Payment',
                            'Num_Credit_Inquiries',
                            'Credit_History_Age',
                            'Credit_Utilization_Ratio',
                           'Annual_Income',
                           'Monthly_Inhand_Salary', 
                           'Changed_Credit_Limit', 
                           'Outstanding_Debt', 
                           'Total_EMI_per_month',
                           'Amount_invested_monthly', 
                           'Monthly_Balance']

# Feature to predict
label_to_predict = 'Credit_Score'

In [9]:
X_train = train_data[numerical_columns]
X_test  = test_data[numerical_columns]

## Begin Training

We start training a simple DecisionTree model, you will need to think about which model to use to outperform this one and which hyperparameters to use

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier

In [11]:
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(train_data[label_to_predict]).astype(int)

In [12]:
y_train

array([1, 1, 0, ..., 2, 2, 1])

In [13]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

In [14]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

## Evaluation

In [15]:
from sklearn.metrics import accuracy_score

In [16]:
y_pred = model.predict(X_val)
accuracy_score(y_pred, y_val)

0.7215

In [23]:
y_pred

array([1, 2, 1, ..., 2, 2, 1])

In [25]:
X_test.shape

(30000, 17)

## Generating the Submission

Below you will find a function that does the predictions for the test set and generates a submission, this submission is to be uploaded to kaggle to update the leaderboard

In [17]:
def generate_submission():
    list_of_predictions = model.predict(X_test)
    preds = label_encoder.inverse_transform(list_of_predictions)
    df = pd.DataFrame({'Credit_Score': preds}, index=X_test.index)
    df.to_csv('sandbox_submission_demo.csv')
    print(df['Credit_Score'].value_counts())

In [22]:
generate_submission()

Credit_Score
Standard    16049
Poor         8607
Good         5344
Name: count, dtype: int64
