## Machine Learning For Credit Scoring

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.


数据集出处：https://www.kaggle.com/competitions/GiveMeSomeCredit/data

![在这里插入图片描述](https://img-blog.csdnimg.cn/ef14a1b053744335b57914140f0453a9.png)

In [1]:
import pandas as pd
pd.set_option("display.max_columns",500)
import zipfile


In [5]:
# 读取zip里的文件
with zipfile.ZipFile("./KaggleCredit2.csv.zip",'r') as z:
    f = z.open("KaggleCredit2.csv")
    data = pd.read_csv(f,index_col=0)

print(data.shape)

data.head()


(112915, 11)


Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
0,1,0.766127,45.0,2.0,0.802982,9120.0,13.0,0.0,6.0,0.0,2.0
1,0,0.957151,40.0,0.0,0.121876,2600.0,4.0,0.0,0.0,0.0,1.0
2,0,0.65818,38.0,1.0,0.085113,3042.0,2.0,1.0,0.0,0.0,0.0
3,0,0.23381,30.0,0.0,0.03605,3300.0,5.0,0.0,0.0,0.0,0.0
4,0,0.907239,49.0,1.0,0.024926,63588.0,7.0,0.0,1.0,0.0,0.0


查看是否有null值，有的话需要去除


In [10]:
# data.info()
data.isnull().sum(axis=0)

SeriousDlqin2yrs                           0
RevolvingUtilizationOfUnsecuredLines       0
age                                     4267
NumberOfTime30-59DaysPastDueNotWorse       0
DebtRatio                                  0
MonthlyIncome                              0
NumberOfOpenCreditLinesAndLoans            0
NumberOfTimes90DaysLate                    0
NumberRealEstateLoansOrLines               0
NumberOfTime60-89DaysPastDueNotWorse       0
NumberOfDependents                      4267
dtype: int64

In [11]:
##去掉为空的数据
data.dropna(inplace=True)
data.shape

(108648, 11)

定义好X,y

In [13]:
print(data.columns.values) ## 查看列名

['SeriousDlqin2yrs' 'RevolvingUtilizationOfUnsecuredLines' 'age'
 'NumberOfTime30-59DaysPastDueNotWorse' 'DebtRatio' 'MonthlyIncome'
 'NumberOfOpenCreditLinesAndLoans' 'NumberOfTimes90DaysLate'
 'NumberRealEstateLoansOrLines' 'NumberOfTime60-89DaysPastDueNotWorse'
 'NumberOfDependents']


In [14]:

y = data['SeriousDlqin2yrs']
X = data.drop('SeriousDlqin2yrs',axis=1)

### 练习1

把数据切分成训练集和测试集

In [16]:
from sklearn import model_selection
x_train,x_test,y_train,y_test= model_selection.train_test_split(X,y,test_size=0.2)

x_train.shape

(86918, 10)


### 练习2

使用logistic regression/决策树/SVM/KNN...等sklearn分类算法进行分类，尝试查sklearn API了解模型参数含义，调整不同的参数。


逻辑回归类库使用小结：https://blog.csdn.net/sun_shengyun/article/details/53811483

In [18]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(multi_class='ovr',solver='sag',class_weight='balanced')

lr.fit(x_train,y_train)
score = lr.score(x_train,y_train)
print(score)  # ##最好的分数是1

0.9326491635794657




### 练习3

在测试集上进行预测，计算准确度

https://blog.csdn.net/qq_16095417/article/details/79590455


In [20]:
from sklearn.metrics import accuracy_score
train_score = accuracy_score(y_train,lr.predict(x_train))
# test_score = lr.score(x_test,y_test)
test_score = accuracy_score(y_test,lr.predict(x_test))
print('训练集准确率：',train_score)
print('测试集准确率：',test_score)

训练集准确率： 0.9326491635794657
测试集准确率： 0.9321675103543489



### 练习4

查看sklearn的官方说明，了解分类问题的评估标准，并对此例进行评估。


In [29]:


##召回率
from sklearn.metrics import recall_score
train_recall = recall_score(y_train,lr.predict(x_train),average='macro')
test_recall = recall_score(y_test,lr.predict(x_test),average='macro')
print('训练集召回率：',train_recall)
print('测试集召回率：',test_recall)

训练集召回率： 0.5
测试集召回率： 0.4999506367854675


### 练习5

银行通常会有更严格的要求，因为fraud带来的后果通常比较严重，一般我们会调整模型的标准。
比如在logistic regression当中，一般我们的概率判定边界为0.5，但是我们可以把阈值设定低一些，来提高模型的“敏感度”，试试看把阈值设定为0.3，再看看这时的评估指标(主要是准确率和召回率)。

tips:
sklearn的很多分类模型，predict_prob可以拿到预估的概率，可以根据它和设定的阈值大小去判断最终结果(分类类别)

In [30]:
import numpy as np
y_pro = lr.predict_proba(x_test)
y_prd2 =  [list(p>=0.3).index(1) for i,p in enumerate(y_pro)]   ##设定0.3阈值，把大于0.3的看成1分类。
train_score = accuracy_score(y_test,y_prd2)
print(train_score)

0.9322595490105845
