# Кредитный скоринг
При принятии решения о выдаче кредита или займа учитывается т.н. «Кредитный скоринг» — рейтинг платежеспособности клиента. ИИ на основе модели, которую просчитывает машинное обучение — в ней много параметров — возраст, зарплата, кредитная история, наличие недвижимости, автомобиля, судимости и других признаков, после обработки которых выносится положительное или отрицательное решение

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.4.0.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317130 sha256=9269a94d12dcccafdafe1fe16c82d6fbf0b06a454677bcce6bb58c9b7c3d9fb6
  Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.0


In [2]:

from pyspark.sql import SparkSession


spark = SparkSession.builder\
        .master("local[*]")\
        .appName("Colab_pyspark")\
        .config('spark.ui.port', '4050')\
        .config('spark.executor.memory', '3g')\
        .getOrCreate()
        # .config('spark.sql.execution.arrow.enabled', 'true')\
        # .config('spark."Broadcastsizetable"', '-1')\
        # .config('preferSortHashJoin', 'true')\

In [69]:
# Импортируем библиотеки
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd   
import matplotlib.pyplot as plt
import seaborn as sns
import itertools

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

# Данные:
[скачать](https://drive.google.com/file/d/1MuAyZiIm3b_r-AgQSj78tsRPqZpvv_2s/view?usp=sharing)

**application_record.csv**
*   Feature name	Explanation	Remarks
*   ID	Client number	
*   CODE_GENDER	Gender	
*   FLAG_OWN_CAR	Is there a car	
*   FLAG_OWN_REALTY	Is there a property	
*   CNT_CHILDREN	Number of children	
*   AMT_INCOME_TOTAL	Annual income	
*   NAME_INCOME_TYPE	Income category	
*   NAME_EDUCATION_TYPE	Education level	
*   NAME_FAMILY_STATUS	Marital status	
*   NAME_HOUSING_TYPE	Way of living	
*   DAYS_BIRTH	Birthday	Count backwards from current day (0), -1 means yesterday
*   DAYS_EMPLOYED	Start date of employment	Count backwards from current day(0). If positive, it means the person currently unemployed.
FLAG_MOBIL	Is there a mobile phone	
*   FLAG_WORK_PHONE	Is there a work phone	
*   FLAG_PHONE	Is there a phone	
*   FLAG_EMAIL	Is there an email	
*   OCCUPATION_TYPE	Occupation	
*   CNT_FAM_MEMBERS	Family size	

**credit_record.csv**
*   Feature name	Explanation	Remarks
*   ID	Client number	
*   MONTHS_BALANCE	Record month	The month of the extracted data is the starting point, backwards, 0 is the current month, -1 is the previous month, and so on
*   STATUS	Status	
   *   0: 1-29 days past due
   *   1: 30-59 days past due 
   *   2: 60-89 days overdue 
   *   3: 90-119 days overdue 
   *   4: 120-149 days overdue 
    *   5: Overdue or bad debts, write-offs for more than 150 days
    *   C: paid off that month X: No loan for the month


## Считываем данные

In [75]:
data = spark.read.csv('/content/application_record.csv', header=True, inferSchema=True)
record = spark.read.csv('/content/credit_record.csv', header=True, inferSchema=True)

In [76]:
data.describe()

DataFrame[summary: string, ID: string, CODE_GENDER: string, FLAG_OWN_CAR: string, FLAG_OWN_REALTY: string, CNT_CHILDREN: string, AMT_INCOME_TOTAL: string, NAME_INCOME_TYPE: string, NAME_EDUCATION_TYPE: string, NAME_FAMILY_STATUS: string, NAME_HOUSING_TYPE: string, DAYS_BIRTH: string, DAYS_EMPLOYED: string, FLAG_MOBIL: string, FLAG_WORK_PHONE: string, FLAG_PHONE: string, FLAG_EMAIL: string, OCCUPATION_TYPE: string, CNT_FAM_MEMBERS: string]

In [77]:
for row in data.schema:
  print(row)

StructField('ID', IntegerType(), True)
StructField('CODE_GENDER', StringType(), True)
StructField('FLAG_OWN_CAR', StringType(), True)
StructField('FLAG_OWN_REALTY', StringType(), True)
StructField('CNT_CHILDREN', IntegerType(), True)
StructField('AMT_INCOME_TOTAL', DoubleType(), True)
StructField('NAME_INCOME_TYPE', StringType(), True)
StructField('NAME_EDUCATION_TYPE', StringType(), True)
StructField('NAME_FAMILY_STATUS', StringType(), True)
StructField('NAME_HOUSING_TYPE', StringType(), True)
StructField('DAYS_BIRTH', IntegerType(), True)
StructField('DAYS_EMPLOYED', IntegerType(), True)
StructField('FLAG_MOBIL', IntegerType(), True)
StructField('FLAG_WORK_PHONE', IntegerType(), True)
StructField('FLAG_PHONE', IntegerType(), True)
StructField('FLAG_EMAIL', IntegerType(), True)
StructField('OCCUPATION_TYPE', StringType(), True)
StructField('CNT_FAM_MEMBERS', DoubleType(), True)


In [78]:
record.show(5)

+-------+--------------+------+
|     ID|MONTHS_BALANCE|STATUS|
+-------+--------------+------+
|5001711|             0|     X|
|5001711|            -1|     0|
|5001711|            -2|     0|
|5001711|            -3|     0|
|5001712|             0|     C|
+-------+--------------+------+
only showing top 5 rows



In [79]:
for row in record.schema:
  print(row)

StructField('ID', IntegerType(), True)
StructField('MONTHS_BALANCE', IntegerType(), True)
StructField('STATUS', StringType(), True)


In [80]:
from pyspark.sql.functions import col, when

In [81]:
begin_month = record.groupby(["ID"]).min('MONTHS_BALANCE').withColumn('begin_month', col('min(MONTHS_BALANCE)')* -1).drop('min(MONTHS_BALANCE)')

In [123]:
new_data = data.join(begin_month, ['ID'], 'left')

In [89]:
record.withColumn('dep_value', when(record['STATUS'] == '2', '1')\
                                .when(record['STATUS'] == '3', '1')\
                                .when(record['STATUS'] == '4', '1')\
                                .when(record['STATUS'] == '5', '1')\
                                .otherwise('0')).show(5)

+-------+--------------+------+---------+
|     ID|MONTHS_BALANCE|STATUS|dep_value|
+-------+--------------+------+---------+
|5001711|             0|     X|        0|
|5001711|            -1|     0|        0|
|5001711|            -2|     0|        0|
|5001711|            -3|     0|        0|
|5001712|             0|     C|        0|
+-------+--------------+------+---------+
only showing top 5 rows



In [101]:
cpunt = record.withColumn('dep_value', when(record['STATUS'] == '2', '1')\
                                .when(record['STATUS'] == '3', '1')\
                                .when(record['STATUS'] == '4', '1')\
                                .when(record['STATUS'] == '5', '1')\
                                .otherwise('0'))

In [102]:
cpunt = cpunt.withColumn('dep_value', cpunt['dep_value'].cast('int')).groupby('ID').sum('dep_value')

In [107]:
cpunt = cpunt.withColumn('target', when(cpunt['sum(dep_value)'] > 0, 1).otherwise(0)).drop('sum(dep_value)')

In [111]:
cpunt.show(5)

+-------+------+
|     ID|target|
+-------+------+
|5001812|     0|
|5001849|     0|
|5001921|     0|
|5003338|     0|
|5003386|     0|
+-------+------+
only showing top 5 rows



In [124]:
new_data = new_data.join(cpunt, ['ID'], 'inner')

In [125]:
new_data.show(5)

+-------+-----------+------------+---------------+------------+----------------+--------------------+--------------------+--------------------+-----------------+----------+-------------+----------+---------------+----------+----------+---------------+---------------+-----------+------+
|     ID|CODE_GENDER|FLAG_OWN_CAR|FLAG_OWN_REALTY|CNT_CHILDREN|AMT_INCOME_TOTAL|    NAME_INCOME_TYPE| NAME_EDUCATION_TYPE|  NAME_FAMILY_STATUS|NAME_HOUSING_TYPE|DAYS_BIRTH|DAYS_EMPLOYED|FLAG_MOBIL|FLAG_WORK_PHONE|FLAG_PHONE|FLAG_EMAIL|OCCUPATION_TYPE|CNT_FAM_MEMBERS|begin_month|target|
+-------+-----------+------------+---------------+------------+----------------+--------------------+--------------------+--------------------+-----------------+----------+-------------+----------+---------------+----------+----------+---------------+---------------+-----------+------+
|5008804|          M|           Y|              Y|           0|        427500.0|             Working|    Higher education|      Civil marri

In [70]:
# Ниже, мы для тех, у кого хоть раз были просрчоки больше 60 дней, ставим в таргет 1.
data = pd.read_csv("application_record.csv", encoding = 'utf-8')
record = pd.read_csv("credit_record.csv", encoding = 'utf-8')

# # Добавляем срок кредита к параметрам выдачи кредита
begin_month = pd.DataFrame(record.groupby(["ID"])["MONTHS_BALANCE"].agg(min) * - 1)
begin_month = begin_month.rename(columns={'MONTHS_BALANCE':'begin_month'}) 
new_data = pd.merge(data, begin_month, how="left", on="ID") 

# # Больше 60, то это просрочка, ставим - Yes, если просрочка есть за срок кредита,то так же ставим Yes
record['dep_value'] = None
record['dep_value'][record['STATUS'] == '2'] = 'Yes'
record['dep_value'][record['STATUS'] == '3'] = 'Yes'
record['dep_value'][record['STATUS'] == '4'] = 'Yes'
record['dep_value'][record['STATUS'] == '5'] = 'Yes'
cpunt = record.groupby('ID').count()
cpunt['dep_value'][cpunt['dep_value'] > 0] = 'Yes' 
cpunt['dep_value'][cpunt['dep_value'] == 0] = 'No'

# # Джойним всё вместе,заменяем Yes и No на 1 и 0
cpunt = cpunt[['dep_value']]
new_data = pd.merge(new_data, cpunt, how='inner', on='ID')
new_data['target'] = new_data['dep_value']
new_data.loc[new_data['target'] == 'Yes', 'target'] = 1
new_data.loc[new_data['target'] == 'No', 'target'] = 0

In [None]:
#  В итоге к анкетным данным мы добавили таргет
new_data.head()

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,begin_month,dep_value,target
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0,15.0,No,0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0,14.0,No,0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0,29.0,No,0
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0,4.0,No,0
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0,26.0,No,0


In [None]:
# Упростим себе задачу и оставим только часть признаков
features = ['AMT_INCOME_TOTAL', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN']	
target = ['target',]
dataset = new_data[features + target]
dataset[target[0]] = pd.to_numeric(dataset[target[0]])

In [126]:
# Упростим себе задачу и оставим только часть признаков
features = ['AMT_INCOME_TOTAL', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN']	
target = ['target',]
dataset = new_data[features + target]

In [128]:
dataset.withColumn('target', dataset['target'].cast('int'))

DataFrame[AMT_INCOME_TOTAL: double, CODE_GENDER: string, FLAG_OWN_CAR: string, FLAG_OWN_REALTY: string, CNT_CHILDREN: int, target: int]

У нас есть выборка, где указаны параметры клиента, и вышел ли он на просрочку или нет.

In [None]:
# Разделим выборку на трейн и тест, на трейн будем обучать модель, на тест валидировать.
X_train, X_test, y_train, y_test = train_test_split(dataset[features], pd.to_numeric(dataset[target[0]]), test_size=0.3, random_state=42)

In [132]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline

In [138]:
text_columns = ['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']

stri = StringIndexer(inputCols=text_columns, outputCols=[column+'_stri' for column in text_columns])
ohe = OneHotEncoder(inputCols=stri.getOutputCols(), outputCols=[column+'_ohe' for column in text_columns])
pipe = Pipeline(stages=[
    stri,
    ohe
])
total_new_data = pipe.fit(new_data).transform(new_data)

In [139]:
total_new_data.show(5)

+-------+-----------+------------+---------------+------------+----------------+--------------------+--------------------+--------------------+-----------------+----------+-------------+----------+---------------+----------+----------+---------------+---------------+-----------+------+----------------+-----------------+--------------------+---------------+----------------+-------------------+
|     ID|CODE_GENDER|FLAG_OWN_CAR|FLAG_OWN_REALTY|CNT_CHILDREN|AMT_INCOME_TOTAL|    NAME_INCOME_TYPE| NAME_EDUCATION_TYPE|  NAME_FAMILY_STATUS|NAME_HOUSING_TYPE|DAYS_BIRTH|DAYS_EMPLOYED|FLAG_MOBIL|FLAG_WORK_PHONE|FLAG_PHONE|FLAG_EMAIL|OCCUPATION_TYPE|CNT_FAM_MEMBERS|begin_month|target|CODE_GENDER_stri|FLAG_OWN_CAR_stri|FLAG_OWN_REALTY_stri|CODE_GENDER_ohe|FLAG_OWN_CAR_ohe|FLAG_OWN_REALTY_ohe|
+-------+-----------+------------+---------------+------------+----------------+--------------------+--------------------+--------------------+-----------------+----------+-------------+----------+-----------

In [None]:
# Превращаем категориальные факторы в численные
ohe = OneHotEncoder()
ohe.fit(X_train[['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']])
X_train_ohe = ohe.transform(X_train[['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']])
X_test_ohe = ohe.transform(X_test[['CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY']])

X_train_ohe = pd.DataFrame(X_train_ohe.toarray(), columns=[item for sublist in ohe.categories_ for item in sublist])
X_test_ohe = pd.DataFrame(X_test_ohe.toarray(), columns=[item for sublist in ohe.categories_ for item in sublist])

In [None]:
# Отскалируем численные
mms = MinMaxScaler()
mms.fit(X_train[['AMT_INCOME_TOTAL', 'CNT_CHILDREN']])
X_train_scaled = mms.transform(X_train[['AMT_INCOME_TOTAL', 'CNT_CHILDREN']])
X_test_scaled = mms.transform(X_test[['AMT_INCOME_TOTAL', 'CNT_CHILDREN']])

X_train_scaled = pd.DataFrame(X_train_scaled, columns=['AMT_INCOME_TOTAL', 'CNT_CHILDREN'])
X_test_scaled = pd.DataFrame(X_test_scaled, columns=['AMT_INCOME_TOTAL', 'CNT_CHILDREN'])

In [None]:
X_train = pd.concat([X_train_scaled, X_train_ohe,], axis=1)
X_test = pd.concat([X_test_scaled, X_test_ohe, ], axis=1)

#  Модель

In [None]:
# Создадим простейшую модель, которая покажет через линейные коэффиценты связь переменных и таргета
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
train_score, test_score = accuracy_score(model.predict(X_train), y_train), accuracy_score(model.predict(X_test), y_test)
print(f'Точность модели на трейне {train_score}, на тесте {test_score}')

Точность модели на трейне 0.9828755045260394, на тесте 0.983635033827025
