# Часть 1. EDA и Preprocessing.
Используем данные по клиентам немецкого банка: https://www.kaggle.com/uciml/german-credit

Сразу удаляем первую колонку без названия — уникальный идентификатор: в ней просто нумерация по порядку.

Смысл неочевидных колонок:
Saving accounts — накопительный вклад.

Первый шаг: делаем анализ необработанных данных (raw dataset). Изучаем описательные статистики и диаграммы: german_credit_data_Descriptives.pdf
(PDF файл был получен с помощью приложения: Jamovi.)

Распределения далеки от нормального.

## Пропуски

Пропуски есть в колонках: Saving accounts (18.3 % пропусков), 
Checking accoun (39.4 %).

Заполним константой — используем моду: 
Saving accounts — little;
Checking account — little.

## Выбросы

Посмотрим: есть ли выбросы по Box Plot (ящики с усами).

Есть выбросы по колонкам:
- Age (в сторону старшего возраста, незначительные);
- Credit amount (в сторону больших значений);
- Duration (в сторону больших значений).t

## Feature Engineerin

Преобразуем все признаки в числовые.g

Непрерывные числовые признаки: Age, Credit amount, Duration.

Бинарные признаки (применим LabelEncoding): Sex.

Упорядоченные признаки: 
- Job (0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled) — уже закодирован;
- Housing (0 - free(безсвоего  жилья), 1 - rent, 2 - own) — закодируем степень владения жильём

- Saving accounts (0 - little, 1 - moderate, 2 - quite rich, 3 - rich;
- Checking account (в описании на сайте ошибка) (0 - little, 1 - moderate, 3 - rich)..

Категориальные признаки (применим OneHotEncoding): 
- Purpose (словарь значений: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others).

In [10]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder

df = pd.read_csv(r'.\german_credit_data.csv')

# Drop the first noname column with unique indexes
df = df.iloc[: , 1:]

# The Processing for NA values:

df = df.fillna({'Saving accounts': 'little', 'Checking account': 'little'})

# Encode as Binary

le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])

# Encode as Ordered

enc = OrdinalEncoder(categories=[['free', 'rent', 'own']])
df['Housing'] = enc.fit_transform(df[['Housing']])

enc = OrdinalEncoder(categories=[['little', 'moderate', 'quite rich', 'rich']])
df['Saving accounts'] = enc.fit_transform(df[['Saving accounts']])
df['Checking account'] = enc.fit_transform(df[['Checking account']])

# One Hot Encoding

features_for_encoding = ('Purpose',)
enc = OneHotEncoder(dtype=np.int_)
transformer = ColumnTransformer(transformers=[('', enc, features_for_encoding)],
                                verbose_feature_names_out=False, # For avoid prefix befor names of columns
                                remainder='passthrough') # to keep the rest of features with the same values
transformed = transformer.fit_transform(df)
df = pd.DataFrame(transformed, columns=transformer.get_feature_names_out())


Unnamed: 0,Purpose_business,Purpose_car,Purpose_domestic appliances,Purpose_education,Purpose_furniture/equipment,Purpose_radio/TV,Purpose_repairs,Purpose_vacation/others,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,67.0,1.0,2.0,2.0,0.0,0.0,1169.0,6.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,22.0,0.0,2.0,2.0,0.0,1.0,5951.0,48.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,49.0,1.0,1.0,2.0,0.0,0.0,2096.0,12.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,45.0,1.0,2.0,0.0,0.0,0.0,7882.0,42.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0,1.0,2.0,0.0,0.0,0.0,4870.0,24.0


## Нормировка данных

Нормировка необходима, потому что разные колонки сильно отличаются по численному массштабу. Это может привести к тому, что модель больше будет опираться на признак с самым большим размахом значений, а, например, бинарные признаки проигнорирует.

Поэтому применим MinMaxScaler для колонок:
- Age
- Job
- Housing
- 'Saving accounts'
- 'Checking account'
- 'Credit amount'
- Duration.

In [14]:
from sklearn.preprocessing import MinMaxScaler

# Normalization

columns_for_scaling = ['Age', 'Job', 'Housing', 'Saving accounts', 'Checking account', 'Credit amount', 'Duration']
mm = MinMaxScaler()
df[columns_for_scaling] = mm.fit_transform(df[columns_for_scaling])

# Save prepared dataframe to the new csv file
df.to_csv(r'.\german_credit_data_Preprocessed.csv', index=False)

df.head()

Unnamed: 0,Purpose_business,Purpose_car,Purpose_domestic appliances,Purpose_education,Purpose_furniture/equipment,Purpose_radio/TV,Purpose_repairs,Purpose_vacation/others,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.857143,1.0,0.666667,1.0,0.0,0.0,0.050567,0.029412
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.053571,0.0,0.666667,1.0,0.0,0.333333,0.31369,0.647059
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.535714,1.0,0.333333,1.0,0.0,0.0,0.101574,0.117647
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.464286,1.0,0.666667,0.0,0.0,0.0,0.419941,0.558824
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.607143,1.0,0.666667,0.0,0.0,0.0,0.254209,0.294118


**Данные подготовлены для моделирования** и сохранены в файл german_credit_data_Preprocessed.csv.