# 德国信贷风险评估报告
本信贷风险评估报告数据源来自: https://www.kaggle.com/datasets/uciml/german-credit

## 背景:
原始数据集包含由 Prof. Homann 收集的 1000 个条目, 具有20 个属性. 在数据集中, 每个条目代表一个接收银行信贷的人. 根据属性集, 每个人都被划分为良好或不良信用风险.

## 内容:
数据集包括以下内容
> 1. Age (numeric)
> 2. Sex (text: male, femal)
> 3. Job (numeric: 0 - unskilled and none - resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)
> 4. Housing (text: own, rent, free)
> 5. Saving accounts (text - little, moderate, quite rich, rich)
> 6. Checking account (numeric, in DM - Deutsch Mark)
> 7. Credit amount (numeric, in DM)
> 8. Duration (numeric, in month)
> 9. Purpose (text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)
> 10. Risk (Value target - Good or Bad Risk)

## 导入所需库:

In [4]:
# 数据处理和作图库
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# 可交互可视化库
import plotly.offline as py 
py.init_notebook_mode(connected=True)                  # this code, allow us to work with offline plotly version
import plotly.graph_objs as go
import plotly.tools as tls
from collections import Counter
import plotly.figure_factory as ff

# 模型评估和选择
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV                                         # to split the data
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix, classification_report, fbeta_score     # to evaluate our model

# 导入模型
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# 处理警告信息
import warnings

from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter("ignore", category=ConvergenceWarning)

# 显示所有列, 规范小数位
pd.pandas.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## EDA:

### 1.数据准备:

In [9]:
# 在项目目录下获取文件下载路径
path = r"C:\Users\Adroke\.cache\kagglehub\datasets\uciml\german-credit\versions\1\german_credit_data.csv"

# 查看表头
df = pd.read_csv(path, index_col=0)
df.head()

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose
0,67,male,2,own,,little,1169,6,radio/TV
1,22,female,2,own,little,moderate,5951,48,radio/TV
2,49,male,1,own,little,,2096,12,education
3,45,male,2,free,little,little,7882,42,furniture/equipment
4,53,male,2,free,little,little,4870,24,car


In [10]:
# 查看缺失值
df.isnull().sum()

Age                   0
Sex                   0
Job                   0
Housing               0
Saving accounts     183
Checking account    394
Credit amount         0
Duration              0
Purpose               0
dtype: int64

In [None]:
# 查看