<a href="https://colab.research.google.com/github/ChristianHFS/Data/blob/main/Credit_Scoring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CRISP-DM
**CR**oss **I**ndustry **S**tandard **P**rocess for **D**ata **M**ining (CRISP-DM)
![Image in a markdown cell](https://raw.githubusercontent.com/reynoldms/csv_data/main/crispdm.jpg)


# Import Library

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

import warnings
warnings.filterwarnings("ignore")

# 1. Business Understanding
What Is Credit Scoring?
Credit scoring is a statistical analysis performed by lenders/financial institutions to determine the **creditworthiness** of a person or a small, owner-operated business. This score will be used to determine whether debtor candidate have **good** or **bad** risk.
<br>  
![Image in a markdown cell](https://www.homeispossiblenv.org/sites/default/files/Credit%20score%20main.png)


The original dataset contains 1000 entries with 20 categorial/symbolic attributes prepared by Prof. Hofmann. In this dataset, each entry represents a person who takes a credit by a bank. Each person is classified as good or bad credit risks according to the set of attributes. The link to the original dataset can be found below.

Source : https://www.kaggle.com/datasets/uciml/german-credit

# 2. Data Understanding

## 2.1. Extract Data

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/reynoldms/csv_data/main/german_credit_data.csv', index_col= 0)

In [None]:
df.head()

Below are the details of independent variables in the dataset.
1. Age (numeric)
2. Sex (text)
  - male
  - female
3. Job (numeric)
  - 0 - unskilled and non-resident
  - 1 - unskilled and resident
  - 2 - skilled 
  - 3 - highly skilled
4. Housing (text)
  - own
  - rent
  - free
5. Saving accounts (text) 
  - little
  - moderate
  - quite rich
  - rich
6. Checking account (text) 
  - little
  - moderate
  - quite rich
  - rich
7. Credit amount (numeric, same as *Checking account*)
8. Duration (numeric, in month)
9. Purpose (text)
  - car
  - furniture/equipment
  - radio/TV, etc
  
Moreover, the dependent variables (target) is **Risk**.

## 2.2. Statistic Description

In [None]:
df.describe()

## 2.3. Univariate Analysis

In [None]:
sns.countplot(df['Risk'], label = "Count") 

## 2.4. Bivariate Analysis

There are several insigts we can get from plot above, which are :
1.  
2. 

Anything else?

Insights :
1. 

# 3. Data Preparation

## 3.1. Handling Missing Value

In [None]:
df.info()

In [None]:
# ## by dropping missing value
# df.dropna(inplace=True)
# df.info()

In [None]:
## by imputing missing value

## 3.4. Predictors and Target

In [None]:
X = df.drop(['Risk'], axis=1)
y = df['Risk']

## 3.5. Feature Encoding

In [None]:
## OneHotEncoder
X = pd.get_dummies(X)

In [None]:
X.head()

## 3.2. Data Splitting

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42, stratify=y)

# 4. Baseline Model
## 4.1. Training Model

In [None]:
model_svc = SVC()
model_svc.fit(X_train, y_train)

## 4.2. Evaluation

In [None]:
model_svc.score(X_train,y_train)

# 5. Tuning Model
## 5.1. Training Model

In [None]:
model_svc_tuned = SVC(C = 0.001,  gamma = 5, kernel = 'rbf')
model_svc_tuned.fit(X_train, y_train)

In [None]:
model_svc_tuned.score(X_train,y_train)