# **I. Introduction**

<p>=================================================<br>

**Data Scientist Technical Test in PT. Pemeringkat Efek Indonesia (PT. PEFINDO)**

**Name**: Maulana Yusuf Taufiqurrahman<br>
**From**: Hacktiv8 Full Time Data Science (FTDS) - HCK Batch 026 - Alumni

<p>=================================================<br>

# **II. Import Library Packages**

The most important things to do for programming by using Python is importing the library packages, especially in this project. We need to import the packages starts from **Pandas**, **NumPy**, **Sci-Kit Learn**, **Seaborn**, and so on. If we need to import another packages, just add another packages in the **`II. Import Library Packages`** section.

In [1]:
# Data Analysis Packages
import pandas as pd
import numpy as np

# Visualization Tools
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning Packages
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, roc_curve, roc_auc_score

# SHAP Package
import shap

# **III. Data Loading**

### **A. Load the Dataset**

Load the dataset is important to make sure the dataset is able to read and get to know what this data contains about it.

In [2]:
# Read the csv file
df_org = pd.read_csv('./data/credit_scoring.csv')

# Create a copy
df_copy = df_org.copy()

# Combining into a dataframe
combined_df_copy = pd.concat([df_copy.iloc[:10], df_copy.iloc[-10:]], ignore_index=True)

# Show the dataset
combined_df_copy

Unnamed: 0,application_id,age,monthly_income,loan_amount,previous_defaults,credit_score,default,leak_col_good,leak_col_subtle
0,501000,41,13995609,5982664,0,624,0,0,-0.04
1,501001,58,13683833,3711198,0,809,0,0,0.001
2,501002,33,9417391,7172332,0,647,0,0,0.077
3,501003,45,6861811,8661056,0,450,0,0,0.038
4,501004,22,5640742,4520669,1,816,0,0,0.02
5,501005,22,7783669,13057356,1,642,0,0,0.004
6,501006,22,15252800,4009613,1,515,0,0,-0.033
7,501007,35,17437764,3871786,0,807,0,0,0.061
8,501008,35,12499029,13703265,0,636,0,0,0.083
9,501009,38,11650601,12664024,0,422,0,0,0.033


The dataset is successfully been loaded. In this case, it shows only **20 data** to make sure the data is showed. The further information about the data will be explain more.

### **B. Summary of the Dataset**

By seeing the information of the dataset, we can get the insights from it such as what we want to drop a/some unecessary column/s. It is one of the actions that must be taken to analyze data.

In [3]:
# Show the information of the dataset
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   application_id     6000 non-null   int64  
 1   age                6000 non-null   int64  
 2   monthly_income     6000 non-null   int64  
 3   loan_amount        6000 non-null   int64  
 4   previous_defaults  6000 non-null   int64  
 5   credit_score       6000 non-null   int64  
 6   default            6000 non-null   int64  
 7   leak_col_good      6000 non-null   int64  
 8   leak_col_subtle    6000 non-null   float64
dtypes: float64(1), int64(8)
memory usage: 422.0 KB


**`Summary of the dataset`**:
- **9** column names.
- **6000** values with **NO** missing values.
- **2** types of data: **float** and **int**.<p>

From this information, we can know which one is for the **Target** and to be **Split**.

### **C. Check the Missing Values**

In [6]:
# Check the missing values
df_copy.isnull().sum()

application_id       0
age                  0
monthly_income       0
loan_amount          0
previous_defaults    0
credit_score         0
default              0
leak_col_good        0
leak_col_subtle      0
dtype: int64

### **D. Check the Duplicates**

In [7]:
# Check the missing values
df_copy.duplicated().sum()

0

### **E. Check the Cardinality**

In [8]:
# Check the cardinality
df_copy.nunique()

application_id       6000
age                    39
monthly_income       6000
loan_amount          6000
previous_defaults       4
credit_score          550
default                 2
leak_col_good           2
leak_col_subtle       239
dtype: int64

### **F. Check the Skewness**

In [9]:
# Take numerical columns only
numerical_cols = df_copy.select_dtypes(include = [np.number]).columns

# Count the skewness
skewness = df_copy[numerical_cols].skew()

# Create a def function
def interpret_skew(val):
    if val > 0.5:
        return 'Right Skewed'
    elif val < -0.5:
        return 'Left Skewed'
    else:
        return 'Approximately Normal'

# Create a dataframe
skewness_df = pd.DataFrame({
    'Skewness': skewness,
    'Interpretation': skewness.apply(interpret_skew)
})

# Show the output
print(skewness_df)

                   Skewness        Interpretation
application_id     0.000000  Approximately Normal
age               -0.050049  Approximately Normal
monthly_income     0.018512  Approximately Normal
loan_amount        0.000769  Approximately Normal
previous_defaults  1.874568          Right Skewed
credit_score      -0.017558  Approximately Normal
default            8.901894          Right Skewed
leak_col_good      8.901894          Right Skewed
leak_col_subtle    7.981180          Right Skewed


Most of the numerical columns are **approximately normal**. The rest of them are 4 columns having High Skewed (tends to the right).<p>
But, the **`leak_col_good`** and **`leak_col_subtle`** will be dropped and **`default`** will be the **target**, so there will be only **one** column that High Skewed, which is **`previous_default`** by treat as category.

### **G. Drop the unrelated columns**

# **IV. Exploratory Data Analysis (EDA)**