# Credit Card Default Prediction Analysis


This analysis aims to predict credit card default using a dataset of credit card holders. The dataset contains various features such as demographic details, payment history, and bill statements. The primary goal is to build a robust model that can predict whether a customer will default on their credit card payment.

## Overview of the Dataset

The dataset contains information about credit card holders and includes the following columns:

- LIMIT_BAL: Amount of given credit
- SEX: Gender (1 = male, 2 = female)
- EDUCATION: Education level
- MARRIAGE: Marital status
- AGE: Age in years
- PAY_0 to PAY_6: Repayment status in months 0 to 6
- BILL_AMT1 to BILL_AMT6: Bill statement amounts in months 1 to 6
- PAY_AMT1 to PAY_AMT6: Previous payments in months 1 to 6
- default payment next month: Default payment (1 = default, 0 = no default)

### Load Dataset and delete ID column

In [1]:
import pandas as pd

# Load the dataset into a DataFrame
df = pd.read_excel('credit_data.xlsx')

# Read the first 10,000 rows and delete the 'ID' column
df = df.head(10000).drop(columns=['ID'])

### After loading the dataset into the notebook, we can further check basic information of the dataset such as data type.

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   LIMIT_BAL                   10000 non-null  int64  
 1   SEX                         9617 non-null   float64
 2   EDUCATION                   9617 non-null   float64
 3   MARRIAGE                    9617 non-null   float64
 4   AGE                         9617 non-null   float64
 5   PAY_0                       9617 non-null   float64
 6   PAY_2                       9617 non-null   float64
 7   PAY_3                       9617 non-null   float64
 8   PAY_4                       9642 non-null   float64
 9   PAY_5                       9642 non-null   float64
 10  PAY_6                       9642 non-null   float64
 11  BILL_AMT1                   9642 non-null   float64
 12  BILL_AMT2                   9642 non-null   float64
 13  BILL_AMT3                   9642

#### Definitions of the three kinds of variables
- Numeric Variables: Numeric variables can take on a range of numerical values, either discrete or continuous. They are often used in mathematical models and statistical analyses.
- Ordinal Variables: Ordinal variables represent categories with a specific order or ranking. The order of these values is important, but the difference between each one is not necessarily known.
- Nominal Variables: Nominal variables represent categorical data without a specific order or ranking. They do not have a numerical value and cannot be ordered or ranked.

Based on the provided output of the data info, here's the breakdown of features into numeric, ordinal, and nominal variables:

| Variable Kind | Number of Features | Feature Names |
| --- | --- | --- |
| Numeric | 14 | LIMIT_BAL, AGE, BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6, PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4, PAY_AMT5, PAY_AMT6, |
| Ordinal | 7 | PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6, EDUCATION |
| Nominal | 2 | SEX, MARRIAGE |

##### Explanation:

- Numeric Variables (14 features): These are features with numerical values. They include LIMIT_BAL, AGE, BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6, PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4, PAY_AMT5, PAY_AMT6. For example, 'LIMIT_BAL' represents the credit limit of cardholders. Each entry in the dataset can be expressed as numerical values. 

- Ordinal Variables (7 features): These features represent categorical data with a specific order. They include PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6 and EDUCATION. These variables have numerical values indicating different levels of payment delays and EDUCATION levels. Example from the dataset, PAY_0 to PAY_6 represent the repayment status of the account holder over several months. These variables indicates different levels of payment delays with higher values indicating more severe delays. For education, 1.0 may refer to high school, University might be 2.0 and graduate school can be 3.0 which is really difficul to determine.

- Nominal Variables (2 features): These features represent categorical data without any specific order. They include SEX (gender) and MARRIAGE (marital status). These variables do not have numerical values and cannot be ordered or ranked. 

### Clean Dataset and Impute Missing Values

In [3]:
# Count missing values for each variable
missing_values_count = df.isnull().sum()

# Print out the number of missing values for each variable
print("Number of missing values for each variable:")
print(missing_values_count)

Number of missing values for each variable:
LIMIT_BAL                       0
SEX                           383
EDUCATION                     383
MARRIAGE                      383
AGE                           383
PAY_0                         383
PAY_2                         383
PAY_3                         383
PAY_4                         358
PAY_5                         358
PAY_6                         358
BILL_AMT1                     358
BILL_AMT2                     358
BILL_AMT3                     358
BILL_AMT4                     360
BILL_AMT5                     360
BILL_AMT6                     360
PAY_AMT1                      360
PAY_AMT2                      360
PAY_AMT3                        2
PAY_AMT4                        2
PAY_AMT5                      300
PAY_AMT6                      300
default payment next month      0
dtype: int64



#### Complete Data:
- The variables LIMIT_BAL and default payment next month have no missing values, indicating that the dataset is complete for these variables.

#### Missing Values:
- PAY_AMT3 and PAY_AMT4 have only 2 missing values.
- PAY_AMT5 and PAY_AMT6 have 300 missing values.
- PAY_4, PAY_5, PAY_6, BILL_AMT1, BILL_AMT2, and BILL_AMT3 have 358 missing values.
- BILL_AMT4, BILL_AMT5, BILL_AMT6, PAY_AMT1, and PAY_AMT2 have 360 missing values.
- SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, and PAY_3 have 383 missing values.


Handling missing data is very essential to ensure the validity of any analysis. Some of the common strategies are: imputation (replacing missing values with statistical estimates) and deletion (excluding missing values from analysis). The appropriate method depends on the nature of the missing data and the specific analysis being conducted.

In [4]:
# Define and Impute missing values for numeric variables with mean
numeric_columns = ['LIMIT_BAL', 'AGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 
                   'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 
                   'PAY_AMT1', 'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].mean())

# Define and Impute missing values for categorical variables with mode
categorical_columns = ['SEX', 'EDUCATION', 'MARRIAGE']
df[categorical_columns] = df[categorical_columns].fillna(df[categorical_columns].mode().iloc[0])

# Check for missing values after new imputation
missing_values_after_imputation = df.isnull().sum()

### Data Imputation Explanation:

Data imputation is the process of filling in missing values in a dataset with estimated or substituted values. It is a crucial step in data preprocessing, as missing data can adversely affect the performance of machine learning algorithms.

#### Imputation Decisions:
- Numeric Variables: For numeric variables (LIMIT_BAL, AGE, PAY_0 to PAY_6, BILL_AMT1 to BILL_AMT6, PAY_AMT1 to PAY_AMT6), missing values are imputed using the mean of the respective column. Imputing the mean for missing values is the best solution because this will not change the overall distribution of the dataset and it is considered absent at random.
- Categorical Variables: For categorical variables (SEX, EDUCATION, MARRIAGE), missing values are imputed using the mode (most frequent value) of the respective column. Imputing with the mode is appropriate for imputing categorical values, as it replaces missing values with the most commonly appearing category and hence it does not change the distribution of the variable.

#### Imputation Methods Used:
- Mean Imputation: The mean imputation is one technique to calculate missing values. Some characteristics of mean imputation include, For numerical variables, the mean value of the corresponding column substitutes for the missing values. Mean imputation can be done quickly and is very simple. One disadvantage of mean imputation is that, regardless of the missing data being completely random, it changes the distribution of the original data.
- Mode Imputation: The mode imputation substitutes the most common category for the missing values. In categorical data. Like mean imputation, if missing data is not completely at random. Mode imputation could affect the original distribution of the data.

#### Key Decisions Made:
- Selecting the Imputation Method: It is one of the technique that was prevented to choose the most appropriate method for each type of data; the selection of the imputation method was depends on the column was numeric or categorical.
- Assumptions Regarding Missing Data: TThe only necessary assumption required for these imputations is that the data are missing completely at random (MCAR). The imputation can cause bias if this assumption is false. 
- Not Eliminating Any Observations: To retain the integrity of the analysis and preserve the size of the dataset, observations with missing values were decided to be imputed rather than deleted, which might result in data loss.

By effectively imputing missing values using suitable methods, the dataset maintains its completeness and allows for reliable analysis and modeling.

### Creating Dummy Variables

In [5]:
# Print value_counts() of the 'SEX' column
print("Value counts of 'SEX' column:")
print(df['SEX'].value_counts())

# Add dummy variable 'SEX_FEMALE' to df using get_dummies()
df = pd.get_dummies(df, columns=['SEX'], prefix='SEX', drop_first=True)

# Rename SEX columns
df.rename(columns={'SEX_2.0': 'SEX_FEMALE'}, inplace=True)

import pandas as pd

# Reorder columns
limit_bal_index = df.columns.get_loc('LIMIT_BAL')

# Insert the 'SEX_FEMALE' column after the 'LIMIT_BAL' column
df.insert(limit_bal_index + 1, 'SEX_FEMALE', df.pop('SEX_FEMALE'))

Value counts of 'SEX' column:
2.0    6206
1.0    3794
Name: SEX, dtype: int64


### Value Counts of 'SEX' Column:

The value_counts() function is used to display the frequency of unique values in the 'SEX' column. In this case, the output shows that there are 6206 occurrences of the value 2.0 (presumably representing females) and 3794 occurrences of the value 1.0 (presumably representing not females).

#### Dummy Variable 'SEX_FEMALE':
A dummy variable named 'SEX_FEMALE' is created using the get_dummies() function. This new variable indicates whether the individual is female or not.
- If 'SEX_FEMALE' is 1 or True, it means the individual is female.
- If 'SEX_FEMALE' is 0 or False, it means the individual is not female.

#### Deletion of 'SEX' Column:
After creating the dummy variable 'SEX_FEMALE' using the get_dummies() function from the Pandas library in order to execute out this change. In order to prevent repetition, the original "SEX" column has been deleted from the dataset.


##### Additional Steps:
- The drop_first=True parameter in get_dummies() drops the first level of each categorical variable to prevent multicollinearity in modeling.
- The dummy variable 'SEX_FEMALE' is inserted immediately after the 'LIMIT_BAL' column for organizational purposes, using the insert() function.

Overall, these steps transform the categorical 'SEX' variable into a binary dummy variable 'SEX_FEMALE', facilitating its use in subsequent analyses or modeling tasks.

In [6]:
# Print value_counts() of the 'MARRIAGE' column
print("Value counts of 'MARRIAGE' column:")
print(df['MARRIAGE'].value_counts())

Value counts of 'MARRIAGE' column:
2.0    5511
1.0    4361
3.0     110
0.0      18
Name: MARRIAGE, dtype: int64


### Observations on 'MARRIAGE' Variable:
The 'MARRIAGE' variable indicates marital status. Key observations from the value_counts() output are as follows:

#### a) Categories 1 (Married) and 2 (Single):
- Category 1 (Married) has 4361 occurrences and Category 2 (Single) has 5511 occurrences.
- These categories are the most common marital statuses in the dataset.

#### b) Category 3 (Others):
- Category 3 has 110 occurrences.
- This category likely represents other marital statuses.

#### c) Category 0 (Missing/Undefined):
- Category 0 has 18 occurrences.
- This category is rare and likely indicates missing or undefined values.

To ensure accurate modelling and reliable insights, it is important to evaluate the distribution and interpretation of categorical variables such as 'MARRIAGE'. Understanding the distribution of marital status facilitates the development of prediction models capable of accurately projecting future events such as loan default or spending habits.

### Mapping and Encoding Features

In [7]:
import pandas as pd

# Mapping numerical values to corresponding categories in 'MARRIAGE' column
df['MARRIAGE'] = df['MARRIAGE'].replace({
    2.0: 'MARRIAGE_MARRIED',
    1.0: 'MARRIAGE_SINGLE',
    3.0: 'MARRIAGE_OTHER',
    0.0: 'MARRIAGE_OTHER'   # assume 0 represents 'MARRIAGE_OTHER'
})

# Add dummy variable 'MARRIAGE' to df using get_dummies()
df = pd.get_dummies(df, columns=['MARRIAGE'])

# Reorder columns
education_index = df.columns.get_loc('EDUCATION')

# Insert the 'MARRIAGE' columns after the 'EDUCATION' column
df.insert(education_index + 1, 'MARRIAGE_MARRIED', df['MARRIAGE_MARRIAGE_MARRIED'])
df.insert(education_index + 2, 'MARRIAGE_OTHER', df['MARRIAGE_MARRIAGE_OTHER'])
df.insert(education_index + 3, 'MARRIAGE_SINGLE', df['MARRIAGE_MARRIAGE_SINGLE'])

# Drop the original 'MARRIAGE' columns
df.drop(columns=['MARRIAGE_MARRIAGE_MARRIED', 'MARRIAGE_MARRIAGE_OTHER', 'MARRIAGE_MARRIAGE_SINGLE'], inplace=True)

### Allocation of 'MARRIAGE' Values Across Dummy Variables:
To ensure appropriate representation and interpretation, many judgements were taken while assigning values of the 'MARRIAGE' variable over the three newly constructed features ('MARRIAGE_MARRIED, 'MARRIAGE_SINGLE, and 'MARRIAGE_OTHER').

#### 1.	Mapping:

The original numerical values were mapped to corresponding categories as follows:
-  2.000000 was mapped to 'MARRIAGE_MARRIED', representing married individuals.
-  1.000000 was mapped to 'MARRIAGE_SINGLE', representing single individuals.
-  Both 3.000000 and 0.000000 were mapped to 'MARRIAGE_OTHER', representing other marital statuses or potentially missing or undefined values.

#### 2.	Encoding Method:
- For each category, binary features were created using one-hot encoding. By precisely assigning a value of 1 to each observation in the associated feature and 0 to others, this strategy makes it easier to do additional analysis and modelling.

#### 3.	Handling Missing/Undefined Values:
- Although 'MARRIAGE_OTHER' is assumed to be represented by 0.000000, more analysis is required to verify whether or not this value actually represents missing or undefined values. By doing this, it is ensured that the data is correctly interpreted and displayed, preventing any possibility of error.

#### 4.	Interpretability:
- Readability is improved by keeping relevant category labels, such as "MARRIAGE_MARRIED," "MARRIAGE_SINGLE," and "MARRIAGE_OTHER." Evaluating model predictions and data-driven insights is facilitated by well-defined labels.

By carefully considering these factors during the allocation process, the resulting dummy variables accurately represent the marital status of individuals in the dataset, laying the foundation for reliable analysis and modeling tasks.

### In the column 'EDUCATION', convert the values {0, 5, 6} to the value 4

In [8]:
# Replace values {0, 5, 6} in the 'EDUCATION' column with 4
df['EDUCATION'] = df['EDUCATION'].replace({0: 4, 5: 4, 6: 4})

### Model Training and Evaluation

**Splitting the Data**

We split the data into training and testing sets, with 75% of the data used for training and 25% for testing:

In [9]:
import numpy as np

# Rename the column 'default payment next month' to 'default payment'
df = df.rename(columns={'default payment next month': 'payment_default'})

# Extract the target variable 'payment_default' for the first 7,500 observations
y = df['payment_default'].values[:7500]

# Extract the features (all remaining variables except 'default payment next month') for the first 7,500 observations
X = df.drop(columns=['payment_default']).values[:7500]

### Standardizing the Data

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the data into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=3, stratify=y)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Support Vector Classifier (SVC)

In [11]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Train a Support Vector Classifier with RBF kernel
svc_classifier = SVC(kernel='rbf', random_state=24)
svc_classifier.fit(X_train_scaled, y_train)

# Predictions on training and test datasets
y_train_pred = svc_classifier.predict(X_train_scaled)
y_test_pred = svc_classifier.predict(X_test_scaled)

# Compute and print training and test dataset accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("Training dataset accuracy:", train_accuracy)
print("Test dataset accuracy:", test_accuracy)

Training dataset accuracy: 0.8234666666666667
Test dataset accuracy: 0.8138666666666666


### Principal Component Analysis (PCA)

In [12]:
from sklearn.decomposition import PCA

# Initialize PCA with 2 components
pca = PCA(n_components=2)

# Fit PCA on standardized features and transform them to obtain the principal components
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Train a Support Vector Classifier with RBF kernel on the principal components
svc_classifier_pca = SVC(kernel='rbf', random_state=24)
svc_classifier_pca.fit(X_train_pca, y_train)

# Predictions on training and test datasets
y_train_pred_pca = svc_classifier_pca.predict(X_train_pca)
y_test_pred_pca = svc_classifier_pca.predict(X_test_pca)

# Compute and print training and test dataset accuracies
train_accuracy_pca = accuracy_score(y_train, y_train_pred_pca)
test_accuracy_pca = accuracy_score(y_test, y_test_pred_pca)

print("Training dataset accuracy (with PCA):", train_accuracy_pca)
print("Test dataset accuracy (with PCA):", test_accuracy_pca)

Training dataset accuracy (with PCA): 0.8017777777777778
Test dataset accuracy (with PCA): 0.8037333333333333


In comparing the two classifiers trained on the Credit Card Defaults data, the following observations can be made:

#### 1.	Support Vector Classifier on Standardized Data:
- Achieved a training accuracy of 0.8234666666666667 and a test accuracy of 0.8138666666666666 which indicates good generalization ability as the test accuracy is close to the training accuracy which means that the model effectively performed.

#### 2.	Support Vector Classifier on Principal Components:
- Trained on two main components derived from the standardized features that achieved lower training accuracy of 0.8017777777777778 and a test accuracy of 0.8037333333333333 which means that it sacrifices some interpretability by transforming features into a lower-dimensional space.

#### 3.	Comparison:
- When compared to the SVC with main components, the SVC with standard features performs better on both training and testing datasets.

- However, the slight variation in accuracy raises the possibility that some important data was overlooked during PCA's dimensionality reduction procedure.

- The original features have a greater predictive power than the reduced main components, as seen by the better accuracy of the SVC with standard features.

#### 4.	Conclusion:
These data suggest that, since the performance of the two classifiers is comparable, the Support Vector Classifier that was trained using all of the standardised characteristics would be more efficient to forecast credit card defaults in this particular situation. Although both classifiers perform reasonably well, the choice between them depends on factors such as the importance of interpretability, computational efficiency, and the specific requirements of the application.