## Predicting Default on Payments of Credit Card Clients

# 1. Data Loading

***

### Main Insights:

<div class="alert alert-block alert-info">
   The dataset contains no missing values neither duplicated rows.
</div>


<div class="alert alert-block alert-info">
    In the end, the dataset was split into train (80%) and test (20%) sets, which correspond to 24000 and 6000 samples, respectively.
</div>


***

Within the scope of ML Analytics Recruitment Challenge, the goal of this notebook is to load the required data and split it into train and test sets. 

The dataset given by Luís Simões can be found [here](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients). The dataset can also be found in [Kaggle](https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset?datasetId=306&sortBy=voteCount). As stated in the initial page, this dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

In [1]:
# !pip install kaggle

In [2]:
import os
import pandas as pd
import zipfile

from sklearn.model_selection import train_test_split

os.environ["KAGGLE_USERNAME"] = ""  # "<write-here-your-username>"
os.environ["KAGGLE_KEY"] = ""  # <"write-here-your-kaggle-api-key>"

from kaggle.api.kaggle_api_extended import KaggleApi

seed = 17

In [3]:
# Instantiate Kaggle API client:
api = KaggleApi()
api.authenticate()

In [4]:
# Download dataset:
api.dataset_download_files("uciml/default-of-credit-card-clients-dataset", path="../data/")

In [5]:
# Extract contents from zip previously downloaded:
with zipfile.ZipFile("../data/default-of-credit-card-clients-dataset.zip", "r") as z:
    z.extractall("../data/")

df = pd.read_csv("../data/UCI_Credit_Card.csv")
df.sample(5)

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
13178,13179,210000.0,1,1,2,32,-1,2,2,-1,...,6299.0,0.0,197.0,0.0,0.0,6299.0,0.0,197.0,0.0,0
5165,5166,50000.0,1,3,1,58,-2,-2,-2,-2,...,4186.0,3545.0,2659.0,5318.0,4151.0,4186.0,3545.0,2860.0,2702.0,0
6808,6809,160000.0,2,2,2,30,0,0,0,0,...,107418.0,90471.0,86363.0,4507.0,4000.0,3500.0,3300.0,3500.0,3000.0,0
14218,14219,140000.0,2,2,1,30,-1,-1,-1,-1,...,3365.0,1567.0,1707.0,2290.0,2292.0,3365.0,1567.0,1707.0,1707.0,0
2039,2040,200000.0,2,3,1,53,0,0,0,0,...,10075.0,5107.0,7789.0,52306.0,1000.0,212.0,102.0,7786.0,4000.0,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 25 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   ID                          30000 non-null  int64  
 1   LIMIT_BAL                   30000 non-null  float64
 2   SEX                         30000 non-null  int64  
 3   EDUCATION                   30000 non-null  int64  
 4   MARRIAGE                    30000 non-null  int64  
 5   AGE                         30000 non-null  int64  
 6   PAY_0                       30000 non-null  int64  
 7   PAY_2                       30000 non-null  int64  
 8   PAY_3                       30000 non-null  int64  
 9   PAY_4                       30000 non-null  int64  
 10  PAY_5                       30000 non-null  int64  
 11  PAY_6                       30000 non-null  int64  
 12  BILL_AMT1                   30000 non-null  float64
 13  BILL_AMT2                   300

In [7]:
print(f"The dataset contains {'no' if len(df) == len(df.drop_duplicates()) else 'yes'} duplicates.")

The dataset contains no duplicates.


In [8]:
# Split the data and export new train and test sets:
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    shuffle=True,
    random_state=seed
)

df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

df_train.to_csv("../data/train_data.csv")
df_test.to_csv("../data/test_data.csv")