# TABLE PREVIOUS APPLICATIONS

This table is composed by 3 tables:

- <code>train_applprev_1_0</code>, <code>train_applprev_1_1</code> that are internal data frames of home credit.
- <code>train_applprev_2</code> that is another internal table with different structure.


We will analyze this points:

- the columns of all dataframes
- how to merge them
- their NA meanings and how to fill them
- some plots
- how to create some behavioural KPI.

# 1. SETTINGS

In [1]:
import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import sys 
import os 

sys.path.append("../../")
from src.utils import get_feature_definitions, extract_columns_tipe, aggregate_num_features_by_historic

In [2]:
dataPath = "../../data/"

In [3]:
df_feature_definition = pl.read_csv(dataPath + "feature_definitions.csv")        

In [4]:
df_target = pl.read_parquet(dataPath  + "parquet_files/train/train_base.parquet")

In [5]:
train_applprev_1_0 = pl.read_parquet(dataPath + "parquet_files/train/train_applprev_1_0.parquet")
train_applprev_1_1 = pl.read_parquet(dataPath + "parquet_files/train/train_applprev_1_1.parquet")
train_applprev_2 = pl.read_parquet(dataPath + "parquet_files/train/train_applprev_2.parquet")


# 2. STRUCTURE OF THE DATAFRAMES

Let's first see how the dataframe are made. 

In [6]:
train_applprev_1_0.shape

(3887684, 41)

In [7]:
train_applprev_1_1.shape

(2638295, 41)

In [8]:
train_applprev_2.shape

(14075487, 6)

It's quite clear that the two fist dataframes has a different structure wrt the third one.


# 3. INTERNAL DATA SOURCE ANALYSIS

In [9]:
columns_0_0 = list(train_applprev_1_0.columns)
columns_0_1 = list(train_applprev_1_1.columns)

columns_0_0.sort()
columns_0_1.sort()

columns_0_0 == columns_0_1

True

In [10]:
print("Number of case id in first dataframe: ", train_applprev_1_0["case_id"].n_unique())
print("The case id are unique in the first dataframe: ", train_applprev_1_0["case_id"].n_unique() == train_applprev_1_0.shape[0])

Number of case id in first dataframe:  782997
The case id are unique in the first dataframe:  False


It's make sense since we have num group as historical events.

In [11]:
print("Number of case id in second dataframe: ", train_applprev_1_1["case_id"].n_unique())
print("The case id are unique in the second dataframe: ", train_applprev_1_1["case_id"].n_unique() == train_applprev_1_1.shape[0])

Number of case id in second dataframe:  438525
The case id are unique in the second dataframe:  False


In [12]:
print("Number of case id in third dataframe: ", train_applprev_2["case_id"].n_unique())
print("The case id are unique in the third dataframe: ", train_applprev_2["case_id"].n_unique() == train_applprev_2.shape[0])

Number of case id in third dataframe:  1221522
The case id are unique in the third dataframe:  False


In [13]:
len(set(train_applprev_1_0["case_id"].unique()).intersection(set(train_applprev_1_1["case_id"].unique())))

0

So we can infer that the two first dataframe can be concatenated and the third one is different.

In [14]:
len(set(list(train_applprev_1_0["case_id"].unique()) + list(train_applprev_1_1["case_id"].unique())))

1221522

In [15]:
len(set(list(train_applprev_1_0["case_id"].unique()) + list(train_applprev_1_1["case_id"].unique())).intersection(set(train_applprev_2["case_id"].unique())))

1221522

From this first analysis we can say that the the first two dataframe are a unique one, and the third one is separated and contain other informations.

In [16]:
train_applprev_1 = pl.concat(
    [
        train_applprev_1_0, 
        train_applprev_1_1,
    ],
    how="vertical_relaxed",
)

# 4. ANALYSIS ON THE SECOND TABLE

Let's start from the smallest one.

**Analyze comment**

In [17]:
with pl.Config() as cfg:
    cfg.set_fmt_str_lengths(150)
    cfg.set_tbl_rows(-1)

    display(get_feature_definitions(train_applprev_2.columns, df_feature_definition)) 

Variable,Description
str,str
"""case_id""",
"""cacccardblochreas_147M""","""Card blocking reason."""
"""conts_type_509L""","""Person contact type in previous application."""
"""credacc_cards_status_52L""","""Card status of the previous credit account."""
"""num_group1""",
"""num_group2""",


In [18]:
for col in train_applprev_2.columns:
    print(col, ": ", train_applprev_2[col].n_unique())

case_id :  1221522
cacccardblochreas_147M :  10
conts_type_509L :  10
credacc_cards_status_52L :  7
num_group1 :  20
num_group2 :  12


In [19]:
df_target["case_id"].n_unique()

1526659

In [20]:
train_applprev_2["case_id"].n_unique()

1221522

In [21]:
train_applprev_2["conts_type_509L"].unique()

conts_type_509L
str
"""SECONDARY_MOBI…"
"""HOME_PHONE"""
"""EMPLOYMENT_PHO…"
"""PRIMARY_MOBILE…"
"""PHONE"""
"""ALTERNATIVE_PH…"
"""PRIMARY_EMAIL"""
""
"""WHATSAPP"""
"""SKYPE"""


In [22]:
train_prev_app_w_target = train_applprev_2.join(df_target, on="case_id", how="left")

In [23]:
cols = ["cacccardblochreas_147M", "conts_type_509L", "credacc_cards_status_52L"]
for col in cols:
    percentage = train_prev_app_w_target.group_by(col).agg(
        pl.col("target").mean() * 100
        )
    print(col, ": percentage of target == 1 is ")
    print(percentage)


cacccardblochreas_147M : percentage of target == 1 is 
shape: (10, 2)
┌────────────────────────┬──────────┐
│ cacccardblochreas_147M ┆ target   │
│ ---                    ┆ ---      │
│ str                    ┆ f64      │
╞════════════════════════╪══════════╡
│ P19_60_110             ┆ 4.845815 │
│ P41_107_150            ┆ 1.086957 │
│ P127_74_114            ┆ 0.0      │
│ P33_145_161            ┆ 1.571472 │
│ a55475b1               ┆ 3.425389 │
│ null                   ┆ 5.023387 │
│ P201_63_60             ┆ 8.513514 │
│ P23_105_103            ┆ 3.418803 │
│ P17_56_144             ┆ 6.349206 │
│ P133_119_56            ┆ 5.0      │
└────────────────────────┴──────────┘
conts_type_509L : percentage of target == 1 is 
shape: (10, 2)
┌───────────────────┬──────────┐
│ conts_type_509L   ┆ target   │
│ ---               ┆ ---      │
│ str               ┆ f64      │
╞═══════════════════╪══════════╡
│ WHATSAPP          ┆ 12.0     │
│ ALTERNATIVE_PHONE ┆ 4.258187 │
│ PHONE             ┆ 3.9803

# 5. ANALYSIS OF THE FIRST TABLE

In [24]:
train_applprev_1_pd = train_applprev_1.to_pandas()

In [25]:
features_num, features_date, features_cat = extract_columns_tipe(train_applprev_1_pd)

In [26]:
features_num

['actualdpd_943P',
 'annuity_853A',
 'byoccupationinc_3656910L',
 'childnum_21L',
 'credacc_actualbalance_314A',
 'credacc_credlmt_575A',
 'credacc_maxhisbal_375A',
 'credacc_minhisbal_90A',
 'credacc_transactions_402L',
 'credamount_590A',
 'currdebt_94A',
 'downpmt_134A',
 'mainoccupationinc_437A',
 'maxdpdtolerance_577P',
 'outstandingdebt_522A',
 'pmtnum_8L',
 'revolvingaccount_394A',
 'tenor_203L']

In [27]:
features_date

['approvaldate_319D',
 'creationdate_885D',
 'dateactivated_425D',
 'dtlastpmt_581D',
 'dtlastpmtallstes_3545839D',
 'employedfrom_700D',
 'firstnonzeroinstldate_307D']

In [28]:
with pl.Config() as cfg:
    cfg.set_fmt_str_lengths(150)
    cfg.set_tbl_rows(-1)

    display(get_feature_definitions(features_date, df_feature_definition)) 

Variable,Description
str,str
"""approvaldate_319D""","""Approval Date of Previous Application"""
"""creationdate_885D""","""Date when previous application was created."""
"""dateactivated_425D""","""Contract activation date of the applicant's previous application."""
"""dtlastpmt_581D""","""Date of last payment made by the applicant."""
"""dtlastpmtallstes_3545839D""","""Date of the applicant's last payment."""
"""employedfrom_700D""","""Employment start date from the previous application."""
"""firstnonzeroinstldate_307D""","""Date of first instalment in the previous application."""


In [29]:
features_date

['approvaldate_319D',
 'creationdate_885D',
 'dateactivated_425D',
 'dtlastpmt_581D',
 'dtlastpmtallstes_3545839D',
 'employedfrom_700D',
 'firstnonzeroinstldate_307D']

In [30]:
for col_d in features_date:
    train_applprev_1 = train_applprev_1.with_columns(pl.col(col_d).str.strptime(pl.Date))
    print(col_d, ": ", train_applprev_1[col_d].max())

approvaldate_319D :  2020-10-19
creationdate_885D :  2020-10-19
dateactivated_425D :  2020-10-19
dtlastpmt_581D :  2020-10-19
dtlastpmtallstes_3545839D :  2020-10-19
employedfrom_700D :  2020-07-15
firstnonzeroinstldate_307D :  2020-11-19


In [31]:
with pl.Config() as cfg:
    cfg.set_fmt_str_lengths(150)
    cfg.set_tbl_rows(-1)

    display(get_feature_definitions(features_num, df_feature_definition)) 

Variable,Description
str,str
"""actualdpd_943P""","""Days Past Due (DPD) of previous contract (actual)."""
"""annuity_853A""","""Monthly annuity for previous applications."""
"""byoccupationinc_3656910L""","""Applicant's income from previous applications."""
"""childnum_21L""","""Number of children in the previous application."""
"""credacc_actualbalance_314A""","""Actual balance on credit account."""
"""credacc_credlmt_575A""","""Credit card credit limit provided for previous applications."""
"""credacc_maxhisbal_375A""","""Maximal historical balance of previous credit account"""
"""credacc_minhisbal_90A""","""Minimum historical balance of previous credit accounts."""
"""credacc_transactions_402L""","""Number of transactions made with the previous credit account of the applicant."""
"""credamount_590A""","""Loan amount or card limit of previous applications."""


In [32]:
sys.exit()

SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [None]:
with pl.Config() as cfg:
    cfg.set_fmt_str_lengths(150)
    cfg.set_tbl_rows(-1)

    display(get_feature_definitions(features_num, df_feature_definition)) 

Variable,Description
str,str
"""actualdpd_943P""","""Days Past Due (DPD) of previous contract (actual)."""
"""annuity_853A""","""Monthly annuity for previous applications."""
"""byoccupationinc_3656910L""","""Applicant's income from previous applications."""
"""childnum_21L""","""Number of children in the previous application."""
"""credacc_actualbalance_314A""","""Actual balance on credit account."""
"""credacc_credlmt_575A""","""Credit card credit limit provided for previous applications."""
"""credacc_maxhisbal_375A""","""Maximal historical balance of previous credit account"""
"""credacc_minhisbal_90A""","""Minimum historical balance of previous credit accounts."""
"""credacc_transactions_402L""","""Number of transactions made with the previous credit account of the applicant."""
"""credamount_590A""","""Loan amount or card limit of previous applications."""


In [None]:
aggregate_num_features_by_historic(train_applprev_1.select(features_num + ["case_id", "num_group1"]), features_num, "num_group1")

: 

In [None]:
def aggregate_num_features_by_historic(df, col_list, col_sort):
    operation_to_apply = []
    for col in col_list:
        operation_to_apply.append(pl.col(col).mean().alias(f"{col}_mean"))
        operation_to_apply.append(pl.col(col).std().alias(f"{col}_std"))
        operation_to_apply.append(pl.col(col).last().alias(f"{col}_last"))
        operation_to_apply.append(pl.col(col).last().alias(f"{col}_first"))
    df_grouped = df.sort(by=col_sort, descending=True).group_by("case_id").agg(operation_to_apply)
    return df_grouped
