# TABLE STATIC ANALYSIS

This table is composed by 3 tables:

- <code>train_static_0_0</code>, <code>train_static_0_1</code> that are internal data frames of home credit.
- <code>train_tatic_cb_0</code> that is an external dataset.

We are mainly interested in the internal datasets since we will have to do a stable inference in a future and a not sure table cannot be a good predictor with this goal.

We will analyze this points:

- the columns of all dataframes
- how to merge them
- their NA meanings and how to fill them
- some plots

!!! From https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability/discussion/476463 we can see that the person age must be taken from train_person_1 "birth_259D".

# 1. SETTINGS

In [1]:
import polars as pl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import sys

sys.path.append("../../")
from src.utils import get_feature_definitions, compute_date_distance_from_col, extract_columns_tipe, aggregate_num_features_by_historic

In [2]:
dataPath = "../../data/"

We will import the target dataframe with the features definition in order to improve the graphics later. 

In [3]:
df_target = pl.read_parquet(dataPath + 'parquet_files/train/train_base.parquet')

In [4]:
df_feature_definition = pl.read_csv(dataPath + 'feature_definitions.csv')

In [5]:
train_static_0_0 = pl.read_parquet(dataPath + "parquet_files/train/train_static_0_0.parquet")
train_static_0_1 = pl.read_parquet(dataPath + "parquet_files/train/train_static_0_1.parquet")
train_static_cb_1 = pl.read_parquet(dataPath + "parquet_files/train/train_static_cb_0.parquet")

# 2. STRUCTURE OF THE DATAFRAMES

Let's first see how the dataframe are made. 

In [6]:
train_static_0_0.shape

(1003757, 168)

In [7]:
train_static_0_1.shape

(522902, 168)

In [8]:
train_static_cb_1.shape

(1500476, 53)

It's quite clear that the esternal dataframe has a different structure. 

In this first analys as we have said we will only analyze the internal dataset. 

# 3. INTERNAL DATA SOURCE ANALYSIS

Let's first see if the internal datasources have the same columns.

In [9]:
columns_0_0 = list(train_static_0_0.columns)
columns_0_1 = list(train_static_0_1.columns)

columns_0_0.sort()
columns_0_1.sort()

columns_0_0 == columns_0_1

True

The two dataframe have the same columns but different rows.

Let's go in more details.

In [10]:
print("Number of case id in first dataframe: ", train_static_0_0["case_id"].n_unique())
print("The case id are unique in the first dataframe: ", train_static_0_0["case_id"].n_unique() == train_static_0_0.shape[0])

Number of case id in first dataframe:  1003757
The case id are unique in the first dataframe:  True


In [11]:
print("Number of case id in second dataframe: ", train_static_0_1["case_id"].n_unique())
print("The case id are unique in the second dataframe: ", train_static_0_1["case_id"].n_unique() == train_static_0_1.shape[0])

Number of case id in second dataframe:  522902
The case id are unique in the second dataframe:  True


So each case id in the dataframes is unique.

Let's see if the two dataframe have some case id in common.

In [12]:
set(train_static_0_1["case_id"].unique()).intersection(set(train_static_0_0["case_id"].unique()))

set()

**We can conclude that we have two perfectly separated dataframes with each one its case ids and with the same columns. We can concated them.**

In [13]:
train_static_internal = pl.concat(
    [
        train_static_0_0, 
        train_static_0_1,
    ],
    how="vertical_relaxed", how="vertical_relaxed",
)

# 4. DEEPER ANALYSIS ON THE COMPLETE INTERNAL TABLE

Let's move to a deeper analysis on the entire dataframe.

Create the pandas representation in order to plot it.

In [14]:
train_static_internal_pd = train_static_internal.to_pandas()

### 4.1 COLUMNS TYPE EXTRACTION

Let's extract the column types splitted by tipe.

In [15]:
features_num, features_date, features_cat = extract_columns_tipe(train_static_internal_pd)

In [None]:
train_static_internal_pd

In [16]:
sys.exit()

SystemExit: 

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


In [None]:
with pl.Config() as cfg:
    cfg.set_fmt_str_lengths(150)
    cfg.set_tbl_rows(-1)

    display(get_feature_definitions(features_date, df_feature_definition)) 

In [None]:
for col in features_date:
    train_static_internal = train_static_internal.with_columns(pl.col(col).str.to_date())

In [None]:
df_target = df_target.with_columns(pl.col("date_decision").str.to_date())

In [None]:
with pl.Config() as cfg:
    cfg.set_fmt_str_lengths(150)
    cfg.set_tbl_rows(-1)

    display(get_feature_definitions(features_num, df_feature_definition)) 

## 4.1 NULL ANALYSIS

In [None]:
sys.exit()

In [None]:
df_nulls = (train_static_internal.null_count() / train_static_internal.shape[0]).transpose(include_header=True).sort(by="column_0", descending=True).to_pandas()
df_nulls["perc_of_nulls"] = df_nulls.iloc[:, 1] 
df_nulls = df_nulls.drop("column_0", axis = 1)
df_nulls

As we can see in case of a lot of null it seems this is a information. We have to understand how to deal with it. 

In [None]:
df_nulls["perc_of_nulls"].hist(bins=30)

In [None]:
df_nulls.loc[(df_nulls["perc_of_nulls"] < 0.8) & (df_nulls["perc_of_nulls"]>0.6)]

It's seems in this case as well that the absence of the value is an information. 

## 4.2 ANALYSIS OF CATEGORICAL VS NUMERICAL 

We first want to split the numerical variable from the cathegorical ones.

In [None]:
features_date

In [None]:
for col in features_cat:
    print(f"  {col}")

From the date column we can see that we will need to compute a difference between a reference time and the considered date. 

In [None]:
# aesthetics
default_color_1 = 'darkblue'
default_color_2 = 'darkgreen'
default_color_3 = 'darkred'

In [None]:
train_static_internal_pd = train_static_internal_pd.merge(df_target, on='case_id')

Let's see how much values are unique in the numerical and categorical variables.

In [None]:
for col in features_num:
    print(col, ": ", len(train_static_internal_pd[col].unique()))

In [None]:
date_columns = []
long_features_cat = []
short_features_cat = []
for col in features_cat:
    if 'date' in col or col.endswith("D"):
        date_columns.append(col)
        features_cat.remove(col)
    elif len(train_static_internal_pd[col].unique()) > 10:
        long_features_cat.append(col)
    elif len(train_static_internal_pd[col].unique()) <= 10:
        short_features_cat.append(col)
    else: 
        raise ValueError("Strange column: ", col)
        

In [None]:
for col in long_features_cat:
    print(col, ": ", len(train_static_internal_pd[col].unique()))

In [None]:
for col in short_features_cat:
    print(col, ": ", len(train_static_internal_pd[col].unique()))

In [None]:
train_static_internal_pd["isdebitcard_729L"].unique()

In [None]:
train_static_internal_pd["isbidproductrequest_292L"].unique()

In [None]:
train_static_internal_pd["paytype_783L"].unique()

In [None]:
train_static_internal_pd["typesuite_864L"].unique()

In [None]:
train_static_internal_pd["bankacctype_710L"].unique()

In [None]:
train_static_internal_pd["isbidproduct_1095L"].unique()

We can see a very big difference among the variable cardinality.

In [None]:
def plot_continuous(df, feature, txt):
    '''Plot a histogram and boxplot for the churned and retained distributions for the specified feature.'''
    df_func = df.copy()
    df_paid = df.loc[df["target"] == 1]
    df_default = df.loc[df["target"] == 0]
    
    df_func['target'] = df_func['target'].astype('category')
    fig, ax1 = plt.subplots()

    for df, label in zip([df_paid,df_default], [0, 1]): 
        sns.boxplot(data=df,
                     x=feature,
                     bins=30,
                     alpha=0.66,
                     edgecolor='firebrick',
                     label=label,
                     kde=False,
                     ax=ax1)
    ax1.legend()
    fig.text(.5, .005, txt, ha='center')
    plt.tight_layout();

In [None]:
def plot_categorical(df, feature, txt):
    '''For a categorical feature, plot a seaborn.countplot for the total counts of each category next to a barplot for the churn rate.'''
    fig, ax1 = plt.subplots()

    sns.countplot(x=feature,
                  hue='target',
                  data=df,
                  ax=ax1)
    ax1.set_ylabel('Count')
    ax1.legend(labels=['paid', 'default'])
    ax1.tick_params(axis='x', rotation=90)
    
    fig.text(.5, .005, txt, ha='center')
    plt.tight_layout();


In [None]:
for i in short_features_cat:
    print(i)
    plot_categorical(train_static_internal_pd, i, f'({dict_feature[i]})')

# 5. FEATURE CREATION

In [None]:
df_target_with_static = df_target.join(other=train_static_internal, left_on="case_id", right_on="case_id", how="left")  
df_target_with_static, diff_col = compute_date_distance_from_col(df_target_with_static, features_date, "date_decision")
df_target_with_static= df_target_with_static.drop(features_date)

## CONCLUSIONS

For this dataframe we can conclude that on a first glance, the internal dataset has a good quality.

We only have to take into account that:

- The NAs seem informative and so we don't have to drop them. 
- We don't know nothing about the outlier values at the moment. Maybe they are informative so we will not drop them.
- the date can be used to compute a time difference. 