<h1 style="color: navy; text-align: center;">Credit Risk Model Exploratory Data Analysis</h1>
<p style="text-align: justify; font-size: 16px;">
This notebook delves into the dataset's data exploration, providing insights crucial for evaluating the default risk of potential clients. By enabling consumer finance providers to approve a higher number of loan applications, this analysis contributes to improving the financial inclusiveness of individuals previously excluded due to insufficient credit history.
</p>


In [1]:
import sys
import os
import polars as pl
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

sns.set_style("white")
sns.set_palette("colorblind") 
sns.set_context("talk")

script_dir = os.path.dirname(os.path.abspath('PreprocessingSteps.ipynb'))
parent_directory = os.path.dirname(script_dir)
module_directory = os.path.join(parent_directory, 'module') 
utils_directory = os.path.join(parent_directory, 'utils') 

if (parent_directory not in sys.path):
    sys.path.append(parent_directory)
    
if (module_directory not in sys.path):
    sys.path.append(module_directory)
    
if (utils_directory not in sys.path):
    sys.path.append(utils_directory)    

# created files
from utils import config
from module.preprocess.bpe import BpeArgs, Encoder
from module.preprocess.load_and_batch import DataBatcher

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Oreos\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


<h1 style="color: navy; font-family: Verdana, Geneva, sans-serif;">Exploring Information for Depth 0</h1>

<p style="font-size: 16px; font-family: 'Lucida Grande', 'Lucida Sans Unicode', Arial, sans-serif; color: #333;">
  At depth zero, we have <strong style="color: darkred;">static features</strong> tied to a specific credit case. All the features here can be directly used as predictors.
</p>


In [2]:
preprocess = DataBatcher()
preprocess.load_and_process(config.DATA_LOCATION, training=True, train_test_split=0.8)

------> Size of train dataset: (1526659, 224) <------
------> Size of categorical columns: 185 <------
------> Size of numeric columns: 35 <------


<div style="text-align: center;">
    <img src="images/confused.gif" 00" alt="Confused">
    <p style="text-align: center; font-style: italic; font-weight: bold;">Just the 224 Columns Then.........</p>
</div>

## Training Data information

In [5]:
print(f"---------------> Shape of training data: {preprocess.train.data.shape} <---------------")
print(f"---------------> Shape of test data: {preprocess.test.data.shape} <---------------")
display(preprocess.train.data.head(5))

---------------> Shape of training data: (1221327, 222) <---------------
---------------> Shape of test data: (305332, 222) <---------------


case_id,actualdpdtolerance_344P,amtinstpaidbefduel24m_4187115A,annuity_780A,annuitynextmonth_57A,applicationcnt_361L,applications30d_658L,applicationscnt_1086L,applicationscnt_464L,applicationscnt_629L,applicationscnt_867L,avgdbddpdlast24m_3658932P,avgdbddpdlast3m_4187120P,avgdbdtollast24m_4525197P,avgdpdtolclosure24_3658938P,avginstallast24m_3658937A,avglnamtstart24m_4525187A,avgmaxdpdlast9m_3716943P,avgoutstandbalancel6m_4187114A,avgpmtlast12m_4525200A,bankacctype_710L,cardtype_51L,clientscnt12m_3712952L,clientscnt3m_3712950L,clientscnt6m_3712949L,clientscnt_100L,clientscnt_1022L,clientscnt_1071L,clientscnt_1130L,clientscnt_136L,clientscnt_157L,clientscnt_257L,clientscnt_304L,clientscnt_360L,clientscnt_493L,clientscnt_533L,clientscnt_887L,…,for3years_504L,for3years_584L,formonth_118L,formonth_206L,formonth_535L,forquarter_1017L,forquarter_462L,forquarter_634L,fortoday_1092L,forweek_1077L,forweek_528L,forweek_601L,foryear_618L,foryear_818L,foryear_850L,fourthquarter_440L,maritalst_385M,maritalst_893M,numberofqueries_373L,pmtaverage_3A,pmtaverage_4527227A,pmtaverage_4955615A,pmtcount_4527229L,pmtcount_4955617L,pmtcount_693L,pmtscount_423L,pmtssum_45A,requesttype_4525192L,responsedate_1012D,responsedate_4527233D,responsedate_4917613D,riskassesment_302T,riskassesment_940T,secondquarter_766L,thirdquarter_1082L,date_decision,target
i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,str,str,f64,f64,f64,str,date,date,date,str,f64,f64,f64,date,i64
106054,0.0,,4051.0,5785.0,0.0,0.0,0.0,0.0,0.0,5.0,-8.0,,,0.0,5202.8003,,0.0,,,"""CA""",,0.0,0.0,0.0,0.0,1.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,…,,,,,,,,,,,,,,,,1.0,"""3439d993""","""a55475b1""",1.0,10956.967,,,,,6.0,,,,2019-02-04,,,,,2.0,0.0,2019-01-21,0
1875612,0.0,0.0,6863.8003,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,0.0,,,,,,,"""INSTANT""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,…,,,,,,,,,,,,,,,,5.0,"""3439d993""","""a55475b1""",4.0,,,,,,,,,,,,2020-07-17,,,1.0,4.0,2020-07-03,0
1523012,0.0,75811.68,3333.4001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-17.0,,-17.0,0.0,6317.6,,0.0,,8372.8,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,…,,,,,,,,,,,,,,,,1.0,"""3439d993""","""a55475b1""",3.0,,,,,,,6.0,50112.6,"""DEDUCTION_6""",2019-09-20,2019-09-20,,,,0.0,1.0,2019-09-06,0
628698,,,2979.4001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,"""CA""",,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,…,,,,,,,,,,,,,,,,4.0,"""a55475b1""","""a55475b1""",7.0,,,,,,,,,,,,,,,3.0,5.0,2019-02-09,0
861635,0.0,5417.6,13503.0,1791.8,0.0,0.0,0.0,0.0,0.0,4.0,-2.0,-3.0,-3.0,0.0,1805.8,17051.6,0.0,18616.6,1805.8,,,0.0,0.0,0.0,0.0,1.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,1.0,…,,,,,,,,,,,,,,,,11.0,"""3439d993""","""a55475b1""",16.0,,,,,,,,,"""DEDUCTION_6""",,2019-11-29,,,,3.0,6.0,2019-11-15,0


# Exploring Data Conversion

## Overview
In this project, we explore the conversion of tabular data into text, focusing on representing numerical data as continuous variables. Notably, we dropped the *Week number* and *Month* information, relying instead on the 'date_decision' column for temporal insights. This strate**gy aims for a 1-to-1 mapp**ing between text and encoded numerical values, minimizing data scaling.

## Converting Tabular Information to Text Tokens

### Rationale
The conversion process leverages neural network capabilities to process and interpret tabular data as text. This approach enables the utilization of features like null values in a more meaningful way through embeddings. During data augmentation, masked values are identified, allowing the model to effectively 'ignore' them during training. By employing specialized embedding tables and Byte Pair Encoding (B**hope to** PE), we enhance the model's ability to generalize across unseen inputs.

### Process and Benefits

#### Tokenization and Embedding
- **Contextual Encoding**: Categorical values are either encoded or marked as empty, and numeric data presence is flagged, providing context to the data.
- **Adaptability to New Data**: The method ensures adaptability, allowing the model to perform well with new data, as long as it shares similarities with the training set.
- **Efficient Representation**: By avoiding one-hot encoding, we use embeddings to create a compact and meaningful data representation, mitigating issues related to dimensionality and data sparsity.

### Visual Representation
A diagram will be included here to illustrate the data conversion process, showcasing how tabular data is transformed into text tokens and the subsequent embedding and processing steps.

<div style="text-align: center;">
    <img src="images/Flow_w.png" style="width: 50%;" alt="Algorithm Process">
    <p style="text-align: center; font-style: italic; font-weight: bold;">Steps Flowchart</p>
</div>


### Word Embedding

In our program, we utilize **Byte Pair Encoding (BPE)** to convert text into tokens. BPE is a method initially devised for data compression that has been effectively adapted for tokenization in natural language processing (NLP).

#### Overview of Byte Pair Encoding (BPE)

BPE operates by iteratively merging the most frequent pair of bytes or characters in a sequence. This technique was first used for compressing data but has since gained prominence in NLP for its efficiency in tokenization.

<div style="text-align: center;">
    <img src="images/tok_types.png" style="width: 50%;" alt="Byte Pair Encoding visualization">
    <p style="text-align: center; font-style: italic; font-weight: bold;">Ways to tokenize characters. We are using the subword approach</p>
</div>

#### Key Features of BPE

- **Efficiency**: BPE addresses the vocabulary explosion problem by merging frequently occurring pairs, effectively reducing the vocabulary size without losing significant information.

- **Subword Tokenization**: It breaks down words into smaller units (subwords or characters), aiding in the handling of out-of-vocabulary words and the morphological variations of words.

#### Chosen Method

To effectively manage the vocabulary size and ensure the model can generalize to unseen scenarios, we employ a hybrid approach that combines subword-level and word-level tokenization strategies.
ions.

### Encoding Numeric Data in Text as Continuous Variables

Our goal is to encode numbers within text (e.g., amounts, general numbers, dates) as continuous variables to improve the model’s understanding of their context.

#### Methodology
- **Text Parsing**: Identify and categorize numbers in text (e.g., $8.99 as `Amount`, 13 as `Num`, 2020-19-10 as `Date`).
- **Embedding Selection**: Use distinct embedding tables for each category to reflect their unique attributes:
  - `Amount`: Embed digits and scale indicators (e.g., K, M).
  - `Num`: Include digits, decimal points, and negative signs.
  - `Date`: Embed digits and date separators (e.g., -).
- **Encoding Process**:
  - Tokenize numbers into components (e.g., 8.99 into ['8', '.', '9', '9']).
  - Retrieve embeddings for each token, forming a matrix.
  - Apply self-attention over this matrix to generate a cohesive vector representation.
  
  <div style="text-align: center;">
    <img src="images/SA_White.png" style="width: 50%;" alt="Attention Mechanism">
    <p style="text-align: center; font-style: italic; font-weight: bold;">Self-Attention Mechanism</p>
</div>

### Training the BPE

#### Pseudocode for Encoding Table Information

For a batch size of 10 we do the following. Important to not we combine Byte pair encoding with rule based splitting [Hence the use of the regex]

1. **Chunk Data Columns by 10**: Begin by chunking your data columns into segments of 10 for more manageable processing.

    ```python
    chunks = chunk_data(data_columns, chunk_size=10)
    ```

2. **Analyze Each Chunk**: For each chunk, sample entries to find common regex patterns, and use these patterns to facilitate further analysis.

    ```python
    for chunk in chunks:
        sample = get_sample(chunk, sample_size=100)  # Sample 100 entries
        regex_patterns = find_common_regex_patterns(sample)  # Identify patterns
        split_texts = split_text_using_patterns(sample, regex_patterns)  # Split text
    ```

3. **Combine and Tokenize Text**: Combine all the split text from each chunk along with the found bi-grams. Then tokenize the combined text.

    ```python
    combined_text = combine_all_split_text(chunks)
    tokens = tokenize(combined_text, combined_bi_grams)
    ```

4. **Apply Byte Pair Encoding (BPE)**: Use BPE on the tokens to refine the vocabulary size to the desired maximum.

    ```python
    final_vocab = byte_pair_encoding(tokens, max_vocab_size=desired_max_vocab_size)
    ```

#### Python example with comments
```python
model_init = BpeArgs(
        pattern=None,
        target_context=28, # if designing for a particular context window apply here
        adhoc_tokens=["PAD"], # any addtional key tokens or tags
        adhoc_words=preprocess.train_data.columns, # The table columns are taken as keywords and have their token number
        store_loc=config.BASE_LOCATION # storage location
    )
    
ops = Encoder(model_init)
while True:
    data_list = preprocess.get_meta_data(batch_size=10000, data_type="train", ignore_list=["case_id", "target"], verbose= False)
    
    if len(data_list) > 0:
        ops.compress_files(data_list, save_or_update_every=100)
    else:
        break

```

#### Simulation 

In [3]:
# For instantiating new Encoder Instance
model_init = BpeArgs(
        pattern=None,
        target_context=28, # if designing for a particular context window apply here
        adhoc_tokens=["PAD"], # any addtional key tokens or tags
        adhoc_words=preprocess.train.data.columns, # The table columns are taken as keywords and have their token number
        store_loc="" # storage location
    )
    
    
new_encoder = Encoder(model_init) # New version with basic information

trained_encoder = Encoder(model_init) # can be None
trained_encoder.load_state(config.BASE_LOCATION)

print("Base Information")
print(f"----> Size of New encoder: {new_encoder.vocab_size}")
print(f"----> Size of Trained encoder: {trained_encoder.vocab_size}")

Base Information
----> Size of New encoder: 352
----> Size of Trained encoder: 1731


In [4]:
data_list = preprocess.get_meta_data(batch_size=1, data_type="test", ignore_list=["case_id", "target"], verbose= False)
print(f"Length of initial text: {len(data_list[0].text)}")
print(f"Length with new encoder: {len(new_encoder.encode_text(data_list[0].text, verbose=True))}")
print(f"Length with trained encoder: {len(trained_encoder.encode_text(data_list[0].text, verbose=True))}")

Length of initial text: 9405
+-----------------------------------------------+
|               Analysis Results                |
+-----------------------------------------------+
| Length of original list              :   4968 |
| Number of unique tokens in original  :    269 |
| Length of compressed list            :   4373 |
| Number of unique tokens in compressed:    275 |
| Final compression ratio              : 0.120  |
| Length of vocabulary                 :    358 |
| Length of vocabulary                 :    358 |
| Total documents seen so far          :      0 |
| Total words seen so far              :      0 |
+-----------------------------------------------+
Length with new encoder: 4373
+-----------------------------------------------+
|               Analysis Results                |
+-----------------------------------------------+
| Length of original list              :   4968 |
| Number of unique tokens in original  :    269 |
| Length of compressed list            : 

As can be seen we able to compress the text by almost 70% with the trained version

## How the vocabulary is created

1. **Base Vocabulary from ASCII Characters**
- **Rationale**: The ASCII character set includes standard letters, digits, and symbols, providing a comprehensive foundation.
- **Implementation**: Start with all ASCII characters (values 32 to 126) to ensure basic textual elements are covered.

2. **Handling Numbers, Dates, and Amounts**
- **Rationale**: These formatted entities are crucial for maintaining contextual meaning.
- **Implementation**: Extract and store numbers, dates, and amounts in a `MetaData` class during tokenization, preserving their whole units.

3. **Training on All Possible Strings**
- **Rationale**: A robust vocabulary should encompass all unique strings in the training data.
- **Implementation**: Analyze and iteratively merge frequent pairs in the training set to build a comprehensive BPE vocabulary.

4. **Specialized Tokens for Column Names**
- **Rationale**: Column names play a unique role and are important for understanding tabular data.
- **Implementation**: Add column names as unique tokens to directly recognize and preserve their significance.

### Future Considerations for BPE Vocabulary Creation
- **Frequency Analysis**: Perform token frequency analysis to refine the vocabulary, ensuring it remains representative.
- **Token Granularity**: Balance token granularity to capture nuances without overcomplicating the model.
- **Context Preservation**: Maintain the data's context and meaning, especially for specialized entities.