<h1 style="color: navy; text-align: center;">Credit Risk Model Exploratory Data Analysis</h1>
<p style="text-align: justify; font-size: 16px;">
This notebook delves into the dataset's data exploration, providing insights crucial for evaluating the default risk of potential clients. By enabling consumer finance providers to approve a higher number of loan applications, this analysis contributes to improving the financial inclusiveness of individuals previously excluded due to insufficient credit history.
</p>


In [1]:
import sys
import os
import polars as pl
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

sns.set_style("white")
sns.set_palette("colorblind") 
sns.set_context("talk")

script_dir = os.path.dirname(os.path.abspath('PreprocessingSteps.ipynb'))
parent_directory = os.path.dirname(script_dir)
module_directory = os.path.join(parent_directory, 'module') 
utils_directory = os.path.join(parent_directory, 'utils') 

if (parent_directory not in sys.path):
    sys.path.append(parent_directory)
    
if (module_directory not in sys.path):
    sys.path.append(module_directory)
    
if (utils_directory not in sys.path):
    sys.path.append(utils_directory)    

# created files
from utils.helpers import *
from utils import config
from module.preprocess.bpe import BytePairEncodeAlgo

# Getting column defintions
col_defs = {row['Variable']: row['Description'] for row in pl.read_csv(config.BASE_LOCATION + "feature_definitions.csv").to_dicts()}

<h1 style="color: navy; font-family: Verdana, Geneva, sans-serif;">Exploring Information for Depth 0</h1>

<p style="font-size: 16px; font-family: 'Lucida Grande', 'Lucida Sans Unicode', Arial, sans-serif; color: #333;">
  At depth zero, we have <strong style="color: darkred;">static features</strong> tied to a specific credit case. All the features here can be directly used as predictors.
</p>


In [5]:
# Merging information
final_df = pl.concat([change_column_type(pl.read_csv(config.BASE_LOCATION + "csv_files/train/train_static_0_0.csv")), 
                      change_column_type(pl.read_csv(config.BASE_LOCATION + "csv_files/train/train_static_0_1.csv"))], how="vertical_relaxed").join(change_column_type(pl.read_csv(config.BASE_LOCATION + "csv_files/train/train_static_cb_0.csv")), on="case_id", how="left")
print(f"------> Size of dataset without targets {final_df.shape} <------") 

# extract descriptive information for each column
descriptive_col = {col : col_defs[col] for col in final_df.columns[1:]}

final_df = final_df.join(change_column_type(pl.read_csv(config.BASE_LOCATION + "csv_files/train/train_base.csv")), on="case_id", how="inner")
print(f"------> Size of dataset with targets {final_df.shape} <------") 

# adding target columns to descriptive_col
descriptive_col['date_decision']= "This refers to the date when a decision was made regarding the approval of the loan."
descriptive_col['MONTH']= "Month the decision was made"
descriptive_col['WEEK_NUM']= "This is the week number used for aggregation. In the test sample, WEEK_NUM continues sequentially from the last training value of WEEK_NUM"
descriptive_col['target']= " This is the target value, determined after a certain period based on whether or not the client defaulted on the specific credit case (loan)"


------> Size of dataset without targets (1526659, 220) <------
------> Size of dataset with targets (1526659, 224) <------


<div style="text-align: center;">
    <img src="images/confused.gif" 00" alt="Confused">
    <p style="text-align: center; font-style: italic; font-weight: bold;">Just the 224 Columns Then.........</p>
</div>

# Preprocessing Data:

## Contextual information:

The decision date and month are dropped as it is implied in Week_Num. Week_Num represent the week the decision was made from a reference "start_date"

In [6]:
base_reference = final_df['date_decision'].min()
final_df = final_df.drop(['date_decision','MONTH']) 

## Tokenization and Embedding

We aim to give context to the table. For categorical values we either encode the information or flag that it's empty. In the numeric data case we just tag that the information is present or information isn't present.
The process of tokenizing as a form of embedding is crucial for several reasons:

- **Adaptability to New Data**: It allows the system to adapt to new and unseen data effectively. As long as the categorical values can be embedded, and there is similarity with the training set's information, the model is expected to perform well.

- **Avoidance of One-Hot Encoding**: To prevent the dimensional issues and sparsity associated with one-hot encoding, embedding provides a more compact and meaningful representation.

#### Sample Row Information and Text representation

In [5]:
final_df.head(1)

case_id,actualdpdtolerance_344P,amtinstpaidbefduel24m_4187115A,annuity_780A,annuitynextmonth_57A,applicationcnt_361L,applications30d_658L,applicationscnt_1086L,applicationscnt_464L,applicationscnt_629L,applicationscnt_867L,avgdbddpdlast24m_3658932P,avgdbddpdlast3m_4187120P,avgdbdtollast24m_4525197P,avgdpdtolclosure24_3658938P,avginstallast24m_3658937A,avglnamtstart24m_4525187A,avgmaxdpdlast9m_3716943P,avgoutstandbalancel6m_4187114A,avgpmtlast12m_4525200A,bankacctype_710L,cardtype_51L,clientscnt12m_3712952L,clientscnt3m_3712950L,clientscnt6m_3712949L,clientscnt_100L,clientscnt_1022L,clientscnt_1071L,clientscnt_1130L,clientscnt_136L,clientscnt_157L,clientscnt_257L,clientscnt_304L,clientscnt_360L,clientscnt_493L,clientscnt_533L,clientscnt_887L,…,for3years_504L,for3years_584L,formonth_118L,formonth_206L,formonth_535L,forquarter_1017L,forquarter_462L,forquarter_634L,fortoday_1092L,forweek_1077L,forweek_528L,forweek_601L,foryear_618L,foryear_818L,foryear_850L,fourthquarter_440L,maritalst_385M,maritalst_893M,numberofqueries_373L,pmtaverage_3A,pmtaverage_4527227A,pmtaverage_4955615A,pmtcount_4527229L,pmtcount_4955617L,pmtcount_693L,pmtscount_423L,pmtssum_45A,requesttype_4525192L,responsedate_1012D,responsedate_4527233D,responsedate_4917613D,riskassesment_302T,riskassesment_940T,secondquarter_766L,thirdquarter_1082L,WEEK_NUM,target
i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,str,str,f64,f64,f64,str,date,date,date,str,f64,f64,f64,i64,i64
0,,,1917.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,…,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0,0


<div style="text-align: center;">
    <img src="images/row_sample.png" alt="row_image">
    <p style="text-align: center; font-style: italic; font-weight: bold;">Example of Row Information In Text. The text pattern follows what was described in the <strong>Tokenization and Embedding</strong> section above </p>
</div>

### Word Embedding
For our program we use Byte Pair Encoding to convert from text to tokens.

Byte Pair Encoding (BPE) is a data compression technique that iteratively merges the most frequent pair of bytes or characters in a sequence. Initially used for compressing data, it has been adapted for use in natural language processing (NLP) for tokenization.

#### Key Features of BPE

- **Efficiency**: BPE is efficient in handling the vocabulary explosion problem by merging frequent pairs, thus reducing the vocabulary size without losing significant information.
- **Subword Tokenization**: It tokenizes words into smaller units (subwords or characters), which helps in handling out-of-vocabulary words and morphological variations of words.

#### Comparison with Other Methods

- **Against Fixed Vocabulary Tokenization**: Unlike fixed vocabulary methods, BPE dynamically adjusts its vocabulary based on the corpus, reducing issues with unknown tokens.
- **Versus Unigram Language Models**: BPE merges based on frequency, while unigram language models use likelihood to determine subwords, leading to potentially more linguistically meaningful tokens in the unigram approach.
- **With WordPiece**: BPE and WordPiece are similar, but WordPiece optimizes for language model likelihood rather than just frequency, which can lead to slightly different tokenizations.

### Pseudocode for Encoding Table Information
1. **Chunk Data Columns by 10**: Begin by chunking your data columns into segments of 10 for more manageable processing.

    ```python
    chunks = chunk_data(data_columns, chunk_size=10)
    ```

2. **Analyze Each Chunk**: For each chunk, sample entries to find common regex patterns, and use these patterns to facilitate further analysis.

    ```python
    for chunk in chunks:
        sample = get_sample(chunk, sample_size=100)  # Sample 100 entries
        regex_patterns = find_common_regex_patterns(sample)  # Identify patterns
        split_texts = split_text_using_patterns(sample, regex_patterns)  # Split text
    ```

3. **Combine and Tokenize Text**: Combine all the split text from each chunk along with the found bi-grams. Then tokenize the combined text.

    ```python
    combined_text = combine_all_split_text(chunks)
    tokens = tokenize(combined_text, combined_bi_grams)
    ```

4. **Apply Byte Pair Encoding (BPE)**: Use BPE on the tokens to refine the vocabulary size to the desired maximum.

    ```python
    final_vocab = byte_pair_encoding(tokens, max_vocab_size=desired_max_vocab_size)
    ```

5. **Embedding the Final Vocabulary**: Pass the final vocabulary through an embedding layer to receive the encoded values.

    ```python
    encoded_values = pass_through_embedding(final_vocab)
    ```


### Pattern Matching with Regular Expressions

Here is the pattern regex used. It has been tested on the tabular string and general data. A slight disadvantage is it can lead to bloating because of how '***space***' character is handled. But this is covered during the BPE process:

```python
import string
import re

escaped_punctuation = ''.join('\\' + char if char in '.^$*+?{}[]\\|()' else char for char in string.punctuation)

# Pattern gotten with the help of ChatGPT
pattern = re.compile(
    r"""
    (?i)                                          # Case-insensitive matching
    \d{4}-\d{2}-\d{2}|                            # Dates in YYYY-MM-DD format
    \b\w+_\d+[A-Z]?\b|                            # Enhanced alphanumeric with specific formats
    \b(?:has|value|is|empty|true|false|           # Specific keywords
    active\scontract|closed\scontract)\b|
    \b[A-Z0-9_]+(?<!\s)\b|                        # Uppercase identifiers including numbers and underscores, not preceded by whitespace
    \b\p{L}+(?:'\p{L}+)*\b(?:[{}]+)?|             # Match words with optional trailing punctuation
    \b\p{N}+\b(?:[{}]+)?|                         # Match whole numbers with optional trailing punctuation
    \$?\d+(?:,\d{3})*(?:\.\d+)?(?:[KMBT])?\b|     # Monetary values, including optional $
    [^\s\p{L}\p{N}]+|                             # Match any character not a space, letter, or number
    \s+(?!\S)|\s+|                                # Spaces
    """ + rf"[{escaped_punctuation}]+|",
    re.VERSION1 | re.VERBOSE | re.IGNORECASE
)


### Pattern example on plain text
Below is an example of coverted text with the derived patten
```python
new_txt = "On 2024-03-13, the project 'GeoData_Analysis_2024' officially kicked off. The team had previously discussed several key points, emphasizing the importance of accuracy and efficiency. \
As of today, there have been 152 issues logged, with 47 marked as 'resolved' and the remaining awaiting review. Interestingly, the budget allocated for this phase is 3,450,000.2456, \
which is under it's $3.5M cap suggested in the initial proposal."

output:

Length of split text: 429
['On', ' ', '2024-03-13', ',', ' ', 'the', ' ', 'project', ' ', "'", 'GeoData_Analysis_2024', "'", ' ', 'officially', ' ', 'kicked', ' ', 'off', '.', ' ', 'The', ' ', 'team', ' ', 'had', ' ', 'previously', ' ', 'discussed', ' ', 'several', ' ', 'key', ' ', 'points', ',', ' ', 'emphasizing', ' ', 'the', ' ', 'importance', ' ', 'of', ' ', 'accuracy', ' ', 'and', ' ', 'efficiency', '.', ' ', 'As', ' ', 'of', ' ', 'today', ',', ' ', 'there', ' ', 'have', ' ', 'been', ' ', '152', ' ', 'issues', ' ', 'logged', ',', ' ', 'with', ' ', '47', ' ', 'marked', ' ', 'as', ' ', "'", 'resolved', "'", ' ', 'and', ' ', 'the', ' ', 'remaining', ' ', 'awaiting', ' ', 'review', '.', ' ', 'Interestingly', ',', ' ', 'the', ' ', 'budget', ' ', 'allocated', ' ', 'for', ' ', 'this', ' ', 'phase', ' ', 'is', ' ', '3', ',', '450', ',', '000', '.', '2456', ',', ' ', 'which', ' ', 'is', ' ', 'under', ' ', 'it', "'", 's', ' ', '$3.5M', ' ', 'cap', ' ', 'suggested', ' ', 'in', ' ', 'the', ' ', 'initial', ' ', 'proposal', '.', '']
```

### Running full example showing encode to decoding process:

In [2]:
# Internally test function uses new_txt shown above. The process from encoding  to decoding is shown. Also an idea of the compression ratio we can get from using BPE
import regex as re
operation = BytePairEncodeAlgo(pattern=re.compile(""), IQr_Mult=1.5, IQR_Iter=5, compression_ratio=0.5)
print(operation.test())


Orginal_text: 
On 2024-03-13, the project 'GeoData_Analysis_2024' officially kicked off.The team had previously discussed several key points, emphasizing the importance of accuracy and efficiency. As of today, there have been 152 issues logged, with 47 marked as 'resolved' and the remaining awaiting review. Interestingly, the budget allocated for this phase is 3,450,000.2456, which is under it's $3.5M cap suggested in the initial proposal.

Running script to encode

Encoding the Text
Pre Merge: 
[79, 110, 32, 50, 48, 50, 52, 45, 48, 51, 45, 49, 51, 44, 32, 116, 104, 101, 32, 112, 114, 111, 106, 101, 99, 116, 32, 39, 71, 101, 111, 68, 97, 116, 97, 95, 65, 110, 97, 108, 121, 115, 105, 115, 95, 50, 48, 50, 52, 39, 32, 111, 102, 102, 105, 99, 105, 97, 108, 108, 121, 32, 107, 105, 99, 107, 101, 100, 32, 111, 102, 102, 46, 84, 104, 101, 32, 116, 101, 97, 109, 32, 104, 97, 100, 32, 112, 114, 101, 118, 105, 111, 117, 115, 108, 121, 32, 100, 105, 115, 99, 117, 115, 115, 101, 100, 32, 115, 101,

## Creating Vocabulary for Project

In [2]:
# %pip install tqdm
# from tqdm import tqdm
# from itertools import islice

# ## Encoding current table information
# main_vocab = BytePairEncodeAlgo(pattern=config.BPE_CONFIG['Pattern'], IQr_Mult=config.BPE_CONFIG['IQr_Mult'], IQR_Iter=config.BPE_CONFIG['IQR_Iter'], compression_ratio=config.BPE_CONFIG['Compression_Ratio'])
# text_corpus = config.MAIN_TABLE_STORAGE + 'row_converted.txt'

# # First, determine the total number of lines
# total_lines = sum(1 for _ in open(text_corpus, 'r'))
# print(f"Total number of lines to process: {total_lines}")

# # Define the batch size
# idx = 0
# batch_size = 20
# with open(text_corpus, 'r') as file:
#     progress_bar = tqdm(total=total_lines, desc="Processing")
    
#     while True:
#         # Read a batch of lines
#         lines = list(islice(file, batch_size))
#         if not lines:
#             break  # Exit loop if no more lines
            
#         curr_size = len(lines)    
#         text = '\n'.join(lines)    
        
#         # Process each line in the batch
#         main_vocab.process(text)
        
#         # Update progress bar by the number of lines processed in this batch
#         progress_bar.update(curr_size)
#         idx += curr_size
#         # Optional: Check and print the vocabulary size every 10 batches
#         current_percentage = ((idx + 1) / total_lines) * 100
#         if current_percentage  % 10 == 0:
#             print(f"The Length of vocabulary is {len(main_vocab.get_vocabulary())}")

# # Close the progress bar
# progress_bar.close()
        
# print(f"Encoding complete after processing {total_lines} lines. With {config.BPE_CONFIG['Compression_Ration']}% compression ratio for each line. The Length of vocabulary is {main_vocab.get_vocab_length()}")        
# main_vocab.save_state(config.BASE_LOCATION)        

In [3]:
# print(f"Encoding complete after processing {total_lines} lines. With {config.BPE_CONFIG['Compression_Ratio']}% compression ratio for each line. The Length of vocabulary is {main_vocab.get_vocab_length()}")        
# main_vocab.save_state(config.BASE_LOCATION)     
# main_vocab.get_vocab_length()

main_vocab = BytePairEncodeAlgo()
main_vocab.load_state(config.BASE_LOCATION)
print(main_vocab.get_vocab_length())

23550


In [5]:
print(main_vocab.test())


Orginal_text: 
On 2024-03-13, the project 'GeoData_Analysis_2024' officially kicked off.The team had previously discussed several key points, emphasizing the importance of accuracy and efficiency. As of today, there have been 152 issues logged, with 47 marked as 'resolved' and the remaining awaiting review. Interestingly, the budget allocated for this phase is 3,450,000.2456, which is under it's $3.5M cap suggested in the initial proposal.

Running script to encode

Encoding the Text
Pre Merge: 
[79, 110, 32, 50, 48, 50, 52, 45, 48, 51, 45, 49, 51, 44, 32, 116, 104, 101, 32, 112, 114, 111, 106, 101, 99, 116, 32, 39, 71, 101, 111, 68, 97, 116, 97, 95, 65, 110, 97, 108, 121, 115, 105, 115, 95, 50, 48, 50, 52, 39, 32, 111, 102, 102, 105, 99, 105, 97, 108, 108, 121, 32, 107, 105, 99, 107, 101, 100, 32, 111, 102, 102, 46, 84, 104, 101, 32, 116, 101, 97, 109, 32, 104, 97, 100, 32, 112, 114, 101, 118, 105, 111, 117, 115, 108, 121, 32, 100, 105, 115, 99, 117, 115, 115, 101, 100, 32, 115, 101,

In [7]:
new_text = "actualdpdtolerance_344P is empty, amtinstpaidbefduel24m_4187115A is empty, annuity_780A has value, annuitynextmonth_57A has value, \
applicationcnt_361L has value, applications30d_658L has value, applicationscnt_1086L has value, applicationscnt_464L has value, applicationscnt_629L has value, \
applicationscnt_867L has value, avgdbddpdlast24m_3658932P is empty, avgdbddpdlast3m_4187120P is empty, avgdbdtollast24m_4525197P is empty, avgdpdtolclosure24_3658938P is \
empty, avginstallast24m_3658937A is empty, avglnamtstart24m_4525187A is empty, avgmaxdpdlast9m_3716943P is empty, avgoutstandbalancel6m_4187114A is empty, avgpmtlast12m_4525200A is empty, \
bankacctype_710L is empty, cardtype_51L is empty, clientscnt12m_3712952L has value, clientscnt3m_3712950L has value, clientscnt6m_3712949L has value, clientscnt_100L has value, clientscnt_1022L has value, \
clientscnt_1071L has value, clientscnt_1130L has value, clientscnt_136L is empty, clientscnt_157L has value, clientscnt_257L has value, clientscnt_304L has value, clientscnt_360L has value, clientscnt_493L has value, \
clientscnt_533L has value, clientscnt_887L has value, clientscnt_946L has value, cntincpaycont9m_3716944L is empty, cntpmts24_3658933L is empty, commnoinclast6m_3546845L has value, credamount_770A has value, \
credtype_322L value is CAL, currdebt_22A has value, currdebtcredtyperange_828A has value, datefirstoffer_1144D is empty, datelastinstal40dpd_247D is empty, datelastunpaid_3546854D is empty, \
daysoverduetolerancedd_3976961L is empty, deferredmnthsnum_166L has value, disbursedcredamount_1113A has value, disbursementtype_67L value is GBA, downpmt_116A has value, dtlastpmtallstes_4499206D is empty, \
eir_270L has value, equalitydataagreement_891L is empty, equalityempfrom_62L is empty, firstclxcampaign_1125D is empty, firstdatedue_489D is empty, homephncnt_628L has value, inittransactionamount_650A is empty, \
inittransactioncode_186L value is CASH, interestrate_311L has value, interestrategrace_34L is empty, isbidproduct_1095L is False, isbidproductrequest_292L is empty, isdebitcard_729L is empty, \
lastactivateddate_801D is empty, lastapplicationdate_877D is empty, lastapprcommoditycat_1041M value is a55475b1, lastapprcommoditytypec_5251766M value is a55475b1, lastapprcredamount_781A is empty, \
lastapprdate_640D is empty, lastcancelreason_561M value is a55475b1, lastdelinqdate_224D is empty, lastdependentsnum_448L is empty, lastotherinc_902A is empty, lastotherlnsexpense_631A is empty, \
lastrejectcommoditycat_161M value is a55475b1, lastrejectcommodtypec_5251769M value is a55475b1, lastrejectcredamount_222A is empty, lastrejectdate_50D is empty, lastrejectreason_759M value is a55475b1, \
lastrejectreasonclient_4145040M value is a55475b1, lastrepayingdate_696D is empty, lastst_736L is empty, maininc_215A is empty, mastercontrelectronic_519L has value, mastercontrexist_109L has value, \
maxannuity_159A has value, maxannuity_4075009A is empty, maxdbddpdlast1m_3658939P is empty, maxdbddpdtollast12m_3658940P is empty, maxdbddpdtollast6m_4187119P is empty, maxdebt4_972A has value, \
maxdpdfrom6mto36m_3546853P has value, maxdpdinstldate_3546855D is empty, maxdpdinstlnum_3546846P is empty, maxdpdlast12m_727P has value, maxdpdlast24m_143P has value, maxdpdlast3m_392P has value, \
maxdpdlast6m_474P has value, maxdpdlast9m_1059P has value, maxdpdtolerance_374P has value, maxinstallast24m_3658928A is empty, maxlnamtstart6m_4525199A is empty, maxoutstandbalancel12m_4187113A is empty, \
maxpmtlast3m_4525190A is empty, mindbddpdlast24m_3658935P is empty, mindbdtollast24m_4525191P is empty, mobilephncnt_593L has value, monthsannuity_845L is empty, numactivecreds_622L has value, \
numactivecredschannel_414L has value, numactiverelcontr_750L has value, numcontrs3months_479L has value, numincomingpmts_3546848L is empty, numinstlallpaidearly3d_817L is empty, numinstls_657L has value, \
numinstlsallpaid_934L is empty, numinstlswithdpd10_728L is empty, numinstlswithdpd5_4187116L is empty, numinstlswithoutdpd_562L is empty, numinstmatpaidtearly2d_4499204L is empty, numinstpaid_4499208L is empty, \
numinstpaidearly3d_3546850L is empty, numinstpaidearly3dest_4493216L is empty, numinstpaidearly5d_1087L is empty, numinstpaidearly5dest_4493211L is empty, numinstpaidearly5dobd_4499205L is empty, \
numinstpaidearly_338L is empty, numinstpaidearlyest_4493214L is empty, numinstpaidlastcontr_4325080L is empty, numinstpaidlate1d_3546852L is empty, numinstregularpaid_973L is empty,\
numinstregularpaidest_4493210L is empty, numinsttopaygr_769L is empty, numinsttopaygrest_4493213L is empty, numinstunpaidmax_3546851L is empty, numinstunpaidmaxest_4493212L is empty, \
numnotactivated_1143L has value, numpmtchanneldd_318L has value, numrejects9m_859L has value, opencred_647L is empty, paytype1st_925L value is OTHER, paytype_783L value is OTHER, \
payvacationpostpone_4187118D is empty, pctinstlsallpaidearl3d_427L is empty, pctinstlsallpaidlat10d_839L is empty, pctinstlsallpaidlate1d_3546856L is empty, pctinstlsallpaidlate4d_3546849L is empty, \
pctinstlsallpaidlate6d_3546844L is empty, pmtnum_254L has value, posfpd10lastmonth_333P has value, posfpd30lastmonth_3976960P has value, posfstqpd30lastmonth_3976962P is empty, \
previouscontdistrict_112M value is a55475b1, price_1097A is empty, sellerplacecnt_915L has value, sellerplacescnt_216L has value, sumoutstandtotal_3546847A is empty, sumoutstandtotalest_4493215A is empty, \
totaldebt_9A has value, totalsettled_863A has value, totinstallast1m_4525188A is empty, twobodfilling_608L value is BO, typesuite_864L is empty, validfrom_1069D is empty, assignmentdate_238D is empty, \
assignmentdate_4527235D is empty, assignmentdate_4955616D is empty, birthdate_574D is empty, contractssum_5085716L is empty, dateofbirth_337D is empty, dateofbirth_342D is empty, days120_123L is empty, \
days180_256L is empty, days30_165L is empty, days360_512L is empty, days90_310L is empty, description_5085714M is empty, education_1103M is empty, education_88M is empty, firstquarter_103L is empty, \
for3years_128L is empty, for3years_504L is empty, for3years_584L is empty, formonth_118L is empty, formonth_206L is empty, formonth_535L is empty, forquarter_1017L is empty, forquarter_462L is empty, \
forquarter_634L is empty, fortoday_1092L is empty, forweek_1077L is empty, forweek_528L is empty, forweek_601L is empty, foryear_618L is empty, foryear_818L is empty, foryear_850L is empty, \
fourthquarter_440L is empty, maritalst_385M is empty, maritalst_893M is empty, numberofqueries_373L is empty, pmtaverage_3A is empty, pmtaverage_4527227A is empty, pmtaverage_4955615A is empty, \
pmtcount_4527229L is empty, pmtcount_4955617L is empty, pmtcount_693L is empty, pmtscount_423L is empty, pmtssum_45A is empty, requesttype_4525192L is empty, responsedate_1012D is empty, \
responsedate_4527233D is empty, responsedate_4917613D is empty, riskassesment_302T is empty, riskassesment_940T is empty, secondquarter_766L is empty, thirdquarter_1082L is empty, WEEK_NUM has value"

new_text = new_text.replace(",", "")

enc_text = main_vocab.encode_text(new_text, True)


Pre Merge: 
[97, 99, 116, 117, 97, 108, 100, 112, 100, 116, 111, 108, 101, 114, 97, 110, 99, 101, 95, 51, 52, 52, 80, 32, 105, 115, 32, 101, 109, 112, 116, 121, 32, 97, 109, 116, 105, 110, 115, 116, 112, 97, 105, 100, 98, 101, 102, 100, 117, 101, 108, 50, 52, 109, 95, 52, 49, 56, 55, 49, 49, 53, 65, 32, 105, 115, 32, 101, 109, 112, 116, 121, 32, 97, 110, 110, 117, 105, 116, 121, 95, 55, 56, 48, 65, 32, 104, 97, 115, 32, 118, 97, 108, 117, 101, 32, 97, 110, 110, 117, 105, 116, 121, 110, 101, 120, 116, 109, 111, 110, 116, 104, 95, 53, 55, 65, 32, 104, 97, 115, 32, 118, 97, 108, 117, 101, 32, 97, 112, 112, 108, 105, 99, 97, 116, 105, 111, 110, 99, 110, 116, 95, 51, 54, 49, 76, 32, 104, 97, 115, 32, 118, 97, 108, 117, 101, 32, 97, 112, 112, 108, 105, 99, 97, 116, 105, 111, 110, 115, 51, 48, 100, 95, 54, 53, 56, 76, 32, 104, 97, 115, 32, 118, 97, 108, 117, 101, 32, 97, 112, 112, 108, 105, 99, 97, 116, 105, 111, 110, 115, 99, 110, 116, 95, 49, 48, 56, 54, 76, 32, 104, 97, 115, 32, 118, 97, 1

In [6]:
txt = "Goodnight! Hello, I'm back again. I basically have only two interests nowadays: languages and furries. What? Oh, sorry, I thought you knew I was a furry. Haha, oops. Anyway, yeah, I'm a furry, but since I'm a young furry, I can't really do as much as I would like to do in the fandom. When I'm older, I would like to have a fursuit, go to furry conventions, all that stuff. But for now I can only dream of that. Sorry you had to deal with me talking about furries, but I'm honestly very desperate for this to be the longest text ever."

enc_text = main_vocab.encode_text(txt, True)

Pre Merge: 
[71, 111, 111, 100, 110, 105, 103, 104, 116, 33, 32, 72, 101, 108, 108, 111, 44, 32, 73, 39, 109, 32, 98, 97, 99, 107, 32, 97, 103, 97, 105, 110, 46, 32, 73, 32, 98, 97, 115, 105, 99, 97, 108, 108, 121, 32, 104, 97, 118, 101, 32, 111, 110, 108, 121, 32, 116, 119, 111, 32, 105, 110, 116, 101, 114, 101, 115, 116, 115, 32, 110, 111, 119, 97, 100, 97, 121, 115, 58, 32, 108, 97, 110, 103, 117, 97, 103, 101, 115, 32, 97, 110, 100, 32, 102, 117, 114, 114, 105, 101, 115, 46, 32, 87, 104, 97, 116, 63, 32, 79, 104, 44, 32, 115, 111, 114, 114, 121, 44, 32, 73, 32, 116, 104, 111, 117, 103, 104, 116, 32, 121, 111, 117, 32, 107, 110, 101, 119, 32, 73, 32, 119, 97, 115, 32, 97, 32, 102, 117, 114, 114, 121, 46, 32, 72, 97, 104, 97, 44, 32, 111, 111, 112, 115, 46, 32, 65, 110, 121, 119, 97, 121, 44, 32, 121, 101, 97, 104, 44, 32, 73, 39, 109, 32, 97, 32, 102, 117, 114, 114, 121, 44, 32, 98, 117, 116, 32, 115, 105, 110, 99, 101, 32, 73, 39, 109, 32, 97, 32, 121, 111, 117, 110, 103, 32, 102, 