# Creating Vocabulary for Project

In [1]:
import sys
import os

script_dir = os.path.dirname(os.path.abspath('vocab.ipynb'))
parent_directory = os.path.dirname(script_dir)
module_directory = os.path.join(parent_directory, 'module') 
utils_directory = os.path.join(parent_directory, 'utils') 

if (parent_directory not in sys.path):
    sys.path.append(parent_directory)
    
if (module_directory not in sys.path):
    sys.path.append(module_directory)
    
if (utils_directory not in sys.path):
    sys.path.append(utils_directory)  

In [1]:
from module.preprocess.bpe import BpeArgs, Encoder
from module.preprocess.load_and_batch import BatchMeta
from module.preprocess.load_and_batch import TableInfoManagers
from tqdm import tqdm
from utils import config

# Loading the dataset

In [3]:
data = TableInfoManagers.load_data_downloaded(config.DATA_LOCATION, cat= "train")
data.head()

------> Size of train dataset: (1526659, 224) <------


case_id,actualdpdtolerance_344P,amtinstpaidbefduel24m_4187115A,annuity_780A,annuitynextmonth_57A,applicationcnt_361L,applications30d_658L,applicationscnt_1086L,applicationscnt_464L,applicationscnt_629L,applicationscnt_867L,avgdbddpdlast24m_3658932P,avgdbddpdlast3m_4187120P,avgdbdtollast24m_4525197P,avgdpdtolclosure24_3658938P,avginstallast24m_3658937A,avglnamtstart24m_4525187A,avgmaxdpdlast9m_3716943P,avgoutstandbalancel6m_4187114A,avgpmtlast12m_4525200A,bankacctype_710L,cardtype_51L,clientscnt12m_3712952L,clientscnt3m_3712950L,clientscnt6m_3712949L,clientscnt_100L,clientscnt_1022L,clientscnt_1071L,clientscnt_1130L,clientscnt_136L,clientscnt_157L,clientscnt_257L,clientscnt_304L,clientscnt_360L,clientscnt_493L,clientscnt_533L,clientscnt_887L,…,formonth_118L,formonth_206L,formonth_535L,forquarter_1017L,forquarter_462L,forquarter_634L,fortoday_1092L,forweek_1077L,forweek_528L,forweek_601L,foryear_618L,foryear_818L,foryear_850L,fourthquarter_440L,maritalst_385M,maritalst_893M,numberofqueries_373L,pmtaverage_3A,pmtaverage_4527227A,pmtaverage_4955615A,pmtcount_4527229L,pmtcount_4955617L,pmtcount_693L,pmtscount_423L,pmtssum_45A,requesttype_4525192L,responsedate_1012D,responsedate_4527233D,responsedate_4917613D,riskassesment_302T,riskassesment_940T,secondquarter_766L,thirdquarter_1082L,date_decision,MONTH,WEEK_NUM,target
i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,f64,f64,f64,str,f64,f64,f64,f64,f64,f64,f64,…,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,f64,f64,f64,f64,str,str,f64,f64,f64,str,date,date,date,str,f64,f64,f64,date,i64,i64,i64
0,,,1917.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,…,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2019-01-03,201901,0,0
1,,,3134.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"""0.0""",3.0,0.0,0.0,0.0,0.0,0.0,0.0,…,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2019-01-03,201901,0,0
2,,,4937.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,…,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2019-01-04,201901,0,0
3,,,4643.6,0.0,0.0,1.0,0.0,2.0,0.0,1.0,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,…,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2019-01-03,201901,0,0
4,,,3390.2,0.0,0.0,1.0,0.0,0.0,0.0,1.0,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,…,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2019-01-04,201901,0,1


# Building Vocabulary

## Overview
To construct the vocabulary, a rule-based approach is employed to segment the text into tokens. Different categories of words are processed according to their characteristics.

## Categories
- **Common Words**: Utilizing the NLTK library, common words are identified and stored in a `common` vocabulary.
- **Special Words**: This category includes identifiers such as `<tags>` for numeric types and column headers, which are crucial for semantic processing.
- **Base Words**: Includes all tokens that can be encoded using ASCII characters, forming the foundation of the vocabulary.
- **Paired Words**: Refers to combinations of words merged during the Byte Pair Encoding (BPE) process, crucial for efficient encoding of common word pairings.

# Training Process

## Methodology
The training employs a combination of Byte Pair Encoding (BPE) for subwords and common words. Special words or tags are exempt from pairing during BPE to preserve their uniqueness and facilitate easy identification.
- **Unique Handling of Special Words**: During the BPE algorithm, special words are not merged into pairs, ensuring that these elements remain distinct for subsequent embedding analysis.
- **Use of IQR**: An Interquartile Range (IQR) method is used to determine which pairs to merge. Only pairs that are clear outliers in the text snippet are merged, based on the premise that merging less frequent, outlier pairs prevents the loss of important information.
- **Threshold Adjustment**: The current IQR threshold is set at 1.5, but this parameter is adjustable in future experiments to optimize the 

## Code Snippet

```python
from module.preprocess.load_and_batch import TableHandler

# Function for building
def build(enc :Encoder, verbose=False, batch_size:int = 128):
    col_types = data.dtypes
    col_names = data.columns
    ignore_list = ["case_id", 'target']
    agg_info = BatchMeta()
    
    with tqdm(total=data.height, desc="Processing batch") as pbar:
        for i in range(0, data.height, batch_size):
            batch = data.slice(i, min(batch_size, data.height - i))  
            
            for row in batch.rows():
                TableHandler.row_to_text(row, agg_info, col_types, col_names, ignore_list, output=[])
                enc.process(agg_info.texts[-1])
                if verbose:
                    print(agg_info.texts[-1])
                pbar.update(1)
                
            # Process the batch here if needed
            agg_info = BatchMeta()  # Reset for the next batch
            
        pbar.close()

# Running: create BPE, Encode 
bpe_args = BpeArgs(
    target_context = 256,
    max_vocab_size = 10000, # controls the total vocab size
    store_loc = config.BASE_LOCATION, # Where to store or load information post
    adhoc_tokens = ['CLS', 'PAD'], # List of special tokens to be included
    adhoc_words = data.columns # Special words that aren't key tags but should be included
)

text_encoder = Encoder(bpe_args)
build(text_encoder, verbose=False)  
text_encoder.save_state(config.BASE_LOCATION)
```merging strategy.idation are needed.


## Remarks
- The use of IQR in BPE training is a novel approach that adds an analytical dimension to the merging process, potentially enhancing the model's ability to recognize and utilize important lexical features from the text. This strategy is postulated to identify the most performant features from all columns after training, although further experimentation and validation are needed.
- The Total vocabulary length is **861** tokens and it averages a **60%** compression rate.

# Example of Encoded Text

In [6]:
from module.preprocess.load_and_batch import TableHandler
# load from validation set
import random
random_index = random.randint(1, data.height - 1)
trial = data.slice(random_index, 5)
col_types = data.dtypes
col_names = data.columns
ignore_list = ["case_id", 'target', 'WEEK_NUM']
agg_info = BatchMeta()

encoder = Encoder(None)
encoder.load_state(config.BASE_LOCATION)

for row in trial.rows():
    TableHandler.row_to_text(row, agg_info, col_types, col_names, ignore_list, output=[])
    encoder.encode_text(agg_info.texts[-1], verbose=True)


Empty instantiation. Ensure to load from pickle file location
+-----------------------------------------------+
|               Analysis Results                |
+-----------------------------------------------+
| Length of original list              :   1992 |
| Number of unique tokens in original  :    262 |
| Length of compressed list            :    717 |
| Number of unique tokens in compressed:    254 |
| Final compression ratio              : 0.640  |
| Length of vocabulary                 :    861 |
+-----------------------------------------------+
+-----------------------------------------------+
|               Analysis Results                |
+-----------------------------------------------+
| Length of original list              :   1953 |
| Number of unique tokens in original  :    261 |
| Length of compressed list            :    712 |
| Number of unique tokens in compressed:    254 |
| Final compression ratio              : 0.635  |
| Length of vocabulary                

# Conclusion
As can be seen the compressed length tends to be between **700 - 800** range. To accomodate the chances of new tokens that haven't been encountered we set the context length for the algorithm to 1024 which will be used during the training process. This allows for flexibility at runtime for longer text.