In [1]:
!bash /home/azureuser/cloudfiles/code/blobfuse/blobfuse_raadsinformatie.sh

In [2]:
import sys
sys.path.append("..")

# Select where to run notebook: "azure" or "local"
my_run = "azure"

import my_secrets as sc
import settings as st

if my_run == "azure":
    import config_azure as cf
elif my_run == "local":
    import config as cf


import os
if my_run == "azure":
    if not os.path.exists(cf.HUGGING_CACHE):
        os.mkdir(cf.HUGGING_CACHE)
    os.environ["TRANSFORMERS_CACHE"] = cf.HUGGING_CACHE

import pandas as pd

## Notebook Overview

Goal: Explain/show the contents of the different kind of files included in this project. During the project, a lot of extra columns are added to the data, predictions and overview files. Thus in this notebook, all columns are explained. 

Different kind of files:
1. txtfiles_notcleaned.pkl
2. txtfiles.pkl
3. overview.pkl
4. predictions.pkl (difference between baselines and LLMs?)

### Common columns
The following columns are included in (almost) all files:
- label = true class of doc
- path = path to OCR file
- id = document id
- text = extracted text from OCR file
- num_pages = amount of pages in pdf file.

### 1. txtfiles_notcleaned.pkl

This file contains all docs that were scraped. 

Columns:
- tokens = text split into tokens using the NLTK library. These tokens are not used during the research, but needed for the preliminary analysis of the data.
- token_count = amount of tokens in text, split using the NLTK library. Equal the length of the tokens columns.
- clean_tokens = tokens column, but removed stopwords and interpunction. These tokens are not used during the research, but needed for the preliminary analysis of the data.
- clean_tokens_count = amount of tokens in text, split using the NLTK library. Equal the length of the clean_tokens columns.
- pdf_path = path to PDF file. Each doc has multiple versions saved, including the OCR version and the PDF version. That's why there is a distinction between path and pdf_path.

In [5]:
txtfiles_notcleaned = pd.read_pickle(f"{cf.output_path}/txtfiles_notcleaned.pkl")
print(f"The txtfiles_notcleaned file contains {len(txtfiles_notcleaned)} docs.")
print(f"Columns: {list(txtfiles_notcleaned.columns)}")
display(txtfiles_notcleaned.head(2))

The txtfiles_notcleaned file contains 33117 docs.
Columns: ['label', 'path', 'id', 'text', 'tokens', 'token_count', 'clean_tokens', 'clean_tokens_count', 'pdf_path', 'num_pages']


Unnamed: 0,label,path,id,text,tokens,token_count,clean_tokens,clean_tokens_count,pdf_path,num_pages
0,Motie,/home/azureuser/cloudfiles/code/blobfuse/raads...,0,Gemeente Amsterdam\n% Gemeenteraad R\n% Gemeen...,"[Gemeente, Amsterdam, %, Gemeenteraad, R, %, G...",395,"[Gemeente, Amsterdam, Gemeenteraad, Gemeentebl...",205,/home/azureuser/cloudfiles/code/blobfuse/raads...,2.0
1,Motie,/home/azureuser/cloudfiles/code/blobfuse/raads...,1,Gemeente Amsterdam\n\n% Gemeenteraad R\n\n% Ge...,"[Gemeente, Amsterdam, %, Gemeenteraad, R, %, G...",390,"[Gemeente, Amsterdam, Gemeenteraad, Gemeentebl...",197,/home/azureuser/cloudfiles/code/blobfuse/raads...,2.0


### 2. txtfiles.pkl

This file contains all documents, without faulty/unclean documents and duplicates. This file is used during the research. 

What is removed (compared to txtfiles_unclean):
- three classes that were deemed to unclean. 
- faulty documents that only contained gibberish.
- duplicates.

Columns:
- 4split = indicates to which subset the doc belongs to. Incase of 4split, the data is split into train,dev, test and val.
- 2split = indicates to which subset the doc belongs to. Incase of 2split, the data is split into train and test.
- balanced_split = indicates to which subset the doc belongs to. Incase of 2split, the data is split into train, test and val. Is used for the final experiments.
- MistralTokens = text split into tokens using Mistral tokenizer. Needed to shorten the docs for the experiments. Note: since mistral and geitje have the same tokenizer, an extra column GeitjeTokens is not needed.
- count_MistralTokens = is count of tokens in MistralTokens, i.e. in how many tokens the text was split by Mistral tokenizer.
- LlamaTokens = text split into tokens using Llama tokenizer. Needed to shorten the docs for the experiments.
- count_LlamaTokens = is count of tokens in LlamaTokens, i.e. in how many tokens the text was split by Llama tokenizer.
- md5_hash = unique id for the content of the doc. Is used to remove duplicates.






In [6]:
txtfiles = pd.read_pickle(f"{cf.output_path}/txtfiles.pkl")
print(f"The txtfiles file contains {len(txtfiles)} docs.")
print(f"Columns: {list(txtfiles.columns)}")
display(txtfiles.head(2))

The txtfiles file contains 20818 docs.
Columns: ['label', 'path', 'id', 'text', 'num_pages', '4split', '2split', 'MistralTokens', 'count_MistralTokens', 'LlamaTokens', 'count_LlamaTokens', 'md5_hash', 'balanced_split']


Unnamed: 0,label,path,id,text,num_pages,4split,2split,MistralTokens,count_MistralTokens,LlamaTokens,count_LlamaTokens,md5_hash,balanced_split
0,Motie,/home/azureuser/cloudfiles/code/blobfuse/raads...,1874,x Gemeente Amsterdam R\nGemeenteraad\n% Gemeen...,1.0,train,train,"[▁x, ▁Geme, ente, ▁Amsterdam, ▁R, <0x0A>, G, e...",350,"[▁x, ▁Geme, ente, ▁Amsterdam, ▁R, <0x0A>, G, e...",346,2f09ba2c967bba0eecf71f846f258a78,discard
1,Motie,/home/azureuser/cloudfiles/code/blobfuse/raads...,230,X Gemeente Amsterdam R\nGemeenteraad\n% Gemeen...,2.0,train,train,"[▁X, ▁Geme, ente, ▁Amsterdam, ▁R, <0x0A>, G, e...",1130,"[▁X, ▁Geme, ente, ▁Amsterdam, ▁R, <0x0A>, G, e...",1082,d14b33c32ba1e1bcff16320891bdf158,discard


### Overview.pkl
The overview files for the experiments contain for each experiment exactly one row with meta data about the experiments. There is not difference between the overview file of the baselines and the overview files of the LLMs.

Columns:
- model = model used to get predictions.
- run_id = id of the experiment, unique for each experiment.
- date = date and time when experiment finished running.
- train_set = the subset of the split_col that was used as training set. Depending on the split and goal of the experiment this can be either train or dev.
- test_set = the subset of the split_col that was used as test set. Depending on the split and goal of the experiment this can be either test or val.
- train_set_support = amount of docs in training set.
- test_set_support = amount of docs in test set. 
- split_col = column that indicates the split, shows which split was used. Can be 2split, 4split or balanced_split.
- text_col = column that contained the text of the document. Indicates how doc was shortened.
- runtime = total time it took to run the experiment in seconds.
- 'accuracy', 'macro_avg_precision', 'macro_avg_recall', 'macro_avg_f1', 'weighted_avg_precision', 'weighted_avg_recall', 'weighted_avg_f1', 'classification_report' = evaluation scores


In [27]:
llm_overview = pd.read_pickle(f"{cf.output_path}/predictionsFinal/finetuning/1epochs/overview.pkl")
print(f"The baseline_overview file contains {len(llm_overview)} experiments.")
print(f"Columns: {list(llm_overview.columns)}")
display(llm_overview.head(2))

The baseline_overview file contains 3 experiments.
Columns: ['model', 'run_id', 'date', 'train_set', 'test_set', 'train_set_support', 'test_set_support', 'split_col', 'text_col', 'runtime', 'accuracy', 'macro_avg_precision', 'macro_avg_recall', 'macro_avg_f1', 'weighted_avg_precision', 'weighted_avg_recall', 'weighted_avg_f1', 'classification_report']


Unnamed: 0,model,run_id,date,train_set,test_set,train_set_support,test_set_support,split_col,text_col,runtime,accuracy,macro_avg_precision,macro_avg_recall,macro_avg_f1,weighted_avg_precision,weighted_avg_recall,weighted_avg_f1,classification_report
0,AmsterdamDocClassificationMistral200T1Epochs,FT_AmsterdamDocClassificationMistral200T1Epoch...,2024-06-03 21:28:18.488289+02:00,train,test,9900,1100,balanced_split,TruncationLlamaTokensFront200Back0,24081.369143,0.895455,0.921469,0.895455,0.888446,0.921469,0.895455,0.888446,precision recall f1-s...
0,AmsterdamDocClassificationGEITje200T1Epochs,FT_AmsterdamDocClassificationGEITje200T1Epochs...,2024-06-04 10:25:38.531475+02:00,train,test,9900,1100,balanced_split,TruncationLlamaTokensFront200Back0,25634.273609,0.890909,0.925842,0.890909,0.881006,0.925842,0.890909,0.881006,precision recall f1-s...


In [None]:

baseline_overview = pd.read_pickle(f"{cf.output_path}/trial/overview.pkl")
baseline_predictions = pd.read_pickle(f"{cf.output_path}/trial/LinearSVCpredictions.pkl")
llm_overview = pd.read_pickle(f"{cf.output_path}/predictionsFinal/finetuning/1epochs/overview.pkl")
llm_predictions =  pd.read_pickle(f"{cf.output_path}/predictionsFinal/finetuning/1epochs/MistralFirst200Last0Predictions.pkl")

In [20]:
display(baseline_predictions.head(2))

Unnamed: 0,path,id,num_pages,MistralTokens,count_MistralTokens,LlamaTokens,count_LlamaTokens,old_label,md5_hash,balanced_split,trunc_txt,trunc_col,label,prediction,model,date,run_id
0,/home/azureuser/cloudfiles/code/blobfuse/raads...,10154,1.0,"[▁x, ▁Geme, ente, ▁Amsterdam, ▁J, ▁C, <0x0A>, ...",447,"[▁x, ▁Geme, ente, ▁Amsterdam, ▁J, ▁C, <0x0A>, ...",422,Actualiteit,2307717591d864eea5918ef27cea118b,test,x Gemeente Amsterdam J C\n% Actualiteit voor d...,TruncationLlamaTokensFront100Back100,Actualiteit,Actualiteit,LinearSVC,2024-05-15 16:01:29.742235+02:00,LinearSVC_fulltext
1,/home/azureuser/cloudfiles/code/blobfuse/raads...,10470,2.0,"[▁Geme, ente, ▁Amsterdam, <0x0A>, %, ▁Geme, en...",1179,"[▁Geme, ente, ▁Amsterdam, <0x0A>, %, ▁Geme, en...",1167,Actualiteit,a72c3857912f5e79874002c849d92da7,test,Gemeente Amsterdam\n% Gemeenteraad R\n% Raadsa...,TruncationLlamaTokensFront100Back100,Actualiteit,Actualiteit,LinearSVC,2024-05-15 16:01:29.742235+02:00,LinearSVC_fulltext


###

In [18]:
print(sorted(list(baseline_overview.columns)))
print(sorted(list(llm_overview.columns)))


print(sorted(list(baseline_predictions.columns),  key=str.lower))
print(sorted(list(llm_predictions.columns),  key=str.lower))


['accuracy', 'classification_report', 'date', 'macro_avg_f1', 'macro_avg_precision', 'macro_avg_recall', 'model', 'run_id', 'runtime', 'split_col', 'test_set', 'test_set_support', 'text_col', 'train_set', 'train_set_support']
['accuracy', 'classification_report', 'date', 'macro_avg_f1', 'macro_avg_precision', 'macro_avg_recall', 'model', 'run_id', 'runtime', 'split_col', 'test_set', 'test_set_support', 'text_col', 'train_set', 'train_set_support', 'weighted_avg_f1', 'weighted_avg_precision', 'weighted_avg_recall']
['balanced_split', 'count_LlamaTokens', 'count_MistralTokens', 'date', 'id', 'label', 'LlamaTokens', 'md5_hash', 'MistralTokens', 'model', 'num_pages', 'old_label', 'path', 'prediction', 'run_id', 'trunc_col', 'trunc_txt']
['date', 'id', 'label', 'path', 'prediction', 'prompt', 'prompt_function', 'response', 'run_id', 'runtime', 'shots', 'test_set', 'text_column', 'train_set']
