In [1]:
!bash /home/azureuser/cloudfiles/code/blobfuse/blobfuse_raadsinformatie.sh

In [4]:
import sys
sys.path.append("..")

# MAKE SURE TO SET-UP PATH -> use local to run with demo data; use azure to run with complete dataset (access required)
# Select where to run notebook: "azure" or "local"
my_run = "local"

if my_run == "azure":
    import config_azure as cf
    running_demo = False
elif my_run == "local":
    import config as cf
    running_demo = True


import os
if my_run == "azure":
    if not os.path.exists(cf.HUGGING_CACHE):
        os.mkdir(cf.HUGGING_CACHE)
    os.environ["TRANSFORMERS_CACHE"] = cf.HUGGING_CACHE

import pandas as pd


ModuleNotFoundError: No module named 'config'

## Notebook Overview

Goal: Explain/show the contents of the different kind of files included in this project. During the project, a lot of extra columns are added to the data, predictions and overview files. Thus in this notebook, all columns are explained. 

Different kind of files:
1. txtfiles_notcleaned.pkl
2. txtfiles.pkl
3. predictions.pkl
4. overview.pkl (experiments)
5. overview_model.pkl (fine-tuning of models)


### Common columns
The following columns are included in (almost) all files:
- label = true class of doc
- path = path to OCR file
- id = document id
- text = extracted text from OCR file
- num_pages = amount of pages in pdf file.

### 1. txtfiles_notcleaned.pkl

This file contains all docs that were scraped. 

Columns:
- tokens = text split into tokens using the NLTK library. These tokens are not used during the research, but needed for the preliminary analysis of the data.
- token_count = amount of tokens in text, split using the NLTK library. Equal the length of the tokens columns.
- clean_tokens = tokens column, but removed stopwords and interpunction. These tokens are not used during the research, but needed for the preliminary analysis of the data.
- clean_tokens_count = amount of tokens in text, split using the NLTK library. Equal the length of the clean_tokens columns.
- pdf_path = path to PDF file. Each doc has multiple versions saved, including the OCR version and the PDF version. That's why there is a distinction between path and pdf_path.

In [3]:
txtfiles_notcleaned = pd.read_pickle(f"{cf.output_path}/txtfiles_notcleaned.pkl")
print(f"The txtfiles_notcleaned file contains {len(txtfiles_notcleaned)} docs.")
print(f"Columns: {list(txtfiles_notcleaned.columns)}")
display(txtfiles_notcleaned.head(2))

The txtfiles_notcleaned file contains 33117 docs.
Columns: ['label', 'path', 'id', 'text', 'tokens', 'token_count', 'clean_tokens', 'clean_tokens_count', 'pdf_path', 'num_pages']


Unnamed: 0,label,path,id,text,tokens,token_count,clean_tokens,clean_tokens_count,pdf_path,num_pages
0,Motie,/home/azureuser/cloudfiles/code/blobfuse/raads...,0,Gemeente Amsterdam\n% Gemeenteraad R\n% Gemeen...,"[Gemeente, Amsterdam, %, Gemeenteraad, R, %, G...",395,"[Gemeente, Amsterdam, Gemeenteraad, Gemeentebl...",205,/home/azureuser/cloudfiles/code/blobfuse/raads...,2.0
1,Motie,/home/azureuser/cloudfiles/code/blobfuse/raads...,1,Gemeente Amsterdam\n\n% Gemeenteraad R\n\n% Ge...,"[Gemeente, Amsterdam, %, Gemeenteraad, R, %, G...",390,"[Gemeente, Amsterdam, Gemeenteraad, Gemeentebl...",197,/home/azureuser/cloudfiles/code/blobfuse/raads...,2.0


### 2. txtfiles.pkl

This file contains all documents, without faulty/unclean documents and duplicates. This file is used during the research. 

What is removed (compared to txtfiles_unclean):
- three classes that were deemed to unclean. 
- faulty documents that only contained gibberish.
- duplicates.

Columns:
- 4split = indicates to which subset the doc belongs to. Incase of 4split, the data is split into train,dev, test and val.
- 2split = indicates to which subset the doc belongs to. Incase of 2split, the data is split into train and test.
- balanced_split = indicates to which subset the doc belongs to. Incase of 2split, the data is split into train, test and val. Is used for the final experiments.
- MistralTokens = text split into tokens using Mistral tokenizer. Needed to shorten the docs for the experiments. Note: since mistral and geitje have the same tokenizer, an extra column GeitjeTokens is not needed.
- count_MistralTokens = is count of tokens in MistralTokens, i.e. in how many tokens the text was split by Mistral tokenizer.
- LlamaTokens = text split into tokens using Llama tokenizer. Needed to shorten the docs for the experiments.
- count_LlamaTokens = is count of tokens in LlamaTokens, i.e. in how many tokens the text was split by Llama tokenizer.
- md5_hash = unique id for the content of the doc. Is used to remove duplicates.






In [4]:
txtfiles = pd.read_pickle(f"{cf.output_path}/txtfiles.pkl")
print(f"The txtfiles file contains {len(txtfiles)} docs.")
print(f"Columns: {list(txtfiles.columns)}")
display(txtfiles.head(2))

The txtfiles file contains 20818 docs.
Columns: ['label', 'path', 'id', 'text', 'num_pages', '4split', '2split', 'MistralTokens', 'count_MistralTokens', 'LlamaTokens', 'count_LlamaTokens', 'md5_hash', 'balanced_split']


Unnamed: 0,label,path,id,text,num_pages,4split,2split,MistralTokens,count_MistralTokens,LlamaTokens,count_LlamaTokens,md5_hash,balanced_split
0,Motie,/home/azureuser/cloudfiles/code/blobfuse/raads...,1874,x Gemeente Amsterdam R\nGemeenteraad\n% Gemeen...,1.0,train,train,"[▁x, ▁Geme, ente, ▁Amsterdam, ▁R, <0x0A>, G, e...",350,"[▁x, ▁Geme, ente, ▁Amsterdam, ▁R, <0x0A>, G, e...",346,2f09ba2c967bba0eecf71f846f258a78,discard
1,Motie,/home/azureuser/cloudfiles/code/blobfuse/raads...,230,X Gemeente Amsterdam R\nGemeenteraad\n% Gemeen...,2.0,train,train,"[▁X, ▁Geme, ente, ▁Amsterdam, ▁R, <0x0A>, G, e...",1130,"[▁X, ▁Geme, ente, ▁Amsterdam, ▁R, <0x0A>, G, e...",1082,d14b33c32ba1e1bcff16320891bdf158,discard


### Predictions.pkl
In the predictions files all individual predictions are saved. Each experiment has it's own predictions file. There is a slight difference between the predictions files of the baselines and the predictions files of the LLMs.

Columns that are the same:
- path, id, label
- prediction = predicted class of the doc
- date = date and time of when prediction was made.
- run_id = unique id for each experiment.

Column unique to baseline predictions file:
- model = model used to make the predictions 
- trunc_txt = the text of the doc truncated, which is the input. Incase the full doc was used, the value will be N.A.

Columns unique to LLM predictions file:
- text_column = column of the input dataframe that contained the truncated text. This column was used as input.
- prompt_function = function used to format the prompt. All formats can be found in scripts/prompt_template.py
- response = the response of the LLM. Prediction not yet extracted.
- runtime = the time it took for the model to respond.
- prompt = the doc formatted in the prompt. The input given to the LLM.
- shots = amount of example docs included in the prompt.


Furthermore, because fine-tuned Mistral's predictions needed to be repaired, the prediction files of Mistral for 2epoch and 3epoch contain extra columns:
- Original_Prediction = prediction originally extracted, without repair.
- matches_complete_regex = True or False, indicates whether the response included a JSON file with the format '{category}'
- matched_adjusted_regex = True or False, indicates whether the response matched with 'category}'. If a response is false for matches_complete_regex and true for matches_adjusted_regex, then we know that that response only missed the opening bracket.
- prediction = prediction after repair.


In [5]:
baseline_predictions = pd.read_pickle(f"{cf.output_path}/trial/LinearSVCpredictions.pkl")
print(f"The baseline_predictions file contains {len(baseline_predictions)} predictions.")
print(f"Columns: {list(baseline_predictions.columns)}")
display(baseline_predictions.head(2))


llm_predictions =  pd.read_pickle(f"{cf.output_path}/predictionsFinal/finetuning/1epochs/MistralFirst200Last0Predictions.pkl")
print(f"The llm_predictions file contains {len(llm_predictions)} predictions.")
print(f"Columns: {list(llm_predictions.columns)}")
display(llm_predictions.head(2))

mistral_predictions =  pd.read_pickle(f"{cf.output_path}/predictionsFinal/finetuning/2epochs/MistralFirst200Last0Predictions.pkl")
print(f"The mistral_predictions file contains {len(mistral_predictions)} predictions.")
print(f"Columns: {list(mistral_predictions.columns)}")
display(mistral_predictions.head(2))

The baseline_predictions file contains 8 predictions.
Columns: ['path', 'id', 'label', 'prediction', 'model', 'date', 'run_id', 'trunc_txt']


Unnamed: 0,path,id,label,prediction,model,date,run_id,trunc_txt
2,/home/azureuser/cloudfiles/code/blobfuse/raads...,26304,Raadsnotulen,Raadsadres,LinearSVC,2024-06-10 15:28:09.365516+02:00,LinearSVC_fulltext,
13,/home/azureuser/cloudfiles/code/blobfuse/raads...,32939,Factsheet,Voordracht,LinearSVC,2024-06-10 15:28:09.365516+02:00,LinearSVC_fulltext,


The llm_predictions file contains 1100 predictions.
Columns: ['id', 'path', 'text_column', 'prompt_function', 'response', 'prediction', 'label', 'runtime', 'date', 'prompt', 'run_id', 'shots', 'new_prediction']


Unnamed: 0,id,path,text_column,prompt_function,response,prediction,label,runtime,date,prompt,run_id,shots,new_prediction
0,26304,/home/azureuser/cloudfiles/code/blobfuse/raads...,TruncationLlamaTokensFront200Back0,zeroshot_prompt_mistral_llama,{'categorie': Raadsnotulen},raadsnotulen,raadsnotulen,23.454581,2024-06-03 14:46:15.522548+02:00,Classificeer het document in één van de catego...,FT_AmsterdamDocClassificationMistral200T1Epoch...,0,raadsnotulen
1,32939,/home/azureuser/cloudfiles/code/blobfuse/raads...,TruncationLlamaTokensFront200Back0,zeroshot_prompt_mistral_llama,{'categorie': Onderzoeksrapport},onderzoeksrapport,factsheet,24.833001,2024-06-03 14:46:40.356687+02:00,Classificeer het document in één van de catego...,FT_AmsterdamDocClassificationMistral200T1Epoch...,0,onderzoeksrapport


The mistral_predictions file contains 200 predictions.
Columns: ['id', 'path', 'text_column', 'prompt_function', 'response', 'Original_Prediction', 'label', 'runtime', 'date', 'prompt', 'run_id', 'shots', 'matches_complete_regex', 'matches_adjusted_regex', 'prediction', 'new_prediction']


Unnamed: 0,id,path,text_column,prompt_function,response,Original_Prediction,label,runtime,date,prompt,run_id,shots,matches_complete_regex,matches_adjusted_regex,prediction,new_prediction
0,26304,/home/azureuser/cloudfiles/code/blobfuse/raads...,TruncationLlamaTokensFront200Back0,zeroshot_prompt_mistral_llama,'categorie': Raadsnotulen},NoPredictionFormat,raadsnotulen,22.356095,2024-05-31 17:25:30.266286+02:00,Classificeer het document in één van de catego...,FT_AmsterdamDocClassificationMistral200T2Epoch...,0,0,1,raadsnotulen,NoPredictionFormat
1,32939,/home/azureuser/cloudfiles/code/blobfuse/raads...,TruncationLlamaTokensFront200Back0,zeroshot_prompt_mistral_llama,'categorie': Onderzoeksrapport},NoPredictionFormat,factsheet,23.359804,2024-05-31 17:25:53.744179+02:00,Classificeer het document in één van de catego...,FT_AmsterdamDocClassificationMistral200T2Epoch...,0,0,1,onderzoeksrapport,NoPredictionFormat


### Overview.pkl
The overview files for the experiments contain for each experiment exactly one row with meta data about the experiments. There is not difference between the overview file of the baselines and the overview files of the LLMs.

Columns:
- model = model used to get predictions.
- run_id = id of the experiment, unique for each experiment.
- date = date and time when experiment finished running.
- train_set = the subset of the split_col that was used as training set. Depending on the split and goal of the experiment this can be either train or dev.
- test_set = the subset of the split_col that was used as test set. Depending on the split and goal of the experiment this can be either test or val.
- train_set_support = amount of docs in training set.
- test_set_support = amount of docs in test set. 
- split_col = column that indicates the split, shows which split was used. Can be 2split, 4split or balanced_split.
- text_col = column that contained the text of the document. Indicates how doc was shortened.
- runtime = total time it took to run the experiment in seconds.
- 'accuracy', 'macro_avg_precision', 'macro_avg_recall', 'macro_avg_f1', 'weighted_avg_precision', 'weighted_avg_recall', 'weighted_avg_f1', 'classification_report' = evaluation scores


In [6]:
llm_overview = pd.read_pickle(f"{cf.output_path}/predictionsFinal/finetuning/1epochs/overview.pkl")
print(f"The baseline_overview file contains {len(llm_overview)} experiments.")
print(f"Columns: {list(llm_overview.columns)}")
display(llm_overview.head(2))

The baseline_overview file contains 3 experiments.
Columns: ['model', 'run_id', 'date', 'train_set', 'test_set', 'train_set_support', 'test_set_support', 'split_col', 'text_col', 'runtime', 'accuracy', 'macro_avg_precision', 'macro_avg_recall', 'macro_avg_f1', 'weighted_avg_precision', 'weighted_avg_recall', 'weighted_avg_f1', 'classification_report']


Unnamed: 0,model,run_id,date,train_set,test_set,train_set_support,test_set_support,split_col,text_col,runtime,accuracy,macro_avg_precision,macro_avg_recall,macro_avg_f1,weighted_avg_precision,weighted_avg_recall,weighted_avg_f1,classification_report
0,AmsterdamDocClassificationMistral200T1Epochs,FT_AmsterdamDocClassificationMistral200T1Epoch...,2024-06-03 21:28:18.488289+02:00,train,test,9900,1100,balanced_split,TruncationLlamaTokensFront200Back0,24081.369143,0.895455,0.921469,0.895455,0.888446,0.921469,0.895455,0.888446,precision recall f1-s...
0,AmsterdamDocClassificationGEITje200T1Epochs,FT_AmsterdamDocClassificationGEITje200T1Epochs...,2024-06-04 10:25:38.531475+02:00,train,test,9900,1100,balanced_split,TruncationLlamaTokensFront200Back0,25634.273609,0.890909,0.925842,0.890909,0.881006,0.925842,0.890909,0.881006,precision recall f1-s...
