# Timeline Extraction Pipeline
Our task is to extract timelines from government decision letters using ChatGPT. 
In this notebook we will evaluate all steps of the algorithm. 

### Date Extraction & Correction
Here we will extract and correct the dates. In preprocessing the PDF files have already been converted to txt files. The first step is to split the txt files into sentences, detect the dates and discard sentences that do not have a date.

In [4]:
import pickle
import pandas as pd
from code.scripts.date_correction import compile, accuracy_dates
from code.scripts.remove_mistakes import remove_false_dates
 
# select all dates using spacy
with open('code/data/results/date_extraction/uncorrected_dates_test.pkl', 'rb') as fp:
    test_uncorrected = pickle.load(fp)

# correct incomplete dates and filter out nonsense date selected by spacy
test_corrected = compile(test_uncorrected)

# load ground truth test set
gt_test = pd.read_csv("code/data/GT/GTtest/all_dates.csv")
gt_test_date_event = gt_test.loc[gt_test['label'] != 'DATE+']

# calculate accuracy on test set & return dataset with mistakes removed
accuracy_test_dates, mistakes, test_dates = accuracy_dates(gt_test_date_event, test_corrected)

print(f"Accuracy of the date correction algorithm: {accuracy_test_dates}")
print(f"Total dates not correctly extracted: {mistakes} out of {len(gt_test_date_event)}")

# remove dates with an event that have not been extracted correctly
clean_test_dates = remove_false_dates(test_dates, gt_test)



Accuracy of the date correction algorithm: 0.9768339768339769
Total dates not correctly extracted: 6 out of 259
Original length of dataframe:547
New length after removing mistakes: 541


### Decision date extraction
Next, we will extract the "decision made" dates. Then, we will remove all dates that are in truth decision dates. 

In [3]:
from code.scripts.decision import decision_class, evaluate_decision

test_decision = decision_class(clean_test_dates)

# return evaluation & df with prediction and truth 
accuracy, recall, precision, f1, values, test_copy = evaluate_decision(test_decision, gt_test)
print(f"Evaluation metrics values for classifying the decision date on testset: \n Accuracy: {accuracy} \n Recall: {recall} \n Precision: {precision} \n F1-score: {f1}")

# remove all date where decisiondate truth = True. Keep dates that still need to be classified
print(f"Original length of dataframe:{len(test_decision)}")
test_dates_no_decision = test_copy.loc[test_copy['truth'] == False].drop(columns=['truth', "decisiondate"])
print(f"New length of dataframe after removing mistakes: {len(test_dates_no_decision)}")
test_dates_no_decision.to_csv("code/data/results/chatgpt_extraction/input.csv", index=False)

# select all decision dates that were predicted correctly
decision_dates = test_copy.loc[(test_copy['truth'] == True) & (test_copy['decisiondate'] == test_copy['truth'])]

# select decision mistakes
decision_mistakes = test_copy.loc[(test_copy['decisiondate'] != test_copy['truth'])]

Evaluation metrics values for classifying the decision date on testset: 
 Accuracy: 0.9963031423290203 
 Recall: 1.0 
 Precision: 0.96 
 F1-score: 0.9795918367346939
Original length of dataframe:541
New length of dataframe after removing mistakes: 493


### ChatGPT 1: event phrase extraction & filtering dates
We ran the ChatGPT experiment in code/run_chatGPT.ipynb. The results are loaded and evaluated here.

In [5]:
from code.scripts.chatgpt_extraction import evaluate

# load predictions and event descriptions
extraction_predictions = pd.read_csv("code/data/results/chatgpt_extraction/predictions.csv")

# load ground truth
gt = pd.read_csv("code/data/GT/GTtest/date_event_combinations.csv")

# print out evaluation metrics
evaluate(extraction_predictions, gt)



              precision    recall  f1-score   support

       False       0.63      0.51      0.56       288
        True       0.45      0.58      0.51       205

    accuracy                           0.54       493
   macro avg       0.54      0.54      0.53       493
weighted avg       0.55      0.54      0.54       493

{'fp': 142, 'tp': 118, 'fn': 87, 'tn': 146}
Total dates with an event of which ChatGPT extracted an event phrase: 144
Average jaccard similarity: 50.184% 
 Fraction of dates that overlap >= 50%: 53.472% 
 Fraction of dates that overlap >= 75%: 27.083% 
 Fraction of dates that overlap = 100%: 8.333% 


### ChatGPT 2: Event Classification
We ran the ChatGPT experiment in code/run_chatGPT.ipynb. The results are loaded and evaluated here.

In [7]:
# load predictions
predictions_classification = pd.read_csv("code/data/results/chatgpt_classification/predictions.csv")

from code.scripts.chatgpt_classification import evaluate
df_predictions_classification = evaluate(predictions_classification, gt)

                             precision    recall  f1-score   support

     beslistermijn verdaagd       0.82      0.86      0.84        21
                    contact       0.88      0.70      0.78        40
   inwerking treden van Woo       1.00      0.88      0.93        16
ontvangst verzoek bevestigd       0.95      0.80      0.86        44
                     overig       0.60      0.52      0.56        23
              verzoek datum       0.77      1.00      0.87        46
          verzoek ontvangen       0.50      0.67      0.57        15

                   accuracy                           0.80       205
                  macro avg       0.79      0.77      0.77       205
               weighted avg       0.81      0.80      0.80       205

