# Correcting the Dev and Test sets

During training the ModernBERT models an error was discovered in the DEV and TEST sets. This was caused by a download from the original weaklabelling platform not being correctly filtered to remove labels predicted by the platform. As a result the DEV TEST sets contained a mixture of handlabeled data and poorly predicted data. This had no impact on the model or results in the original paper but would affect subsequent work and reproducibility. 

This notebook remedies the situation by remaking the DEV and TEST sets from the original data and removing the predicted labels. This is done by ensuring that only data marked as "annotation" is included, whilst "prediction" is excluded.

In addition the csv columns will be renamed to be consistent with the rest of the training and inference pipeline.

The results of this change cause a large reduction in the apparent number of entities in the DEV and TEST sets, however, these entities are at best duplicates of the annotations and at worse incorrectly parsed entities which are duplicated several times. This means that the updated DEV and TEST sets are much smaller than the original publibly available sets.

In [1]:
import pandas as pd
def clean_raw_data(path):

    """
    This funcction simply keeps only the rows that are marked as "annotation" i.e. are hand labelled.
    It then selects the needed columns and renames as appropriate 
    """

    gt_data_raw = pd.read_csv(path)


    gt_data =  gt_data_raw.loc[gt_data_raw['type'] == 'annotation',
    ['input:datapoint_id', 'start', 'end', 'label', 'text', 'input:text',]]
    
    gt_data = gt_data.rename(columns = {'input:text':'property_address', 
    'input:datapoint_id':'datapoint_id'})    

    return gt_data

In [None]:
gt_test_data = clean_raw_data(path = "data/ground_truth_test_set_labels.csv")
gt_test_data.to_csv("data/enhanced_ocod_data_and_gt/ground_truth_test_set_labels.csv", index = False)
gt_test_data

Unnamed: 0,datapoint_id,start,end,label,text,property_address
0,62555,0,2,street_number,90,"90 craven park, london (nw10 8qe)"
1,62555,2,14,street_name,craven park,"90 craven park, london (nw10 8qe)"
2,62555,16,22,city,london,"90 craven park, london (nw10 8qe)"
3,62555,24,32,postcode,nw10 8qe,"90 craven park, london (nw10 8qe)"
4,283,0,4,unit_type,land,"land on the west side of wisbech road, march"
...,...,...,...,...,...,...
4681,69298,9,22,building_name,chelsea house,"flat 13, chelsea house, 26 lowndes street, lon..."
4682,69298,24,26,street_number,26,"flat 13, chelsea house, 26 lowndes street, lon..."
4683,69298,27,41,street_name,lowndes street,"flat 13, chelsea house, 26 lowndes street, lon..."
4684,69298,43,49,city,london,"flat 13, chelsea house, 26 lowndes street, lon..."


In [113]:
gt_dev_data = clean_raw_data(path = "data/ground_truth_dev_set_labels.csv")
gt_dev_data.to_csv("data/enhanced_ocod_data_and_gt/ground_truth_dev_set_labels.csv", index = False)
gt_dev_data

Unnamed: 0,datapoint_id,start,end,label,text,property_address
0,53574,0,3,street_number,207,"207 sloane street, london (sw1x 9qx)"
1,53574,4,17,street_name,sloane street,"207 sloane street, london (sw1x 9qx)"
2,53574,19,25,city,london,"207 sloane street, london (sw1x 9qx)"
3,53574,27,35,postcode,sw1x 9qx,"207 sloane street, london (sw1x 9qx)"
4,17501,0,9,unit_type,apartment,"apartment 533, block 11 spectrum, blackfriars ..."
...,...,...,...,...,...,...
9557,76011,5,6,unit_id,a,"flat a, 49 norcott road, stoke newington, (n16..."
9558,76011,8,10,street_number,49,"flat a, 49 norcott road, stoke newington, (n16..."
9559,76011,11,23,street_name,norcott road,"flat a, 49 norcott road, stoke newington, (n16..."
9560,76011,43,50,postcode,n16 7ej,"flat a, 49 norcott road, stoke newington, (n16..."


## The parsed GT data

This data is fine as it was the result of handlablled data being parsed and manually labelled with the property type. No action required here.

In [17]:
parsed_test_set = pd.read_csv('data/enhanced_ocod_data_and_gt/full_data_set_no_overlaps.csv')

parsed_test_set

FileNotFoundError: [Errno 2] No such file or directory: 'data/enhanced_ocod_data_and_gt/full_data_set_no_overlaps.csv'