# Benchmarking Dictionary-based NER (Tagger) 

* First we need to run tagger on the annoated benchmark abstracts and get the list of tagged LSFs to compare them with manually tagged LSFs

## Tagger run
1. prepare abstracts in a fomrat that tagger request
    * 6 column tabseparated format, first column(PMID) and the last column which is the text, others can be left as empty since ther are relevant or required here

In [2]:
import os

# Input directory containing text files
input_directory = '../../data/NER-Benchmarking/annotations/raw/'

# Output file name
output_file = '../../data/NER-Benchmarking/annotations/abstracts.tsv'

# List all .txt files in the input directory
input_files = [file for file in os.listdir(input_directory) if file.endswith('.txt')]

# Open the output file for writing
with open(output_file, 'w') as out_file:
    for input_file in input_files:
        with open(os.path.join(input_directory, input_file), 'r') as in_file:
            content = in_file.read().replace('\n', ' ').strip()  # Read, replace newlines with spaces, and strip any leading/trailing whitespace

        # Extract the base name of the input file without the .txt extension
        input_filename = os.path.splitext(os.path.basename(input_file))[0]

        # Write the formatted line to the output file
        line = f"PMID:{input_filename}\t \t \t \t \t{content}\n"  # Four empty columns
        out_file.write(line)

print("Output file created:", output_file)


Output file created: ../../data/NER-Benchmarking/annotations/abstracts.tsv


2. Prepare required input files for tagger
    * Note: You can use the saved results, or reproduce using the following steps
    
    * Steps 
        * . Using required scripts for Tagger prepare required input files for tagger
            * tagger scripts ('../NER-Benchmarking/scripts/tagger_scripts/')
            * Clone LSFC (or use the offline version in '../../LSFC/LSFC.obo')
            * Generate entities,groups,names using LSFC
                * ../NER-Benchmarking/scripts/tagger_scripts/obo2reflect.pl  ../../LSFC/   ../../data/NER-Benchmarking/tagger/tagger_input/
            *  Add adjective and plural endings (orthoexpand)
                * ../NER-Benchmarking/scripts/tagger_scripts/orthoexpand.pl  ../../data/NER-Benchmarking/tagger/tagger_input/LSFC_entities.tsv  ../../data/NER-Benchmarking/tagger/tagger_input/LSFC_names.tsv >  ../../data/NER-Benchmarking/tagger/tagger_input/LSFC_entities_names_expanded_mid.tsv

            *  Change nonASCII into ASCII
                * cat ../../data/NER-Benchmarking/tagger/tagger_input/LSFC_entities_names_expanded_mid.tsv | python2  ../NER-Benchmarking/scripts/tagger_scripts/utf8expand.py > ../../data/NER-Benchmarking/tagger/tagger_input/LSFC_names_expanded.tsv


            * orthounexpand
                *  ../NER-Benchmarking/scripts/tagger_scripts/orthounexpand.pl ../../data/NER-Benchmarking/tagger/tagger_input/LSFC_entities.tsv  ../../data/NER-Benchmarking/tagger/tagger_input/LSFC_names.tsv  ../../data/NER-Benchmarking/tagger/tagger_input/LSFC_names_expanded.tsv >  ../../data/NER-Benchmarking/tagger/tagger_input/LSFC_names_unexpanded.tsv


            * disambiguate
                *  ../NER-Benchmarking/scripts/tagger_scripts/disambiguate.pl ../../data/NER-Benchmarking/tagger/tagger_input/LSFC_names_unexpanded.tsv >  ../../data/NER-Benchmarking/tagger/tagger_input/LSFC_names_disambiguated.tsv

            * Create manually (or copy) LSFC_types.tsv ('../../data/NER-Benchmarking/tagger/tagger_input/LSFC_types.tsv')
                * which contains types ids: in this case only -20 for life style 
            
            * createing block list file:
                * craete shortlist for manual checking
                    * block_list_candidates=input_data[input_data.match_count >2000]
                * Evalualte candidates and create final block list by concatinating them with general list of blocked words
                    * The format of the blocklist file is two tab separated columns:
                    * the word, which may be a string containing spaces either the string "t" or the string "f", according to whether it is a stopword ("t") or is whitelisted ("f")
                    * (../../data/NER-Benchmarking/tagger/tagger_input/all_global.tsv)

3. Run Tagger
    * Output file : '../../data/NER-Benchmarking/tagger/tagger_output/all_matches_benchmark.tsv'
    * We want to run Tagger on small set of annotated documents (200 abstracts) unlike normal situations which is for all pubmed/PMC articles and is a large scale run and takes time 
        /tagger/tagcorpus 
        --documents=../../data/NER-Benchmarking/annotations/abstracts.tsv
        --threads=1 \
        --autodetect \
        --types=tagger_input/LSFC_types.tsv \
        --entities=tagger_input/LSFC_entities.tsv \
        --names=tagger_input/LSFC_names_disambiguated.tsv \
        --stopwords=tagger_input/all_global.tsv \
        --local-stopwords=tagger_input/all_local.tsv \
        --groups=tagger_input/LSFC_groups.tsv \
        --out-matches=tagger_output/all_matches_benchmark.tsv 




4. Post processing tagger output
    

* Tagger does not aasign category to the tagged entities and here we replace serial number with the LSF category name

In [7]:
# Load LSFC 

import warnings,sys
warnings.filterwarnings("ignore")
module_path = os.path.abspath(os.path.join('../../'))
if module_path not in sys.path:
    sys.path.append(module_path+"//utils")

import retrieve_LSFC

import importlib
# reload if library gets updated
importlib.reload(retrieve_LSFC)

### Load LSFC 

LSFC_file='../../LSFC/LSFC.obo'
id_to_name,name_to_id,id_to_synonyms,id_2_childs,id_2_parents=retrieve_LSFC.read_LSFC(LSFC_file)
# Get names of 9 main LSF catgeories
LSF_exisiting_names, LSF_Labels,LFIDs,categories=retrieve_LSFC.generate_lfid_categories_labels(LSFC_file)

# to make it consistent with annotation attributes
categories=['Beauty_and_Cleaning','Nutrition','Drugs','Environmental_exposures','Non_physical_leisure_time_activities','Physical_activity','Sleep','Socioeconomic_factors','Mental_health_practices']
# How to extract related ontology classes using Bioportal
lfid_to_category={}
for i,lfid in enumerate(LFIDs):
    lfid_to_category[lfid]=categories[LSF_Labels[i]]


#a dictionary to store the serial number of the detected entities and the corresponding lsf category
serial_to_cargeory={}
tagger_entities=pd.read_csv('../../data/NER-Benchmarking/tagger/tagger_input/LSFC_entities.tsv',sep='\t')
tagger_entities.columns=['serial','entity_type','lfid']
for i,row in tagger_entities.iterrows():
    if row['lfid']=='LFID:0000000':
        continue
    serial_to_cargeory[row['serial']]=lfid_to_category[row['lfid']]


# load tagger matches  
tagger_matches_benchmark=pd.read_csv('../../data/NER-Benchmarking/tagger/tagger_output/all_matches_benchmark.tsv',sep='\t',header=None)

for index, row in tagger_matches_benchmark.iterrows():
    
    serial = row[7]
    if serial==1000000001:
        continue
    #replace serial number with the corresponding lsf category
    tagger_matches_benchmark.at[index,7] = serial_to_cargeory[serial]

tagger_matches_benchmark.to_csv('../../data/NER-Benchmarking/tagger/tagger_output/all_matches_benchmark_with_branches.tsv',sep='\t',header=None,index=False)



* Remove possible duplicate lines(some lines were redundant) the following file is saved after removing duplications

In [9]:
import os

raw_input_file_path = '../../data/NER-Benchmarking/tagger/tagger_output/all_matches_benchmark_with_branches.tsv'

deduplicated_input_file_path = '../../data/NER-Benchmarking/tagger/tagger_output/all_matches_benchmark_with_branches_deduplicated.tsv'

# Read the input file, remove duplicates, and write to the output file

df=pd.read_csv(raw_input_file_path,sep='\t',header=None)
df.drop_duplicates(inplace=True)
df.to_csv(deduplicated_input_file_path,index=False,header=None,sep='\t')





5. Generate BRAT files using tagger results 
    * this makes it possible to compare the performance of the *.ann files produced by tagger and annotated by annotator and generate the perfomance result  


In [4]:
# Copy raw *.txt files of the abstracts
import os
import shutil

def copy_txt_files(source_dir, target_dir):
    # Create the target directory if it doesn't exist
    if not os.path.exists(target_dir):
        os.makedirs(target_dir)

    for filename in os.listdir(source_dir):
        source_path = os.path.join(source_dir, filename)
        if os.path.isfile(source_path) and filename.endswith('.txt'):
            target_path = os.path.join(target_dir, filename)
            shutil.copy2(source_path, target_path)  # Using copy2 to preserve metadata

source_directory = "../../data/NER-Benchmarking/annotations/raw/"
target_directory = "../../data/NER-Benchmarking/tagger/tagger_output/brat_files/"

copy_txt_files(source_directory, target_directory)


In [None]:
%%bash

#Produces *.ann files for the abstracts based on the tagger matches 
awk -F'\t' '{printf ("%s\t%s\t%s\t%s\n", $4, $5, $6, $7) >> "../../data/NER-Benchmarking/tagger/tagger_output/brat_files/"$1".tsv"; close("../../data/NER-Benchmarking/tagger/tagger_output/brat_files/"$1".tsv")}' ../../data/NER-Benchmarking/tagger/tagger_output/all_matches_benchmark_with_branches_deduplicated.tsv

cd ../../data/NER-Benchmarking/tagger/tagger_output/brat_files/

# Add incremental indexing to files produced
for f in *.tsv
do
    awk -F "\t" '{$1=++i FS $1; printf("T%s\n",$0)}' OFS="\t" $f > "$(basename "$f" .tsv).tsv2";
    rm ${f}
done

# Make file in ann format
for f in *.tsv2
do
     awk -F "\t" '{if($5~/-20/){printf("%s\tLifestyle_factor %s %s\t%s\n", $1, $2, ++$3, $4)}else{printf("%s\tDisease %s %s\t%s\n", $1, $2, ++$3, $4)}}' OFS="\t" $f > "$(basename "$f" .tsv2).ann";
     rm ${f}
done


3. Ceate empty *.ann files for those abstracts that tagger has not found any matches

In [5]:
import os

def create_missing_ann_files(source_dir):
    for filename in os.listdir(source_dir):
        if filename.endswith('.txt'):
            txt_file_path = os.path.join(source_dir, filename)
            ann_file_path = os.path.splitext(txt_file_path)[0] + '.ann'
            
            if not os.path.exists(ann_file_path):
                with open(ann_file_path, 'w') as ann_file:
                    pass  # Creates an empty .ann file

source_directory = "../../data/NER-Benchmarking/tagger/tagger_output/brat_files/"

create_missing_ann_files(source_directory)


# Post processing manual annotations

### 1. Check consistency of mentions annotation
* Use brat annotation consistency check to create a report of missing annotation and correct those mistakes
    * clone BRAT annotation tool repo
    * https://github.com/nlplab/brat.git
    * use search.py in the brat server to find inconsistencies in the annotation
    * Manually correct the annotation

In [86]:
# consistency check 
! python /brat/server/src/search.py  -cm  ../../data/NER-Benchmarking/annotations/raw/*.ann >  ../../data/NER-Benchmarking/annotations/raw/missing_mentions.tsv


### 2.Replacing the generic Lifestyle_factor with corresponding attribute

* By default annotations are categorized by different attributes which are inserted in the *.ann files as followings:
    * T1	Lifestyle_factor 56 79	agricultural wastewater
    * A1	Environmental_exposures T1
    * T2	Lifestyle_factor 899 922	agricultural wastewater
    * A2	Environmental_exposures T2
    * T3	Lifestyle_factor 204 210	Taiwan
    * A3	Geographical_Feature T3
    * T4	Lifestyle_factor 232 245	water quality
    * A4	Environmental_exposures T4
    * T5	Lifestyle_factor 704 717	water quality
    * A5	Environmental_exposures T5
* But we want to replace Lifestyle_factor with the exact lifestyle branch which make it easy to compute categorized metrics for every branch

In [1]:
import os

import pandas as pd

def process_annotation_files(input_directory, output_directory, skip_lsf_out_of_context=False):
    # List all input files in the directory
    input_files = [f for f in os.listdir(input_directory) if f.endswith('.ann')]
    
    # Process each .ann input file
    for input_file in input_files:
        # Initialize an empty DataFrame to keep the updated annotations
        updated_annotations = pd.DataFrame(columns=['id', 'lsf_type', 'enity'])

        input_file_path = os.path.join(input_directory, input_file)
        with open(input_file_path, 'r') as file:
            lines = file.readlines()

        for line in lines:
            line = line.strip()

            if line.startswith('T'):
                new_row = line.split('\t')
                new_row_df = pd.DataFrame([new_row], columns=updated_annotations.columns)
                updated_annotations = updated_annotations.append(new_row_df, ignore_index=True)
            elif line.startswith('A'):
                if skip_lsf_out_of_context:
                    new_type = line.split()[1]
                    if new_type == 'LSF_out_of_context':
                        continue
                new_type = line.split()[1]
                target_id = line.split()[-1]
                old_type = updated_annotations.loc[updated_annotations['id'] == target_id, 'lsf_type'].tolist()[0]
                old_type = old_type.split()
                if old_type[0] != 'LSF_out_of_context':
                    old_type[0] = new_type
                updated_type = ' '.join(old_type)
                updated_annotations.loc[updated_annotations['id'] == target_id, 'lsf_type'] = updated_type

        # Write modified content to a new output .ann file
        output_file_path = os.path.join(output_directory, input_file)  # Specify the output directory
        updated_annotations.to_csv(output_file_path, sep='\t', index=None, header=False)


### 3. Create different variations of annotations regarding OOC mentions
* Annotations have mentions with 'LSF_out_of_context' as attribute which says these mentions are LSF but not in this context
* We create two different copies of annotations regaridng 'LSF_out_of_context'


* 1. Create a modified vesrion of the annotations where LSF_out_of_context attributes are removed leaving the entity with initial attrubute type 
    * So as the result OOC entities will apeare as normal LSF type with a types assigned which is determined by attribute type


In [None]:

input_directory = '../../data/NER-Benchmarking/annotations/raw/'
output_directory = '../../data/NER-Benchmarking/annotations/categorized_no_OOC/'
skip_lsf_out_of_context = True  # Set to True to skip lines with 'LSF_out_of_context' as the new type
process_annotation_files(input_directory, output_directory, skip_lsf_out_of_context)


* 2. Create a modified vesrion of the annotations where LSF_out_of_context attributes are assigned as entity types 
    * So as the result OOC entities will apeare with LSF_out_of_context as their type


In [None]:

input_directory = '../../data/NER-Benchmarking/annotations/raw/'
output_directory = '../../data/NER-Benchmarking/annotations/categorized/'
skip_lsf_out_of_context = False  # Set to True to skip lines with 'LSF_out_of_context' as the new type
process_annotation_files(input_directory, output_directory, skip_lsf_out_of_context)


# Run benchmark

* The 2>&1 at the end of the command ensures that both stdout and stderr are redirected to the full_output.txt file.
* all the execustion  excludes Geographical_Feature and Occupations because these are annotated initially for future use cases and it is not intended from NER system to detect Geographical_Feature such as country names or different Occupations, also 'Out-of-scope' is excluded in all analysis since it removes cases such as 'work' in this example: 'In this work ...'

    *  -f  Geographical_Feature,Occupations


# We run benchmark in two different versions:
1. Single LSF type
2. Categorized LSF types


# 1. Single LSF type

### Scenario 1 (published):

*  we use -i which ignores the types, because the idea is not to check if we can distinguish correctly between different LSF branches and only annotating LSF factors


* Performance:

    * precision 96.01% (938/977) recall 49.39% (927/1877) F 65.22%




* Note: (heavy smoker) issue solved (file name: 21719896)
* T9	Drugs 1086 1099	heavy smokers from '../data/200_abstracts/taggers_matches/brat_files/' 


In [10]:
%%bash

# Define the file paths as Bash variables
echo 'annotation_directory="../../data/NER-Benchmarking/annotations/categorized/"' > variable.sh
echo 'tagger_directory="../../data/NER-Benchmarking/tagger/tagger_output/brat_files/"' >> variable.sh
echo 'output_file_single_lsf_type_OOC_as_LSF="../../data/NER-Benchmarking/benchmark_results/output_single_lsf_type_OOC_as_LSF.txt"' >> variable.sh

echo 'output_file_single_lsf_type_OOC_as_LSF_FP_Lines="../../data/NER-Benchmarking/benchmark_results/output_file_single_lsf_type_OOC_as_LSF_FP_Lines.txt"' >> variable.sh
echo 'output_file_single_lsf_type_OOC_as_LSF_FN_Lines="../../data/NER-Benchmarking/benchmark_results/output_file_single_lsf_type_OOC_as_LSF_FN_Lines.txt"' >> variable.sh
echo 'output_file_single_lsf_type_OOC_as_LSF_performance_Lines="../../data/NER-Benchmarking/benchmark_results/output_file_single_lsf_type_OOC_as_LSF_performance_Lines.txt"' >> variable.sh

source variable.sh

# Run your Python script with the variables
python2 ../../utils/IAA.py -o -d -v -i -f Out-of-scope,Geographical_Feature,Occupations "$annotation_directory" "$tagger_directory" --allowmissing > "$output_file_single_lsf_type_OOC_as_LSF" 2>&1

grep 'FP:' "$output_file_single_lsf_type_OOC_as_LSF" > "$output_file_single_lsf_type_OOC_as_LSF_FP_Lines"
grep 'FN:' "$output_file_single_lsf_type_OOC_as_LSF" > "$output_file_single_lsf_type_OOC_as_LSF_FN_Lines"
grep  -vE 'TP:|TN:|FP:|FN:|MATCH:' "$output_file_single_lsf_type_OOC_as_LSF" > "$output_file_single_lsf_type_OOC_as_LSF_performance_Lines"



### Scenario 2 (published):
* This is used to do error analysis

* This is the actual performance of the NER since OOCs (LSF_out_of_context) will be excluded (by -f) 
* Without excluding LSF_out_of_context they were improving the performance because if we ignore type (which is LSF_out_of_context for them) in case that they are matching tagger result they will improve result

*  we use -i which ignores the types, because the idea is not to check if we can distinguish correctly between different LSF branches and only annotating LSF factors




In [49]:
%%bash

# Define the file paths as Bash variables
echo 'output_file_single_lsf_type_OOCs_filtered="../../data/NER-Benchmarking/benchmark_results/output_single_lsf_type_OOCs_filtered.txt"' >> variable.sh
echo 'output_file_single_lsf_type_OOCs_filtered_FP_Lines="../../data/NER-Benchmarking/benchmark_results/output_file_single_lsf_type_OOCs_filtered_FP_Lines.txt"' >> variable.sh
echo 'output_file_single_lsf_type_OOCs_filtered_FN_Lines="../../data/NER-Benchmarking/benchmark_results/output_file_single_lsf_type_OOCs_filtered_FN_Lines.txt"' >> variable.sh

source variable.sh

# Run your Python script with the variables
python2 ../../utils/IAA.py -o -d -v -i -f Out-of-scope,LSF_out_of_context,Geographical_Feature,Occupations "$annotation_directory" "$tagger_directory" --allowmissing > "$output_file_single_lsf_type_OOCs_filtered" 2>&1

grep 'FP:' "$output_file_single_lsf_type_OOCs_filtered" > "$output_file_single_lsf_type_OOCs_filtered_FP_Lines"
grep 'FN:' "$output_file_single_lsf_type_OOCs_filtered" > "$output_file_single_lsf_type_OOCs_filtered_FN_Lines"



# 2. Different LSF branches (plot) :
* run with distinguished LSF branches
*  if we remove -i option and run the benchmark it will show also how well is the performance for different LSF branches
    * The result now is a little worse because if the annotator was wrong about selecting the correct branch or in general if both Tagger and annotator agree on annotating a term as LSF but they disagree on the branch this will result in a FP or FN which reduces the performance    
* The goal is to compare the categorized performance in two different options regarding OOC


1. Use a vesrion of annotations where OOCs are apeared as different type ('LSF_out_of_context') in the annotation file so no matter is we filter them out using -f parameter or not they will be ignored since the categorized version reports performance independently for that beside other LSF types and we ignore it in the plot

In [42]:
%%bash

# Define the file paths as Bash variables
echo 'output_file_categorized_lsf_type_OOCs_filtered="../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_filtered.txt"' >> variable.sh
echo 'output_file_categorized_lsf_type_OOCs_filtered_FP_Lines="../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_filtered_FP_Lines.txt"' >> variable.sh
echo 'output_file_categorized_lsf_type_OOCs_filtered_FN_Lines="../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_filtered_FN_Lines.txt"' >> variable.sh


source variable.sh

# Run your Python script with the variables
python2 ../../utils/IAA.py -o -d -v -f Out-of-scope,LSF_out_of_context,Geographical_Feature,Occupations "$annotation_directory" "$tagger_directory" --allowmissing > "$output_file_categorized_lsf_type_OOCs_filtered" 2>&1

grep 'FP:' "$output_file_categorized_lsf_type_OOCs_filtered" > "$output_file_categorized_lsf_type_OOCs_filtered_FP_Lines"
grep 'FN:' "$output_file_categorized_lsf_type_OOCs_filtered" > "$output_file_categorized_lsf_type_OOCs_filtered_FN_Lines"



2. Use a vesrion of annotations where OOCs are apeared as normal LSF types, and thoses cases are reported along with correspoing LSF branch (this makes it possible to compare treating the OOCs as LSF <case 2.> or removing them <case 1.> how affects the performance of different branches)

In [43]:
%%bash

# a version of annotations were OOCs are appeares as corresponding LSF type
echo 'annotation_no_OOC_directory="../../data/NER-Benchmarking/annotations/categorized_no_OOC/"' >> variable.sh


# Define the file paths as Bash variables
echo 'output_file_categorized_lsf_type_OOCs_as_LSF="../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_as_LSF.txt"' >> variable.sh
echo 'output_file_categorized_lsf_type_OOCs_as_LSF_FP_Lines="../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_as_LSF_FP_Lines.txt"' >> variable.sh
echo 'output_file_categorized_lsf_type_OOCs_as_LSF_FN_Lines="../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_as_LSF_FN_Lines.txt"' >> variable.sh




source variable.sh

# Run your Python script with the variables
python2 ../../utils/IAA.py -o -d -v -f Out-of-scope,Geographical_Feature,Occupations "$annotation_no_OOC_directory" "$tagger_directory" --allowmissing > "$output_file_categorized_lsf_type_OOCs_as_LSF" 2>&1

grep 'FP:' "$output_file_categorized_lsf_type_OOCs_as_LSF" > "$output_file_categorized_lsf_type_OOCs_as_LSF_FP_Lines"
grep 'FN:' "$output_file_categorized_lsf_type_OOCs_as_LSF" > "$output_file_categorized_lsf_type_OOCs_as_LSF_FN_Lines"


# Plot

* extract metrics from text files

In [12]:
import pandas as pd
import re

def extract_data_and_save(file_path):
    # Initialize an empty DataFrame
    pr_rec_per_lsf_branch = pd.DataFrame(columns=['Lifestyle-factor branch', 'Precision', 'Recall', 'F'])

    # Read the text from the file
    with open(file_path, 'r') as file:
        text = file.read()

    # Split the text into lines
    lines = text.strip().split('\n')

    # Iterate over each line and extract data
    for line in lines:
        match = re.match(r'TYPE:\s+(.*?)\s+precision\s+([\d.]+%)\s+\((\d+)/(\d+)\)\s+recall\s+([\d.]+%)\s+\((\d+)/(\d+)\)\s+F\s+([\d.]+)', line)
        if match:
            revision_step, precision, _, _, recall, _, _, f_score = match.groups()
            if revision_step not in ['Beauty_and_Cleaning', 'Drugs', 'Environmental_exposures',
                                     'Mental_health_practices', 'Non_physical_leisure_time_activities',
                                     'Nutrition', 'Physical_activity', 'Sleep', 'Socioeconomic_factors']:
                continue

            # Remove the percentage symbols and convert to floats
            precision = float(precision.rstrip('%'))
            recall = float(recall.rstrip('%'))
            f_score = float(f_score)
            # Append the data to the DataFrame
            pr_rec_per_lsf_branch = pr_rec_per_lsf_branch.append({
                'Lifestyle-factor branch': revision_step.strip(),
                'Precision': precision,
                'Recall': recall,
                'F': f_score
            }, ignore_index=True)

    # Sort the DataFrame by the 'F' column in ascending order
    pr_rec_per_lsf_branch = pr_rec_per_lsf_branch.sort_values(by='F')
    
    # Return the resulting DataFrame
    return pr_rec_per_lsf_branch

# full_output_categorized_exclude_OOC:
file_path = "../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_filtered.txt"
resulting_df = extract_data_and_save(file_path)
# Save the DataFrame to a TSV file
tsv_file_path ="../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_filtered_prec_rec.tsv"
resulting_df.to_csv(tsv_file_path, sep='\t', index=None)



# full_output_catgeorized:
file_path = "../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_as_LSF.txt"
resulting_df = extract_data_and_save(file_path)
# Save the DataFrame to a TSV file
tsv_file_path = "../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_as_LSF_prec_rec.tsv"
resulting_df.to_csv(tsv_file_path, sep='\t', index=None)




* Modify lables to be appropriate for plotting

In [13]:
import pandas as pd

def replace_phrases_in_dataframe(input_file_path, phrase_replacements):
    # Read the input TSV file into a DataFrame
    df = pd.read_csv(input_file_path, sep='\t')

    # Replace the phrases in the DataFrame
    df['Lifestyle-factor branch'] = df['Lifestyle-factor branch'].replace(phrase_replacements)

    # Write the modified DataFrame to the output TSV file
    df.to_csv(input_file_path, sep='\t', index=False)


phrase_replacements = {
    'Environmental_exposures': 'Environmental exposures',
    'Physical_activity': 'Physical activities',
    'Socioeconomic_factors': 'Socioeconomic factors',
    'Drugs': 'Substance use',
    'Mental_health_practices': 'Mental health practices',
    'Non_physical_leisure_time_activities': 'Non physical leisure time activities',
    'Beauty_and_Cleaning': 'Beauty and cleaning'
}


# replace names in both files
input_file_path = "../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_filtered_prec_rec.tsv"
replace_phrases_in_dataframe(input_file_path, phrase_replacements)

input_file_path = "../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_as_LSF_prec_rec.tsv"
replace_phrases_in_dataframe(input_file_path, phrase_replacements)




In [14]:
# merge two files

import pandas as pd

# Load the first TSV file into a DataFrame
file1 = "../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_filtered_prec_rec.tsv"
df1 = pd.read_csv(file1, sep='\t')

# Load the second TSV file into a DataFrame, excluding the first column
file2 = "../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_as_LSF_prec_rec.tsv"
df2 = pd.read_csv(file2, sep='\t')

# Add "_ooc" suffix to column names of the second DataFrame
df2.columns = ['Lifestyle-factor branch']+[col + '_ooc_as_LSF' for col in df2.columns if col!='Lifestyle-factor branch']

# Merge the two DataFrames based on the "Lifestyle-factor branch" column
merged_df = df1.merge(df2, on='Lifestyle-factor branch')

# Save the merged DataFrame to a new TSV file
merged_file = "../../data/NER-Benchmarking/benchmark_results/plot_input_dict_based.tsv"


#Define the specific order you want for 'Lifestyle-factor branch'
desired_order = ['nutrition', 'socioeconomic factors', 'environmental exposures', 'substance use','physical activities', 'beauty and cleaning',  'non physical leisure time activities',  'sleep' , 'mental health practices']
# Create a new column with the order of 'Lifestyle-factor branch' (ignoring case)
merged_df['Order'] = merged_df['Lifestyle-factor branch'].str.lower().map({value.lower(): index for index, value in enumerate(desired_order)})

# Sort the DataFrame based on the new 'Order' column
merged_df_sorted = merged_df.sort_values(by='Order').drop('Order', axis=1)


merged_df_sorted.to_csv(merged_file, sep='\t', index=False)

1. plot using only by excluding OOC cases

In [48]:

!python3 ./scripts/plot_prec_rec_progression.py --input_file=../../data/NER-Benchmarking/benchmark_results/output_categorized_lsf_type_OOCs_filtered_prec_rec.tsv --task="Precision-Recall Plot for Lifestyle-factors NER" --output_file=../../plots/NER_Benchmark_Tagger.png


Figure(1400x1200)


2. plot by comparing with and without OOC cases

In [49]:

!python3  ./scripts/plot_prec_rec_ooc.py  --input_file=../../data/NER-Benchmarking/benchmark_results/plot_input_dict_based.tsv --task="" --output_file=../../plots/NER_Benchmark_Tagger_OOC.png


Figure(1200x1200)
