<a name="top"></a>
<br/>
# Script for parsing

Topic: Improvement of the performance of NER (named-entity recognition) on [PGxCorpus](https://www.biorxiv.org/content/10.1101/534388v3) with [BioBERT](https://arxiv.org/abs/1901.08746).

Authors:
- [Sylvain Combettes](https://www.linkedin.com/in/sylvain-combettes/), Ecole des Mines de Nancy
- [Mohammed El Yaagoubi](https://www.linkedin.com/in/melyaagoubi/), TELECOM Nancy
- [Guillaume Gatti](https://www.linkedin.com/in/guillaume-gatti-b2a482149/), TELECOM Nancy
- [Zouhair Janati-Idrissi](https://www.linkedin.com/in/zouhair-janati-idrissi/), TELECOM Nancy
- [Ali Labbaize](https://www.linkedin.com/in/alilabbaize/), TELECOM Nancy

Supervisors:
- [Adrien Coulet](https://members.loria.fr/ACoulet/), Loria
- [Joël Legrand](http://joel-legrand.fr/hugo/), Loria

---
In this notebook, we program a Python script that transforms data from the BRAT format to IOB (inside-outside-beginning) format.

---
### Imports

In [1]:
import os

---
# Getting the list of the names of the `txt` and `ann` files

In [2]:
def list_files(dir):
    tab = []
    for path, dirs, files in os.walk(dir):
        for file in files:
            tab.append([file])
    return tab

In [3]:
all_files = list_files("PGxCorpus")
print(len(all_files))
all_files[:5]

1892


[['23306941_1.txt'],
 ['16944116_13.txt'],
 ['25522765_1.ann'],
 ['21596890_6.ann'],
 ['18852012_19.ann']]

We order the names:

In [4]:
all_files.sort()
all_files[0:5]

[['10070957_8.ann'],
 ['10070957_8.txt'],
 ['10087449_9.ann'],
 ['10087449_9.txt'],
 ['10098858_5.ann']]

---
# Parsing the `txt` files

In [5]:
def parse_txt(filename):
    path_txt = "PGxCorpus/" + str(filename[0])
    my_file = open(path_txt, "r")
    my_file_bis = my_file.read()
    my_file.close()
    parsed_txt = my_file_bis.split(" ")
    return parsed_txt

In [6]:
phrase = parse_txt(all_files[3]) # make sure that the index corresponds to a txt file
phrase

['PACAP',
 '-induced',
 'expression',
 'of',
 'the',
 'c-fos',
 'gene',
 'was',
 'significantly',
 'reduced',
 'by',
 'pretreatment',
 'with',
 'a',
 'PACAP',
 'receptor',
 'antagonist',
 ',',
 'PACAP',
 '-',
 '(',
 '6-38',
 ')',
 '-',
 'NH2',
 '.']

---
# Parsing the `ann` files

In [7]:
def parse_ann(filename):
    
    path_ann = "PGxCorpus/" + str(filename[0])
    my_file_ann = open(path_ann, "r")
    my_file_bis_ann= my_file_ann.read()
    
    parsed = my_file_bis_ann.split("\n")
    new_parsed =  []
    for i in range(0,len(parsed)):
        if len(parsed[i]) != 0 :
            if parsed[i][0] == 'T' and ";" not in parsed[i]:
                new_parsed.append(parsed[i])
    newer_parsed= []
    for j in range(0,len(new_parsed)): 
        chaine1 = new_parsed[j].split('\t')
        del chaine1[0]
        newer_parsed = newer_parsed + chaine1
        
    for k in range(0, len(newer_parsed), 2):
        chaine2 = newer_parsed[k].split(" ")
        newer_parsed[k] =  chaine2[0]
     
    return newer_parsed

In [8]:
file_ann = parse_ann(all_files[4]) # make sure that the index corresponds to a txt file
file_ann

['Chemical',
 'amphetamine ',
 'Phenotype',
 'RGS2 expression ',
 'Gene_or_protein',
 'RGS2 ',
 'Chemical',
 'amphetamine ',
 'Pharmacodynamic_phenotype',
 'effects of repeated amphetamine']

We remove if there are several occurences of the same:

In [9]:
def rmv_occ(file_ann):
    final_ann=[]
    for l in range(1,len(file_ann),2):
        if " " in file_ann[l]:
            boolean=False
            for m in range(1,len(file_ann),2):
                if file_ann[m] in file_ann[l] and m!=l:
                    boolean=True
            if boolean==False:
                final_ann.append(file_ann[l].strip())
                final_ann.append(file_ann[l-1].strip())
        else:
            final_ann.append(file_ann[l].strip())
            final_ann.append(file_ann[l-1].strip())
    return final_ann

In [10]:
final_ann = rmv_occ(file_ann)
final_ann

['RGS2',
 'Gene_or_protein',
 'effects of repeated amphetamine',
 'Pharmacodynamic_phenotype']

In [17]:
def create_file(file, final_ann, phrase):
    n=0
    while n<len(phrase):
        o=0
        boolean1=True
        while o<len(final_ann) and boolean1:
            if " " in final_ann[o]:
                x=final_ann[o].split(" ")
                boolean2=True
                q=0
                while boolean2 and q<len(x):
                    
                    if x[q]==phrase[n+q]:
                        q+=1
                    else:
                        boolean2=False
                if boolean2:
                    file.write(phrase[n]+" B-"+final_ann[o+1]+"\n")
                    for z in range(1,len(x)):
                        file.write(phrase[n+z]+" I-"+final_ann[o+1]+"\n")
                    n=n+z
                    boolean1=False
            else:
                if phrase[n]==final_ann[o]:
                    file.write(phrase[n]+" B-"+final_ann[o+1]+"\n")
                    boolean1=False
            o=o+2            
        if boolean1:
            file.write(phrase[n]+" O \n")
        n+=1
    file.write("\n")

In [20]:
create_file(all_files, final_ann, phrase)

AttributeError: 'list' object has no attribute 'write'

In [21]:
def transf_total(tab_files):
    n =  len(tab_files)
    file = open("data_iob.tsv","w")
    file.write("-DOCSTART- O \n\n")
    
    for s in range(0,n-1,2) :
        phrase = parse_txt(all_files[s+1])
        file_ann = parse_ann(all_files[s])
        final_ann = rmv_occ(file_ann)
        create_file(file,final_ann,phrase)

    file.close()

In [22]:
transf_total(all_files)

---
Back to [top](#top).