Named entity extraction is a process in natural language processing (NLP) that identifies and categorizes specific elements within text, such as names of people, organizations, locations, dates, quantities, and other relevant entities. By analyzing unstructured text, named entity extraction helps transform raw data into structured information, allowing systems to better understand and organize text content.

In this notebook, we will address the task of recognizing entities in French biomedical texts from [The QUAERO French Medical Corpus](https://quaerofrenchmed.limsi.fr/), a dataset designed for Named Entity Recognition (NER) in the biomedical domain. his corpus consists of manually annotated MEDLINE titles and EMEA documents, with entity annotations based on concepts from the Unified Medical Language System [Unified Medical Language System (UMLS)](https://www.nlm.nih.gov/research/umls/).  
The annotations provide a standardized set of biomedical entities, making the dataset an essential resource for extracting and classifying medical terms in French text.

This corpus contains annotations for ten types of clinical entities, each labeled according to specific categories: Anatomy (ANAT), Chemical and Drugs (CHEM), Devices (DEVI), Disorders (DISO), Geographic Areas (GEOG), Living Beings (LIVB), Objects (OBJC), Phenomena (PHEN), Physiology (PHYS), and Procedures (PROC).

For this notebook, we will only use the MEDLINE texts. [MEDLINE](https://www.nlm.nih.gov/bsd/medline.html) is the U.S. National Library of Medicine® (NLM) premier bibliographic database that contains more than 25 million references to journal articles in life sciences with a concentration on biomedicine.

Let's show a sample annotation for a MEDLINE text:

**Sample MEDLINE title 1**

    *Chirurgie de la communication interauriculaire du type " sinus venosus " .*

**Sample MEDLINE title 1 annotations**

T1           PROC 0 9             Chirurgie


T2           DISO 16 46          communication interauriculaire

This means that the text between characters 0 and 9 is assigned a label **PROC** (= procedure). The token which corresponds to this text is “**Chirurgie**”. 
Second annotation is for the text between characters 16 and 46 (which covers tokens “**communication interauriculaire**”) and is assigned label **DISO** (= disorder).


Therefore, we are interested to train a classifier able to extract those text segments and identify them with the correct label. We will use a class of statistical modeling method used for structured prediction known as Conditional Random Fields (CRFs), which falls into the sequence modeling family. Whereas a discrete classifier predicts a label for a single sample without considering "neighboring" samples, a CRF can take context into account. They are used to encode known relationships between observations and construct consistent interpretations and are often used for labeling or parsing of sequential data.

The corpus contains three subdirectories: train, test and dev. For this notebook, we will use only the first one. It contains 1670 files, including 4 files about configuration and statistics. The rest of the files is divided in two types: .TXT files which contain the text of the sentences and annotations files (.ann) with information about the text segments, its types, etc., as we explained below.

In this notebook we will be preprocessing input data


In [None]:
# path to the data train set 
path_train = "C:\\Users\\lamia\\Desktop\\Extraction-NER-Recherche\\Analyses\\train"


In [None]:
import NER

In [None]:
# To read a file and obtain its content
data = NER.read_file(path_train,"14448",".txt")
data

["L' OMS planifie pour l' Europe l' application du processus des soins infirmiers . Compte rendu de la session du groupe technique d' experts en soins infirmiers et obstétricaux du Bureau régional de l' Europe de l' OMS , Nottingham , 14 - 17 décembre 1976\n"]

In [None]:
import pprint

data = NER.read_file(path_train,"14448",".ann")
pprint.pprint(data)

['T1\tGEOG 24 30\tEurope\n',
 '#1\tAnnotatorNotes T1\tC0015176\n',
 'T2\tPHEN 49 58\tprocessus\n',
 '#2\tAnnotatorNotes T2\tC1522240\n',
 'T3\tPROC 63 79\tsoins infirmiers\n',
 '#3\tAnnotatorNotes T3\tC0028682\n',
 'T4\tLIVB 69 79\tinfirmiers\n',
 '#4\tAnnotatorNotes T4\tC0028676\n',
 'T5\tPROC 143 159\tsoins infirmiers\n',
 '#5\tAnnotatorNotes T5\tC0028682\n',
 'T6\tLIVB 149 159\tinfirmiers\n',
 '#6\tAnnotatorNotes T6\tC0028676\n',
 'T7\tGEOG 201 207\tEurope\n',
 '#7\tAnnotatorNotes T7\tC0015176\n']


Observe that the first file read was "14448.txt" while the second one was "14448.ann"

In [None]:
d = NER.ann_text2dict(data)
pprint.pprint(d)

{'T1': {'label': ['GEOG', '24', '30'], 'text': 'Europe'},
 'T2': {'label': ['PHEN', '49', '58'], 'text': 'processus'},
 'T3': {'label': ['PROC', '63', '79'], 'text': 'soins infirmiers'},
 'T4': {'label': ['LIVB', '69', '79'], 'text': 'infirmiers'},
 'T5': {'label': ['PROC', '143', '159'], 'text': 'soins infirmiers'},
 'T6': {'label': ['LIVB', '149', '159'], 'text': 'infirmiers'},
 'T7': {'label': ['GEOG', '201', '207'], 'text': 'Europe'}}


In [None]:
NER.collect_files(path_train,'train')

In [None]:
lnew = NER.ann_files2dict('train_ann',path_train,'train')

print("# of ann files",len(lnew))

# of ann files 833


There is one situation that we didn’t mentioned before: it is possible that more labels are assigned to the same token (annotations overlap). In this case, we will only choose one of them and discard the other. For example, let’s assume that we have the following text: 

> Prévalence des marqueurs des *virus des hépatites A* , B , C à La Réunion ( Hôpital sud et prison de Saint Pierre ).



With the following annotations :

T1           CHEM 15 24       marqueurs

T2           LIVB 29 34           virus

T3           DISO 39 50          hépatites A

T4           DISO 39 48;53 54             hépatites B

T5           DISO 39 48;57 58             hépatites C

T6           GEOG 61 71        La Réunion

T7           LIVB 29 48;57 58              virus des hépatites C

T8           LIVB 29 48;53 54              virus des hépatites B

T9           LIVB 29 50           virus des hépatites A

You can see that:
* annotation T2 identifies the word 'virues' (characters 29-34) as a Living Being (LIVB),
* annotation T9 identifies the segment 'virus des hépatites A' (characters 29-50) as a Living Being (LIVB),
* annotation T8 identifies the segment 'virus des hépatites B' (characters 29-48 and 53-54) as a Living Being (LIVB), and
* annotation T7 identifies the segment 'virus des hépatites C' (characters 29-48 and 57-58) as a Living Being (LIVB)

In those cases, we will discard the annotation T2 which is included into the others and keep T7, T8 and T9.

Let's try it with the annotation dictionary previously obtained d

In [14]:
pprint.pprint(d)

{'T1': {'label': ['GEOG', '24', '30'], 'text': 'Europe'},
 'T2': {'label': ['PHEN', '49', '58'], 'text': 'processus'},
 'T3': {'label': ['PROC', '63', '79'], 'text': 'soins infirmiers'},
 'T4': {'label': ['LIVB', '69', '79'], 'text': 'infirmiers'},
 'T5': {'label': ['PROC', '143', '159'], 'text': 'soins infirmiers'},
 'T6': {'label': ['LIVB', '149', '159'], 'text': 'infirmiers'},
 'T7': {'label': ['GEOG', '201', '207'], 'text': 'Europe'}}


In [None]:
d1 = NER.remove_contained(d)
pprint.pprint(d1)

{'T1': {'label': ['GEOG', '24', '30'], 'text': 'Europe'},
 'T2': {'label': ['PHEN', '49', '58'], 'text': 'processus'},
 'T3': {'label': ['PROC', '63', '79'], 'text': 'soins infirmiers'},
 'T5': {'label': ['PROC', '143', '159'], 'text': 'soins infirmiers'},
 'T7': {'label': ['GEOG', '201', '207'], 'text': 'Europe'}}


We can see that segments T4 and T6 were removed because T4 was contained into T3 and T6 was contained into T5.

In [None]:
NER.count_non_continuous('train')

set train
Number of non continuous segments 13 % 0.43420173680694724
Total number of segments 2994


It shows that the number of non-contiguous segments is very low and that is why we decided to ignore them for this version. 

The information that we need to use to train the classifier is contained in two independent structures: the .TXT files and the annotation dictionaries. Let's now combine and simplfy them.

The function "simple_dic" will be used to simplify the dictionary structure.

In [None]:
sdic = NER.simple_dic(d1)
pprint.pprint(sdic)

[{'label': 'GEOG', 'range': ['24', '30'], 'text': 'Europe'},
 {'label': 'PHEN', 'range': ['49', '58'], 'text': 'processus'},
 {'label': 'PROC', 'range': ['63', '79'], 'text': 'soins infirmiers'},
 {'label': 'PROC', 'range': ['143', '159'], 'text': 'soins infirmiers'},
 {'label': 'GEOG', 'range': ['201', '207'], 'text': 'Europe'}]


In [None]:
set = 'train'
NER.mix_txt_ann('train_txt',path_train,set)
lista = NER.load_pickle(set+'_txt_ann')
pprint.pprint(lista[0])

{'ann_dic': [{'label': 'PROC', 'range': ['0', '10'], 'text': 'Traitement'},
             {'label': 'DISO',
              'range': ['15', '36'],
              'text': 'métastases hépatiques'},
             {'label': 'DISO',
              'range': ['41', '60'],
              'text': 'cancers colorectaux'}],
 'txt': ['Traitement des métastases hépatiques des cancers colorectaux : '
         "jusqu' où aller ?\n"]}


We can simplify more this structure converting the list of dictionaries that corresponds to the annotation part in a list of tuples. We need also to tag all segments included in the TXT segments. We already have some of them tagged, but others don't. We will tag them as 'NONE' indicating that this tag is none of the others. 


In [None]:
set = 'train'
NER.complete_segments(set)
new1 = NER.load_pickle(set + '_txt_ann2')
pprint.pprint(new1[0])

{'ann_dic': [(0, 10, 'PROC', 'Traitement'),
             (11, 14, 'NONE', 'des'),
             (15, 36, 'DISO', 'métastases hépatiques'),
             (37, 40, 'NONE', 'des'),
             (41, 60, 'DISO', 'cancers colorectaux'),
             (61, 80, 'NONE', ": jusqu' où aller ?")],
 'txt': "Traitement des métastases hépatiques des cancers colorectaux : jusqu' "
        'où aller ?\n'}


With the help of function "ldic2ltok_lab", we will tokenize each one of the text segments and tag each token with the corresponding tag i.e. the segment tag

In [None]:
ls_tok_lab = NER.ldic2ltok_lab(new1)

In [None]:
NER.save_pickle(ls_tok_lab,set + '_txt_ann3')

In [None]:

# Exemple d'utilisation
file_path = "C:/Users/lamia/Desktop/Extraction-NER-Recherche/P0001/PDF/analyse.pdf"
if NER.is_pdf(file_path):
    print("Le fichier est un PDF natif.")
else:
    print("Le fichier n'est pas un PDF natif.")


Le fichier est un PDF natif.


In [None]:

# Example usage
pdf_path = "C:/Users/lamia/Desktop/Extraction-NER-Recherche/P0001/PDF/FichePatient.pdf"
if NER.is_scanned_pdf(pdf_path):
    print("The PDF seems to contain scanned images.")
else:
    print("The PDF does not seem to contain scanned images.")