# Span classification on the GeoEDdA dataset with CamemBERT

This notebook shows how to use our fine-tuned CamemBERT model for span classification on the GeoEDdA dataset

The model is available on HuggingFace: 
https://huggingface.co/GEODE/camembert-base-edda-span-classification

It has been fine-tuned for token classification on the GeoEDdA dataset available also on HuggingFace:
https://huggingface.co/datasets/GEODE/GeoEDdA

For more details about the dataset and tagset used, see: https://github.com/GEODE-project/ner-spancat-edda


## Import Pyhton libraries

In [71]:
from transformers import pipeline
import torch
from datasets import load_dataset
import pandas as pd
from tqdm import tqdm

In [2]:
device = torch.device("mps" if torch.backends.mps.is_available() else ("cuda" if torch.cuda.is_available() else "cpu"))

## Load the model

* Load the [GEODE/bert-base-french-cased-edda-ner](https://huggingface.co/GEODE/bert-base-french-cased-edda-ner) model pre-trained model for token classification:

In [3]:
pipe = pipeline("token-classification", model="GEODE/camembert-base-edda-span-classification", aggregation_strategy="simple", device=device)

## Load the dataset

The EDdA dataset is provided by ARTFL and is not fully available online.
In order to test our pipeline, we can test it on the GeoEDdA dataset, which is a subset of the EDdA dataset, available on HuggingFace: https://huggingface.co/datasets/GEODE/GeoEDdA

* Load the [GEODE/GeoEDdA](https://huggingface.co/datasets/GEODE/GeoEDdA) dataset:

In [22]:
dataset = load_dataset("GEODE/GeoEDdA")

dfs = []
for key in dataset.keys():
    dfs.append(pd.DataFrame({'text':dataset[key]['text'], 'meta':dataset[key]['meta']}))
df = pd.concat(dfs, ignore_index=True)

df['head'] = df['meta'].apply(lambda x: x['head'])
df['volume'] = df['meta'].apply(lambda x: x['volume'])
df['article'] = df['meta'].apply(lambda x: x['article'])
df['paragraph'] = df['meta'].apply(lambda x: x['paragraph'])

df = df.drop(columns=['meta'])

df.head()

Unnamed: 0,text,head,volume,article,paragraph
0,"ILLESCAS, (Géog.) petite ville d'Espagne, dans...",ILLESCAS,8,2637,1
1,"MULHAUSEN, (Géog.) ville impériale d'Allemagne...",MULHAUSEN,10,3648,1
2,"* ADDA, riviere de Suisse & d'Italie, qui a sa...",ADDA,1,763,1
3,"SINTRA ou CINTRA, (Géog. mod.) montagne de Por...",SINTRA ou CINTRA,15,1108,1
4,"* ACHSTEDE, ou AKSTEDE, s. petite Ville d'Alle...","ACHSTEDE, ou AKSTEDE",1,603,1


## Test the NER model

In [23]:
pipe(df.iloc[0,0])

[{'entity_group': 'Head',
  'score': 0.9918164,
  'word': 'ILLESCAS',
  'start': 0,
  'end': 8},
 {'entity_group': 'Domain_mark',
  'score': 0.82838464,
  'word': 'Géog.',
  'start': 11,
  'end': 16},
 {'entity_group': 'NC_Spatial',
  'score': 0.9903073,
  'word': 'petite ville',
  'start': 18,
  'end': 30},
 {'entity_group': 'NP_Spatial',
  'score': 0.9912714,
  'word': 'Espagne',
  'start': 33,
  'end': 40},
 {'entity_group': 'Relation',
  'score': 0.987658,
  'word': 'dans',
  'start': 42,
  'end': 46},
 {'entity_group': 'NP_Spatial',
  'score': 0.9903284,
  'word': 'la nouvelle Castille',
  'start': 47,
  'end': 67},
 {'entity_group': 'Relation',
  'score': 0.98800606,
  'word': 'à six lieues au sud de',
  'start': 69,
  'end': 91},
 {'entity_group': 'NP_Spatial',
  'score': 0.99158317,
  'word': 'Madrid',
  'start': 92,
  'end': 98}]

In [None]:
def extract_entities(content, tag=None):
    return [d for d in pipe(content) if d['entity_group']==tag] if tag else pipe(content)


def extract_entities_content(content, tag):
    return ', '.join([e['word'] for e in extract_entities(content, tag)])

In [50]:
display(extract_entities(df.iloc[0,0], 'NP_Spatial'))

[{'entity_group': 'NP_Spatial',
  'score': 0.9912714,
  'word': 'Espagne',
  'start': 33,
  'end': 40},
 {'entity_group': 'NP_Spatial',
  'score': 0.9903284,
  'word': 'la nouvelle Castille',
  'start': 47,
  'end': 67},
 {'entity_group': 'NP_Spatial',
  'score': 0.99158317,
  'word': 'Madrid',
  'start': 92,
  'end': 98}]

In [49]:
display(extract_entities_content(df.iloc[0,0], 'NP_Spatial'))

'Espagne, la nouvelle Castille, Madrid'

## Add latlong to the dataframe

In [51]:
df['Latlong'] = df['text'].apply(lambda x: extract_entities_content(x, 'Latlong'))


Unnamed: 0,text,head,volume,article,paragraph,Latlong
0,"ILLESCAS, (Géog.) petite ville d'Espagne, dans...",ILLESCAS,8,2637,1,
1,"MULHAUSEN, (Géog.) ville impériale d'Allemagne...",MULHAUSEN,10,3648,1,Long. 28. 14. lat. 51. 13
2,"* ADDA, riviere de Suisse & d'Italie, qui a sa...",ADDA,1,763,1,
3,"SINTRA ou CINTRA, (Géog. mod.) montagne de Por...",SINTRA ou CINTRA,15,1108,1,
4,"* ACHSTEDE, ou AKSTEDE, s. petite Ville d'Alle...","ACHSTEDE, ou AKSTEDE",1,603,1,


In [60]:
# count the number of rows with Latlong column not empty 
df[df['Latlong'] != ''].shape


(673, 6)

## Get the list of all NP-Spatial (to see which ones do not have entries)

In [88]:
def extract_entities_asdic(content, dic={}, tags=None):
    if tags:
        for d in pipe(content):
            if d['entity_group'] in tags:
                if d['entity_group'] not in dic:
                    dic[d['entity_group']] = set()
               
                dic[d['entity_group']].add(d['word'])
    else:
        for d in pipe(content):
            if d['entity_group'] not in dic:
                dic[d['entity_group']] = set()
            
            dic[d['entity_group']].add(d['word'])
    return dic

In [89]:
extract_entities_asdic(df.iloc[0,0], tags=['NP_Spatial', 'NC_Spatial'])

{'NC_Spatial': {'petite ville'},
 'NP_Spatial': {'Espagne', 'Madrid', 'la nouvelle Castille'}}

In [94]:
d = {}
for i in tqdm(range(0, 100)):
    d = extract_entities_asdic(df.iloc[i,0], d, tags=['NP_Spatial'])

100%|██████████| 100/100 [00:03<00:00, 29.82it/s]


In [97]:
d['NP_Spatial']

{"'Afrique",
 "'Allemagne",
 "'Eysenach",
 'Abacoa',
 'Aberdéen',
 'Acambou',
 'Adama',
 'Afrique',
 'Alcantara',
 'Alger',
 'Allemagne',
 'Angleterre',
 'Antilles',
 'Argenton',
 'Arles',
 'Armançon',
 'Asie',
 'Asphastite',
 'Augia Rheni',
 'Aujon',
 'Autriche',
 'Avellino',
 'Bahama',
 'Bala',
 'Barbarie',
 'Barcelone',
 'Baviere',
 'Bec-Sangil',
 'Belgrade',
 'Benfordbridge',
 'Bennavenna',
 'Bennones',
 'Besançon',
 'Bimini',
 'Biopio',
 'Boristhène',
 'Bourg',
 'Bourgogne',
 'Bragance',
 'Brandebourg',
 'Braulio',
 'Brem',
 'Bretagne',
 'Brixen',
 'Brunswic',
 'Buchorn',
 'Bursa',
 'Byce',
 'Bénevent',
 'Calabre',
 'Camlin',
 'Candie',
 "Capo d'Istria",
 'Carcassonne',
 'Carcinite',
 'Cardonne',
 'Casal',
 'Cassel',
 'Cerigo',
 'Champagne',
 'Charleroi',
 'Chester',
 'Chili',
 'Châlons',
 'Châtillon',
 'Cianus',
 'Cilicie',
 'Cium',
 'Clermont-sur-la-Dore',
 'Cleycester',
 'Comorin',
 'Constantinople',
 'Conventry',
 'Cordonéro',
 'Cracovie',
 'Crémone',
 'Cummerow',
 'Dago',
 'D

In [None]:
entries = ['Afrique', 'Allemagne']

#difference between the two sets
d['NP_Spatial'].difference(entries)

{"'Afrique",
 "'Allemagne",
 "'Eysenach",
 'Abacoa',
 'Aberdéen',
 'Acambou',
 'Adama',
 'Alcantara',
 'Alger',
 'Angleterre',
 'Antilles',
 'Argenton',
 'Arles',
 'Armançon',
 'Asie',
 'Asphastite',
 'Augia Rheni',
 'Aujon',
 'Autriche',
 'Avellino',
 'Bahama',
 'Bala',
 'Barbarie',
 'Barcelone',
 'Baviere',
 'Bec-Sangil',
 'Belgrade',
 'Benfordbridge',
 'Bennavenna',
 'Bennones',
 'Besançon',
 'Bimini',
 'Biopio',
 'Boristhène',
 'Bourg',
 'Bourgogne',
 'Bragance',
 'Brandebourg',
 'Braulio',
 'Brem',
 'Bretagne',
 'Brixen',
 'Brunswic',
 'Buchorn',
 'Bursa',
 'Byce',
 'Bénevent',
 'Calabre',
 'Camlin',
 'Candie',
 "Capo d'Istria",
 'Carcassonne',
 'Carcinite',
 'Cardonne',
 'Casal',
 'Cassel',
 'Cerigo',
 'Champagne',
 'Charleroi',
 'Chester',
 'Chili',
 'Châlons',
 'Châtillon',
 'Cianus',
 'Cilicie',
 'Cium',
 'Clermont-sur-la-Dore',
 'Cleycester',
 'Comorin',
 'Constantinople',
 'Conventry',
 'Cordonéro',
 'Cracovie',
 'Crémone',
 'Cummerow',
 'Dago',
 'Damas',
 'Danemark',
 'Darb