<a href="https://colab.research.google.com/github/sinairusinek/dariah-campus/blob/main/DARIAH_SOC_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Standoff Converter Tutorial

## Connect to google drive

In [2]:
from google.colab import drive
from google.colab import files
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [3]:
uploaded = files.upload()
filename = next(iter(uploaded))

Saving EisensteinSamuelBenShimshon.xml to EisensteinSamuelBenShimshon.xml


## Install SpaCy models and StandoffConverter

In [4]:
!python -m spacy download de_core_news_lg
!python -m spacy download en_core_web_trf
!pip install -q standoffconverter
!pip install -q spacy-transformers

Collecting de-core-news-lg==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_lg-3.7.0/de_core_news_lg-3.7.0-py3-none-any.whl (567.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m567.8/567.8 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: de-core-news-lg
Successfully installed de-core-news-lg-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-trf==3.7.3
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.7.3/en_core_web_trf-3.7.3-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32

In [5]:
ner_dict = {'en': 'en_core_web_trf',
            'de': 'de_core_news_lg'}

In [6]:
lang = None
while lang not in ner_dict.keys():
  lang = input('Choose language (en/de) or `q` to quit:')
  if lang == 'q':
    break
  elif lang not in ner_dict.keys():
    print('input should either en or de (or q to quit)')

if lang not in ner_dict.keys():
    print('No language was chosen')
else:
  print(lang, 'language selected')

Choose language (en/de) or `q` to quit:en
en language selected


# Tutorial Start

## Import libraries and set constants

In [7]:
from lxml import etree
from standoffconverter import Standoff, View

import spacy
import spacy_transformers

import pandas as pd

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


In [8]:
tags_dict = {"LOC": {"tag": "placeName", "attr":{"type": "loc"}},
             "GPE": {"tag": "placeName", "attr":{"type": "gpe"}},
             "PERSON": {"tag": "persName", "attr":{}},
             "ORG": {"tag": "orgName", "attr":{}},
             "DATE": {"tag": "date", "attr":{}},
             "WORK": {"tag": "name", "attr":{"type": "work"}},
             "MISC": {"tag": "name", "attr":{"type": "misc"}},
             "FAC": {"tag": "orgName", "attr":{"type": "theater"}}}
             #"NORP": {"tag": "name", "attr":{"type": "nationality"}},
             #"ORDINAL": {"tag": "num", "attr":{"type": "ordinal"}},
             #"CARDINAL": {"tag": "num", "attr":{"type": "cardinal"}},
             #"MONEY": {"tag": "num", "attr":{"type": "money"}},
             #"PERCENT": {"tag": "num", "attr":{"type": "percent"}},
             #"LANGUAGE": {"tag": "language", "attr":{}},#not working in TEI! reconsider
             #"EVENT": {"tag": "event", "attr":{"type": "?"}}}
             #"PRODUCT": {"tag": "?", "attr":{"type": "?"}}}
             #"LAW": {"tag": "?", "attr":{"type": "?"}}}
             #"TIME": {"tag": "?", "attr":{"type": "?"}}}
             #"MONEY": {"tag": "?", "attr":{"type": "?"}}}
             #"QUANTITY": {"tag": "?", "attr":{"type": "?"}}}

## Set input and output file paths

In [None]:

filename = "Kalman-Yuvelier.xml"
#filename = "IL-MTFN-001-G-F-0353-18.xml"

In [9]:
# Make dynamic relative to provided file
xml_path = '/content/' + filename
so_ner_result = '/content/soc_ner_results' + filename

## Load XML-TEI and parse it with Standoff

In [10]:
# parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(xml_path)
namespaces = {'tei': 'http://www.tei-c.org/ns/1.0'}
so = Standoff(tree, namespaces)

In [11]:
so.plain[:1000]

'\n          ITINERARY OF RABBI SAMUEL BEN SAMSON IN 1210\nR. Samuel ben Samson made a pilgrimage to Palestine in 1210 in company of the distinguished R. Jonathan ha Cohen for whom Samuel ibn Tibbon translated (from the Arabic into Hebrew) Maimonides’ Guide of the Perplexed and Judah al Harizi translated Maimonides’ Commentary on the Mishna. He is described in the traveller’s account of the pilgrimage as “ Resh Gola ”, the head of the captivity. The text is that of the Parma MS. translated by Carmoly in his Itineraires, pages 127 to 136.\nThe Traveller at the end of his narrative says that he carried a letter from the King of Jerusalem, i.e. John de Brienne and it is suggested that this letter recommended the emigration of Jews to Palestine and resulted in the famous pilgrimage of three hundred French and English Rabbis in the following year.\nThe Itinerary begins as follows :—\nThese words deserve to be written in order that we might know the places of the graves of our fore-fathers b

## Preprocessing

### Clear tabs and newlines

In [12]:
view = View(so).shrink_whitespace()
plain = view.get_plain()
plain[:1000]

create view: 100%|██████████| 223/223 [00:00<00:00, 469.65it/s]
shrink whitespace: 100%|██████████| 13303/13303 [00:01<00:00, 8267.37it/s]


' ITINERARY OF RABBI SAMUEL BEN SAMSON IN 1210\nR. Samuel ben Samson made a pilgrimage to Palestine in 1210 in company of the distinguished R. Jonathan ha Cohen for whom Samuel ibn Tibbon translated (from the Arabic into Hebrew) Maimonides’ Guide of the Perplexed and Judah al Harizi translated Maimonides’ Commentary on the Mishna. He is described in the traveller’s account of the pilgrimage as “ Resh Gola ”, the head of the captivity. The text is that of the Parma MS. translated by Carmoly in his Itineraires, pages 127 to 136.\nThe Traveller at the end of his narrative says that he carried a letter from the King of Jerusalem, i.e. John de Brienne and it is suggested that this letter recommended the emigration of Jews to Palestine and resulted in the famous pilgrimage of three hundred French and English Rabbis in the following year.\nThe Itinerary begins as follows :—\nThese words deserve to be written in order that we might know the places of the graves of our fore-fathers by whose mer

## NER

### Choose language model

In [13]:
nlp = spacy.load(ner_dict[lang])

### Process text for NER

In [14]:
doc = nlp(plain)

In [15]:
doc_results = {'entity_name': [entity.text for entity in doc.ents],
               'entity_label': [entity.label_ for entity in doc.ents]}
ner_df = pd.DataFrame(doc_results).set_index('entity_name')

ner_df

Unnamed: 0_level_0,entity_label
entity_name,Unnamed: 1_level_1
RABBI SAMUEL BEN SAMSON,PERSON
1210,DATE
Samuel ben Samson,PERSON
Palestine,GPE
1210,DATE
...,...
Asher,PERSON
Jerusalem,GPE
Galilee,GPE
the year 970,DATE


## Annotation

### NER Inline annotation
#### ISSUE with `add_inline()`:
Error: `ValueError: no unique context found`\
This error occurs when we are trying to add inline tags. The reason for the error is unclear.

Current workaround is to sorround `add_inline` with a `try/except` block.

In [16]:
for i, ent in enumerate(doc.ents):
    start_ind = view.get_table_pos(ent.start_char)
    end_ind = view.get_table_pos(ent.end_char)
    label = ent.label_

    print(f'{i} {start_ind=}\t{end_ind=}\t{label=}')

    if label not in tags_dict.keys():
        print(label, '- not in dictionary -> IGNORED')
        continue
    else:
        try:
            so.add_inline(
                begin=start_ind,
                end=end_ind,
                tag=tags_dict[label]['tag'],
                depth=None,
                attrib=tags_dict[label]['attr']
            )
        except Exception as e:
            print(e)

0 start_ind=24	end_ind=47	label='PERSON'
1 start_ind=51	end_ind=55	label='DATE'
2 start_ind=59	end_ind=76	label='PERSON'
3 start_ind=98	end_ind=107	label='GPE'
4 start_ind=111	end_ind=115	label='DATE'
5 start_ind=148	end_ind=168	label='PERSON'
6 start_ind=178	end_ind=195	label='PERSON'
7 start_ind=217	end_ind=223	label='LANGUAGE'
LANGUAGE - not in dictionary -> IGNORED
8 start_ind=229	end_ind=235	label='LANGUAGE'
LANGUAGE - not in dictionary -> IGNORED
9 start_ind=237	end_ind=271	label='WORK_OF_ART'
WORK_OF_ART - not in dictionary -> IGNORED
10 start_ind=276	end_ind=291	label='PERSON'
11 start_ind=303	end_ind=313	label='PERSON'
12 start_ind=315	end_ind=339	label='WORK_OF_ART'
WORK_OF_ART - not in dictionary -> IGNORED
13 start_ind=407	end_ind=416	label='PERSON'
14 start_ind=495	end_ind=502	label='PERSON'
15 start_ind=510	end_ind=521	label='WORK_OF_ART'
WORK_OF_ART - not in dictionary -> IGNORED
16 start_ind=529	end_ind=532	label='CARDINAL'
CARDINAL - not in dictionary -> IGNORED
17 sta

#### Text element output

In [17]:
etree.tostring(so.text_el).decode("utf-8")

'<text xmlns="http://www.tei-c.org/ns/1.0">\n          <front><ab xml:id="intro">ITINERARY OF <persName>RABBI SAMUEL BEN SAMSON</persName> IN <date>1210</date>\nR. <persName>Samuel ben Samson</persName> made a pilgrimage to <placeName type="gpe">Palestine</placeName> in <date>1210</date> in company of the distinguished <persName>R. Jonathan ha Cohen</persName> for whom <persName>Samuel ibn Tibbon</persName> translated (from the Arabic into Hebrew) Maimonides&#8217; Guide of the Perplexed and <persName>Judah al Harizi</persName> translated <persName>Maimonides</persName>&#8217; Commentary on the Mishna. He is described in the traveller&#8217;s account of the pilgrimage as &#8220; <persName>Resh Gola</persName> &#8221;, the head of the captivity. The text is that of the Parma MS. translated by <persName>Carmoly</persName> in his Itineraires, pages 127 to 136.\nThe Traveller at the end of his narrative says that he carried a letter from the King of <placeName type="gpe">Jerusalem</placeNa

## Export

In [18]:
etree.ElementTree(so.tree.getroot()).write(so_ner_result)