<a href="https://colab.research.google.com/github/PeerChristensen/NLP-Demos/blob/main/da_transfomers_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An overview of Danish transfomer models

## Named entity recognition

We get the current best model for Danish NER. It can be found [here]("https://huggingface.co/saattrupdan/nbailab-base-ner-scandi")

In [1]:
!pip install transformers
from transformers import pipeline

model = 'saattrupdan/nbailab-base-ner-scandi'
ner = pipeline("ner", model=model, aggregation_strategy='first', )



In [16]:
text = "Margrethe Laursen, bosiddende på adressen Vibevej 25 i København, blev indlagt på Bispebjerg Hospital efter en ulykke i forbindelse med hendes arbejde ved Movia. Hun blev behandlet af Overlæge Jens Severinsen."

In [17]:
ner(text)

[{'end': 17,
  'entity_group': 'PER',
  'score': 0.99971926,
  'start': 0,
  'word': 'Margrethe Laursen'},
 {'end': 52,
  'entity_group': 'LOC',
  'score': 0.9973518,
  'start': 42,
  'word': 'Vibevej 25'},
 {'end': 64,
  'entity_group': 'LOC',
  'score': 0.99921095,
  'start': 55,
  'word': 'København'},
 {'end': 101,
  'entity_group': 'LOC',
  'score': 0.9718465,
  'start': 82,
  'word': 'Bispebjerg Hospital'},
 {'end': 160,
  'entity_group': 'ORG',
  'score': 0.9937564,
  'start': 155,
  'word': 'Movia'},
 {'end': 208,
  'entity_group': 'PER',
  'score': 0.94952404,
  'start': 193,
  'word': 'Jens Severinsen'}]

Given the standard output, we can make a function that anonymizes text by removing named entities based on character positions.

In [13]:
def find_and_remove_named_entities(text: str) -> str:
    """Use current best NER model (saattrupdan/nbailab-base-ner-scandi) to identify named entities.
    Entities are removed by position ranges within strings.
    The model and pipeline are defined outside this function.
    """
    try:
        named_ents = ner(text)
        ranges_to_remove = [range(i["start"], i["end"]) for i in named_ents]
        new_text = ''.join([char for idx, char in enumerate(text) if not any(idx in rng for rng in ranges_to_remove)])
        return new_text
    except:
        return text

In [18]:
find_and_remove_named_entities(text)

', bosiddende på adressen  i , blev indlagt på  efter en ulykke i forbindelse med hendes arbejde ved . Hun blev behandlet af Overlæge .'

## Translation


### A quick example

In [7]:
!pip install sentencepiece

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-da",truncation=True, max_length=500)
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-da")



Jeg ønsker at leve, jeg vil give. Jeg har været en minearbejder for et hjerte af guld


In [10]:
translation = pipeline("translation_en_to_da", model=model, tokenizer=tokenizer)

text = "I want to live, I want to give. I've been a miner for a heart of gold"

translated_text = translation(text)[0]['translation_text']
print(translated_text)

Jeg ønsker at leve, jeg vil give. Jeg har været en minearbejder for et hjerte af guld


### A not so quick *example*

In this example, we'll see how to translate The Da Vinci Code in .epub format into Danish

In [11]:
!pip install epub-conversion
!pip install xml_cleaner

from epub_conversion.utils import open_book, convert_epub_to_lines
import re, time
from tqdm.notebook import tqdm

import nltk
import numpy as np

nltk.download('punkt')

from nltk.tokenize import sent_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


#### Preprocessing text

In [None]:
def clean_text(text):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', text)
  return cleantext

In [None]:
book = open_book("/Users/peerchristensen/Downloads/DaVinciCode.epub")

lines = convert_epub_to_lines(book)

cleaned_text = [clean_text(line) for line in lines]

cleaned_text = [text.strip() for text in cleaned_text]

cleaned_text = list(filter(None, cleaned_text))

We can use a dataframe to store the original and translated text to better evaluate the quality of the translations

In [None]:
df = pd.DataFrame({'text': cleaned_text})

#### Translate


In [None]:
def translate(text):
    if text is None or text == "":
        return "Error",

    #batch input + sentence tokenization
    batch = tokenizer.prepare_seq2seq_batch(sent_tokenize(text))

    #run model
    translated = model.generate(**batch)
    tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

    return " ".join(tgt_text)

In [None]:
df['translated'] = df["clean_text"].map(lambda x: translate(x)).copy()

df.to_csv('translated_auto.csv')