<a href="https://colab.research.google.com/github/PeerChristensen/NLP-Demos/blob/main/da_transfomers_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Some examples using (Danish) transfomer models

## Named entity recognition

We get the current best model for Danish NER. It can be found [here]("https://huggingface.co/saattrupdan/nbailab-base-ner-scandi")

In [4]:
!pip install transformers
from transformers import pipeline
import pandas as pd

model = 'saattrupdan/nbailab-base-ner-scandi'
ner = pipeline("ner", model=model, aggregation_strategy='first', )



In [5]:
text = "Margrethe Laursen, bosiddende på adressen Vibevej 25 i København, blev indlagt på Bispebjerg Hospital efter en ulykke i forbindelse med hendes arbejde ved Movia. Hun blev behandlet af Overlæge Jens Severinsen."

In [6]:
pd.DataFrame(ner(text))

Unnamed: 0,entity_group,score,word,start,end
0,PER,0.999719,Margrethe Laursen,0,17
1,LOC,0.997352,Vibevej 25,42,52
2,LOC,0.999211,København,55,64
3,LOC,0.971847,Bispebjerg Hospital,82,101
4,ORG,0.993756,Movia,155,160
5,PER,0.949524,Jens Severinsen,193,208


Given the standard output, we can make a function that anonymizes text by removing named entities based on character positions.

In [7]:
def find_and_remove_named_entities(text: str) -> str:
    """Use current best NER model (saattrupdan/nbailab-base-ner-scandi) to identify named entities.
    Entities are removed by position ranges within strings.
    The model and pipeline are defined outside this function.
    """
    try:
        named_ents = ner(text)
        ranges_to_remove = [range(i["start"], i["end"]) for i in named_ents]
        new_text = ''.join([char for idx, char in enumerate(text) if not any(idx in rng for rng in ranges_to_remove)])
        return new_text
    except:
        return text

In [8]:
find_and_remove_named_entities(text)

', bosiddende på adressen  i , blev indlagt på  efter en ulykke i forbindelse med hendes arbejde ved . Hun blev behandlet af Overlæge .'

## Translation


### A quick example

In [9]:
# If you get an error saying sentenpiece is not installed, try restarting runtime
!pip install sentencepiece

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-da",truncation=True, max_length=500)
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-da")





Downloading:   0%|          | 0.00/770k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/800k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/286M [00:00<?, ?B/s]

In [15]:
translation = pipeline("translation_en_to_da", model=model, tokenizer=tokenizer)

text = "I want to live, I want to give. I've been a miner for a heart of gold"

translated_text = translation(text)[0]['translation_text']
print(translated_text)

Jeg ønsker at leve, jeg vil give. Jeg har været en minearbejder for et hjerte af guld


### A not so quick *example*

In this example, we'll see how to translate The Da Vinci Code in .epub format into Danish

In [16]:
!pip install epub-conversion
!pip install xml_cleaner

from epub_conversion.utils import open_book, convert_epub_to_lines
import re, time
from tqdm.notebook import tqdm

import nltk
import numpy as np

nltk.download('punkt')

from nltk.tokenize import sent_tokenize

Collecting epub-conversion
  Downloading epub-conversion-1.0.15.tar.gz (6.5 kB)
Collecting bz2file
  Downloading bz2file-0.98.tar.gz (11 kB)
Collecting epub
  Downloading epub-0.5.2.tar.gz (44 kB)
[K     |████████████████████████████████| 44 kB 2.5 MB/s 
[?25hCollecting ciseau
  Downloading ciseau-1.0.1.tar.gz (10 kB)
Building wheels for collected packages: epub-conversion, bz2file, ciseau, epub
  Building wheel for epub-conversion (setup.py) ... [?25l[?25hdone
  Created wheel for epub-conversion: filename=epub_conversion-1.0.15-py3-none-any.whl size=7368 sha256=4e905006f0c7cdf8a2614c87e1390878a52e9ee33fbb3f5646538f75173b3e6d
  Stored in directory: /root/.cache/pip/wheels/be/f9/95/1072882c3f236af4ab652dbbcdd72ef236572ebb3b9e3d1ff9
  Building wheel for bz2file (setup.py) ... [?25l[?25hdone
  Created wheel for bz2file: filename=bz2file-0.98-py3-none-any.whl size=6883 sha256=115ebe4b83bccc229e81bc9ee7f2cae58100e2dffee12f0e69501bba7229945b
  Stored in directory: /root/.cache/pip/whee

#### Preprocessing text

In [19]:
def clean_text(text):
  cleanr = re.compile('<.*?>')
  cleantext = re.sub(cleanr, '', text)
  return cleantext

'sadhk4'

In [None]:
book = open_book("/Users/peerchristensen/Downloads/DaVinciCode.epub")

lines = convert_epub_to_lines(book)

cleaned_text = [clean_text(line) for line in lines]

cleaned_text = [text.strip() for text in cleaned_text]

cleaned_text = list(filter(None, cleaned_text))

We can use a dataframe to store the original and translated text to better evaluate the quality of the translations

In [None]:
df = pd.DataFrame({'text': cleaned_text})

#### Translate


In [None]:
def translate(text):
    if text is None or text == "":
        return "Error",

    #batch input + sentence tokenization
    batch = tokenizer.prepare_seq2seq_batch(sent_tokenize(text))

    #run model
    translated = model.generate(**batch)
    tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

    return " ".join(tgt_text)

In [None]:
df['translated'] = df["clean_text"].map(lambda x: translate(x)).copy()

df.to_csv('translated_auto.csv')

## Fine-tuning for domain adaptation

In some cases, it makes sense to fine-tune a pretrained model to better align with special domains of language use.

In this case, we compare the outcomes of a pretrained model for translating text from English to French with a specialized model fine-tuned using a dataset with translations of technical texts.

In [20]:
text = "Software developers and data scientist use computers to write emails and code."

### General

In [21]:
model = "Helsinki-NLP/opus-mt-en-fr"
translator = pipeline("translation", model=model)

translation = translator(text)
translation[0]['translation_text']

Downloading:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/287M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/784k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.28M [00:00<?, ?B/s]

'Les développeurs de logiciels et les data scientist utilisent des ordinateurs pour écrire des courriels et des codes.'

### Domain-specific

In [22]:
model = "huggingface-course/marian-finetuned-kde4-en-to-fr"
translator = pipeline("translation", model=model)

translation = translator(text)
translation[0]['translation_text']

Downloading:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/285M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/296 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/784k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

'Les développeurs de logiciels et les informaticiens utilisent les ordinateurs pour écrire des courriers électroniques et du code.'

## Sequence classification/Sentiment analysis

In [23]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("DaNLP/da-bert-tone-sentiment-polarity")

model = AutoModelForSequenceClassification.from_pretrained("DaNLP/da-bert-tone-sentiment-polarity")

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/422M [00:00<?, ?B/s]

In [24]:
texts = ["Dette er intet mindre end et fantastisk produkt!", "Jeg er ret skuffet over den dårlige service."]

clf = pipeline(task="text-classification", model=model, tokenizer=tokenizer)

In [26]:
import pandas as pd
df = pd.DataFrame(clf(texts))
df["text"] = texts
df = df[ ['text'] + [ col for col in df.columns if col != 'text' ] ]
df

  cpuset_checked))


Unnamed: 0,text,label,score
0,Dette er intet mindre end et fantastisk produkt!,positive,0.998569
1,Jeg er ret skuffet over den dårlige service.,negative,0.996194


## Zero-shot classification

"*The zero-shot pipeline in the Transformers library treats text classification as natural language inference (NLI). This approach was pioneered by Yin et al. in 2019. In NLI, a model takes two sentences as input — a premise and a hypothesis — and decides whether the hypothesis follows from the premise (entailment), contradicts it (contradiction), or neither (neutral). For example, the premise David killed Goliath entails the hypothesis Goliath is dead, is contradicted by Goliath is alive and doesn’t allow us to draw any conclusions about Goliath is a giant. This NLI template can be reused for text classification by taking the text we’d like to label as the premise, and rephrasing every candidate class as a hypothesis.*" 

https://nlp.town/blog/zero-shot-classification/

If more than one label can be true, we might set `multi_class=True`

In [27]:
clf = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")


Downloading:   0%|          | 0.00/1.13k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [47]:
texts = ["The expansion of the playoff field to include the top 14 teams ranks among the best decisions the league has made in recent years. An additional postseason bid for each conference has translated into an increased level of competition while also creating greater intrigue around those middle-of-the-pack teams."]
candidate_labels = ['travel', 'sports', 'cooking', 'politics','science','religion']
output = clf(texts, candidate_labels)[0]

  cpuset_checked))


In [49]:
pd.DataFrame({"label":output["labels"], "score":output["scores"]})

Unnamed: 0,label,score
0,sports,0.735874
1,cooking,0.109218
2,travel,0.069894
3,science,0.041872
4,politics,0.03317
5,religion,0.009971
