# NLP with Flair

This notebook is a step-by-step walkthrough of NLP – named entity recognition and part-of-speech tagging specifically.

After familiarizing yourself with how it all works and plays together, you should be able to re-use the notebook and adapt it to your needs.

The term 'notebook' here refers to this .ipynb-file. It is a file format, which allows for text- and code cells respectively. Code cells can be run directly form the notebook.

As we will be running the notebook in Google Colab, no installations are required. Google Colab provides you with a virtual machine which has python and an array of useful packages pre-installed.
Keep in mind that Google Colab is a cloud service after all, so if your research involves working with sensitive data it might not be compliant with your data privacy regulations and restrictions.

## NER with Flair

The goal of named entity extraction (NER) is to identify persons, locations, organisations, and sometimes other concepts too.

This notebook uses the NER framework [flair](https://github.com/flairNLP/flair) as it is relatively easy to use and Ismail Prada-Ziegler from DH Unibe has trained an NER-Model for the bernese towerbooks, which will serve as an example here. Spacy is another widely used NER framework, which is also intuitive and powerful.

In [None]:
# install flair

!pip install flair

In [None]:
# First, we need to load the utilities we need from the flair module.
from flair.splitter import SegtokSentenceSplitter
from flair.models import SequenceTagger

In order to be able to access the example text we need to upload the txt to
the session storage. You can do this in the files-tab in the sidebar.

In [None]:
# Once done, you can read the file into a python object like so.
with open('B_IX_452.txt') as f:
  text = f.read()

We also need to load a tagger (model). We can directly use links to [huggingface](https://huggingface.co/) here.

Models tend to be large, so this step can take some time if executed for the first time. It can also fill up the RAM of your virtual machine rather quickly, so if you use large models, try to have only one loaded at the same time and restart the runtime if you intend to switch the model.

In this example we use the aforementioned model trained on the bernese towerbooks.

In [None]:
tagger = SequenceTagger.load("dh-unibe/turmbuecher-ner-v1")

Before we feed the text to the tagger we need to turn it into a Sentence
object, which also tokenizes it.

It can make sense to split long documents into paragraphs before prediction
as most systems perform worse on long texts, which is why we use the built
in splitter.

In [None]:
splitter = SegtokSentenceSplitter()
sentences = splitter.split(text)

# Here we actually identify the named entities. This can take a while.
tagger.predict(sentences)

# print the entities
for sentence in sentences:
  for entity in sentence.get_spans('ner'):
    print(entity)

As expected, this model works very well for the text we fed it. This is mainly because it is fine-tuned on the corpus the text is extracted from, so we can expect it to perform well on similar text.

For your own project in the context of this workshop it is recommended to look for suitable models on [huggingface](https://huggingface.co/) instead of using the model we used above.

## POS-tagging with Flair

Part-of-speech tagging works the same in Flair. The only difference is that we likely have to use a different model and that we have to load a Classifier instead of a SequenceTagger. We will use the same text as an example again.

In [None]:
# Install and load requirements and text in case you start here
# (uncomment if necessary)
# !pip install flair

from flair.nn import Classifier
from flair.splitter import SegtokSentenceSplitter

with open('B_IX_452.txt') as f:
  text = f.read()

# Now we are loading a different model. In this case a build-in model for german
# pos-tagging, just to see how it works for pre-modern texts.
tagger = Classifier.load("de-pos")

# Finally we split the text into paragraphs again and predict and retrieve the
# annotations.
splitter = SegtokSentenceSplitter()
sentences = splitter.split(text)

# Here we actually identify the parts-of-speech. This can take a while.
tagger.predict(sentences)

# print the parts-of-speech
for sentence in sentences:
  print(sentence)