<a href="https://colab.research.google.com/github/LxYuan0420/nlp/blob/main/notebooks/flair/HUNFLAIR_TUTORIAL_1_TAGGING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#!pip install flair

### HunFlair Tutorial 1: Tagging

This is part 1 of the tutorial, in which we show how to use our pre-trained HunFlair models to tag your text.

##### Tagging with Pre-trained HunFlair-Models

Let's use the pre-trained HunFlair model for biomedical named entity recognition (NER). This model was trained over 24 biomedical NER data sets and can recognize 5 different entity types, i.e. cell lines, chemicals, disease, gene / proteins and species.

In [2]:
from flair.models import MultiTagger

tagger = MultiTagger.load("hunflair")

2022-11-12 05:52:25,866 loading file /root/.flair/models/hunflair-celline-v1.0.pt
2022-11-12 05:52:39,307 SequenceTagger predicts: Dictionary with 8 tags: <unk>, O, S-CellLine, B-CellLine, I-CellLine, E-CellLine, <START>, <STOP>
2022-11-12 05:52:39,813 loading file /root/.flair/models/hunflair-chemical-full-v1.0.pt
2022-11-12 05:52:59,190 SequenceTagger predicts: Dictionary with 8 tags: <unk>, O, S-Chemical, B-Chemical, I-Chemical, E-Chemical, <START>, <STOP>
2022-11-12 05:52:59,480 loading file /root/.flair/models/hunflair-disease-full-v1.0.pt
2022-11-12 05:53:11,553 SequenceTagger predicts: Dictionary with 8 tags: <unk>, O, B-Disease, E-Disease, I-Disease, S-Disease, <START>, <STOP>
2022-11-12 05:53:12,187 loading file /root/.flair/models/hunflair-gene-full-v1.0.pt
2022-11-12 05:53:24,718 SequenceTagger predicts: Dictionary with 8 tags: <unk>, O, S-Gene, B-Gene, I-Gene, E-Gene, <START>, <STOP>
2022-11-12 05:53:25,086 loading file /root/.flair/models/hunflair-species-full-v1.1.pt
2022

All you need to do is use the predict() method of the tagger on a sentence. This will add predicted tags to the tokens in the sentence. Lets use a sentence with four named entities:

In [3]:
from flair.data import Sentence

sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 MOuse Model of Fragile X Syndrome")

# predict NER tags
tagger.predict(sentence)

# print sentence with predicted tags
print(sentence.to_tagged_string())


from flair.tokenization import SciSpacySentenceSplitter
sentence = Sentence('Your biomed text', use_tokenizer=SciSpacySentenceSplitter())



Sentence: "Behavioral abnormalities in the Fmr1 KO2 MOuse Model of Fragile X Syndrome" → ["Behavioral abnormalities"/Disease, "Fmr1"/Gene, "MOuse"/Species, "Fragile X Syndrome"/Disease]


The output contains the words of the original text extended by tags indicating whether the word is the beginning (B), inside (I) or end (E) of an entity. For example, "Fragil" is the first word of the disease "Fragil X Syndrom". Entities consisting of just one word are marked with a special single tag (S). For example, "Mouse" refers to a species entity.

#### Getting Annotated Spans

Often named entities consist of multiple words spanning a certain text span in the input text, such as "Behavioral Abnormalities" or "Fragile X Syndrome" in our example sentence. You can directly get such spans in a tagged sentence like this:

In [4]:
for disease in sentence.get_spans("hunflair-disease"):
    print(disease)

Span[0:2]: "Behavioral abnormalities" → Disease (0.6723)
Span[9:12]: "Fragile X Syndrome" → Disease (0.99)


In [5]:
for gene in sentence.get_spans("hunflair-gene"):
    print(gene)

Span[4:5]: "Fmr1" → Gene (0.8459)


Which indicates that "Behavioral Abnormalities" or "Fragile X Syndrome" are both disease. Each such Span has a text, its position in the sentence and Label with a value and a score (confidence in the prediction). You can also get additional information, such as the position offsets of each entity in the sentence by calling the to_dict() method:

In [6]:
print(sentence.to_dict("hunflair-disease"))

{'text': 'Behavioral abnormalities in the Fmr1 KO2 MOuse Model of Fragile X Syndrome', 'hunflair-disease': [{'value': 'Disease', 'confidence': 0.6722518503665924}, {'value': 'Disease', 'confidence': 0.9900489449501038}]}


You can retrieve all annotated entities of the other entity types in analogous way using `hunflair-cellline` for cell lines, `hunflair-chemical` for chemicals, `hunflair-gene` for genes and proteins, and `hunflair-species` for species. To get all entities in one you can run:

In [7]:
for annotation_layer in sentence.annotation_layers.keys():
    for entity in sentence.get_spans(annotation_layer):
        print(entity)

Span[0:2]: "Behavioral abnormalities" → Disease (0.6723)
Span[9:12]: "Fragile X Syndrome" → Disease (0.99)
Span[4:5]: "Fmr1" → Gene (0.8459)
Span[6:7]: "MOuse" → Species (0.997)


----

##### Using a Biomedical Tokenizer
Tokenization, i.e. separating a text into tokens / words, is an important issue in natural language processing in general and biomedical text mining in particular. So far, we used a tokenizer for general domain text. This can be unfavourable if applied to biomedical texts.

HunFlair integrates SciSpaCy, a library specially designed to work with scientific text. To use the library we first have to install it and download one of it's models:

In [8]:
#!pip install scispacy==0.2.5
#!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz

To use the tokenizer we just have to pass it as parameter to when instancing a sentence:



In [9]:
from flair.tokenization import SciSpacyTokenizer

sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome",
                    use_tokenizer=SciSpacyTokenizer())

#### Working with longer Texts
Often, we are concerned with complete scientific abstracts or full-texts when performing biomedical text mining, e.g.

In [10]:
abstract = "Fragile X syndrome (FXS) is a developmental disorder caused by a mutation in the X-linked FMR1 gene, " \
           "coding for the FMRP protein which is largely involved in synaptic function. FXS patients present several " \
           "behavioral abnormalities, including hyperactivity, anxiety, sensory hyper-responsiveness, and cognitive " \
           "deficits. Autistic symptoms, e.g., altered social interaction and communication, are also often observed: " \
           "FXS is indeed the most common monogenic cause of autism."

To work with complete abstracts or full-text, we first have to split them into separate sentences. Again we can apply the integration of the SciSpaCy library:

In [11]:
from flair.tokenization import SciSpacySentenceSplitter

# initialize the sentence splitter
splitter = SciSpacySentenceSplitter()

# split text into a list of Sentence objects
sentences = splitter.split(abstract)

# you can apply the HunFlair tagger directly to this list
tagger.predict(sentences)

We can access the annotations of the single sentences by just iterating over the list:



In [12]:
for sentence in sentences:
    print(sentence.to_tagged_string())

Sentence: "Fragile X syndrome ( FXS ) is a developmental disorder caused by a mutation in the X - linked FMR1 gene , coding for the FMRP protein which is largely involved in synaptic function ." → ["Fragile X syndrome"/Disease, "FXS"/Disease, "developmental disorder"/Disease, "FMR1"/Gene, "FMRP"/Gene]
Sentence: "FXS patients present several behavioral abnormalities , including hyperactivity , anxiety , sensory hyper - responsiveness , and cognitive deficits ." → ["FXS"/Disease, "behavioral abnormalities"/Disease, "hyperactivity"/Disease, "anxiety"/Disease, "cognitive deficits"/Disease]
Sentence: "Autistic symptoms , e.g. , altered social interaction and communication , are also often observed : FXS is indeed the most common monogenic cause of autism ." → ["Autistic symptoms"/Disease, "FXS"/Disease, "autism"/Disease]


Reference: https://github.com/flairNLP/flair/blob/master/resources/docs/HUNFLAIR_TUTORIAL_1_TAGGING.md