<a href="https://colab.research.google.com/github/LameesKadhim/flair-library-NLP-Python/blob/main/flair_library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install flair

## **Create a Sentence**

In [3]:
# the Sentence objects hold a sentence that we may want to embed or tag
from flair.data import Sentence

# make a sentence object by passing a whitespace tokenized string
sentence = Sentence('the grass is green.')

# print the object to see what is in there
print(sentence)

# using the token id
print(sentence.get_token(2))
# using the index it self
print(sentence[1])

for token in sentence:
  print(token)

Sentence: "the grass is green ."   [− Tokens: 5]
Token: 2 grass
Token: 2 grass
Token: 1 the
Token: 2 grass
Token: 3 is
Token: 4 green
Token: 5 .


## **Adding Tags and labels to Tokens**

In [4]:
from flair.data import Sentence
from flair.data import Label

In [5]:
sentence = Sentence('I love the blue sky', use_tokenizer=True)

print(sentence)

# add a tag to a word in the sentence
sentence[3].add_tag('ner', 'color')
print('Adding tag to a word')
# print the sentence with all tags of this type
print(sentence.to_tagged_string())

tag: Label = sentence[3].get_tag('ner')
print(f'"{sentence[3]}" is tagged as "{tag.value}" with confidence score "{tag.score}"')

sentence = Sentence('Germany is the current world cup winner')

# add a lable to a sentence
print('\nadd a lable to a sentence')
sentence.add_label(str ,value='football')
# a sentence can also belong to multiple classes
sentence.add_label(str, value= ['football', 'world cup'])

print('\n', sentence, '\n')
for label in sentence.labels:
  print(label)

Sentence: "I love the blue sky"   [− Tokens: 5]
Adding tag to a word
I love the blue <color> sky
"Token: 4 blue" is tagged as "color" with confidence score "1.0"

add a lable to a sentence

 Sentence: "Germany is the current world cup winner"   [− Tokens: 7  − Sentence-Labels: {<class 'str'>: [football (1.0), ['football', 'world cup'] (1.0)]}] 

football (1.0)
['football', 'world cup'] (1.0)


## **Named Entity Recognition(NER)**
NER is a sub-task of information extraction (IE) that seeks out and categories specified entities in a body or bodies of texts 

In [30]:
from flair.models import SequenceTagger
from flair.data import Sentence

tagger = SequenceTagger.load('ner')
sentence = Sentence('George Washington went to Washington.')
# predict NER tags
tagger.predict(sentence)
# print sentence with predicted tags
print('\n', sentence.to_tagged_string())
for entity in sentence.get_spans('ner'):
  print('\n', entity)
print('\n', sentence.to_dict(tag_type='ner'))

2020-11-01 19:10:32,311 loading file /root/.flair/models/en-ner-conll03-v0.4.pt

 George <B-PER> Washington <E-PER> went to Washington <S-LOC> .

 Span [1,2]: "George Washington"   [− Labels: PER (0.9968)]

 Span [5]: "Washington"   [− Labels: LOC (0.9994)]

 {'text': 'George Washington went to Washington.', 'labels': [], 'entities': [{'text': 'George Washington', 'start_pos': 0, 'end_pos': 17, 'labels': [PER (0.9968)]}, {'text': 'Washington', 'start_pos': 26, 'end_pos': 36, 'labels': [LOC (0.9994)]}]}


In [32]:
# make a sentence
sentence = Sentence('I love Berlin.')

# load the NER tagger
tagger = SequenceTagger.load('ner')

# run NER over sentence
tagger.predict(sentence)

print('\n', sentence)
print('\nThe following NER tags are found')

# iterate over entities and print
for entity in sentence.get_spans('ner'):
  print('\n', entity)


2020-11-01 19:11:30,171 loading file /root/.flair/models/en-ner-conll03-v0.4.pt

 Sentence: "I love Berlin ."   [− Tokens: 4  − Token-Labels: "I love Berlin <S-LOC> ."]

The following NER tags are found

 Span [3]: "Berlin"   [− Labels: LOC (0.9992)]


## **Text Classification and prediction**

In [33]:
from flair.models import TextClassifier
from flair.data import Sentence

classifier = TextClassifier.load('en-sentiment')
sentence1 = Sentence('This film hurts. It is so bad that I am confused')
sentence2 = Sentence('Flair is pretty neat!')
# predict NER tags
classifier.predict([sentence1, sentence2])

# print sentence with predicted labels
print('\nsentence 1 is: {} and sentence 2 is:  {}'.format(sentence1.labels, sentence2.labels))

2020-11-01 19:11:48,010 loading file /root/.flair/models/sentiment-en-mix-distillbert_3.1.pt

sentence 1 is: [NEGATIVE (1.0)] and sentence 2 is:  [POSITIVE (0.9997)]


## **Word Embedding**

In [34]:
# GloVe is an unsupervised learning algorithm for obtaining vector representations for words. 
# Training is performed on aggregated global word-word co-occurrence statistics from a corpus,
# and the resulting representations showcase interesting linear substructures of the word vector space. 

from flair.embeddings import WordEmbeddings
from flair.data import Sentence

# init embedding
glove_embedding = WordEmbeddings('glove')

# create sentence
sentence = Sentence('The grass is green .')

# embed a sentence using glove
glove_embedding.embed(sentence)

# now check out the embedded tokens
for token in sentence:
  print('\n', token)
  print(token.embedding)


 Token: 1 The
tensor([-0.0382, -0.2449,  0.7281, -0.3996,  0.0832,  0.0440, -0.3914,  0.3344,
        -0.5755,  0.0875,  0.2879, -0.0673,  0.3091, -0.2638, -0.1323, -0.2076,
         0.3340, -0.3385, -0.3174, -0.4834,  0.1464, -0.3730,  0.3458,  0.0520,
         0.4495, -0.4697,  0.0263, -0.5415, -0.1552, -0.1411, -0.0397,  0.2828,
         0.1439,  0.2346, -0.3102,  0.0862,  0.2040,  0.5262,  0.1716, -0.0824,
        -0.7179, -0.4153,  0.2033, -0.1276,  0.4137,  0.5519,  0.5791, -0.3348,
        -0.3656, -0.5486, -0.0629,  0.2658,  0.3020,  0.9977, -0.8048, -3.0243,
         0.0125, -0.3694,  2.2167,  0.7220, -0.2498,  0.9214,  0.0345,  0.4674,
         1.1079, -0.1936, -0.0746,  0.2335, -0.0521, -0.2204,  0.0572, -0.1581,
        -0.3080, -0.4162,  0.3797,  0.1501, -0.5321, -0.2055, -1.2526,  0.0716,
         0.7056,  0.4974, -0.4206,  0.2615, -1.5380, -0.3022, -0.0734, -0.2831,
         0.3710, -0.2522,  0.0162, -0.0171, -0.3898,  0.8742, -0.7257, -0.5106,
        -0.5203, -0.1459,

## **Document Embedding**

In [24]:
from flair.embeddings import WordEmbeddings, DocumentRNNEmbeddings
from flair.data import Sentence

glove_embeddings = WordEmbeddings('glove')

document_embeddings = DocumentRNNEmbeddings([glove_embedding])

# create an example sentence
sentence = Sentence('The grass is green and the sky is blue.')

# embed the sentence with our document embedding
document_embeddings.embed(sentence)

# now check out the embedded sentence.
print(sentence.get_embedding())

tensor([ 0.0440, -0.3109, -0.0417,  0.0755,  0.0792, -0.0844, -0.0824, -0.1984,
        -0.0606,  0.1573,  0.0708, -0.0084,  0.2063,  0.1378, -0.2152, -0.0593,
         0.1882, -0.2069, -0.0904,  0.1662,  0.1048,  0.0437,  0.3212, -0.2128,
         0.1258, -0.1650, -0.0848,  0.1415, -0.1952, -0.1981,  0.0882, -0.2492,
        -0.2923,  0.1543, -0.1051,  0.4094,  0.0007,  0.4009, -0.4420, -0.3354,
        -0.0406,  0.0440,  0.1490,  0.0344,  0.0982,  0.0837, -0.0094, -0.1453,
        -0.3861, -0.2928,  0.1424, -0.4047,  0.1599,  0.3470, -0.2107,  0.6693,
         0.0171,  0.2365, -0.0158, -0.0753,  0.0235,  0.1951,  0.2331, -0.1688,
        -0.3832, -0.4519,  0.1698,  0.1330, -0.2802,  0.0750, -0.0335, -0.0565,
         0.1637, -0.0454,  0.1202, -0.2097,  0.2553, -0.0944,  0.1900, -0.1798,
        -0.0650, -0.3736,  0.1023, -0.3984,  0.2873, -0.1591, -0.2093, -0.1043,
         0.3362, -0.4446,  0.0269, -0.3394,  0.1413,  0.0803,  0.2299,  0.2608,
        -0.3816,  0.2213,  0.2121, -0.12

## **Loading training data**

In [29]:
# the corpus represents a dataset that is used to train a model 
# It consists of a list of train sentences, a list of developement sentences and a list of test sentences
import flair.datasets

corpus = flair.datasets.UD_ENGLISH()

# print number of sentences in the train split
print('\nnumber of sentences in traing split is ', len(corpus.train),'\n')

# print number of sentences in the test split
print('number of sentences in testing split is ', len(corpus.test),'\n')

# print number of sentences in the developement split
print('number of sentences in development split is ',len(corpus.dev),'\n')

# print the second sentence in the training split
print('the second sentence in the training split is ', corpus.train[1], '\n')

# print the first sentence in the test split
print('the first sentence in the test split with POS tagging is ', corpus.test[0].to_tagged_string('pos'), '\n')

2020-11-01 19:08:25,647 Reading data from /root/.flair/datasets/ud_english
2020-11-01 19:08:25,649 Train: /root/.flair/datasets/ud_english/en_ewt-ud-train.conllu
2020-11-01 19:08:25,651 Dev: /root/.flair/datasets/ud_english/en_ewt-ud-dev.conllu
2020-11-01 19:08:25,655 Test: /root/.flair/datasets/ud_english/en_ewt-ud-test.conllu

number of sentences in traing split is  12543 

number of sentences in testing split is  2077 

number of sentences in development split is  2002 

the second sentence in the training split is  Sentence: "[ This killing of a respected cleric will be causing us trouble for years to come . ]"   [− Tokens: 18  − Token-Labels: "[ <[/PUNCT/-LRB-/punct> This <this/DET/DT/det/Sing/Dem> killing <killing/NOUN/NN/nsubj/Sing> of <of/ADP/IN/case> a <a/DET/DT/det/Ind/Art> respected <respected/ADJ/JJ/amod/Pos> cleric <cleric/NOUN/NN/nmod/Sing> will <will/AUX/MD/aux/Fin> be <be/AUX/VB/aux/Inf> causing <cause/VERB/VBG/root/Ger> us <we/PRON/PRP/iobj/Acc/Plur/1/Prs> trouble <tro