# Home Exericse 1: Preprocessing and NER
In this first home exercise, you will use the knowledge from Tutorial 1 and Tutorial 2 to perform some preprocessing and NLP steps on a news article of your choice. An example article in English is provided in this notebook.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

We will use the newspaper library to facilitate the scraping of the news article from a webpage.

In [53]:
!pip install newspaper3k

!pip install lxml[html_clean]



In [54]:
import newspaper
from newspaper import Article

url = 'https://edition.cnn.com/2024/10/25/style/banana-artwork-maurizio-cattelan-comedian-auction/index.html'
article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Authors:  ['Oscar Holland'] 

Title:  Maurizio Cattelan’s viral banana artwork ‘Comedian’ could now be worth $1.5 million 

Text of article: 
 CNN —

When a banana duct-taped to a wall sold for $120,000 in 2019, social media uproar and an age-old debate about the meaning of art ensued.

But artist Maurizio Cattelan’s viral creation, titled “Comedian,” may yet prove a sound investment: On Friday, auction house Sotheby’s announced that one of the artwork’s three “editions” is going back on sale — this time with an estimate of $1 million to $1.5 million.

For their money, the winning bidder will receive a roll of duct tape and one banana, as well as a certificate of authenticity and official instructions for installing the work. Sotheby’s confirmed to CNN that neither the tape nor, thankfully, the banana are the originals.

“‘Comedian’ is a conceptual artwork, and the actual physical materials are replaced with every installation,” an auction spokesperson said via email.

Cattelan and Fre

👋 ⚒ Use the above article or a news article of your choice and print the number of unique words in the text.

In [55]:
# Calculate and print the number of unique words in the text
import spacy
nlp = spacy.load("en_core_web_sm")

en_original_text = nlp(article.text)

tokens = []

for token in en_original_text:
  tokens.append(token.text)

unique_tokens = set(tokens)

print(f'Number of tokens in the preproccessed text: {len(tokens)}')
print(f'Number of unique tokens  in the preproccessed text: {len(unique_tokens)}')

print(tokens)
print(unique_tokens)

Number of tokens in the preproccessed text: 852
Number of unique tokens  in the preproccessed text: 370
['CNN', '—', '\n\n', 'When', 'a', 'banana', 'duct', '-', 'taped', 'to', 'a', 'wall', 'sold', 'for', '$', '120,000', 'in', '2019', ',', 'social', 'media', 'uproar', 'and', 'an', 'age', '-', 'old', 'debate', 'about', 'the', 'meaning', 'of', 'art', 'ensued', '.', '\n\n', 'But', 'artist', 'Maurizio', 'Cattelan', '’s', 'viral', 'creation', ',', 'titled', '“', 'Comedian', ',', '”', 'may', 'yet', 'prove', 'a', 'sound', 'investment', ':', 'On', 'Friday', ',', 'auction', 'house', 'Sotheby', '’s', 'announced', 'that', 'one', 'of', 'the', 'artwork', '’s', 'three', '“', 'editions', '”', 'is', 'going', 'back', 'on', 'sale', '—', 'this', 'time', 'with', 'an', 'estimate', 'of', '$', '1', 'million', 'to', '$', '1.5', 'million', '.', '\n\n', 'For', 'their', 'money', ',', 'the', 'winning', 'bidder', 'will', 'receive', 'a', 'roll', 'of', 'duct', 'tape', 'and', 'one', 'banana', ',', 'as', 'well', 'as', 

## **Preprocessing**

👋 ⚒ Now perform the following preprocessing steps and see how the number of unique words changes:

1. Lowercase all words in the text.
2. Remove punctuation markers and numbers (Hint: `string.isalpha()).
3. Lemmatize all words in the text.

In [56]:
# Preprocess the text with all three steps and then calculate the number of
# unique words in the text again
preprocessed_text = []


for word in nlp(article.text.lower()):
  if word.text.isalpha() and word.text.isalpha() not in preprocessed_text:
    preprocessed_text.append(word.lemma_)

print(f'Number of words in preprocessed text: {len(preprocessed_text)}')
print(f'Number of unique words in the preproccessed text: {len(set(preprocessed_text))}')

print(preprocessed_text)

Number of words in preprocessed text: 682
Number of unique words in the preproccessed text: 310
['cnn', 'when', 'a', 'banana', 'duct', 'tape', 'to', 'a', 'wall', 'sell', 'for', 'in', 'social', 'medium', 'uproar', 'and', 'an', 'age', 'old', 'debate', 'about', 'the', 'meaning', 'of', 'art', 'ensue', 'but', 'artist', 'maurizio', 'cattelan', 'viral', 'creation', 'title', 'comedian', 'may', 'yet', 'prove', 'a', 'sound', 'investment', 'on', 'friday', 'auction', 'house', 'sotheby', 'announce', 'that', 'one', 'of', 'the', 'artwork', 'three', 'edition', 'be', 'go', 'back', 'on', 'sale', 'this', 'time', 'with', 'an', 'estimate', 'of', 'million', 'to', 'million', 'for', 'their', 'money', 'the', 'win', 'bidder', 'will', 'receive', 'a', 'roll', 'of', 'duct', 'tape', 'and', 'one', 'banana', 'as', 'well', 'as', 'a', 'certificate', 'of', 'authenticity', 'and', 'official', 'instruction', 'for', 'instal', 'the', 'work', 'sotheby', 'confirm', 'to', 'cnn', 'that', 'neither', 'the', 'tape', 'nor', 'thankfu

## **NER**

In the tutorial we have only used one of the different models available in spaCy. In this exercise, you will compare the performance of the different models of different sizes and implementations. A description of the type of available models is in the [spaCy documentation](https://spacy.io/models/en). First, the models to be used need to be installed. We will use the following three models.

In [57]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


👋 ⚒  Use each of the three models that were downloaded above and perform named entitiy recognition with each of them on the original not preprocessed article, one after another. You can use different code cells for the different models or write everything into one cell, as you prefer. For each of the model outputs, automatically calculate the number of NERs for each NER type that the model identifies.

In [58]:
import spacy
nlp_sm = spacy.load("en_core_web_sm")

# Your code here
i = 1
entities_list_sm = []
entities_freq_dict_sm = {}

doc_sm = nlp_sm(article.text)

for ent in doc_sm.ents:
    print(f'{i}) {ent.text}, = {ent.label_}')
    i += 1
    entities_list_sm.append(ent.label_)

for ent_label in entities_list_sm:
  if ent_label not in entities_freq_dict_sm:
    entities_freq_dict_sm[ent_label] = 0
  entities_freq_dict_sm[ent_label] += 1

print(f'The list of unique named entity labels: {entities_list_sm}')
print(f'The frequencies of unique named entity labels: {entities_freq_dict_sm}')

1) CNN, = ORG
2) 120,000, = MONEY
3) 2019, = DATE
4) Maurizio Cattelan’s, = PERSON
5) Comedian, = NORP
6) Friday, = DATE
7) Sotheby’s, = ORG
8) one, = CARDINAL
9) three, = CARDINAL
10) $1 million to $1.5 million, = MONEY
11) one, = CARDINAL
12) Sotheby’s, = ORG
13) CNN, = ORG
14) Comedian, = NORP
15) Cattelan, = NORP
16) French, = NORP
17) Perrotin, = PERSON
18) five years ago, = DATE
19) Comedian, = NORP
20) six, = CARDINAL
21) the Art Basel, = FAC
22) Miami Beach, = GPE
23) Miami, = GPE
24) 02:42 - Source, = PERSON
25) CNN, = ORG
26) David Datuna, = PERSON
27) hundreds, = CARDINAL
28) Miami, = GPE
29) three, = CARDINAL
30) Two, = CARDINAL
31) 120,000, = MONEY
32) third, = ORDINAL
33) The Guggenheim museum, = ORG
34) New York, = GPE
35) Sotheby’s, = ORG
36) November, = DATE
37) one, = CARDINAL
38) Miami, = GPE
39) Cattelan, = ORG
40) Comedian, = NORP
41) the Art Newspaper, = ORG
42) 2021, = DATE
43) Italian, = NORP
44) CNN, = ORG
45) November, = DATE
46) Comedian, = NORP
47) Sotheby's

In [59]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [60]:
import spacy
nlp_lg = spacy.load("en_core_web_lg")

# Your code here
i = 1
entities_list_lg = []
entities_freq_dict_lg = {}

doc_lg = nlp_lg(article.text)

for ent in doc_lg.ents:
  print(f'{i}) {ent.text}, = {ent.label_}')
  i += 1
  entities_list_lg.append(ent.label_)

for ent_label in entities_list_lg:
  if ent_label not in entities_freq_dict_lg:
    entities_freq_dict_lg[ent_label] = 0
  entities_freq_dict_lg[ent_label] += 1

print(f'The list of unique named entity labels: {entities_list_lg}')
print(f'The frequencies of unique named entity labels: {entities_freq_dict_lg}')

1) CNN, = ORG
2) 120,000, = MONEY
3) 2019, = DATE
4) Maurizio Cattelan, = PERSON
5) Comedian,, = WORK_OF_ART
6) Friday, = DATE
7) Sotheby’s, = ORG
8) one, = CARDINAL
9) three, = CARDINAL
10) $1 million to $1.5 million, = MONEY
11) one, = CARDINAL
12) Sotheby’s, = ORG
13) CNN, = ORG
14) Cattelan, = NORP
15) French, = NORP
16) Perrotin, = ORG
17) five years ago, = DATE
18) Comedian, = WORK_OF_ART
19) six, = CARDINAL
20) the Art Basel, = ORG
21) Miami Beach, = GPE
22) Miami, = GPE
23) Marcel Duchamp, = PERSON
24) 02:42 -, = PRODUCT
25) CNN, = ORG
26) David Datuna, = PERSON
27) hundreds, = CARDINAL
28) Miami, = GPE
29) three, = CARDINAL
30) Two, = CARDINAL
31) 120,000, = MONEY
32) third, = ORDINAL
33) Guggenheim, = ORG
34) New York, = GPE
35) Sotheby’s, = ORG
36) November, = DATE
37) one, = CARDINAL
38) Miami, = GPE
39) Cattelan, = ORG
40) Comedian, = WORK_OF_ART
41) the Art Newspaper, = ORG
42) 2021, = DATE
43) Italian, = NORP
44) CNN, = ORG
45) November, = DATE
46) Comedian, = WORK_OF_AR

In [61]:
!python -m spacy download en_core_web_trf

Collecting en-core-web-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [62]:
import spacy
nlp = spacy.load("en_core_web_trf")

# Your code here

doc_trf = nlp(article.text)

i = 1
entities_list_trf = []
entities_freq_dict_trf = {}

for ent in doc_trf.ents:
    print(f'{i}) {ent.text}, = {ent.label_}')
    i += 1
    entities_list_trf.append(ent.label_)

for ent_label in entities_list_trf:
  if ent_label not in entities_freq_dict_trf:
    entities_freq_dict_trf[ent_label] = 0
  entities_freq_dict_trf[ent_label] += 1

print(f'The list of unique named entity labels: {entities_list_trf}')
print(f'The frequencies of unique named entity labels: {entities_freq_dict_trf}')

  model.load_state_dict(torch.load(filelike, map_location=device))


1) CNN, = ORG
2) 120,000, = MONEY
3) 2019, = DATE
4) Maurizio Cattelan, = PERSON
5) Comedian, = WORK_OF_ART
6) Friday, = DATE
7) Sotheby’s, = ORG
8) one, = CARDINAL
9) three, = CARDINAL
10) $1 million to $1.5 million, = MONEY
11) one, = CARDINAL
12) Sotheby’s, = ORG
13) CNN, = ORG
14) Comedian, = WORK_OF_ART
15) Cattelan, = ORG
16) French, = NORP
17) Perrotin, = ORG
18) five years ago, = DATE
19) Comedian, = WORK_OF_ART
20) six, = CARDINAL
21) the Art Basel Miami Beach, = EVENT
22) Miami, = GPE
23) Marcel Duchamp’s, = PERSON
24) 02:42, = TIME
25) CNN, = ORG
26) Events, = ORG
27) David Datuna, = PERSON
28) hundreds, = CARDINAL
29) Miami, = GPE
30) three, = CARDINAL
31) Two, = CARDINAL
32) 120,000, = MONEY
33) third, = ORDINAL
34) The Guggenheim, = ORG
35) New York, = GPE
36) Sotheby’s, = ORG
37) November, = DATE
38) one, = CARDINAL
39) Miami, = GPE
40) Cattelan, = PERSON
41) Comedian, = WORK_OF_ART
42) the Art Newspaper, = ORG
43) 2021, = DATE
44) Italian, = NORP
45) CNN, = ORG
46) Nove

You can use the following function to visualize the named entities in the text in order to facilitate the analysis.

In [63]:
# You can also visualize the detected named entities
from spacy import displacy
displacy.render(doc_sm, style="ent", jupyter=True)

In [64]:
from spacy import displacy
displacy.render(doc_lg, style="ent", jupyter=True)

In [65]:
from spacy import displacy
displacy.render(doc_trf, style="ent", jupyter=True)

👋 ⚒ Add your text of the analysis of differences between the three different models right here in the next text field.

Each of th models performed differently, e.g. in total named entities number:  sm-model has 75, lg-model has 73, and trf-model has 76 named entities.

Since the (NER) label scheme in all of three does not differ from each other in terms of the label set, I focused in my comparison on the number of misinterepereted entities, which differs from 2 (trf-model) to 5-6 (sm-model and lg-model accordingly). From these different numbers I assume that trf-model prerformed best, while other two worse.

Here I have listed all I have wrongly recognised named entities I had found:
1.  sm-model:
*   Comedian as NORP, not as WORK_OF_ART
*   02:42 - Source PERSON, not as TIME
*   Cattelan NORP or ORG, not as PERSON
*   French NORP art gallery Perrotin ORG separately, not together as ORG
*   the Art Basel ORG Miami Beach GPE fair separately, not together as EVENT

2.  lg-model:
*   Galperin as ORG, not as PERSON
*   the **2023** incident as CARDINAL, not as DATE
*   02:42 as  PRODUCT, not as TIME
*   Cattelan as NORP, not as PERSON
*   French as NORP art gallery and Perrotin ORG separately, not French art gallery Perrotin as ORG
*   Art Basel ORG Miami and Beach GPE fair separetely, not together as EVENT

3.  trf-model:  
*   Catellan (second use) as ORG, not as PERSON
*   Events as ORG, not as general word



👋 ⚒ Compare the analysis of the best performing spaCy model for NER on the article after it was preprocessed to the performance on the non-preprocessed article.

In [66]:
# Your code
import spacy
nlp_trf = spacy.load("en_core_web_trf")

# Your code here

prep_doc_trf = nlp_trf(str(preprocessed_text))

i = 1
prep_entities_list_trf = []
prep_entities_freq_dict_trf = {}

for ent in prep_doc_trf.ents:
    print(f'{i}) {ent.text}, = {ent.label_}')
    i += 1

for ent in prep_doc_trf.ents:
    print(f'{i}) {ent.text}, = {ent.label_}')
    i += 1
    prep_entities_list_trf.append(ent.label_)

for ent_label in prep_entities_list_trf:
  if ent_label not in prep_entities_freq_dict_trf:
    prep_entities_freq_dict_trf[ent_label] = 0
  prep_entities_freq_dict_trf[ent_label] += 1

print(f'The list of unique named entity labels: {prep_entities_list_trf}')
print(f'The frequencies of unique named entity labels: {prep_entities_freq_dict_trf}')

1) friday, = DATE
2) five, = CARDINAL
3) third, = ORDINAL
4) york, = ORG
5) november, = DATE
6) first, = ORDINAL
7) seoul, = ORG
8) korea, = ORG
9) seoul, = ORG
10) monday, = DATE
11) london, = GPE
12) milan, = GPE
13) kong, = GPE
14) dubai, = GPE
15) taipei, = GPE
16) tokyo, = GPE
17) friday, = DATE
18) five, = CARDINAL
19) third, = ORDINAL
20) york, = ORG
21) november, = DATE
22) first, = ORDINAL
23) seoul, = ORG
24) korea, = ORG
25) seoul, = ORG
26) monday, = DATE
27) london, = GPE
28) milan, = GPE
29) kong, = GPE
30) dubai, = GPE
31) taipei, = GPE
32) tokyo, = GPE
The list of unique named entity labels: ['DATE', 'CARDINAL', 'ORDINAL', 'ORG', 'DATE', 'ORDINAL', 'ORG', 'ORG', 'ORG', 'DATE', 'GPE', 'GPE', 'GPE', 'GPE', 'GPE', 'GPE']
The frequencies of unique named entity labels: {'DATE': 3, 'CARDINAL': 1, 'ORDINAL': 2, 'ORG': 4, 'GPE': 6}


## **Multilingual NER**
In this exercise, the NER performance of spaCy in English is compared to another language of your choice.

👋 ⚒ Go the [spaCy page](https://spacy.io/models) detailing the available models to identify supported languages on the left listed under the heading "Trained Pipelines". Select a language and model of your choice. Find an article in this language and parse it using the newspaper package.

In [67]:
# Remember that you first need to load the model by replacing
#"en_core_web_sm" with the name of your model
!python -m spacy download ru_core_news_sm

Collecting ru-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/ru_core_news_sm-3.8.0/ru_core_news_sm-3.8.0-py3-none-any.whl (15.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m60.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('ru_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


👋 ⚒ Perform NER on the selected article.

In [68]:
import newspaper
from newspaper import Article

url = 'https://www.bbc.com/russian/articles/c0k8307z58xo'

ru_article = Article(url)
ru_article.download()
ru_article.parse()

print("Authors: ", ru_article.authors, "\n")
print("Title: ", ru_article.title, "\n")
print("Text of article: \n", ru_article.text)

Authors:  ['Https', 'Www.Facebook.Com Bbcnews'] 

Title:  В Британии теперь есть своя Му Денг. В зоопарке Эдинбурга родилась карликовая бегемотиха 

Text of article: 
 В Британии теперь есть своя Му Денг. В зоопарке Эдинбурга родилась карликовая бегемотиха

Автор фото, Edinburgh Zoo Подпись к фото, Смотрители зоопарка говорят, что Хаггис уже «начала проявлять характер»

5 часов назад

В Эдинбургском зоопарке прошли уникальные роды: самка невероятно редкого карликового бегемота, находящегося под угрозой исчезновения, произвела на свет девочку, получившую прозвище Хаггис.

Роды состоялись 30 октября, и смотрители зоопарка отмечают, что Хаггис уже начинает «проявлять характер».

В дикой природе карликовые бегемоты обитают лишь в трех странах Западной Африки, и, согласно подсчетам экспертов, их совокупная популяция не превышает 2500 особей.

В сентябре карликовая бегемотиха из Таиланда по кличке Му Денг стала вирусным феноменом в соцсетях, объектом целой серии мемов из-за своих крохотных р

In [69]:
# Your code here

import spacy
nlp_ru = spacy.load("ru_core_news_sm")

ru_doc = nlp_ru(ru_article.text)

i = 1
ru_entities_list = []
ru_entities_freq_dict = {}

for ent in ru_doc.ents:
    print(f'{i}) {ent.text}, = {ent.label_}')
    i += 1
    ru_entities_list.append(ent.label_)

for ent_label in ru_entities_list:
  if ent_label not in ru_entities_freq_dict:
    ru_entities_freq_dict[ent_label] = 0
  ru_entities_freq_dict[ent_label] += 1

print(f'The list of unique named entity labels: {ru_entities_list}')
print(f'The frequencies of unique named entity labels: {ru_entities_freq_dict}')

1) Британии, = LOC
2) Му Денг, = PER
3) Эдинбурга, = LOC
4) Edinburgh Zoo, = ORG
5) Хаггис, = PER
6) Эдинбургском зоопарке, = ORG
7) Хаггис, = PER
8) Хаггис, = PER
9) Западной Африки, = LOC
10) Таиланда, = LOC
11) Му Денг, = PER
12) Хаггис, = PER
13) Эдинбургском зоопарке, = LOC
14) Джонни Эпплъярд, = PER
15) Хаггис, = LOC
16) Edinburgh Zoo, = ORG
17) Международный союз охраны природы и природных ресурсов (IUCN), = ORG
18) Либерии, = LOC
19) Му Денг, = PER
20) Таиланда, = LOC
21) Эпплъярд, = PER
22) Эдинбурга, = LOC
23) Хаггис, = PER
24) Отто, = LOC
25) Глория, = PER
26) Амару, = PER
27) Лондонскому, = PER
28) Му Денг, = PER
29) Открытом зоопарке, = ORG
30) Кхао Кхео, = PER
The list of unique named entity labels: ['LOC', 'PER', 'LOC', 'ORG', 'PER', 'ORG', 'PER', 'PER', 'LOC', 'LOC', 'PER', 'PER', 'LOC', 'PER', 'LOC', 'ORG', 'ORG', 'LOC', 'PER', 'LOC', 'PER', 'LOC', 'PER', 'LOC', 'PER', 'PER', 'PER', 'PER', 'ORG', 'PER']
The frequencies of unique named entity labels: {'LOC': 10, 'PER': 

In [70]:
from spacy import displacy
displacy.render(ru_doc, style="ent", jupyter=True)

👋 ⚒ How well did the NER in the language of your choice work as compared to the overall performance of NER with spaCy in English?

*Your NE performance analysis here*

The article I chose is about a birth of endangered pygmy hippo calf, named Haggis, at Edingburgh Zoo.

The similar article in English you read here:
https://www.bbc.com/news/articles/c937lvpl3vno

The NER of a Russian article extracted in total 30 named entities, from which were only 4 not recognised correctly, namely:

*   [...] Открытом зоопарке ORG Кхао Кхео PER separataly, not together as ORG(en. "Khao Kheow Open Zoo")
*   Отто LOC, not as PER (en. "Otto", it's a name of one of the hyppo's parent)
*   Хаггис LOC, not as PER (en. "Haggis", the newborn hyppo's name)
*   [...] Эдинбургском зоопарке as LOC, not as ORG, however I can also consider this as a location (en. "Edingburgh Zoo")

Particluarly, I found it interesting that the Russian model also has recognised two English named entities correctly (Edingburgh Zoo as ORG).