<a href="https://colab.research.google.com/github/Saputoa21/ADS_2024_Saputoa/blob/master/exercises/HomeExercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 1: Preprocessing and NER
In this first home exercise, you will use the knowledge from Tutorial 1 and Tutorial 2 to perform some preprocessing and NLP steps on a news article of your choice. An example article in English is provided in this notebook.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

We will use the newspaper library to facilitate the scraping of the news article from a webpage.

In [2]:
!pip install newspaper3k

!pip install lxml[html_clean]



In [3]:
import newspaper
from newspaper import Article

url = 'https://edition.cnn.com/2024/10/25/style/banana-artwork-maurizio-cattelan-comedian-auction/index.html'
article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Authors:  ['Oscar Holland'] 

Title:  Maurizio Cattelan’s viral banana artwork ‘Comedian’ could now be worth $1.5 million 

Text of article: 
 CNN —

When a banana duct-taped to a wall sold for $120,000 in 2019, social media uproar and an age-old debate about the meaning of art ensued.

But artist Maurizio Cattelan’s viral creation, titled “Comedian,” may yet prove a sound investment: On Friday, auction house Sotheby’s announced that one of the artwork’s three “editions” is going back on sale — this time with an estimate of $1 million to $1.5 million.

For their money, the winning bidder will receive a roll of duct tape and one banana, as well as a certificate of authenticity and official instructions for installing the work. Sotheby’s confirmed to CNN that neither the tape nor, thankfully, the banana are the originals.

“‘Comedian’ is a conceptual artwork, and the actual physical materials are replaced with every installation,” an auction spokesperson said via email.

Cattelan and Fre

👋 ⚒ Use the above article or a news article of your choice and print the number of unique words in the text.

In [4]:
# Calculate and print the number of unique words in the text
import spacy
nlp = spacy.load("en_core_web_sm")

en_original_text = nlp(article.text)

unique_words = set()

for token in en_original_text:
    unique_words.add(token.text)

print(f'Number of all unique tokens: {len(unique_words)}')
print(unique_words)



Number of all unique tokens: 370
{'notion', 'But', 'Korea', 'Miami', 'installing', 'eaten', 'turn', 'before', 'public', 'exhibit', 'money', 'Video', 'Comedian', 'ate', 'from', 'ensued', 'Milan', 'eating', 'when', 'after', 'fair', 'by', 'act', 'of', '’', 'since', 'Duchamp', '02:42', 'concerns', 'Beach', 'While', 'works', 'old', 'commentary', 'University', 'physical', 'higher', 'museum', 'November', 'satirical', 'winning', 'gallery', 'press', 'as', 'right', 'pieces', 'reflection', 'using', 'adding', ',', 'which', 'safety', 'first', 'ago', 'undisclosed', 'materials', 'famous', 'titled', 'uproar', 'yet', 'joke', 'roll', 'then', 'thankfully', 'collection', 'ahead', 'going', 'Dubai', 'When', 'actual', 'true', 'a', 'made', 'For', '.', 'meaning', 'creation', 'conceptual', 'identity', 'original', 'The', 'up', 'front', 'rooted', 'one', 'auction', 'student', 'back', 'five', 'prove', 'merits', 'recently', 'Friday', '”', 'store', 'expensive', 'world', 'there', 'some', '-', 'putting', 'Galperin', 'd

## **Preprocessing**

👋 ⚒ Now perform the following preprocessing steps and see how the number of unique words changes:

1. Lowercase all words in the text.
2. Remove punctuation markers and numbers (Hint: `string.isalpha()).
3. Lemmatize all words in the text.

In [5]:
# Preprocess the text with all three steps and then calculate the number of
# unique words in the text again
prepocessed_text = set()

for word in en_original_text:
  if str(word).isalpha():
    prepocessed_text.add(word.lemma_.lower())

print(f'Number of unique words: {len(prepocessed_text)}')
print(prepocessed_text)

Number of unique words: 311
{'notion', 'david', 'add', 'americas', 'eaten', 'turn', 'before', 'public', 'embark', 'exhibit', 'money', 'angeles', 'from', 'win', 'when', 'after', 'fair', 'by', 'act', 'beach', 'of', 'since', 'speak', 'two', 'taipei', 'old', 'milan', 'commentary', 'physical', 'museum', 'satirical', 'give', 'merit', 'gallery', 'press', 'as', 'right', 'reflection', 'year', 'which', 'seoul', 'safety', 'first', 'ago', 'undisclosed', 'new', 'form', 'attendee', 'famous', 'uproar', 'yet', 'joke', 'roll', 'then', 'thankfully', 'collector', 'galperin', 'collection', 'ahead', 'if', 'sell', 'actual', 'true', 'a', 'marcel', 'meaning', 'creation', 'conceptual', 'identity', 'original', 'up', 'ensue', 'front', 'one', 'auction', 'student', 'back', 'monday', 'five', 'hundred', 'prove', 'piece', 'recently', 'store', 'instruction', 'expensive', 'world', 'there', 'some', 'hungry', 'art', 'very', 'comment', 'wit', 'every', 'duchamp', 'around', 'they', 'title', 'three', 'november', 'sotheby', '

## **NER**

In the tutorial we have only used one of the different models available in spaCy. In this exercise, you will compare the performance of the different models of different sizes and implementations. A description of the type of available models is in the [spaCy documentation](https://spacy.io/models/en). First, the models to be used need to be installed. We will use the following three models.

In [6]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m60.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-sm
  Attempting uninstall: en-core-web-sm
    Found existing installation: en-core-web-sm 3.7.1
    Uninstalling en-core-web-sm-3.7.1:
      Successfully uninstalled en-core-web-sm-3.7.1
Successfully installed en-core-web-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


👋 ⚒  Use each of the three models that were downloaded above and perform named entitiy recognition with each of them on the original not preprocessed article, one after another. You can use different code cells for the different models or write everything into one cell, as you prefer. For each of the model outputs, automatically calculate the number of NERs for each NER type that the model identifies.

In [7]:
import spacy
nlp_sm = spacy.load("en_core_web_sm")

# Your code here
i = 1
j = 0
entities_list = {}

doc_sm = nlp_sm(article.text)

for ent in doc_sm.ents:
    print(f'{i}) {ent.text}, = {ent.label_}')
    i += 1
    if ent not in entities_list:
      entities_list[ent.label_] = j
    j += 1

print(entities_list)

1) CNN, = ORG
2) 120,000, = MONEY
3) 2019, = DATE
4) Maurizio Cattelan’s, = PERSON
5) Comedian, = NORP
6) Friday, = DATE
7) Sotheby’s, = ORG
8) one, = CARDINAL
9) three, = CARDINAL
10) $1 million to $1.5 million, = MONEY
11) one, = CARDINAL
12) Sotheby’s, = ORG
13) CNN, = ORG
14) Comedian, = NORP
15) Cattelan, = NORP
16) French, = NORP
17) Perrotin, = PERSON
18) five years ago, = DATE
19) Comedian, = NORP
20) six, = CARDINAL
21) the Art Basel, = FAC
22) Miami Beach, = GPE
23) Miami, = GPE
24) 02:42 - Source, = PERSON
25) CNN, = ORG
26) David Datuna, = PERSON
27) hundreds, = CARDINAL
28) Miami, = GPE
29) three, = CARDINAL
30) Two, = CARDINAL
31) 120,000, = MONEY
32) third, = ORDINAL
33) The Guggenheim museum, = ORG
34) New York, = GPE
35) Sotheby’s, = ORG
36) November, = DATE
37) one, = CARDINAL
38) Miami, = GPE
39) Cattelan, = ORG
40) Comedian, = NORP
41) the Art Newspaper, = ORG
42) 2021, = DATE
43) Italian, = NORP
44) CNN, = ORG
45) November, = DATE
46) Comedian, = NORP
47) Sotheby's

In [8]:
!python -m spacy download en_core_web_lg

Collecting en-core-web-lg==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0-py3-none-any.whl (400.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: en-core-web-lg
  Attempting uninstall: en-core-web-lg
    Found existing installation: en-core-web-lg 3.7.1
    Uninstalling en-core-web-lg-3.7.1:
      Successfully uninstalled en-core-web-lg-3.7.1
Successfully installed en-core-web-lg-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [9]:
import spacy
nlp_lg = spacy.load("en_core_web_lg")

# Your code here
i = 1
j = 0
entities_list_lg = {}

doc_lg = nlp_lg(article.text)

for ent in doc_lg.ents:
    print(f'{i}) {ent.text}, = {ent.label_}')
    i += 1
    if ent not in entities_list_lg:
      entities_list_lg[ent.label_] = j
    j += 1

print(entities_list_lg)

1) CNN, = ORG
2) 120,000, = MONEY
3) 2019, = DATE
4) Maurizio Cattelan, = PERSON
5) Comedian,, = WORK_OF_ART
6) Friday, = DATE
7) Sotheby’s, = ORG
8) one, = CARDINAL
9) three, = CARDINAL
10) $1 million to $1.5 million, = MONEY
11) one, = CARDINAL
12) Sotheby’s, = ORG
13) CNN, = ORG
14) Cattelan, = NORP
15) French, = NORP
16) Perrotin, = ORG
17) five years ago, = DATE
18) Comedian, = WORK_OF_ART
19) six, = CARDINAL
20) the Art Basel, = ORG
21) Miami Beach, = GPE
22) Miami, = GPE
23) Marcel Duchamp, = PERSON
24) 02:42 -, = PRODUCT
25) CNN, = ORG
26) David Datuna, = PERSON
27) hundreds, = CARDINAL
28) Miami, = GPE
29) three, = CARDINAL
30) Two, = CARDINAL
31) 120,000, = MONEY
32) third, = ORDINAL
33) Guggenheim, = ORG
34) New York, = GPE
35) Sotheby’s, = ORG
36) November, = DATE
37) one, = CARDINAL
38) Miami, = GPE
39) Cattelan, = ORG
40) Comedian, = WORK_OF_ART
41) the Art Newspaper, = ORG
42) 2021, = DATE
43) Italian, = NORP
44) CNN, = ORG
45) November, = DATE
46) Comedian, = WORK_OF_AR

In [10]:
!python -m spacy download en_core_web_trf

Collecting en-core-web-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-trf
  Attempting uninstall: en-core-web-trf
    Found existing installation: en-core-web-trf 3.7.3
    Uninstalling en-core-web-trf-3.7.3:
      Successfully uninstalled en-core-web-trf-3.7.3
Successfully installed en-core-web-trf-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [12]:
import spacy
nlp = spacy.load("en_core_web_trf")

# Your code here

doc_trf = nlp(article.text)

i = 1
j = 0
entities_list_trf = {}
for ent in doc_trf.ents:
    print(f'{i}) {ent.text}, = {ent.label_}')
    i += 1
    if ent not in entities_list_trf:
      entities_list_trf[ent.label_] = j
    j += 1

print(entities_list_trf)

1) CNN, = ORG
2) 120,000, = MONEY
3) 2019, = DATE
4) Maurizio Cattelan, = PERSON
5) Comedian, = WORK_OF_ART
6) Friday, = DATE
7) Sotheby’s, = ORG
8) one, = CARDINAL
9) three, = CARDINAL
10) $1 million to $1.5 million, = MONEY
11) one, = CARDINAL
12) Sotheby’s, = ORG
13) CNN, = ORG
14) Comedian, = WORK_OF_ART
15) Cattelan, = ORG
16) French, = NORP
17) Perrotin, = ORG
18) five years ago, = DATE
19) Comedian, = WORK_OF_ART
20) six, = CARDINAL
21) the Art Basel Miami Beach, = EVENT
22) Miami, = GPE
23) Marcel Duchamp’s, = PERSON
24) 02:42, = TIME
25) CNN, = ORG
26) Events, = ORG
27) David Datuna, = PERSON
28) hundreds, = CARDINAL
29) Miami, = GPE
30) three, = CARDINAL
31) Two, = CARDINAL
32) 120,000, = MONEY
33) third, = ORDINAL
34) The Guggenheim, = ORG
35) New York, = GPE
36) Sotheby’s, = ORG
37) November, = DATE
38) one, = CARDINAL
39) Miami, = GPE
40) Cattelan, = PERSON
41) Comedian, = WORK_OF_ART
42) the Art Newspaper, = ORG
43) 2021, = DATE
44) Italian, = NORP
45) CNN, = ORG
46) Nove

You can use the following function to visualize the named entities in the text in order to facilitate the analysis.

In [13]:
# You can also visualize the detected named entities
from spacy import displacy
displacy.render(doc_sm, style="ent", jupyter=True)

In [14]:
from spacy import displacy
displacy.render(doc_lg, style="ent", jupyter=True)

In [15]:
from spacy import displacy
displacy.render(doc_trf, style="ent", jupyter=True)

👋 ⚒ Add your text of the analysis of differences between the three different models right here in the next text field.

As far as i can see from the output, the first two models have gained the same results in numbers of found named entities and theire labels.

👋 ⚒ Compare the analysis of the best performing spaCy model for NER on the article after it was preprocessed to the performance on the non-preprocessed article.

In [None]:
# Your code here

## **Multilingual NER**
In this exercise, the NER performance of spaCy in English is compared to another language of your choice.

👋 ⚒ Go the [spaCy page](https://spacy.io/models) detailing the available models to identify supported languages on the left listed under the heading "Trained Pipelines". Select a language and model of your choice. Find an article in this language and parse it using the newspaper package.

In [None]:
# Remember that you first need to load the model by replacing
#"en_core_web_sm" with the name of your model
!python -m spacy download ru_core_news_sm

Collecting ru-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/ru_core_news_sm-3.7.0/ru_core_news_sm-3.7.0-py3-none-any.whl (15.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m60.1 MB/s[0m eta [36m0:00:00[0m
Collecting pymorphy3>=1.0.0 (from ru-core-news-sm==3.7.0)
  Downloading pymorphy3-2.0.2-py3-none-any.whl.metadata (1.8 kB)
Collecting dawg-python>=0.7.1 (from pymorphy3>=1.0.0->ru-core-news-sm==3.7.0)
  Downloading DAWG_Python-0.7.2-py2.py3-none-any.whl.metadata (7.0 kB)
Collecting pymorphy3-dicts-ru (from pymorphy3>=1.0.0->ru-core-news-sm==3.7.0)
  Downloading pymorphy3_dicts_ru-2.4.417150.4580142-py2.py3-none-any.whl.metadata (2.0 kB)
Downloading pymorphy3-2.0.2-py3-none-any.whl (53 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.8/53.8 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading DAWG_Python-0.7.2-py2.py3-none-any.whl (11 kB)
Downloading pymorphy3_d

👋 ⚒ Perform NER on the selected article.

In [None]:
import newspaper
from newspaper import Article

url = 'https://www.dw.com/ru/svetlana-aleksievic-nado-gotovit-seba-dla-budusego/a-70580673'

ru_article = Article(url)
ru_article.download()
ru_article.parse()

print("Authors: ", ru_article.authors, "\n")
print("Title: ", ru_article.title, "\n")
print("Text of article: \n", ru_article.text)

import spacy

nlp_sm = spacy.load("ru_core_web_sm")

ru_original_text = nlp_sm(ru_article.text)

ru_tokens = []

for token in ru_original_text:
    if token not in tokens:
      tokens.append(token.text)
    continue

ru_unique_tokens = set(ru_tokens)

print('\n')

i = 1
j = 0
entities_list = {}
for ent in ru_unique_tokens.ents:
    print(f'{i}) {ent.text}, = {ent.label_}')
    i += 1
    if ent not in entities_list:
      entities_list[ent.label_] = j
    j += 1

print(entities_list)


Authors:  ['Яна Шварц', 'Вера Волович'] 

Title:  Светлана Алексиевич: "Надо готовить себя для будущего" – DW – 23.10.2024 

Text of article: 
 В Берлине на книжном фестивале Pradmova выступила Светлана Алексиевич. DW поговорила с ней о литературе и том, что ожидает Беларусь в будущем.

В Берлине прошел белорусский фестиваль интелектуальной книги Pradmova. В дискуссии "Песни любви и ненависти", которая состоялась 19 октября, приняла участие и лауреат Нобелевской премии по литературе Светлана Алексиевич. После мероприятия писательница рассказала DW, что не так сделали белорусские элиты в 2020 году и жалеет ли она, что белорусы не применили силу в тот период.

"Память о том августе будет сохранять наше достоинство очень долго"

DW: Во время дискуссии вы рассуждали о роли белорусских элит во время протестов 2020 года, сказав, что задумываетесь сегодня о том, что их романтизм, возможно, был преступным. Что именно элиты сделали не так?

Светлана Алексиевич: Я же тоже элита. Мы были в большо

OSError: [E050] Can't find model 'ru_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

👋 ⚒ How well did the NER in the language of your choice work as compared to the overall performance of NER with spaCy in English?

*Your NE performance analysis here*