<a href="https://colab.research.google.com/github/EmiliaFidler/Intro_to_Comp_Ling_WS24/blob/main/homeexercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 1: Preprocessing and NER
In this first home exercise, you will use the knowledge from Tutorial 1 and Tutorial 2 to perform some preprocessing and NLP steps on a news article of your choice. An example article in English is provided in this notebook.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

We will use the newspaper libabry to facilitate the scraping of the news article from a webpage.

In [None]:
!pip install newspaper3k
!pip install lxml_html_clean



In [None]:
import newspaper
from newspaper import Article

url = 'https://edition.cnn.com/2024/10/25/style/banana-artwork-maurizio-cattelan-comedian-auction/index.html'
article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Authors:  ['Oscar Holland'] 

Title:  Maurizio Cattelan’s viral banana artwork has sold again — this time for $6.24 million 

Text of article: 
 Editor’s Note: This article was updated with the final sale price and other details following the auction’s conclusion.

CNN —

When a banana duct-taped to a wall sold for $120,000 in 2019, social media uproar and an age-old debate about the meaning of art ensued.

But artist Maurizio Cattelan’s viral creation, titled “Comedian,” has proven a sound investment for one collector: One of the artwork’s three “editions” ﻿smashed estimates to sell for $6.24 million at a Sotheby’s auction in New York on Wednesday.

The auction house had estimated the work to go for between $1 million to $1.5 million; bidding began at $800,000.

During the sale, auctioneer Oliver Barker described the work as “iconic” and “disruptive,” while joking that selling a banana at auction were “words I never thought I’d say.”

Shortly after the sale, Sotheby’s revealed that Ju

👋 ⚒ Use the above article or a news article of your choice and print the number of unique words in the text.

In [None]:
# Calculate and print the number of unique words in the text
import string
import nltk
from nltk.tokenize import word_tokenize

text = article.text

tokens = word_tokenize(text)

unique_words = set(tokens)

print(f"Number of unique words: {len(unique_words)}")

Number of unique words: 404


## **Preprocessing**

👋 ⚒ Now perform the following preprocessing steps and see how the number of unique words changes:

1. Lowercase all words in the text.
2. Remove punctuation markers and numbers (Hint: `string.isalpha()).
3. Lemmatize all words in the text.

In [None]:
# Preprocess the text with all three steps and then calculate the number of
# unique words in the text again

from nltk.corpus import stopwords

nltk.download('punkt_tab')
nltk.download('stopwords')

tokens = [word.lower() for word in tokens if word.isalpha()]

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]

unique_words = set(filtered_tokens)

print(f"Number of unique words: {len(unique_words)}")

Number of unique words: 296


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## **NER**

In the tutorial we have only used one of the different models available in spaCy. In this exercise, you will compare the performance of the different models of different sizes and implementations. A description of the type of available models is in the [spaCy documentation](https://spacy.io/models/en). First, the models to be used need to be installed. We will use the following three models.

In [None]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m54.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation succe

👋 ⚒  Use each of the three models that were downloaded above and perform named entitiy recognition with each of them on the original not preprocessed article, one after another. You can use different code cells for the different models or write everything into one cell, as you prefer. For each of the model outputs, automatically calculate the number of NERs for each NER type that the model identifies.

In [None]:
import spacy
# Your code here
from collections import Counter

def perform_ner(model, text):
  nlp = spacy.load(model)
  doc = nlp(text)

  ent_count = Counter([ent.label_ for ent in doc.ents])

  print(f"Named Entities for {model}: ")
  print(ent_count)
  return

perform_ner("en_core_web_sm", text)
perform_ner("en_core_web_lg", text)
perform_ner("en_core_web_trf", text)

Named Entities for en_core_web_sm: 
Counter({'ORG': 17, 'GPE': 16, 'NORP': 10, 'DATE': 8, 'CARDINAL': 8, 'MONEY': 6, 'PERSON': 6, 'FAC': 2, 'ORDINAL': 2, 'LOC': 1})
Named Entities for en_core_web_lg: 
Counter({'ORG': 18, 'GPE': 16, 'PERSON': 9, 'DATE': 8, 'CARDINAL': 8, 'MONEY': 6, 'WORK_OF_ART': 6, 'NORP': 3, 'ORDINAL': 2, 'FAC': 1, 'LOC': 1})
Named Entities for en_core_web_trf: 
Counter({'ORG': 16, 'GPE': 16, 'PERSON': 10, 'DATE': 8, 'MONEY': 7, 'WORK_OF_ART': 7, 'CARDINAL': 7, 'NORP': 3, 'ORDINAL': 2, 'EVENT': 1, 'LOC': 1})


You can use the following function to visualize the named entities in the text in order to facilitate the analysis.

In [None]:
# You can also visualize the detected named entities
from spacy import displacy
# displacy.render(doc, style="ent", jupyter=True)

def perform_ner_and_display(model, text):
  nlp = spacy.load(model)
  doc = nlp(text)

  displacy.render(doc, style="ent", jupyter=True)

  ent_count = Counter([ent.label_ for ent in doc.ents])

  print(f"Named Entities for {model}: ")
  print(ent_count)
  return

perform_ner_and_display("en_core_web_sm", text)
perform_ner_and_display("en_core_web_lg", text)
perform_ner_and_display("en_core_web_trf", text)

Named Entities for en_core_web_sm: 
Counter({'ORG': 17, 'GPE': 16, 'NORP': 10, 'DATE': 8, 'CARDINAL': 8, 'MONEY': 6, 'PERSON': 6, 'FAC': 2, 'ORDINAL': 2, 'LOC': 1})


Named Entities for en_core_web_lg: 
Counter({'ORG': 18, 'GPE': 16, 'PERSON': 9, 'DATE': 8, 'CARDINAL': 8, 'MONEY': 6, 'WORK_OF_ART': 6, 'NORP': 3, 'ORDINAL': 2, 'FAC': 1, 'LOC': 1})


  model.load_state_dict(torch.load(filelike, map_location=device))
  with torch.cuda.amp.autocast(self._mixed_precision):


Named Entities for en_core_web_trf: 
Counter({'ORG': 16, 'GPE': 16, 'PERSON': 10, 'DATE': 8, 'MONEY': 7, 'WORK_OF_ART': 7, 'CARDINAL': 7, 'NORP': 3, 'ORDINAL': 2, 'EVENT': 1, 'LOC': 1})


👋 ⚒ Add your text of the analysis of differences between the three different models right here in the next text field.

*Your NE performance analysis here*

using: en_core_web_sm
*   small size (12 MB)
*   identifies many different entities, but not all categories, e.g. doesn't recognize WORK_OF_Art entities ("Comedian")
*   identifies too much NORP entities, missclassifications!
*   good for speed and not too much resources



using: en_core_web_lg
*   large size (382 MB), has vectors for word context (300 dimentions)
*   better at accurately recognizing entities, identifies WORK_OF_ART entities
*   better at PERSON entity recognition
*   general-purpose, accurate enough, doesn't need GPU



using: en_core_web_trf
*   Transformer-based (436 MB)
*   highest accuracy, correctly identifies entities like WORK_OF_ART and EVENT
*   evenm more precise about recognising PERSON entities
*   most accurate results, needs GPU for best results






👋 ⚒ Compare the analysis of the best performing spaCy model for NER on the article after it was preprocessed to the performance on the non-preprocessed article.

In [None]:
# Your code here

def perform_ner_and_display(model, text):
  nlp = spacy.load(model)
  doc = nlp(text)

  displacy.render(doc, style="ent", jupyter=True)

  ent_count = Counter([ent.label_ for ent in doc.ents])

  print(f"Named Entities for {model}: ")
  print(ent_count)
  return

preprocessed_text = " ".join(filtered_tokens)
perform_ner_and_display("en_core_web_trf", preprocessed_text)

  model.load_state_dict(torch.load(filelike, map_location=device))
  with torch.cuda.amp.autocast(self._mixed_precision):


Named Entities for en_core_web_trf: 
Counter({'ORG': 18, 'GPE': 13, 'PERSON': 12, 'CARDINAL': 10, 'DATE': 5, 'NORP': 3, 'ORDINAL': 2, 'LOC': 1})


Analysis (NER of the preprocessed text):
*   model picked up more ORG entitities (e.g. "art world")
*   less GPE entities, could be because words like "the", "in", etc. where removed
*   more PERSON and CARDINAL entitities recognized, becaus of the simplified text?
*   all MONEY entities were lost, because monetary symbols ($) were removed
*   all WORK_OF_ART entities were lost, because of removed quotation marks maybe
*   some DATE entities were lost, maybe because of stopwords (e.g. "in 2019")




## **Multilingual NER**
In this exercise, the NER performance of spaCy in English is compared to another language of your choice.

👋 ⚒ Go the [spaCy page](https://spacy.io/models) detailing the available models to identify supported languages on the left listed under the heading "Trained Pipelines". Select a language and model of your choice. Find an article in this language and parse it using the newspaper package.

In [None]:
# Remember that you first need to load the model by replacing
#"en_core_web_sm" with the name of your model
!python -m spacy download pl_core_news_sm

Collecting pl-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/pl_core_news_sm-3.7.0/pl_core_news_sm-3.7.0-py3-none-any.whl (20.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.2/20.2 MB[0m [31m39.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pl-core-news-sm
Successfully installed pl-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pl_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


👋 ⚒ Perform NER on the selected article.

In [None]:
url = 'https://tvn24.pl/swiat/usa-floryda-ponad-700-ksiazek-usunietych-z-bibliotek-na-liscie-stephen-king-margaret-atwood-i-inni-st8177583'
pl_article = Article(url)
pl_article.download()
pl_article.parse()
pl_text = pl_article.text

print("Title: ", pl_article.title, "\n")
print("Text of article: \n", pl_article.text)

Title:  USA, Floryda. Ponad 700 książek usuniętych z bibliotek. Na liście Stephen King, Margaret Atwood i inni 

Text of article: 
 Departament edukacji amerykańskiego stanu Floryda opublikował listę ponad 700 książek, które zostały usunięte ze szkolnych bibliotek lub których wypożyczenia zakazano - podała w środę agencja AP. To efekt wprowadzonego w 2023 roku prawa, które pozwala rodzicom na ingerowanie w szkolny księgozbiór.

Na liście pozycji zakazanych znalazły się między innymi klasyki amerykańskiej literatury, takie jak "Przygody Tomka Sawyera" Marka Twaina, "Komu bije dzwon" Ernesta Hemingwaya czy "Pieśń Salomonowa" noblistki Toni Morrison.

Pojawiła się także pacyfistyczna powieść "Rzeźnia numer pięć" Kurta Vonneguta oraz należące do kanonu literatury dystopijnej "Rok 1984" George’a Orwella i "Opowieść podręcznej" Margaret Atwood. Jest klasyka, m.in. "Trudno o dobrego człowieka" Flannery O'Connor i graficzne wersje "Gry o tron". Są też tytuły dotyczące Holokaustu: Williama Styr

In [None]:
pl_nlp = spacy.load('pl_core_news_sm')
pl_doc = pl_nlp(pl_text)
pl_ent_count = Counter([ent.label_ for ent in pl_doc.ents])
displacy.render(pl_doc, style='ent', jupyter=True)

for ent in pl_doc.ents:
  print(f"{ent.text}: {ent.label_}")

print()
print(pl_ent_count)

amerykańskiego: placeName
Floryda: orgName
AP: orgName
2023 roku: date
amerykańskiej: placeName
Tomka Sawyera: persName
Marka Twaina: persName
Ernesta Hemingwaya: persName
Toni Morrison: persName
Kurta Vonneguta: persName
1984: date
Orwella: persName
Margaret Atwood: persName
Holokaustu: persName
Williama Styrona: persName
Zofii: persName
"Dziennik": orgName
Anny Frank: persName
Stephena Kinga: persName
" Sally Rooney: orgName
Florydzie: placeName
EPA

Książki: orgName
2022 roku: date
Florydy Ron DeSantis: orgName
Moms for Liberty: orgName
1069: date
Ron DeSantis: persName
lipcu 2023 roku: date
Florydy: placeName
Florydzie: placeName
sierpnia: date
USA: placeName
Penguin: placeName
Random House: persName
Publishers: placeName
Florydzie: geogName
ZOBACZ: placeName
TEŻ: orgName
Dawida Michała Anioła: persName
Florydy: placeName
Sydney Booker: persName
Florydy: placeName
AP: orgName
Florydzie: placeName
Florydy: placeName
Sydney Booker: persName
brytyjski: placeName
"Guardian": orgName
AP

👋 ⚒ How well did the NER in the language of your choice work as compared to the overall performance of NER with spaCy in English?

*Your NE performance analysis here*

*   english models are more detailed when classifying entities, the polish model is not able to classify entities as MONEY, WORK_OF_ART, NORP and CARDINAL
*   the polish model picked up a lot of names (26), good for identifying notable figures, however, although it correctly identifies fragments like "D." and "Vance" as persName, it fails to combine them into one name, there were also some missclassifications (e.g. "Sally Rooney" classified as orgName, "Holokaustu")
*   recognized organizations decently (e.g. "AP", "The Guardian"), but also made some missclassifications (e.g. "TEż", "Floryda", "Sally Rooney", "LESSER", "Książki", etc.)
*   with placeNames the model did decently but made some missclassifications (e.g. "Penguin", "Publishers")
*   correctly extracted the date references
*   has a limited differentiation between placeName and geogName, it only identified one example of geographical names (e.g. "Florydzie", which is thrice categorized as placeName but once as geogName)
*   a big problem was that included in the text were also picture descriptions etc, which may have led to some confusion
*   Polish is morphologically more complicated than English (e.g. declensions, conjugations) which adds a lot of complexity to NER
*   it does alright for general-purpose NER tasks but still lacks a lot of the English model's capabilities, it's smaller and less robust
*   more training and/or finetuning is needed, lacks the specific context for better results
