<a href="https://colab.research.google.com/github/Saputoa21/ADS_2024_Saputoa/blob/master/exercises/HomeExercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 1: Preprocessing and NER
In this first home exercise, you will use the knowledge from Tutorial 1 and Tutorial 2 to perform some preprocessing and NLP steps on a news article of your choice. An example article in English is provided in this notebook.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

We will use the newspaper library to facilitate the scraping of the news article from a webpage.

In [5]:
!pip install newspaper3k



In [11]:
!pip  install lxml[html_clean]
import newspaper
from newspaper import Article

url = 'https://edition.cnn.com/2024/10/25/style/banana-artwork-maurizio-cattelan-comedian-auction/index.html'
article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Collecting lxml-html-clean (from lxml[html_clean])
  Downloading lxml_html_clean-0.3.1-py3-none-any.whl.metadata (2.4 kB)
Downloading lxml_html_clean-0.3.1-py3-none-any.whl (13 kB)
Installing collected packages: lxml-html-clean
Successfully installed lxml-html-clean-0.3.1
Authors:  ['Oscar Holland'] 

Title:  Maurizio Cattelan’s viral banana artwork ‘Comedian’ could now be worth $1.5 million 

Text of article: 
 CNN —

When a banana duct-taped to a wall sold for $120,000 in 2019, social media uproar and an age-old debate about the meaning of art ensued.

But artist Maurizio Cattelan’s viral creation, titled “Comedian,” may yet prove a sound investment: On Friday, auction house Sotheby’s announced that one of the artwork’s three “editions” is going back on sale — this time with an estimate of $1 million to $1.5 million.

For their money, the winning bidder will receive a roll of duct tape and one banana, as well as a certificate of authenticity and official instructions for installing t

👋 ⚒ Use the above article or a news article of your choice and print the number of unique words in the text.

In [26]:
# Calculate and print the number of unique words in the text
nlp = spacy.load("en_core_web_sm")

text = article.text

analysing_text = nlp(text)

tokens = []

for token in analysing_text:
    if token not in tokens:
      tokens.append(token.text)
    continue

unique_tokens = set(tokens)

print(f'Number of all tokens: {len(tokens)}')
print(f'Number of all unique tokens: {len(unique_tokens)}')

Number of all tokens: 852
Number of all tokens: 370


## **Preprocessing**

👋 ⚒ Now perform the following preprocessing steps and see how the number of unique words changes:

1. Lowercase all words in the text.
2. Remove punctuation markers and numbers (Hint: `string.isalpha()).
3. Lemmatize all words in the text.

In [27]:
# Preprocess the text with all three steps and then calculate the number of
# unique words in the text again
text = article.text

words = text.lower().split()

unique_words = set()

for word in words:
    if word.isalpha():
        unique_words.add(word)

print(f'Number of unique words: {len(unique_words)}')

Number of unique words: 279


## **NER**

In the tutorial we have only used one of the different models available in spaCy. In this exercise, you will compare the performance of the different models of different sizes and implementations. A description of the type of available models is in the [spaCy documentation](https://spacy.io/models/en). First, the models to be used need to be installed. We will use the following three models.

In [28]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web

👋 ⚒  Use each of the three models that were downloaded above and perform named entitiy recognition with each of them on the original not preprocessed article, one after another. You can use different code cells for the different models or write everything into one cell, as you prefer. For each of the model outputs, automatically calculate the number of NERs for each NER type that the model identifies.

In [None]:
import spacy
nlp_sm = spacy.load("en_core_web_sm")
# Your code here

You can use the following function to visualize the named entities in the text in order to facilitate the analysis.

In [None]:
# You can also visualize the detected named entities
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

👋 ⚒ Add your text of the analysis of differences between the three different models right here in the next text field.

*Your NE performance analysis here*

👋 ⚒ Compare the analysis of the best performing spaCy model for NER on the article after it was preprocessed to the performance on the non-preprocessed article.

In [None]:
# Your code here

## **Multilingual NER**
In this exercise, the NER performance of spaCy in English is compared to another language of your choice.

👋 ⚒ Go the [spaCy page](https://spacy.io/models) detailing the available models to identify supported languages on the left listed under the heading "Trained Pipelines". Select a language and model of your choice. Find an article in this language and parse it using the newspaper package.

In [None]:
# Remember that you first need to load the model by replacing
#"en_core_web_sm" with the name of your model
!python -m spacy download en_core_web_sm

👋 ⚒ Perform NER on the selected article.

👋 ⚒ How well did the NER in the language of your choice work as compared to the overall performance of NER with spaCy in English?

*Your NE performance analysis here*