<a href="https://colab.research.google.com/github/polinauni/IntroToCL/blob/main/exercises/HomeExercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 1: Preprocessing and NER
In this first home exercise, you will use the knowledge from Tutorial 1 and Tutorial 2 to perform some preprocessing and NLP steps on a news article of your choice. An example article in English is provided in this notebook.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

We will use the newspaper libabry to facilitate the scraping of the news article from a webpage.

In [2]:
!pip install newspaper3k
!pip install lxml_html_clean



In [3]:
import newspaper
from newspaper import Article


url = 'https://edition.cnn.com/2024/10/29/style/new-chopin-waltz-discovered-scli-intl/index.html'
article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Authors:  ['Jack Guy'] 

Title:  Lost Chopin music unearthed nearly 200 years after composer’s death 

Text of article: 
 CNN —

A curator at a museum in New York City has discovered a previously unknown waltz written by Frédéric Chopin, the first time that a new piece of work by the Polish composer has been found in nearly 100 years.

The waltz, written on a small manuscript measuring about 4 inches by 5 inches, was first discovered by curator Robinson McClellan in 2019, who then sought outside expert help, according to a statement from the Morgan Library & Museum on Monday.

“He found it peculiar that he could not think of any waltzes by Chopin that matched the measures on the page,” reads the statement.

“Chopin famously wrote in ‘small forms,’ but this work, lasting about one minute, is shorter than any other waltz by him,” adds the statement.

“It is nevertheless a complete piece, showing the kind of ‘tightness’ that we expect from a finished work by the composer.”

McClellan aske

👋 ⚒ Use the above article or a news article of your choice and print the number of unique words in the text.

In [4]:
# Calculate and print the number of unique words in the text

article_text = article.text
unique_words = set(article_text.split())
num_unique_words = len(unique_words)
print("Number of unique words:", num_unique_words)

Number of unique words: 369


## **Preprocessing**

👋 ⚒ Now perform the following preprocessing steps and see how the number of unique words changes:

1. Lowercase all words in the text.
2. Remove punctuation markers and numbers (Hint: `string.isalpha()).
3. Lemmatize all words in the text.

In [5]:
# Preprocess the text with all three steps and then calculate the number of
# unique words in the text again

import spacy
import string

nlp = spacy.load("en_core_web_sm")

preprocessed = [word.lower() for word in article_text.split() if word.isalpha()]
preprocessed_article = " ".join(preprocessed)

doc = nlp(preprocessed_article)

lemmas = [token.lemma_ for token in doc]
preprocessed_article = " ".join(lemmas)

print("Preprocessed article: \n", preprocessed_article)

unique_words = set(preprocessed_article.split())
num_unique_words = len(unique_words)
print("Number of unique words:", num_unique_words)


Preprocessed article: 
 cnn a curator at a museum in new york city have discover a previously unknown waltz write by frédéric the first time that a new piece of work by the polish composer have be find in nearly the write on a small manuscript measure about inch by be first discover by curator robinson mcclellan in who then seek outside expert accord to a statement from the morgan library museum on find it peculiar that he could not think of any waltz by chopin that match the measure on the read the famously write in but this last about one be short than any other waltz by add the be nevertheless a complete show the kind of that we expect from a finished work by the mcclellan ask chopin expert jeffrey associate dean for art and letter at the university of to help authenticate the research point to the strong likelihood that the piece be by accord to the this research include analysis by paper conservator who find that the paper and ink match those that chopin normally this date the man

## **NER**

In the tutorial we have only used one of the different models available in spaCy. In this exercise, you will compare the performance of the different models of different sizes and implementations. A description of the type of available models is in the [spaCy documentation](https://spacy.io/models/en). First, the models to be used need to be installed. We will use the following three models.

In [6]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation succe

👋 ⚒  Use each of the three models that were downloaded above and perform named entitiy recognition with each of them on the original not preprocessed article, one after another. You can use different code cells for the different models or write everything into one cell, as you prefer. For each of the model outputs, automatically calculate the number of NERs for each NER type that the model identifies.

In [18]:
import spacy
from spacy import displacy


def ner_counts(model, text):
  doc = model(text)
  ner_counts = {}
  for ent in doc.ents:
    if ent.label_ in ner_counts:
      ner_counts[ent.label_] += 1
    else:
      ner_counts[ent.label_] = 1
  for label, count in ner_counts.items():
    print(f"{label}: {count}")



nlp_sm = spacy.load("en_core_web_sm")
ner_counts(nlp_sm, article_text)
doc = nlp_sm(article_text)
displacy.render(doc, style="ent", jupyter=True)

ORG: 19
GPE: 7
ORDINAL: 3
NORP: 6
DATE: 11
QUANTITY: 2
PERSON: 11
TIME: 1
CARDINAL: 1
LOC: 1


In [17]:
nlp_lg = spacy.load("en_core_web_lg")
ner_counts(nlp_lg, article_text)
doc = nlp_lg(article_text)
displacy.render(doc, style="ent", jupyter=True)

ORG: 7
GPE: 7
PERSON: 23
ORDINAL: 3
NORP: 6
DATE: 10
QUANTITY: 1
TIME: 1
CARDINAL: 1
LOC: 1


In [13]:
nlp_trf = spacy.load("en_core_web_trf")
ner_counts(nlp_trf, article_text)
doc = nlp_trf(article_text)
displacy.render(doc, style="ent", jupyter=True)

  model.load_state_dict(torch.load(filelike, map_location=device))
  with torch.cuda.amp.autocast(self._mixed_precision):


ORG: 8
GPE: 7
PERSON: 23
ORDINAL: 2
NORP: 6
DATE: 12
QUANTITY: 1
TIME: 1
LOC: 1


You can use the following function to visualize the named entities in the text in order to facilitate the analysis.

In [None]:
# You can also visualize the detected named entities
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

👋 ⚒ Add your text of the analysis of differences between the three different models right here in the next text field.

Output for the sm model:
ORG: 19
GPE: 7
ORDINAL: 3
NORP: 6
DATE: 11
QUANTITY: 2
PERSON: 11
TIME: 1
CARDINAL: 1
LOC: 1

Output for the lg model:
ORG: 7
GPE: 7
PERSON: 23
ORDINAL: 3
NORP: 6
DATE: 10
QUANTITY: 1
TIME: 1
CARDINAL: 1
LOC: 1

Output for the trf model:
ORG: 8
GPE: 7
PERSON: 23
ORDINAL: 2
NORP: 6
DATE: 12
QUANTITY: 1
TIME: 1
LOC: 1

Analysis of the models according to the following criteria:

- Detection of organizational entities: one of the worst-performing models in this category was the sm model, as it recognized "Chopin" as an ORG entity, resulting in the highest count for this entity type.

- Detection of people: the trf and lg models recognized the same number of person entities, whereas the sm model recognized only 11, as the remaining entities were assigned to the wrong category.

- Other categories (ordinal, NORP, date, quantity, time, and location): in these categories, entities were recognized almost in the same quantity. Nevertheless, there are minor differences, such as two additional date entities recognized by the trf model: “to the 1830s” and “the age of just 39.” I believe these entities can be considered date entities in the context of this article.

Based on the analysis above, I conclude that the trf model performed the best in these settings.


👋 ⚒ Compare the analysis of the best performing spaCy model for NER on the article after it was preprocessed to the performance on the non-preprocessed article.

In [22]:
# Your code here

ner_counts(nlp_sm, preprocessed_article)
ner_counts(nlp_lg, preprocessed_article)
ner_counts(nlp_trf, preprocessed_article)


ORG: 4
GPE: 1
ORDINAL: 3
NORP: 5
PERSON: 3
CARDINAL: 1
ORG: 6
GPE: 1
ORDINAL: 3
NORP: 3
PERSON: 7
CARDINAL: 1
ORG: 5
GPE: 1
PERSON: 16
ORDINAL: 3
NORP: 3
QUANTITY: 1
CARDINAL: 2


## **Multilingual NER**
In this exercise, the NER performance of spaCy in English is compared to another language of your choice.

👋 ⚒ Go the [spaCy page](https://spacy.io/models) detailing the available models to identify supported languages on the left listed under the heading "Trained Pipelines". Select a language and model of your choice. Find an article in this language and parse it using the newspaper package.

In [23]:
# Remember that you first need to load the model by replacing
#"en_core_web_sm" with the name of your model

!python -m spacy download ru_core_news_sm

Collecting ru-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/ru_core_news_sm-3.7.0/ru_core_news_sm-3.7.0-py3-none-any.whl (15.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
Collecting pymorphy3>=1.0.0 (from ru-core-news-sm==3.7.0)
  Downloading pymorphy3-2.0.2-py3-none-any.whl.metadata (1.8 kB)
Collecting dawg-python>=0.7.1 (from pymorphy3>=1.0.0->ru-core-news-sm==3.7.0)
  Downloading DAWG_Python-0.7.2-py2.py3-none-any.whl.metadata (7.0 kB)
Collecting pymorphy3-dicts-ru (from pymorphy3>=1.0.0->ru-core-news-sm==3.7.0)
  Downloading pymorphy3_dicts_ru-2.4.417150.4580142-py2.py3-none-any.whl.metadata (2.0 kB)
Downloading pymorphy3-2.0.2-py3-none-any.whl (53 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.8/53.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading DAWG_Python-0.7.2-py2.py3-none-any.whl (11 kB)
Downloading pymorphy3_d

In [24]:
url = 'https://www.bbc.com/russian/articles/cvg4pl3nl64o'
article_ru = Article(url)
article_ru.download()
article_ru.parse()

print("Authors: ", article_ru.authors, "\n")

print("Title: ", article_ru.title, "\n")
print("Text of article: \n", article_ru.text)

Authors:  ['Https', 'Www.Facebook.Com Bbcnews'] 

Title:  Наводнения в Валенсии: в район бедствия направлены 10 тыс. военных и полицейских, испанцы критикуют власти за медленную реакцию 

Text of article: 
 Наводнения в Валенсии: в район бедствия направлены 10 тыс. военных и полицейских, испанцы критикуют власти за медленную реакцию

Автор фото, Reuters

2 ноября 2024

Испанские власти объявили, что на ликвидацию последствий разрушительного наводнения в Валенсии направлены еще 10 тыс. военных и полицейских. Число погибших в результате стихийного бедствия уже превысило 200 человек, и многие местные жители обвиняют власти в том, что они своевременно не предупредили их об угрозе и крайне медленно отреагировали на происходящее.

Премьер-министр Испании Педро Санчес заявил, что в пострадавшие регионы отправятся 5 тыс. военнослужащих и столько же сотрудников полиции. Таким образом общее число представителей силовых структур, принимающих участие в ликвидации последствий наводнения, составит 1

👋 ⚒ Perform NER on the selected article.

In [25]:
article_ru = article_ru.text
nlp_ru = spacy.load("ru_core_news_sm")
doc = nlp_ru(article_ru)
ner_counts = {}
for ent in doc.ents:
    if ent.label_ in ner_counts:
        ner_counts[ent.label_] += 1
    else:
        ner_counts[ent.label_] = 1

for label, count in ner_counts.items():
    print(f"{label}: {count}")

LOC: 24
ORG: 12
PER: 10


In [26]:
doc = nlp_ru(article_ru)
displacy.render(doc, style="ent", jupyter=True)

👋 ⚒ How well did the NER in the language of your choice work as compared to the overall performance of NER with spaCy in English?


I chose the "ru_core_news_sm" model for the article in Russian.

The following entities have been recognized:
LOC: 24
ORG: 12
PER: 10

In comparison to the best-performing trf model, many entities were missed, including dates (e.g., "2 ноября 2024" (Nov 2, 2024)), days of the week (e.g., "пятницу" (Friday), "субботу" (Saturday)), numbers (e.g., "5 тыс." (5k), "17,5 тыс." (17.5k)), and GPEs (e.g., "северо-восточных и южных районах" (northeastern and southern regions)).

Overall, these results are not particularly good compared to the other models.
