<a href="https://colab.research.google.com/github/polinauni/IntroToCL/blob/main/exercises/HomeExercise1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 1: Preprocessing and NER
In this first home exercise, you will use the knowledge from Tutorial 1 and Tutorial 2 to perform some preprocessing and NLP steps on a news article of your choice. An example article in English is provided in this notebook.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

We will use the newspaper libabry to facilitate the scraping of the news article from a webpage.

In [3]:
!pip install newspaper3k
!pip install lxml_html_clean

Collecting lxml_html_clean
  Downloading lxml_html_clean-0.3.1-py3-none-any.whl.metadata (2.4 kB)
Downloading lxml_html_clean-0.3.1-py3-none-any.whl (13 kB)
Installing collected packages: lxml_html_clean
Successfully installed lxml_html_clean-0.3.1


In [4]:
import newspaper
from newspaper import Article


url = 'https://edition.cnn.com/2024/10/29/style/new-chopin-waltz-discovered-scli-intl/index.html'
article = Article(url)
article.download()
article.parse()

#This line displays the authors of the article
print("Authors: ", article.authors, "\n")

#This line displays the title and entire text of the article
print("Title: ", article.title, "\n")
print("Text of article: \n", article.text)

Authors:  ['Jack Guy'] 

Title:  Lost Chopin music unearthed nearly 200 years after composer’s death 

Text of article: 
 CNN —

A curator at a museum in New York City has discovered a previously unknown waltz written by Frédéric Chopin, the first time that a new piece of work by the Polish composer has been found in nearly 100 years.

The waltz, written on a small manuscript measuring about 4 inches by 5 inches, was first discovered by curator Robinson McClellan in 2019, who then sought outside expert help, according to a statement from the Morgan Library & Museum on Monday.

“He found it peculiar that he could not think of any waltzes by Chopin that matched the measures on the page,” reads the statement.

“Chopin famously wrote in ‘small forms,’ but this work, lasting about one minute, is shorter than any other waltz by him,” adds the statement.

“It is nevertheless a complete piece, showing the kind of ‘tightness’ that we expect from a finished work by the composer.”

McClellan aske

👋 ⚒ Use the above article or a news article of your choice and print the number of unique words in the text.

In [5]:
# Calculate and print the number of unique words in the text

article_text = article.text
unique_words = set(article_text.split())
num_unique_words = len(unique_words)
print("Number of unique words:", num_unique_words)

Number of unique words: 369


## **Preprocessing**

👋 ⚒ Now perform the following preprocessing steps and see how the number of unique words changes:

1. Lowercase all words in the text.
2. Remove punctuation markers and numbers (Hint: `string.isalpha()).
3. Lemmatize all words in the text.

In [12]:
# Preprocess the text with all three steps and then calculate the number of
# unique words in the text again

import spacy
import string

nlp = spacy.load("en_core_web_sm")

preprocessed = [word.lower() for word in article_text.split() if word.isalpha()]
preprocessed_article = " ".join(preprocessed)

doc = nlp(preprocessed_article)

lemmas = [token.lemma_ for token in doc]
preprocessed_article = " ".join(lemmas)

print("Preprocessed article: \n", preprocessed_article)

unique_words = set(preprocessed_article.split())
num_unique_words = len(unique_words)
print("Number of unique words:", num_unique_words)


Preprocessed article: 
 cnn a curator at a museum in new york city have discover a previously unknown waltz write by frédéric the first time that a new piece of work by the polish composer have be find in nearly the write on a small manuscript measure about inch by be first discover by curator robinson mcclellan in who then seek outside expert accord to a statement from the morgan library museum on find it peculiar that he could not think of any waltz by chopin that match the measure on the read the famously write in but this last about one be short than any other waltz by add the be nevertheless a complete show the kind of that we expect from a finished work by the mcclellan ask chopin expert jeffrey associate dean for art and letter at the university of to help authenticate the research point to the strong likelihood that the piece be by accord to the this research include analysis by paper conservator who find that the paper and ink match those that chopin normally this date the man

## **NER**

In the tutorial we have only used one of the different models available in spaCy. In this exercise, you will compare the performance of the different models of different sizes and implementations. A description of the type of available models is in the [spaCy documentation](https://spacy.io/models/en). First, the models to be used need to be installed. We will use the following three models.

In [13]:
!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_lg
!python -m spacy download en_core_web_trf

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m78.4 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting en-core-web-lg==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m841.2 kB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-w

👋 ⚒  Use each of the three models that were downloaded above and perform named entitiy recognition with each of them on the original not preprocessed article, one after another. You can use different code cells for the different models or write everything into one cell, as you prefer. For each of the model outputs, automatically calculate the number of NERs for each NER type that the model identifies.

In [14]:
import spacy
nlp_sm = spacy.load("en_core_web_sm")
# Your code here


#create a function for ner counts
doc = nlp_sm(article_text)
ner_counts = {}
for ent in doc.ents:
    if ent.label_ in ner_counts:
        ner_counts[ent.label_] += 1
    else:
        ner_counts[ent.label_] = 1

for label, count in ner_counts.items():
    print(f"{label}: {count}")

nlp_lg = spacy.load("en_core_web_lg")
doc = nlp_lg(article_text)
ner_counts = {}
for ent in doc.ents:
    if ent.label_ in ner_counts:
        ner_counts[ent.label_] += 1
    else:
        ner_counts[ent.label_] = 1

for label, count in ner_counts.items():
    print(f"{label}: {count}")


nlp_trf = spacy.load("en_core_web_trf")
doc = nlp_trf(article_text)
ner_counts = {}
for ent in doc.ents:
    if ent.label_ in ner_counts:
        ner_counts[ent.label_] += 1
    else:
        ner_counts[ent.label_] = 1

for label, count in ner_counts.items():
    print(f"{label}: {count}")


ORG: 19
GPE: 7
ORDINAL: 3
NORP: 6
DATE: 11
QUANTITY: 2
PERSON: 11
TIME: 1
CARDINAL: 1
LOC: 1
ORG: 7
GPE: 7
PERSON: 23
ORDINAL: 3
NORP: 6
DATE: 10
QUANTITY: 1
TIME: 1
CARDINAL: 1
LOC: 1


ValueError: [E002] Can't find factory for 'curated_transformer' for language English (en). This usually happens when spaCy calls `nlp.create_pipe` with a custom component name that's not registered on the current language class. If you're using a custom component, make sure you've added the decorator `@Language.component` (for function components) or `@Language.factory` (for class components).

Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, entity_ruler, tagger, morphologizer, ner, beam_ner, senter, sentencizer, spancat, spancat_singlelabel, span_finder, future_entity_ruler, span_ruler, textcat, textcat_multilabel, en.lemmatizer

You can use the following function to visualize the named entities in the text in order to facilitate the analysis.

In [None]:
# You can also visualize the detected named entities
from spacy import displacy
displacy.render(doc, style="ent", jupyter=True)

👋 ⚒ Add your text of the analysis of differences between the three different models right here in the next text field.

*Your NE performance analysis here*

👋 ⚒ Compare the analysis of the best performing spaCy model for NER on the article after it was preprocessed to the performance on the non-preprocessed article.

In [None]:
# Your code here

## **Multilingual NER**
In this exercise, the NER performance of spaCy in English is compared to another language of your choice.

👋 ⚒ Go the [spaCy page](https://spacy.io/models) detailing the available models to identify supported languages on the left listed under the heading "Trained Pipelines". Select a language and model of your choice. Find an article in this language and parse it using the newspaper package.

In [None]:
# Remember that you first need to load the model by replacing
#"en_core_web_sm" with the name of your model
!python -m spacy download en_core_web_sm

👋 ⚒ Perform NER on the selected article.

👋 ⚒ How well did the NER in the language of your choice work as compared to the overall performance of NER with spaCy in English?

*Your NE performance analysis here*