# Environment Setup



In [None]:
# Install necessary libraries
!pip install -U spacy transformers datasets pyLDAvis scikit-learn pandas matplotlib
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m94.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# Foundations & Traditional Methods

Remember that you can access the methods of an object with the `dir(obj)` function or with help on the data type, e.g. `help(int)`

## Exercise 1

Goal: Convert a text into lowercase and print it out.

Input: `"This text about Natural Language Processing (NLP) should be in lowercase!"`

Expected output: `"this text about natural language processing (nlp) should be in lowercase!"`

In [None]:
text = "This text about Natural Language Processing (NLP) should be in lowercase!"



this text about natural language processing (nlp) should be in lowercase!


## Exercise 2

Goal: Replace all instances of "Natural Language Processing" with "NLP".

Input: `"Natural Language Processing is a very interesting subfield of computer science. Natural language processing is more important than ever thanks to language models."`

Expected output: `"NLP is a very interesting subfield of computer science. Natural language processing is more important than ever thanks to language models."`

In [None]:
text = "Natural Language Processing is a very interesting subfield of computer science. Natural language processing is more important than ever thanks to language models."



NLP is a very interesting subfield of computer science. Natural language processing is more important than ever thanks to language models.


## Exercise 3

Goal: Convert all strings inside the list into uppercase.

Input: `["deep learning", "nlp", "python"]`

Expected output: `["DEEP LEARNING", "NLP", "PYTHON"]`

Bonus: Do it also for a pandas series.


In [None]:
import pandas as pd

l = ["deep learning", "nlp", "python"]
s = pd.Series(l)




['DEEP LEARNING', 'NLP', 'PYTHON']
0    DEEP LEARNING
1              NLP
2           PYTHON
dtype: str


# Regular Expressions

Goal: Extract structured information (like IDs or dates) and mask sensitive data using Regular Expressions.

Context: Regex is the industry standard for "Data Masking" and "PII Redaction" before sending data to public APIs.

Example Pattern: r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' targets the specific structure of an email address.

Common Regex Tokens:

* \d: Digit

* \w: Alphanumeric character

* +: One or more

* \s: Whitespace

* []: Set of characters

## Exercise 4

Goal: Create a function that takes a paragraph and replaces all specific year mentions (e.g., 1995, 2023) with the word [YEAR].

Input: `"The protocol was updated in 2015 and again in 2021."`

Expected Output: `"The protocol was updated in [YEAR] and again in [YEAR]."`

In [None]:
import re

def replace_years(paragraph):
    """
    Replaces all four-digit year mentions in a paragraph with '[YEAR]'.
    """
    pass

input_text = "The protocol was updated in 2015 and again in 2021."


Input: The protocol was updated in 2015 and again in 2021.
Output: The protocol was updated in [YEAR] and again in [YEAR].


## Exercise 5

Goal: Extract the paper number, the date and author from the the string "Paper No. 5382B23XXX7 was submitted on 2026-01-03 by Max Meier."

Input: `"Paper No. 5382B23XXX7 was submitted on 2026-01-03 by Max Meier."`

Expected output: `{'paper_number': '5382B23XXX7', 'submission_date': '2026-01-03', 'name': 'Max Meier'}`

In [None]:
s = "Paper No. 5382B23XXX7 was submitted on 2026-01-03 by Max Meier."

paper_number = None
submission_date = None
name = None

print({'paper_number': paper_number,
       "submission_date": submission_date,
       "name": name})

{'paper_number': '5382B23XXX7', 'submission_date': '2026-01-03', 'name': 'Max Meier'}


## Exercise 6

Goal: Create a function that extracts the information from Exercise 2 from multiple such texts that are saved in a list and saves it to a dictionary with the paper number as key and the submission date and name as it elements.

Input: [
    "Paper No. 5382B23XXX7 was submitted on 2026-01-03 by Max Meier.",
    "Paper No. A9921Z92 was submitted on 2025-11-12 by Sarah Jenkins.",
    "Paper No. 10023948572 was submitted on 2026-02-28 by Hiroshi Tanaka.",
    "Paper No. CONF20249X was submitted on 2024-08-15 by Elena Rodriguez.",
    "Paper No. B88231PF was submitted on 2025-05-20 by David Connor.",
    "Paper No. 7732VQ66 was submitted on 2026-01-10 by Li Wei."
]

Expected Output: {'5382B23XXX7': {'submission_date': '2026-01-03', 'name': 'Max Meier'}, 'A9921Z92': {'submission_date': '2025-11-12', 'name': 'Sarah Jenkins'}, '10023948572': {'submission_date': '2026-02-28', 'name': 'Hiroshi Tanaka'}, 'CONF20249X': {'submission_date': '2024-08-15', 'name': 'Elena Rodriguez'}, 'B88231PF': {'submission_date': '2025-05-20', 'name': 'David Connor'}, '7732VQ66': {'submission_date': '2026-01-10', 'name': 'Li Wei'}}

In [None]:
submitted_papers = [
    "Paper No. 5382B23XXX7 was submitted on 2026-01-03 by Max Meier.",
    "Paper No. A9921Z92 was submitted on 2025-11-12 by Sarah Jenkins.",
    "Paper No. 10023948572 was submitted on 2026-02-28 by Hiroshi Tanaka.",
    "Paper No. CONF20249X was submitted on 2024-08-15 by Elena Rodriguez.",
    "Paper No. B88231PF was submitted on 2025-05-20 by David Connor.",
    "Paper No. 7732VQ66 was submitted on 2026-01-10 by Li Wei."
]

def extract_paper_info(submitted_papers):
    pass

paper_info = extract_paper_info(submitted_papers)
print(paper_info)

{'5382B23XXX7': {'submission_date': '2026-01-03', 'name': 'Max Meier'}, 'A9921Z92': {'submission_date': '2025-11-12', 'name': 'Sarah Jenkins'}, '10023948572': {'submission_date': '2026-02-28', 'name': 'Hiroshi Tanaka'}, 'CONF20249X': {'submission_date': '2024-08-15', 'name': 'Elena Rodriguez'}, 'B88231PF': {'submission_date': '2025-05-20', 'name': 'David Connor'}, '7732VQ66': {'submission_date': '2026-01-10', 'name': 'Li Wei'}}


# The NLP Pipeline Approach

Goal: Compare the modular approach of NLTK with the automated, production-ready pipeline of spaCy.

Why use spaCy? It is significantly faster for large-scale analysis and automatically runs Tokenization, POS tagging, and NER in one step.

Why use NLTK? It is highly modular, making it a great tool for researchers who need to choose specific algorithms for niche tasks.

## Exercise 7

Goal: Initialize a spaCy model, create a Doc object, and understand how it stores text as a sequence of Token objects. Input: "NLP allows researchers to discover hidden patterns in data."

In [None]:
import spacy

# Load the small English pipeline [cite: 178]
nlp = spacy.load("en_core_web_sm")

# Create a Doc object by passing the string to the nlp object
text = "NLP allows researchers to discover hidden patterns in data."


The Doc object contains 10 tokens.

Token Index: 0 | Text: NLP
Token Index: 1 | Text: allows
Token Index: 2 | Text: researchers
Token Index: 3 | Text: to
Token Index: 4 | Text: discover
Token Index: 5 | Text: hidden
Token Index: 6 | Text: patterns
Token Index: 7 | Text: in
Token Index: 8 | Text: data
Token Index: 9 | Text: .


## Exercise 8

Goal: Identify and extract Named Entities (NER) from a text using spaCy. Input: "Apple is looking at buying a startup in the United Kingdom for $1 billion in 2024."

In [None]:
text = "Apple is looking at buying a startup in the United Kingdom for $1 billion in 2024."


Entity: Apple                | Label: ORG        | Description: Companies, agencies, institutions, etc.
Entity: the United Kingdom   | Label: GPE        | Description: Countries, cities, states
Entity: $1 billion           | Label: MONEY      | Description: Monetary values, including unit
Entity: 2024                 | Label: DATE       | Description: Absolute or relative dates or periods


## Exercise 9

Goal: Implement a cleaning function that removes stop words and punctuation while lemmatizing the remaining words.

In [None]:
def clean_text(text):
    pass

sample_text = "The researchers are analyzing massive datasets on climate change."

print(f"Original: {sample_text}")
print(f"Cleaned:  {clean_text(sample_text)}")

Original: The researchers are analyzing massive datasets on climate change.
Cleaned:  researcher analyze massive dataset climate change


# Vectorization

## Exercise 10

Goal: Use CountVectorizer to create a Bag of Words (BoW) representation of a small corpus.

Hint: Use the get_feature_names_out() method of CountVectorizer.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

corpus = [
    "The dog bit the man.",
    "The man bit the dog.",
    "The cat sat on the mat."
]


   bit  cat  dog  man  mat  on  sat  the
0    1    0    1    1    0   0    0    2
1    1    0    1    1    0   0    0    2
2    0    1    0    0    1   1    1    2


## Exercise 11

Goal: Calculate TF-IDF scores for a collection of documents to identify unique keywords. Use the previously defined corpus.

Hint: Use the get_feature_names_out() method of TfidfVectorizer.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer



       bit       cat      dog      man       mat        on       sat       the
0  0.42984  0.000000  0.42984  0.42984  0.000000  0.000000  0.000000  0.667618
1  0.42984  0.000000  0.42984  0.42984  0.000000  0.000000  0.000000  0.667618
2  0.00000  0.430518  0.00000  0.00000  0.430518  0.430518  0.430518  0.508542


## Exercise 12

Goal: Calculate the Cosine Similarity between two document vectors. Use the vectors of exercise 11 or 12.

Bonus: Try out a different similarity measure, e.g. Manhattan Distance.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity, pairwise_distances



Cosine Similarity between Doc 0 and Doc 1: 1.0000
Manhattan Distance between Doc 0 and Doc 1: 0.0000


# Text Classification and Topic Modelling

## Exercise 13

Goal: Perform Sentiment Analysis on a list of reviews using the VADER tool.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

reviews = [
    "The results of this study are absolutely excellent!",
    "The methodology was poor and the data was very noisy.",
    "The paper provides an average overview of the topic."
]



Review: The results of this study are absolutely excellent!
Score: {'neg': 0.0, 'neu': 0.62, 'pos': 0.38, 'compound': 0.6468}

Review: The methodology was poor and the data was very noisy.
Score: {'neg': 0.389, 'neu': 0.611, 'pos': 0.0, 'compound': -0.624}

Review: The paper provides an average overview of the topic.
Score: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}



[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Exercise 14

Goal: Apply Latent Dirichlet Allocation (LDA) to identify 2 main topics in a small dataset. Use your variable of the CountVectorizer.

In [None]:
from sklearn.decomposition import LatentDirichletAllocation

# Using the BoW matrix from Exercise 10

Topic 0: ['on', 'sat', 'the']
Topic 1: ['bit', 'man', 'the']


## Exercise 15

Goal: Use a Hugging Face Pipeline to perform Zero-Shot Classification (categorizing text into labels it wasn't specifically trained on).

In [None]:
from transformers import pipeline

sequence = "The new variant of the virus is spreading rapidly across the continent."
candidate_labels = ["politics", "science", "health", "economy"]


No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


Text: The new variant of the virus is spreading rapidly across the continent.
Top Label: health (Score: 0.4538)


## Exercise 16

Goal: Summarize a long paragraph of text using a pre-trained Transformer model.

In [None]:
long_text = """
Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence
concerned with the interactions between computers and human language, in particular how to program computers
to process and analyze large amounts of natural language data. The goal is a computer capable of
'understanding' the contents of documents, including the contextual nuances of the language within them.
The technology can then accurately extract information and insights contained in the documents as well
as categorize and organize the documents themselves.
"""


No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Summary:  The goal is a computer capable of understanding the contents of documents, including the contextual nuances of the language within them . The technology can then extract information and insights contained in the documents as well as categorize and organize the documents themselves .
