University of Zagreb\
Faculty of Electrical Engineering and Computing

## Text Analysis and Retrieval 2023/2024
https://www.fer.unizg.hr/en/course/taar

------------------------------

# LAB 1: Basics of NLP

*Version: 1.3*

© 2024 Josip Jukić, Jan Šnajder

Submission deadline: **March 24, 2024, 23:59 CET** 

------------------------------

### Instructions

Welcome, visitor! This lab assignment is structured into three segments. Your primary objective is to complete the missing code sections, marked by the "YOUR CODE HERE" placeholder, and then evaluate the cells.

For each part of the assignment, a series of tests are available for you to run. These tests are designed to guide you by showing the expected output format. Additionally, after you submit your assignment, further tests will be conducted. Please note that variations in library versions might cause slight differences in your results. However, there's no need for concern, as your submitted work will be evaluated in a controlled environment.


### Submission rules
By submitting the exercise, you confirm the following points:
1. You did not receive help from another when solving the exercise;
2. You attributed parts of the code that were taken from the Internet by referencing them in comments;
3. You did not use parts of the code from the Internet that are specific to the laboratory exercise;
4. You have not used AI assistants for coding such as GitHub Copilot (including generative AI tools such as ChatGPT).

**Violation of any of the above rules is considered a misdemeanor and results in academic sanctions.**

## Tasks

### 1. Preprocessing

In [1]:
import spacy
import numpy as np
import pandas as pd

In this assignment, we'll be making extensive use of the [spaCy](https://spacy.io/) library. It's crucial that you familiarize yourself with its key features. For a foundational understanding, please explore the basics [here](https://spacy.io/usage/spacy-101). Ensure you're comfortable with the concepts we've discussed in lectures, such as tokenization, lemmatization, part-of-speech (POS) tagging, and named entity recognition (NER).

Additionally, our work will incorporate the [NumPy](https://numpy.org/) and [pandas](https://pandas.pydata.org/) libraries. Should these be new to you, we recommend engaging with [this tutorial](https://www.hackerearth.com/practice/machine-learning/data-manipulation-visualisation-r-python/tutorial-data-manipulation-numpy-pandas-python/tutorial/) to get up to speed.

In [2]:
# Load spacy model
nlp = spacy.load("en_core_web_sm")

#### (a)
Process the example below with spaCy. Tokenize the document and gather the tokens in a list. Finally, print the tokens.

In [3]:
ex1_a1 = (
    "A wizard is never late, Frodo Baggins. "
    "Nor is he early; he arrives precisely when he means to."
)

In [4]:
# YOUR CODE HERE
doc = nlp(ex1_a1)
for token in doc:
    print(token.text)


A
wizard
is
never
late
,
Frodo
Baggins
.
Nor
is
he
early
;
he
arrives
precisely
when
he
means
to
.


#### (b)
Implement `sentencizer` using [spaCy](https://spacy.io/usage/linguistic-features).

In [5]:
def sentencizer(text):
    """
    Receives a string as an input,
    splits the document to sentences and gathers them in a list.
    """
    # YOUR CODE HERE
    doc = nlp(text)
    return [sentence.text for sentence in doc.sents]


In [6]:
assert sentencizer("Sentence no. 1. Sentence no. 2.") == [
    "Sentence no. 1.",
    "Sentence no. 2.",
]

#### (c)

Implement `lemmatizer` using [spaCy](https://spacy.io/usage/linguistic-features).

In [7]:
def lemmatizer(text):
    """
    Receives a string as an input and lemmatizes it.
    The lemmas are returned in a list.
    """
    # YOUR CODE HERE
    doc = nlp(text)
    return [token.lemma_ for token in doc]

In [8]:
assert lemmatizer(ex1_a1) == [
    "a",
    "wizard",
    "be",
    "never",
    "late",
    ",",
    "Frodo",
    "Baggins",
    ".",
    "nor",
    "be",
    "he",
    "early",
    ";",
    "he",
    "arrive",
    "precisely",
    "when",
    "he",
    "mean",
    "to",
    ".",
]

#### (d)

Implement the `ngrams` methods. You might find the [`tee`](https://www.geeksforgeeks.org/python-itertools-tee/) method from the `itertools` package useful, but you're not obliged to use it. The method should return a generator. Plase refer to the [link](https://wiki.python.org/moin/Generators) if you aren't familiar with Python generators.

In [9]:
from itertools import tee


def ngrams(sequence, n, **kwargs):
    """
    Receives a list of tokens and generates n-grams.
    """
    # YOUR CODE HERE
    iterators = tee(sequence, n)
    for i, iterator in enumerate(iterators):
        for _ in range(i):
            next(iterator)
    for ngram in zip(*iterators):
        yield ngram

In [10]:
assert list(ngrams(lemmatizer(ex1_a1), 2)) == [
    ("a", "wizard"),
    ("wizard", "be"),
    ("be", "never"),
    ("never", "late"),
    ("late", ","),
    (",", "Frodo"),
    ("Frodo", "Baggins"),
    ("Baggins", "."),
    (".", "nor"),
    ("nor", "be"),
    ("be", "he"),
    ("he", "early"),
    ("early", ";"),
    (";", "he"),
    ("he", "arrive"),
    ("arrive", "precisely"),
    ("precisely", "when"),
    ("when", "he"),
    ("he", "mean"),
    ("mean", "to"),
    ("to", "."),
]


### 2. News classification

#### (a)
Load the prepared BBC news data to a `pandas` dataframe named `df_bbc`. Explore the dataset structure.

In [11]:
import pandas as pd

# YOUR CODE HERE
df_bbc = pd.read_csv("bbc.csv")

#print(df_bbc.iloc[1].news)
#print(df_bbc.iloc[1].type)

for index, row in df_bbc.iterrows():
    print('Row data:')
    print(row)




Row data:
news    New 'yob' targets to be unveiled\n \n Fifty ne...
type                                             politics
Name: 0, dtype: object
Row data:
news    Newcastle line up Babayaro\n \n Newcastle mana...
type                                                sport
Name: 1, dtype: object
Row data:
news    Europe backs digital TV lifestyle\n \n How peo...
type                                                 tech
Name: 2, dtype: object
Row data:
news    Fears raised over ballet future\n \n Fewer chi...
type                                        entertainment
Name: 3, dtype: object
Row data:
news    Barkley fit for match in Ireland\n \n England ...
type                                                sport
Name: 4, dtype: object
Row data:
news    France Telecom gets Orange boost\n \n Strong g...
type                                             business
Name: 5, dtype: object
Row data:
news    MCI shareholder sues to stop bid\n \n A shareh...
type                                  

#### (b)
To make the classification task a bit more challenging, we want to remove the news title from the text.\
Additionally, we will replace all whitespaces with single spaces. Implement title removal and whitespace replacement in `clean_text`.\
E.g., "This \n is  \t an &nbsp;&nbsp;&nbsp;&nbsp; example. " -> "This is an example."

In [12]:
import re

def clean_text(text):
    """
    Removes news title and replaces all whitespaces with single spaces.
    Returns preprocessed text.
    """
    # YOUR CODE HERE
    text = text.split("\n", 1)[1].replace("\n", " ").replace("\t", " ")
    return re.sub(r'\s+', ' ', text)

print(clean_text("Breaking news\nClever Hans \t learns  to integrate."))

Clever Hans learns to integrate.


In [13]:
assert (
    clean_text("Breaking news\nClever Hans \t learns  to integrate.")
    == "Clever Hans learns to integrate."
)


In [14]:
df_bbc["text"] = df_bbc.news.apply(clean_text)

#### (c)
1. Implement an abstract pipeline in `preprocess_pipe`. The method receives a sequence of texts and a pipe function, which is used to preprocess documents in combination with the spaCy model `nlp` that we loaded at the beginning. We recommend you use [`nlp.pipe`](https://spacy.io/usage/processing-pipelines).
2. Implement `lemmatize_pipe` that collects lemmas and returns a list of n-grams ranging from `ngram_min` to `ngram_max`. Additionally, **truncate** the documents to `max_len` tokens and **remove the stop words**. Refer to the tests below to see how this method should behave.

In [15]:
def lemmatize_pipe(doc, max_len, ngram_min, ngram_max):
    """
    Removes stopword, truncates the document to `max_len` tokens,
    and returns lemma n-grams in range [`ngram_min`, `ngram_max`].
    """
    # YOUR CODE HERE
    modified_lemmas = [token.lemma_ for token in doc if token.is_stop==False][:max_len]
    ngrams_final = []
    for n in range(ngram_min, ngram_max+1):
        ngrams_final.extend(list(ngrams(modified_lemmas, n)))
    return ngrams_final

    
def preprocess_pipe(texts, pipe_fn):
    # YOUR CODE HERE
    pipes = []
    for doc in nlp.pipe(texts):
        pipes.append(pipe_fn(doc))
    return pipes


In [16]:
from functools import partial


pipe_fn = partial(lemmatize_pipe, max_len=100, ngram_min=1, ngram_max=2)

ex2_c1 = ["Text no. 1", "Text no. 2"]
sol2_c1 = [
    [("text",), (".",), ("1",), ("text", "."), (".", "1")],
    [("text",), (".",), ("2",), ("text", "."), (".", "2")],
]

assert preprocess_pipe(ex2_c1, pipe_fn) == sol2_c1


ex2_c2 = ["The Quest stands upon the edge of a knife."]

sol2_c2 = [
    [
        ("Quest",),
        ("stand",),
        ("edge",),
        ("knife",),
        (".",),
        ("Quest", "stand"),
        ("stand", "edge"),
        ("edge", "knife"),
        ("knife", "."),
    ]
]
assert preprocess_pipe(ex2_c2, pipe_fn) == sol2_c2



In [17]:
from functools import partial
from sklearn.model_selection import train_test_split

pipe_fn = partial(lemmatize_pipe, max_len=100, ngram_min=1, ngram_max=2)

df_bbc["lemmas"] = preprocess_pipe(df_bbc.text, pipe_fn)
df_bbc_train, df_bbc_test = train_test_split(
    df_bbc[["lemmas", "type"]], test_size=0.2, random_state=42
)

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Load vectorizers
count_vectorizer = CountVectorizer(
    tokenizer=lambda doc: doc, lowercase=False, min_df=3, token_pattern=None
)
tfidf_vectorizer = TfidfVectorizer(
    tokenizer=lambda doc: doc, lowercase=False, min_df=3, token_pattern=None
)

#### (d)
Implement `train_lr`. Run `test_performance` with count and TF-IDF vectorizer. Compare the results.

In [19]:
from sklearn.linear_model import LogisticRegression as LR


def train_lr(df_train, vectorizer, lr_kwargs={"max_iter": 1000, "solver": "lbfgs"}):
    """
    Receives the train set `df_train` as pd.DataFrame and extracts lemma n-grams
    with their correspoding labels (news type).
    The text is vectorized and used to train a logistic regression with
    training arguments passed as `lr_kwargs`.
    Returns the fitted model.
    """
    # YOUR CODE HERE
    X_train = df_train["lemmas"]
    y_train = df_train["type"]
    X_train = vectorizer.fit_transform(X_train)
    lr_model = LR(**lr_kwargs)
    lr_model.fit(X_train, y_train)
    
    return lr_model
    

In [20]:
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score


def test_performance(model, df_test, vectorizer):
    X_test, y_test = df_test.lemmas, df_test.type
    X_vec = vectorizer.transform(X_test)
    y_pred = model.predict(X_vec)
    print(classification_report(y_pred=y_pred, y_true=y_test))
    return f1_score(y_pred=y_pred, y_true=y_test, average="macro")



In [21]:
## Count vectorizer scenario
lr = train_lr(df_bbc_train, count_vectorizer)
f1 = test_performance(lr, df_bbc_test, count_vectorizer)
print(f"f1 = {f1:.3f}")

               precision    recall  f1-score   support

     business       1.00      0.91      0.95        11
entertainment       0.83      0.83      0.83         6
     politics       1.00      0.88      0.93         8
        sport       0.92      1.00      0.96        12
         tech       0.75      1.00      0.86         3

     accuracy                           0.93        40
    macro avg       0.90      0.92      0.91        40
 weighted avg       0.93      0.93      0.93        40

f1 = 0.907


In [22]:
## TF-IDF vectorizer scenario
lr = train_lr(df_bbc_train, tfidf_vectorizer)
f1 = test_performance(lr, df_bbc_test, tfidf_vectorizer)
print(f"f1 = {f1:.3f}")

               precision    recall  f1-score   support

     business       0.79      1.00      0.88        11
entertainment       1.00      0.67      0.80         6
     politics       1.00      0.88      0.93         8
        sport       0.92      1.00      0.96        12
         tech       1.00      0.67      0.80         3

     accuracy                           0.90        40
    macro avg       0.94      0.84      0.87        40
 weighted avg       0.92      0.90      0.90        40

f1 = 0.875
