<a href="https://colab.research.google.com/github/SISTERZHANGLN/MLOps-Practical-1/blob/main/TextAnalysis_Lab2_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text and Media Analytics
### Seminar 2 Lab

This is the Google Colab notebook accompanying the second Seminar of the Applied Data Science *Text and Media Analytics* course at Utrecht University, the 2025/2026 edition.

We will cover three basic approaches to automatic text analysis: word-count-based methods, static embeddings and contextual embeddings. Along the way, we will try out various potential useful steps in the NLP pipeline.

## Some elements of the classic NLP pipeline

We worked a little bit with **NLTK** last week, but only used it for tokenization. In fact, NLTK offers many types of analysis along the NLP pipeline and is a (relatively) convenient single entry point for different types of tools and models. Let's see some things it can do.

In [1]:
# first, let's kill warnings, so that they don't distract us
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# let's import nltk and some of its components
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

# If you need quick and easy access to a corpus to try something out, access via NLTK Corpus module is a good option https://www.nltk.org/api/nltk.corpus.html
from nltk.corpus import reuters

nltk.download('reuters')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package reuters to /root/nltk_data...


True

Once you download a corpus, you can either just use the raw text, or load it pre-tokenized as a list of words, or a list of sentences, each of which is, in turn, a list of words, etc.

In [None]:
print('Raw corpus:', reuters.raw()[:35])
print('Corpus as a list of words:', reuters.words()[:5])
print('Corpus as a list of sentences (here is the first one):', reuters.sents()[0])

Raw corpus: ASIAN EXPORTERS FEAR DAMAGE FROM U.
Corpus as a list of words: ['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM']
Corpus as a list of sentences (here is the first one): ['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']


Another thing NLTK offers is [Text](https://tedboy.github.io/nlps/generated/generated/nltk.Text.html) -- a wrapper around a sequence of words for initial text exploration.

In [None]:
from nltk.text import Text

text = Text(reuters.words())

# finding words similar to a given word (by the contexts it's used in):
text.similar('finance')

the agriculture trade oil foreign prime pct energy increase sell japan
cut reduce commerce in a exports industry make of


**SpaCy** is another tool that we discussed previously. We only looked at its tokenization model, but in fact it provides analysis along the whole NLP pipeline.

So, let's look at what it can do. You might find it useful to check out [the info about SpaCy English model that we will use](https://spacy.io/models/en) -- but also check out models for other languages that they have, including Dutch.

The pipeline includes [morphology](https://spacy.io/api/morphologizer) and a [syntactic parser](https://spacy.io/api/dependencyparser). Check out [the list of parts of speech and syntactic dependencies](https://spacy.io/models/en#en_core_web_trf-labels). Morphology follows [the format from Universal Dependencies](https://universaldependencies.org/format.html#morphological-annotation).

Basically, SpaCy keeps the analyzed text as a collection of tokens with a bunch of properties for these tokens. For instance, we can zoom in on the first 5 tokens and print parts of speech of each of them, next to the actual token, like this:


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")


sentence = 'This is a relatively simple example, I hope it can parse it.'

processed_sent = nlp(sentence)

for token in processed_sent:
  print(token.text, token.pos_)

This PRON
is AUX
a DET
relatively ADV
simple ADJ
example NOUN
, PUNCT
I PRON
hope VERB
it PRON
can AUX
parse VERB
it PRON
. PUNCT


Another thing we can do is to print all the morphological information per token:

In [None]:
for token in processed_sent[:5]:
  print(token.text, token.morph)

This Number=Sing|PronType=Dem
is Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
a Definite=Ind|PronType=Art
relatively 
simple Degree=Pos


We can also check get values of a particular morphological feature we are interested in. This will give a list of values of this morphological feature for each word; if the token doesn't have any value for this feature, the resulting list will be empty. Let's get the value of the feature `Number` for the first 5 tokens:

In [None]:
for token in processed_sent[:5]:
  print(token.text, token.morph.get('Number'))

This ['Sing']
is ['Sing']
a []
relatively []
simple []


Another thing you can do is to check what kind of syntactic role a token has -- in what kind of dependency relation it stands to some other token in the sentence. We can print, for each token, its syntactic head and the type of dependency between them:

In [None]:
for token in processed_sent[:5]:
  print(token.text, token.head.text, token.dep_)

This is nsubj
is hope ccomp
a example det
relatively simple advmod
simple example amod


It might be useful to visualize the dependency structure sometimes -- you can use DisplaCy for this purpose, like in this example below:

In [None]:
from spacy import displacy

displacy.render(processed_sent, style="dep", jupyter=True)

Displacy can visualize not only syntactic structures, but also named entities:

In [None]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million. '
         u'By contrast, Sony sold only 7 thousand Walkman music players.')

displacy.render(doc, style='ent', jupyter=True)

There are many ways to use the annotation SpaCy and other tools provide -- the use of structured linguistic information about the text is limited only by your imagination and by the task at hand. Here, for instance, we transform the sentence so that the result is the lemmas of all the verbs, with all the stopwords kicked out:

In [None]:
sentence = 'This is a relatively simple example, I hope it can parse it.'
processed_sent = nlp(sentence)

result = [word.lemma_ for word in processed_sent if word.pos_ == 'VERB' and not word.is_stop]
print(result)

['hope', 'parse']


# Seminar 2 Exercises: Part I

In this exercise, we will topic analysis with Latent Dirichlet Allocation, following lecture material. Optionally, you can compare the result with K-Means clustering on top of TF-IDF vectors.

First of all, let's download the data that we will work with for our exercises. This is just some collection of news articles.


In [None]:
! wget https://raw.githubusercontent.com/bylinina/TMA_seminars/refs/heads/main/news_sample.csv

--2025-11-17 11:46:29--  https://raw.githubusercontent.com/bylinina/TMA_seminars/refs/heads/main/news_sample.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16044521 (15M) [text/plain]
Saving to: ‘news_sample.csv’


2025-11-17 11:46:29 (389 MB/s) - ‘news_sample.csv’ saved [16044521/16044521]



In [None]:
# Some imports and installations. These are your tools for the exercises -- check out the documentation for these packages for guidelines on how they are used.

!pip install --upgrade kneed gensim

from nltk.corpus import stopwords
import nltk

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import spacy
import string

from kneed import KneeLocator
from sklearn.cluster import KMeans
import sklearn

from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel, LdaModel
from gensim import corpora, models

import matplotlib.pyplot as plt

nltk.download('stopwords')

stop_words = stopwords.words('english')
nlp = spacy.load("en_core_web_sm")


Collecting kneed
  Downloading kneed-0.8.5-py3-none-any.whl.metadata (5.5 kB)
Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading kneed-0.8.5-py3-none-any.whl (10 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kneed, gensim
Successfully installed gensim-4.4.0 kneed-0.8.5


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


IA.1) Load the data as a df and explore it a bit. What's the shape of the df? What are the names of the columns? If something needs to be deleted for the df to look cleaner, do it.


In [None]:
# code

IA.2) Run the news texts (``body``) through the SpaCy pipeline and put the result in a new column. You may need to convert the columns into the right data type (str) before you can proceeed.

In [None]:
# code

IA.3) Use Spacy to select words you want to keep/throw out! Does it make sense to keep all the words in? Maybe something could be kicked out based on parts of speech? Try different combinations and see how the results change later on. Also, clean the resulting texts: remove punctuation, digits, stopwords, lemmatize all words, put each of the texts back together into a string where each included lemma is separated by a space. Put the resulting clean texts into a new column (let's call it `clean_text`).

In [None]:
# code

IA.4) Use LDA topic modeling on the news data set. You can use the ``clean_text`` column from the dataframe -- but make sure to turn the strings into lists of words! Try different numbers of topics, calculate coherence scores and see if they can assist in determining the number of topics. Feel free to try different methods for calculating coherence.

In [None]:
# code
# check out documentation and example code from here https://radimrehurek.com/gensim/models/ldamodel.html

IA.5) Create a line plot with the coherence scores (using ``CoherenceModel`` from ``gensim.models``) with number of clusters/topics on the x-axis and the scores on the y-axis.

In [None]:
# code

IA.6) Print out coherence values per number of topics.

In [None]:
# code

IA.7) Choose a K and do the LDA! Explore the result -- print out word mixes for each of the resulting topics using ``print_topics()``

In [None]:
# code

**OPTIONAL**: Compare with TF-IDF based clustering

IB.1) Vectorize ``clean_text`` and calculate TF-IDF!

In [None]:
# code

IB.2) Perform KMeans clustering for a range of the number of clusters and create a line plot with the number of clusters on the x axis and the sum of square distances on the y axis. Find the optimal number of clusters, perform clustering with this number of clusters and add column to the dataframe that indicates what cluster an article belongs to.

In [None]:
# code

IB.3) Check the top words (e.g., 10 or 15) per cluster. Try labelling the clusters as topics or media frames.

In [None]:
# code

IB.4) Check the cluster sizes

In [None]:
# code

IB.5) Reflect on the TF-IDF approach. What are the benefits, what are the limitations? How can you ensure validity and reliability of the results?

In [None]:
# smth

IB.6) Compare the results of TF-IDF and LDA topic modeling. What are the differences? How can you explain them? Maybe you can try same number of clusters for TF-IDF and number of topics for LDA and then compare most important respective words. How different are the results?

In [None]:
# text

# Seminar 2 Exercises: Part II

In this part, we will look at embeddings, using text classification as our running example. We will look at classification based on

1) Tf-Idf vectors;

2) Static dense embeddings;

3) Contextual embeddings.


First, as usual, let's get our data. We are going to use a [dataset of 16,086 article titles](https://github.com/bhargaviparanjape/clickbait) that are either labelled as `clickbait` (1) or `not clickbait` (0). The dataset is introduced in this paper:

> Abhijnan Chakraborty, Bhargavi Paranjape, Sourya Kakarla, and Niloy Ganguly. "[Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media](https://ieeexplore.ieee.org/document/7752207)”. In Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), San Fransisco, US, August 2016.

We can load the datset from the following url:


In [None]:
import pandas as pd

DATASET_URL = 'https://gist.githubusercontent.com/amitness/0a2ddbcb61c34eab04bad5a17fd8c86b/raw/66ad13dfac4bd1201e09726677dd8ba8048bb8af/clickbait.csv'
df = pd.read_csv(DATASET_URL)
df.head(5)

Unnamed: 0,title,label
0,"15 Highly Important Questions About Adulthood,...",1
1,250 Nuns Just Cycled All The Way From Kathmand...,1
2,"Australian comedians ""could have been shot"" du...",0
3,Lycos launches screensaver to increase spammer...,0
4,Fußball-Bundesliga 2008–09: Goalkeeper Butt si...,0


## Tf-Idf + logistic regression (the baseline)

Let's start with putting together a simple baseline with logistic regression based on Tf-Idf features. First, we split our dataset into training and test sets -- to be able to see model performance on data unseen during training.

In [None]:
from sklearn.model_selection import train_test_split

X = df.title.values
y = df.label.values # the labels we want to predict
labels = ['not clickbait', 'clickbait']

X_train_str, X_test_str, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

We reserved 20% of the data set for the test set. Generally, having more training data will improve your models performance, so don’t reserve too much - especially when your data set is small. On the other hand, too little test data makes your estimate of the performance of your model more unreliable.

Now, let's vectorize the texts -- for now, with Tf-Idf:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

X_train = ## your code here
X_test = ## your code here

Now we have our features ready, let’s train the Logistic Regression classifier. In logistic regression, we learn for each element ``xᵢ`` in our input vector ``x`` (weighted frequencies of all the words in the corpus) a corresponding weight ``wᵢ``, in combination with a bias term ``b``. We transform the linear combination of ``w``, ``x``, and ``b`` (``wx + b``) via the sigmoid function to a probability between 0 and 1. Conventionally, if the probability of a title being clickbait is higher than 0.5, we classify it as a clickbait title, otherwise a classify it as non-clickbait.

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train, y_train)

We’ve trained our model. How do measure its performance? We let it predict the labels of texts from the test sest using `predict()`, and compare its results with the true labels:

In [None]:
from sklearn.metrics import classification_report

y_pred = lr.predict(X_test)

print(classification_report(y_test, y_pred,
                          target_names=labels, digits=4))

               precision    recall  f1-score   support

not clickbait     0.9567    0.9736    0.9651      3178
    clickbait     0.9735    0.9565    0.9649      3220

     accuracy                         0.9650      6398
    macro avg     0.9651    0.9650    0.9650      6398
 weighted avg     0.9651    0.9650    0.9650      6398



Not bad! Accuracy of 96.5%! Apart from accuracy, we see three evaluation metrics listed here: `precision`, `recall`, and `f1-score`. The precision of clickbait, for example, is the proportion of titles the model classified as clickbait that are correctly classified. If a model classified 10 posts as clickbait, and 8 of them were actually clickbait, the precision would be 0.80. Recall, on the other hand, indicates the proportion of the titles that are actually clickbait that are also found by the model. If there were 16 posts in the dataset that are labeled as clickbait, and the model found 8 of them correctly, the recall would be 0.5. Finally, `f1-score` is the harmonic mean of precision and recall.

How do we interpret these numbers? Generally, the higher the better, and it depends on your goal how high you want your metrics to be (if you don’t mind important emails to be classified as spam, you’ll be fine with a precision of 0.6). But there is an absolute minimum: you want your model to be better than random, better than just flipping a coin. In the case of binary classification, random baseline is 50%. Let us see what a random prediction between 0 and 1 actually produces:

In [None]:
import random

random_preds = [random.randint(0,1) for i in range(len(y_test))]

print(classification_report(y_test, random_preds,
                          target_names=labels, digits=4))

               precision    recall  f1-score   support

not clickbait     0.4991    0.4978    0.4984      3178
    clickbait     0.5056    0.5068    0.5062      3220

     accuracy                         0.5023      6398
    macro avg     0.5023    0.5023    0.5023      6398
 weighted avg     0.5023    0.5023    0.5023      6398



## Static embeddings + logistic regression

During the lecture, we talked about representing words -- or bigger units of text, like documents -- as dense vectors of some pre-set length. Those vectors are trained on large corpora, and then they can be reused for different tasks. These word embeddings are usually called 'static' in a sense that there's one vector associated with every word, and that's it -- regardless of the context the word is used in, its vector is fixed once it's trained! There is no way to, for instance, alter the embeddings of an ambiguous word depending on which meaning it's used in in different sentences.

Static word embeddings come in a variety of implementations and versions (the Word2Vec family of embeddings; Glove, Fasttext) and are useful for a variety of tasks -- let's try them on clickbait title classification! `Spacy` offers static word embeddings (of length 300) as part of its pipeline, we will use them.



In [None]:
! python -m spacy download en_core_web_sm

import spacy

nlp = spacy.load("en_core_web_sm")
nlp.select_pipes(disable=["ner", "parser"])

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m106.2 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


['ner', 'parser']

You can access an embedding of every token of a document processed by Spacy (if you want to trh more things with static embeddings, check out [this Colab](https://colab.research.google.com/drive/1w1N4LIWo-rKeqstjLv_9V31-KHQLEaWH?usp=sharing)):

In [None]:
nlp('dog')[0].vector

array([-0.27226925, -0.93836105,  0.5357777 ,  0.44520932,  0.38257632,
       -0.9749172 ,  0.8469627 ,  0.6745571 , -0.25771025, -0.27247453,
        1.1028785 , -0.35970753, -0.3852108 ,  0.04651581, -0.14505364,
        1.5175972 , -0.73540014, -0.84961474,  0.40694195, -0.35687307,
       -0.66881895,  1.2842281 , -0.38473105, -0.14557725,  0.58634967,
        0.8983518 ,  1.1753087 , -0.708951  , -0.79900455,  0.7978039 ,
       -0.39777148,  1.0300553 ,  0.54685897,  0.27132213, -0.6710391 ,
       -1.252937  ,  0.31320006,  1.1285927 ,  0.02641791, -0.06576954,
       -0.2265883 , -0.44849777, -0.08686753,  0.34801948,  0.16308916,
        0.31855017, -0.60689956, -0.8128593 , -0.13207734, -0.3962313 ,
        0.41582835,  0.0253248 , -0.15275669, -0.83847266,  0.69083273,
        0.506855  ,  1.5052202 , -0.63450634,  0.23774838, -0.1408025 ,
       -1.3257025 ,  0.63608605, -0.47730634, -0.24139911, -0.01089418,
        0.6481187 ,  0.614839  , -0.88111126, -1.2229419 , -0.16

In fact, Spacy also provides vectors for larger text units (sentences, documents) -- by averaging embeddings of their individual words. Check this out -- we can take a sequence embedding that Spacy suggests, and also manually average across word-by-word embeddings, and then show that the result is the same:

In [None]:
import numpy as np

emb_spacy = nlp("I don't have a dog").vector

emb_average = [x.vector for x in nlp("I don't have a dog")]
emb_average = sum(emb_average) / len(emb_average)

(emb_spacy == emb_average).all()

np.True_

We can use these document embeddings as features for whatever model we are interested in training. For instance, quite like with Tf-Idf above, we can train a Logistic Regression model for clickbait title detection. Let's reuse our train-test split from before, but now vectorize the texts with Spacy text embeddings (takes a bit of time):

In [None]:
X_train = [nlp(x).vector for x in X_train_str]
X_test = [nlp(x).vector for x in X_test_str]

Now we can just fit a logistic regression model as before and check out the results:

In [None]:
lr = LogisticRegression(solver='lbfgs')
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)

print(classification_report(y_test, y_pred,
                          target_names=labels, digits=4))

               precision    recall  f1-score   support

not clickbait     0.9713    0.9594    0.9653      3178
    clickbait     0.9604    0.9720    0.9662      3220

     accuracy                         0.9658      6398
    macro avg     0.9659    0.9657    0.9658      6398
 weighted avg     0.9658    0.9658    0.9658      6398



Maybe a little bit better than before, but not dramatically so. I guess that's not a surprise -- the representation the model works with is still BoW, although a slightly more sophisticated one: we just average all words' vectors, and their order, for instance, is not taken into account. Time to move on to contextualized embeddings!

# Contextual embeddings

As you know, the order of words can be very important. *The man killed the lion* and *The lion killed the man* are two completely different stories but they have exactly the same BoW representation, so we might want something beyond that -- something where embeddings of individual tokens interact with each other depending on their position in the text. A big class of models that does exactly that is Transformer models we discussed during the lecture this week: they use a so-called self-attention mechanism to get an idea on how relevant every other word in the sentence is for the current word (if you like to know more about how Transformers models work, also check out the [original BERT paper](https://https://arxiv.org/abs/1810.04805)).

Today, we will look at one example of a model that does that -- an **encoder** model of the BERT family (recall the discussion about the distinction between encoder models, decoder models and encoder-models from the lecture). Encoder models are typically used for tasks of text classification. We will use DistilBERT -- a pretty small but well-performing model from the BERT family.

The way these models are typically used involves fine-tuning: a pre-trained model (that was trained usually either for next-token prediction or masked-token prediction on a huge amount of texts) gets trained a little bit more for a downstream task. The hope is that the model acquired during pre-training will help it with the final task -- and that would be better than training a model for the downstream task from scratch. This is known as 'transfer learning' and it works pretty well!

We will fine-tune a pre-trained DistilBERT model using Huggingface's [Trainer](https://huggingface.co/docs/transformers/en/main_classes/trainer) class that supports training with PyTorch in a pretty compact way. Because fine-tuning these bigger models requires a lot of computational power, it is really not advised to train it on your own computer, unless you have a GPU. When you use Google Colab, make sure you use a GPU (Runtime > Change runtime type > GPU).

First, let's prepare the data. In order to save training time, let's use just 5k examples for training and 500 examples for evaluation. To prepare the data for Trainer, let's use the Dataset class:

In [None]:
from datasets import Dataset

train_data = Dataset.from_pandas(df.head(5000))
eval_data = Dataset.from_pandas(df.tail(500))

In order to run the model on the text data, we need to tokenize it -- with the tokenizer that comes with the model! Let's do it both for train and eval data.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["title"], padding="max_length", truncation=True)


train_dataset = train_data.map(tokenize_function, batched=True).shuffle(seed=42)
eval_dataset = eval_data.map(tokenize_function, batched=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Now, let's load the actual model with its pre-trained weights! Once you do it using the `AutoModelForSequenceClassification` class, you will see a warning that some weights are newly initialized and need to be trained. That's because this class adds a linear layer on top of the pre-trained model that is responsible for the actual classification task, with the number of output classes that corresponds to the number of labels for your classification that you specify when you load the model (`num_labels` argument). The weights in this layer are random when you load the model, so some training is needed!

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels=2)

model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We are almost ready to train. We need to do some annoying things first to disable Trainer's attempts to connect us to Weights&Biases and send stuff there. Don't think about it too much.

In [None]:
import os
import wandb
os.environ["WANDB_DISABLED"] = "true"

Now, time to specify training arguments and initialize Trainer with these arguments:

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer",
                                  eval_strategy="no",
                                  num_train_epochs=1,
                                  save_strategy="no",
                                  report_to="none")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset)

Ok, let's train!!

In [None]:
trainer.train()

Step,Training Loss
500,0.0792


TrainOutput(global_step=625, training_loss=0.06944524803161621, metrics={'train_runtime': 229.842, 'train_samples_per_second': 21.754, 'train_steps_per_second': 2.719, 'total_flos': 662336993280000.0, 'train_loss': 0.06944524803161621, 'epoch': 1.0})

Now we can use the result of this training to predict labels for new data -- we set aside evaluation data for this purpose. Let's see what the model does with it!

In [None]:
predictions = trainer.predict(eval_dataset)
y_pred = np.argmax(predictions.predictions, axis=-1)
y_test = predictions.label_ids

print(classification_report(y_test, y_pred,
                          target_names=labels, digits=4))

               precision    recall  f1-score   support

not clickbait     1.0000    0.9919    0.9959       246
    clickbait     0.9922    1.0000    0.9961       254

     accuracy                         0.9960       500
    macro avg     0.9961    0.9959    0.9960       500
 weighted avg     0.9960    0.9960    0.9960       500



Awesome. Let's clean up.

In [None]:
import torch
import gc

del model, trainer

if torch.cuda.is_available():
    torch.cuda.empty_cache()

gc.collect()

544

## Seminar 2 Exercises: Part III

### IIIA: Classification with different vectorizations

Earlier, we looked at Tf-Idf vs. static embeddings as the basis of classification with Logistic Regression and barely saw improvement when we switched from Tf-Idf as features to static embeddings offered by SpaCy. Is this a stable result? Do the same thing we did above but using different types of classifiers. Pick at least one alternative classifier model! Suggested classifiers are

```
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
```

Check on the Scikit Learn pages which hyperparameters these classifiers have. Not specifying hyperparameters means that run on default values. You can also try to combine classifiers into ensembles, with a voting classifier on top.


In [None]:
## code goes here

### IIIB: Fine-tuning vs. contextual embeddings as features

When we talked about the benefit of contextual embeddings, we saw an example of fine-tuning a DistilBERT model with an added layer specifically for the binary classification task. During this process, the whole model was trained (all weights were adjusted, including the newly initialized ones). Could we just use embeddings produced by the encoder model as features of our favourite classification model? We certainly could. Would it work well? Let's see.

What should we use as an embedding of the whole text, given that DistilBERT produces embeddings for each token? There are different approaches -- we could average embeddings of all tokens and use that as a sequence embedding. Alternatively -- and more frequently -- the embedding of the special token `[CLS]` is used for this purpose. BERT tokenizers wrap sequences into special tokens, with `[CLS]` token in the beginning of the sequence. `[CLS]` stands for 'classification' so it's not a coincidence!

In [None]:
tokenizer.decode(tokenizer.encode('This is a sentence'))

'[CLS] This is a sentence [SEP]'

Now, let's again load the DistilBERT model (now using a different class, without an added classfication layer!), and use the pre-tokenized 5k examples to run them through the model and get the embeddings that we need out of it. It will take a bit of time, hold tight!

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained("distilbert-base-cased")

In [None]:
from tqdm import tqdm
import torch

model.to('cuda:0')
input_ids_train = torch.tensor(train_dataset['input_ids']).to('cuda:0')
attention_masks_train = torch.tensor(train_dataset['attention_mask']).to('cuda:0')

encoded = []
for row in tqdm(range(len(train_dataset))):
  with torch.no_grad():
    embeds = model(input_ids=input_ids_train[row:row+1,:], attention_mask=attention_masks_train[row:row+1,:]).last_hidden_state.squeeze(dim=0).to('cpu').detach().numpy()
  encoded.append(embeds)

100%|██████████| 5000/5000 [01:22<00:00, 60.55it/s]


Now, keep only the embedding of the very first token for each of the 5000 titles. Keep in mind that all examples were padded to the maximum sequence length that the model can take: 512. The size of the vector that corresponds to each of the tokens is 768.

In [None]:
X_cls = # your code here
y_cls = train_dataset['label']

In [None]:
del encoded

if torch.cuda.is_available():
    torch.cuda.empty_cache()

gc.collect()

239

Train a Logistic Regression on top of these features

In [None]:
# your code here

Write a function that would take the pre-tokenized `eval_dataset` and return predictions - first running the dataset through the model, then extracting the embeddings of the `[CLS]` tokens, then running that through the logistic regression model. Evaluate the quality of your set-up.

In [None]:
# your code here

Reflect on the results!

\[YOUR THOUGHTS HERE\]