# NLP 2026
# Lab 3: Classification and BERT

Have you ever read a movie review and wondered:

```‚ÄúIs this review actually positive or negative?‚Äù ü§î```

In this lab, you will build your own sentiment analysis tool using Natural Language Processing (NLP)! Your goal is to automatically classify movie reviews into one of two categories:

‚úÖ Positive

‚ùå Negative

We will approach this as a binary classification task and you will experiment with increasingly powerful methods ‚Äî from classic machine learning to modern neural networks based on transformers üöÄ

### üéØ Learning Goals ###

By completing this lab, you should be able to:

- Formulate sentiment analysis as a binary classification problem

- Design and evaluate hand-crafted text features

- Implement a Bag-of-Words representation

- Apply and evaluate Logistic Regression and alternative classifiers

- Understand how BERT tokenization and embeddings work

- Extract sentence representations using: ```CLS``` token, mean token pooling

- Compare classical ML and transformer-based methods

- Critically analyze evaluation metrics beyond accuracy üìä



### Score breakdown

| Exercise            | Points |
|---------------------|--------|
| [Exercise 1](#e1)   | 1      |
| [Exercise 2](#e2)   | 5      |
| [Exercise 3](#e3)   | 5      |
| [Exercise 4](#e4)   | 5      |
| [Exercise 5](#e5)   | 5      |
| [Exercise 6](#e6)   | 5      |
| [Exercise 7](#e7)   | 3      |
| [Exercise 8](#e8)   | 6      |
| [Exercise 9](#e9)   | 5      |
| [Exercise 10](#e10) | 5      |
| [Exercise 12](#e12) | 5      |
| [Exercise 13](#e13) | 10     |
| Total               | 60     |

This score will be scaled down to 0.6 and that will be your final lab score.

### üìå **Instructions for Delivery** (üìÖ **Deadline: 23/Feb 18:00**, üé≠ *wildcards possible*)

‚úÖ **Submission Requirements**
+ üìÑ You need to submit a **notebook** üìì with the code, appropriate comments and figures in all questions. Make sure to have a mix of code (some explanations needed if not clear what you implement), figures to support the answers or your claims and proper amount of text to explain your reasoning, answer etc.
+ ‚ö° Make sure that **all cells are executed properly** ‚öôÔ∏è and that **all figures/results/plots** üìä you include in the report are also visible in your **executed notebook**.
+ You can work on Google Collab (or other environments), but you need to make sure that your delivered notebook is executed properly.

‚úÖ **Collaboration & Integrity**
+ üó£Ô∏è While you may **discuss** the lab with others, you must **write your solutions with your group only**. If you **discuss specific tasks** with others, please **include their names** below.
+ üìú **Honor Code applies** to this lab. For more details, check **Syllabus ¬ß7.2** ‚öñÔ∏è.
+ üì¢ **Mandatory Disclosure**:
   - Any **websites** üåê (e.g., **Stack Overflow** üí°) or **other resources** used must be **listed and disclosed**.
   - Any **GenAI tools** ü§ñ (e.g., **ChatGPT**) used must be **explicitly mentioned**.
   - üö® **Failure to disclose these resources is a violation of academic integrity**. See **Syllabus ¬ß7.3** for details.

## 0. Setup

We first install the scikit-learn library [Scikit-learn](https://scikit-learn.org/stable/). We will use its classification models.

In [None]:
# pip install -U scikit-learn

We will need [PyTorch](https://pytorch.org/) installed. It is a very popular deep learning library that offers modularized versions of many of the sequence models we discussed in class. It's an important tool that you may want to practice further if you want to dive deeper into NLP, since much of the current academic and industrial research uses it.

Some resources to look further are given below.

* [Documentation](https://pytorch.org/docs/stable/index.html) (We will need this soon)

* [Installation Instructions](https://pytorch.org/get-started/locally/)

* [Quickstart Tutorial](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html)

The cell below should install the library:

In [None]:
# pip install torch torchvision

The last bit we need is the huggingface transformers library (here is the documentation [https://huggingface.co/docs/transformers/en/index](https://huggingface.co/docs/transformers/en/index)). Transformers are one of the most influential architectures in handling sequences (not only in language). As we discussed in lectures, they excel at taking into account context (which is the salt-and-pepper of NLP) with mechansisms such as self-attetion, which allows them to weigh the importance of different words in a sentence. If you want to know more, revisit the course material (slides and textbook).

We already used huggingface datasets in previous labs and huggingface transformers integrates nicely with that. Apart from the ease of use, huggingface is also providing pre-trained models of different kinds. The list can be found [here](https://huggingface.co/models) ([https://huggingface.co/models](https://huggingface.co/models)). The following line should be enough to install huggingface transformers library:

In [None]:
# pip install transformers

Here, we import the libraries:

In [1]:
import re
from collections import Counter

import datasets
import numpy as np
import torch
import tqdm
import transformers
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from torch.utils.data.dataloader import DataLoader

## 1. Loading the Dataset

We will work with the IMDB dataset [https://huggingface.co/datasets/stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb). It contains the reviews and a label that indicates whether the review is positive or not (the neutral reviews have been filtered out). You can read the paper [here](https://aclanthology.org/P11-1015/).

In [2]:
dataset = datasets.load_dataset('stanfordnlp/imdb', split=['train', 'test'])
print(dataset)

[Dataset({
    features: ['text', 'label'],
    num_rows: 25000
}), Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})]


Notice that the dataset has been loaded as a list of two datasets. They are the `train` and `test` splits that we asked for.
We will use the validation subset to tune the parameters. So, let's split the `train` subset and create a `DatasetDict` object:

In [3]:
train_valid_split = dataset[0].train_test_split(5000)
dataset = datasets.DatasetDict({
    'train': train_valid_split['train'].shuffle(),
    'validation': train_valid_split['test'].shuffle(),
    'test': dataset[1].shuffle(),
})
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})


We can print several examples from the `train` dataset:

In [4]:
for i in range(5):
    print('i', i)
    print(dataset['train'][i]['text'])
    print(dataset['train'][i]['label'])
    print()

i 0
Demer Daves,is a wonderful director when it comes to westerns and "broken arrow" remains in everybody's mind.As far as melodrama is concerned,he should leave that to knowing people like Vincente Minelli,George Cukor or the fabulous Douglas Sirk. The screenplay is so predictable that you will not be surprised once while you are watching such a tepid weepie.Natalie Wood 's character was inspired by Fannie Hurst's "imitation of life" (see Stahl and Sirk),but who could believe she's a black man's daughter anyway?Susan Kohner was more credible in "imitation of life")and Sinatra and Curtis are given so stereotyped parts that they cannot do anything with them:the poor officer,and the wealthy good-looking -and mean- sergeant.Guess whom will Natalie fall in love with?France is shown as a land of tolerance ,where interracial unions are warmly welcome.At the time(circa 1944) it was dubious,it still is for narrow-minded people you can find here there and everywhere.
0

i 1
Okay, let me start o

Let's extract the labels from the dataset. We will use them to train and evaluate our classifiers.

In [5]:
y_train = dataset['train']['label']
print(y_train)
y_valid = dataset['validation']['label']
print(y_valid)

Column([0, 0, 1, 0, 1, ...])
Column([0, 0, 1, 0, 0, ...])


<a name='e1'></a>
#### Exercise 1: Cleaning the text

(1p) In this exercise you should clean the text in the dataset. This is the same step we saw in the previous labs.

If you think this step is not necessary in this use case, you can skip this step, but make sure to justify your decision.

In [6]:
def clean(text):
    """
    Cleans the text
    Args:
        text: a string that will be cleaned

    Returns: the cleaned text

    """

    # Empty text
    if text == '':
        return text

    ### YOUR CODE HERE
    
    # Remove HTML
    text = re.sub(r'<.*?>', '', text)

    # Lowercase
    text = text.lower()

    # Expand contractions
    text = re.sub(r"n't\b", " not", text)

    # Normalize repeated characters
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)

    # Negation handling
    text = re.sub(r'\bnot\s+(\w+)', r'not_\1', text)

    # Remove punctuation but keep underscore for negation handling
    # text = re.sub(r'[^\w\s_]', '', text)

    # Collapse spaces
    text = re.sub(r'\s+', ' ', text)

    # Minimal stopwords
    stop_words = set(['a', 'an', 'the'])
    words = [w for w in text.split() if w not in stop_words]
    text = ' '.join(words)
    
    text = text.strip()


    ### YOUR CODE ENDS HERE

    return text


def clean_example(example):
    """
    Applies the clean() function to the example from the Dataset
    Args:
        example: an example from the Dataset

    Returns: update example with cleaned 'text' column

    """
    example['text'] = clean(example['text'])
    return example


dataset = dataset.map(clean_example, desc="Cleaning")
print(dataset)

Cleaning:   0%|          | 0/20000 [00:00<?, ? examples/s]

Cleaning:   0%|          | 0/5000 [00:00<?, ? examples/s]

Cleaning:   0%|          | 0/25000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
})


In [7]:
def show_cleaning_example(text):
    print(text)
    print(clean(text))

example = "This movie is NOT good!!! I don't like it at all. <br /> It was soooo boring."

show_cleaning_example(example)

This movie is NOT good!!! I don't like it at all. <br /> It was soooo boring.
this movie is not_good!! i do not_like it at all. it was soo boring.


## 2. Hand-crafted Features

<a name='e2'></a>
#### Exercise 2: Hand-crafted features

(5p) Write your own hand-crafted feature extraction function. Include at least these types of features:
- length of the text,
- number of different punctuation characters,
- number of positive and negative words.

In [8]:
### YOUR CODE HERE
# you can define the positive and negative words here

positive_words = set(['good','amazing','great','excellent','wonderful','thrilling','exciting','nice', 'best'])
negative_words = set(['bad', 'horrible', 'terrible', 'boring', 'worst', 'awful'])
punctuation_set = {".", "!", "?", ",", ";"}


def calculate_features(text):
    features = []
    ### YOUR CODE HERE
    unique_punctuation_count = 0
    positive_word_count = 0
    negative_word_count = 0

    words = text.split()

    word_count = len(words)

    for word in words: 
        if word in positive_words:
            positive_word_count += 1
        if word in negative_words:
            negative_word_count += 1
    
    found_punctuation = set()

    for char in text:
        if char in punctuation_set:
            found_punctuation.add(char)

    unique_punctuation_count = len(found_punctuation)
        
    features = [
        word_count,
        unique_punctuation_count,
        positive_word_count,
        negative_word_count
    ]

    ### YOUR CODE ENDS HERE
    return np.array(features, dtype=float)


### YOUR CODE ENDS HERE

#### Below are the Implementations of the iteratively improved extraction functions for exercise 3

In [9]:

positive_words = set(['good','amazing','great','excellent','wonderful','thrilling','exciting','nice', 'best'])
negative_words = set(['bad', 'horrible', 'terrible', 'boring', 'worst', 'awful'])
punctuation_set = {".", "!", "?", ",", ";"}


def calculate_features_V1(text):
    features = []
    unique_punctuation_count = 0
    positive_word_count = 0
    negative_word_count = 0


    words = text.split()

    word_count = len(words)
    if word_count == 0:
        word_count += 1

    for word in words: 
        if word in positive_words:
            positive_word_count += 1
        if word in negative_words:
            negative_word_count += 1
    
    found_punctuation = set()

    for char in text:
        if char in punctuation_set:
            found_punctuation.add(char)

    unique_punctuation_count = len(found_punctuation)
    
    # New: The sentimental difference, capturing the difference in the counts of positive and negative words
    sentiment_difference = positive_word_count - negative_word_count

    # New: Ratio of positive and negative words: 
    denominator = max(word_count, 1)
    positive_ratio = positive_word_count / denominator
    negative_ratio = negative_word_count / denominator
    
    features = [
        word_count,
        unique_punctuation_count,
        positive_word_count,
        negative_word_count,
        sentiment_difference,
        positive_ratio,
        negative_ratio
    ]

    return np.array(features, dtype=float)


In [10]:
# New: Added many new positive and negative words
positive_words_V2 = set([
    "good","great","amazing","excellent","wonderful","best","love","loved","lovely",
    "awesome","fantastic","brilliant","superb","outstanding","perfect","enjoy","enjoyed",
    "fun","funny","hilarious","charming","beautiful","moving","touching","heartwarming",
    "impressive","solid","strong","well","wellmade","masterpiece","recommend","recommended",
    "favorite","favourite","satisfying","pleasant","clever","smart","engaging","thrilling",
    "exciting","delightful","captivating","incredible","remarkable","refreshing","fine"
])

negative_words_V2 = set([
    "bad","terrible","horrible","awful","worst","boring","hate","hated","dull","stupid",
    "ridiculous","waste","wasted","disappointing","disappointed","disaster","poor",
    "weak","forgettable","mess","confusing","annoying","irritating","painful","predictable",
    "cliche","cliched","lame","pathetic","mediocre","unfunny","nonsense","pointless",
    "garbage","trash","crap","ugly","flawed","slow","tedious","overrated","unwatchable",
    "fails","failed","failure","nasty","gross","insulting"
])

punctuation_set = {".", "!", "?", ",", ";"}


def calculate_features_V2(text):
    features = []
    unique_punctuation_count = 0
    positive_word_count = 0
    negative_word_count = 0
    
    negated_token_count = 0      # New: count tokens like not_good, not_funny
    negated_pos_count = 0        # New: count not_ + positive (e.g., not_good)
    negated_neg_count = 0        # New: count not_ + negative (e.g., not_awful)

    words = text.split()

    word_count = len(words)

    for word in words: 
        if word.startswith("not_"):
            negated_token_count += 1
            base = word[4:]
            if base in positive_words_V2:
                negated_pos_count += 1
            if base in negative_words_V2:
                negated_neg_count += 1
            continue

        if word in positive_words_V2:
            positive_word_count += 1
        if word in negative_words_V2:
            negative_word_count += 1
    
    found_punctuation = set()

    for char in text:
        if char in punctuation_set:
            found_punctuation.add(char)

    unique_punctuation_count = len(found_punctuation)
    
    sentiment_difference = positive_word_count - negative_word_count

    denominator = max(word_count, 1)
    positive_ratio = positive_word_count / denominator
    negative_ratio = negative_word_count / denominator
    # New: Ratio of negated tokens: 
    negated_ratio = negated_token_count / denominator

    features = [
        word_count,
        unique_punctuation_count,
        positive_word_count,
        negative_word_count,
        sentiment_difference,
        positive_ratio,
        negative_ratio,
        negated_token_count,     # New
        negated_pos_count,       # New
        negated_neg_count,       # New
        negated_ratio            # New
    ]

    return np.array(features, dtype=float)


In [11]:
positive_words_V2 = set([
    "good","great","amazing","excellent","wonderful","best","love","loved","lovely",
    "awesome","fantastic","brilliant","superb","outstanding","perfect","enjoy","enjoyed",
    "fun","funny","hilarious","charming","beautiful","moving","touching","heartwarming",
    "impressive","solid","strong","well","wellmade","masterpiece","recommend","recommended",
    "favorite","favourite","satisfying","pleasant","clever","smart","engaging","thrilling",
    "exciting","delightful","captivating","incredible","remarkable","refreshing","fine"
])

negative_words_V2 = set([
    "bad","terrible","horrible","awful","worst","boring","hate","hated","dull","stupid",
    "ridiculous","waste","wasted","disappointing","disappointed","disaster","poor",
    "weak","forgettable","mess","confusing","annoying","irritating","painful","predictable",
    "cliche","cliched","lame","pathetic","mediocre","unfunny","nonsense","pointless",
    "garbage","trash","crap","ugly","flawed","slow","tedious","overrated","unwatchable",
    "fails","failed","failure","nasty","gross","insulting"
])

punctuation_set = {".", "!", "?", ",", ";"}


def calculate_features_V3(text):
    features = []
    unique_punctuation_count = 0
    positive_word_count = 0
    negative_word_count = 0
    
    negated_token_count = 0      
    negated_pos_count = 0        
    negated_neg_count = 0    

    # New: Count exclamation and Question marks
    exclamation_counter = text.count("!")
    question_counter = text.count("?")
    
    # New: Check if there are multiple question or exclamation marks
    has_multi_exclam = 0
    has_multi_question = 0
    if "!!" in text:
        has_multi_exclam = 1
    if "??" in text:
        has_multi_question = 1

    words = text.split()

    word_count = len(words)

    for word in words: 
        if word.startswith("not_"):
            negated_token_count += 1
            base = word[4:]
            if base in positive_words_V2:
                negated_pos_count += 1
            if base in negative_words_V2:
                negated_neg_count += 1
            continue

        if word in positive_words_V2:
            positive_word_count += 1
        if word in negative_words_V2:
            negative_word_count += 1
    
    found_punctuation = set()

    for char in text:
        if char in punctuation_set:
            found_punctuation.add(char)

    unique_punctuation_count = len(found_punctuation)
    
    sentiment_difference = positive_word_count - negative_word_count

    denominator = max(word_count, 1)
    positive_ratio = positive_word_count / denominator
    negative_ratio = negative_word_count / denominator
    # New: Ratio of negated tokens: 
    negated_ratio = negated_token_count / denominator

    features = [
        word_count,
        unique_punctuation_count,
        positive_word_count,
        negative_word_count,
        sentiment_difference,
        positive_ratio,
        negative_ratio,
        negated_token_count,     
        negated_pos_count,       
        negated_neg_count,       
        negated_ratio,
        exclamation_counter, #New
        question_counter, #New
        has_multi_question, #New
        has_multi_exclam #New           
    ]

    return np.array(features, dtype=float)


In [12]:
############################ V4 ################################# PS. Pythonic code style is different since it was implemented by a different group member unlike the preceeding versions
### YOUR CODE HERE

from collections import Counter
# NEW : Use of string library to import all types of punctuation symbols (full coverage)
import string
punctuation = set(string.punctuation)

def calculate_features_v4(text):
    features = []

    # ensure text is a string
    if not isinstance(text, str):
        text = str(text)

    # Tokenize using regex (keeps underscores from negation: not_word)
    words = re.findall(r"\b[\w_]+\b", text.lower())
    word_count = len(words)
    features.append(word_count)

    # distinct punctuation characters in raw text
    punct_counts = set(c for c in text if c in punctuation)
    features.append(len(punct_counts))

    # count positive / negative words (compare base word, strip leading 'not_' if present)
    def base(w):
        return w[4:] if w.startswith("not_") else w

    num_positive = sum(1 for w in words if base(w) in positive_words_V2) # using the previously defined words corpora
    num_negative = sum(1 for w in words if base(w) in negative_words_V2)
    features.append(num_positive)
    features.append(num_negative)

    # sentimental variance 
    sentimental_diff = (num_positive - num_negative)
    denominator = max(word_count, 1)
    pos_ratio = num_positive / denominator
    neg_ratio = num_negative / denominator
    log_odds = np.log((num_positive+1)/(num_negative+1))

    # negation counts (tokens starting with 'not_')
    num_negations = sum(1 for w in words if w.startswith("not_"))
    num_positive_negations = sum(1 for w in words if w.startswith("not_") and w[4:] in positive_words_V2)
    num_negative_negations = sum(1 for w in words if w.startswith("not_") and w[4:] in negative_words_V2)
    negation_ratio = num_negations / denominator

    # exaggeration symbols from raw text
    count_exclamations = text.count("!")
    count_questions = text.count("?")
    has_multiple_exclamations = int(re.search(r'!{2,}', text) is not None)
    has_multiple_questions = int(re.search(r'\?{2,}', text) is not None)

    features.extend([
        sentimental_diff, 
        pos_ratio, 
        neg_ratio,
        num_negations,
        num_positive_negations,
        num_negative_negations,
        negation_ratio,
        count_exclamations, 
        count_questions, 
        has_multiple_questions, 
        has_multiple_exclamations, 
        log_odds # NEW - log odds for measuring quantified sentimental difference
    ])

    return np.array(features, dtype=float)

### YOUR CODE ENDS HERE

In [14]:
############################ V5 ################################# PS. Pythonic code style is different since it was implemented by a different group member unlike the preceeding versions (V1,2,3)
### YOUR CODE HERE

from collections import Counter
import string
punctuation = set(string.punctuation)
# NEW : Use of NLTK corpora of positive and negative words for an even larger coverage of words
# define the positive and negative words here
import nltk
from nltk.corpus import opinion_lexicon
nltk.download('opinion_lexicon')
pos_words = set(opinion_lexicon.positive())
neg_words = set(opinion_lexicon.negative())
# -------------------------------

def calculate_features_v5(text):
    features = []

    # ensure text is a string
    if not isinstance(text, str):
        text = str(text)

    # Tokenize using regex (keeps underscores from negation: not_word)
    words = re.findall(r"\b[\w_]+\b", text.lower())
    word_count = len(words)
    features.append(word_count)

    # distinct punctuation characters in raw text
    punct_counts = set(c for c in text if c in punctuation)
    features.append(len(punct_counts))

    # count positive / negative words (compare base word, strip leading 'not_' if present)
    def base(w):
        return w[4:] if w.startswith("not_") else w

    num_positive = sum(1 for w in words if base(w) in pos_words)
    num_negative = sum(1 for w in words if base(w) in neg_words)
    features.append(num_positive)
    features.append(num_negative)

    # sentimental variance 
    sentimental_diff = (num_positive - num_negative)
    denominator = max(word_count, 1)
    pos_ratio = num_positive / denominator
    neg_ratio = num_negative / denominator
    log_odds = np.log((num_positive+1)/(num_negative+1))

    # negation counts (tokens starting with 'not_')
    num_negations = sum(1 for w in words if w.startswith("not_"))
    num_positive_negations = sum(1 for w in words if w.startswith("not_") and w[4:] in pos_words)
    num_negative_negations = sum(1 for w in words if w.startswith("not_") and w[4:] in neg_words)
    negation_ratio = num_negations / denominator

    # exaggeration symbols from raw text
    count_exclamations = text.count("!")
    count_questions = text.count("?")
    has_multiple_exclamations = int(re.search(r'!{2,}', text) is not None)
    has_multiple_questions = int(re.search(r'\?{2,}', text) is not None)

    features.extend([
        sentimental_diff, 
        pos_ratio, 
        neg_ratio,
        num_negations,
        num_positive_negations,
        num_negative_negations,
        negation_ratio,
        count_exclamations, 
        count_questions, 
        has_multiple_questions, 
        has_multiple_exclamations, 
        log_odds 
    ])

    return np.array(features, dtype=float)

### YOUR CODE ENDS HERE

ModuleNotFoundError: No module named 'nltk'

The function below will apply your feature extraction implementation to a specified dataset.

In [15]:
def calculate_features_dataset(dataset, features_fn):
    all_features = []
    for e in tqdm.tqdm(dataset, desc='Extracting features'):
        text = e['text']
        features = features_fn(text)
        all_features.append(features)
    all_features = np.array(all_features, dtype=float)
    return all_features

And we can obtain the features for the `train` and `validation` splits. Later you will need to do the same for the `test` subset.

In [16]:
X_train = calculate_features_dataset(dataset['train'], calculate_features)
X_valid = calculate_features_dataset(dataset['validation'], calculate_features)

Extracting features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20000/20000 [00:00<00:00, 21916.65it/s]
Extracting features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:00<00:00, 19402.71it/s]


### 2.1 Classification

In this section, we will create and train a logistic regression classifier. We will train it on the `train` subset and evaluate on the `validation` split. Later, you will do a final comparison between methods on the `test` subset, but it is important to avoid it when tuning the methods.

In [17]:
classifier = LogisticRegression(solver='lbfgs', max_iter=1000)
classifier.fit(X_train, y_train)

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


Let's check the performance on the training data:

In [18]:
print('Features results: train')
pred_train = classifier.predict(X_train)
print(accuracy_score(y_train, pred_train))

Features results: train
0.6804


... and the validation dataset:

In [19]:
print('Features results: validation')
pred_valid = classifier.predict(X_valid)
print(accuracy_score(y_valid, pred_valid))

Features results: validation
0.6908


<a name='e3'></a>
#### Exercise 3: Improving the features

(5p) Iteratively improve your hand-crafted features. Think about what information from the review might be useful for to predict the rating a person gave to the particular movie. You can also look into the expected format (or range) of features for the classifier.

Document the steps you tried (even if unsuccessful) and how they influenced the metrics. Try at least 3 modifications from your original implementation.

In [20]:
### YOUR CODE HERE
'''
Pipeline for V1
'''
X_train = calculate_features_dataset(dataset['train'], calculate_features_V1)
X_valid = calculate_features_dataset(dataset['validation'], calculate_features_V1)
classifier = LogisticRegression(solver='lbfgs', max_iter=1000)
classifier.fit(X_train, y_train)
print('Features results: train')
pred_train = classifier.predict(X_train)
print(accuracy_score(y_train, pred_train))
print('Features results: validation')
pred_valid = classifier.predict(X_valid)
print(accuracy_score(y_valid, pred_valid))
### YOUR CODE ENDS HERE

Extracting features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20000/20000 [00:00<00:00, 22076.23it/s]
Extracting features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:00<00:00, 22134.56it/s]


Features results: train
0.6803
Features results: validation
0.6906


In [21]:
### YOUR CODE HERE
'''
Pipeline for V2
'''
X_train = calculate_features_dataset(dataset['train'], calculate_features_V2)
X_valid = calculate_features_dataset(dataset['validation'], calculate_features_V2)
classifier = LogisticRegression(solver='lbfgs', max_iter=1000)
classifier.fit(X_train, y_train)
print('Features results: train')
pred_train = classifier.predict(X_train)
print(accuracy_score(y_train, pred_train))
print('Features results: validation')
pred_valid = classifier.predict(X_valid)
print(accuracy_score(y_valid, pred_valid))
### YOUR CODE ENDS HERE

Extracting features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20000/20000 [00:01<00:00, 17566.05it/s]
Extracting features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:00<00:00, 18164.96it/s]


Features results: train
0.7669
Features results: validation
0.785


In [22]:
### YOUR CODE HERE
'''
Pipeline for V3
'''
X_train = calculate_features_dataset(dataset['train'], calculate_features_V3)
X_valid = calculate_features_dataset(dataset['validation'], calculate_features_V3)
classifier = LogisticRegression(solver='lbfgs', max_iter=1000)
classifier.fit(X_train, y_train)
print('Features results: train')
pred_train = classifier.predict(X_train)
print(accuracy_score(y_train, pred_train))
print('Features results: validation')
pred_valid = classifier.predict(X_valid)
print(accuracy_score(y_valid, pred_valid))
### YOUR CODE ENDS HERE

Extracting features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20000/20000 [00:01<00:00, 16772.35it/s]
Extracting features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:00<00:00, 17221.84it/s]
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Features results: train
0.77195
Features results: validation
0.7912


In [23]:

### YOUR CODE HERE
'''
Pipeline for V4
'''
X_train = calculate_features_dataset(dataset['train'], calculate_features_v4)
X_valid = calculate_features_dataset(dataset['validation'], calculate_features_v4)
classifier = LogisticRegression(solver='lbfgs', max_iter=1000)
classifier.fit(X_train, y_train)
print('Features results: train')
pred_train = classifier.predict(X_train)
print(accuracy_score(y_train, pred_train))
print('Features results: validation')
pred_valid = classifier.predict(X_valid)
print(accuracy_score(y_valid, pred_valid))
### YOUR CODE ENDS HERE

Extracting features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20000/20000 [00:02<00:00, 6683.67it/s]
Extracting features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:00<00:00, 7097.04it/s]
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Features results: train
0.79635
Features results: validation
0.8144


In [None]:
### YOUR CODE HERE
'''
Pipeline for V5
'''
X_train = calculate_features_dataset(dataset['train'], calculate_features_v5)
X_valid = calculate_features_dataset(dataset['validation'], calculate_features_v5)
classifier = LogisticRegression(solver='lbfgs', max_iter=1000)
classifier.fit(X_train, y_train)
print('Features results: train')
pred_train = classifier.predict(X_train)
print(accuracy_score(y_train, pred_train))
print('Features results: validation')
pred_valid = classifier.predict(X_valid)
print(accuracy_score(y_valid, pred_valid))
### YOUR CODE ENDS HERE

#### First run (used standard version): 
Train set: 0.70365. 

Validation set: 0.702.

In this run we used the very first version of the extraction function. The results are better than random guessing but can definetelely be improved.

#### Second run (used V1): 
Train set: 0.7037

Validation set: 0.7018

This approach was rather unsuccesfull. The changes made were adding a sentiment difference and normalized ratio of positive and negative words. However these changes did not significantly improve the performance. This is likely because Logistic Regression can already learn linear combinations of positive and negative word counts. 

#### Third run (used V2): 
Train set: 0.7942

Validation set: 0.7892

Here we can see significant improvement in the accuracy score for both the validation and train set. The changes made in the V2 version were primarily adding more positve and negative words, as well as accounting for negated tokens. In our cleaning functions we turn negations into one token by connecting words like "not good" into one token like "not_good". In the v2 version we count these negated tokens as well as the ratio of negated tokens and add them to the feature vector. Adding new positive and negative words led to more reviews triggering positive and negative matches which drastically improved the coverage. Furthermore the negation specific features introduced a completely new signal to the classifier, likely also positively affecting the performance. 

#### Fourth run (used V3): 
Train set: 0.7943

Validation set: 0.789

This approach was rather unsuccesfull. The changes made were adding counts of exclamation and question marks, as well as capturing strings of multiple exclamation and question marks. These changes did not have a positive or negative impact, which indicates that the most sentiment is already captured by words.



<a name='e4'></a>
#### Exercise 4: Improving the evaluation

(5p) In the previous cells, we only looked at the accuracy of predictions. Investigate which other metrics might be better for our case. You can check the documentation of scikit-learn for evaluation metrics ([https://scikit-learn.org/stable/api/sklearn.metrics.html#classification-metrics](https://scikit-learn.org/stable/api/sklearn.metrics.html#classification-metrics)). Give reasons why the metrics you try can be more informative than raw accuracy score.

Decide which evaluation metric(s) is most suitable for our use-case and give reasons why. Test your features-based classifier and all further classifiers on that metric (apart from the accuracy score).

In [24]:
### YOUR CODE HERE
# We decided to add Precision, Recall, the F1 score and we compute a confusion matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix



X_train = calculate_features_dataset(dataset['train'], calculate_features_V3)
X_valid = calculate_features_dataset(dataset['validation'], calculate_features_V3)
classifier = LogisticRegression(solver='lbfgs', max_iter=1000)
classifier.fit(X_train, y_train)
print('Features results: train')
pred_train = classifier.predict(X_train)
print(accuracy_score(y_train, pred_train))
print('Features results: validation')
pred_valid = classifier.predict(X_valid)

print("Accuracy Score:")
print(accuracy_score(y_valid, pred_valid))

print("F1 Score:")
print(f1_score(y_valid, pred_valid))

print("Precision Score:")
print(precision_score(y_valid, pred_valid))

print("Recall Score:")
print(recall_score(y_valid, pred_valid))

print("Confusion Matrix:")
print(confusion_matrix(y_valid, pred_valid))

### YOUR CODE ENDS HERE

Extracting features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20000/20000 [00:01<00:00, 16473.05it/s]
Extracting features: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:00<00:00, 17402.99it/s]
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Features results: train
0.77195
Features results: validation
Accuracy Score:
0.7912
F1 Score:
0.7979876160990712
Precision Score:
0.7682563338301043
Recall Score:
0.8301127214170693
Confusion Matrix:
[[1894  622]
 [ 422 2062]]


--- YOUR ANSWERS HERE
#### Answer Exercise 4 

We added the following metrics: **F1 Score, Precision, Recall, Confusion Matrix**

The reasons for adding each metric are presented in the following paragraph. Accuracy alone already provides a general overview of the model performance but cannot distinguish between different types of erros in classification. To also take into account the false positves and false negatives we decided to add precision and recall to seperatetly evaluate how reliable positive predictions are (precision) and how well the model detects actual positive reviews (recall). We also added the F1-score because it combines the precision and recall which makes it more informative than accuracy alone. The Confusion matrix simply provides an overview of the distribution of True Positives, True Negatives, False Positives and False Negatives.

Obtained Results: 

**Accuracy**: 0.7776

**F1 Score**: 0.7892342683851402

**Precision**: 0.7620790629575402

**Recall**: 0.8183962264150944

**Confusion Matrix**: TN = 1806, FP = 650, FN = 462, TP = 2082

One interesting observation is that **Recall > Precision**. The model detects most positive reviews as positive, indicated by the high recalll of about 81%. However this comes at the cost of predicting some negative reviews as positive. The confusion matrix confirms that the false positives occur more often than the false negatives (650 > 462) altough the dataset is balanced. This indicates a mild bias towards predicting positive. 

Another important observation is that **Recall > F1 score > Precision**. This behaviour is expected, as the F1 score represents the harmonic mean between the Precision and Recall. The F1 score balances the trade-off between detecting positive reviews and avoiding incorrect positive predictions. The relatively high F1 score indicates that the classifier achieves a good balance between identifying most positive reviews and limiting false positive predictions. 



## 3. Bag-of-Words Classifier

Similar to the previous lab, we will use the classic bag-of-words representation as one of our embeddings. While it is simple and does not preserve the positions of words, it gives our classifier a lot of useful information.

<a name='e5'></a>
#### Exercise 5: Implementing BOW

(5p) Implement the BOW. In this exercise, we do not give you a rigid structure, so you can conjure your own. The two things your code should produce is the `token_to_id` dictionary, and `bag_of_words()` function that accepts a list of tokens, and the `token_to_id` dictionary while generating the BOW representation as a numpy array.

In [25]:
#### YOUR CODE HERE

MAX_VOCAB_SIZE = 1_000

# The goal is to implement the `bag_of_words(tokens, token_to_id)` function similar to the previous lab.
# You might want to follow the steps:
# - tokenize the `text` column in the dataset,
# - extract the vocabulary from the tokens,
# - limit the vocabulary to `MAX_VOCAB_SIZE`,
# - calculate the `token_to_id` dictionary
# - implement the `bag_of_words(tokens, token_to_id)` function.
def tokenise(text):
    tokens = text.split()
    return tokens

dataset = dataset.map(lambda ex: {"tokens": tokenise(ex["text"])})
    
def build_vocab_counter():
    vocab_counter = Counter()
    for tokens in dataset["train"]["tokens"]:
        vocab_counter.update(tokens)
    return vocab_counter

def build_token_to_id(vocab: Counter, vocab_size: int):
    token_to_id = {}
    most_common_tokens = vocab.most_common(vocab_size)

    for index, (token, count) in enumerate(most_common_tokens):
        token_to_id[token] = index
    return token_to_id

def bag_of_words(tokens, token_to_id):
    """
    Creates a bag-of-words representation of the sentence
    Args:
        tokens: a list of tokens
        token_to_id: a dictionary mapping each word to an index in the vocabulary

    Returns:: a numpy array of size vocab_size with the counts of each word in the vocabulary
    """
    vocab_size = len(token_to_id)

    bow_vector = np.zeros(vocab_size, dtype=int)
    
    for token in tokens: 
        vocab_idx = token_to_id.get(token)
        if vocab_idx is not None: 
            bow_vector[vocab_idx] += 1

    return bow_vector

vocab = build_vocab_counter()
token_to_id = build_token_to_id(vocab, MAX_VOCAB_SIZE)



#### YOUR CODE ENDS HERE

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Here, we will use your implemented function to calculate the bag-of-words for each example in the `train` and `validation` subsets.

In [26]:
train_bows = []
for example in tqdm.tqdm(dataset['train'], desc='Calculating test BOWs'):
    train_bows.append(bag_of_words(example['tokens'], token_to_id))
train_bows = np.array(train_bows, dtype=float)

valid_bows = []
for example in tqdm.tqdm(dataset['validation'], desc='Calculating validation BOWs'):
    valid_bows.append(bag_of_words(example['tokens'], token_to_id))
valid_bows = np.array(valid_bows, dtype=float)

Calculating test BOWs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 20000/20000 [00:02<00:00, 8450.15it/s]
Calculating validation BOWs: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:00<00:00, 8421.45it/s]


Finally, we can train the classifier on the BOW representations and the labels in the `train` split.

In [27]:
classifier = LogisticRegression(solver='lbfgs', max_iter=1000)
print('Training classifier...')
classifier.fit(train_bows, y_train)

Training classifier...


0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


Let's evaluate the classifier:

In [34]:
print('BOW results: train')
pred_train = classifier.predict(train_bows)
print(accuracy_score(y_train, pred_train))

print('BOW results: validation')
pred_valid = classifier.predict(valid_bows)
print(accuracy_score(y_valid, pred_valid))

BOW results: train
0.8632
BOW results: validation
0.8398


<a name='e6'></a>
#### Exercise 6: Tuning the model

(5p) Try different values for the vocab size. Experiment with adding the hand-crafted features. Test the model on the evaluation metric of your choice (remember to use the validation split).

In [35]:
#### YOUR CODE HERE

from sklearn.preprocessing import StandardScaler


def make_bow_matrix(split_name: str, token_to_id):
    bows = []
    for example in tqdm.tqdm(dataset[split_name], desc='Calculating BOWs'):
        bows.append(bag_of_words(example['tokens'], token_to_id))
    return np.array(bows, dtype=float)

def make_handcrafted_matrix(split_name):
    features = []
    for example in tqdm.tqdm(dataset[split_name], desc='Calculating Hand crafted features'):
        features.append(calculate_features_V3(example["text"]))
    return np.vstack(features).astype(float)

def run_experiment(vocab_size: int, use_handcrafted: bool):
    vocab = build_vocab_counter()
    token_to_id = build_token_to_id(vocab, vocab_size)

    X_train = make_bow_matrix("train", token_to_id)
    X_valid = make_bow_matrix("validation", token_to_id)

    if use_handcrafted:
        train_feats = make_handcrafted_matrix("train")
        valid_feats = make_handcrafted_matrix("validation")

        scaler = StandardScaler()
        train_feats_scaled = scaler.fit_transform(train_feats)
        valid_feats_scaled = scaler.transform(valid_feats)

        X_train = np.hstack([X_train, train_feats_scaled])
        X_valid = np.hstack([X_valid, valid_feats_scaled])

    classifier = LogisticRegression(solver='lbfgs', max_iter=1000)
    classifier.fit(X_train, y_train)
    
    print('BOW results: train')
    pred_train = classifier.predict(X_train)

    print('BOW results: validation')
    pred_valid = classifier.predict(X_valid)

    results = {
        "vocab_size": vocab_size,
        "use_handcrafted": use_handcrafted,
        "train_acc": accuracy_score(y_train, pred_train),
        "valid_acc": accuracy_score(y_valid, pred_valid),
        "valid_f1": f1_score(y_valid, pred_valid),
        "valid_precision": precision_score(y_valid, pred_valid),
        "valid_recall": recall_score(y_valid, pred_valid),
        "confusion_matrix": confusion_matrix(y_valid, pred_valid)
    }
    
    return results

VOCAB_SIZES = [500, 1000, 1500, 2000, 5000, 10000]
all_results = []

for vs in VOCAB_SIZES:
    # BOW only
    r1 = run_experiment(vs, use_handcrafted=False)
    all_results.append(r1)

    # BOW + handcrafted
    r2 = run_experiment(vs, use_handcrafted=True)
    all_results.append(r2)

for r in all_results:
    tag = "BOW+HC" if r["use_handcrafted"] else "BOW"
    print(f"\n=== {tag} | vocab={r['vocab_size']} ===")
    print("Train acc:", r["train_acc"])
    print("Valid acc:", r["valid_acc"])
    print("Valid F1:", r["valid_f1"])
    print("Valid Precision:", r["valid_precision"])
    print("Valid Recall:", r["valid_recall"])
    print("Confusion Matrix:\n", r["confusion_matrix"])
#### YOUR CODE ENDS HERE

Calculating BOWs:  71%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 14210/20000 [00:01<00:00, 8460.57it/s]


KeyboardInterrupt: 

### Results Exercise 6 

We evaluated the performance of the BOW classifier paired with handcrafted features and on its own for different vocab sizes.

**The used vocab sizes were: 500, 1000, 1500, 2000, 5000, 10000**

| Vocab Size | BOW Acc    | BOW+HC Acc | Œî Acc       | BOW F1     | BOW+HC F1 | Œî F1        |
| ---------- | ---------- | ---------- | ----------- | ---------- | --------- | ----------- |
| 500        | 0.8066     | 0.8266     | **+0.0200** | 0.8054     | 0.8261    | **+0.0208** |
| 1000       | 0.8308     | 0.8378     | +0.0070     | 0.8302     | 0.8367    | +0.0065     |
| 1500       | 0.8408     | 0.8432     | +0.0024     | 0.8406     | 0.8425    | +0.0019     |
| 2000       | 0.8428     | 0.8500     | +0.0072     | 0.8423     | 0.8494    | +0.0071     |
| 5000       | 0.8532     | 0.8532     | +0.0000     | 0.8516     | 0.8517    | +0.0001     |
| 10000      | **0.8692** | 0.8684     | -0.0008     | **0.8680** | 0.8675    | -0.0005     |

Œî indicates the improvement in the metric from adding handcrafted features 

We can see that an increase in vocab size caused an increase in accuravy and F1 score. The best accuracy score and F1 score measured were obtained using only the BOW model without handcrafted features for a vocab size of 10000. For small vocabulary sizes, adding the handcrafted features definetely improved the performance of the model, however for large vocabulary sizes adding the handcrafted features provided little to no improvement. 

- Validation accuracy increased from 0.8066 (500 vocab) to 0.8692 (10,000 vocab).
- F1-score improved from 0.8054 to 0.8680.
- Handcrafted features improved the performance in accuracy and F1 score by about 0.02 for vocab size = 500

The largest improvement occurred when increasing the vocabulary beyond 5000 tokens.

However, training accuracy increased much more sharply:

- 0.8297 (500 vocab)
- 0.9962 (10,000 vocab)

**This indicates increasing model capacity and potential overfitting at very large vocabulary sizes.**




## 4. BERT Model

For the first part of this lab, we will be using a pre-trained BERT model from Huggingface, namely the [BERT Cased](https://huggingface.co/google-bert/bert-base-cased). You can read the original paper that introduced this model [here](https://aclanthology.org/N19-1423.pdf). This paper has been one of the most cited papers ever (currently having more than 100,000 citations).

We will specify the model name that can be found on the model's card on huggingface (revisit the first link). Make sure to check what other information Huggingface is offering (e.g. how to use the model, limitations, how to inference, etc.).

In [30]:
model_name = 'google-bert/bert-base-cased'

### 4.1 Tokenizer

The models on huggingface come with their own tokenizers. They are loaded separately from the models. We can use [AutoTokenizer](https://huggingface.co/docs/transformers/v4.40.2/en/model_doc/auto#transformers.AutoTokenizer)'s `from_pretrained()` method to load it.

Inspect the output: The loaded object is of `BertTokenizer` class. Check the documentation [here](https://huggingface.co/docs/transformers/en/model_doc/bert#transformers.BertTokenizer).

In [36]:
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
print(tokenizer)

BertTokenizer(name_or_path='google-bert/bert-base-cased', vocab_size=28996, model_max_length=512, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)


Next, let's see how we can use it to tokenize some text.

In [37]:
print(dataset['test'][0]['text'])
tokenized = tokenizer(dataset['test'][0]['text'], padding=True, return_tensors='pt')
print("---")
print(type(tokenized))
print("---")
print(tokenized)

debbie boone had monster hit with her recording of pop song "you light up my life;" didi conn film of same name, however, was horrifically embarrassing flop. conn plays stereotypically goofy-homely-vulnerable girl who is in love with michael zaslow, who plays stereotypical yuppie-wannabe guy. they are engaged, but every one knows that zaslow is not_going to marry any one that is not_blonde and built, so only didi is surprised when he dumps her. needless to say, didi is quite embarrassed.fortunately, she has been doing little songwriting in her spare time, and she's come up with tune she thinks is pretty nifty. she calls it--can you guess?--"you light up my life." she hops in car and drives off to big city to sell her song and make new life. now, i recall sitting in theatre and watching her hop in car to drive off to big city, and thinking "well thank heavens, we've finally got all exposition out of way. now maybe something interesting will happen." and something interesting did happen.

Examine the outputs: The tokenizer returned three things:
- `input_ids` - this is a PyTorch tensor ([https://pytorch.org/docs/stable/tensors.html](https://pytorch.org/docs/stable/tensors.html)) with the indices of our tokens. PyTorch tensors are similar to numpy arrays. They hold data in a multidimensional array or matrix. The difference is that PyTorch tensors can be placed and modified on the GPU which greatly improves the speed of execution.
- `token_type_ids` - this tensor holds the information about the index of the sentence. This has to do with the classification objective from the original paper, where two sentences were given and the model had to predict if they are connected. Because we only included a single sentence, we have only zeros here. We will not be concerned with it in this lab.
- `attention_mask` - holds the mask that the model will use to determine if the tokens in the `input_ids` are the real tokens or *padding*. Padding is a technique used to ensure that all input sequences have the same length. BERT (like many other NLP models) process data in batches and requires each sequence in a batch to have the same length, so sequences that are shorter than the maximum sequence length in the batch are padded with special tokens. In this case, because we only inputted a single sentence, the mask contains only ones. Later you will see examples where this is not the case.

Let's see how exactly the sentence was tokenized and how we can retrieve the original text. Notice that some words have been split into multiple tokens (remember when we discussed sub-word tokenization in class?). Also pay attention to the added special tokens, namely `CLS` and `SEP`:

The `[CLS]` token is a special classification token added at the beginning of every input sequence. It stands for "classification" (daah!) and is used by BERT to aggregate information from the entire sequence. The final hidden state corresponding to this token (after passing through the transformer layers) is used as the aggregate sequence representation for classification tasks. We will use this later in the lab!

The `[SEP]` token is used to separate different segments or sentences within the input sequence. It stands for "separator" (daaah again!).

In [38]:
print(tokenized['input_ids'].shape)
print("---")
print(tokenizer.convert_ids_to_tokens(tokenized['input_ids'][0]))
print("---")
print(len(tokenizer.convert_ids_to_tokens(tokenized['input_ids'][0])))
print("---")
print(tokenizer.decode(tokenized['input_ids'][0]))
print("---")
print(tokenizer.decode(tokenized['input_ids'][0], skip_special_tokens=True))

torch.Size([1, 389])
---
['[CLS]', 'de', '##bb', '##ie', 'b', '##oon', '##e', 'had', 'monster', 'hit', 'with', 'her', 'recording', 'of', 'pop', 'song', '"', 'you', 'light', 'up', 'my', 'life', ';', '"', 'did', '##i', 'con', '##n', 'film', 'of', 'same', 'name', ',', 'however', ',', 'was', 'horrific', '##ally', 'embarrassing', 'fl', '##op', '.', 'con', '##n', 'plays', 'stereo', '##ty', '##pical', '##ly', 'go', '##of', '##y', '-', 'home', '##ly', '-', 'vulnerable', 'girl', 'who', 'is', 'in', 'love', 'with', 'mi', '##cha', '##el', 'z', '##as', '##low', ',', 'who', 'plays', 'stereo', '##ty', '##pical', 'y', '##up', '##pie', '-', 'wanna', '##be', 'guy', '.', 'they', 'are', 'engaged', ',', 'but', 'every', 'one', 'knows', 'that', 'z', '##as', '##low', 'is', 'not', '_', 'going', 'to', 'marry', 'any', 'one', 'that', 'is', 'not', '_', 'blonde', 'and', 'built', ',', 'so', 'only', 'did', '##i', 'is', 'surprised', 'when', 'he', 'dump', '##s', 'her', '.', 'needles', '##s', 'to', 'say', ',', 'did', '#

Tokenizer can process a list of sentences. This will create a batched output with tensor's first dimension corresponding to the batch size (the number of sentences we passed to the tokenizer). Examine the following cell and make sure it makes sense to you.

In [39]:
print(dataset['test'][0:3]['text'])
tokenized = tokenizer(dataset['test'][0:3]['text'], padding=True, return_tensors='pt')
print(tokenized)
print(tokenized['input_ids'].shape)
print(tokenizer.convert_ids_to_tokens(tokenized['input_ids'][0]))
print(len(tokenizer.convert_ids_to_tokens(tokenized['input_ids'][0])))
print(tokenizer.decode(tokenized['input_ids'][0]))
print(tokenizer.decode(tokenized['input_ids'][0], skip_special_tokens=True))

['debbie boone had monster hit with her recording of pop song "you light up my life;" didi conn film of same name, however, was horrifically embarrassing flop. conn plays stereotypically goofy-homely-vulnerable girl who is in love with michael zaslow, who plays stereotypical yuppie-wannabe guy. they are engaged, but every one knows that zaslow is not_going to marry any one that is not_blonde and built, so only didi is surprised when he dumps her. needless to say, didi is quite embarrassed.fortunately, she has been doing little songwriting in her spare time, and she\'s come up with tune she thinks is pretty nifty. she calls it--can you guess?--"you light up my life." she hops in car and drives off to big city to sell her song and make new life. now, i recall sitting in theatre and watching her hop in car to drive off to big city, and thinking "well thank heavens, we\'ve finally got all exposition out of way. now maybe something interesting will happen." and something interesting did hap

<a name='e7'></a>
#### Exercise 7: Questions about the tokenizer

Answer the following questions:
- (1p) What is the size of the vocabulary?
- (2p) What are the special tokens apart from `[CLS]` and `[SEP]`? What are their functions?

--- YOUR ANSWERS HERE

#### Vocab size:

The vocab size is **28996** words

#### Tokens:

[PAD] = Padding token, used to make all sequences in a batch the same length. It is ignored by the attention mechanism using an attention mask and has its own embedding but does not carry semantic meaning. 

[UNK] = Unknown token, represents words that are not in the vocabulary 

[MASK] = Masking token, Used during masked language modeling (MLM) pretraining so the model learns to predict the masked word.

### 4.2 Loading the Model

In this section, we will load and examine the model. We will start with selecting the device we will place the model on. This will be a GPU (if one is available) or a CPU.

Google Colab offers free access to GPU, provided there is availability (also based on quotas which may vary based on your usage and the overall demand on Colab's resources). If you are working locally, then if you don't have a GPU, CPU will be selected. For the first parts of the assignment running on CPU might be okay but when we have to process the dataset a GPU will be necessary.

The following cell will select the device for us.

In [None]:
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print(f'Device: {device}')

Now, let's load the model from huggingface and move it (slowly because it's heavy due to the large number of parameters) on the device from the previous cell (the methods `to()`).

In [None]:
model = transformers.AutoModel.from_pretrained(model_name)
print('loaded on device:', model.device)
model.to(device)
print('moved to device', model.device)
print(model)

When loading the model you might have seen the warning about some unexpected weights. This means that the model on huggingface has some additional weights that were downloaded, but our model does not use them. In essence, you can load the same weights (as linked by our `model_name`) to load to different but related models. In our case those would be `BertForMaskedLM` or `BertForNextSentencePrediction` instead of our `BertModel`, which is loaded automatically as the `AutoModel`. Below is a way to load the weights into a different model.

In [None]:
# transformers.BertForMaskedLM.from_pretrained(model_name)

Next, let's use BERT model for inference. We will tokenize the first sentence of our dataset and pass it to the model. We set `output_hidden_states` to `True` in order to have access to the hidden states of the model. Those represent the latent representations after embedding and transformer layers.

In [None]:
tokenized = tokenizer(dataset['test'][0]['text'], padding=True, return_tensors='pt').to(device)
print(tokenized)
model_output = model(**tokenized, output_hidden_states=True)

Examine the next cell and make sure everything makes sense to you. Consult the [documentation](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertModel.forward) in case of doubt.

In [None]:
print(list(model_output.keys()))
print()
print('pooler_output:')
print(type(model_output['pooler_output']))
print(model_output['pooler_output'].shape)
print()
print('hidden_states:')
print(type(model_output['hidden_states']))
print(len(model_output['hidden_states']))
print(type(model_output['hidden_states'][0]))
print(model_output['hidden_states'][0].shape)
print()
print('last_hidden_state:')
print(type(model_output['last_hidden_state']))
print(model_output['last_hidden_state'].shape)

<a name='e8'></a>
#### Exercise 8: Questions about the Model

Examine the output of the previous cells. Answer the following questions:
- (1p) What is the number of transformer layers in this model?
- (1p) What is the dimension of the embeddings?
- (1p) What is the hidden size of the FFN in the transformer layer?
- (1p) What is the total number of parameters of the model (hint: check the `num_parameters()` method of the model)?
- (1p) How can you find the vocabulary size from the model?
- (1p) What is the length of the `hidden_states` in the output? Why?

--- YOUR ANSWERS HERE

## 5. BERT Sentence Embeddings

Having the model loaded and ready we can work on obtaining the sentence embeddings. During the last lab, you averaged the token embeddings. This time we will start with something else. Remember the CLS token? Its hidden representation is often used for classification as a representation of the whole sentence. We will do exactly that.

But first, we have to tokenize the dataset using BERT tokenizer.

<a name='e9'></a>
#### Exercise 9: BERT tokenizing examples

(5p) Fill in the following function to embed the examples (passed as a parameter) using the tokenizer (also a parameter). The function will tokenize a batch of examples, but the tokenizer can handle that, if you remember from the previous section.

In [None]:
def tokenize_text_bert(examples, tokenizer):
    """
    Tokenizes the `sentence` column from the batch of examples and returns the whole output of the tokenizer.
    Args:
        examples: a batch of examples
        tokenizer: the BERT tokenizer

    Returns: the tokenized `sentence` column (returns the whole output of the tokenizer)

    """
    ### YOUR CODE HERE
    tokenized_sentence = None



    ### YOUR CODE ENDS HERE
    return tokenized_sentence


In [None]:
dataset_tokenized_bert = dataset.map(tokenize_text_bert,
                                     fn_kwargs={'tokenizer': tokenizer},
                                     batched=True,
                                     remove_columns=dataset['train'].column_names,)
print(dataset_tokenized_bert)

<a name='e10'></a>
#### Exercise 10: BERT sentence embeddings by the CLS token

(5p) Implement the following function which calculates the sentence embeddings based on the model output (passed to the function as a parameter). It should take the embedding of the CLS token of last layer.

In [None]:
def calculate_cls_embeddings(input_batch, model_output):
    """
    Calculates the sentence embeddings of a batch of sentences as the last-layer representation of the CLS token.
    Args:
        input_batch: tokenized batch of sentences (as returned by the tokenizer), contains `input_ids`, `token_type_ids`, and `attention_mask` tensors
        model_output: the output of the model given the `input_batch`, contains `last_hidden_state`, `pooler_output`, `hidden_states` tensors

    Returns: tensor of the hidden states of the CLS token (from the last layer) for each example in the batch

    """

    ### YOUR CODE HERE
    sentence_embeddings = None



    ### YOUR CODE ENDS HERE

    return sentence_embeddings

In [None]:
text = "The weather is nice today."
tokenized = tokenizer(text, padding=True, return_tensors='pt').to(device)
print(tokenized)
model_output = model(**tokenized, output_hidden_states=True)
print(model_output['last_hidden_state'].shape)
sentence_embedding = calculate_cls_embeddings(tokenized, model_output)
print(sentence_embedding.shape)

In [None]:
def embed_dataset(dataset, model, sentence_embedding_fn, batch_size=8):
    data_collator = transformers.DataCollatorWithPadding(tokenizer)
    data_loader = DataLoader(dataset, batch_size=batch_size, collate_fn=data_collator)
    sentence_embeddings = []
    with torch.no_grad():
        for batch in tqdm.tqdm(data_loader):
            batch.to(device)
            model_output = model(**batch, output_hidden_states=True)
            batch_sentence_embeddings = sentence_embedding_fn(batch, model_output)
            sentence_embeddings.append(batch_sentence_embeddings.detach().cpu())

    sentence_embeddings = torch.concat(sentence_embeddings, dim=0)
    return sentence_embeddings

In [None]:
bert_cls_train = embed_dataset(dataset_tokenized_bert['train'], model, calculate_cls_embeddings)
print(bert_cls_train.shape)

bert_cls_valid = embed_dataset(dataset_tokenized_bert['validation'], model, calculate_cls_embeddings)
print(bert_cls_valid.shape)

In [None]:
classifier = LogisticRegression(solver='lbfgs', max_iter=1000)
print('Training classifier...')
classifier.fit(bert_cls_train, y_train)

In [None]:
print('BERT train')
pred_train = classifier.predict(bert_cls_train)
print(accuracy_score(y_train, pred_train))

print('BERT valid')
pred_valid = classifier.predict(bert_cls_valid)
print(accuracy_score(y_valid, pred_valid))

You can test the model on the evaluation metric of your choice:

In [None]:
#### YOUR CODE HERE


#### YOUR CODE ENDS HERE

<a name='e11'></a>
#### Exercise 11: BERT Sentence embeddings by averaging tokens

(5p) Implement embedding sentences by averaging the hidden representations of tokens. Make sure to ignore the special and padding tokens. The padding tokens are indicated by the attention mask. You can find the other special tokens using the tokenizer's attributes such as `tokenizer.sep_token_id`. The function accepts the `layer` parameter. Typically, you would use the hidden representations of the last layer, it might be beneficial for some tasks to use previous layers or an averaged representations of multiple layers.

In [None]:
### YOUR CODE HERE

def calculate_sentence_embeddings(input_batch, model_output, layer=-1):
    """
    Calculates the sentence embeddings of a batch of sentences as a mean of token representations.
    The representations are taken from the layer of the index provided as a `layer` parameter.
    Args:
        input_batch: tokenized batch of sentences (as returned by the tokenizer), contains `input_ids`, `token_type_ids`, and `attention_mask` tensors
        model_output: the output of the model given the `input_batch`, contains `last_hidden_state`, `pooler_output`, `hidden_states` tensors
        layer: specifies the layer of the hidden states that are used to calculate sentence embedding

    Returns: tensor of the averaged hidden states (from the specified layer) for each example in the batch

    """
    attention_mask = input_batch['attention_mask']
    hidden_states = model_output['hidden_states'][layer]

    ### YOUR CODE HERE
    sentence_embeddings = None




    ### YOUR CODE ENDS HERE

    return sentence_embeddings

### YOUR CODE ENDS HERE

We can test it here:


In [None]:
text = "The weather is nice today."
tokenized = tokenizer(text, padding=True, return_tensors='pt').to(device)
print(tokenized)
model_output = model(**tokenized, output_hidden_states=True)
print(model_output['last_hidden_state'].shape)
sentence_embedding = calculate_sentence_embeddings(tokenized, model_output)
print(sentence_embedding.shape)

We will embed the sentences and evaluate the model on the `validation` subset.


In [None]:
bert_sentence_train = embed_dataset(dataset_tokenized_bert['train'], model, calculate_sentence_embeddings)
print(bert_cls_train.shape)

bert_sentence_valid = embed_dataset(dataset_tokenized_bert['validation'], model, calculate_sentence_embeddings)
print(bert_cls_valid.shape)

In [None]:
classifier = LogisticRegression(solver='lbfgs', max_iter=1000)
print('Training classifier...')
classifier.fit(bert_sentence_train, y_train)

In [None]:
print('BERT train')
pred_train = classifier.predict(bert_sentence_train)
print(accuracy_score(y_train, pred_train))

print('BERT valid')
pred_valid = classifier.predict(bert_sentence_valid)
print(accuracy_score(y_valid, pred_valid))

Test the model on the evaluation metric of your choice:


In [None]:
#### YOUR CODE HERE


#### YOUR CODE ENDS HERE

## 6. Testing all methods

In this last section, you will bering together all of what you have done so far in this lab. First, you will find the best classifier. Next, you will evaluate all the models you created so far.

<a name='e12'></a>
#### Exercise 12: Find the best classifier for the models

(5p) Basically, do what the title of the exercise says. Evaluate on the `validation` subset. Try at least two other classifiers (apart from the logistic regression). Comment on the results.

In [None]:
#### YOUR CODE HERE


#### YOUR CODE ENDS HERE

--- YOUR ANSWERS HERE

<a name='e13'></a>
#### Exercise 13: Evaluating methods on the test set

(10p) Test the models you implemented on the test subset:
- Hand-crafted features,
- BOW,
- BERT model based on the CLS token.

You have the models trained already, so only do evaluation.

Evaluate the performance using the metric(s) of your choice. Make sure to discuss the results. Which model performed best? Is this what you expected?

In [None]:
#### YOUR CODE HERE


### YOUR CODE ENDS HERE

--- YOUR ANSWERS HERE