# Assignment 1

**Credits**: Federico Ruggeri, Eleonora Mancini, Paolo Torroni

**Keywords**: POS tagging, Sequence labelling, RNNs


# Contact

For any doubt, question, issue or help, you can always contact us at the following email addresses:

Teaching Assistants:

* Federico Ruggeri -> federico.ruggeri6@unibo.it
* Eleonora Mancini -> e.mancini@unibo.it

Professor:

* Paolo Torroni -> p.torroni@unibo.it

# Introduction

You are tasked to address the task of POS tagging.

<center>
    <img src="images/pos_tagging.png" alt="POS tagging" />
</center>

# [Task 1 - 0.5 points] Corpus

You are going to work with the [Penn TreeBank corpus](https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip).

**Ignore** the numeric value in the third column, use **only** the words/symbols and their POS label.

### Example

```Pierre	NNP	2
Vinken	NNP	8
,	,	2
61	CD	5
years	NNS	6
old	JJ	2
,	,	2
will	MD	0
join	VB	8
the	DT	11
board	NN	9
as	IN	9
a	DT	15
nonexecutive	JJ	15
director	NN	12
Nov.	NNP	9
29	CD	16
.	.	8
```

### Splits

The corpus contains 200 documents.

   * **Train**: Documents 1-100
   * **Validation**: Documents 101-150
   * **Test**: Documents 151-199

### Instructions

* **Download** the corpus.
* **Encode** the corpus into a pandas.DataFrame object.
* **Split** it in training, validation, and test sets.

#### Preliminaries

#### Out of Vocabulary (OOV) words in training set
We see words in the training set that are not alredy embedded through Glove (50) model, in addition we define the set oov_terms with all those words.

In [2]:
# file management
import sys
import shutil
import urllib
import zipfile
from pathlib import Path

# dataframe management
import pandas as pd

# data manipulation
import numpy as np

# for readability
from typing import Iterable
from tqdm import tqdm

#### Download

In [3]:
class DownloadProgressBar(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)
        
def download_url(download_path: Path, url: str):
    with DownloadProgressBar(unit='B', unit_scale=True,
                             miniters=1, desc=url.split('/')[-1]) as t:
        urllib.request.urlretrieve(url, filename=download_path, reporthook=t.update_to)

        
def download_dataset(download_path: Path, url: str):
    print("Downloading dataset...")
    download_url(url=url, download_path=download_path)
    print("Download complete!")

def extract_dataset(download_path: Path, extract_path: Path):
    print("Extracting dataset... (it may take a while...)")
    
    with zipfile.ZipFile(download_path) as loaded_tar:
        loaded_tar.extractall(path=extract_path, pwd=None)
    print("Extraction completed!")

In [4]:
url = "https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/dependency_treebank.zip"
dataset_name = "dependency_treebank"

print(f"Current work directory: {Path.cwd()}")
dataset_folder = Path.cwd().joinpath("Datasets")

if not dataset_folder.exists():
    dataset_folder.mkdir(parents=True)

dataset_tar_path = dataset_folder.joinpath("dependency_treebank.zip")
dataset_path = dataset_folder.joinpath(dataset_name)

if not dataset_tar_path.exists():
    download_dataset(dataset_tar_path, url)

if not dataset_path.exists():
    extract_dataset(dataset_tar_path, dataset_folder)

Current work directory: c:\Users\Utente\Desktop\UNIVERSITA'\AI\2 Anno\Natural Language Processing\_ Esame\Assignment 1\NLP_POS-tagging


#### Encode and Split

The aim of the code below is to find a way to create a dataframe starting from all the files downloaded before.
For every downloaded file, we check the number through the function find_number(), we decide if it belongs to train, validation or test given that number, we then split it into rows to get the word and the POS and to check where a phrase ends. Given all this informations we can create a list whose columns are: 
1. num_file: the number of the file
2. phrase_id: the id of the phrase contained in a file 
3. text: the text that has to be analyzed
4. pos: the tag assigned to the text
5. split: the split to which the text belongs

In [5]:
import re

def find_number(string):
    """
    This function finds the number written in a string.
    """
    return re.findall(r'\d+', string)


In [6]:
dataframe_rows = []
id = 0

folder = dataset_folder.joinpath(dataset_name)
for file_path in folder.glob('*.dp'):
    num_file = int(find_number(file_path.name)[0])
    id = 1
    
    with file_path.open(mode='r', encoding='utf-8') as text_file:
        
        if num_file < 101:
            split = "train"
        elif num_file >= 101 and num_file < 151:
            split = "validation"
        else:
            split = "test"

        for row in text_file.readlines():
            if row=='\n' or row=='':
                id += 1

            else:
                text, pos, _ = row.split('\t')

                dataframe_row = {
                    "num_file": num_file,
                    "phrase_id": str(num_file) + "_" + str(id),
                    "text": text,
                    "pos": pos,
                    "split": split
                }

                dataframe_rows.append(dataframe_row)

In [7]:
df = pd.DataFrame(dataframe_rows)
df.head(20)

Unnamed: 0,num_file,phrase_id,text,pos,split
0,1,1_1,Pierre,NNP,train
1,1,1_1,Vinken,NNP,train
2,1,1_1,",",",",train
3,1,1_1,61,CD,train
4,1,1_1,years,NNS,train
5,1,1_1,old,JJ,train
6,1,1_1,",",",",train
7,1,1_1,will,MD,train
8,1,1_1,join,VB,train
9,1,1_1,the,DT,train


# [Task 2 - 0.5 points] Text encoding

To train a neural POS tagger, you first need to encode text into numerical format.

### Instructions

* Embed words using **GloVe embeddings**.
* You are **free** to pick any embedding dimension.
* [Optional] You are free to experiment with text pre-processing: **make sure you do not delete any token!**

In [8]:
# typing
from typing import List, Callable, Dict

#### Text pre-processing
In the code below we pre-processed the df dataframe in order to reduce the number of different words. Our text pre-processing consist just in lowering the text of words. <br>
**NB: should we add somenthing to the pre processing?**

In [9]:
import re
from functools import reduce
import nltk
from nltk.corpus import stopwords

In [10]:
def lower(text: str) -> str:
    """
    Transforms given text to lower case.
    """
    return text.lower()

In [11]:
PREPROCESSING_PIPELINE = [
                          lower
                          ]

def text_prepare(text: str,
                 filter_methods: List[Callable[[str], str]] = None) -> str:
    """
    Applies a list of pre-processing functions in sequence (reduce).
    Note that the order is important here!
    """
    filter_methods = filter_methods if filter_methods is not None else PREPROCESSING_PIPELINE
    return reduce(lambda txt, f: f(txt), filter_methods, text)

In [12]:
print('Pre-processing text...')

print()
print(f'[Debug] Before:\n{df.text.values[0]}')
print()

# Replace each sentence with its pre-processed version
df['text'] = df['text'].apply(lambda txt: text_prepare(txt))

print(f'[Debug] After:\n{df.text.values[0]}')
print()

print("Pre-processing completed!")

Pre-processing text...

[Debug] Before:
Pierre

[Debug] After:
pierre

Pre-processing completed!


#### Vocabulary creation for training set
We define a vocabulary for the training set assigning to each word a random index, the building_vocabulary function returns a list containing:<br>
- word vocabulary: vocabulary index to word
- inverse word vocabulary: word to vocabulary index
- word listing: set of unique terms that build up the vocabulary


In [13]:
df_train = df[df['split']=='train']
df_val = df[df['split']=='validation']
df_test = df[df['split']=='test']

#### GloVe embeddings (50)
Download GloVe 50 embedding where most of the words are alredy embedded in an embedding model that associate each word to a vector of dimension 50.

In [14]:
# !pip install gensim

In [15]:
import gensim
import gensim.downloader as gloader

def load_embedding_model(model_type: str,
                         embedding_dimension: int = 50) -> gensim.models.keyedvectors.KeyedVectors:
    """
    Loads a pre-trained word embedding model via gensim library.

    :param model_type: name of the word embedding model to load.
    :param embedding_dimension: size of the embedding space to consider

    :return
        - pre-trained word embedding model (gensim KeyedVectors object)
    """
    download_path = ""

    if model_type.strip().lower() == 'glove':
        download_path = "glove-wiki-gigaword-{}".format(embedding_dimension)
    else:
        raise AttributeError("Unsupported embedding model type! Available one: glove")
        
    try:
        emb_model = gloader.load(download_path)
    except ValueError as e:
        print("Invalid embedding model name! Check the embedding dimension:")
        print("Glove: 50, 100, 200, 300")
        raise e

    return emb_model

In [16]:
embedding_dim = 50

embedding_model = load_embedding_model(model_type="glove",
                                       embedding_dimension=embedding_dim)

In [17]:
vocab = {} # word to idx
embedding_matrix_glove = np.zeros((400001, embedding_dim))

for i in range(0, 400000):
    vocab[embedding_model.index_to_key[i]] = i+1
    embedding_matrix_glove[i+1] = embedding_model.vectors[i]

print(f'Embedding matrix shape: {embedding_matrix_glove.shape}')

Embedding matrix shape: (400001, 50)


In [18]:
def check_OOV_terms(embedding_model: gensim.models.keyedvectors.KeyedVectors,
                    word_listing: List[str]):
    """
    Checks differences between pre-trained embedding model vocabulary
    and dataset specific vocabulary in order to highlight out-of-vocabulary terms.

    :param embedding_model: pre-trained word embedding model (gensim wrapper)
    :param word_listing: dataset specific vocabulary (list)

    :return
        - list of OOV terms
    """
    embedding_vocabulary = set(embedding_model.key_to_index.keys())
    oov = set(word_listing).difference(embedding_vocabulary)
    return list(oov)

In [19]:
word_listing = set(df_train['text'])
oov_terms = check_OOV_terms(embedding_model, word_listing)
oov_percentage = float(len(oov_terms)) * 100 / len(word_listing)
print(f"Total OOV terms: {len(oov_terms)} ({oov_percentage:.2f}%)")

Total OOV terms: 359 (4.85%)


Here we add the OOV in the train set to the vocabulary and the embedded matrix </br>
*NB: maybe is better to define the OOV as the mean of all the other word, link for report (https://stackoverflow.com/questions/49239941/what-is-unk-in-the-pretrained-glove-vector-files-e-g-glove-6b-50d-txt)* <br>
Our embedding matrix has the following columns:
- column 0 is all zeros, represents the embedding vector for padding
- columns 1 to 400001 are the embedding vectors for the words in GloVe
- columns 400002 to 400360 are the embedding vectors for the words OOV in the training set (random vector)
- column 400361 is the embedding vector for the words OOV in the final vocabulary

In [20]:
for word in oov_terms:
    vocab[word] = 400002 + oov_terms.index(word)
    embedding_matrix_glove = np.append(embedding_matrix_glove, np.random.uniform(-0.25, 0.25, 50).reshape(1, 50), axis=0)

vocab['[OOV]'] = len(vocab) + 1
average_oov = np.mean(embedding_matrix_glove, axis=0)
embedding_matrix = np.append(embedding_matrix_glove, average_oov.reshape(1, 50), axis=0)

vocab['[PAD]'] = 0

print(len(vocab))
print(f"New embedding matrix size: {embedding_matrix.shape}")

400361
New embedding matrix size: (400361, 50)


#### Embedding for training set
We create the embedding matrix for all the training set:
- using GloVe embeddings for alredy known words
- assigning to each OOV word a random value.

**NB: maybe instead of random we can define OOV with the mean of its neighbour word embeddings (tutorial 2)** <br>
**NB: we can even add all embedding in GloVe to the embedding matrix, even if they are not in train set**

# [Task 3 - 1.0 points] Model definition

You are now tasked to define your neural POS tagger.

### Instructions

* **Baseline**: implement a Bidirectional LSTM with a Dense layer on top.
* You are **free** to experiment with hyper-parameters to define the baseline model.

* **Model 1**: add an additional LSTM layer to the Baseline model.
* **Model 2**: add an additional Dense layer to the Baseline model.

* **Do not mix Model 1 and Model 2**. Each model has its own instructions.

**Note**: if a document contains many tokens, you are **free** to split them into chunks or sentences to define your mini-batches.

In [21]:
list_of_pos = list(set(df_train['pos']))

def pos_to_int(string):
    length = len(list_of_pos)
    for i in range(length):
        if list_of_pos[i]==string:
            return [1 if j == i else 0 for j in range(length)]

def int_to_pos(pred):
    idx_hot_encoding, _ = max(enumerate(pred), key=lambda x: x[1])
    return list_of_pos[idx_hot_encoding]

In [22]:
print(pos_to_int('JJ'))

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


#### Baseline 
**NB: reference slides 08 pag 38** <br>
https://analyticsindiamag.com/complete-guide-to-bidirectional-lstm-with-python-codes/

In [23]:
# ! pip install tensorflow
import tensorflow as tf
import tensorflow.keras.layers as layers

In [24]:
batch_size = 128
GloVe_dim = 50 # GloVe embedding
units_bi = 100

n_unique_words = len(vocab) # input and output layer
outputs_dim = len(set(df_train['pos']))

In [30]:
baseline = tf.keras.Sequential(name='baseline')

baseline.add(layers.Embedding(n_unique_words, GloVe_dim, embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix), mask_zero=True, trainable=False))
baseline.add(layers.Bidirectional(layers.LSTM(units_bi, activation='relu', return_sequences=True)))
baseline.add(layers.Dense(outputs_dim, activation='softmax'))

baseline.summary()

Model: "baseline"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, None, 50)          20018050  
                                                                 
 bidirectional_5 (Bidirecti  (None, None, 200)         120800    
 onal)                                                           
                                                                 
 dense_4 (Dense)             (None, None, 45)          9045      
                                                                 
Total params: 20147895 (76.86 MB)
Trainable params: 129845 (507.21 KB)
Non-trainable params: 20018050 (76.36 MB)
_________________________________________________________________


#### Model 1

In [32]:
model_1 = tf.keras.Sequential(name='Model_1')

model_1.add(layers.Embedding(n_unique_words, GloVe_dim, embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix), mask_zero=True, trainable=False))
model_1.add(layers.Bidirectional(layers.LSTM(units_bi, activation='relu', return_sequences=True)))
model_1.add(layers.Bidirectional(layers.LSTM(units_bi, activation='relu', return_sequences=True)))
model_1.add(layers.Dense(outputs_dim, activation='softmax'))


model_1.summary()

Model: "Model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, None, 50)          20018050  
                                                                 
 bidirectional_7 (Bidirecti  (None, None, 200)         120800    
 onal)                                                           
                                                                 
 bidirectional_8 (Bidirecti  (None, None, 200)         240800    
 onal)                                                           
                                                                 
 dense_7 (Dense)             (None, None, 45)          9045      
                                                                 
Total params: 20388695 (77.78 MB)
Trainable params: 370645 (1.41 MB)
Non-trainable params: 20018050 (76.36 MB)
_________________________________________________________________


#### Model 2

In [28]:
units_dense = 100

In [31]:
model_2 = tf.keras.Sequential(name='Model_2')

model_2.add(layers.Embedding(n_unique_words, GloVe_dim, embeddings_initializer=tf.keras.initializers.Constant(embedding_matrix), mask_zero=True, trainable=False))
model_2.add(layers.Bidirectional(layers.LSTM(units_bi, activation='relu', return_sequences=True)))
model_2.add(layers.Dense(units_dense, activation='softmax'))
model_2.add(layers.Dense(outputs_dim, activation='softmax'))


model_2.summary()

Model: "Model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, None, 50)          20018050  
                                                                 
 bidirectional_6 (Bidirecti  (None, None, 200)         120800    
 onal)                                                           
                                                                 
 dense_5 (Dense)             (None, None, 100)         20100     
                                                                 
 dense_6 (Dense)             (None, None, 45)          4545      
                                                                 
Total params: 20163495 (76.92 MB)
Trainable params: 145445 (568.14 KB)
Non-trainable params: 20018050 (76.36 MB)
_________________________________________________________________


#### try models

https://medium.com/analytics-vidhya/author-multi-class-text-classification-using-bidirectional-lstm-keras-c9a533a1cc4a 
an example of a bi LSTM implementation with padding <br>

In [33]:
def assign_idx(word):
    try:
        idx = vocab[word]
    except(KeyError):
        idx = vocab['[OOV]']
    return idx

In [34]:
df_train_reduction = df_train.loc[df_train['num_file'] <= 20]
df_val_reduction = df_val.loc[df_val['num_file'] <= 110]
df_test_reduction = df_test.loc[df_test['num_file'] <= 160]

In [35]:
x_train = [ [assign_idx(word) for word in df_train_reduction[df_train_reduction['phrase_id']==nr_phrase]['text']] for nr_phrase in set(df_train_reduction['phrase_id']) ]
x_val   = [ [assign_idx(word) for word in df_val_reduction[df_val_reduction['phrase_id']==nr_phrase]['text']] for nr_phrase in set(df_val_reduction['phrase_id']) ]
x_test   = [ [assign_idx(word) for word in df_test_reduction[df_test_reduction['phrase_id']==nr_phrase]['text']] for nr_phrase in set(df_test_reduction['phrase_id']) ]

y_train = [ [pos_to_int(pos) for pos in df_train_reduction[df_train_reduction['phrase_id']==nr_phrase]['pos']] for nr_phrase in set(df_train_reduction['phrase_id']) ]
y_val   = [ [pos_to_int(pos) for pos in df_val_reduction[df_val_reduction['phrase_id']==nr_phrase]['pos']] for nr_phrase in set(df_val_reduction['phrase_id']) ]
y_test   = [ [pos_to_int(pos) for pos in df_test_reduction[df_test_reduction['phrase_id']==nr_phrase]['pos']] for nr_phrase in set(df_test_reduction['phrase_id']) ]

In [36]:
#In order to put the data data in data in the model we need to pad the array representing the words / pos

pad = max(max([len(phrase) for phrase in x_train]), max([len(phrase) for phrase in x_val]), max([len(phrase) for phrase in x_test]))
print(pad)

x_train_pad = [phrase + np.zeros((pad-len(phrase),)).tolist() for phrase in x_train]
x_val_pad = [phrase + np.zeros((pad-len(phrase),)).tolist() for phrase in x_val]
x_test_pad = [phrase + np.zeros((pad-len(phrase),)).tolist() for phrase in x_test]

y_train_pad = [phrase + np.zeros((pad-len(phrase), 45)).tolist() for phrase in y_train]
y_val_pad = [phrase + np.zeros((pad-len(phrase), 45)).tolist() for phrase in y_val]
y_test_pad = [phrase + np.zeros((pad-len(phrase), 45)).tolist() for phrase in y_test]

81


# [Task 4 - 1.0 points] Metrics

Before training the models, you are tasked to define the evaluation metrics for comparison.

### Instructions

* Evaluate your models using macro F1-score, compute over **all** tokens.
* **Concatenate** all tokens in a data split to compute the F1-score. (**Hint**: accumulate FP, TP, FN, TN iteratively) 
* **Do not consider punctuation and symbol classes** $\rightarrow$ [What is punctuation?](https://en.wikipedia.org/wiki/English_punctuation)

**Note**: What about OOV tokens?
   * All the tokens in the **training** set that are not in GloVe are **not** considered as OOV
   * For the remaining tokens (i.e., OOV in the validation and test sets), you have to assign them a **static** embedding.
   * You are **free** to define the static embedding using any strategy (e.g., random, neighbourhood, etc...)

In [37]:
from tensorflow.keras import backend as K

def recall_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    recall = true_positives / (possible_positives + K.epsilon())
    return recall

def precision_m(y_true, y_pred):
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    return precision

def f1_m(y_true, y_pred):
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

# [Task 5 - 1.0 points] Training and Evaluation

You are now tasked to train and evaluate the Baseline, Model 1, and Model 2.

### Instructions

* Train **all** models on the train set.
* Evaluate **all** models on the validation set.
* Compute metrics on the validation set.
* Pick **at least** three seeds for robust estimation.
* Pick the **best** performing model according to the observed validation set performance.

**NB: todo fine tuning on embedding layer https://stackoverflow.com/questions/40345607/how-does-fine-tuning-word-embeddings-work**

In [38]:
model = baseline

model.compile(loss='categorical_crossentropy', metrics=['accuracy', f1_m], optimizer='Adadelta')
history = model.fit(x_train_pad, y_train_pad, batch_size=8, epochs=2, validation_data=(x_val_pad, y_val_pad), verbose=2)
loss, accuracy, f1_score = model.evaluate(x_test_pad, y_test_pad)

Epoch 1/2
30/30 - 9s - loss: 3.7153 - accuracy: 0.0350 - f1_m: 0.0000e+00 - val_loss: 3.7222 - val_accuracy: 0.0370 - val_f1_m: 0.0000e+00 - 9s/epoch - 293ms/step
Epoch 2/2
30/30 - 2s - loss: 3.7130 - accuracy: 0.0369 - f1_m: 0.0000e+00 - val_loss: 3.7200 - val_accuracy: 0.0386 - val_f1_m: 0.0000e+00 - 2s/epoch - 57ms/step


In [39]:
pred = model.predict(x_val_pad)
print(int_to_pos(y_val_pad[0][2]))
print(int_to_pos(pred[0][2]))

IN
RBS


In [40]:
# import matplotlib.pyplot as plt

# def graph_plots(history, string):
#   plt.plot(history.history[string])
#   plt.plot(history.history['val_'+string])
#   plt.xlabel("Epochs")
#   plt.ylabel(string)
#   plt.legend([string, 'val_'+string])
#   plt.show()
  
# graph_plots(history, "accuracy")
# graph_plots(history, "loss")
# graph_plots(history, "f1_m")

# [Task 6 - 1.0 points] Error Analysis

You are tasked to evaluate your best performing model.

### Instructions

* Compare the errors made on the validation and test sets.
* Aggregate model errors into categories (if possible) 
* Comment the about errors and propose possible solutions on how to address them.

# [Task 7 - 1.0 points] Report

Wrap up your experiment in a short report (up to 2 pages).

### Instructions

* Use the NLP course report template.
* Summarize each task in the report following the provided template.

### Recommendations

The report is not a copy-paste of graphs, tables, and command outputs.

* Summarize classification performance in Table format.
* **Do not** report command outputs or screenshots.
* Report learning curves in Figure format.
* The error analysis section should summarize your findings.

# Submission

* **Submit** your report in PDF format.
* **Submit** your python notebook.
* Make sure your notebook is **well organized**, with no temporary code, commented sections, tests, etc...
* You can upload **model weights** in a cloud repository and report the link in the report.

# FAQ

Please check this frequently asked questions before contacting us

### Trainable Embeddings

You are **free** to define a trainable or non-trainable Embedding layer to load the GloVe embeddings.

### Model architecture

You **should not** change the architecture of a model (i.e., its layers).

However, you are **free** to play with their hyper-parameters.

### Neural Libraries

You are **free** to use any library of your choice to implement the networks (e.g., Keras, Tensorflow, PyTorch, JAX, etc...)

### Keras TimeDistributed Dense layer

If you are using Keras, we recommend wrapping the final Dense layer with `TimeDistributed`.

### Error Analysis

Some topics for discussion include:
   * Model performance on most/less frequent classes.
   * Precision/Recall curves.
   * Confusion matrices.
   * Specific misclassified samples.

### Punctuation

**Do not** remove punctuation from documents since it may be helpful to the model.

You should **ignore** it during metrics computation.

If you are curious, you can run additional experiments to verify the impact of removing punctuation.

# The End