# Actuarial Applications of Natural Language Processing Using Transformers
### A Case Study for Processing Text Features in an Actuarial Context
### Part I – Introduction and Case Studies on Car Accident Descriptions

By Andreas Troxler, June 2022

An abundant amount of information is available to insurance companies in the form of text.
However, language data is unstructured, sometimes multilingual,
and single words or phrases taken out of context can be highly ambiguous.
By the help of transformer models, text data can be converted into structured data and then
used as input to predictive models.

In this Part I of tutorial, you will discover the use of transformer models for text classification.
Throughout this tutorial, the [HuggingFace](https://huggingface.co/docs/transformers/index)
Transformers library will be used.

This notebook serves as a companion to the tutorial
["Actuarial Applications of Natural Language Processing Using Transformers”](https://github.com/JSchelldorfer/ActuarialDataScience/tree/master/12%20-%20NLP%20Using%20Transformers).
The tutorial explains the underlying concepts, and this notebook illustrates the implementation.
This tutorial, the dataset and the notebooks are available on [github](https://github.com/JSchelldorfer/ActuarialDataScience/tree/master/12%20-%20NLP%20Using%20Transformers).

After competing this tutorial, you will know:
* How to use a transformer model to convert multi-lingual text features into embeddings - simply put, into a vector of real numbers.
* How to use this structured data to perform a text classification task.
* How to improve model performance by fine-tuning the NLP model with your own data.
* How to perform error analysis and interpret model predictions.
* How to deal with long input sequences.

Let’s get started.



## Notebook Overview

This notebook is divided into into seven parts; they are:

1. [Introduction](#intro)

   1.1 [Prerequisites](#prerequisites)

   [1.2 Exploring the data](#dataexploration)<br><br>

2. [A brief introduction to the HuggingFace ecosystem](#huggingface)

   2.1 [Loading the data into a DataSet](#dataset)

   2.2 [Tokenization – splitting the raw text](#tokenize)

   2.3 [The transformer model](#transformer)<br><br>

3. [Using transformers to extract features for classification or regression tasks](#feature_extraction)

   3.1 [Extracting the encoded text ...](#extract_encoding)

   3.2  [... and using it in a classification model](#classification)

   3.3 [Case study: use accident descriptions to predict the number of vehicles involved](#case_study_nvehicles)

   3.4 [Cross-lingual transfer](#cross_lingual_transfer)

   3.5 [Multi-lingual training](#multi_lingual_training)<br><br>

4. [Fine-tuning – improving the model](#finetuning)

   4.1. [Domain-specific finetuning](#domain_finetuning)

   4.2. [Task-specific finetuning](#task_finetuning)<br><br>

5. [Understand predictions errors and interpret predictions](#understand)

   5.1. [Case study: use accident descriptions to identify bodily injury](#case_study_injuries)

   5.2. [Investigate false positives and false negatives](#investigate)

   5.3. [Use Captum and `transformers-interpret` to interpret predictions](#interpret)<br><br>

6. [Using extractive question answering to process longer texts](#qna)<br><br>

7. [Conclusion](#conclusion)


<a id='intro'></a>
<a name='intro'></a>
## 1.&nbsp;Introduction

<a id='prerequisites'></a>
<a name='prerequisites'></a>
### 1.1. Prerequisites

#### Computing Power
This notebook is computationally intensive. We recommend using a platform with GPU support.

We have run this notebook on Google Colab and on an Amazon EC2 p2.xlarge instance (an older generation of GPU-based instances).

Please note that the results may not be reproducible across platforms and versions.

#### Local files
Make sure the following files are available in the directory of the notebook:
* `tutorial_utils.py` - a collection of utility functions used throughout this notebook, explained in Section [3.2](#classification)
* `NHTSA_NMVCCS_extract.parquet.gzip` - the data

This notebook will create the following subdirectories:
* `datasets` - pre-processed datasets
* `models` - trained Transformer models
* `results` - figures and Excel files

#### Getting started with Python and Jupyter Notebook

For this tutorial, we assume that you are already familiar with Python and Jupyter Notebook.

In this section, Jupyter Notebook and Python settings are initialized.
For code in Python, the [PEP8 standard](https://www.python.org/dev/peps/pep-0008/)
("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readability.


In [1]:
# Notebook settings

# clear the namespace variables
from IPython import get_ipython
get_ipython().run_line_magic("reset", "-sf")

# formatting: cell width
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

#### Importing Required Libraries

The following libraries are required:

In [2]:
!pip install transformers[torch]

Collecting transformers[torch]
  Downloading transformers-4.32.1-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers[torch])
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[torch])
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m43.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers[torch])
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m4

In [3]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2.14.

In [4]:
!pip install transformers_interpret

Collecting transformers_interpret
  Downloading transformers_interpret-0.10.0-py3-none-any.whl (45 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.8/45.8 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting captum>=0.3.1 (from transformers_interpret)
  Downloading captum-0.6.0-py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
Collecting jedi>=0.16 (from ipython<8.0.0,>=7.31.1->transformers_interpret)
  Downloading jedi-0.19.0-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m71.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi, captum, transformers_interpret
Successfully installed captum-0.6.0 jedi-0.19.0 transformers_interpret-0.10.0


In [5]:
!pip install plotly



In [6]:
!pip install kaleido

Collecting kaleido
  Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kaleido
Successfully installed kaleido-0.2.1


In [7]:
from datasets import Dataset, DatasetDict, load_from_disk
from transformers import AutoTokenizer, AutoModel, Trainer, TrainingArguments, trainer_utils, AutoModelForMaskedLM,\
    DataCollatorForLanguageModeling, AutoModelForSequenceClassification, pipeline
from transformers_interpret import SequenceClassificationExplainer
import torch
import pandas as pd
import numpy as np
from scipy.special import softmax
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
import plotly.express as px
from wordcloud import WordCloud

from tutorial_utils import extract_sequence_encoding, get_xy, dummy_classifier, logistic_regression_classifier, evaluate_classifier

In addition, we require `openpyxl` to enable export from Pandas to Excel.

<a id='dataexploration'></a>
<a name='dataexploration'></a>
### 1.2. Exploring the Data

The data used throughout this tutorial is derived from data of a vehicle crash causation study performed
in the United States from 2005 to 2007.
The dataset has almost 7'000 records, each relating to one accident.
For each case, a verbal description of the accident is available in English,
which summarizes road and weather conditions,
vehicles, drivers and passengers involved, preconditions, injury severities, etc.
The same information is also encoded in tabular form,
so that we can apply supervised learning techniques to train the NLP models and
compare the information extracted from the verbal descriptions with the encoded data.

The original data consists of multiple tables. For this tutorial, we have aggregated it into a single dataset
and added German translations of the English accident descriptions.
The translations were generated using the new
[DeepL python API](https://pypi.org/project/deepl/).

To explore the data, let's load it into a Pandas DataFrame and examine its shape, columns and data types:

In [8]:
df = pd.read_parquet("NHTSA_NMVCCS_extract.parquet.gzip")
print(f"shape of DataFrame: {df.shape}")
print(*list(zip(df.columns, df.dtypes)), sep="\n")

shape of DataFrame: (6949, 16)
('level_0', dtype('int64'))
('index', dtype('int64'))
('SCASEID', dtype('int64'))
('SUMMARY_EN', dtype('O'))
('SUMMARY_GE', dtype('O'))
('INJSEVA', dtype('int64'))
('NUMTOTV', dtype('int64'))
('WEATHER1', dtype('int64'))
('WEATHER2', dtype('int64'))
('WEATHER3', dtype('int64'))
('WEATHER4', dtype('int64'))
('WEATHER5', dtype('int64'))
('WEATHER6', dtype('int64'))
('WEATHER7', dtype('int64'))
('WEATHER8', dtype('int64'))
('INJSEVB', dtype('int64'))


The column `SCASEID` is a unique case identifier.

The columns `SUMMARY_EN` and `SUMMARY_GE` are strings representing the verbal descriptions of the accident
in English and German, respectively.

`NUMTOTV` is the number of vehicles involved in the case. Let's have a look at the distribution of this feature:

In [9]:
fig = px.bar(df["NUMTOTV"].value_counts().sort_index(), width=640)
fig.update_layout(title="number of cases by number of vehicles", xaxis_title="number of vehicles",
                  yaxis_title="number of cases")
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "num_vehicles"}})

Most cases involve two vehicles, and only very few accidents involve more than three vehicles.

Each of the columns `WEATHER1` to `WEATHER8` indicates the presence of a specific weather condition
(1: weather condition present, 9999: presence of weather condition unknown, 0 otherwise):

| column | meaning | count |
|---|---|---|
| `WEATHER1` | cloudy | 1112 |
| `WEATHER2` | snow | 114 |
| `WEATHER3` | fog, smog, smoke | 28 |
| `WEATHER4` | rain | 624 |
| `WEATHER5` | sleet, hail (freezing drizzle or rain) | 25 |
| `WEATHER6` | blowing snow | 38 |
| `WEATHER7` | severe crosswinds | 20 |
| `WEATHER8` | other | 25 |

These weather conditions are not mutually exclusive, i.e., more than one condition can be present in a single case.
The frequency distribution looks as follows:

In [10]:
fig=px.bar(x=range(1,9), y=[(df["WEATHER"+str(i)]==1).sum() for i in range(1,9)], width=640)
fig.update_layout(title="number of cases by weather condition", xaxis_title="weather condition",
                  yaxis_title="number of cases")
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "weather"}})

The most frequently recorded weather conditions are "cloudy" (`WEATHER1`) and "rain" (`WEATHER4`).

`INJSEVA` indicates the most serious sustained injury in the accident.
For instance, if one person was not injured, and another person suffered a non-incapacitating injury,
injury class 2 was assigned to the case.

Information on injury severity has been taken from police accident reports, which are not available in the data.
Unfortunately, this information does not necessarily align with the case description:
There are many cases for which the case description indicates the presence of an injury,
but `INJSEVA` does not, and vice versa.

For this reason, we created manually an additional column `INJSEVB` based on the case description,
to indicate the presence of a (possible) bodily injury.
The table below shows the distribution of number of cases by the two variables.

| `INJSEVA` | meaning | count | `INJSEVB`=0 | `INJSEVB`=1
|---|---|---|---|---|
|  0 | O - No injury | 1'458 | 96| 1'554 |
|  1 | C - Possible injury | 1'112 | 1'298 | 2'410 |
|  2 | B - Non-incapacitating injury | 729 | 945 | 1'674 |
|  3 | A - Incapacitating injury | 304 | 373 | 677 |
|  4 | K - Killed | 5 | 114 | 119 |
|  5 | U - Injury, severity unknown | 44 | 122 | 166 |
|  6 | Died prior to crash  | 0 | 0| 0 |
|  9 | Unknown if injured  | 51 | 16 | 67 |
| 10 | No person in crash  | 1 | 0| 1 |
| 11 | No PAR (police accident report) obtained | 231 | 50 | 281 |
|**Total**| | **3'935** | **3'014**| **6'949**|



Now we turn to the verbal accident descriptions.
First, we examine the length of the English texts, `SUMMARY_EN`.
To this end, we split the texts into words, with blank spaces as separator,
and show a box plot of the text length by number of vehicles involved in the accident:

In [11]:
# statistics of summary length
df["words per case summary"] = df["SUMMARY_EN"].str.split().apply(len)
print(f"Overall number of words by case summary: min {df['words per case summary'].min()}, "
      f"average {df['words per case summary'].mean():.0f}, max {df['words per case summary'].max()}")
fig = px.box(df, x="NUMTOTV", y="words per case summary", width=640)
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "text_length"}})

Overall number of words by case summary: min 60, average 419, max 1248


Not surprisingly, the length of the descriptions correlates with the number of vehicles involved.

The average length is above 400 words.
As we will see later in this notebook, this poses some challenges with the NLP models that we are using in this notebook,
because these are limited to text up to a length of 512 so-called "tokens" (vocabulary items).
Since a single word may be tokenized into more than one token, some accident descriptions will be truncated.

Let's examine one of the English texts and its German translation:

In [12]:
display(HTML(df.loc[0, "SUMMARY_EN"]))

In [13]:
display(HTML(df.loc[0, "SUMMARY_GE"]))

To get an impression of the most frequent words, we generate a simple word cloud form all English case descriptions.
By default, the word cloud excludes so-called stop words (such as articles, prepositions, pronouns, conjunctions, etc.),
which are the most common words and do not add much information to the text.

In [14]:
text = df["SUMMARY_EN"].str.cat(sep=" ")

# Create and generate a word cloud image:
word_cloud = WordCloud(max_words=100, background_color="white").generate(text)

# Display the generated image:
fig = px.imshow(word_cloud, width=640)
fig.update_layout(xaxis_showticklabels=False, yaxis_showticklabels=False)
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "word_cloud"}})

<a id='huggingface'></a>
<a name='huggingface'></a>
## 2.&nbsp;A Brief Introduction to the HuggingFace Ecosystem

This tutorial uses NLP models provided by [*HuggingFace*](https://huggingface.co/).

HuggingFace is a community that builds, trains and deploys state-of-the-art models for natural language processing,
audio, computer vision etc. HuggingFace's model hub provides thousands of pre-trained models for these applications.
The [Transformers](https://huggingface.co/docs/transformers/index) library offers functionality to
quickly download and use those pre-trained models on a given input, fine-tune them on the own datasets
and then share them with the community.
The library is backed by the three most popular deep learning libraries — Jax, PyTorch and TensorFlow.

In this notebook, the following elements of the HuggingFace ecosystem will be used:

* datasets – a library to load and process inputs and outputs of the NLP model
* tokenizers – translating the raw input text into tokens, which are the vocabulary items of a given NLP model
* models – loading and saving models
* trainer - training of models, making predictions

In the next sections we will briefly explore the first three components in turn.
The trainer functionality will be used in [Section 4](#finetuning) of this notebook.

<a id='dataset'></a>
<a name='dataset'></a>
### 2.1. Loading the Data into a Dataset

[*Datasets*](https://huggingface.co/docs/datasets/) is a library for easily accessing and sharing datasets,
and evaluation of metrics for NLP, computer vision, and audio tasks.

A dataset can be loaded in a single line of code, in our case directly from the pandas DataFrame.
At the same time, we split the dataset into a training (80%) and a test dataset (20%).
We fix the random seed for the sake of reproducibility.

In [15]:
dataset = Dataset.from_pandas(df).train_test_split(test_size=0.2, seed=0)

Since the texts are relatively long, some parts of this notebook require computing resources. Uncomment the following line to reduce the size of the dataset.

In [16]:
# dataset = DatasetDict({"train": dataset["train"].select(range(1000)), "test": dataset["train"].select(range(250))})
print(dataset)

DatasetDict({
    train: Dataset({
        features: ['level_0', 'index', 'SCASEID', 'SUMMARY_EN', 'SUMMARY_GE', 'INJSEVA', 'NUMTOTV', 'WEATHER1', 'WEATHER2', 'WEATHER3', 'WEATHER4', 'WEATHER5', 'WEATHER6', 'WEATHER7', 'WEATHER8', 'INJSEVB', 'words per case summary'],
        num_rows: 5559
    })
    test: Dataset({
        features: ['level_0', 'index', 'SCASEID', 'SUMMARY_EN', 'SUMMARY_GE', 'INJSEVA', 'NUMTOTV', 'WEATHER1', 'WEATHER2', 'WEATHER3', 'WEATHER4', 'WEATHER5', 'WEATHER6', 'WEATHER7', 'WEATHER8', 'INJSEVB', 'words per case summary'],
        num_rows: 1390
    })
})


The resulting `DatasetDict` behaves like a Python dictionary.
Therefore, you can access the `Dataset` corresponding to each split by

In [17]:
ds_train = dataset["train"]
print(ds_train)

Dataset({
    features: ['level_0', 'index', 'SCASEID', 'SUMMARY_EN', 'SUMMARY_GE', 'INJSEVA', 'NUMTOTV', 'WEATHER1', 'WEATHER2', 'WEATHER3', 'WEATHER4', 'WEATHER5', 'WEATHER6', 'WEATHER7', 'WEATHER8', 'INJSEVB', 'words per case summary'],
    num_rows: 5559
})


The `Dataset` object behaves like a normal Python container.
You can query its length, get rows or columns, etc. For instance, its length is:

In [18]:
len(ds_train)

5559

To query a single row, you can use its index, like in a list: `ds_train[0]`.
This returns a dictionary representing the row.
Its elements can be accessed by the column names as keys,
e.g. `ds_train[0]["SCASEID"]`.
Multiple rows can be accessed by index slices, e.g. `dataset["train"][:2]`,
or by a list of indices, e.g. `dataset["train"][0, 2]`.

You can list the column names and get their detailed types (called features):

In [19]:
ds_train.features

{'level_0': Value(dtype='int64', id=None),
 'index': Value(dtype='int64', id=None),
 'SCASEID': Value(dtype='int64', id=None),
 'SUMMARY_EN': Value(dtype='string', id=None),
 'SUMMARY_GE': Value(dtype='string', id=None),
 'INJSEVA': Value(dtype='int64', id=None),
 'NUMTOTV': Value(dtype='int64', id=None),
 'WEATHER1': Value(dtype='int64', id=None),
 'WEATHER2': Value(dtype='int64', id=None),
 'WEATHER3': Value(dtype='int64', id=None),
 'WEATHER4': Value(dtype='int64', id=None),
 'WEATHER5': Value(dtype='int64', id=None),
 'WEATHER6': Value(dtype='int64', id=None),
 'WEATHER7': Value(dtype='int64', id=None),
 'WEATHER8': Value(dtype='int64', id=None),
 'INJSEVB': Value(dtype='int64', id=None),
 'words per case summary': Value(dtype='int64', id=None)}

Later in this tutorial we will get to know methods to process datasets,
such as filtering the rows based on conditions, and processing the data in each row.





<a id='tokenize'></a>
<a name='tokenize'></a>
### 2.2 Tokenization: Split Raw Text into Vocabulary Items

Next, we convert the summary texts into tokens,
i.e., the text strings are split into elements of the vocabulary of the NLP model.

As such, the tokenizer and the NLP model need to be aligned.
Changing the tokenizer after training the model would produce unpredictable results.

Let's start with the model
[`distilbert-base-multilingual-cased`](https://huggingface.co/distilbert-base-multilingual-cased).
As the name implies, this model is cased: it does make a difference between "english" and "English".

The model is trained on the concatenation of Wikipedia in 104 different languages listed
[here](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages).
The model has 6 layers, 768 dimensions and 12 heads, totalizing 134 million parameters.
This model is a distilled version of the
[BERT base multilingual model](https://huggingface.co/bert-base-multilingual-cased)
which has 177 million parameters.
On average, the distilled model is twice as fast as the original model.

**If you want to use another model throughout this notebook, please feel free to simply change the following line!**

In [20]:
model_name = "distilbert-base-multilingual-cased"

# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"Tokenizer vocab_size: {tokenizer.vocab_size}")
print(f"Tokenizer model_max_length (maximum context size): {tokenizer.model_max_length}")

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Tokenizer vocab_size: 119547
Tokenizer model_max_length (maximum context size): 512


As we can see, the tokenizer has a vocabulary of size 119'547.
The maximum sequence length of the model is 512 tokens.

To see the tokenizer in action, we tokenize the first sentence of an accident description:

In [21]:
text = "V1, a 2000 Pontiac Montana minivan, made a left turn from a private driveway onto a northbound 5-lane two-way, dry asphalt roadway on a downhill grade."
result = tokenizer(text)

Calling the tokenizer returns a `BatchEncoding` object,
which behaves just like a standard Python dictionary that holds input items used by the NP model.
`input_ids` is the list of token IDs for each token.
`attention_mask` is a list containing 1 for all elements that corresponds to tokens of the input text,
and 0 for padding tokens that are appended to attain a specified sequence length.

To illustrate the meaning of the input IDs, we convert them back to token strings:

In [22]:
print(result)
print(tokenizer.convert_ids_to_tokens(result["input_ids"]))

{'input_ids': [101, 159, 10759, 117, 169, 10180, 23986, 46917, 24408, 25103, 12955, 117, 11019, 169, 12153, 18923, 10188, 169, 14591, 23806, 14132, 31095, 169, 12756, 47755, 126, 118, 23636, 10551, 118, 13170, 117, 36796, 28438, 27015, 15485, 14132, 10135, 169, 12935, 32049, 21958, 119, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'V', '##1', ',', 'a', '2000', 'Pont', '##iac', 'Montana', 'mini', '##van', ',', 'made', 'a', 'left', 'turn', 'from', 'a', 'private', 'drive', '##way', 'onto', 'a', 'north', '##bound', '5', '-', 'lane', 'two', '-', 'way', ',', 'dry', 'asp', '##halt', 'road', '##way', 'on', 'a', 'down', '##hill', 'grade', '.', '[SEP]']


We observe that words like "V1", "Pontiac", "minivan", "driveway" etc. are split into multiple tokens each.
This is typical for WordPiece tokenization adopted by BERT, an approach designed to reduce vocabulary size.
This tokenizer marks sub-words by the prefix `##`.

It is interesting to note that `2000` is a separate element of the vocabulary.

The first and last tokens of the tokenized sequence are `CLS` and `SEP`, respectively.
* `CLS` stands for "classification".
The output of the BERT encoder corresponding to this input token is sometimes interpreted to represent the meaning of
the entire sequence (we will check this in [Section 3.2](#classification) of this notebook).
* `SEP` stands for "separation".
In next-sequence prediction tasks, it is used to separate the first from the second sequence.

Here is a list of other special tokens used by the BERT tokenizer:
* The `UNK` token is used to represent tokens that are not available in the dictionary.
* The `PAD` token is used to pad the length of the tokenized sequence to a fixed length.
A fixed length is required when multiple sequences of different length are tokenized and fed into a BERT model
at the same time.
* The `MASK` token is used for pre-training the BERT model by masked language modeling.
For this task, the model is used to predict the masked token.

In [23]:
print(f"Tokenizer special_tokens_map: {tokenizer.special_tokens_map}")

Tokenizer special_tokens_map: {'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}


It is instructive to look at the tokenization of the German translation of the same text:

In [24]:
text = "V1, ein Minivan der Marke Pontiac Montana aus dem Jahr 2000, bog von einer privaten Einfahrt nach links auf eine zweispurige, trockene Asphaltstraße mit 5 Fahrspuren in nördlicher Richtung und einem Gefälle ab."
result = tokenizer(text)
print(result)
print(tokenizer.convert_ids_to_tokens(result["input_ids"]))

{'input_ids': [101, 159, 10759, 117, 10290, 32930, 12955, 10118, 73879, 23986, 46917, 24408, 10441, 10268, 11218, 10180, 117, 66298, 10166, 10599, 73655, 12210, 25131, 10496, 23608, 10329, 10359, 11615, 54609, 13091, 10525, 117, 42169, 21181, 10112, 10882, 37590, 72847, 43968, 10221, 126, 44271, 16757, 54609, 30064, 10106, 28253, 10165, 20139, 10130, 10745, 144, 16822, 38064, 11357, 119, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'V', '##1', ',', 'ein', 'Mini', '##van', 'der', 'Marke', 'Pont', '##iac', 'Montana', 'aus', 'dem', 'Jahr', '2000', ',', 'bog', 'von', 'einer', 'privaten', 'Ein', '##fahrt', 'nach', 'links', 'auf', 'eine', 'zwei', '##sp', '##uri', '##ge', ',', 'tro', '##cken', '##e', 'As', '##pha', '##lts', '##traße', 'mit', '5', 'Fa', '##hr', '##sp', '##uren', 'in', 'nördlich', '##er', 'Richtung', 'und', 'einem', 'G

Tokenizers of multi-lingual models use the same vocabulary for all languages.
Obviously, the tokenizer simply splits the input string into pieces and does not perform any translation:
the English pronoun "a" (169) is a different token than the equivalent German "ein" (10290).

We observe that the tokenizer is case-sensitive:
It differentiates between the tokens `mini` (25103) and `Mini` (32930).

So far, we have tokenized single sentences only.
Next, we want to tokenize the entire dataset.
This is easily achieved by applying the `map` function to the dataset.

All we need to provide to the `map` function is a function that takes a record or a batch of records from the dataset,
applies an operation to it, and returns a `DataSet` or a `dict` which defines the columns to be added or updated.

In our case, we supply a function that calls the `tokenizer` as shown before.
As we have seen, calling the tokenizer returns a dict with the keys `input_ids` and `attention_mask`.
Therefore, the `map` function will add columns with these names to the original dataset.

Since we plan to feed the tokenized sequences into a transformer model,
we need to truncate their length to the maximum length accepted by the transformer.
Moreover, the shorter sequences need to be padded at the end, so that all tokenized sequences have the same length.

Overall, only a few lines of code are required to complete the tokenization:

In [25]:
# define a function to tokenize a batch
def tokenize(batch, column):
    return tokenizer(batch[column], truncation=True, padding=True)

# encode the full dataset
dataset_en = dataset.map(tokenize, batched=True, fn_kwargs={"column": "SUMMARY_EN"})
print(dataset_en["train"].column_names)

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

['level_0', 'index', 'SCASEID', 'SUMMARY_EN', 'SUMMARY_GE', 'INJSEVA', 'NUMTOTV', 'WEATHER1', 'WEATHER2', 'WEATHER3', 'WEATHER4', 'WEATHER5', 'WEATHER6', 'WEATHER7', 'WEATHER8', 'INJSEVB', 'words per case summary', 'input_ids', 'attention_mask']


The additional argument `column` is passed to `tokenize` via the the dictionary `fn_kwargs`.
As we can see from the progress bars, the map function gets called twice - once for each split.
As expected, new columns `input_ids` and `attention_mask` have been added to the dataset.

We repeat the same procedure for the German texts.

In [26]:
dataset_ge = dataset.map(tokenize, batched=True, fn_kwargs={"column": "SUMMARY_GE"})

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Later on, we will also use a dataset which has 80% English texts and 20% German texts:

In [27]:
def map_mixed(x, idx):
    return {"SUMMARY_MX" : x["SUMMARY_GE"] if idx % 5 == 0 else x["SUMMARY_EN"]}
dataset = dataset.map(map_mixed, batched=False, with_indices=True)
dataset_mx = dataset.map(tokenize, batched=True, fn_kwargs={"column": "SUMMARY_MX"})

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Now we have created three datasets - with the tokenized English, German and mixed language texts, respectively.

We could have stored the results in a single dataset (with different column names),
but keeping languages separately will make it easier to convince ourselves in the following examples
that the languages have not been mixed up!

<a id='transformer'></a>
<a name='transformer'></a>
### 2.3. Transformer model

After completing the tokenization of the raw texts, we are ready to apply the transformer model,
in our case the multilingual DistilBERT model.

First, we load the model.
To speed up the following calculations, we opt for GPU support if available.


In [28]:
# load model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42)  # for reproducibility, set random seed before instantiating the model
model = AutoModel.from_pretrained(model_name).to(device)

Downloading model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

The warning message can be ignored for our application.

Let's examine the model structure:

In [29]:
model

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(119547, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): L

As we can see, the first block of the model deals with embeddings, with the word embedding as the first layer.
This is followed by the transformer which consists of 6 transformer blocks.

Let's first explore the word embedding.

The goal of the word embedding layer is to assign each element of the vocabulary a vector of length $E$.

The multilingual DistilBERT model has a vocabulary of size $V=119'547$ and a word embedding size of $E=768$.
We can confirm this by looking at the dimension of the word embedding weight tensor:

In [30]:
model.embeddings.word_embeddings

Embedding(119547, 768, padding_idx=0)

To see the outputs of the transformer encoder, let's apply the transformer to the first record of the dataset,
more precisely to its columns `input_ids` and  `attention_mask`, the outputs of the tokenizer:

In [31]:
example = dataset_en["train"][:1]

input_ids = torch.tensor(example["input_ids"]).to(device)
attention_mask = torch.tensor(example["attention_mask"]).to(device)
with torch.no_grad():
    output = model(input_ids, attention_mask)
print(output)

BaseModelOutput(last_hidden_state=tensor([[[ 0.1148, -0.0254,  0.1447,  ...,  0.1937,  0.0804, -0.2158],
         [ 0.1216, -0.5199,  0.6924,  ...,  0.2711, -0.2492, -0.0172],
         [-0.4065, -0.0786,  0.3362,  ..., -0.2183,  0.0278,  0.1635],
         ...,
         [-0.1276, -0.4791, -0.1539,  ...,  0.0442, -0.2272,  0.1089],
         [-0.1577, -0.4097, -0.2176,  ...,  0.0154, -0.2008, -0.1374],
         [-0.1855, -0.4261, -0.1884,  ..., -0.0515, -0.0600, -0.3426]]],
       device='cuda:0'), hidden_states=None, attentions=None)


This produces a `BaseModelOutput` object which has a named property `last_hidden_state`,
a tensor that represents the hidden state of the final transformer block, i.e. the encoded text sequence!

The dimension of the last hidden state is:

In [32]:
print("dimensions of last hidden state: ", output.last_hidden_state.size())

dimensions of last hidden state:  torch.Size([1, 512, 768])


i.e., \[number of samples (1),  sequence length $T$ (maximum 512 tokens), embedding size $E$ (768)\].

In what follows, we will use the information contained in this tensor to make predictions.


<a id='feature_extraction'></a>
<a name='feature_extraction'></a>
## 3.&nbsp;Using Transformers to Extract Features for Classification or Regression Tasks

In this section you will learn how transformers can be used to extract features from text data for a classification
or regression problem.

The idea is simple: The tokenized raw text data is encoded by the transformer model,
and the features are extracted from the last hidden state.


<a id='extract_encoding'></a>
<a name='extract_encoding'></a>
### 3.1. Extracting the Encoded Text

Before we have seen that the DistilBERT model encodes *each token* of each input sample into a tensor
of length $E=768$.
As such, the output of the transformer model depends on the length of the input sequences.
To make predictions, we would prefer having a single vector per input sample, independent of the sequence length.

Different approaches are available to achieve this goal:
* Use the tensor corresponding to the `CLS` token, which is the first token of the input sequence in BERT models.
* *Mean pooling*: Taking the average of the tensors over all elements of the sequence.
    Here, the tensors corresponding to a `PAD` token should be excluded because they don't carry any information.

We will implement both techniques and compare results.

In the following cell we display a short function which applies the NLP model to a batch of encoded input samples,
extracts the last hidden state, and returns two tensors of length 768 for each input sample,
corresponding to the two methods explained before.

The cell is not executable, because the function is already defined in the module `tutorial_utils` we imported initially.

Let's apply this function to the first sample of the training data:

In [33]:
example = dataset_en["train"][:1]
result = extract_sequence_encoding(example, model)
print(result.keys())

dict_keys(['level_0', 'index', 'SCASEID', 'SUMMARY_EN', 'SUMMARY_GE', 'INJSEVA', 'NUMTOTV', 'WEATHER1', 'WEATHER2', 'WEATHER3', 'WEATHER4', 'WEATHER5', 'WEATHER6', 'WEATHER7', 'WEATHER8', 'INJSEVB', 'words per case summary', 'input_ids', 'attention_mask', 'cls_hidden_state', 'mean_hidden_state'])


As desired, two additional columns `cls_hidden_state` and `mean_hidden_state` were appended.

Therefore, the function can be supplied to the familiar `map` function
to add corresponding columns to the original dataset.
The following lines do this for the full datasets.

On an AWS EC2 p2.xlarge instance, the run time is more than 10 minutes.
We save the resulting datasets to disk.

In [34]:
dataset_en = dataset_en.map(extract_sequence_encoding, fn_kwargs={"model": model}, batched=True, batch_size=16)
dataset_ge = dataset_ge.map(extract_sequence_encoding, fn_kwargs={"model": model}, batched=True, batch_size=16)
dataset_mx = dataset_mx.map(extract_sequence_encoding, fn_kwargs={"model": model}, batched=True, batch_size=16)
dataset_en.save_to_disk("./datasets/dataset_en")
dataset_ge.save_to_disk("./datasets/dataset_ge")
dataset_mx.save_to_disk("./datasets/dataset_mx")

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/5559 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1390 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/5559 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1390 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/5559 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1390 [00:00<?, ? examples/s]

<a id='classification'></a>
<a name='classification'></a>
### 3.2. ... and Using It in a Classification Model

We will now use the encoded texts as features to predict labels taken from certain tabular information available in the dataset.

To this end, we use the following convenience functions implemented in `tutorial_utils.py`:

* `x_train, y_train, x_test, y_test = get_xy(dataset, features, label)`<br>
    get numpy arrays corresponding features (x) and label (y) corresponding to the train and test split of the `dataset`where the encoded sentences are stored in the column `features` and the labels in the column `label`.<br><br>
    
* `clf = logistic_regression_classifier(x, y, c=1)`<br>
    fit and return a multinomial [Logistic Regression classifier](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to features `x`, and labels `y`. L2-penalty is controlled by the hyper-parameter `c`.<br><br>
    
* `clf = dummy_classifier(x, y):`<br>
    fit and return a [Dummy classifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) to features  `x`, and labels `y`. This classifier predicts always the most frequent class and `predict_proba` always returns the empirical class distribution of `y`.<br><br>
    
* `score_accuracy, score_log, score_brier, confusion_matrix, fig = evaluate_classifier(y_true, y_pred, p_pred, target_names, display_title_string, file_name)`<br>
    Calculate and display performance metrics of a classifier. The return value `fig` is a ploty figure representing the confusion matrix plot. The following inputs are expected:<br>
    * the true labels `y_true` (array-like);
    * either the predicted labels `y_pred` (array_like), in which case the log loss and Brier score are not evaluated;
    * or the predicted probabilities `p_pred`  (array_like);
    * a display title string;
    * a file name for exporting the figure, or `None`.

Now the toolbox is ready!

Next, we apply it to a simple classification task.

<a id='case_study_nvehicles'></a>
<a name='case_study_nvehicles'></a>
### 3.3. Case Study: Use Accident Descriptions to Predict the Number of Vehicles Involved

In this case study, we will predict the number of vehicles involved in an accident from the verbal accident description.

Since the data set contains the column `NUMTOTV`, we can adopt a supervised learning approach.

We might consider framing the problem as a regression task, e.g. using Poisson regression. However, looking at the frequenca distribution of `NUMTOTV`, it apears unlikely that the Poisson distribution is a good reflection of reality. First, there are no accidents with zero vehicles involved - it takes at least one. So we might consider using a zero-truncated Poisson model. However, the empirical frequency distribution has low mass at high vehicle counts, so that this would not be a plausible model either.

Therefore, we frame the prediction task as multinomial classification. Given that only a small fraction of cases involves four or more vehicles,
and to avoid a heavily imbalanced classification problem, we map these cases to an aggregated class "3+".

To achieve this, we map the column `NUMTOTV` to a new column `labels`, with levels 0 (1 vehicle), 1 (2 vehicles) and 2 (3 or more vehicles).
We choose the column name `labels` because this is expected by the sequence classification model which we fit in Section [4.2](#task_finetuning).

In [35]:
dataset_en = load_from_disk("./datasets/dataset_en")
dataset_ge = load_from_disk("./datasets/dataset_ge")
dataset_mx = load_from_disk("./datasets/dataset_mx")

# map number of vehicles to a new column "labels"
labels = ["1", "2", "3+"]
d = {i: min(i-1, 2) for i in range(1,10)}
dataset_en = dataset_en.map(lambda x: {"labels": d[x["NUMTOTV"]]})
dataset_ge = dataset_ge.map(lambda x: {"labels": d[x["NUMTOTV"]]})
dataset_mx = dataset_mx.map(lambda x: {"labels": d[x["NUMTOTV"]]})
print(dataset_en["train"]["NUMTOTV"][:40])
print(dataset_en["train"]["labels"][:40])

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

[2, 1, 2, 2, 2, 2, 2, 1, 2, 3, 2, 3, 2, 1, 3, 4, 1, 3, 1, 2, 1, 2, 2, 4, 2, 2, 2, 4, 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2]
[1, 0, 1, 1, 1, 1, 1, 0, 1, 2, 1, 2, 1, 0, 2, 2, 0, 2, 0, 1, 0, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1]


As explained in Section [3.1](#extract_encoding), we will explore two different ways to use encoded texts:
1. Use the hidden state corresponding to the `CLS` token, which is the first token of the input sequence in BERT models.
2. *Mean pooling*: Taking the average of the tensors over all elements of the sequence.

Let's start with the first approach by using the feature `cls_hidden_state` produced in Section [3.1](#extract_encoding).

Using the toolbox developed before we fit a dummy classifier and a logistic regression classifier to the features and
labels of the English dataset.

In [36]:
# extract the transformer encoding corresponding to the the CLS token
x_train_en, y_train_en, x_test_en, y_test_en = get_xy(dataset_en, "cls_hidden_state", "labels")

# fit dummy classifier
clf_dummy = dummy_classifier(x_train_en, y_train_en)
_ = evaluate_classifier(y_test_en, None, clf_dummy.predict_proba(x_test_en), labels, "Dummy classifier", "cm_nv_dummy")

Dummy classifier
accuracy score = 57.2%,  log loss = 0.961,  Brier loss = 0.574
classification report
               precision    recall  f1-score   support

           1       0.00      0.00      0.00       389
           2       0.57      1.00      0.73       795
          3+       0.00      0.00      0.00       206

    accuracy                           0.57      1390
   macro avg       0.19      0.33      0.24      1390
weighted avg       0.33      0.57      0.42      1390



In [37]:
# fit a classifier to the encoded English texts
clf_en = logistic_regression_classifier(x_train_en, y_train_en, c=10)
_ = evaluate_classifier(y_test_en, None, clf_en.predict_proba(x_test_en), labels, "Logistic regression (a)", "cm_nv_lr_a")

Logistic regression (a)
accuracy score = 90.9%,  log loss = 0.275,  Brier loss = 0.146
classification report
               precision    recall  f1-score   support

           1       0.94      0.93      0.93       389
           2       0.89      0.96      0.92       795
          3+       0.92      0.68      0.78       206

    accuracy                           0.91      1390
   macro avg       0.92      0.85      0.88      1390
weighted avg       0.91      0.91      0.91      1390



We obtain an accuracy score of 91%, compared to 57% with the dummy classifier.
This is already a very good result!

Remember, we have just used the DistilBERT transformer off the shelf, with no tuning whatsoever,
to extract a vector of length 768 representing the information contained in the accident descriptions.
During this entire text encoding, the transformer model was unaware that its output was going to be used to predict the number of vehicles.

How about the second approach, which uses the feature `mean_hidden_state` that was extracted
by mean pooling over the entire encoded sequence?

Let's see:

In [38]:
x_train_en, y_train_en, x_test_en, y_test_en = get_xy(dataset_en, "mean_hidden_state", "labels")
clf_en = logistic_regression_classifier(x_train_en, y_train_en, c=10)
_ = evaluate_classifier(y_test_en, None, clf_en.predict_proba(x_test_en), labels, "Logistic regression (b), train EN, test EN", "cm_nv_EN_EN")

Logistic regression (b), train EN, test EN
accuracy score = 96.0%,  log loss = 0.127,  Brier loss = 0.063
classification report
               precision    recall  f1-score   support

           1       0.96      0.97      0.97       389
           2       0.95      0.98      0.97       795
          3+       0.99      0.86      0.92       206

    accuracy                           0.96      1390
   macro avg       0.97      0.94      0.95      1390
weighted avg       0.96      0.96      0.96      1390



Again, we have used DistilBERT without any fine-tuning.

For the present task, by any of the considered scores, mean pooling performs much better than using the encoding of the `CLS` token.
For this reason, we use mean pooling in what follows.

What would you guess - will the classifier model exhibit a similar performance when trained on the encoded German dataset?

Let's check:

In [39]:
x_train_ge, y_train_ge, x_test_ge, y_test_ge = get_xy(dataset_ge, "mean_hidden_state", "labels")
clf_ge = logistic_regression_classifier(x_train_ge, y_train_ge, c=10)
_, _, _, _, _ = evaluate_classifier(y_test_ge, None, clf_ge.predict_proba(x_test_ge), labels, "train GE, test GE", "cm_nv_GE_GE")

train GE, test GE
accuracy score = 96.0%,  log loss = 0.120,  Brier loss = 0.062
classification report
               precision    recall  f1-score   support

           1       0.97      0.98      0.97       389
           2       0.95      0.98      0.97       795
          3+       0.96      0.86      0.91       206

    accuracy                           0.96      1390
   macro avg       0.96      0.94      0.95      1390
weighted avg       0.96      0.96      0.96      1390



Yes indeed, the performance on the English and German datasets are comparable.
This is what we would have expected - after all we are using a multilingual transformer model.


<a id='cross_lingual_transfer'></a>
<a name='cross_lingual_transfer'></a>
### 3.4. Cross-Lingual Transfer

In practice, it might happen that training data is available (predominantly) in one language,
but we would like to apply the model to test data in another language.
Translating the test data to the language of the training data would be an option,
but let's see how the multilingual transformer model performs.

In our small experiment, we simply switch the languages of the test sets.
This might be hard for the models, since in the entire training process each model has seen only encoded input
from text samples in one language!

First, use the German test set for the model trained on English input:

In [40]:
_ = evaluate_classifier(y_test_ge, None, clf_en.predict_proba(x_test_ge), labels, "train EN, test GE", "cm_nv_EN_GE")

train EN, test GE
accuracy score = 66.0%,  log loss = 1.083,  Brier loss = 0.527
classification report
               precision    recall  f1-score   support

           1       1.00      0.16      0.27       389
           2       0.67      0.86      0.75       795
          3+       0.57      0.85      0.68       206

    accuracy                           0.66      1390
   macro avg       0.75      0.62      0.57      1390
weighted avg       0.75      0.66      0.61      1390



From these rather poor results, we conclude that this approach to cross-language transferability does not work.

Vice versa, use the English test set for the model based on German input:

In [41]:
_ = evaluate_classifier(y_test_en, None, clf_ge.predict_proba(x_test_en), labels, "train GE, test EN", "cm_nv_GE_EN")

train GE, test EN
accuracy score = 24.3%,  log loss = 8.052,  Brier loss = 1.360
classification report
               precision    recall  f1-score   support

           1       0.00      0.00      0.00       389
           2       0.40      0.17      0.24       795
          3+       0.19      0.99      0.32       206

    accuracy                           0.24      1390
   macro avg       0.20      0.39      0.19      1390
weighted avg       0.26      0.24      0.18      1390



Again, performance is unsatisfactory.

To improve results, we need to change the approach.


<a id='multi_lingual_training'></a>
<a name='multi_lingual_training'></a>
### 3.5. Multi-Lingual Training

In a multilingual situation, a possible approach is to train the classifier with a training set consisting
of encoded samples from both languages.
This can always be achieved by translating a fraction of the text data and then use it to train the model.

This is exactly what we are going to do next.
In order to simulate a situation where one language is underrepresented, we create a mixed-language dataset
with about 80% English and 20% German samples, our dataset `dataset_mx` produced in [Section 2.2](#tokenize).

Since we are already using a multilingual transformer model, no further changes are required.

In [42]:
x_train_mx, y_train_mx, x_test_mx, y_test_mx = get_xy(dataset_mx, "mean_hidden_state", "labels")
clf_mx = logistic_regression_classifier(x_train_mx, y_train_mx, c=10)
_ = evaluate_classifier(y_test_en, None, clf_mx.predict_proba(x_test_en), labels, "train EN/GE, test EN", "cm_nv_MX_EN")
_ = evaluate_classifier(y_test_ge, None, clf_mx.predict_proba(x_test_ge), labels, "train EN/GE, test GE", "cm_nv_MX_GE")

train EN/GE, test EN
accuracy score = 95.7%,  log loss = 0.136,  Brier loss = 0.068
classification report
               precision    recall  f1-score   support

           1       0.96      0.98      0.97       389
           2       0.95      0.97      0.96       795
          3+       0.97      0.85      0.90       206

    accuracy                           0.96      1390
   macro avg       0.96      0.93      0.95      1390
weighted avg       0.96      0.96      0.96      1390



train EN/GE, test GE
accuracy score = 95.2%,  log loss = 0.160,  Brier loss = 0.080
classification report
               precision    recall  f1-score   support

           1       0.96      0.97      0.97       389
           2       0.95      0.97      0.96       795
          3+       0.94      0.85      0.90       206

    accuracy                           0.95      1390
   macro avg       0.95      0.93      0.94      1390
weighted avg       0.95      0.95      0.95      1390



This is a very good outcome. The scores are close to those achieved in the situation with a single-language!

To conclude, a multi-lingual situation can be handled by a multi-lingual transformer model. For the best performance, the classifier should be trained on the encoded sequences from all languages.

<a id='finetuning'></a>
<a name='finetuning'></a>
## 4.&nbsp;Fine-Tuning – Improving the Model

In the previous case study, we have used the DistilBERT model without any adaptation to the text data at hand,
simply by using the sequence encoding produced by the model.
As such, the language representation, which the model has learned from a large corpus of multilingual data, is transferred
to the text data at hand.
This approach is called transfer learning.
The advantage of transfer learning is that a powerful (but relatively complex) model can be trained on a large corpus
of data, using large-scale computing power, and then be applied to situations where availability of data or computing
power would not allow for such complex models.

For the task at hand, the results are already very good.
However, in certain situations it might be required to further improve model performance.

In the following sections you will learn how to fine-tune a transformer model.
We will explore two approaches to fine-tuning:

* *Domain-specific fine-tuning* involves updating the parameters of the transformer model using text data which is
    relevant to the domain where the model will be applied.
    However, the model is not necessarily tuned for a specific downstream task of interest.
* *Task-specific fine-tuning* uses domain-specific text data and tunes the parameters of the transformer model
    while training it for a given downstream task of interest.

The advantage of the first approach is that it can be performed in an unsupervised fashion,
i.e., it does not require labeled data.

On the other hand, task-specific fine-tuning is expected to produce better performance on the particular task
which the model was tuned for, so it might be the method of choice if there is a single down-stream task
and sufficient labeled data.

Let's explore these two fine-tuning approaches in turn.

<a id='domain_finetuning'></a>
<a name='domain_finetuning'></a>
### 4.1. Domain-specific fine-tuning

Domain-specific fine-tuning can be achieved by applying the model to a "masked language modeling" task.
This involves taking a sentence, randomly masking a certain percentage of the words in the input,
and then running the entire masked sentence through the model which has to predict the masked words.
This self-supervised approach is an automatic process to generate inputs and labels from the texts and does not require
any humans labelling in any way.

This is very easy to implement using the Transformers library.
You will see three new elements of the Transformer library in action:

* the `AutoModelForMaskedLM` class loads the DistilBERT model with a model head suitable for the masked language
    modeling task.
* The `DataCollatorForLanguageModeling` class forms training batches from the dataset and handles the masking.
* The `Trainer` class provides the interface to train the model.

Depending on the hardware available, training might take a rather long time.
Therefore, if available, we use GPU support.
On an AWS EC2 p2.xlarge instance, the run time is about 55 minutes.
We store the trained model for later use.

If you do not have enough time to perform this step right now, you can skip this section and return later. The remainder of this notebook does not depend on it.

In [43]:
# load model and tokenizer and define the DataCollator
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42)  # for reproducibility, set random seed before instantiating the model
model_mlm = AutoModelForMaskedLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=True, mlm_probability=0.15)
dataset_mx = load_from_disk("./datasets/dataset_mx")

# define training arguments
training_args = TrainingArguments(
    output_dir="models/" + model_name + "_mlm_epochs",
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_device_train_batch_size=4,
    save_strategy=trainer_utils.IntervalStrategy.NO,
)
trainer = Trainer(
    model=model_mlm,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset_mx["train"]
)
trainer.train()
trainer.save_model("models/" + model_name + "_mlm")

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,1.4351
1000,1.1164
1500,1.0015
2000,0.9418
2500,0.8912


Now, `model_mlm` holds the DistilBERT model, fine-tuned to the mixed-language accident descriptions
using masked-language-modeling.

Next, we apply this model to all input sequences and extract the last hidden state.
The procedure is the same as in section [3.1](#extract_encoding).
To avoid confusion, we create new datasets, and store them on disk for later use,
so that this step does not need to be repeated all over when this notebook is re-run.

In [44]:
dataset_en = load_from_disk("./datasets/dataset_en")
dataset_ge = load_from_disk("./datasets/dataset_ge")
dataset_mx = load_from_disk("./datasets/dataset_mx")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained("models/" + model_name + "_mlm").to(device)
dataset_en_pretrained = dataset_en.map(extract_sequence_encoding, fn_kwargs={"model": model}, batched=True, batch_size=16)
dataset_ge_pretrained = dataset_ge.map(extract_sequence_encoding, fn_kwargs={"model": model}, batched=True, batch_size=16)
dataset_mx_pretrained = dataset_mx.map(extract_sequence_encoding, fn_kwargs={"model": model}, batched=True, batch_size=16)
dataset_en_pretrained.save_to_disk("./datasets/dataset_en_pretrained")
dataset_ge_pretrained.save_to_disk("./datasets/dataset_ge_pretrained")
dataset_mx_pretrained.save_to_disk("./datasets/dataset_mx_pretrained")

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/5559 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1390 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/5559 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1390 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/5559 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1390 [00:00<?, ? examples/s]

Now let's see to what extent domain-specific fine-tuning is able to improve the performance of the classification model.

To this end, we perform the same steps as in Sections [3.3](#case_study_nvehicles)-[3.5](#multi_lingual_training):

In [45]:
dataset_en_pretrained = load_from_disk("./datasets/dataset_en_pretrained")
dataset_ge_pretrained = load_from_disk("./datasets/dataset_ge_pretrained")
dataset_mx_pretrained = load_from_disk("./datasets/dataset_mx_pretrained")

# map number of vehicles to a new column "labels"
labels = ["1", "2", "3+"]
d = {i: min(i-1, 2) for i in range(1,10)}
dataset_en = dataset_en_pretrained.map(lambda x: {"labels": d[x["NUMTOTV"]]})
dataset_ge = dataset_ge_pretrained.map(lambda x: {"labels": d[x["NUMTOTV"]]})
dataset_mx = dataset_mx_pretrained.map(lambda x: {"labels": d[x["NUMTOTV"]]})

# extract features and labels and create multi-lingual dataset
x_train_en, y_train_en, x_test_en, y_test_en = get_xy(dataset_en, "mean_hidden_state", "labels")
x_train_ge, y_train_ge, x_test_ge, y_test_ge = get_xy(dataset_ge, "mean_hidden_state", "labels")
x_train_mx, y_train_mx, x_test_mx, y_test_mx = get_xy(dataset_mx, "mean_hidden_state", "labels")

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Map:   0%|          | 0/5559 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

In [46]:
# fit logistic regression classifiers to each of the three datasets and (cross-) evaluate them
clf_en = logistic_regression_classifier(x_train_en, y_train_en, c=10)
_ = evaluate_classifier(y_test_en, None, clf_en.predict_proba(x_test_en), labels, "train EN, test EN", "cm_nv_pr_EN_EN")
_ = evaluate_classifier(y_test_ge, None, clf_en.predict_proba(x_test_ge), labels, "train EN, test GE", "cm_nv_pr_EN_GE")

train EN, test EN
accuracy score = 97.1%,  log loss = 0.094,  Brier loss = 0.047
classification report
               precision    recall  f1-score   support

           1       0.97      0.99      0.98       389
           2       0.97      0.98      0.98       795
          3+       0.98      0.88      0.93       206

    accuracy                           0.97      1390
   macro avg       0.97      0.95      0.96      1390
weighted avg       0.97      0.97      0.97      1390



train EN, test GE
accuracy score = 88.5%,  log loss = 0.308,  Brier loss = 0.174
classification report
               precision    recall  f1-score   support

           1       0.96      0.78      0.86       389
           2       0.85      0.98      0.91       795
          3+       0.96      0.73      0.83       206

    accuracy                           0.88      1390
   macro avg       0.92      0.83      0.87      1390
weighted avg       0.89      0.88      0.88      1390



In [47]:
clf_ge = logistic_regression_classifier(x_train_ge, y_train_ge, c=10)
_ = evaluate_classifier(y_test_ge, None, clf_ge.predict_proba(x_test_ge), labels, "train GE, test GE", "cm_nv_pr_GE_GE")
_ = evaluate_classifier(y_test_en, None, clf_ge.predict_proba(x_test_en), labels, "train GE, test EN", "cm_nv_pr_GE_EN")

train GE, test GE
accuracy score = 96.8%,  log loss = 0.115,  Brier loss = 0.055
classification report
               precision    recall  f1-score   support

           1       0.98      0.98      0.98       389
           2       0.96      0.98      0.97       795
          3+       0.97      0.89      0.93       206

    accuracy                           0.97      1390
   macro avg       0.97      0.95      0.96      1390
weighted avg       0.97      0.97      0.97      1390



train GE, test EN
accuracy score = 65.1%,  log loss = 3.024,  Brier loss = 0.665
classification report
               precision    recall  f1-score   support

           1       1.00      0.00      0.01       389
           2       0.62      1.00      0.77       795
          3+       0.99      0.53      0.69       206

    accuracy                           0.65      1390
   macro avg       0.87      0.51      0.49      1390
weighted avg       0.78      0.65      0.54      1390



In [48]:
clf_mx = logistic_regression_classifier(x_train_mx, y_train_mx, c=10)
_ = evaluate_classifier(y_test_en, None, clf_mx.predict_proba(x_test_en), labels, "train EN/GE, test EN", "cm_nv_pr_MX_EN")
_ = evaluate_classifier(y_test_ge, None, clf_mx.predict_proba(x_test_ge), labels, "train EN/GE, test GE", "cm_nv_pr_MX_GE")

train EN/GE, test EN
accuracy score = 96.9%,  log loss = 0.098,  Brier loss = 0.049
classification report
               precision    recall  f1-score   support

           1       0.97      0.99      0.98       389
           2       0.97      0.98      0.97       795
          3+       0.98      0.87      0.92       206

    accuracy                           0.97      1390
   macro avg       0.97      0.95      0.96      1390
weighted avg       0.97      0.97      0.97      1390



train EN/GE, test GE
accuracy score = 95.8%,  log loss = 0.146,  Brier loss = 0.070
classification report
               precision    recall  f1-score   support

           1       0.97      0.97      0.97       389
           2       0.96      0.97      0.96       795
          3+       0.94      0.89      0.92       206

    accuracy                           0.96      1390
   macro avg       0.96      0.94      0.95      1390
weighted avg       0.96      0.96      0.96      1390



By comparing to the above results, we observe that the domain-specific fine-tuning on the English training set has improved the scores, but not to a satisfactory level for the cross-language transfer cases.

<a id='task_finetuning'></a>
<a name='task_finetuning'></a>
### 4.2. Task-specific fine-tuning

An alternative to domain-specific fine-tuning is task-specific fine-tuning.

The idea is to train a transformer model directly on the task at hand, in our case a sequence classification task.
The process is very similar to the masked language modeling used for domain-specific pre-training, except that
we load a sequence classification model using the class `AutoModelForSequenceClassification`.

The following code tunes a sequence classification model that uses the English accident descriptions to predict
the number of vehicles involved.
On an AWS EC2 p2.xlarge instance, the run time is about 20 minutes.

In [49]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42)  # for reproducibility, set random seed before instantiating the model
model_cls = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(labels)).to(device)

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

# train the model
batch_size = 8
logging_steps = len(dataset_en["train"]) // batch_size
training_args = TrainingArguments(
    output_dir="models/" + model_name + "nv_epochs",
    num_train_epochs=2,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    metric_for_best_model="f1",
    logging_steps=logging_steps,
    save_strategy=trainer_utils.IntervalStrategy.NO,
)
trainer = Trainer(model=model_cls, args=training_args,
                  compute_metrics=compute_metrics, train_dataset=dataset_en["train"],
                  eval_dataset=dataset_en["test"])
trainer.train();
trainer.save_model("models/" + model_name + "_nv")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
694,0.1541
1388,0.0528


In [50]:
# evaluate model performance using predictions on the English test set
predictions_en = trainer.predict(dataset_en["test"])
_ = evaluate_classifier(predictions_en.label_ids, None, softmax(predictions_en.predictions, axis=1), labels, "train EN, test EN", "cm_nv_tsk_EN_EN")

train EN, test EN
accuracy score = 99.6%,  log loss = 0.025,  Brier loss = 0.007
classification report
               precision    recall  f1-score   support

           1       0.99      1.00      0.99       389
           2       1.00      1.00      1.00       795
          3+       1.00      0.99      1.00       206

    accuracy                           1.00      1390
   macro avg       1.00      1.00      1.00      1390
weighted avg       1.00      1.00      1.00      1390



In [51]:
# evaluate model performance using predictions on the German test set (cross-lingual test)
predictions_ge = trainer.predict(dataset_ge["test"])
_ = evaluate_classifier(predictions_ge.label_ids, None, softmax(predictions_ge.predictions, axis=1), labels, "train EN, test GE", "cm_nv_task_EN_GE")

train EN, test GE
accuracy score = 99.6%,  log loss = 0.025,  Brier loss = 0.007
classification report
               precision    recall  f1-score   support

           1       0.99      1.00      0.99       389
           2       1.00      1.00      1.00       795
          3+       1.00      0.99      1.00       206

    accuracy                           1.00      1390
   macro avg       1.00      1.00      1.00      1390
weighted avg       1.00      1.00      1.00      1390



The scores on the English test set have improved to fantastic levels.

What is even more impressive is the performance on cross-lingual transfer:
Despite the fact that the model has been trained on English texts only,
its performance scores on the German test set are very good.

This is an excellent result!

<a id='understand'></a>
<a name='understand'></a>
## 5.&nbsp;Understand Predictions Errors and Interpret Predictions

In this section you will learn how to analyze prediction errors and how to interpret predictions.

We will study a more challenging example.


<a id='case_study_injuries'></a>
<a name='case_study_injuries'></a>
### 5.1 Case Study: Use Accident Descriptions to Identify Bodily Injury

As seen in the previous section,
predicting the number of vehicles from the available accident descriptions is a
relatively easy task for the transformer model, even in a multi-lingual situation.

Therefore, we will turn to a somewhat more difficult task: identifying cases which lead to bodily injuries. We cuse the column `INJSEVB` as label.

The process is identical to the previous case study:
* Start from the original dataset, enrich it with hidden states produced by the original transformer model
    (before domain-specific fine-tuning).
    Given the experience from the previous task, we use the mean pooling output.
* For comparison, we also load the encodings produced by the transformer model after domain-specific fine-tuning.
* Define the labels.
* Fit a dummy classifier, which always predicts the most frequent class.
* Fit a regression classifier, and evaluate its performance.

In case you have skipped Section [4.1 Domain-specific finetuning](#domain_finetuning), the dataset `../datasets/dataset_en_pretrained` will not be available.
In this case simply comment out the last lines of each block below.

In [52]:
dataset_en = load_from_disk("./datasets/dataset_en")
dataset_ge = load_from_disk("./datasets/dataset_ge")
dataset_mx = load_from_disk("./datasets/dataset_mx")
#dataset_pr = load_from_disk("./datasets/dataset_en_pretrained")

# map injuries
labels = ["0", "1"]
dataset_en = dataset_en.rename_column("INJSEVB", "labels")
dataset_ge = dataset_ge.rename_column("INJSEVB", "labels")
dataset_mx = dataset_mx.rename_column("INJSEVB", "labels")
#dataset_pr = dataset_pr.rename_column("INJSEVB", "labels")

x_train_en, y_train_en, x_test_en, y_test_en = get_xy(dataset_en, "mean_hidden_state", "labels")
x_train_ge, y_train_ge, x_test_ge, y_test_ge = get_xy(dataset_ge, "mean_hidden_state", "labels")
x_train_mx, y_train_mx, x_test_mx, y_test_mx = get_xy(dataset_mx, "mean_hidden_state", "labels")
#x_train_pr, y_train_pr, x_test_pr, y_test_pr = get_xy(dataset_pr, "mean_hidden_state", "labels")

In [53]:
# fit dummy classifier
clf_dummy = dummy_classifier(x_train_en, y_train_en)
_ = evaluate_classifier(y_test_en, None, clf_dummy.predict_proba(x_test_en), labels, "Dummy classifier", "cm_inj_dummy")

Dummy classifier
accuracy score = 58.7%,  log loss = 0.679,  Brier loss = 0.486
classification report
               precision    recall  f1-score   support

           0       0.59      1.00      0.74       816
           1       0.00      0.00      0.00       574

    accuracy                           0.59      1390
   macro avg       0.29      0.50      0.37      1390
weighted avg       0.34      0.59      0.43      1390



In [54]:
# fit logistic regression classifier to the encoded English texts (by the original DistilBERT model)
clf_en = logistic_regression_classifier(x_train_en, y_train_en, c=10)
_ = evaluate_classifier(y_test_en, None, clf_en.predict_proba(x_test_en), labels, "Logistic regression, DistilBERT", "cm_inj_lr")

Logistic regression, DistilBERT
accuracy score = 80.1%,  log loss = 0.400,  Brier loss = 0.259
classification report
               precision    recall  f1-score   support

           0       0.83      0.83      0.83       816
           1       0.76      0.75      0.76       574

    accuracy                           0.80      1390
   macro avg       0.79      0.79      0.79      1390
weighted avg       0.80      0.80      0.80      1390



In case you have skipped Section [4.1 Domain-specific finetuning](#domain_finetuning), please also skip the following cell.

In [55]:
# fit logistic regression classifier to the encoded English texts (by the fine-tuned DistilBERT model)
#clf_pr = logistic_regression_classifier(x_train_pr, y_train_pr, c=10)
#_ = evaluate_classifier(y_test_pr, None, clf_pr.predict_proba(x_test_pr), labels, "Logistic regression - 2 epochs pre-training", "cm_inj_pr")

We observe the following:
* The accuracy score of the dummy classifier is 59%.
* Using the logistic regression classifier on the outputs of the DistilBERT model with two epochs of domain-specific fine-tuning improves the scores compared to using the outputs of the plain DistilBERT model.
* The performance on the class `0` is better than on the class `1` because of a large number of false positives.

Next, we perform task-specific fine-tuning.
On an AWS EC2 p2.xlarge instance, the run time is about 20 minutes.

In [56]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42)  # for reproducibility, set random seed before instantiating the model
model_cls_inj = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(labels)).to(device)
batch_size = 8
logging_steps = len(dataset_en["train"]) // batch_size
training_args = TrainingArguments(
    output_dir="models/" + model_name + "inj_epochs",
    num_train_epochs= 2,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    metric_for_best_model="f1",
    disable_tqdm=False,
    logging_steps=logging_steps,
    save_strategy=trainer_utils.IntervalStrategy.NO,
)
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    f1 = f1_score(labels, preds, average="weighted")
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}
trainer = Trainer(model=model_cls_inj, args=training_args,
                  compute_metrics=compute_metrics, train_dataset=dataset_en["train"], eval_dataset=dataset_en["test"])
trainer.train();
trainer.save_model("models/" + model_name + "_inj")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
694,0.5486
1388,0.3539


In [57]:
# Execute the following line to load the trained model from disk.
# trainer = Trainer(AutoModelForSequenceClassification.from_pretrained(model_name+"_inj", num_labels=len(labels)).to(torch.device("cuda" if torch.cuda.is_available() else "cpu")))

In [58]:
# evaluate model performance using predictions on the English test set
predictions_en = trainer.predict(dataset_en["test"])
_ = evaluate_classifier(predictions_en.label_ids, None, softmax(predictions_en.predictions, axis=1), labels,
                        "DistilBERT classifier - 2 epochs task-specific", "cm_inj_tsk")

DistilBERT classifier - 2 epochs task-specific
accuracy score = 89.1%,  log loss = 0.297,  Brier loss = 0.174
classification report
               precision    recall  f1-score   support

           0       0.91      0.90      0.91       816
           1       0.86      0.87      0.87       574

    accuracy                           0.89      1390
   macro avg       0.89      0.89      0.89      1390
weighted avg       0.89      0.89      0.89      1390



We observe the following:
* Task-specific fine-tuning has further improved all scores.
* There is still a relatively large number of false positives.

<a id='investigate'></a>
<a name='investigate'></a>
### 5.2. Investigate False Positives and False Negatives

To investigate the prediction errors, we export the predictions into an Excel file with the following columns:

| column | meaning |
|---|---|
| `SCASEID` | unique identification number of the case |
| `SUMMARY_EN` | description of the accident, in English |
| `SUMMARY_TRUNCATED` | description of the accident, in English, truncated to a length of 512 tokens |
| `INJSEVA` |  most serious injury sustained in the case, as per Police Accident Report |
| `labels` |  indicator of odily injury `INJSEVB` (true label) |
| `pred` | predicted label |
| `0` | probability of negative label |
| `1` | probability of positive label |

In [59]:
# export prediction results for error analysis
dataset_en.set_format(type="pandas")
df_res = pd.concat([dataset_en["test"].to_pandas(),
                    pd.DataFrame(data=softmax(predictions_en.predictions, axis=1), columns=["0", "1"]),
                    pd.DataFrame(data=np.argmax(predictions_en.predictions, -1).reshape((-1,1)), columns=['pred'])
                ], axis=1)
df_res = df_res[["SCASEID", "SUMMARY_EN", "INJSEVA", "labels", "pred", "0", "1"]]
dataset_en.set_format()
for i in range(df_res.shape[0]):
    df_res.loc[i, "SUMMARY_TRUNCATED"] = tokenizer.convert_tokens_to_string(tokenizer.tokenize(df_res.loc[i, "SUMMARY_EN"], truncation=True))
df_res.to_excel("./results/error_analysis_inj.xlsx")

The first step of the error analysis is to inspect the samples producing false negative and false positive predictions.
Reading every single text would be very tedious, therefore it is worthwhile focusing on those examples where the probability assigned to the false prediction was high,
i.e., cases where the model was confident but wrong.

Looking at the false negatives, we observe that there are many cases where the model assigns a high probability to negative.
We suspect that truncation is responsible for many of the false negatives – the relevant part of the text was discarded.

To address this issue, we split the text into slightly overlapping chunks,
run the prediction on each chunk and apply the logical OR-function to the results.
We implement this functionality in a simple function that returns an additional column `pred`,
containing a list of predicted labels, with one element for each chunk.

In [60]:
def predict_with_overflow(x, model, feature):
    t = tokenizer(x[feature], truncation=True, padding=True, return_overflowing_tokens=True)
    input_ids = torch.tensor(t["input_ids"]).to(model.device)
    attention_mask = torch.tensor(t["attention_mask"]).to(model.device)
    with torch.no_grad():
        preds = np.argmax(model(input_ids, attention_mask).logits.cpu(), -1)
    return {"preds": preds}

In [61]:
# Execute the following lines to load the trained model and the okenizer from disk.
# model_cls_inj = AutoModelForSequenceClassification.from_pretrained("models/" + model_name + "_inj", num_labels=len(labels)).to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))
# tokenizer = AutoTokenizer.from_pretrained(model_name)

In [62]:
dataset_en_overflow = dataset_en["test"].map(predict_with_overflow, batched=False, fn_kwargs={"model": model_cls_inj, "feature": "SUMMARY_EN"})
dataset_en_overflow = dataset_en_overflow.map(lambda x: {"pred": max(x["preds"])})

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

In [63]:
_ = evaluate_classifier(predictions_en.label_ids, dataset_en_overflow["pred"], None, labels,
                        "DistilBERT classifier - split inputs", "cm_inj_split")

DistilBERT classifier - split inputs
accuracy score = 91.5%,  log loss = nan,  Brier loss = nan
classification report
               precision    recall  f1-score   support

           0       0.95      0.90      0.93       816
           1       0.87      0.94      0.90       574

    accuracy                           0.92      1390
   macro avg       0.91      0.92      0.91      1390
weighted avg       0.92      0.92      0.92      1390



In [64]:
dataset_en_overflow.set_format(type="pandas")
df_res = dataset_en_overflow.to_pandas()
df_res = df_res[["SCASEID", "SUMMARY_EN", "INJSEVA", "labels", "pred"]]
dataset_en.set_format()
for i in range(df_res.shape[0]):
    df_res.loc[i, "SUMMARY_TRUNCATED"] = tokenizer.convert_tokens_to_string(tokenizer.tokenize(df_res.loc[i, "SUMMARY_EN"], truncation=True))
df_res.to_excel("./results/error_analysis_inj_overflow.xlsx")

The number of false negatives has reduced significantly, as expected, and the accuracy score has improved.
Since we have not implemented a logic to combine the predicted probabilities of the different chunks, the log loss and Brier loss cannot be evaluated in this case.

<a id='interpret'></a>
<a name='interpret'></a>
### 5.3. Use Captum and `transformers-interpret`  to Interpret Predictions


Transformer models are quite complex, and therefore, interpreting model output can be difficult.

Our main interest is in knowing which parts of the input text cause the classifier to arrive at a particular prediction.
One way to answer this question is the so-called integrated gradients method.
It is provided conveniently by the library [transformers_interpret](https://github.com/cdpierse/transformers-interpret)
which provides a convenient interface to the library [Captum](https://captum.ai/),
an open source, extensible library for model interpretability built on PyTorch.

With just a few lines of code, we can run this on individual examples, and receive a graphical output  as shown below.
Of course, the output is also available in numerical form.
We run this on CPU because on the AWS p2.xlarge instance, the GPU ran out of memory.

In [65]:
device = torch.device("cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = model_cls_inj.to(device)
cls_explainer = SequenceClassificationExplainer(model, tokenizer)

In [66]:
# true positive
s = tokenizer.decode(dataset_en["test"][144]["input_ids"][1:511])
word_attributions = cls_explainer(s, n_steps=20)
cls_explainer.visualize("./results/viz_144.html");

True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
1.0,LABEL_1 (0.97),LABEL_1,2.32,"[CLS] This three - vehicle crash occurred in the morning of a weekend on a multi - lane highway near an entrance ra ##mp . The highway runs east and west and divided by a high - tension cable guard ##rail . The bit ##umi ##nou ##s road ##way is dry , level and curve ##d to the left at the location of this crash . The posted speed limit 89 km ##ph ( 65 mph ) and there were no ad ##verse weather conditions . V ##1 , a 2006 Je ##ep Liberty with two occupa ##nts , was west ##bound in lane three inte ##nding to go straight . V ##2 , a 1992 Mitsubishi Dia ##mante with one occupa ##nt , was west ##bound in lane four inte ##nding to go straight . V ##3 , a 1996 Nissan pick ##up with one occupa ##nt , was west ##bound in lane one ( ac ##cel ##eration ra ##mp ) inte ##nding to merge left . An unknown vehicle traveling behind V ##3 switched lane ##s and cut in front of V ##1 . V ##1 attempted to avoid this unknown vehicle by changing lane ##s and striking V ##2 ( event # 1 ) . Subsequently , V ##1 and V ##2 sp ##un across all travel lane ##s and departed the right side of the road . V ##1 was struck in the right side by V ##3 as it sp ##un across the ac ##cel ##eration lane and came to final rest on the right roads ##ide . After V ##2 entered the right roads ##ide it sp ##un into an em ##bank ##ment and rolle ##d ( est . 6 - quarter turns ) and came to final rest on its roof . V ##3 drove off the right side of the road after striking V ##1 . The driver of V ##1 is a 45 - year - old female that refused to be interviewed . She was not injured in the crash and her Je ##ep was driven from the scene . The Critical Pre ##cra ##sh Event for V ##1 was code ##d this vehicle traveling over the lane line on the left side of the travel lane . The Critical Reason for the Critical Event was code ##d in ##corre ##ct eva ##sive action . Other factors code ##d to this driver include chose ina ##pp ##rop ##riate eva ##sive action and poor direction ##al control ( failure to control vehicle with skill ord ##inar ##ily expected ) . The driver of V ##2 is a 40 - year - old female that was not interviewed because of a language barrier ( Korean . ) She was transported to the hospital and her vehicle was to ##wed due to damage . The Critical Pre ##cra ##sh Event was code ##d other vehicle en ##cro ##aching from adjacent lane - over right lane line . The Critical Reason for the Critical Event was not code ##d to this vehicle . The driver [SEP]"
,,,,


In [67]:
# true positive
s = tokenizer.decode(dataset_en["test"][18]["input_ids"][1:511])
word_attributions = cls_explainer(s, n_steps=20)
cls_explainer.visualize("./results/viz_18.html");

True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
1.0,LABEL_1 (0.94),LABEL_1,4.12,"[CLS] This crash occurred in the south ##bound lane of a two - lane und ##ivi ##ded road ##way . This was a level asp ##halt road that curve ##d slightly to the left , with a posted speed limit of 64 km ##ph ( 40 mph ) . It was early in the evening on a week ##day , conditions were clear , and the road ##way was dry . There were no traffic flow restrictions . V ##1 was a 2002 Chrysler Se ##bring 2 - door convert ##ible . The vehicle was traveling south ##bound and its driver was beginning to nego ##tia ##te a left curve . V ##1 departed the road ##way to the right and struck a telephone pole located on the roads ##ide . V ##1 rota ##ted clock ##wise after the impact and then trip ##ped over its wheels . V ##1 rolle ##d two quarter - turns and came to final rest on its roof . V ##1 was driven by a 69 - year old female who suffered moderate injuries . The driver has since been put into a nur ##sing home and does not reca ##ll any information from the accident . The accident report and medical records indicated that the driver of V ##1 had a blood alcohol content of 0 . 177 . The Critical Pre - crash Event for V ##1 was this vehicle traveling off the edge of the road on the right side . The Critical Reason for the Critical Pre - crash Event was poor direction ##al control , a driver - related factor . Associated factors code ##d to the driver of V ##1 include alcohol use , the medical condition of diabetes and the use of pre ##scription med ##ication to control the diabetes . Medical reports also indicated that the driver of V ##1 had a history of alcohol ##ism . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [SEP]"
,,,,


In [68]:
# false negative: "leaving an injured passenger" overlooked
s = tokenizer.decode(dataset_en["test"][331]["input_ids"][1:511])
word_attributions = cls_explainer(s, n_steps=20)
cls_explainer.visualize("./results/viz_331.html");

True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
0.0,LABEL_0 (0.75),LABEL_0,6.24,"[CLS] This two vehicle crash occurred late in the evening on a two - lane up ##hill bit ##umi ##nou ##s road ##way , with no traffic controls and a speed limit of 56 km ##ph ( 30 mph ) . Vehicle one ( V ##1 ) was a 2007 Ford e ##cono ##line van driven by a thirty four ( 34 ) year - old male who takes no med ##ication or has any vision restrictions . V ##1 was traveling south in lane one going straight . Vehicle two ( V ##2 ) was a 1994 Honda Civic sedan driven by an unknown aged driver with one passenger . V ##2 was traveling south in lane one . According to a witness V ##2 was traveling at a high rate of speed and attempting to pass V ##1 on the right when the front of V ##2 struck the rear of V ##1 . The driver of V ##2 fled the scene on foot , leaving an injured passenger . Both vehicle ' s came to final rest facing south . V ##2 was to ##wed from the scene . The passenger of V ##2 did not know the driver and refused to speak about the crash due to his illegal status in this country . The critical pre - crash event for V ##1 was code ##d : other motor vehicle in lane , traveling in same direction with higher speed . The critical reason for the critical event was not code ##d to this vehicle . The driver of V ##1 was traveling from one job site to another when V ##1 was rear - ended by V ##2 . He was going straight traveling at the posted speed limit in this residential area and observed V ##2 approach ##ing from the rear in his side mirror . The critical pre - crash event for V ##2 was code ##d : other motor vehicle in lane , traveling in same direction with lower st ##eady speed . The critical reason for the critical event was code ##d to the driver of V ##2 as a driver related factor : poor direction ##al control ( e . g . , failing to control vehicle with skill ord ##inar ##ily expected ) . An associated factor for V ##2 was excessive speed and mis ##jud ##gment of gap . V ##2 ' s left front tire was the wrong size and all tire ##s had low tre ##ad depth . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [SEP]"
,,,,


In [69]:
# false positive:
s = tokenizer.decode(dataset_en["test"][78]["input_ids"][1:511])
word_attributions = cls_explainer(s, n_steps=20)
cls_explainer.visualize("./results/viz_78.html");

True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
1.0,LABEL_1 (0.97),LABEL_1,1.93,"[CLS] The crash occurred on a north / south four - lane highway with shoulder ##s . It curve ##d to the east ( right ) as it traveled north ##ward with a radius of curva ##ture of 274 meters and a positive 4 % grade . Initially there was a grass median div ##iding the north and south lane ##s but as the highway traveled north the median ended with only a double yellow line separat ##ing the directions of travel . A two - lane side street inter ##sect ##ed on the west side of the highway and traveled southeast . Con ##ditions were dark and dry on a week ##day evening . Vehicle # 1 was a 1987 Mercury Marquis traveling north ##bound on the highway . The driver , apparently confused , attempted to turn left on the side street 29 meters prior to the intersection . The vehicle went down a steep 62 % em ##bank ##ment , striking the ground at the bottom of the em ##bank ##ment with its front . It came to rest facing south with its rear wheels just on the edge of the pave ##d south shoulder and was to ##wed due to damage . Vehicle # 1 was driven by a 54 - year old female that was un ##belt ##ed and not transported to a medical facility . Two adult passengers and an 8 - month child in a safety seat were also not injured . The driver stated she went out the wrong exit from a gas station on the east side of the highway a few hundred meters south of the crash . She intended to turn left on the side street to circle back around and enter a shopping center that was located across the highway from the gas station . App ##aren ##tly she thought that the street sign identify ##ing the side streets name was on the north side of the intersection as opposed to south and initiated the left turn 29 meters before the inter ##sect ##ing pave ##ment began . She said that once she started to turn and realized the error she attempted to brak ##e but the front wheels had left the pave ##ment and the em ##bank ##ment was so steep she could not recover . In ##vesti ##gating tro ##oper ##s agree with researcher that poor vision could have contributed to the scenario and required her to follow up with a vision rete ##sting at a state driver ' s license center . The Critical Pre ##cra ##sh Event for Vehicle # 1 was this vehicle traveling off the edge of the road on the left side . The Critical Reason for the Critical Event was code ##d other recognition error , attempted left turn too early . Associated factors included con ##versi ##ng with passenger and poor direction ##al control ( failure to control vehicle with skill ord ##inar ##ily expected ) . A vehicle view ob ##stru ##ction - related to other was included due [SEP]"
,,,,


In [70]:
# false positive:
s = tokenizer.decode(dataset_en["test"][915]["input_ids"][1:511])
word_attributions = cls_explainer(s, n_steps=20)
cls_explainer.visualize("./results/viz_915.html");

True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
1.0,LABEL_1 (0.95),LABEL_1,4.93,"[CLS] This crash occurred on a straight level bit ##umi ##nou ##s two lane road ##way that was divided by a painted median . The posted speed limit of 72 km ##ph ( 45 mph ) which reduce ##s to 56 km ##ph ( 35 mph ) 100 meters after the crash site . There is a sign indicating the road ##way narrow ##s . The weather was cloud ##y and the road ##way was partially wet . Traffic flow was normal for that time of day . This crash occurred on a week ##day afternoon . Vehicle 1 , a 2002 Nissan Alt ##ima , was traveling behind Vehicle 2 , a 1991 Chevrolet Lu ##mina , when it drove into the safety zone into the on ##coming traffic lane in order to illegal ##ly pass Vehicle 2 . V ##1 returned to its original lane and impact ##ed with V ##2 ' s front left , with its right rear quarter panel . This sp ##un V ##1 in a clock ##wise position 180 degrees , with V ##1 coming to final rest after impact ##ing an em ##bank ##ment on the right side of the road ##way , with its rear left . Vehicle 1 was to ##wed due to damage . V ##1 came to final rest off the road ##way facing in a northeast ##erly direction . V ##2 came to final rest on the road ##way facing in a south ##erly direction . V ##1 was to ##wed due to damage . V ##2 was to ##wed due to its driver going to the hospital with her baby . Vehicle # 1 , the Nissan Alt ##ima , was driven by a belt ##ed 38 - year - old male who refused to be interviewed . He stated he did not want to be both ##ered "" with this sh - t "" . The Critical Pre ##cra ##sh Event code ##d to Vehicle 1 was : Other - this vehicle traveling entering the road ##way from the left side of the road ##way . The Critical Reason for the Critical Pre ##cra ##sh Event was code ##d as : driver related factor , aggressive driving behavior . Vehicle # 2 , the Chevrolet , was driven by a belt ##ed 21 year - old female who was not injured . There was a belt ##ed 18 year - old male in the front right seat who was not injured . There was a 6 - month - old female child in a car seat in the second row . The child was taken to the hospital for a check out , accompanied by both other people in the vehicle . This driver stated to her relative that she had seen the driver of V ##1 making "" wild ge ##stu ##res "" and tail ##gating her . She stated she saw V ##1 coming around her on the left but could only brak ##e before impact . The Critical Pre ##cra ##sh Event code [SEP]"
,,,,


<a id='qna'></a>
<a name='qna'></a>
## 6.&nbsp;Using Extractive Question Answering to Process Longer Texts

In this section we use extractive question answering to extract parts of the accident description which indicate the presence of bodily injury. The aim is to reduce the length of the input texts by extracting only the relevant parts.

The easiest implementation of extractive question answering is provided by the `pipeline` abstraction.

We use [`deutsche-telekom/bert-multi-english-german-squad2`](https://huggingface.co/deutsche-telekom/bert-multi-english-german-squad2),
a multilingual English German question answering model built on `bert-base-multilingual-cased`. By specifying `device=0` we use GPU support.

In [71]:
model_name_qa ="deutsche-telekom/bert-multi-english-german-squad2"
pl = pipeline("question-answering", model=model_name_qa, tokenizer=model_name_qa, device=0)
questions = ["Was someone injured?", "Was someone transported?"]

Downloading (…)lve/main/config.json:   0%|          | 0.00/817 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/709M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/264 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

We visit each accident report in turn (the context), and ask the model the two questions “Was someone injured?”
and “Was someone transported?”.
Since the accident reports might provide information on multiple persons,
we allow a maximum of four candidate answers for each of the questions,
which we concatenate into a single (much shorter) new text.

To achieve this, we write a short function which applies a question answering pipeline to an input text `x`.
The argument `questions` is a list of questions.

In [72]:
def get_answers(x, qa_pipeline, questions):
    x["INJ"] = ""
    for question in questions:
        res = qa_pipeline(context=x["SUMMARY_EN"], question=question, top_k=4, handle_impossible_answer=True)
        if isinstance(res, dict):
            res = [res]
        if len(res[0]) > 0:
            x["INJ"] = '. '.join([x["INJ"]] + [item["answer"] for item in res])
    return x

We apply the question answering function to the entire test set.

On an AWS EC2 p2.xlarge instance, the run time is about 6 minutes. If you want to try the concept on only the first 250 samples, you can use `ds_test = dataset["test"].select(range(250).map(...`

In [73]:
ds_test = dataset["test"].map(get_answers, batched=False, fn_kwargs={"qa_pipeline": pl, "questions": questions})

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset



Next, we tokenize the extracted texts and define the labels, and store the dataset for later use:

In [74]:
ds_test = ds_test.map(tokenize, batched=True, fn_kwargs={"column": "INJ"})
ds_test = ds_test.rename_column("INJSEVB", "labels")
ds_test.save_to_disk("./datasets/ds_test")

Map:   0%|          | 0/1390 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1390 [00:00<?, ? examples/s]

We load the transformer model that was trained on the classification task...

In [75]:
#ds_test = load_from_disk("./datasets/ds_test")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained("models/" + model_name + "_inj").to(device)
trainer = Trainer(model)

...apply it to the tokenized text extracts and evaluate the predictions.

In [76]:
predictions = trainer.predict(ds_test)
_ = evaluate_classifier(predictions.label_ids, None, softmax(predictions.predictions, axis=1), ["0", "1"], "Extractive QA", "cm_inj_qa")

Extractive QA
accuracy score = 82.9%,  log loss = 0.456,  Brier loss = 0.265
classification report
               precision    recall  f1-score   support

           0       0.81      0.94      0.87       816
           1       0.88      0.68      0.77       574

    accuracy                           0.83      1390
   macro avg       0.84      0.81      0.82      1390
weighted avg       0.84      0.83      0.82      1390



The performance is comparable with the logistic regression classifier on mean-pooled encodings of the original texts.
On the other hand, from there is a larger number of false negatives than obtained by task-specific training
and evaluation on the full-length sequence.
This indicates that in some cases the extractive question answering has missed out or suppressed certain relevant parts.
For instance, if the original text reads “The driver was injured.”,
the extract “The driver” is a correct answer to the question “Was someone injured?”;
however, it is too short to detect the presence of an injury from the extract.

<a id='conclusion'></a>
<a name='conclusion'></a>
## 7.&nbsp;Conclusion

Congratulations!

In this notebook, you have learned how to apply transformer-based models to classification tasks that often arise in actuarial applications.

You have seen how to address challenges that often arise in practical applications:

a.	The text corpus may be highly domain-specific, i.e., it may use specialized terminology.
    – In [Section 4.1](#domain_finetuning) we have applied domain-specific fine-tuning to improve model performance
    in a specific domain.

b.	Multiple languages might be present in parallel.
    – In [Section 3.5](#multi_lingual_training) we have used a multi-lingual transformer model to encode multi-lingual texts
    and to use this output for a classification task. Performance was good even when one language is underrepresented.  

c.	Text sequences might be short and ambiguous.
    Or they might be so long that it is hard to identify the parts relevant to the task.
    – In this tutorial we have demonstrated two approaches to deal with long texts:
    
   * In [Section 5.2](#investigate) we have split long input texts into slightly overlapping chunks and applied
   the classifier to each chunk separately.
    
   * In [Section 6](#qna) we have used extractive question answering to extract parts of the original texts which are relevant
   to the task.

d.	The amount of training data may be relatively small.
    In particular, gathering large amounts of labelled data (i.e., text sequences augmented with a target label)
    might be expensive.
    – Throughout this workbook, we have used transformer models which have been trained on a large corpus of text data.
    We have applied these models to the specific task with no or little specific training,
    thus transferring the language understanding skills to the task at hand.

e.	It is important to understand why a model arrives at a particular prediction.
    – In [Section 5.3](#interpret) we have shown how to visualize which parts of the input text
    cause the classifier to arrive at a particular prediction.

The notebook Part II deals with another dataset that has only short text descriptions.
It demonstrates possible approaches in case no or few labels are available.