
# Models

In practice the code provided allows you to analyse any model built using the BERT architecture configuration. Feel free to look around on HuggingFace and find other models that the ones itemised below. The models below have been tested to confirmt ehy work with the code - if you try to load a model incorrectly configured for the analyser it will throw an error when you run any analysis.

### Mono-Lingual Models

these are models intended to be used for a single language. However because they are trained on data scraped from the internet they likely will have seen a substantive portion of data from other languages too.

- BERT Base (english): https://huggingface.co/google-bert/bert-base-cased
- BERT Spanish: https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased
- BERT Finnish: https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1
- BERT German: https://huggingface.co/dbmdz/bert-base-german-cased


### Multi-Lingual Models

-DistilBERT Multilingual: https://huggingface.co/distilbert/distilbert-base-multilingual-cased


### Bi/Tri Lingual Models

Work has extracted subsets of multi-lingual models that handle only a couple of lanuguages, rather than the 104 included in wikipedia data. Take a look at the paper that introduces this technique [here](https://aclanthology.org/2020.sustainlp-1.16.pdf). The resulting models may not be perfect analogues for bilinguals, as they are distiallations of a larger model. Here we list the distilled models as they're smaller than full-sized BERT and so easier to run locally.

- BERT English/German: https://huggingface.co/Geotrend/distilbert-base-en-de-cased
- BERT English/French: https://huggingface.co/Geotrend/distilbert-base-en-fr-cased
- BERT English/French/German: https://huggingface.co/Geotrend/distilbert-base-en-fr-de-cased
- BERT English/Spanish: https://huggingface.co/Geotrend/distilbert-base-en-es-cased
- BERT English/French/Spanish: https://huggingface.co/Geotrend/distilbert-base-en-fr-es-cased


## Historical Language Models

These are models from the Bavarian State Library, trained on European Newspaper data from largely from 1850-1950, but ranging from 1600-1999 - more details about the dataset can be found on the model pages or [here](http://www.europeana-newspapers.eu/). They release both monolingual models, and different sizes of models trained multi-lingually.

#### Mono-Lingual


- Historical BERT French: https://huggingface.co/dbmdz/bert-base-french-europeana-cased
- Historical BERT German: https://huggingface.co/dbmdz/bert-base-german-europeana-cased
- Historical BERT Swedish: https://huggingface.co/dbmdz/bert-base-swedish-europeana-cased
- Historical BERT Finnish: https://huggingface.co/dbmdz/bert-base-finnish-europeana-cased

#### Multi-Lingual

Below are 5 Different Sizes of a Multi-Lingual Historical BERT model. 
- Tiny: https://huggingface.co/dbmdz/bert-tiny-historic-multilingual-cased
- Mini: https://huggingface.co/dbmdz/bert-mini-historic-multilingual-cased
- Small: https://huggingface.co/dbmdz/bert-small-historic-multilingual-cased
- Medium: https://huggingface.co/dbmdz/bert-medium-historic-multilingual-cased
- Base: https://huggingface.co/dbmdz/bert-base-historic-multilingual-cased


# Datasets

We've prepared a number of datasets for you to use in analysis. Some others on HuggingFace will work out of the box with the existent code, if not a ```NotImplemented``` error will be thrown. If you would like to use a dataset that isn't supported you just need to write a new ```get_example``` function in the ```codebase/h/analysis.py ln118``` which formats the downloaded dataset so that it can be fed into the model. Alternatively call me (Henry) over, and I should be able to add support for any datasets you're interested in.

## EuroParl

The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It contains high-quality professional translations in 21 European languages. However is domain specific to parlimentary proceedings. This dataset has been a major part of training Machine Translation systems since 2006.

Included Languages in the full dataset: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

### Monolingual Data

The same 100k sentences are sampled from 7 Languages

- English: https://huggingface.co/datasets/hcoxec/english_100k
- French: https://huggingface.co/datasets/hcoxec/french_100k
- German: https://huggingface.co/datasets/hcoxec/german_100k
- Finnish: https://huggingface.co/datasets/hcoxec/finnish_100k
- Romanian: https://huggingface.co/datasets/hcoxec/romanian_100k
- Danish: https://huggingface.co/datasets/hcoxec/danish_100k
- Spanish: https://huggingface.co/datasets/hcoxec/spanish_100k

### Multi-lingual Data

The same 50k sentences are sampled from a pair of languages, and labelled with their langauge ID. This allows you to see the degree to which a model encodes the same sentences in different langauges differently.

- Spanish/Danish: https://huggingface.co/datasets/hcoxec/danish_spanish_mix
- Spanish/Romanian: https://huggingface.co/datasets/hcoxec/romanian_spanish_mix
- Spanish/Finnish: https://huggingface.co/datasets/hcoxec/finnish_spanish_mix
- Finnish/Danish: https://huggingface.co/datasets/hcoxec/finnish_danish_mix
- Finnish/Romanian: https://huggingface.co/datasets/hcoxec/finnish_romanian_mix
- Spanish/French: https://huggingface.co/datasets/hcoxec/french_spanish_mix
- Spanish/German: https://huggingface.co/datasets/hcoxec/german_spanish_mix
- German/Danish: https://huggingface.co/datasets/hcoxec/german_danish_mix
- German/Romanian: https://huggingface.co/datasets/hcoxec/german_romanian_mix
- German/Finnish: https://huggingface.co/datasets/hcoxec/german_finnish_mix
- French/Danish: https://huggingface.co/datasets/hcoxec/french_danish_mix
- French/Romanian: https://huggingface.co/datasets/hcoxec/french_romanian_mix
- French/Finnish: https://huggingface.co/datasets/hcoxec/french_finnish_mix
- French/German: https://huggingface.co/datasets/hcoxec/french_german_mix


## GLUE Benchmark

GLUE, the General Language Understanding Evaluation benchmark (https://gluebenchmark.com/) is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

It contains 12 subtasks, all of which are supported by the analysis code. View the full dataset here: https://huggingface.co/datasets/nyu-mll/glue

They can be loaded using repo name ```nyu-mll/glue``` followed by the subtask name

- ax A manually-curated evaluation dataset for fine-grained analysis of system performance on a broad range of linguistic phenomena. This dataset evaluates sentence understanding through Natural Language Inference (NLI) problems. Use a model trained on MulitNLI to produce predictions for this dataset.

- cola The Corpus of Linguistic Acceptability consists of English acceptability judgments drawn from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is a grammatical English sentence.

- mnli The Multi-Genre Natural Language Inference Corpus is a crowdsourced collection of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral). The premise sentences are gathered from ten different sources, including transcribed speech, fiction, and government reports. The authors of the benchmark use the standard test set, for which they obtained private labels from the RTE authors, and evaluate on both the matched (in-domain) and mismatched (cross-domain) section. They also uses and recommend the SNLI corpus as 550k examples of auxiliary training data.

- mnli_matched The matched validation and test splits from MNLI. See the "mnli" BuilderConfig for additional information.

- mnli_mismatched The mismatched validation and test splits from MNLI. See the "mnli" BuilderConfig for additional information.

- mrpc The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent.

- qnli The Stanford Question Answering Dataset is a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator). The authors of the benchmark convert the task into sentence pair classification by forming a pair between each question and each sentence in the corresponding context, and filtering out pairs with low lexical overlap between the question and the context sentence. The task is to determine whether the context sentence contains the answer to the question. This modified version of the original task removes the requirement that the model select the exact answer, but also removes the simplifying assumptions that the answer is always present in the input and that lexical overlap is a reliable cue.

- qqp The Quora Question Pairs2 dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.

- rte The Recognizing Textual Entailment (RTE) datasets come from a series of annual textual entailment challenges. The authors of the benchmark combined the data from RTE1 (Dagan et al., 2006), RTE2 (Bar Haim et al., 2006), RTE3 (Giampiccolo et al., 2007), and RTE5 (Bentivogli et al., 2009). Examples are constructed based on news and Wikipedia text. The authors of the benchmark convert all datasets to a two-class split, where for three-class datasets they collapse neutral and contradiction into not entailment, for consistency.

- sst2 The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. It uses the two-way (positive/negative) class split, with only sentence-level labels.

- stsb The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 1 to 5.

- wnli The Winograd Schema Challenge (Levesque et al., 2011) is a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices. The examples are manually constructed to foil simple statistical methods: Each one is contingent on contextual information provided by a single word or phrase in the sentence. To convert the problem into sentence pair classification, the authors of the benchmark construct sentence pairs by replacing the ambiguous pronoun with each possible referent. The task is to predict if the sentence with the pronoun substituted is entailed by the original sentence. They use a small evaluation set consisting of new examples derived from fiction books that was shared privately by the authors of the original corpus. While the included training set is balanced between two classes, the test set is imbalanced between them (65% not entailment). Also, due to a data quirk, the development set is adversarial: hypotheses are sometimes shared between training and development examples, so if a model memorizes the training examples, they will predict the wrong label on corresponding development set example. As with QNLI, each example is evaluated separately, so there is not a systematic correspondence between a model's score on this task and its score on the unconverted original task. The authors of the benchmark call converted dataset WNLI (Winograd NLI)

