# Bsc25 - Active training

## Summary

The purpose of this work is to experiment with active learning.

To do this, we:
1. Develop a model (CRF) for a given dataset (spoken French).
2. Test that model with passive (conventional) training.
3. Test that model with active training.

The dataset is the OFROM+ corpus containing ~2mn tokens (*words*) of spoken French. Tokens are grouped in IPUs (*inter-pausal units*) separated by pauses of >=0.5s. IPUs are grouped in *files* corresponding to the TextGrid files containing the transcriptions. 

The model is a simple CRF (*Conditional Random Field*) model that uses IPUs as *sequences*, with the *token* as sole factor. 

Testing means taking an initial subset (usually 1-10k tokens), training the model, then adding to that subset (another 1-10k tokens) and re-training, iteratively until a set limit or the original dataset is exhausted. At each step we retrieve an *accuracy score*, as well as a set of tokens of interest. The subset and additional data is selected by *file*, adding files until the token count is reached: this is because researchers would use files as their minimal unit for correction and sharing.

Passive training selects the additional data at random. 

Active training follows an automated strategy based on a file's value and cost as well as the set of tokens from the previous step. The full formula is discussed in point (3).

Results are provided in the form of charts showing the evolution of the accuracy score. Several iterations of testing allow for a confidence interval: the y-axis offers the number of iterations in parenthesis. While the set of tokens varies at each step, another mode of active training allows maintaining the same set: this lets us produce a graph showing the evolution of those tokens' confidence score throughout the process.

In [10]:
import ofrom_train as tr

## 1. Data and Model

The OFROM+ corpus is a set of TextGrid files containing transcriptions of spoken French. Those transcriptions were annotated into PoS (*Part-of-Speech*, grammatical labels such as verb, noun, adjective, etc.) with the automatic annotation tool DisMo.

As a result, the OFROM+ team also has a dictionary with, for each token, all of its possible tags. This dictionary was originally used for correction purposes and ignores false negatives (missing tags for a token) but allows us to find non-problematic tokens, that is, tokens with a single possible tag, as well as find tokens with grammatical tags (excluding purely lexical tokens: nouns, verbs, numeric...). This will be used in active training.

We initially transformed the TextGrid files into a DataFrame with a *token* per row and (meta)data in columns: file, speaker, timestamps, PoS, lemma, confidence score, etc. This is the 'ofrom_alt.joblib' file. We further parse that list of *tokens* to group them into *sequences* and those sequences by *file*; and for each file we collect the number of occurrences for each token. This results in a list of files (a file being here a list of sequences) as well as a DataFrame with file information. (We also recreate the OFROM+ dictionary with tags per token, but the resulting 'pos' dictionary isn't used during testing.)

At this step we already calculate the file's weight and cost, which we will discuss in point (3). 

The result is stored in 'ofrom_gen.joblib'.

In [16]:
# tr.regen("code/ofrom_alt.joblib", "code/ofrom_gen.joblib")    # generate 'ofrom_gen.joblib'
gen = tr._load_gen()                                            # loads the data, 
                                                                ## assumes 'code/ofrom_gen.joblib' as path

A typical training dataset would be 100k tokens. Corrections would usually be by batches of 10k tokens (~5-10 files). Runs of 1k tokens are purely for technical testing.

When training the model during our testing, we will actually cross-train, that is, split the data into batches (5 by default), keep one batch for testing (accuracy score and confidence scores) and train the model on the rest; we train as many models as there are batches and average the accuracy_scores. This is done to ensure all tokens receive a confidence score. We have hard-coded the process to train all models in parallel as threads for a given subset.

## 2. Passive training

With passive training, we can immediately train our CRF model on the entire dataset.

In [None]:
crf_model = gen.crf_passive()    # train a model on the whole dataset

But for comparison purposes, we iterate on subsets selected at random. 

In [None]:
gen.reset()
tr.passive(gen, lim=10000, loop=10, nb_batches=5)    # one iteration with 10 steps of 10k tokens

In [None]:
tr.save_passive(it=10, f="json/passive_10k_10.json") # add 10 iterations of 'passive' to the json

We can then observe the learning rate (accuracy_score / nb_tokens) on a graph.

In [None]:
tr.plot_acc(f="json/passive_10k_10.json", lim=10000, 
            title="Passive training")                # plots the iterations

The 'plot_acc()' function directly saves the image using the title as name. Here are a couple plots.
![passive_10k_10](img/passive_10k_10.png "passive_1k_10")
![passive_1k_100](img/passive_1k_100.png "passive_1k_10")

## 3. Active training

Active training requires a strategy to select the next files for the subset. Our current formula is:
> ( (tok_coeff\*token_weight) \* (file_coeff\*file_weight) ) / file_cost

Where the coefficients (X_coeff) are manually set and fixed throughout the process. By default:
- tok_coeff = 1.
- file_coeff = 0.

The *token_weight* formula is:
> sum( token_occurrences_in_file \* (1-token_confidence_score) ) / nb_tokens

That is, the average of confidence scores multiplied by the amount of occurrences in the file: the more of it and the more uncertain the better. 

As for how the set of tokens is selected at each step, the formula is:
> log10( token_occurrences_in_subset ) \* (1-token_confidence_score)

Picking the lowest confidence score usually ends up selecting scarce occurrences; we therefore take the number of occurrences into account. This also tends to eliminate tokens that may have exhausted their occurrences in the dataset. The logarithmic scale avoids the highest count dominating by default.

The *file_weight*, while eliminated by default (due to file_coeff == 0.), is calculated by summing the file's tokens (occurrences) with at least one grammatical tag and dividing that sum by the total number of tokens (occurrences). That percentage represents a file's potential value, as one way among others to seek variety and relevance for training & research.

The *file_cost* is the number of *problematic* tokens (occurrences) in a file, that is, the number of occurrences with more than one possible tag and therefore required to correct. This isn't exactly an ENUA (Expected Number of User Actions) as we don't give each occurrence a probability of being corrected, but is still a good estimate of the amount of work expected in a manual correction.

Those two values are fixed throughout the process.

In [None]:
gen.reset()
tr.active_variable(gen, lim=10000, loop=10, 
                   nb_batches=5, nb_toks=10)

In [None]:
tr.save_active_v(it=10, f="json/active_10k_10.json")

In [None]:
tr.plot_acc(f="json/active_10k_10.json", lim=10000, 
            title="Active training")

Again, 'plot_acc()' directly saves the graphs. 
![active_1k_10](img/active_1k_10.png "active_1k_10")

It is possible that users already know what tokens of interest they want to track. This *fixed active training* is active training where the set of tokens does not vary at each step: only the confidence scores (and number of occurrences) are updated. 

We can use the 'active_fixed()' function for that purpose. It can take an additional 'g_toks' parameter with a list of tuples (token, confidence_score). If that parameter is omitted, it will select an initial subset at random, then fix its set of tokens from that subset.

In [None]:
gen.reset()
tr.active_fixed(gen, lim=10000, loop=10, nb_batches=5, nb_toks=10)

No function has been set to save iterations in a json. The resulting graph is saved manually.
![active_fixed](img/active_fixed.png "active_fixed")

## Conclusion

First, regarding accuracy scores:
- We can expect a 1k training set to provide an accuracy score of ~0.87.
- At 10k tokens, it should be around ~0.92.
- At 100k tokens, it should be around ~0.94.

Our CRF model could not be made simpler (although we did not discuss here how some tokens/symbols were removed, such as truncations, shorter pauses or inaudible speech...). We conclude that a PoS annotator should not do worse than 0.94 and that value should be considered the floor. (The DisMo annotation tool has an accuracy score of ~0.98.)

Second, regarding learning rates:
- Passive training follows a nice, logarithmic curve. It plateaus around ~0.94-95.
- Active training doesn't look more efficient.

This goes against our expectation, as the purpose of active learning is to get a better accuracy score with fewer data. We have yet to formulate hypotheses to explain those observations.