# Bsc25 - CRF

## Setting up a CRF pipeline

### Training

We turn our dataset 'ofrom_alt.joblib' into *sequences*.

The dataset is derived from the OFROM+ database of spoken French. The joblib object is a Pandas DataFrame with one row per *token* (~word).

A *sequence* is an IPU for Intra-Pausal Unit, meaning a set of *tokens* between two (silent) pauses. The rest of this paragraph is a discussion for linguists. Relevant pauses are based on their duration, with the old DisMo model using a 0.5s (second) threshold, whereas our IPU threshold is set at 0.3s. Linguistically, 0.3s is the duration at which pauses start being perceived, while ~0.6-0.8s is when they start getting considered as proper boundaries. We have chosen the lower threshold based on Fribourg's pragma-syntax for sequences as close as possible to *clauses*.

When building a sequence, some *tokens* are discarded. Those are:
- shorter pauses (<0.3s)
- reserved symbols (anonymized or inintelligible parts, third-party locutor, etc.)
- truncations (*tokens* interrupted before completion)

While the choice of excluding those cases is not trivial, we made the bet that they would add noise more than anything.

In [30]:
import ofrom_crf # requires scikit-learn & sklearn_crfsuite
import ofrom_pos # requires joblib, zipfile, networkx, ...

In [32]:
X, y = ofrom_pos.load_allsequs(lim=100000) # only take 10'000 first lines for demonstration
X_tr, X_te, y_tr, y_te = ofrom_crf.train_test_split(X, y, train_size=0.8) # actually from scikit-learn

We then train a CRF (Conditional Random Fields) model on those sequences using the dedicated *sklearn_crfsuite* (more precisely its CRF class). 

- 'X' is a list of *sequences*, with each sequence being a list of dictionaries, each dictionary containing the feature for its related *token*. In our case, 'load_allsequs()' gave only the *token* string itself as feature, meaning our model has only 1 feature.
- 'y' is a list of *sequences*, with each sequence being a list of strings representing the 'pos' (PoS standing for Part-of-Speech, the *token*'s morpho-syntactic / grammatical category).

The hyperparameters 'c1' and 'c2' (which we do not yet understand) have been based on a preliminary work from fall 2024 and should be revised when possible.

In [34]:
crf = ofrom_crf.train(X_tr, y_tr, c1=0.22, c2=0.03, max_iterations=100)

### Predicting

We can use that model on a single sequence and retrieve the confidence score for each *token*.

In [108]:
nx, ny = X_te[8], y_te[8]                # pick a sequence
l_res = ofrom_crf.predict_one(crf, nx)   # predict it
for a, res in enumerate(l_res):          # print in a somewhat clean way...
    pos, conf = res['pos'], res['confidence']
    print(f"{pos:<20} {conf:.02f}\t{nx[a]['token']}\t {ny[a]}")

CON:coo              1.00	et	 CON:coo
PRP                  1.00	pour	 PRP
PRO:per:ton          1.00	moi	 PRO:per:ton
VER:inf              1.00	faire	 VER:inf
DET:def              1.00	la	 DET:def
NOM:com              0.97	cuisine	 NOM:com
ADJ                  0.44	cuisiner	 VER:inf
PRP                  1.00	à	 PRP
DET:ind              0.93	des	 DET:ind
PRP                  0.99	à	 PRP
DET:ind              0.98	des	 DET:ind
NOM:com              1.00	personnes	 NOM:com
CON:sub              0.63	que	 CON:sub
PRO:per:sjt          1.00	je	 PRO:per:sjt
VER:pres             0.87	connais	 VER:pres
CON:coo              1.00	ou	 CON:coo
ADV:neg              1.00	pas	 ADV:neg
CON:sub              0.87	que	 CON:sub
PRO:per:sjt          1.00	je	 PRO:per:sjt
PRP                  0.97	dans	 PRP
PRP                  1.00	dans	 PRP
DET:def              0.99	les	 DET:def
NUM:crd:det          0.54	deux	 NUM:crd:nom


We also have the ability to iterate over sequences. *load_allsequs* is actually built over that generator.

In [42]:
for nx, ny in ofrom_pos.iter_sequ(s=200000, lim=20, ch_prep=True):
    l_res = ofrom_crf.predict_one(crf, nx)
    print([(nx[a]['token'], res['pos']) for a, res in enumerate(l_res)])

[('qui', 'PRO:rel'), ('veut', 'VER:pres'), ('aller', 'VER:inf'), ('en', 'PRP'), ('Roumanie', 'NOM:com'), ('elle', 'PRO:per:sjt'), ('toujours', 'ADV'), ('non', 'ADV:neg'), ('Bucarest', 'VER:ppas')]
[('parce', 'CON:sub'), ("qu'", 'CON:sub'), ('y', 'PRO:per:obji'), ('avait', 'VER:impf'), ('pas', 'ADV:neg'), ('ah', 'ITJ'), ('non', 'ADV:neg'), ('parce', 'CON:sub'), ("qu'", 'CON:sub')]


And naturally we can predict over an entire set of sequences, which will be used for testing.

### Testing

We have so far limited our testing to a cross-validation score using our test subset 'X/y_te'. 

In [53]:
res = ofrom_crf.np.mean(ofrom_crf.cross_test(crf, X_te, y_te, cv=10)) # defaults cv=5, n_jobs=-1
print(res)

0.9201842124550519


## Remarks

This was thought as a demonstration of the pipeline using a CRF model. A previous run with only 10,000 lines resulted in a cross-validation score of 0.86. 

1. The only way to make the model simpler would be to simplify the tagset by only retaining the main category (the first three letters, such as *PRO* or *ADV*).
2. As such, a score of 0.92 may be considered as a floor. In fact, when testing CRF models back in fall 2024, we obtained a score of 0.95 (with a simplified tagset).
3. Steps to improve the model would be to add features and revise its hyperparameters.
4. Steps to "improve" the pipeline may include breaking it into two layers, one CRF for the main 'pos' category and another for the sub-categories. Using a dictionary to handle *tokens* with a single possible tag does not seem relevant anymore.

The actual priority now, with a score of 0.92 being sufficient, would be to use that pipeline for active training. 

But first, we would like to take time to simulate and acclimate ourselves with the Markov Decision Process, as well as learn about the theory surrounding the CRF model. 