# Sangkak AI Challenge: POS tasks

--------------------------------------------------------------------------

- **Author**: Elvis MBONING (NTeALan Research and Development Team)
- **Session**: septembre 2023

--------------------------------------------------------------------------

In this notebook, we try to train differents models.

We want to train these hypothesis for CRF_suite model:

- impact of features normalization to the model classification
- impact of features regulatization to the model classification
- impact of choice of classification algorithm to the model classification
- impact of data augmentation based on position (imbalence classes) to the model classification
- impact of data augmentation based on features (imbalence classes) to the model classification



# Experiments

In this experiment, we want to build ML model based on Conditional Random Field (CRF). 

## 1- Data processing and analysis

### 1.1. Loading data from Masakhane folder


In [None]:
# Install python packages dependencies
!pip3 install pandas python_crfsuite summarytools sklearn_crfsuite
!pip3 install iteration_utilities matplotlib

In [None]:
# Download Masakhane dataset from Github
!git clone https://github.com/masakhane-io/masakhane-pos.git

In [1]:
from pathlib import Path
import pandas as pd
import joblib
from datetime import datetime
import json

from sangkak_estimators import SangkakPosProjetReader, SangkakPosFeaturisation

In [2]:
# Get path of test data 
language = 'bbj'
bbj_pos_path   = Path(f'../data_source/masakhane-pos/data/{language}')
train_data_path = bbj_pos_path / 'train.txt'
dev_data_path = bbj_pos_path / 'dev.txt'
test_data_path = bbj_pos_path / 'test.txt'

# read data from source with sklearn estimator
reader_estimator = SangkakPosProjetReader()
list_train_data, pd_train_data = reader_estimator.fit(train_data_path).transform_analysis(augment=True)
list_dev_data, pd_dev_data = reader_estimator.fit(dev_data_path).transform_analysis(augment=True)
list_test_data, pd_test_data = reader_estimator.fit(test_data_path).transform_analysis(augment=True)

pd_train_data

-> Read input sentences
-> Augment input sentences with pos to pos algorithm
-> Read input sentences
-> Augment input sentences with pos to pos algorithm
-> Read input sentences
-> Augment input sentences with pos to pos algorithm


Unnamed: 0,sentence_id,word,tags
0,1,Mwɔ̌ʼ,NOUN
1,1,pfʉ́tə́,VERB
2,1,nə́,ADP
3,1,mwâsi,NOUN
4,1,máp,DET
...,...,...,...
210453,11131,Nəmo,NOUN
210454,11131,Ntamtə,NOUN
210455,11131,Guŋ,NOUN
210456,11131,áá,DET


In [3]:
feature_estimator = SangkakPosFeaturisation()
feature_estimator.fit([])

Xtrain = feature_estimator.transform(list_train_data)
Xdev  = feature_estimator.transform(list_dev_data)
Xtest = feature_estimator.transform(list_test_data)

ytrain = feature_estimator.transform(list_train_data, label=True)
ydev   = feature_estimator.transform(list_dev_data, label=True)
ytest  = feature_estimator.transform(list_test_data, label=True)

Xtrain[0]

-> Featurisation of input sentences
-> Featurisation of input sentences
-> Featurisation of input sentences
-> Featurisation of input sentences
-> Featurisation of input sentences
-> Featurisation of input sentences


[{'word': 'Mwɔ̌ʼ',
  'bias': 1.0,
  'word.tones': '̌',
  'word.normalized': 'Mwɔ̌ʼ',
  'word.position': 0,
  'word.has_hyphen': 0,
  'word.lower()': 'mwɔ̌ʼ',
  'word.start_with_capital': -1,
  'word.have_tone': 1,
  'word.prefix': 'Mw',
  'word.root': '̌ʼ',
  'word.ispunctuation': 0,
  'word.letters': -1,
  'word.isdigit()': 0,
  'word.EOS': 0,
  'word.BOS': 1,
  '-1:word': '',
  '-1:word.position': -1,
  '-1:word.letters': -1,
  '-1:word.normalized': '',
  '-1:word.start_with_capital': -1,
  '-1:len(word-1)': -1,
  '-1:word.lower()': '',
  '-1:word.isdigit()': -1,
  '-1:word.ispunctuation': 0,
  '-1:word.BOS': 0,
  '-1:word.EOS': 0,
  '-1:word.prefix': '',
  '-1:word.root': '',
  '+1:word.prefix': 'pf',
  '+1:word.root': '́tə́',
  '+1:word': 'pfʉ́tə́',
  '+1:word.position': 1,
  '+1:word.letters': 'p f ʉ ́ t ə ́',
  '+1:word.normalized': 'pfʉ́tə́',
  '+1:word.start_with_capital': 0,
  '+1:len(word+1)': 7,
  '+1:word.lower()': 'pfʉ́tə́',
  '+1:word.isdigit()': 0,
  '+1:word.ispunctuati

In [None]:
len([str(x) for y in Xtrain for x in y]), len([str(x) for y in ytrain for x in y])

In [None]:
all_data_train = pd.concat([pd.DataFrame([json.dumps(x) for y in Xtrain for x in y], columns=["features"]), 
                            pd.DataFrame([x for y in ytrain for x in y], columns=["labels"])], axis=1, ignore_index=True)
all_data_dev = pd.concat([pd.DataFrame([json.dumps(x) for y in Xdev for x in y], columns=["features"]), 
                            pd.DataFrame([x for y in ydev for x in y], columns=["labels"])], axis=1, ignore_index=True)
all_data_test = pd.concat([pd.DataFrame([json.dumps(x) for y in Xtest for x in y], columns=["features"]), 
                            pd.DataFrame([x for y in ytest for x in y], columns=["labels"])], axis=1, ignore_index=True)

all_data_parse = pd.concat([all_data_train, all_data_dev, all_data_test], 
                        axis=0, ignore_index=True)

all_data_parse.columns = ['features','labels']
all_data_parse

### 1.3. Analyzing data 

In [None]:
# remove unused / non performants variables
remove_unused_features = ['+1:word.isdigit()', '+1:word.ispunctuation', '-1:word.EOS',
        '+1:word.BOS', 'word.has_hyphen', '+1:word.EOS', '-1:word.BOS',
        '+1:word.EOS', '-1:word.isdigit()', '+1:word.BOS', 
        '-1:word.ispunctuation', '-1:word.BOS', '+1:word.normalized',
        '-1:word.EOS', '-1:word.tag', '+1:word.tag', 
        '-1:word.start_with_capital','+1:word.start_with_capital']
for x in remove_unused_features:
    try: del all_data_parse[x]
    except: print("-- fail to removed: %s" %x)

# - Preprocess data into training and validation sets

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.utils.multiclass import type_of_target

Xtrain, Xtest, ytrain, ytest = train_test_split(
    all_data_parse.drop('labels', axis=1).copy(),
    all_data_parse['labels'].copy(),
    test_size=0.2, random_state=None, shuffle=False
)

Xtrain, Xdev, ytrain, ydev = train_test_split(
    Xtrain, ytrain, test_size=0.25, 
    random_state=None, shuffle=False
)

num_class = len(list(set(all_data_parse['labels'])))
print("Number of classes: %s" %num_class)

print("Type of target of ytrain data set: %s" %type_of_target(ytrain))
print("Type of target of ytest data set: %s\n" %type_of_target(ytest))

len_data = len(all_data_parse.index)

def f_len(data):
    l = len(data)
    percent = l*100/len_data
    return {'l':l, 'p':int(percent)}

print("- len of Xtrain data set: {l} ({p}%)".format(**f_len(Xtrain)))
print("- len of Xtest data set: {l} ({p}%)".format(**f_len(Xtest)))
print("- len of Xdev data set: {l} ({p}%)".format(**f_len(Xdev)))
print("len of ytrain data set: {l} ({p}%)".format(**f_len(ytrain)))
print("len of ytest data set: {l} ({p}%)".format(**f_len(ytest)))
print("len of ydev data set: {l} ({p}%)".format(**f_len(ydev)))

In [None]:
[json.loads(x[0]) for x in [x for x in Xtrain.values]][0]

In [None]:
X_train = [[json.loads(x[0])] for x in [x for x in Xtrain.values]]
X_test  = [[json.loads(x[0])] for x in [x for x in Xtest.values]]
X_dev   = [[json.loads(x[0])] for x in [x for x in Xdev.values]]
y_dev   = [[x] for x in ydev]
y_train = [[x] for x in ytrain]
y_test  = [[x] for x in ytest]

In [None]:
X_train[0]

# 3. Modelling with CRF algorithm

In [4]:
#import pycrfsuite
import sklearn_crfsuite
import math, string, re
import scipy
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_val_score
from sklearn_crfsuite import metrics
from collections import Counter


### 3.1. Initialisation of pycrfsuite with training data

In [5]:
project = f"sangkak-{language}"
build_date = str(datetime.now()).replace(' ','_')
model_name = Path(f"models/multi/crf_{project}_{build_date}.model")
model_file = str(model_name)
file_crf = Path(f"models/multi/crf_{project}_{build_date}.object")

params = {
    "algorithm": 'lbfgs',
    "c1": 0.0920512484757745,
    "c2": 0.0328771171605105, 
    "max_iterations":100,
    "verbose": True,
    "num_memories":10000,
    "epsilon": 1e-3,
    "linesearch": "MoreThuente",
    "max_linesearch":100000,
    "delta":1e-4,
    #n_job=-1,
    #"c": 2,
    #"pa_type": 2,
    "all_possible_states":True,
    "all_possible_transitions":True, 
    "model_filename": model_file
}

crf = sklearn_crfsuite.CRF(**params)

crf.fit(Xtrain, ytrain, Xdev, ydev)    

final = {"crf": crf, "params": params}
joblib.dump(final, file_crf) 


loading training data to CRFsuite: 100%|██████████| 11131/11131 [00:04<00:00, 2344.61it/s]





loading dev data to CRFsuite: 100%|██████████| 1908/1908 [00:00<00:00, 2249.66it/s]



Holdout group: 2

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 1
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 596087
Seconds required: 14.663

L-BFGS optimization
c1: 0.092051
c2: 0.032877
num_memories: 10000
max_iterations: 100
epsilon: 0.001000
stop: 10
delta: 0.000100
linesearch: MoreThuente
linesearch.max_iterations: 100000

Iter 1   time=4.22  loss=543231.08 active=590214 precision=0.013  recall=0.062  F1=0.021  Acc(item/seq)=0.203 0.000  feature_norm=0.12
Iter 2   time=0.78  loss=532849.78 active=589501 precision=0.056  recall=0.067  F1=0.028  Acc(item/seq)=0.173 0.000  feature_norm=0.13
Iter 3   time=0.73  loss=529426.66 active=592085 precision=0.033  recall=0.069  F1=0.036  Acc(item/seq)=0.209 0.000  feature_norm=0.13
Iter 4   time=0.72  loss=524137.82 active=591553 precision=0.098  recall=0.063  F1=0.023  Acc(item/seq)=0.205 0.000  feature_norm=0.17
Iter 5   time=0.72  loss=517133

['models/multi/crf_sangkak-bbj_2023-09-22_21:45:56.610246.object']

In [None]:
def evaluate_crf_model(crf, Xtest, ytest):
    # get model classes
    labels = list(crf.classes_)
    #labels.remove('O')

    sorted_labels = sorted(labels, key=lambda name: (name[1:], name[0]))
    #print(sorted_labels)

    # obtaining metrics such as accuracy, etc. on the test set
    ypred = crf.predict(Xtest)
    print('- F1 score on the test set = {}'.format(
            metrics.flat_f1_score(ytest, ypred, average='weighted', 
                        labels=labels, zero_division=False)))

    print('- Accuracy on the test set = {}\n'.format(
        metrics.flat_accuracy_score(ytest, ypred)))

    print('Train set classification report: \n\n{}'.format(
                metrics.flat_classification_report(ytest, 
                ypred, labels=sorted_labels, digits=3, zero_division=False)))



In [None]:
evaluate_crf_model(crf, X_test, y_test)

In [None]:
import pycrfsuite

tagger = pycrfsuite.Tagger()
tagger.open(pycrfsuite_model)

In [None]:
# with no augmentation
y_pred = [tagger.tag(xseq) for xseq in Xtest]

print(bio_classification_report(ytest, y_pred))

# Try to find the best dataset for training

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score

skf = StratifiedKFold(n_splits=4)

fold_no = 1

params_skf = {
    "algorithm": 'lbfgs',
    "c1": 0.0920512484757745,
    "c2": 0.0328771171605105, 
    "max_iterations":100,
    "verbose":False,
    "num_memories":1000,
    "epsilon": 1e-4,
    "linesearch": "MoreThuente",
    "max_linesearch":100000,
    "delta":1e-3,
    "all_possible_states":True,
    "all_possible_transitions":True
}

crf_skf = sklearn_crfsuite.CRF(**params_skf)

train_split = all_data_parse.drop('labels', axis=1).copy()
test_split = all_data_parse['labels'].copy()

for train_index, test_index in skf.split(train_split, test_split):
    print('Working on Fold ', str(fold_no),': ')
    train = all_data_parse.loc[train_index,:]
    test  = all_data_parse.loc[test_index,:]

    Xx_train = [[x] for x in [x for x in train["features"].values]]
    yy_train = [[x] for x in train['labels'].values]

    Xx_test = [[x] for x in [x for x in test["features"].values]]
    yy_test = [[x] for x in test['labels'].values]

    print(f'\tTraining: {len(Xx_train)} / {len(yy_train)}')
    crf_skf.fit(Xx_train, yy_train, X_dev, y_dev)

    print(f'\tTest: {len(Xx_test)} / {len(yy_test)}')
    predictions = crf_skf.predict(Xx_test)

    print('\t', '>> f1_score: ', 
                f1_score(yy_test, predictions, average="macro"), '\n')
    fold_no += 1


# 4- Grid search

In other to optimised parameters of CRF model, we want here to find the best parameters that will fit to our data.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn_crfsuite import scorers
from itertools import chain
from sklearn.preprocessing import MultiLabelBinarizer

In [None]:
# get initial model parameters and delete c1 and c2 parameters
del params['c1']
del params['c2']

crf_grill = sklearn_crfsuite.CRF(**params)

labels = list(crf.classes_)
labels.remove('O')

params_space = {
    'c1': scipy.stats.expon(scale=0.1),
    'c2': scipy.stats.expon(scale=0.05)
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

# search
rs = RandomizedSearchCV(crf_grill, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=5,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit(Xtrain, ytrain)

In [None]:
# crf = rs.best_estimator_
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)
print('model size: {:0.2f}M'.format(rs.best_estimator_.size_ / 1000000))

In [None]:
#print(rs.cv_results_)
_x = [s['c1'] for s in rs.cv_results_['params']]
_y = [s['c2'] for s in rs.cv_results_['params']]
_c = [s for s in rs.cv_results_['mean_score_time']]

fig = plt.figure()
fig.set_size_inches(12, 12)
ax = plt.gca()
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('C1')
ax.set_ylabel('C2')
ax.set_title("Randomized Hyperparameter Search CV Results (min={:0.3}, max={:0.3})".format(
    min(_c), max(_c)
))

ax.scatter(_x, _y, c=_c, s=60, alpha=0.9, edgecolors=[0,0,0])

print("Dark blue => {:0.4}, dark red => {:0.4}".format(min(_c), max(_c)))

In [None]:
crf_grid = rs.best_estimator_
y_pred = crf_grid.predict(Xtest)
print(metrics.flat_classification_report(
    ytest, y_pred, labels=sorted_labels, digits=3
))

In [None]:

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(20))

print("\nTop unlikely transitions:")
print_transitions(Counter(crf.transition_features_).most_common()[-20:])

In [None]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))

print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(30))

print("\nTop negative:")
print_state_features(Counter(crf.state_features_).most_common()[-30:])