<a href="https://colab.research.google.com/github/Levis0045/SCIA-CRF_LF/blob/0.1/training/experimentations_crf_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sangkak AI Challenge: NER tasks

--------------------------------------------------------------------------

- **Author**: Elvis MBONING (NTeALan Research and Development Team)
- **Session**: février 2023

--------------------------------------------------------------------------

In this notebook, we try to implement new methods which can potentialy improved NER task in low african resource languages.

We propose a rule-based approach call **Position to position entity augmentation** to normalize and augment lowest training data for CRF model. Our work is based on this paper (Xiang Dai and Heike Adel, 2020)[https://aclanthology.org/2020.coling-main.343.pdf].

# Experiments : features engineering

In this experiment, we want to build features for differents algorithms and tools


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [56]:
import joblib
from pathlib import Path
import pandas as pd

In [197]:
# Reading folder path

date = "2023-08-21"
folder = "./preprocessing"

bbj_pos_prep_path = Path(f'{folder}/sangkak_input_df_data_bbj_{date}.joblib')

bbj_pos_data = joblib.load(bbj_pos_prep_path) 

pd_train_data = bbj_pos_data["train_data"]
pd_dev_data = bbj_pos_data["dev_data"]
pd_test_data = bbj_pos_data["test_data"]
extracted_test_data = bbj_pos_data["list_test_data"]
extracted_train_data = bbj_pos_data["list_train_data"]
extracted_dev_data = bbj_pos_data["list_dev_data"]


In [None]:
pd_train_data, pd_dev_data , pd_test_data

## 1- Features engineering for sklearn algo

We will use differents kind of features to modelize our model. As ghomala is an african language, it is important to consider some of its features.

Any Bantu or semi-Bantu language use tone markers as morpho-syntatic properties to differentiate word or meaning. 

In [58]:
# Loading dependents libraries

import unidecode
import re
from datetime import datetime
import string
import math
import unicodedata

### 1.1 Features based on african linguistics specificities

In [66]:
# importing features module 
# from features import number_tone_word


In [141]:
# Constructing word features based on tones and API charaters

all_words = list(set(pd_train_data["word"].values))
all_tags  = list(set(pd_train_data["tags"].values))

words_caracters = set([y.lower() for x in all_words for y in x])
all_caracters   = string.punctuation+string.ascii_letters+string.digits+''
tone_caracters  = list(set([x for x in words_caracters if x not in all_caracters]))
cpm_search      = re.compile(str(tone_caracters))

def remove_accents(input_str):
    """Remove accents from input string in other to get ascii string

    Args:
        input_str (str): input string

    Returns:
        str: output ascii string
    """
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    #print([x for x in nfkd_form if x not in string.ascii_letters])
    only_ascii = nfkd_form.encode('ASCII', 'ignore')
    return only_ascii.decode('utf8')

def extract_tone(input_str):
    """Extract tone from input string

    Args:
        input_str (str): input string

    Returns:
        str: tones found from input string
    """
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    #print([x for x in nfkd_form if x not in string.ascii_letters])
    tones = [x for x in nfkd_form if x not in bantou_letters]
    str_tones =  "".join(list(set([x.strip() for x in tones 
                     if x in " ̄ ̀ ̌ ̂ '"])))
    # print(str_tones)
    return str_tones if len(str_tones) != 0 else None
    
# Set of functions that normalizes and get features from datasets
bantou_tones = [f"{x} " for x in " ́̄̀̌̂" if x != " "]
string_tones = "".join(bantou_tones)
tones_search = re.compile(string_tones)

bantou_letters = string.ascii_letters+"ǝɔᵾɓɨşœɑʉɛɗŋøẅëïə"

# ------------------------------------------------------------------

non_tone = remove_accents("fə̀fə̀")

print(
    len("fə̀fə̀"), 
    len(non_tone), 
    "---"+extract_tone("fə̀fə̀")
)

print([x for x in "ntâmgǒ"])
print(tone_caracters, string_tones)
print(tone_caracters)


6 2 ---̀
['n', 't', 'â', 'm', 'g', 'ǒ']
['û', 'ꞌ', 'ŋ', 'ə', '̂', 'ô', '̌', 'ʼ', 'ê', 'ï', 'è', 'ɛ', 'ǔ', 'ǝ', 'ǎ', 'ú', 'ó', '̀', 'ì', 'í', '̈', 'ʉ', 'à', 'ǐ', 'î', 'ɔ', 'ǒ', 'ù', 'á', '̩', 'é', 'â', 'ě', '́'] ́ ̄ ̀ ̌ ̂ 
['û', 'ꞌ', 'ŋ', 'ə', '̂', 'ô', '̌', 'ʼ', 'ê', 'ï', 'è', 'ɛ', 'ǔ', 'ǝ', 'ǎ', 'ú', 'ó', '̀', 'ì', 'í', '̈', 'ʉ', 'à', 'ǐ', 'î', 'ɔ', 'ǒ', 'ù', 'á', '̩', 'é', 'â', 'ě', '́']


In [73]:
int(True)

1

### 2.2. Features based on word and its contexts

In [143]:
# l'ajout des tags suivants au mot courant améliore significativement le modèle
# l'ajout des informations sur les tons

bantou_tones = [f"{x} " for x in " ́̄̀̌̂" if x != " "]
string_tones = "".join(bantou_tones)
tones_search = re.compile(string_tones)

def word_decomposition(input_str):
    """Decompse input string in to words

    Args:
        input_str (str): input string

    Returns:
        str: input string
    """
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    word_decomp = " ".join([x for x in nfkd_form ])
    return word_decomp

def number_tone_word(input_str):
    """Get number of tone found in the input string

    Args:
        input_str (str): input string

    Returns:
        int: number of tone found in the input string
    """
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    tone_str = [x for x in nfkd_form if x not in bantou_letters]

    return len([x.strip() for x in tone_str 
                     if x not in ['.', 'Ŋ', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
    
def word2features(sent, i):
    word = sent[i][0]
    len_tone = number_tone_word(word)
    tones = extract_tone(word)
    features = {
        'word': word,
        #'bias': 1.0,
        'word.tones': tones if tones else "",
        'word.normalized': unicodedata.normalize('NFKD', word),
        'word.position': i,
        'word.has_hyphen': int('-' in word),
        'word.lower()': word.lower(),
        'word.start_with_capital': int(word[0].isupper()) if i > 0 else -1,
        'word.have_tone': 1 if len_tone>0 else 0,
        'word.prefix': word[:2] if len(word)>2 else "",
        'word.root': word[3:] if len(word)>2 else "",
        'word.ispunctuation': int(word in string.punctuation),
        'word.isdigit()': int(word.isdigit()),
        'word.EOS': 1 if word in ['.','?','!'] else 0,
        'word.BOS': 1 if i == 0 else 0,
        '-1:word': sent[i-1][0] if i > 0 else "",
        '-1:word.position': i-1 if i > 0 else -1,
        '-1:word.tag': sent[i-1][1] if i > 0 else "",
        #'-1:word.letters': word_decomposition(sent[i-1][0]) if i > 0 else -1,
        '-1:word.normalized': unicodedata.normalize('NFKD', sent[i-1][0]) if i > 0 else "",
        '-1:word.start_with_capital': int(sent[i-1][0][0].isupper()) if i > 0 else -1,
        '-1:len(word-1)': len(sent[i-1][0]) if i > 0 else -1,
        '-1:word.lower()': sent[i-1][0].lower() if i > 0 else "",
        '-1:word.isdigit()': int(sent[i-1][0].isdigit()) if i > 0 else -1,
        '-1:word.ispunctuation': int((sent[i-1][0] in string.punctuation)) if i > 0 else 0,
        '-1:word.BOS': 1 if (i-1) == 0 else 0,
        '-1:word.EOS': 1 if i > 0 and sent[i-1][0] in ['.','?','!'] else 0,
        '+1:word': sent[i+1][0] if i < len(sent)-1 else "",
        '+1:word.tag': sent[i+1][1] if i < len(sent)-1 else "",
        '+1:word.position': i+1,
        #'+1:word.letters': word_decomposition(sent[i+1][0]) if i < len(sent)-1 else -1,
        '+1:word.normalized': unicodedata.normalize('NFKD', sent[i+1][0]) if i < len(sent)-1 else "",
        '+1:word.start_with_capital': int(sent[i+1][0][0].isupper()) if i < len(sent)-1 else -1,
        '+1:len(word+1)': len(sent[i+1][0]) if i < len(sent)-1 else -1,
        '+1:word.lower()': sent[i+1][0].lower() if i < len(sent)-1 else "",
        '+1:word.isdigit()': int(sent[i+1][0].isdigit()) if i < len(sent)-1 else -1,
        '+1:word.ispunctuation': int((sent[i+1][0] in string.punctuation)) if i < len(sent)-1 else -1,
        '+1:word.BOS': 1 if i < 0 else 0,
        '+1:word.EOS': 1 if i < len(sent)-1 and sent[i+1][0] in ['.','?','!'] else 0
    }

    # if tagword not in ['B-ORG','B-LOC']: features.update({'-1:word.tag()': tagword1})
    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [word[1] for word in sent]

def sent2tokens(sent):
    return [word[0] for word in sent]


### 2.3. Building all features and apply it to all datasets

Ces nouveaux estimateurs nous permettent de d'intégrer et de faire converger toute notre chaine de traitement dans un seul bloc de pipeline sklearn.

In [198]:
from sklearn.base import (
    BaseEstimator,
    OneToOneFeatureMixin,
    TransformerMixin
)

class SangkakPosFeaturisation(OneToOneFeatureMixin, TransformerMixin, BaseEstimator):

    def __init__(self, norm="l2", *, copy=True):
        self.norm = norm
        self.copy = copy

    def fit(self, X, y=None):  
        return self

    def transform(self, X, copy=None, label=False):
        # check_is_fitted(self)
        
        copy = copy if copy is not None else self.copy
        # X = self._validate_data(X, accept_sparse="csr", reset=False)
        train_sents = [[word for word in sentence] for sentence in X]
       
        if label:
            X = [sent2labels(s) for s in train_sents]
        else:
            X = [sent2features(s) for s in train_sents]
        return X

    def _more_tags(self):
        return {"stateless": True}


In [202]:
# Build features from dataset 

train_sents = [[word for word in sentence] for sentence in extracted_train_data]
dev_sents = [[word for word in sentence] for sentence in extracted_dev_data]
test_sents = [[word for word in sentence] for sentence in extracted_test_data]

"""
print(len(extracted_train_data), len(extracted_test_data))

Xtrain = [sent2features(s) for s in train_sents]
ytrain = [sent2labels(s) for s in train_sents]

Xdev = [sent2features(s) for s in dev_sents]
ydev = [sent2labels(s) for s in dev_sents]

Xtest = [sent2features(s) for s in test_sents]
ytest = [sent2labels(s) for s in test_sents]

print(f"Train X length: {len(Xtrain)} | {len(ytrain)}")
print(f"Dev X length: {len(Xdev)} | {len(ydev)}")
print(f"Test X length: {len(Xtest)} | {len(ytest)}")

Xtrain[2]

"""

posfeat = PosFeaturisation()
posfeat.fit([])

Xtrain = posfeat.transform(extracted_train_data)
Xdev = posfeat.transform(extracted_dev_data)
Xtest = posfeat.transform(extracted_test_data)

ytrain = posfeat.transform(extracted_train_data, label=True)
ydev = posfeat.transform(extracted_dev_data, label=True)
ytest = posfeat.transform(extracted_test_data, label=True)

In [200]:
Xtrain[2], ytrain[2]


([{'word': 'Mə́kuʼ',
   'word.tones': '',
   'word.normalized': 'Mə́kuʼ',
   'word.position': 0,
   'word.has_hyphen': 0,
   'word.lower()': 'mə́kuʼ',
   'word.start_with_capital': -1,
   'word.have_tone': 1,
   'word.prefix': 'Mə',
   'word.root': 'kuʼ',
   'word.ispunctuation': 0,
   'word.isdigit()': 0,
   'word.EOS': 0,
   'word.BOS': 1,
   '-1:word': '',
   '-1:word.position': -1,
   '-1:word.tag': '',
   '-1:word.normalized': '',
   '-1:word.start_with_capital': -1,
   '-1:len(word-1)': -1,
   '-1:word.lower()': '',
   '-1:word.isdigit()': -1,
   '-1:word.ispunctuation': 0,
   '-1:word.BOS': 0,
   '-1:word.EOS': 0,
   '+1:word': 'dʉmtʉm',
   '+1:word.tag': 'ADJ',
   '+1:word.position': 1,
   '+1:word.normalized': 'dʉmtʉm',
   '+1:word.start_with_capital': 0,
   '+1:len(word+1)': 6,
   '+1:word.lower()': 'dʉmtʉm',
   '+1:word.isdigit()': 0,
   '+1:word.ispunctuation': 0,
   '+1:word.BOS': 0,
   '+1:word.EOS': 0},
  {'word': 'dʉmtʉm',
   'word.tones': '',
   'word.normalized': 'dʉm

In [203]:
build_date = str(datetime.now()).split(' ')[0]

joblib.dump({
    "Xtrain": Xtrain, "ytrain": ytrain, 
    "Xdev": Xdev, "ydev": ydev,
    "Xtest": Xtest, "ydev": ytest
}, f'preprocessing/sangkak_featurised_sklearn_train_dev_test_data_{build_date}.joblib') 

['preprocessing/sangkak_featurised_sklearn_train_dev_test_data_2023-08-21.joblib']

## 3. Build data format for sagemaker app

In [None]:
!pip3 install sagemaker -U

Le format sagemaker permet de normaliser le nombre de features pour chaque classe observée, contrairement à la stratégie de création de features non uniforme sur chaque classe.

In [147]:
def build_sagemaker_classification_format(Xt, yt, label="train"):
    print(f"[{label}] Building sagemaker data for classification")
    columns = ['labels']
    columns = columns + list(Xt[0][0].keys())
    df = pd.DataFrame(columns=columns)
    i = 0
    for x, y in zip(Xt, yt):
        for v, k in zip(x, y):
            row = [k]
            row = row + list(v.values())
            df.loc[i] = row
            i += 1
    return df


In [148]:
df_sg_train = build_sagemaker_classification_format(
    Xtrain, ytrain
)
df_sg_dev = build_sagemaker_classification_format(
    Xdev, ydev, "dev"
)
df_sg_test = build_sagemaker_classification_format(
    Xtest, ytest, "test"
)

[train] Building sagemaker data for classification
[dev] Building sagemaker data for classification
[test] Building sagemaker data for classification


In [149]:
df_sg_train

Unnamed: 0,labels,word,word.tones,word.normalized,word.position,word.has_hyphen,word.lower(),word.start_with_capital,word.have_tone,word.prefix,...,+1:word.tag,+1:word.position,+1:word.normalized,+1:word.start_with_capital,+1:len(word+1),+1:word.lower(),+1:word.isdigit(),+1:word.ispunctuation,+1:word.BOS,+1:word.EOS
0,NOUN,Mwɔ̌ʼ,̌,Mwɔ̌ʼ,0,0,mwɔ̌ʼ,-1,1,Mw,...,VERB,1,pfʉ́tə́,0,7,pfʉ́tə́,0,0,0,0
1,VERB,pfʉ́tə́,,pfʉ́tə́,1,0,pfʉ́tə́,0,1,pf,...,ADP,2,nə́,0,3,nə́,0,0,0,0
2,ADP,nə́,,nə́,2,0,nə́,0,1,nə,...,NOUN,3,mwâsi,0,6,mwâsi,0,0,0,0
3,NOUN,mwâsi,̂,mwâsi,3,0,mwâsi,0,1,mw,...,DET,4,máp,0,4,máp,0,0,0,0
4,DET,máp,,máp,4,0,máp,0,1,ma,...,DET,5,yə́,0,3,yə́,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12342,AUX,kə,,kə,6,0,kə,0,0,,...,VERB,7,fǎʼ,0,4,fǎʼ,0,0,0,0
12343,VERB,fǎʼ,̌,fǎʼ,7,0,fǎʼ,0,1,fa,...,ADP,8,nə́,0,3,nə́,0,0,0,0
12344,ADP,nə́,,nə́,8,0,nə́,0,1,nə,...,PRON,9,é,0,2,é,0,0,0,0
12345,PRON,é,,é,9,0,é,0,1,,...,PUNCT,10,.,0,1,.,0,1,0,1


In [150]:
categorical_features = [
   'word',
   'word.tones',
   'word.normalized',
   'word.lower()',
   'word.prefix',
   'word.root',
   '-1:word',
   '-1:word.tag',
   '-1:word.normalized',
   '-1:word.lower()',
   '+1:word',
   '+1:word.lower()',
   '+1:word.tag',
   '+1:word.normalized'
]

numeric_features = [
   'word.position',
   '-1:word.position',
   '-1:word.start_with_capital',
   '-1:len(word-1)',
   '+1:word.position',
   'word.start_with_capital',
   'word.has_hyphen',
   '+1:word.start_with_capital',
   '+1:len(word+1)',
   '+1:word.isdigit()',
   '+1:word.ispunctuation',
   '+1:word.BOS',
   '+1:word.EOS',
   '-1:word.isdigit()',
   '-1:word.ispunctuation',
   '-1:word.BOS',
   '-1:word.EOS',
   'word.have_tone',
   'word.ispunctuation',
   'word.isdigit()',
   'word.EOS',
   'word.BOS'
]

df_sg_train[categorical_features] = df_sg_train[categorical_features].astype("category")
assert len(list(df_sg_train.select_dtypes(include="category").columns)) == len(categorical_features)

df_sg_train[numeric_features] = df_sg_train[numeric_features].astype("int32")
assert len(list(df_sg_train.select_dtypes(include="number").columns)) == len(numeric_features)

In [None]:

df_sg_train.to_csv(
    f'preprocessing/sangkak_featurised_sagemaker_train_data_{build_date}.csv',
    encoding='utf-8'
)
df_sg_dev.to_csv(
    f'preprocessing/sangkak_featurised_sagemaker_dev_data_{build_date}.csv',
    encoding='utf-8'
)

joblib.dump({
    "train": df_sg_train, "dev": df_sg_dev
}, f'preprocessing/sangkak_featurised_sagemaker_train_dev_data_{build_date}.joblib') 

['preprocessing/sangkak_featurised_sagemaker_train_dev_data_2023-08-19.joblib']

### 4- One hot encoding for all features of data


In [155]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.compose import make_column_selector, make_column_transformer

one_hot_encoder = make_column_transformer(
    (
        OneHotEncoder(sparse_output=False, handle_unknown="ignore"),
        make_column_selector(dtype_include="category"),
    ),
    remainder="passthrough",
)

ordinal_encoder = make_column_transformer(
    (
        StandardScaler(),
        make_column_selector(dtype_include="number"),
    ),
    remainder="passthrough",
    # Use short feature names to make it easier to specify the categorical
    # variables in the HistGradientBoostingRegressor in the next step
    # of the pipeline.
    verbose_feature_names_out=False,
)

In [152]:
from sklearn.decomposition import IncrementalPCA
import numpy as np

n_batches = 100

inc_pca = IncrementalPCA(n_components=100)

train_pca = one_hot_encoded_train[one_hot_encoded_train.columns[1:]]
for X_batch in np.array_split(train_pca, n_batches):
    inc_pca.partial_fit(X_batch)
        
X_train_reduced = inc_pca.transform(train_pca)
X_train_reduced

KeyboardInterrupt: 

In [131]:
df_sg_train.dtypes

labels                          object
word                          category
word.tones                    category
word.normalized               category
word.position                    int32
word.has_hyphen                  int32
word.lower()                  category
word.start_with_capital          int32
word.have_tone                   int32
word.prefix                   category
word.root                     category
word.ispunctuation               int32
word.isdigit()                   int32
word.EOS                         int32
word.BOS                         int32
-1:word                       category
-1:word.position                 int32
-1:word.tag                   category
-1:word.normalized            category
-1:word.start_with_capital       int32
-1:len(word-1)                   int32
-1:word.lower()               category
-1:word.isdigit()                int32
-1:word.ispunctuation            int32
-1:word.BOS                      int32
-1:word.EOS              

In [153]:
from sklearn.decomposition import IncrementalPCA
from sklearn.decomposition import PCA

import numpy as np
from sklearn.pipeline import make_pipeline

from sklearn.tree import DecisionTreeClassifier


#n_batches = 100

#inc_pca = IncrementalPCA(n_components=100)

train_pca = df_sg_train[categorical_features+numeric_features]
#for X_batch in np.array_split(train_pca, n_batches):
#    inc_pca.partial_fit(X_batch)
        
pcr = make_pipeline(one_hot_encoder, StandardScaler(), IncrementalPCA(n_components=8))
pcr.fit(train_pca, df_sg_train['labels'])


TypeError: 'Bunch' object is not callable

In [182]:
import matplotlib.pyplot as plt

rng = np.random.RandomState(0)
n_samples = 12347

y = train_pca.dot(inc_pca.components_[1]) + rng.normal(size=n_samples) / 2

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

axes[0].scatter(train_pca.dot(inc_pca.components_[2]), y, alpha=0.3)
axes[0].set(xlabel="Projected data onto first PCA component", ylabel="y")

axes[1].scatter(train_pca.dot(inc_pca.components_[1]), y, alpha=0.3)
axes[1].set(xlabel="Projected data onto second PCA component", ylabel="y")
plt.tight_layout()
plt.show()

ValueError: Dot product shape mismatch, (12347, 36) vs (25924,)

# 3. Modelling with CRF algorithm

In [None]:
# import dependents libraries

import pycrfsuite
#import sklearn_crfsuite
import math, string, re
import scipy
import joblib
from sklearn.metrics import make_scorer
from sklearn.metrics import classification_report
from sklearn.model_selection import RandomizedSearchCV

### 3.1. Initialisation of pycrfsuite with training data

In [None]:
# trainer initialisation of pycrfsuite
trainer = pycrfsuite.Trainer(verbose=True)

for xseq, yseq in zip(Xtrain, ytrain):
    trainer.append(xseq, yseq)

In [None]:
project = "sangkak-02-2023-aug"
build_date = str(datetime.now()).replace(' ','_')
model_name = Path(f"models/with_aug/crf_pycrfsuite_{project}_{build_date}.model")
model_file = str(model_name)
file_crf = Path(f"models/with_aug/crf_pycrfsuite_{build_date}.object")

print(trainer.params())

params = {
    #"algorithm": 'lbfgs',
    "c1": 0.0920512484757745,
    "c2": 0.0328771171605105, 
    "max_iterations":100,
    #"verbose":True,
    "num_memories":10000,
    "epsilon": 1e-3,
    "linesearch": "MoreThuente",
    "max_linesearch":100000,
    "delta":1e-4,
    #"n_job":-1,
    #"c": 2,
    #"pa_type": 2,
    "feature.possible_transitions":True,
    "feature.possible_states":True, 
    #"model_filename": model_file
}

trainer.set_params(params)

### 3.2. Training and saving pycrfsuite model with training data

In [None]:
trainer.train(model_file)

joblib.dump({"crf": trainer, "params": params}, file_crf) 

In [None]:
trainer.logparser.last_iteration

# 4. Grid search: find best parameters for our models

In [None]:
from sklearn_crfsuite import metrics


params = {
    "algorithm": 'lbfgs',
    "max_iterations":100,
    "verbose": False,
    #"job":-1,
    "all_possible_states":True,
    "all_possible_transitions":True, 
    "model_filename":model_file
}

crf_grill = pycrfsuite.Trainer(verbose=True)

labels = list(trainer.classes_)
labels.remove('O')

params_space = {
    'c1': scipy.stats.expon(scale=0.1),
    'c2': scipy.stats.expon(scale=0.05)
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

# search
rs = RandomizedSearchCV(crf_grill, 
                        params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=5,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit(Xtrain, ytrain)

In [None]:
# crf = rs.best_estimator_
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)
print('model size: {:0.2f}M'.format(rs.best_estimator_.size_ / 1000000))

In [None]:
#print(rs.cv_results_)
_x = [s['c1'] for s in rs.cv_results_['params']]
_y = [s['c2'] for s in rs.cv_results_['params']]
_c = [s for s in rs.cv_results_['mean_score_time']]

fig = plt.figure()
fig.set_size_inches(12, 12)
ax = plt.gca()
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('C1')
ax.set_ylabel('C2')
ax.set_title("Randomized Hyperparameter Search CV Results (min={:0.3}, max={:0.3})".format(
    min(_c), max(_c)
))

ax.scatter(_x, _y, c=_c, s=60, alpha=0.9, edgecolors=[0,0,0])

print("Dark blue => {:0.4}, dark red => {:0.4}".format(min(_c), max(_c)))

In [None]:
crf = rs.best_estimator_
y_pred = crf.predict(Xtest)
print(metrics.flat_classification_report(
    ytest, y_pred, labels=sorted_labels, digits=3
))

In [None]:
from collections import Counter

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(20))

print("\nTop unlikely transitions:")
print_transitions(Counter(crf.transition_features_).most_common()[-20:])

In [None]:
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))

print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(30))

print("\nTop negative:")
print_state_features(Counter(crf.state_features_).most_common()[-30:])