<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#All-imports-necessary" data-toc-modified-id="All-imports-necessary-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>All imports necessary</a></span></li><li><span><a href="#Auxiliary-methods" data-toc-modified-id="Auxiliary-methods-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Auxiliary methods</a></span></li><li><span><a href="#Read-the-data" data-toc-modified-id="Read-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Read the data</a></span></li><li><span><a href="#Naive-tag-frequency-memorization-(NTFM)" data-toc-modified-id="Naive-tag-frequency-memorization-(NTFM)-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Naive tag frequency memorization (NTFM)</a></span></li><li><span><a href="#Hidden-Markov-model-(HMM)" data-toc-modified-id="Hidden-Markov-model-(HMM)-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Hidden Markov model (HMM)</a></span></li><li><span><a href="#Conditional-random-field-(CRF)" data-toc-modified-id="Conditional-random-field-(CRF)-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Conditional random field (CRF)</a></span></li><li><span><a href="#Bidirectional-Long-Short-Term-Memory-neural-network-(Bi-LSTM)" data-toc-modified-id="Bidirectional-Long-Short-Term-Memory-neural-network-(Bi-LSTM)-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Bidirectional Long-Short Term Memory neural network (Bi-LSTM)</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

# All imports necessary

In [1]:
import os
import sys
import warnings
import numpy as np
import pandas as pd

In [2]:
sys.path.append('..')
warnings.filterwarnings("ignore")

In [3]:
from source.code.utils.utils import filter_by_subcorpus
from source.code.utils.utils import get_tagged_texts_as_pd

In [4]:
from source.code.models.memorytagger import MemoryTagger
from source.code.models.bilstmtagger import BiLSTMTagger
from source.code.models.crftagger import CRFTagger
from source.code.models.hmmtagger import HMMTagger

Using TensorFlow backend.


In [5]:
from source.code.utils.preprocessing import iob3bio
from source.code.utils.preprocessing import filtrations
from source.code.utils.preprocessing import additional_features

In [6]:
from source.code.transformers.sentenceextractor import SentenceExtractor
from source.code.transformers.crftransformer import CRFTransformer

In [7]:
from seqeval.metrics import classification_report as seqeval_classification_report

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
from sklearn.pipeline import Pipeline

In [10]:
import seaborn as sns

In [11]:
import matplotlib.pyplot as plt

In [12]:
%matplotlib inline

# Auxiliary methods

In [13]:
def hmm_fit_step(X_train, X_test, y_train, from_, to_):
    X_train_f = [sentence[:, features.index(from_):features.index(to_) + 1] for sentence in X_train]

    X_test_f = [sentence[:, features.index(from_):features.index(to_) + 1] for sentence in X_test]

    hmm_tagger = HMMTagger(features=features[features.index(from_):features.index(to_) + 1])

    hmm_tagger.fit(X_train_f, y_train)

    y_pred = hmm_tagger.predict(X_test_f)

    return y_pred

In [14]:
def crf_fit_step(X_train, X_test, y_train, from_, to_):
    X_train_f = [sentence[:, features.index(from_):features.index(to_) + 1] for sentence in X_train]

    X_test_f = [sentence[:, features.index(from_):features.index(to_) + 1] for sentence in X_test]

    pipeline = Pipeline([
        ('transform', CRFTransformer(
            features=features[features.index(from_):features.index(to_) + 1]
        )),
        ('fit', CRFTagger())
    ])

    pipeline.fit(X_train_f, y_train)

    y_pred = pipeline.predict(X_test_f)

    return y_pred

In [15]:
def fit_and_validation_step(X_train, X_test, y_train, y_test, from_, to_, fit_step):
    y_pred = fit_step(X_train, X_test, y_train, from_, to_)

    seqeval_report = seqeval_classification_report(y_pred=y_pred, y_true=y_test)

    return seqeval_report

# Read the data

In [16]:
target_subcorpus_folders = filter_by_subcorpus('../data/datasets/gmb-2.2.0/', 'subcorpus: Voice of America')

HBox(children=(IntProgress(value=0, description='READ FOLDERS: '), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FILTER FOLDERS: ', max=10000), HTML(value='')))




In [17]:
data = get_tagged_texts_as_pd(target_subcorpus_folders, '../data/datasets/gmb-2.2.0/')

In [18]:
data = filtrations(data, with_dots=True)

HBox(children=(IntProgress(value=0, description='WITH DOTS: ', max=1231279), HTML(value='')))




In [19]:
data.ner_tag = iob3bio(data.ner_tag.values)

HBox(children=(IntProgress(value=0, description='IOB TO BIO: ', max=780339), HTML(value='')))




In [20]:
data = additional_features(df=data)

HBox(children=(IntProgress(value=0, description='IS TITLE: ', max=780339), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CONTAINS DIGITS: ', max=780339), HTML(value='')))




HBox(children=(IntProgress(value=0, description='WORD LENGTH: ', max=780339), HTML(value='')))




HBox(children=(IntProgress(value=0, description='SUFFIX: ', max=780339), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREFIX: ', max=780339), HTML(value='')))




In [21]:
# features list:
features = [
    'token',
    'lemma',
    'pos_tag',
    'is_title',
    'contains_digits',
    'word_len',
    'suffix',
    'prefix',
    'prev_pos_tag',
    'prev_is_title',
    'prev_contains_digits',
    'prev_word_len',
    'prev_suffix',
    'prev_prefix',
    'next_pos_tag',
    'next_is_title',
    'next_contains_digits',
    'next_word_len',
    'next_suffix',
    'next_prefix'
]

In [22]:
X, y = SentenceExtractor(features=features, target='ner_tag').fit_transform(data)

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [24]:
X_train = [sentence for sentence in X_train if len(sentence) > 0]

In [25]:
y_train = [sentence.tolist() for sentence in y_train if len(sentence) > 0]

In [26]:
X_test = [sentence for sentence in X_test if len(sentence) > 0]

In [27]:
y_test = [sentence.tolist() for sentence in y_test if len(sentence) > 0]

# Naive tag frequency memorization (NTFM)

In [33]:
tagger = MemoryTagger()

In [34]:
tagger.fit(X_train, y_train)

MemoryTagger()

In [35]:
y_pred_memory_tagger = tagger.predict(X_test)

In [36]:
print(seqeval_classification_report(y_pred=y_pred_memory_tagger, y_true=y_test))

             precision    recall  f1-score   support

        org       0.00      0.00      0.00      7630
        per       0.00      0.00      0.00     11082
        geo       0.21      0.81      0.33     13941
        tim       0.00      0.00      0.00      8677
        gpe       0.00      0.00      0.00      6329
        nat       0.00      0.00      0.00        78
        art       0.00      0.00      0.00       146
        eve       0.00      0.00      0.00       117

avg / total       0.06      0.23      0.10     48000



# Hidden Markov model (HMM)

[This article](https://pdfs.semanticscholar.org/9528/4b31f27b9b8901fdc18554603610ebbc2752.pdf) gives a full description of what parameters of HMM should be calculated.

From [this article](https://www.digitalvidya.com/blog/inroduction-to-hidden-markov-models-using-python/) the Viterbi algorithm implementation was taken.

In [37]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'lemma', 'lemma', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.40      0.39      0.39      7630
        per       0.65      0.71      0.68     11082
        geo       0.56      0.62      0.59     13941
        tim       0.91      0.74      0.82      8677
        gpe       0.54      0.67      0.60      6329
        nat       0.38      0.13      0.19        78
        art       0.42      0.12      0.18       146
        eve       0.00      0.00      0.00       117

avg / total       0.61      0.63      0.62     48000



In [38]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'is_title', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.81      0.04      0.08      7630
        per       0.00      0.00      0.00     11082
        geo       0.22      0.80      0.34     13941
        tim       0.01      0.00      0.00      8677
        gpe       0.60      0.83      0.70      6329
        nat       0.00      0.00      0.00        78
        art       0.00      0.00      0.00       146
        eve       0.00      0.00      0.00       117

avg / total       0.27      0.35      0.20     48000



In [39]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'contains_digits', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.81      0.04      0.08      7630
        per       0.35      0.11      0.17     11082
        geo       0.21      0.70      0.32     13941
        tim       0.01      0.00      0.00      8677
        gpe       0.61      0.83      0.70      6329
        nat       0.00      0.00      0.00        78
        art       0.00      0.00      0.00       146
        eve       0.00      0.00      0.00       117

avg / total       0.35      0.35      0.24     48000



In [40]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'word_len', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.24      0.06      0.10      7630
        per       0.40      0.25      0.31     11082
        geo       0.24      0.74      0.37     13941
        tim       0.10      0.02      0.03      8677
        gpe       0.61      0.82      0.70      6329
        nat       0.00      0.00      0.00        78
        art       0.00      0.00      0.00       146
        eve       0.00      0.00      0.00       117

avg / total       0.30      0.40      0.29     48000



In [41]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'suffix', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.35      0.47      0.40      7630
        per       0.52      0.64      0.58     11082
        geo       0.55      0.76      0.64     13941
        tim       0.72      0.77      0.75      8677
        gpe       0.90      0.91      0.91      6329
        nat       0.18      0.28      0.22        78
        art       0.07      0.03      0.05       146
        eve       0.19      0.22      0.20       117

avg / total       0.59      0.70      0.64     48000



In [42]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prefix', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.38      0.52      0.44      7630
        per       0.58      0.69      0.63     11082
        geo       0.60      0.75      0.67     13941
        tim       0.64      0.79      0.71      8677
        gpe       0.91      0.94      0.92      6329
        nat       0.17      0.72      0.28        78
        art       0.11      0.26      0.16       146
        eve       0.10      0.38      0.16       117

avg / total       0.60      0.73      0.66     48000



In [43]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prev_pos_tag', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.41      0.53      0.46      7630
        per       0.59      0.66      0.63     11082
        geo       0.67      0.79      0.73     13941
        tim       0.67      0.79      0.72      8677
        gpe       0.88      0.94      0.91      6329
        nat       0.20      0.72      0.31        78
        art       0.11      0.27      0.16       146
        eve       0.10      0.39      0.16       117

avg / total       0.63      0.74      0.68     48000



In [44]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prev_is_title', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.41      0.53      0.46      7630
        per       0.58      0.61      0.59     11082
        geo       0.68      0.80      0.74     13941
        tim       0.67      0.79      0.72      8677
        gpe       0.90      0.94      0.92      6329
        nat       0.19      0.72      0.30        78
        art       0.11      0.28      0.16       146
        eve       0.10      0.39      0.16       117

avg / total       0.64      0.73      0.68     48000



In [45]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prev_contains_digits', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.40      0.53      0.46      7630
        per       0.62      0.66      0.64     11082
        geo       0.69      0.79      0.74     13941
        tim       0.67      0.79      0.73      8677
        gpe       0.87      0.94      0.90      6329
        nat       0.18      0.69      0.29        78
        art       0.10      0.27      0.15       146
        eve       0.09      0.38      0.15       117

avg / total       0.64      0.74      0.69     48000



In [46]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prev_word_len', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.39      0.55      0.45      7630
        per       0.62      0.65      0.64     11082
        geo       0.68      0.76      0.72     13941
        tim       0.67      0.79      0.72      8677
        gpe       0.85      0.94      0.89      6329
        nat       0.18      0.67      0.28        78
        art       0.09      0.25      0.13       146
        eve       0.09      0.38      0.15       117

avg / total       0.64      0.73      0.68     48000



In [47]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prev_suffix', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.38      0.55      0.45      7630
        per       0.68      0.72      0.70     11082
        geo       0.69      0.74      0.71     13941
        tim       0.65      0.78      0.71      8677
        gpe       0.83      0.93      0.88      6329
        nat       0.17      0.60      0.27        78
        art       0.10      0.27      0.14       146
        eve       0.09      0.33      0.14       117

avg / total       0.65      0.73      0.69     48000



In [48]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prev_prefix', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.35      0.54      0.42      7630
        per       0.67      0.72      0.70     11082
        geo       0.69      0.73      0.71     13941
        tim       0.63      0.75      0.69      8677
        gpe       0.81      0.90      0.85      6329
        nat       0.12      0.53      0.20        78
        art       0.06      0.21      0.10       146
        eve       0.08      0.31      0.13       117

avg / total       0.63      0.72      0.67     48000



In [49]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'next_pos_tag', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.36      0.54      0.44      7630
        per       0.66      0.72      0.69     11082
        geo       0.69      0.73      0.71     13941
        tim       0.64      0.75      0.69      8677
        gpe       0.83      0.90      0.87      6329
        nat       0.14      0.51      0.22        78
        art       0.06      0.21      0.10       146
        eve       0.08      0.29      0.13       117

avg / total       0.64      0.72      0.68     48000



In [50]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'next_is_title', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.37      0.55      0.44      7630
        per       0.65      0.72      0.68     11082
        geo       0.69      0.73      0.71     13941
        tim       0.64      0.75      0.69      8677
        gpe       0.83      0.90      0.86      6329
        nat       0.14      0.50      0.22        78
        art       0.06      0.21      0.10       146
        eve       0.08      0.31      0.13       117

avg / total       0.64      0.72      0.67     48000



In [51]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'next_contains_digits', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.37      0.55      0.44      7630
        per       0.65      0.72      0.68     11082
        geo       0.69      0.74      0.71     13941
        tim       0.65      0.75      0.70      8677
        gpe       0.85      0.90      0.87      6329
        nat       0.14      0.49      0.21        78
        art       0.06      0.20      0.09       146
        eve       0.08      0.31      0.13       117

avg / total       0.64      0.72      0.68     48000



In [52]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'next_word_len', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.37      0.55      0.44      7630
        per       0.62      0.71      0.66     11082
        geo       0.70      0.73      0.71     13941
        tim       0.65      0.74      0.69      8677
        gpe       0.86      0.89      0.87      6329
        nat       0.13      0.50      0.20        78
        art       0.06      0.20      0.10       146
        eve       0.09      0.32      0.14       117

avg / total       0.63      0.72      0.67     48000



In [53]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'next_suffix', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.40      0.56      0.47      7630
        per       0.54      0.70      0.61     11082
        geo       0.72      0.71      0.72     13941
        tim       0.67      0.73      0.70      8677
        gpe       0.87      0.88      0.88      6329
        nat       0.15      0.55      0.23        78
        art       0.06      0.17      0.09       146
        eve       0.11      0.34      0.16       117

avg / total       0.63      0.71      0.67     48000



In [54]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'next_prefix', hmm_fit_step))

HBox(children=(IntProgress(value=0, description='INITIAL PROBS CALCULATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='FLATTEN SENTENCES: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='STATE TO IDX: ', max=485526), HTML(value='')))




HBox(children=(IntProgress(value=0, description='PREDICTIONS CALCULATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.40      0.55      0.46      7630
        per       0.45      0.67      0.54     11082
        geo       0.72      0.67      0.69     13941
        tim       0.69      0.70      0.70      8677
        gpe       0.88      0.84      0.86      6329
        nat       0.13      0.51      0.21        78
        art       0.06      0.16      0.09       146
        eve       0.09      0.25      0.13       117

avg / total       0.62      0.68      0.64     48000



# Conditional random field (CRF)

In [55]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'lemma', 'lemma', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.67      0.50      0.57      7630
        per       0.75      0.70      0.72     11082
        geo       0.69      0.72      0.70     13941
        tim       0.90      0.79      0.84      8677
        gpe       0.67      0.59      0.63      6329
        nat       0.36      0.23      0.28        78
        art       0.35      0.05      0.09       146
        eve       0.36      0.17      0.23       117

avg / total       0.73      0.67      0.70     48000



In [56]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'is_title', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.32      0.29      0.31      7630
        per       0.38      0.12      0.18     11082
        geo       0.48      0.79      0.59     13941
        tim       0.69      0.06      0.10      8677
        gpe       0.76      0.82      0.79      6329
        nat       0.00      0.00      0.00        78
        art       1.00      0.01      0.01       146
        eve       0.00      0.00      0.00       117

avg / total       0.51      0.42      0.39     48000



In [57]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'contains_digits', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.36      0.27      0.31      7630
        per       0.46      0.27      0.34     11082
        geo       0.47      0.79      0.59     13941
        tim       0.68      0.09      0.16      8677
        gpe       0.77      0.82      0.79      6329
        nat       0.00      0.00      0.00        78
        art       1.00      0.01      0.01       146
        eve       0.00      0.00      0.00       117

avg / total       0.53      0.46      0.43     48000



In [58]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'word_len', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.36      0.30      0.33      7630
        per       0.32      0.17      0.22     11082
        geo       0.48      0.77      0.59     13941
        tim       0.65      0.09      0.15      8677
        gpe       0.75      0.81      0.78      6329
        nat       0.00      0.00      0.00        78
        art       0.00      0.00      0.00       146
        eve       0.00      0.00      0.00       117

avg / total       0.49      0.43      0.40     48000



In [59]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'suffix', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.68      0.60      0.64      7630
        per       0.69      0.65      0.67     11082
        geo       0.79      0.85      0.82     13941
        tim       0.88      0.75      0.81      8677
        gpe       0.95      0.92      0.93      6329
        nat       0.87      0.17      0.28        78
        art       0.28      0.03      0.06       146
        eve       0.54      0.21      0.31       117

avg / total       0.78      0.75      0.76     48000



In [60]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prefix', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.73      0.64      0.68      7630
        per       0.73      0.72      0.72     11082
        geo       0.82      0.87      0.84     13941
        tim       0.90      0.79      0.84      8677
        gpe       0.98      0.95      0.96      6329
        nat       0.51      0.23      0.32        78
        art       0.33      0.10      0.16       146
        eve       0.58      0.24      0.34       117

avg / total       0.82      0.79      0.80     48000



In [61]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prev_pos_tag', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.73      0.64      0.68      7630
        per       0.73      0.71      0.72     11082
        geo       0.82      0.87      0.85     13941
        tim       0.91      0.81      0.85      8677
        gpe       0.98      0.95      0.96      6329
        nat       0.71      0.38      0.50        78
        art       0.46      0.11      0.18       146
        eve       0.54      0.24      0.33       117

avg / total       0.82      0.79      0.80     48000



In [62]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prev_is_title', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.73      0.64      0.68      7630
        per       0.73      0.71      0.72     11082
        geo       0.82      0.87      0.85     13941
        tim       0.91      0.81      0.86      8677
        gpe       0.98      0.95      0.96      6329
        nat       0.74      0.37      0.50        78
        art       0.45      0.10      0.17       146
        eve       0.54      0.26      0.35       117

avg / total       0.82      0.79      0.80     48000



In [63]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prev_contains_digits', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.75      0.64      0.69      7630
        per       0.78      0.77      0.78     11082
        geo       0.82      0.88      0.85     13941
        tim       0.91      0.81      0.86      8677
        gpe       0.97      0.95      0.96      6329
        nat       0.83      0.32      0.46        78
        art       0.43      0.10      0.17       146
        eve       0.49      0.23      0.31       117

avg / total       0.84      0.81      0.82     48000



In [64]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prev_word_len', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.74      0.64      0.68      7630
        per       0.79      0.78      0.78     11082
        geo       0.82      0.88      0.85     13941
        tim       0.91      0.80      0.85      8677
        gpe       0.97      0.94      0.96      6329
        nat       0.62      0.36      0.46        78
        art       0.44      0.12      0.19       146
        eve       0.52      0.23      0.32       117

avg / total       0.84      0.81      0.82     48000



In [65]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prev_suffix', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.73      0.66      0.69      7630
        per       0.81      0.82      0.81     11082
        geo       0.83      0.87      0.85     13941
        tim       0.91      0.82      0.86      8677
        gpe       0.97      0.95      0.96      6329
        nat       0.62      0.32      0.42        78
        art       0.50      0.15      0.23       146
        eve       0.52      0.27      0.36       117

avg / total       0.84      0.82      0.83     48000



In [66]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'prev_prefix', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.74      0.65      0.69      7630
        per       0.82      0.82      0.82     11082
        geo       0.83      0.88      0.85     13941
        tim       0.92      0.82      0.86      8677
        gpe       0.97      0.94      0.96      6329
        nat       0.69      0.31      0.42        78
        art       0.47      0.14      0.21       146
        eve       0.54      0.26      0.35       117

avg / total       0.84      0.82      0.83     48000



In [67]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'next_pos_tag', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.75      0.66      0.70      7630
        per       0.82      0.83      0.83     11082
        geo       0.83      0.88      0.85     13941
        tim       0.91      0.82      0.87      8677
        gpe       0.97      0.95      0.96      6329
        nat       0.76      0.33      0.46        78
        art       0.45      0.13      0.20       146
        eve       0.51      0.26      0.34       117

avg / total       0.85      0.83      0.83     48000



In [68]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'next_is_title', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.75      0.65      0.70      7630
        per       0.82      0.83      0.82     11082
        geo       0.83      0.88      0.85     13941
        tim       0.92      0.82      0.87      8677
        gpe       0.97      0.95      0.96      6329
        nat       0.74      0.33      0.46        78
        art       0.54      0.14      0.22       146
        eve       0.54      0.25      0.34       117

avg / total       0.85      0.83      0.83     48000



In [69]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'next_contains_digits', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.74      0.66      0.70      7630
        per       0.82      0.82      0.82     11082
        geo       0.83      0.88      0.85     13941
        tim       0.91      0.82      0.86      8677
        gpe       0.97      0.95      0.96      6329
        nat       0.71      0.35      0.47        78
        art       0.51      0.13      0.21       146
        eve       0.53      0.26      0.34       117

avg / total       0.84      0.83      0.83     48000



In [70]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'next_word_len', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.74      0.66      0.70      7630
        per       0.82      0.82      0.82     11082
        geo       0.83      0.88      0.85     13941
        tim       0.91      0.83      0.87      8677
        gpe       0.97      0.95      0.96      6329
        nat       0.64      0.35      0.45        78
        art       0.48      0.14      0.22       146
        eve       0.51      0.26      0.34       117

avg / total       0.84      0.83      0.83     48000



In [71]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'next_suffix', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.74      0.66      0.70      7630
        per       0.82      0.82      0.82     11082
        geo       0.83      0.88      0.85     13941
        tim       0.91      0.85      0.88      8677
        gpe       0.97      0.95      0.96      6329
        nat       0.55      0.28      0.37        78
        art       0.47      0.11      0.18       146
        eve       0.53      0.26      0.34       117

avg / total       0.85      0.83      0.84     48000



In [72]:
print(fit_and_validation_step(X_train, X_test, y_train, y_test, 'pos_tag', 'next_prefix', crf_fit_step))

HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=38011), HTML(value='')))




HBox(children=(IntProgress(value=0, description='CRF TRANSFORMATION: ', max=18724), HTML(value='')))


             precision    recall  f1-score   support

        org       0.73      0.67      0.70      7630
        per       0.82      0.82      0.82     11082
        geo       0.83      0.87      0.85     13941
        tim       0.91      0.86      0.89      8677
        gpe       0.97      0.95      0.96      6329
        nat       0.55      0.28      0.37        78
        art       0.56      0.12      0.20       146
        eve       0.53      0.24      0.33       117

avg / total       0.84      0.83      0.84     48000



# Bidirectional Long-Short Term Memory neural network (Bi-LSTM)

In [73]:
X_train_1_f = [sentence[:, features.index('lemma')] for sentence in X_train]

In [74]:
X_test_1_f = [sentence[:, features.index('lemma')] for sentence in X_test]

In [75]:
estimator = BiLSTMTagger(
    checkpoint_dir='../data/datasets/keras_model/',
    epochs=50,
    batch_size=200
)

In [76]:
estimator.fit(X_train_1_f, y_train)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 75)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 75, 20)            457960    
_________________________________________________________________
bidirectional_1 (Bidirection (None, 75, 100)           28400     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 75, 50)            5050      
_________________________________________________________________
crf_1 (CRF)                  (None, 75, 17)            1190      
Total params: 492,600
Trainable params: 492,600
Non-trainable params: 0
_________________________________________________________________


HBox(children=(IntProgress(value=0, description='Training', max=50), HTML(value='')))

HBox(children=(IntProgress(value=0, description='Epoch 0', max=34209), HTML(value='')))


Epoch 00001: val_loss improved from inf to 0.13769, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 1', max=34209), HTML(value='')))


Epoch 00002: val_loss improved from 0.13769 to 0.09077, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 2', max=34209), HTML(value='')))


Epoch 00003: val_loss improved from 0.09077 to 0.06765, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 3', max=34209), HTML(value='')))


Epoch 00004: val_loss improved from 0.06765 to 0.05257, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 4', max=34209), HTML(value='')))


Epoch 00005: val_loss improved from 0.05257 to 0.04557, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 5', max=34209), HTML(value='')))


Epoch 00006: val_loss improved from 0.04557 to 0.04175, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 6', max=34209), HTML(value='')))


Epoch 00007: val_loss improved from 0.04175 to 0.03828, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 7', max=34209), HTML(value='')))


Epoch 00008: val_loss improved from 0.03828 to 0.03730, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 8', max=34209), HTML(value='')))


Epoch 00009: val_loss improved from 0.03730 to 0.03461, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 9', max=34209), HTML(value='')))


Epoch 00010: val_loss improved from 0.03461 to 0.03456, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 10', max=34209), HTML(value='')))


Epoch 00011: val_loss improved from 0.03456 to 0.03450, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 11', max=34209), HTML(value='')))


Epoch 00012: val_loss improved from 0.03450 to 0.03237, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 12', max=34209), HTML(value='')))


Epoch 00013: val_loss improved from 0.03237 to 0.03194, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 13', max=34209), HTML(value='')))


Epoch 00014: val_loss improved from 0.03194 to 0.03111, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 14', max=34209), HTML(value='')))


Epoch 00015: val_loss did not improve from 0.03111


HBox(children=(IntProgress(value=0, description='Epoch 15', max=34209), HTML(value='')))


Epoch 00016: val_loss improved from 0.03111 to 0.03082, saving model to ../data/datasets/keras_model/model.h5


HBox(children=(IntProgress(value=0, description='Epoch 16', max=34209), HTML(value='')))


Epoch 00017: val_loss did not improve from 0.03082


HBox(children=(IntProgress(value=0, description='Epoch 17', max=34209), HTML(value='')))


Epoch 00018: val_loss did not improve from 0.03082


HBox(children=(IntProgress(value=0, description='Epoch 18', max=34209), HTML(value='')))


Epoch 00019: val_loss did not improve from 0.03082
Epoch 00019: early stopping



BiLSTMTagger(batch_size=200, checkpoint_dir='../data/datasets/keras_model/',
       epochs=50, max_len=75, validation_split=0.1)

In [77]:
y_pred = estimator.predict(X_test_1_f)



In [78]:
y_pred = [y_p[0: len(y_t)] for y_p, y_t in zip(y_pred, y_test)]

In [79]:
print(seqeval_classification_report(y_pred=y_pred, y_true=y_test))

             precision    recall  f1-score   support

        org       0.74      0.54      0.62      7630
        per       0.76      0.82      0.79     11082
        geo       0.82      0.75      0.79     13941
        tim       0.89      0.87      0.88      8677
        gpe       0.77      0.92      0.84      6329
        nat       0.56      0.12      0.19        78
        art       0.33      0.01      0.01       146
        eve       0.85      0.09      0.17       117

avg / total       0.80      0.77      0.78     48000



# Conclusion

So here in this work several approaches were tested for NER-recognition namely:
- Naive tag frequency memorization;
- Hidden Markov model (HMM);
- Conditional Random Fields (CRF);
- Bi-Directional Long-Short Term Neural Network + CRF layer.

Every calculations were performed on the same data split with [the same](https://github.com/chakki-works/seqeval) library.

The best quality showed CRF with three features (awg precision: 0.85, awg recall: 0.81, awg f1-score: 0.83).

From HMM models the best was with five features (awg precision: 0.62, awg recall: 0.76, awg f1-score: 0.68).

Despite the tremendous simplicity of naive approach it was able to show relatively good performance (awg precision: 0.55, awg recall: 0.59, awg f1-score: 0.57).

Fits quite well as a baseline.