In [1]:
import io
import os
from copy import deepcopy

import pandas as pd
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imblearn_pipeline
from sklearn.base import TransformerMixin
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, PassiveAggressiveClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline as sklearn_pipeline
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

from constants import LOGISTIC_REGRESSION_MODEL, PASSIVE_AGGRESSIVE_MODEL, DECISION_TREE_MODEL, SVM_MODEL, \
    RANDOM_FOREST_MODEL, NAIVE_BAYES_MODEL

number_of_found_word_vecs = 0
number_of_not_found_word_vecs = 0


### Intro 

We are going to introduce to you a machine learning classifier with performance of 95% on hebrew corpus of 66K train instances. 

We've done profound feature engineering, feature & model selection, and we'll present to you the best we've found. 

First - let's load our data

In [2]:
def get_data():
    data_path = 'resources' + os.sep + 'dataset_biluo.csv'
    df = pd.read_csv(data_path)
    y = df['BILUO']
    if str(y.iloc[len(y)-1]) == 'nan':
        y.iloc[len(y)-1] = 'O'
    df.drop(columns=['BILUO', 'Bio'], inplace=True)
    X = df
    return X, y

X,y = get_data()

X.head(5)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Unnamed: 0,Gender,Lemma,Number,Person,Pos,Prefix,Status,Suffix,Tense,Token,TokenOrder,Word
0,unspecified,DOCSTART,unspecified,unspecified,foreign,no_pref,unspecified,no_suffix,unspecified,DOCSTART,99,DOCSTART
1,masculine,CARD1,singular,unspecified,numeral,no_pref,absolute,no_suffix,unspecified,אחד,102,אחד
2,unspecified,כול,unspecified,unspecified,quantifier,מ,construct,no_suffix,unspecified,מכל,103,כל
3,masculine,שני,singular,unspecified,noun,no_pref,absolute,no_suffix,unspecified,שני,104,שני
4,masculine,ישראלי,plural,unspecified,noun,no_pref,absolute,no_suffix,unspecified,ישראלים,105,ישראלים


We see we have quiet good information in the csv to start with,  
but we add some context features including word embeddings, gazzet features, and relate to stop_words.

Our features:  

First, few basic features we have in the CSV. 

| Pos | Person | Prefix | Suffix |
|-----| ------ | ------ | ------ |

Second, we want some 'context' features:

| Prev_Pos | Next_Pos |
|----------| -------- |

What about prev & next word/token/lemma?  
Because we have a lot of unique words and we don't want to many 'is_word_X' features, we use the power of trained word embeddings in hebrew:
https://fasttext.cc/docs/en/crawl-vectors.html

We tried taking the vectors of all of the possibilities: Lemma, Token, Word.  
And - the Token vector gives the best result.  
It makes sense because the token captures more significance. For example "לחיפה" has more information then "חיפה". 

| PrevTokenVector | TokenVector | NextTokenVector |
| --------------- | ----------- | --------------- |

Third, we have some 'Gazzet' features. Known set or Locations, Persons, etc, that we made a feature for each type. 

| In_LOC_Gazzet | In_PERS_Gazzet | In_PERCENT_Gazzet | In_MONEY_Gazzet | In_ORG_Gazzet |
| ------------- | -------------- | ----------------- | --------------- | ------------- |

Lastly, we know that stop words tend to of tag 'O', so we'll add 'is_stop_word' feature

| is_stop_word |
|--------------|

So we have 15 features, which part of then will be 'dummy features':

> dummies_cols = ['Person', 'Pos', 'Prev_Pos', 'Next_Pos', 'Suffix', 'Prefix']

In additions, the word vectors will be features as well. 

Our trained model have Vword for each word he has in it's vocabulary. 
This vector is of length - 300. 

For a sequence of prev_word, curr_word, next_word, we'll make a vector of size 900 which is the concatenation of the 3 vectors:  

$$ V_(prev-word) * V_(curr-word) * V_(next-word) $$

And what about words that doesn't have vector representation?  
Well, luckily for us, we will see that 97% of the words to have vector representation we we give the rest of the 3% zero vectors. It's negligible.

And this is our feature exctractor code:

In [3]:
class FeatureExtractor(TransformerMixin):
    def __init__(self):
        self.gazzet_sets = self.load_gazzets()
        self.stop_words = self.load_stop_words()
        print("Loading Word Embeddings... Please wait...")
        model_path = 'resources' + os.sep + 'cc.he.300.vec'
        self.trained_model = self.load_word_embeddings(model_path)
        self.VECTOR_SIZE = 300
        print("FeatureExtractor initialized!")

    def load_word_embeddings(self, fname):
        fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
        n, d = map(int, fin.readline().split())
        data = {}
        for line in fin:
            tokens = line.rstrip().split(' ')
            data[tokens[0]] = [float(x) for x in tokens[1:]]
        return data

    def load_stop_words(self):
        f = open("resources" + os.sep + "all_heb_stop_words.txt", 'r', encoding='utf-8')
        stop_words = [w.strip("\n") for w in f.readlines()]
        f.close()
        return stop_words

    def load_gazzets(self):
        f = open("resources" + os.sep + "naama_gazzets" + os.sep + "Dictionary.txt", 'r', encoding='utf-8')
        all_gazzets = f.readlines()
        f.close()
        gazzet_sets = {'LOC': set(), 'PERS': set(), 'ORG': set(), 'MONEY': set(), 'PERCENT': set()}
        for item in all_gazzets:
            for possibility in gazzet_sets.keys():
                if possibility in item:
                    item = item.replace(possibility, "").strip("\n").strip(" ")
                    if item != "":
                        gazzet_sets[possibility].add(item)
        print("Done loading gazzets")
        return gazzet_sets

    def transform(self, data):
        X = []
        all_features = {'Gender', 'Lemma', 'Number', 'Person', 'Pos', 'Status', 'Tense', 'Token', 'TokenOrder', 'Word', 'Prev_Prev_Pos', 'Prev_Pos', 'Next_Pos', 'Next_Next_Pos', 'Prev_Prev_Number', 'Prev_Number', 'Next_Number', 'Next_Next_Number', 'Prev_Prev_Gender', 'Prev_Gender', 'Next_Gender', 'Next_Next_Gender', 'Prev_Word', 'Next_Word', 'Prev_Token', 'Next_Token', 'In_LOC_Gazzet', 'In_PERS_Gazzet', 'In_ORG_Gazzet', 'In_MONEY_Gazzet',  'In_PERCENT_Gazzet', 'Suffix'}

        wanted_features = {'Prev_Pos', 'Next_Pos', 'Suffix', 'Prefix', 'TokenVector', 'NextTokenVector',
                           'PrevTokenVector', 'Person', 'Pos', 'In_MONEY_Gazzet', 'In_ORG_Gazzet', 'In_LOC_Gazzet',
                           'In_PERS_Gazzet', 'In_PERCENT_Gazzet', 'is_stop_word'}

        for i in range(0, len(data)):
            prev_prev_row_data, prev_row_data, curr_row_data, next_row_data, next_next_row_data = \
                self.get_prev_curr_next_row_data(data, i)

            if 'TokenVector' in wanted_features:
                    self.add_word_vectors(curr_row_data)

            if 'NextTokenVector' in wanted_features and 'PrevTokenVector' in wanted_features:
                    self.add_word_vectors(prev_row_data)
                    self.add_word_vectors(next_row_data)

            self.add_context_features(curr_row_data, next_next_row_data, next_row_data, prev_prev_row_data,
                                      prev_row_data, wanted_features)
            self.add_gazzet_features(curr_row_data)

            curr_row_data['is_stop_word'] = curr_row_data['Token'] in self.stop_words

            for feat in all_features.difference(wanted_features):
                del curr_row_data[feat]

            if 'TokenVector' in wanted_features:
                if 'NextTokenVector' in wanted_features and 'PrevTokenVector' in wanted_features:
                    self.convert_vectors_to_features(prev_row_data, curr_row_data, next_row_data, include_contex=True)
                else:
                    self.convert_vectors_to_features(prev_row_data, curr_row_data, next_row_data, include_contex=False)

            X.append(curr_row_data)

        print("wanted_features")
        print(wanted_features)

        print(f"number_of_found_word_vecs: {number_of_found_word_vecs}")
        print(f"number_of_not_found_word_vecs: {number_of_not_found_word_vecs}")
        print(f"percent of words without vects: "
              f"{number_of_not_found_word_vecs / (number_of_found_word_vecs + number_of_not_found_word_vecs)}")

        df = pd.DataFrame(X)
        return df

    def add_word_vectors(self, curr_row_data):
        global number_of_found_word_vecs, number_of_not_found_word_vecs
        if curr_row_data['Token'] in self.trained_model:
            curr_row_data['TokenVector'] = self.trained_model[curr_row_data['Token']]
            number_of_found_word_vecs += 1
        else:
            curr_row_data['TokenVector'] = [float(0)] * 300
            number_of_not_found_word_vecs += 1

    def convert_vectors_to_features(self, prev_row_data, curr_row_data, next_row_data, include_contex):
            for i in range(self.VECTOR_SIZE):  # vector size
                curr_row_data['wordvec_' + str(i)] = curr_row_data['TokenVector'][i]
                if include_contex:
                    curr_row_data['next_wordvec_' + str(i)] = next_row_data['TokenVector'][i]
                    curr_row_data['prev_wordvec_' + str(i)] = prev_row_data['TokenVector'][i]

            del curr_row_data['TokenVector']
            if include_contex:
                del prev_row_data['TokenVector']
                del next_row_data['TokenVector']

    def add_gazzet_features(self, curr_row_data):
        for gazzet_key, gazzet_set in self.gazzet_sets.items():
            if curr_row_data['Word'] in gazzet_set or curr_row_data['Token'] in gazzet_set:
                curr_row_data['In_' + gazzet_key + '_Gazzet'] = True
                # print(curr_row_data['Word'], curr_row_data['Token'], " in gazzet: ", gazzet_key)
            else:
                curr_row_data['In_' + gazzet_key + '_Gazzet'] = False

    def add_context_features(self, curr_row_data, next_next_row_data, next_row_data, prev_prev_row_data, prev_row_data, wanted_features):
        curr_row_data['Prev_Prev_Pos'] = prev_prev_row_data['Pos']
        curr_row_data['Prev_Pos'] = prev_row_data['Pos']
        curr_row_data['Next_Pos'] = next_row_data['Pos']
        curr_row_data['Next_Next_Pos'] = next_next_row_data['Pos']

        curr_row_data['Prev_Prev_Number'] = prev_prev_row_data['Number']
        curr_row_data['Prev_Number'] = prev_row_data['Number']
        curr_row_data['Next_Number'] = next_row_data['Number']
        curr_row_data['Next_Next_Number'] = next_next_row_data['Number']

        curr_row_data['Prev_Prev_Gender'] = prev_prev_row_data['Gender']
        curr_row_data['Prev_Gender'] = prev_row_data['Gender']
        curr_row_data['Next_Gender'] = next_row_data['Gender']
        curr_row_data['Next_Next_Gender'] = next_next_row_data['Gender']

        curr_row_data['Prev_Word'] = prev_row_data['Word']
        curr_row_data['Next_Word'] = next_row_data['Word']

        curr_row_data['Prev_Token'] = prev_row_data['Token']
        curr_row_data['Next_Token'] = next_row_data['Token']

    def get_prev_curr_next_row_data(self, data, i):
        if i % 1000 == 0:
            print(i)
        if i <= 1:
            prev_prev_row = data.iloc[i]
            prev_row = data.iloc[i]
        else:
            prev_prev_row = data.iloc[i - 2]
            prev_row = data.iloc[i - 1]
        curr_row = data.iloc[i]
        if i >= len(data) - 2:
            next_row = data.iloc[i]
            next_next_row = data.iloc[i]
        else:
            next_row = data.iloc[i + 1]
            next_next_row = data.iloc[i + 2]
        prev_prev_row_data = dict(prev_prev_row)
        prev_row_data = dict(prev_row)
        curr_row_data = dict(curr_row)
        next_row_data = dict(next_row)
        next_next_row_data = dict(next_next_row)
        return prev_prev_row_data, prev_row_data, curr_row_data, next_row_data, next_next_row_data

    def fit(self, X, y=None):
        return self


And this is the 'dummy maker' code:

In [4]:
def make_dataset_with_dummies(X_transformed, dummies_cols):
    print(f"shape before dummies: {X_transformed.shape}")
    X_dummies = pd.get_dummies(X_transformed[dummies_cols])
    X_transformed = X_transformed.drop(columns=dummies_cols)
    X_final = pd.concat([X_transformed, X_dummies], axis=1)
    print(f"X_dummies.shape: {X_dummies.shape}, X_transformed.shape: {X_transformed.shape}, X_final.shape: {X_final.shape}")
    return X_final

Let's activate the feature exctractor on our dataset

In [5]:
dummies_cols = ['Person', 'Pos', 'Prev_Pos', 'Next_Pos', 'Suffix', 'Prefix']
feature_extractor = FeatureExtractor()
X_transformed = feature_extractor.transform(X)
X_final = make_dataset_with_dummies(X_transformed, dummies_cols)

Done loading gazzets
Loading Word Embeddings... Please wait...
FeatureExtractor initialized!
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000
49000
50000
51000
52000
53000
54000
55000
56000
57000
58000
59000
60000
61000
62000
wanted_features
{'TokenVector', 'Next_Pos', 'Prefix', 'In_PERS_Gazzet', 'Prev_Pos', 'NextTokenVector', 'In_LOC_Gazzet', 'In_MONEY_Gazzet', 'In_PERCENT_Gazzet', 'PrevTokenVector', 'In_ORG_Gazzet', 'Pos', 'Suffix', 'is_stop_word', 'Person'}
number_of_found_word_vecs: 182467
number_of_not_found_word_vecs: 6266
percent of words without vects: 0.03320034122278563
shape before dummies: (62911, 912)
X_dummies.shape: (62911, 175), X_transformed.shape: (62911, 906), X_final.shape: (62911, 1081)


Now, let's split to train & test.  
The train part will be used also as development because we'll make grid search with cross validation.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=0.20, shuffle=True)

Now, we want that the frequencies of each tag will be quiet close in the general set, train, and test.  
This is a function which returns a 'compare_df',  
which can be used in order to check that the frequencies are similar.  
If the frequencies aren't good enough,  
we can execute again the last cell and we'll get another frequencies (shuffle=True 👌)

In [7]:
def get_freqs(y_data):
    y_without_o = y_data[y_data.values != ['O']]
    y_freqs = y_without_o.value_counts().apply(lambda x: x / y_without_o.value_counts().sum())
    return y_freqs

def check_frequencies_of_labels_in_data(y, y_train, y_test):
    y_freqs = get_freqs(y)
    y_train_freqs = get_freqs(y_train)
    y_test_freqs = get_freqs(y_test)
    y_train_freqs = add_missing_columns(y_freqs, y_train_freqs)
    y_test_freqs = add_missing_columns(y_freqs, y_test_freqs)
    print("We got frequencies of labels in y, y_train, y_test :-) ")
    y_freqs.sum(), y_train_freqs.sum(), y_test_freqs.sum()
    compare_df = pd.DataFrame(columns=y_freqs.keys())
    compare_df.keys = ['y', 'y_train', 'y_test']
    compare_df.loc['y'] = y_freqs
    compare_df.loc['y_train'] = y_train_freqs
    compare_df.loc['y_test'] = y_test_freqs
    return compare_df

def add_missing_columns(all_y_cols, y_data):
    diff_train = set(all_y_cols.keys()).difference(set(y_data.keys()))
    if len(diff_train) > 0:
        for col in diff_train:
            y_data[col] = 0
    return y_data

compare_df = check_frequencies_of_labels_in_data(y, y_train, y_test)
compare_df

We got frequencies of labels in y, y_train, y_test :-) 


Unnamed: 0,U-LOC,U-MISC,U-PERS,L-PERS,B-PERS,L-ORG,B-ORG,U-ORG,L-LOC,B-LOC,...,I-DATE,I-MISC,L-PERCENT,B-PERCENT,U-MONEY,L-TIME,B-TIME,I-TIME,U-TIME,I-PERCENT
y,0.099056,0.087824,0.079221,0.079101,0.079101,0.07719,0.07719,0.054367,0.041104,0.041104,...,0.01171,0.011351,0.005616,0.005616,0.00227,0.002151,0.002151,0.001434,0.001195,0.000358
y_train,0.101166,0.087866,0.079348,0.080693,0.07636,0.079647,0.07636,0.054842,0.041393,0.042289,...,0.012253,0.011506,0.005678,0.005081,0.002092,0.00269,0.002241,0.001494,0.001046,0.000299
y_test,0.090638,0.087657,0.078712,0.072749,0.090042,0.067382,0.080501,0.052475,0.039952,0.036374,...,0.009541,0.010733,0.005367,0.007752,0.002982,0.0,0.001789,0.001193,0.001789,0.000596


Let's now observe our new columns:

In [8]:
X_train.columns

Index(['In_LOC_Gazzet', 'In_MONEY_Gazzet', 'In_ORG_Gazzet',
       'In_PERCENT_Gazzet', 'In_PERS_Gazzet', 'is_stop_word', 'next_wordvec_0',
       'next_wordvec_1', 'next_wordvec_10', 'next_wordvec_100',
       ...
       'Prefix_שיש', 'Prefix_שכ', 'Prefix_שכה', 'Prefix_של', 'Prefix_שלה',
       'Prefix_שמ', 'Prefix_שמאם', 'Prefix_שמן', 'Prefix_שמש', 'Prefix_שעל'],
      dtype='object', length=1081)

We can see, as promised, only negligible amount of words don't have vector representation 👍:

number_of_found_word_vecs: 182467  
number_of_not_found_word_vecs: 6266  
percent of words without vects: 0.03320034122278563  

And let's see the Shapes of our train & test matrixes: 

In [9]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((50328, 1081), (50328,), (12583, 1081), (12583,))

Now we need to choose a model.  
We tried a lot of sklearn options:
1. LogisticRegression
2. RandomForest
3. DecisionTree
4. SVM
5. NaiveBayes - both MultinomialNB and GaussianNB
6. PassiveAggressive
7. CRF
8. Multi-Layer perceptron

Also, because our data is imbalance, we tried to handle the imbalacing problem in serveral ways, including using class_weight = 'balanced', and SMOTE.     
**SMOTE: Synthetic Minority Over-sampling**  
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

But finally, our best results were received with SVM model 👏

Now we have the code of our ClfModel, with few basic methods:  
1. init
2. train 
3. predict 
4. evaluate  

** The full ClfModel code with all of the other models is in NER.py. 

** Here we present the code of the chosen model

In [10]:
class ClfModel:
    def __init__(self, model_type):
        self.model_type = model_type
        self.clf = self.init_normal_model(model_type)
        self.classes_without_O = ['U-PERCENT', 'L-PERS', 'U-PERS', 'L-ORG', 'L-LOC', 'I-ORG', 'I-LOC', 'B-ORG', 'L-DATE', 'I-MONEY', 'B-MISC', 'L-MISC', 'L-MONEY', 'B-LOC', 'B-PERS', 'I-PERS', 'U-DATE', 'B-DATE', 'U-LOC', 'B-MONEY', 'U-MISC', 'I-MISC', 'I-DATE', 'L-PERCENT', 'I-TIME', 'U-ORG', 'L-TIME', 'B-PERCENT', 'B-TIME', 'U-TIME', 'I-PERCENT', 'U-MONEY' ]

    def init_normal_model(self, model_type):
        classifier = SVC(kernel='linear', C=1.2)
        pipe = sklearn_pipeline([('classifier', classifier)])
        return pipe

    def train(self, X_train, y_train):
        self.clf.fit(X_train, y_train)
        
    def train_by_grid_search(self, X_train, y_train):
        parameters = self.prepare_svm_grid_params()
        grid_search = GridSearchCV(self.clf, parameters, cv=5, n_jobs=-1, verbose=1)
        grid_search.fit(X_train, y_train)
        print("Best score: %0.3f" % grid_search.best_score_)
        print("Best parameters set:")
        best_parameters = grid_search.best_estimator_.get_params()
        print(best_parameters)
        self.clf = grid_search.best_estimator_

    def prepare_svm_grid_params(self):
        Cs = [0.001, 0.01, 0.1, 1, 10]
        gammas = [0.001, 0.01, 0.1, 1, 2]
        parameters = {
            'classifier__C': Cs,
            'classifier__gamma': gammas
        }
        return parameters

    def predict(self, X_test):
        y_pred = self.clf.predict(X_test)
        return y_pred

    def evaluate(self, y_true, y_pred):
        y_true = pd.Series(y_true)
        y_pred = pd.Series(y_pred)
        cross_tab = pd.crosstab(y_true, y_pred, rownames=['Real Label'], colnames=['Prediction'], margins=True)
        report = classification_report(y_true, y_pred, labels=self.classes_without_O, target_names=self.classes_without_O)
        report_with_O = classification_report(y_true, y_pred)
        return cross_tab, report, report_with_O

In [11]:
model_type = SVM_MODEL

The grid search takes a lot of time, we've initialized the SVM model with the best parameters found in the grid search performed outside of the notebook.

Let's create our model and train! 

In [12]:
clf_model = ClfModel(model_type=model_type)
clf_model.train(X_train, y_train)
# clf_model.train_by_grid_search(X_train, y_train)

Predict & Evaluate. Notice that we want to see a report with & without 'O',  
because we want to see the results of the other tags in a clear way.  

In [13]:
y_pred = clf_model.predict(X_test=X_test)
cross_tab, report, report_with_O = clf_model.evaluate(y_true=y_test, y_pred=y_pred)

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [14]:
print(report)

              precision    recall  f1-score   support

   U-PERCENT       0.79      0.88      0.83        34
      L-PERS       0.90      0.84      0.87       122
      U-PERS       0.83      0.80      0.81       132
       L-ORG       0.81      0.61      0.70       113
       L-LOC       0.74      0.75      0.74        67
       I-ORG       0.71      0.52      0.60        62
       I-LOC       0.57      0.53      0.55        32
       B-ORG       0.85      0.73      0.79       135
      L-DATE       0.78      0.71      0.75        35
     I-MONEY       1.00      0.92      0.96        39
      B-MISC       0.45      0.25      0.32        20
      L-MISC       0.55      0.30      0.39        20
     L-MONEY       1.00      0.95      0.98        43
       B-LOC       0.62      0.66      0.64        61
      B-PERS       0.90      0.91      0.90       151
      I-PERS       0.76      0.69      0.72        32
      U-DATE       0.73      0.77      0.75        43
      B-DATE       0.79    

In [15]:
print(report_with_O)

              precision    recall  f1-score   support

      B-DATE       0.79      0.78      0.78        40
       B-LOC       0.62      0.66      0.64        61
      B-MISC       0.45      0.25      0.32        20
     B-MONEY       0.89      0.82      0.85        39
       B-ORG       0.85      0.73      0.79       135
   B-PERCENT       0.92      0.92      0.92        13
      B-PERS       0.90      0.91      0.90       151
      B-TIME       1.00      1.00      1.00         3
      I-DATE       0.86      0.75      0.80        16
       I-LOC       0.57      0.53      0.55        32
      I-MISC       0.75      0.17      0.27        18
     I-MONEY       1.00      0.92      0.96        39
       I-ORG       0.71      0.52      0.60        62
   I-PERCENT       0.00      0.00      0.00         1
      I-PERS       0.76      0.69      0.72        32
      I-TIME       0.00      0.00      0.00         2
      L-DATE       0.78      0.71      0.75        35
       L-LOC       0.74    

In [16]:
cross_tab

Prediction,B-DATE,B-LOC,B-MISC,B-MONEY,B-ORG,B-PERCENT,B-PERS,B-TIME,I-DATE,I-LOC,...,L-PERCENT,L-PERS,O,U-DATE,U-LOC,U-MISC,U-ORG,U-PERCENT,U-PERS,All
Real Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
B-DATE,0,0,0,0,0,0,0,0,0,0,...,0,0,5,0,0,0,0,0,0,5
B-LOC,0,0,0,0,0,0,0,0,0,0,...,0,0,4,0,0,0,1,0,0,5
B-MISC,0,0,0,0,0,0,0,0,0,0,...,0,0,4,0,0,0,0,0,0,4
B-MONEY,0,0,0,0,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,0,2
B-ORG,0,0,0,1,0,0,2,0,0,0,...,0,0,25,0,0,0,0,0,0,29
B-PERCENT,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
B-PERS,1,0,0,0,0,0,0,0,0,0,...,0,0,24,0,0,0,0,0,1,27
I-DATE,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
I-LOC,0,0,0,0,0,0,0,0,0,0,...,0,0,3,0,0,0,0,0,0,3
I-MISC,0,0,0,0,0,0,0,0,0,0,...,0,0,3,0,0,0,0,0,1,4


Now, we understand that the predicted tag might be helpfull in order to predict the current tag.

For example, if the previous tag is B-X, it's likely that the current one will be I-X. 

So here we'll try again with exploiting previous tags.

In [17]:
def retrain_with_exploit_previous_tags(X_test, X_train, y_test, y_train, clf_type):
    X_train_with_tag, y_train = make_train_data_with_tags(X_train, clf_type, y_train)
    clf_model_with_tags = ClfModel(model_type=clf_type)
    clf_model_with_tags.train(X_train_with_tag, y_train)
    X_test_with_tag = init_prev_tag_dummy_variables_for_test_data_like_the_train(X_test, X_train_with_tag)
    new_y_pred = loop_of_predict_with_previous_tag(X_test_with_tag, clf_model_with_tags)
    cross_tab, report, report_with_O = clf_model_with_tags.evaluate(y_true=y_test, y_pred=new_y_pred)
    return report, report_with_O, cross_tab

def loop_of_predict_with_previous_tag(X_test_with_tag, clf_model_with_tags):
    X_test_with_tag.loc[X_test_with_tag.index[0], 'prev_tag_O'] = 1
    new_y_pred = []
    for i in range(0, len(X_test_with_tag)):
        curr_df_to_predict = pd.DataFrame(X_test_with_tag.iloc[i]).T
        pred = clf_model_with_tags.predict(X_test=curr_df_to_predict)[0]
        if i + 1 < len(X_test_with_tag):
            X_test_with_tag.loc[X_test_with_tag.index[i + 1], 'prev_tag_' + pred] = 1
        new_y_pred.append(pred)
    return new_y_pred

def init_prev_tag_dummy_variables_for_test_data_like_the_train(X_test, X_train_with_tag):
    X_test_with_tag = deepcopy(X_test)
    all_prev_tag_train_dummies_cols = [col for col in X_train_with_tag.columns if col.startswith("prev_tag")]
    for col in all_prev_tag_train_dummies_cols:
        X_test_with_tag[col] = 0
    return X_test_with_tag

def make_train_data_with_tags(X_train, clf_type, y_train):
    X_train_with_tag = deepcopy(X_train)
    X_train_with_tag['prev_tag'] = ['O'] + list(y_train)[:-1]
    prev_tag_dummies = pd.get_dummies(X_train_with_tag[['prev_tag']])
    X_train_with_tag = pd.concat([X_train, prev_tag_dummies], axis=1, sort=False)
    return X_train_with_tag, y_train

In [18]:
report, report_with_O, cross_tab = retrain_with_exploit_previous_tags(X_test, X_train, y_test, y_train, model_type)

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [19]:
print(report)

              precision    recall  f1-score   support

   U-PERCENT       0.76      0.85      0.81        34
      L-PERS       0.90      0.85      0.88       122
      U-PERS       0.84      0.80      0.82       132
       L-ORG       0.84      0.63      0.72       113
       L-LOC       0.77      0.73      0.75        67
       I-ORG       0.67      0.53      0.59        62
       I-LOC       0.64      0.56      0.60        32
       B-ORG       0.82      0.70      0.76       135
      L-DATE       0.77      0.69      0.73        35
     I-MONEY       1.00      0.92      0.96        39
      B-MISC       0.45      0.25      0.32        20
      L-MISC       0.55      0.30      0.39        20
     L-MONEY       1.00      0.95      0.98        43
       B-LOC       0.64      0.69      0.66        61
      B-PERS       0.91      0.92      0.91       151
      I-PERS       0.79      0.72      0.75        32
      U-DATE       0.75      0.77      0.76        43
      B-DATE       0.75    

In [20]:
print(report_with_O)

              precision    recall  f1-score   support

      B-DATE       0.75      0.75      0.75        40
       B-LOC       0.64      0.69      0.66        61
      B-MISC       0.45      0.25      0.32        20
     B-MONEY       0.89      0.85      0.87        39
       B-ORG       0.82      0.70      0.76       135
   B-PERCENT       0.92      0.92      0.92        13
      B-PERS       0.91      0.92      0.91       151
      B-TIME       1.00      1.00      1.00         3
      I-DATE       0.86      0.75      0.80        16
       I-LOC       0.64      0.56      0.60        32
      I-MISC       0.50      0.17      0.25        18
     I-MONEY       1.00      0.92      0.96        39
       I-ORG       0.67      0.53      0.59        62
   I-PERCENT       0.00      0.00      0.00         1
      I-PERS       0.79      0.72      0.75        32
      I-TIME       0.00      0.00      0.00         2
      L-DATE       0.77      0.69      0.73        35
       L-LOC       0.77    

In [21]:
cross_tab

Prediction,B-DATE,B-LOC,B-MISC,B-MONEY,B-ORG,B-PERCENT,B-PERS,B-TIME,I-DATE,I-LOC,...,L-PERCENT,L-PERS,O,U-DATE,U-LOC,U-MISC,U-ORG,U-PERCENT,U-PERS,All
Real Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
B-DATE,0,0,0,0,0,0,0,0,0,0,...,0,0,5,0,0,0,0,0,0,5
B-LOC,0,0,0,0,0,0,0,0,0,0,...,0,0,4,0,0,0,1,0,0,5
B-MISC,0,0,0,0,0,0,0,0,0,0,...,0,0,4,0,0,0,0,0,0,4
B-MONEY,0,0,0,0,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,0,2
B-ORG,0,0,0,1,1,0,2,0,0,0,...,0,0,24,0,0,0,0,0,0,29
B-PERCENT,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
B-PERS,1,0,0,0,0,0,0,0,0,0,...,0,0,24,0,0,0,0,0,1,27
I-DATE,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
I-LOC,0,0,0,0,0,0,0,0,0,0,...,0,0,3,0,0,0,0,0,0,3
I-MISC,0,0,0,0,0,0,0,0,0,0,...,0,0,3,0,0,0,0,0,1,4


As we can see the results are sort of the same.  
We produced a lot of features, so probarely this one feature isn't strong enough to change.

### Conclusion

We presented you machine learning classifier with performance of 95% on hebrew corpus of 66K train instances.  
This was done with a lot of experimenting which let to the use of:

1. Good feature extraction of POS and Morphological attributes to begin with
2. BILUO taggging instead of BIO 
3. Using context features 
4. Using word embeddings 
5. Using Gazzet features
6. Relating to stop words
7. Choosing best ML model for our experiment


Hope you had fun ✋ 