# Named entity recognition
This notebook is made to train a model for named entity recognition using CRF applied to a private dataset coming from the Hôpital Nord Franche-Comté (HNFC).

## Preparation of the environment

**Download needed packages.**

In [None]:
!pip install sklearn-crfsuite
!pip install eli5
!pip install datasets
!pip install seqeval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3
  Downloading python_crfsuite-0.9.8-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (965 kB)
[K     |████████████████████████████████| 965 kB 8.4 MB/s 
Installing collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.8 sklearn-crfsuite-0.3.6
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
[K     |████████████████████████████████| 216 kB 6.6 MB/s 
Collecting jinja2>=3.0.0
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 52.0 MB/s 
Building wheels for collected packages: eli5
  Building wheel for eli5 (setup.py) ... [?25l[?25hdone
  Created wheel fo

**Import packages.**

In [None]:
import pandas as pd
import numpy as np
#Data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
sns.set(font_scale=1)
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
#Modeling
from sklearn.model_selection import cross_val_predict, cross_val_score, RandomizedSearchCV, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn_crfsuite import CRF, scorers, metrics
from sklearn_crfsuite.metrics import flat_classification_report
from sklearn.metrics import classification_report, make_scorer
import scipy.stats
import csv

## Preparation of the data
We are using a dataset coming from the HNFC which contains conclusions of breast cancer.

**Retrieve tokens and labels from the data.**

In [None]:
# Read the data

# token_docs = []
# label_docs = []

tokens = []
labels = []

data = []
cpt = 0

with open('cancer_records.tsv', 'r', encoding="utf-8") as tsvfile:
  texts = tsvfile.read().split("\n\t\n")
  # texts = tsvfile.read().split("\n\n")
  print(texts)
  print(len(texts))
  for text in texts:
    tokens = []
    labels = []
    if len(text.split("\n")) > 1:
      for element in text.split("\n"):
        if len(element.split("\t")) > 1:
          if len(element.split("\t")[0]) > 0 and element.split("\t")[1] == "O" or len(element.split("\t")[1]) > 1:
            # label = element.split("\t")[1]
            # "B-anatomie", "B-dose", "B-examen", "B-mode", "B-moment", "B-substance", "B-traitement", "B-valeur", "I-anatomie", "I-dose", "I-examen", "I-mode", "I-moment", "I-substance", "I-traitement", "I-valeur"
            # if element.split("\t")[1] in ["B-pathologie", "I-pathologie", "B-sosy", "I-sosy"]:
            if element.split("\t")[1] in ["B-anatomie", "B-dose", "B-examen", "B-mode", "B-moment", "B-substance", "B-traitement", "B-valeur", "I-anatomie", "I-dose", "I-examen", "I-mode", "I-moment", "I-substance", "I-traitement", "I-valeur"]:
              label = element.split("\t")[1]
            else:
              label = "O"
            data.append({
              "Text": cpt,
              "Token": element.split("\t")[0],
              "Label": label
            })  
            tokens.append(element.split("\t")[0])
            labels.append(element.split("\t")[1])
      cpt += 1
      tokens.append(tokens)
      labels.append(label)
      # token_docs.append(tokens)
      # label_docs.append(labels)

print(len(tokens))
print(len(labels))
print(len(data))

['21H15436\tB-ID\n.pdf\tO\nCONCLUSION\tO\n:\tO\nGanglions\tO\naxillaires\tO\ngauches\tO\n(exérèse\tO\nselon\tO\nla\tO\ntechnique\tO\nsentinel)\tO\n:\tO\nAspect\tO\nnormal,\tO\npN0(sn)(i-).\tO\nUICC\tO\n2017\tO\nAu\tO\ntotal\tO\nstade\tB-Stade\nTNM\tI-Stade\nselon\tI-Stade\nUICC\tI-Stade\n8\tI-Stade\nème\tI-Stade\nédition\tI-Stade\n2017\tI-Stade\n:\tI-Stade\npT1cN0(sn)(i-)\tI-Stade\nR0.\tI-Stade\nPHGSA7A0,\tO\nPHGSA7B2,\tO\nOHSG0000\tO\nle\tO\n15/12/2021\tB-Date', '21H11596\tB-ID\n.pdf\tO\nCONCLUSION\tO\n:\tO\nSEIN\tO\nGAUCHE\tO\n-\tO\nPAMECTOMIE\tO\n:\tO\nCarcinome\tB-Type\ninfiltrant\tI-Type\nNST\tI-Type\nde\tO\n40\tB-Taille\nmm\tI-Taille\nsans\tB-Metastase_ganglionnaire\nmétastase\tI-Metastase_ganglionnaire\nganglionnaire\tI-Metastase_ganglionnaire\n0\tI-Metastase_ganglionnaire\n/\tI-Metastase_ganglionnaire\n5\tI-Metastase_ganglionnaire\nSTADE\tB-Stade\np\tI-Stade\nT\tI-Stade\nN\tI-Stade\nM\tI-Stade\n(2016)\tI-Stade\n:\tI-Stade\npT2\tI-Stade\nN0\tI-Stade\nGrade\tB-Grade\nSBR\tI-Grade

**Add data into a dataframe.**

In [None]:
df = pd.DataFrame(data, columns = ['Token', 'Label', 'Text'])

**Check the created dataframe.**

In [None]:
df

Unnamed: 0,Token,Label,Text
0,21H15436,O,0
1,.pdf,O,0
2,CONCLUSION,O,0
3,:,O,0
4,Ganglions,O,0
...,...,...,...
3736,"OHGSA7A0,",O,34
3737,"OIGSA7A0,",O,34
3738,OISGAMA0,O,34
3739,le,O,34


**Get information from the dataframe.**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3741 entries, 0 to 3740
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Token   3741 non-null   object
 1   Label   3741 non-null   object
 2   Text    3741 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 87.8+ KB


**Create a class to retrieve sentences from the dataset.**

In [None]:
# A class to retrieve the sentences from the dataset
class getText(object):
    
  def __init__(self, data):
    self.n_text = 1.0
    self.data = data
    self.empty = False
    agg_func = lambda s: [(w, l, t) for w, l, t in zip(s["Token"].values.tolist(),
                                                        s["Label"].values.tolist(),
                                                        s["Text"].values.tolist())]
    self.grouped = self.data.groupby("Text").apply(agg_func)
    self.texts = [s for s in self.grouped]

**Get sentences from the dataframe.**

In [None]:
getter = getText(df)
sentences = getter.texts

**Get features of a word.**

In [None]:
# Simple feature map to feed arrays into the classifier. 
def feature_map(word):
    return np.array([word.istitle(), word.islower(), word.isupper(), len(word),
                     word.isdigit(),  word.isalpha()])

**Get the list of word features and respective tags.**

In [None]:
# We divide the dataset into train and test sets
words = [feature_map(w) for w in df["Token"].values.tolist()]
tags = df["Label"].values.tolist()

## Training to a named entity recognition model

### Named entity recognition with Random forest

**Train Random Forest classifier.**

In [None]:
# Random Forest classifier
pred = cross_val_predict(RandomForestClassifier(n_estimators=20),X=words[:49444], y=tags[:49444], cv=5)

**Test the trained model.**

In [None]:
#Lets check the performance 
from sklearn.metrics import classification_report
report = classification_report(y_pred=pred, y_true=tags[:49444])
print(report)

              precision    recall  f1-score   support

           O       1.00      1.00      1.00      3741

    accuracy                           1.00      3741
   macro avg       1.00      1.00      1.00      3741
weighted avg       1.00      1.00      1.00      3741



## Named entity recognition with CRF

### Get the features for the training

**Get the features of each word.**

In [None]:
# Feature set
def word2features(sent, i):
    word = sent[i][0]
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'word.isalpha()': word.isalpha(),
        'word.isalnum()': word.isalnum(),
        'word.isidentifier()': word.isidentifier()        
    }
    if i > 0:
        word1 = sent[i-1][0]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:word.isdigit()': word1.isdigit(),
            '-1:word.isalpha()': word1.isalpha(),
            '-1:word.isalnum()': word1.isalnum(),
            '-1:word.isidentifier()': word1.isidentifier()
        })
    else:
        features['BOS'] = True
    if i > 1:
        word2 = sent[i-2][0]
        features.update({
            '-2:word.lower()': word2.lower(),
            '-2:word.istitle()': word2.istitle(),
            '-2:word.isupper()': word2.isupper(),
            '-2:word.isdigit()': word2.isdigit(),
            '-2:word.isalpha()': word2.isalpha(),
            '-2:word.isalnum()': word2.isalnum(),
            '-2:word.isidentifier()': word2.isidentifier()
        })

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:word.isdigit()': word1.isdigit(),
            '+1:word.isalpha()': word1.isalpha(),
            '+1:word.isalnum()': word1.isalnum(),
            '+1:word.isidentifier()': word1.isidentifier()
        })
    else:
        features['EOS'] = True
    if i < len(sent)-2:
        word2 = sent[i+2][0]
        features.update({
            '+2:word.lower()': word2.lower(),
            '+2:word.istitle()': word2.istitle(),
            '+2:word.isupper()': word2.isupper(),
            '+2:word.isdigit()': word2.isdigit(),
            '+2:word.isalpha()': word2.isalpha(),
            '+2:word.isalnum()': word2.isalnum(),
            '+2:word.isidentifier()': word2.isidentifier()
        })

    return features

**Get the features of each sentence.**

In [None]:
def text2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def text2labels(sent):
    return [label for token, label, text in sent]

**Get encodings and tags**

In [None]:
X = [text2features(s) for s in sentences]
y = [text2labels(s) for s in sentences]

### Prepare datasets

**Split data to training and test sets.**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Prepare the model for training

**Define CRF model.**

In [None]:
# Creating the CRF model
crf = CRF(algorithm='lbfgs',
          c1=0.1,
          c2=0.1,
          # max_iterations=2000,
          all_possible_transitions=False,
          verbose=True)

**Train CRF model.**

In [None]:
# Train CRF model
try:
    crf.fit(X_train, y_train)
except AttributeError:
    pass

loading training data to CRFsuite: 100%|██████████| 2481/2481 [00:02<00:00, 1189.18it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 56795
Seconds required: 0.248

L-BFGS optimization
c1: 0.100000
c2: 0.100000
num_memories: 6
max_iterations: 2147483647
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=0.17  loss=42812.05 active=56181 feature_norm=1.00
Iter 2   time=0.09  loss=38838.60 active=52945 feature_norm=0.87
Iter 3   time=0.09  loss=37404.94 active=54000 feature_norm=0.84
Iter 4   time=0.09  loss=34316.21 active=56093 feature_norm=0.92
Iter 5   time=0.09  loss=33114.29 active=56118 feature_norm=1.01
Iter 6   time=0.10  loss=29981.34 active=55232 feature_norm=1.90
Iter 7   time=0.09  loss=26734.92 active=56446 feature_norm=1.84
Iter 8   time=0.09  loss=25715.40 active=56552 feature_norm=1.98
Iter 9   time=0.10  loss=24720.46 active=56561 feature_norm=2.22
Iter 10

### Test CRF model

**Get predictions from the test set.**

In [None]:
# report = classification_report(y_pred=predictions, y_true=y_test)
# print(report)
# report = metrics.flat_classification_report(y_true=y_test, y_pred=predictions)
predictions = crf.predict(X_test)
y_true = []
y_pred = []
for tag, prediction in zip(y_test, predictions):
  for t, p in zip(tag, prediction):
    y_true.append(t)
    y_pred.append(p)

# report = classification_report(y_true=y_true, y_pred=y_pred)
# print(report)

**Test report.**

In [None]:
from seqeval.metrics import classification_report
from seqeval.scheme import IOB2

print(classification_report([y_true], [y_pred], mode='strict', scheme=IOB2))

                         precision    recall  f1-score   support

                   Date       1.00      1.00      1.00         7
                  Grade       1.00      1.00      1.00         7
                     ID       1.00      1.00      1.00         7
                   KI67       1.00      1.00      1.00         7
Metastase_ganglionnaire       0.71      0.71      0.71         7
                     RE       1.00      1.00      1.00         7
                     RP       0.86      0.86      0.86         7
                  Stade       1.00      1.00      1.00         7
                 Taille       1.00      1.00      1.00         7
                   Type       1.00      1.00      1.00         7

              micro avg       0.96      0.96      0.96        70
              macro avg       0.96      0.96      0.96        70
           weighted avg       0.96      0.96      0.96        70



**Check transitions.**

In [None]:
# Check transitions
from collections import Counter

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(20))
print("\nTop unlikely transitions:")
print_transitions(Counter(crf.transition_features_).most_common()[-20:])

Top likely transitions:
O      -> O       5.168187
I-Grade -> I-Grade 4.763148
I-Stade -> I-Stade 4.762784
I-Type -> I-Type  4.313067
I-RP   -> I-RP    4.201904
I-Metastase_ganglionnaire -> I-Metastase_ganglionnaire 3.882501
B-Metastase_ganglionnaire -> I-Metastase_ganglionnaire 3.789060
I-RE   -> I-RE    3.633984
O      -> B-Taille 3.090742
B-Taille -> I-Taille 3.017558
B-KI67 -> I-KI67  2.700924
B-Stade -> I-Stade 2.653564
B-Type -> I-Type  2.612479
B-RP   -> I-RP    2.581076
I-KI67 -> I-KI67  2.534142
B-RE   -> I-RE    2.508328
B-Grade -> I-Grade 2.076604
O      -> B-Metastase_ganglionnaire 2.000038
O      -> B-Type  1.689337
I-Taille -> B-Metastase_ganglionnaire 1.688943

Top unlikely transitions:
I-Type -> O       1.564308
O      -> B-Date  1.551338
B-ID   -> O       1.434504
O      -> B-KI67  1.270114
I-Metastase_ganglionnaire -> B-Stade 1.135722
O      -> B-RE    1.097295
O      -> B-Stade 0.971338
I-RP   -> O       0.779443
I-Stade -> B-Grade 0.774252
I-Taille -> O       0.7004

**Check states.**

In [None]:
# Check states

def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))

print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(30))
print("\nTop negative:")
print_state_features(Counter(crf.state_features_).most_common()[-30:])

Top positive:
3.101115 B-ID     BOS
3.040087 I-RE     +1:word.lower():intensité
2.609347 B-Stade  word.lower():stade
2.580543 I-Metastase_ganglionnaire -1:word.lower():ganglionnaire
2.500942 I-Type   +1:word.lower():de
2.485795 I-RP     +1:word.lower():intensité
2.477788 B-Date   -1:word.lower():le
2.464245 I-RP     -1:word.lower():rp
2.427882 B-Taille -1:word.lower():de
2.373948 B-KI67   word[-2:]:67
2.212998 B-Date   EOS
2.184965 I-KI67   +1:word.lower():;
2.159724 I-RE     -1:word.lower():re
2.074515 I-Taille +1:word.lower():avec
2.001163 B-Type   word.lower():carcinome
1.919304 B-Grade  +1:word.lower():sbr
1.880452 B-ID     +1:word.lower():.pdf
1.804744 O        word.lower():-
1.804744 O        word[-3:]:-
1.804744 O        word[-2:]:-
1.778684 I-Type   word.lower():infiltrant
1.754843 I-KI67   -1:word.lower():ki67
1.727259 O        -1:word.lower():i
1.631425 B-KI67   word[-3:]:I67
1.611053 I-Taille +1:word.lower():sans
1.590012 B-Stade  word[-3:]:ade
1.541731 I-Type   -1:word.lowe

### Save model

**Save trained model.**

In [None]:
import pickle

# save the model to disk
filename = 'crf_model_Cancer.sav'
pickle.dump(crf, open(filename, 'wb'))

In [None]:
tags = []
elements = []
for word in sentences[29]:
  elements.append(word[0])
  tags.append(word[1])
print(len(elements), len(tags))

148 148


### Extra tests

In [None]:
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

In [None]:
text = """20H09606.pdf
CONCLUSION : SEIN GAUCHE - MASTECTOMIE : Carcinome infiltrant de type lobulaire
pléomorphe, de 40 mm. Métastase ganglionnaire 0 / 3.
Stade pTNM (2016) : pT2 N0, Grade SBR (selon Elston et Ellis) : II (3,3,1).
- Etude immunohistochimique réalisée sur pièce :
RE : 60% intensité 1 ; RP : 0% ; KI67 10 %
HER 2 : score 0 : la tumeur examinée ne présente pas de surexpresssion de la protéine HER 2.
- Limites chirurgicales : marge minimale de tissu sain : 2 mm, sur la berge profonde.
- Commentaires : carcinome lobulaire in situ intriqué à la tumeur et étendu, associé à des foyers de
carcinome canalaire in situ. Ce dernier arrive à 1 mm de la berge profonde.
OHGSA7A0
le 29/09/2020"""

In [None]:
lines = text.split("\n")
data = []
for line in lines:
  for token in line.split(" "):
    data.append(token)

In [None]:
len(data)

127

In [None]:
list_sentences = []

In [None]:
list_sentences.append(data)

In [None]:
# Feature set
def token2features(tokens, i):
    word = tokens[i]
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
    }
    if i > 0:
        word1 = tokens[i-1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
        })
    else:
        features['BOS'] = True

    if i < len(tokens)-1:
        word1 = tokens[i+1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
        })
    else:
        features['EOS'] = True

    return features

In [None]:
def test2features(sent):
  return [token2features(sent, i) for i in range(len(sent))]

#Creating the train and test set
i = [test2features(s) for s in list_sentences]

In [None]:
j = crf.predict(i)

In [None]:
for token, prediction in zip(data, j[0]):
  print("token =", token, "label =", prediction)

token = 20H09606.pdf label = O
token = CONCLUSION label = O
token = : label = O
token = SEIN label = O
token = GAUCHE label = O
token = - label = O
token = MASTECTOMIE label = O
token = : label = O
token = Carcinome label = B-Type
token = infiltrant label = I-Type
token = de label = I-Type
token = type label = I-Type
token = lobulaire label = I-Type
token = pléomorphe, label = I-Type
token = de label = O
token = 40 label = B-Taille
token = mm. label = I-Taille
token = Métastase label = B-Metastase_ganglionnaire
token = ganglionnaire label = I-Metastase_ganglionnaire
token = 0 label = I-Metastase_ganglionnaire
token = / label = I-Metastase_ganglionnaire
token = 3. label = I-Metastase_ganglionnaire
token = Stade label = B-Stade
token = pTNM label = I-Stade
token = (2016) label = I-Stade
token = : label = I-Stade
token = pT2 label = I-Stade
token = N0, label = I-Stade
token = Grade label = B-Grade
token = SBR label = I-Grade
token = (selon label = I-Grade
token = Elston label = I-Grade
to