# Named Entity Recognition using CRF model and BiLSTM-CRF
In Natural Language Processing (NLP) an Entity Recognition is one of the common problem. The entity is referred to as the part of the text that is interested in. In NLP, NER is a method of extracting the relevant information from a large corpus and classifying those entities into predefined categories such as location, organization, name and so on. Information about lables:

* geo = Geographical Entity

* org = Organization

* per = Person

* gpe = Geopolitical Entity

* tim = Time indicator

* art = Artifact

* eve = Event

* nat = Natural Phenomenon

  1. Total Words Count = 1354149 
  2. Target Data Column: Tag

# Step-by-Step Implementation
### Step 1: Import Libraries
Objective: Import the necessary libraries for data handling, model building, and evaluation.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn_crfsuite import CRF, metrics
from sklearn_crfsuite.metrics import flat_classification_report
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from transformers import BertTokenizer, TFBertForTokenClassification
from seqeval.metrics import classification_report




### Step 2: Load and Explore the Data
Objective: Load the dataset and get an idea of its structure and content.

In [3]:
# Load the dataset
data = pd.read_csv('ner_dataset.csv', encoding='latin1')

# Display the head
data.head(10)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


In [4]:
data.describe()

Unnamed: 0,Sentence #,Word,POS,Tag
count,47959,1048565,1048575,1048575
unique,47959,35177,42,17
top,Sentence: 1,the,NN,O
freq,1,52573,145807,887908


#### Observations :
* There are total 47959 sentences in the dataset.
* Number unique words in the dataset are 35178.
* Total 17 lables (Tags).

In [5]:
#Displaying the unique Tags
data['Tag'].unique()

array(['O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim',
       'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve',
       'I-eve', 'I-nat'], dtype=object)

In [6]:
# Check for any missing values
data.isnull().sum()

Sentence #    1000616
Word               10
POS                 0
Tag                 0
dtype: int64

In the dataset, there are numerous missing values in both the 'Sentence #' and 'Word #' attributes. To address this, we will employ the pandas fillna method, specifically utilizing the 'ffill' technique, which forwards the last valid observation to the subsequent one.

In [7]:
data = data.fillna(method = 'ffill')

### Step 3: Data Preprocessing
Objective: Prepare the data for model training by extracting features and organizing it into the required format.

##### Extract Features
The word2features function extracts various features for each word, such as word shape, part of speech (POS) tags, and context.

In [8]:
# Define a function to extract features
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features

##### Convert Data Format
The following functions help convert the data into a format suitable for model training.

In [9]:
# Convert the data to the required format
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]


##### Group the Dataset into Sentences
We group the dataset into sentences for easier processing.

In [10]:
# Group the dataset into sentences
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                   s["POS"].values.tolist(),
                                                   s["Tag"].values.tolist())]
grouped = data.groupby("Sentence #").apply(agg_func)
sentences = [s for s in grouped]

In [11]:
sentences[1]

[('Iranian', 'JJ', 'B-gpe'),
 ('officials', 'NNS', 'O'),
 ('say', 'VBP', 'O'),
 ('they', 'PRP', 'O'),
 ('expect', 'VBP', 'O'),
 ('to', 'TO', 'O'),
 ('get', 'VB', 'O'),
 ('access', 'NN', 'O'),
 ('to', 'TO', 'O'),
 ('sealed', 'JJ', 'O'),
 ('sensitive', 'JJ', 'O'),
 ('parts', 'NNS', 'O'),
 ('of', 'IN', 'O'),
 ('the', 'DT', 'O'),
 ('plant', 'NN', 'O'),
 ('Wednesday', 'NNP', 'B-tim'),
 (',', ',', 'O'),
 ('after', 'IN', 'O'),
 ('an', 'DT', 'O'),
 ('IAEA', 'NNP', 'B-org'),
 ('surveillance', 'NN', 'O'),
 ('system', 'NN', 'O'),
 ('begins', 'VBZ', 'O'),
 ('functioning', 'VBG', 'O'),
 ('.', '.', 'O')]

##### Split the Data into Training and Test Sets
We split the data into training and test sets to evaluate the model performance.

In [12]:
# Split the data into training and test sets
train_sentences, test_sentences = train_test_split(sentences, test_size=0.2, random_state=42)

# Extract features and labels
X_train = [sent2features(s) for s in train_sentences]
y_train = [sent2labels(s) for s in train_sentences]
X_test = [sent2features(s) for s in test_sentences]
y_test = [sent2labels(s) for s in test_sentences]


### Step 4: Modeling
#### 1- CRF Model
##### Train the CRF Model
Objective: Train a Conditional Random Field (CRF) model on the training data.

In [None]:
# Initialize the CRF model
crf = CRF(algorithm='lbfgs', 
          c1=0.1, 
          c2=0.1, 
          max_iterations=100, 
          all_possible_transitions=False)

# Train the model
crf.fit(X_train, y_train)

##### Evaluate the CRF Model
Objective: Evaluate the CRF model's performance on the test data.

In [None]:
# Predict the labels on the test set
y_pred = crf.predict(X_test)

# Generate a classification report
report = flat_classification_report(y_test, y_pred)
report

#### 2- BiLSTM-CRF Model
##### Train a BiLSTM-CRF Model
Objective: Use a BiLSTM-CRF model, which is more powerful than a standalone CRF model, leveraging both word embeddings and LSTM layers to capture contextual information.

In [None]:
# Define the BiLSTM-CRF model
input = Input(shape=(max_len,))
model = Embedding(input_dim=n_words, output_dim=50, input_length=max_len)(input)
model = Dropout(0.1)(model)
model = Bidirectional(LSTM(units=100, return_sequences=True, recurrent_dropout=0.1))(model)
out = TimeDistributed(Dense(n_tags, activation="softmax"))(model)
model = Model(input, out)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

# Train the model
history = model.fit(X_train, np.array(y_train).reshape(len(y_train), max_len, 1), batch_size=32, epochs=5, validation_split=0.1, verbose=1)

# Summary
model.summary()

##### Visualizing the performance of model

In [None]:
history.history.keys()

In [None]:
acc = history.history['crf_viterbi_accuracy']
val_acc = history.history['val_crf_viterbi_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
plt.figure(figsize = (8, 8))
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

In [None]:
plt.figure(figsize = (8, 8))
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

##### Evaluate the BiLSTM-CRF Model

In [None]:
# Predict the labels on the test set
y_pred = model.predict(X_test)
pred_labels = np.argmax(y_pred, axis=-1)
true_labels = np.array(y_test).reshape(len(y_test), max_len)

# Generate a classification report
report = classification_report(true_labels)
report

### Conclusion
This project demonstrates how to implement Named Entity Recognition using a CRF model and BiLSTM-CRF Model in Python. By following these steps, you can extract meaningful entities from text and classify them into predefined categories. This technique is widely used in various applications such as information retrieval, question answering, and more.

Feel free to experiment with different feature sets, model parameters, and datasets to further improve the performance of your NER model.