For this demo, we will use the [MIT Restaurant Corpus](https://groups.csail.mit.edu/sls/downloads/restaurant/) -- a dataset of transcriptions of spoken utterances about restaurants.

The dataset has following entity types:

* 'B-Rating'
* 'I-Rating',
* 'B-Amenity',
* 'I-Amenity',
* 'B-Location',
* 'I-Location',
* 'B-Restaurant_Name',
* 'I-Restaurant_Name',
* 'B-Price',
* 'B-Hours',
* 'I-Hours',
* 'B-Dish',
* 'I-Dish',
* 'B-Cuisine',
* 'I-Price',
* 'I-Cuisine'

Let us load the dataset and see what are we working with.

In [12]:
with open('sent_train', 'r') as train_sent_file:
  train_sentences = train_sent_file.readlines()

with open('label_train', 'r') as train_labels_file:
  train_labels = train_labels_file.readlines()

with open('sent_test', 'r') as test_sent_file:
  test_sentences = test_sent_file.readlines()

with open('label_test', 'r') as test_labels_file:
  test_labels = test_labels_file.readlines()


In [13]:
# Print the 6th sentence in the test set i.e. index value 5.
print(test_sentences[5])

# Print the labels of this sentence
print(test_labels[5])

any good ice cream parlors around 

O B-Rating B-Cuisine I-Cuisine I-Cuisine B-Location 



# Defining Features for Custom NER

In [14]:
# Installing required modules
!pip install pycrf
!pip install sklearn-crfsuite



We have defined the following features for CRF model building:

- f1 = input word is in lower case; 
- f2 = last 3 characters of word;
- f3 = last 2 characers of word;
- f4 = 1; if the word is in uppercase, 0 otherwise;
- f5 = 1; if word is a number; otherwise, 0 
- f6= 1; if the word starts with a capital letter; otherwise, 

In [53]:
# Define a function to get the above defined features of a word

def getFeaturesForOneWord(sentence, pos):
    word = sentence[pos]

    features = [
        'word.lower=' + word.lower(), # serves as word id
        'word[-3:]=' + word[-3:], # last 3 characters
        'word[-2:]=' + word[-2:], # last 2 characters
        'word.isupper=%s' % word.isupper(), # is upper case
        'word.isdigit=%s' % word.isdigit(), # is digit
        'word.startsWithCapital=%s' % word[0].isupper(), # starts with capital
    ]

    if pos > 0:
        prev_word = sentence[pos - 1]
        features.extend([
            'prev_word.lower=' + prev_word.lower(), # previous word
            'prev_word.isupper=%s' % prev_word.isupper(), # is upper case
            'prev_word.isdigit=%s' % prev_word.isdigit(), # is digit
            'prev_word.startsWithCapital=%s' % prev_word[0].isupper(), # starts with capital
        ])
    else:
        features.append('BEG') # beginning of the sentence
        
    if pos == len(sentence) - 1:
        features.append('END') # feature to track end of sentence
        
    return features

In [54]:
print(train_sentences[0])
getFeaturesForOneWord("2 places that serves soft serve ice cream".split(" "), 1)

2 start restaurants with inside dining 



['word.lower=places',
 'word[-3:]=ces',
 'word[-2:]=es',
 'word.isupper=False',
 'word.isdigit=False',
 'word.startsWithCapital=False',
 'prev_word.lower=2',
 'prev_word.isupper=False',
 'prev_word.isdigit=True',
 'prev_word.startsWithCapital=False']

## Computing Features

In [63]:
# Define a fucntion to get features for a sentence
# using the already defined 'getFeaturesForOneWord' function

def getFeaturesForOneSentence(sentence):
    sentence = sentence.split()
    return [getFeaturesForOneWord(sentence, index) for index in range(len(sentence))]

In [64]:
# Degine a function to get the labels for a sentence
def getLabelsForOneSentence(labels):
    return labels.split()

In [65]:
example_sentence = train_sentences[5]
print(example_sentence)
getFeaturesForOneSentence(example_sentence)

a place that serves soft serve ice cream 



[['word.lower=a',
  'word[-3:]=a',
  'word[-2:]=a',
  'word.isupper=False',
  'word.isdigit=False',
  'word.startsWithCapital=False',
  'BEG'],
 ['word.lower=place',
  'word[-3:]=ace',
  'word[-2:]=ce',
  'word.isupper=False',
  'word.isdigit=False',
  'word.startsWithCapital=False',
  'prev_word.lower=a',
  'prev_word.isupper=False',
  'prev_word.isdigit=False',
  'prev_word.startsWithCapital=False'],
 ['word.lower=that',
  'word[-3:]=hat',
  'word[-2:]=at',
  'word.isupper=False',
  'word.isdigit=False',
  'word.startsWithCapital=False',
  'prev_word.lower=place',
  'prev_word.isupper=False',
  'prev_word.isdigit=False',
  'prev_word.startsWithCapital=False'],
 ['word.lower=serves',
  'word[-3:]=ves',
  'word[-2:]=es',
  'word.isupper=False',
  'word.isdigit=False',
  'word.startsWithCapital=False',
  'prev_word.lower=that',
  'prev_word.isupper=False',
  'prev_word.isdigit=False',
  'prev_word.startsWithCapital=False'],
 ['word.lower=soft',
  'word[-3:]=oft',
  'word[-2:]=ft',
  'wo

In [66]:
# Get the features and labels for the training set and test set

X_train = [getFeaturesForOneSentence(sentence) for sentence in train_sentences]
Y_train = [getLabelsForOneSentence(labels) for labels in train_labels]

X_test = [getFeaturesForOneSentence(sentence) for sentence in test_sentences]
Y_test = [getLabelsForOneSentence(labels) for labels in test_labels]

In [67]:
len(X_train), len(Y_train), len(X_test), len(Y_test)

(7660, 7660, 1521, 1521)

In [68]:
X_train[0], Y_train[0]

([['word.lower=2',
   'word[-3:]=2',
   'word[-2:]=2',
   'word.isupper=False',
   'word.isdigit=True',
   'word.startsWithCapital=False',
   'BEG'],
  ['word.lower=start',
   'word[-3:]=art',
   'word[-2:]=rt',
   'word.isupper=False',
   'word.isdigit=False',
   'word.startsWithCapital=False',
   'prev_word.lower=2',
   'prev_word.isupper=False',
   'prev_word.isdigit=True',
   'prev_word.startsWithCapital=False'],
  ['word.lower=restaurants',
   'word[-3:]=nts',
   'word[-2:]=ts',
   'word.isupper=False',
   'word.isdigit=False',
   'word.startsWithCapital=False',
   'prev_word.lower=start',
   'prev_word.isupper=False',
   'prev_word.isdigit=False',
   'prev_word.startsWithCapital=False'],
  ['word.lower=with',
   'word[-3:]=ith',
   'word[-2:]=th',
   'word.isupper=False',
   'word.isdigit=False',
   'word.startsWithCapital=False',
   'prev_word.lower=restaurants',
   'prev_word.isupper=False',
   'prev_word.isdigit=False',
   'prev_word.startsWithCapital=False'],
  ['word.lower=i

CRF Model Training

In [69]:
import sklearn_crfsuite
from sklearn_crfsuite import metrics

In [70]:
crf = sklearn_crfsuite.CRF(max_iterations=100)
crf.fit(X_train, Y_train)

Model Testing and evaluation

In [71]:
Y_pred = crf.predict(X_test)
metrics.flat_f1_score(Y_test, Y_pred, average='weighted')

0.8744887733818438

In [None]:
id = 10
print("Sentence: ", test_sentences[id])
print("Orig Labels: ", Y_test[id])
print("Predicted Labels: ", Y_pred[id])

Sentence:  any places around here that has a nice view 

Orig Labels:  ['O', 'O', 'B-Location', 'I-Location', 'O', 'O', 'O', 'B-Amenity', 'I-Amenity']
Predicted Labels:  ['O', 'O', 'B-Location', 'I-Location', 'O', 'O', 'O', 'B-Amenity', 'I-Amenity']


Transitions Learned by CRF

In [79]:
from util import print_top_likely_transitions
from util import print_top_unlikely_transitions

In [80]:
print_top_likely_transitions(crf.transition_features_)

B-Restaurant_Name -> I-Restaurant_Name 6.803175
B-Location -> I-Location 6.730945
B-Amenity -> I-Amenity 6.621640
I-Location -> I-Location 6.436021
I-Amenity -> I-Amenity 6.254962
B-Dish -> I-Dish  5.904813
B-Hours -> I-Hours 5.892986
I-Restaurant_Name -> I-Restaurant_Name 5.845391
B-Cuisine -> I-Cuisine 5.538447
I-Hours -> I-Hours 5.437972


In [81]:
print_top_likely_transitions(crf.transition_features_)

B-Restaurant_Name -> I-Restaurant_Name 6.803175
B-Location -> I-Location 6.730945
B-Amenity -> I-Amenity 6.621640
I-Location -> I-Location 6.436021
I-Amenity -> I-Amenity 6.254962
B-Dish -> I-Dish  5.904813
B-Hours -> I-Hours 5.892986
I-Restaurant_Name -> I-Restaurant_Name 5.845391
B-Cuisine -> I-Cuisine 5.538447
I-Hours -> I-Hours 5.437972
