# Problem 1

Reading the data in CoNLL format (20pts)
Note that the NCBI Disease Corpus (See section DATA above) is already split into train, development,
and test datasets. You will use the train and test datasets in this homework.
As noted above, you should use files in the "ncbi-disease/conll" subfolder. In this file format, a blank line
indicates the start of a new sequence.

• Write a function that reads a .tsv files in the CoNLL format and returns two “list of lists” as
output:

    o A list of sequences of tokens, where a single token may be a word or punctuation.
    o A list of sequences of tags, representing token-level annotation. You should see these 3
    tags in your data (“B-Disease”, “I-Disease”, “O”)

• Apply your function to train.tsv and test.tsv. To show you have read in the data correctly, show
the following in your notebook output:

    o The number of sequences in train and test. (You should see 5432 sequences in train and
    940 sequences in test.)
    o The tokens and tags of the first sequence in the training dataset. 

In [1]:
pip install sklearn-crfsuite

Note: you may need to restart the kernel to use updated packages.


In [2]:
import csv
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_classification_report
from sklearn.metrics import precision_score, recall_score

In [3]:
def read_conll_file(file_path):
    sentences = []
    labels = []

    with open(file_path, 'r') as file:
        sentence = []
        sentence_labels = []

        for line in file:
            line = line.strip()

            if not line:
                if sentence:
                    sentences.append(sentence)
                    labels.append(sentence_labels)
                sentence = []
                sentence_labels = []
            else:
                token, label = line.split("\t")
                sentence.append(token)
                sentence_labels.append(label)

        if sentence:
            sentences.append(sentence)
            labels.append(sentence_labels)

    return sentences, labels

def print_sequence_info(name, sentences, labels):
    print(f"Number of sequences in {name}: {len(sentences)}")
    if len(sentences) > 0:
        print(f"Tokens of the first sequence in the {name} dataset:")
        print(sentences[0])
        print(f"Tags of the first sequence in the {name} dataset:")
        print(labels[0])

train_file_path = "/Users/pradaapss/Desktop/Semester 3/CS 585 NLP/Assignment 4/ncbi disease/train.tsv"
test_file_path = "/Users/pradaapss/Desktop/Semester 3/CS 585 NLP/Assignment 4/ncbi disease/test.tsv"

train_sentences, train_labels = read_conll_file(train_file_path)
test_sentences, test_labels = read_conll_file(test_file_path)

print_sequence_info("train", train_sentences, train_labels)
print_sequence_info("test", test_sentences, test_labels)


Number of sequences in train: 5432
Tokens of the first sequence in the train dataset:
['Identification', 'of', 'APC2', ',', 'a', 'homologue', 'of', 'the', 'adenomatous', 'polyposis', 'coli', 'tumour', 'suppressor', '.']
Tags of the first sequence in the train dataset:
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-Disease', 'I-Disease', 'I-Disease', 'I-Disease', 'O', 'O']
Number of sequences in test: 940
Tokens of the first sequence in the test dataset:
['Clustering', 'of', 'missense', 'mutations', 'in', 'the', 'ataxia', '-', 'telangiectasia', 'gene', 'in', 'a', 'sporadic', 'T', '-', 'cell', 'leukaemia', '.']
Tags of the first sequence in the test dataset:
['O', 'O', 'O', 'O', 'O', 'O', 'B-Disease', 'I-Disease', 'I-Disease', 'O', 'O', 'O', 'B-Disease', 'I-Disease', 'I-Disease', 'I-Disease', 'I-Disease', 'O']


# Problem 2

In this problem you will examine the data that you read into memory in the previous problem. Using the
training dataset for analysis, show the following in your notebook output:

• The count of each of the 3 tags in the training data: “B-Disease”, “I-Disease”, and “O”. Note that
the most frequent token is "O", since most words are not part of a disease mention.

• The 20 most common words/tokens that appear with the tags “B-Disease” or “I-Disease”. That
is, show words that often appear disease mentions. (You may show frequent “B-Disease” and “I-
Disease” words separately, or you may combine them into a single list.)

• OPTIONAL: Any other data exploration you would like to perform. For example, you may want to
print and read a small sample of token sequences, to become familiar with the data.
Review the list of words that commonly appear in disease mentions. Do you see any patterns? (You do
not need to answer in writing, but it may be helpful in Problem 3 where you design a feature.)

In [4]:
from collections import Counter

def count_tag_frequency(labels):
    tag_counts = Counter(tag for label_list in labels for tag in label_list)
    return tag_counts

def count_common_words_with_tags(tokens, labels, target_tags, num_common=20):
    word_counts = Counter(tokens[i] for i, label in enumerate(labels) if label in target_tags)
    common_words = word_counts.most_common(num_common)
    return common_words

tag_counts = count_tag_frequency(train_labels)
print("Tag counts in training data:")
print(tag_counts)

flat_train_tokens = [token for sentence in train_sentences for token in sentence]
flat_train_labels = [label for labels in train_labels for label in labels]

target_tags = ["B-Disease", "I-Disease"]
common_words = count_common_words_with_tags(flat_train_tokens, flat_train_labels, target_tags)

print("20 most common words/tokens with 'B-Disease' or 'I-Disease' tags:")
print(common_words)


Tag counts in training data:
Counter({'O': 124819, 'I-Disease': 6122, 'B-Disease': 5145})
20 most common words/tokens with 'B-Disease' or 'I-Disease' tags:
[('-', 636), ('deficiency', 322), ('syndrome', 281), ('cancer', 269), ('disease', 256), ('of', 178), ('dystrophy', 176), ('breast', 151), ('ovarian', 132), ('X', 122), ('and', 120), ('DM', 120), ('ALD', 114), ('DMD', 110), ('APC', 100), ('disorder', 94), ('muscular', 94), ('G6PD', 92), ('linked', 81), ('the', 78)]


# Problem 3

In this problem, you will build the features that you will use in your CRF model. You may find it helpful to
refer to this demo notebook, to understand how to work with the python-crfsuite library.

• Write a function that takes two inputs:

    o A sequence of tokens
    o An integer position, pointing to one token in that sequence.
    and returns a list of features, represented as a list of strings. At minimum, include these
    features:
    o The current word/token in lower case
    o The suffix (last 3 characters) of the current word
    o The previous word/token (position i-1) or “BOS” if at the beginning of the sequence
    o The next word/token (position i+1), or “EOS” if at the beginning of the sequence
    o At least one other feature of your choice
    
• Apply your function your train and test token sequences (from output of Problem 1).

• To show that you have completed this step, apply your output to the first 3 words in the first
sequence of the training set. 

In [5]:
def extract_features(tokens, position):
    features = []
    
    current_token = tokens[position]
    
    previous_token = tokens[position - 1] if position > 0 else "BOS"
    next_token = tokens[position + 1] if position < len(tokens) - 1 else "EOS"

    suffix = current_token[-3:]
    
    # Add features to the list
    features.append(f'w0.lower={current_token.lower()}')  
    features.append(f'w0.suffix3={suffix}')  
    features.append(f'w-1={previous_token}')  
    features.append(f'w+1={next_token}')  
    features.append(f'len={len(current_token)}')  
    
    return features

sequence = train_sentences[0]

for i in range(3):
    features = extract_features(sequence, i)
    print(features)


['w0.lower=identification', 'w0.suffix3=ion', 'w-1=BOS', 'w+1=of', 'len=14']
['w0.lower=of', 'w0.suffix3=of', 'w-1=Identification', 'w+1=APC2', 'len=2']
['w0.lower=apc2', 'w0.suffix3=PC2', 'w-1=of', 'w+1=,', 'len=4']


# Problem 4

In this problem, you will train a CRF model and evaluate it using metrics computed over individual tags.

• Using the python-crfsuite library, train a CRF sequential tagging model using feature sequences
that you built in the previous step. Using your training data as input.

• Apply your model to your test dataset to generate predicted tag sequences.

• For each of the 3 labels ("B-Disease", "I-Disease", and “O") show precision, recall, f1-score. [You
may use the sckit-learn function classification_report to complete this step. You may also want
to “flatten” both the true and predicted tags into a single list of tags to apply this function.]

In [6]:
import pycrfsuite
from sklearn.metrics import classification_report

X_train = [[extract_features(sent, i) for i in range(len(sent))] for sent in train_sentences]

y_train = train_labels

trainer = pycrfsuite.Trainer(verbose=True)
for x, y in zip(X_train, y_train):
    trainer.append(x, y)
trainer.set_params({
    'c1': 1.0,
    'c2': 1e-3,
    'max_iterations': 50,
    'feature.possible_transitions': True  
})
trainer.train('disease_model.crfsuite')

tagger = pycrfsuite.Tagger()
tagger.open('disease_model.crfsuite')


# Create feature sequences for the test dataset
X_test = [[extract_features(sent, i) for i in range(len(sent)) ] for sent in test_sentences]

y_pred = [tagger.tag(x) for x in X_test]

y_test_flat = [tag for labels in test_labels for tag in labels]
y_pred_flat = [tag for tags in y_pred for tag in tags]

report = classification_report(y_test_flat, y_pred_flat, labels=["B-Disease", "I-Disease", "O"])
print(report)


Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 32892
Seconds required: 0.091

L-BFGS optimization
c1: 1.000000
c2: 0.001000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 59267.137232
Feature norm: 1.000000
Error norm: 56241.322611
Active features: 19837
Line search trials: 1
Line search step: 0.000009
Seconds required for this iteration: 0.037

***** Iteration #2 *****
Loss: 39509.015314
Feature norm: 2.026043
Error norm: 9310.085362
Active features: 14762
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.019

***** Iteration #3 *****
Loss: 38423.293001
Feature norm: 1.943124
Error norm: 7655.084781
Active features: 13853
Line search trials: 1
Line search step: 1.000000
Seconds required for th

# Problem 5

In this problem you will examine parameter weights assigned by your model. You can do this by calling
“tagger.info().transitions” and “tagger.info().state_features” on your trained model object.

• In your notebook, show parameter weights given to transitions between the 3 tag types ("B-
Disease","I-Disease", and "O").

• Refer back to the feature you designed in Problem 3 (the feature "of your choice"). Show the
parameter weights assigned to this feature. You may truncate this list if it is very long. [This may
happen if you included a word from the sequence in the feature name, so your feature was
expanded to become a larger set of features that grows with your vocabulary]

• *IF* your feature was dropped during model training (that is, there is nothing to show in the
previous step) then return to Problem 4 and design a new feature that is used in your model.

In [7]:
# Import the necessary libraries
from sklearn_crfsuite import CRF

# Assuming you have trained your CRF model and have the 'tagger' object

# Show parameter weights for transitions
transitions = tagger.info().transitions
print("Transition weights:")
for tag1, tag2 in transitions:
    weight = transitions[(tag1, tag2)]
    print(f"Transition from {tag1} to {tag2}: {weight}")

# Show parameter weights for state features
state_features = tagger.info().state_features
print("\nState feature weights (truncated):")
for feature, weight in list(state_features.items())[:20]:  # You can adjust the truncation as needed
    print(f"{feature}: {weight}")

Transition weights:
Transition from O to O: 1.786906
Transition from O to B-Disease: -0.303747
Transition from O to I-Disease: -8.685557
Transition from B-Disease to O: -1.761744
Transition from B-Disease to B-Disease: -5.66115
Transition from B-Disease to I-Disease: 1.625385
Transition from I-Disease to O: -1.663512
Transition from I-Disease to B-Disease: -4.065957
Transition from I-Disease to I-Disease: 1.864707

State feature weights (truncated):
('w0.suffix3=ion', 'O'): 0.254854
('w0.suffix3=ion', 'B-Disease'): -1.481054
('w0.suffix3=ion', 'I-Disease'): 0.327464
('w-1=BOS', 'O'): 4.200259
('w-1=BOS', 'B-Disease'): 3.169447
('w+1=of', 'O'): 0.888714
('w+1=of', 'B-Disease'): 0.023472
('w+1=of', 'I-Disease'): -1.535709
('len=14', 'O'): -0.041638
('len=14', 'B-Disease'): 0.234965
('len=14', 'I-Disease'): -0.00441
('w0.lower=of', 'O'): 1.028714
('w0.lower=of', 'I-Disease'): 1.357102
('w0.suffix3=of', 'O'): 1.012942
('w0.suffix3=of', 'I-Disease'): 1.349851
('len=2', 'O'): 1.005881
('len=

# Problem 6

Tag-level accuracy is easy to compute, but it is not very easy to understand. In particular, one disease
reference may cover both "B-Disease" and "I-Disease" tokens. To give another view of model
performance, compute document-level precision and recall on your experiment output. To do this:

• Write a function that aggregates token-level tags to a document-level label. For example,
convert a tag sequence like ["O", "B-Disease", "I-Disease", "O", "O"] to a single label y=1. Your
function should assign y=1 to a sequence with one or more disease mentions (at least one "B-
Disease" tag) and y=0 to a sequence with no disease mentions.

• Apply your function to both true and predicted document-level labels from your test set. Use
the output to compute document level precision and recall of your model. Show your results in
your notebook.

In [8]:
def doc_labels(token_tags):
    for tag in token_tags:
        if tag in ["B-Disease", "I-Disease"]:
            return 1
    return 0

y_test_docs = [doc_labels(ls) for ls in test_labels]
y_pred_docs = [doc_labels(ls) for ls in y_pred]

print("Document precision:", precision_score(y_test_docs, y_pred_docs))  
print("Document recall:", recall_score(y_test_docs, y_pred_docs))

Document precision: 0.9730848861283644
Document recall: 0.8719851576994434


# PROBLEM 7 

State Transitions (10 pts – Answer in Blackboard)
The python-crfsuite library allows you to set a Boolean hyper-parameter called
“feature.possible_transitions”. If this parameter is True, then the model may output tag-to-tag
transitions that were never seen in training data. [You do not need to apply this parameter in your code
to answer this question]

• What is an example of one tag-to-tag transition that never occurred in the training data?

• For this particular experiment, do you think it makes sense to set this parameter to True or
False? That is, should you allow transitions that never occurred in the training data? Explain your
answer briefly.