# Project 3: Information Extraction

### Analyze the output of an off-the-shelf NER model on scientific literature

### 1

In [2]:
import pandas as pd

cord = pd.read_csv('data/cord19_sample/sample.csv')

display(cord.head())

Unnamed: 0,cord_uid,source_x,title,abstract,publish_time
0,b2mg6lmc,Elsevier,Transmissible Viral Vaccines,Genetic engineering now enables the design of ...,2018-01-31
1,acfwtc7t,PMC,CCR5 knockout suppresses experimental autoimmu...,Multiple sclerosis (MS) is an inflammatory dis...,2016-03-15
2,47336zqs,Elsevier,Dipeptidyl Peptidase 4 Distribution in the Hum...,"Dipeptidyl peptidase 4 (DPP4, CD26), a type II...",2016-01-31
3,c02v47jz,PMC,Predicting Infectious Disease Using Deep Learn...,Infectious disease occurs when a person is inf...,2018-07-27
4,rb8wg8k5,PMC,Membrane Fusion and Cell Entry of XMRV Are pH-...,Xenotropic murine leukemia virus-related virus...,2012-03-27


In [3]:
# get the number of papers
len(cord)

8309

In [4]:
# distribution of papers by source
cord['source_x'].value_counts()

PMC         5097
Elsevier    2696
medrxiv      208
WHO          157
biorxiv      138
CZI           13
Name: source_x, dtype: int64

In [5]:
# earliest and latest papers published time
# convert 'publish_time' to datetime type
cord['publish_time'] = pd.to_datetime(cord['publish_time'], errors='coerce')
earliest = cord['publish_time'].min()
latest = cord['publish_time'].max()

print('Earliest publication date:', earliest)
print('Latest publication date:', latest)

Earliest publication date: 1963-05-01 00:00:00
Latest publication date: 2020-12-31 00:00:00


### 2

In [6]:
import spacy
cord['document'] = cord.title + " " + cord.abstract
# load a spacy model trained on web document
nlp_web = spacy.load("en_core_web_sm")

ner_results = []

for text in cord['document']:
    doc = nlp_web(text)
    ner_results.append([(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents])

cord['ner_results'] = ner_results

In [7]:
# try printing result of document1
print('Entities from the document1:')
for ent_text, ent_start, ent_end, ent_label in cord['ner_results'][0]:
    print(ent_label, ent_text)

Entities from the document1:
ORG Viral Vaccines Genetic


In [8]:
# try printing result of document 2
print('Entities from the document2:')
for ent_text, ent_start, ent_end, ent_label in cord['ner_results'][1]:
    print(ent_label, ent_text)

Entities from the document2:
PERSON myelin
CARDINAL 5
ORG CNS
ORG CCR5
ORG MS
GPE CCR5
PERSON myelin oligodendrocyte
CARDINAL 35
ORG EAE
DATE 28 days
ORG EAE
ORG CD4(+
ORG IL-1β
ORG TNF
ORG IFN
ORG MCP-1
PERSON Myelin
ORG MBP
GPE CNPase
PRODUCT O4
ORG CCR5
ORG MS


### 3

In [10]:
avg_entities_per_document = cord['ner_results'].apply(len).mean()

# calculate the average number of entities of each type per document
all_entities = [entity for entities in cord['ner_results'] for entity in entities]
entities_df = pd.DataFrame(all_entities, columns=['text', 'start', 'end', 'label'])
avg_entities_per_type = entities_df.groupby('label').size() / len(cord)

print('Average entities per document trained on web text:', avg_entities_per_document)
print('Average entities per type per document:')
print(avg_entities_per_type)

Average entities per document trained on web text: 15.045733541942472
Average entities per type per document:
label
CARDINAL       3.428812
DATE           1.364424
EVENT          0.014201
FAC            0.058130
GPE            1.337947
LANGUAGE       0.007702
LAW            0.044289
LOC            0.216272
MONEY          0.026718
NORP           0.496329
ORDINAL        0.233361
ORG            5.235407
PERCENT        0.929594
PERSON         1.095679
PRODUCT        0.395836
QUANTITY       0.076062
TIME           0.037790
WORK_OF_ART    0.047178
dtype: float64


### 4

In [11]:
most_common_spans = entities_df.groupby('label')['text'].agg(lambda x: x.value_counts().idxmax())

print('Most common span for each entity type:')
print(most_common_spans)

Most common span for each entity type:
label
CARDINAL                  two
DATE                     2019
EVENT                   GDVII
FAC             Mycobacterium
GPE                     China
LANGUAGE              English
LAW                       HeV
LOC               Middle East
MONEY          18F-FDG PET/CT
NORP                  Chinese
ORDINAL                 first
ORG                       RNA
PERCENT                   95%
PERSON            Escherichia
PRODUCT                    S1
QUANTITY             3500 ppm
TIME                 24 hours
WORK_OF_ART               SeV
Name: text, dtype: object


### 5

In [12]:
import spacy
# load a spacy model trained on scientific text
nlp_sci = spacy.load("en_core_sci_sm")

ner_results = []

for text in cord['document']:
    doc = nlp_sci(text)
    ner_results.append([(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in doc.ents])

cord['ner_results'] = ner_results

  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


In [13]:
# try printing result of document 2
print('Entities from the document2:')
for ent_text, ent_start, ent_end, ent_label in cord['ner_results'][1]:
    print(ent_label, ent_text)

Entities from the document2:
ENTITY CCR5
ENTITY knockout
ENTITY suppresses
ENTITY experimental autoimmune encephalomyelitis
ENTITY C57BL/6 mice
ENTITY Multiple sclerosis
ENTITY MS
ENTITY inflammatory disease
ENTITY myelin
ENTITY spinal cord
ENTITY damaged
ENTITY C-C chemokine receptor type 5
ENTITY CCR5
ENTITY immune cell
ENTITY cytokine release
ENTITY central nervous system
ENTITY CNS
ENTITY investigated
ENTITY CCR5
ENTITY MS
ENTITY progression
ENTITY murine
ENTITY model
ENTITY experimental autoimmune encephalomyelitis
ENTITY EAE
ENTITY CCR5
ENTITY deficient
ENTITY CCR5(−/−
ENTITY CCR5(−/−
ENTITY CCR5(+/+)
ENTITY immunized
ENTITY myelin oligodendrocyte glycoprotein
ENTITY MOG(35
ENTITY pertussis toxin
ENTITY EAE
ENTITY paralysis
ENTITY scored
ENTITY days
ENTITY clinical scoring
ENTITY EAE
ENTITY neuropathology
ENTITY CCR5(−/−)
ENTITY CCR5(+/+)
ENTITY Immune cells
ENTITY CD3(+
ENTITY CD4(+)
ENTITY CD8(+)
ENTITY B cell
ENTITY NK cell
ENTITY macrophages
ENTITY infiltration
ENTITY astrocy

In [14]:
avg_entities_per_document = cord['ner_results'].apply(len).mean()

# calculate the average number of entities of each type per document
all_entities = [entity for entities in cord['ner_results'] for entity in entities]
entities_df = pd.DataFrame(all_entities, columns=['text', 'start', 'end', 'label'])
avg_entities_per_type = entities_df.groupby('label').size() / len(cord)

print('Average entities per document trained on science text:', avg_entities_per_document)
print('Average entities per type per document:')
print(avg_entities_per_type)

Average entities per document trained on science text: 69.8979419906126
Average entities per type per document:
label
ENTITY    69.897942
dtype: float64


In [15]:
most_common_spans = entities_df.groupby('label')['text'].agg(lambda x: x.value_counts().idxmax())

print('Most common span for each entity type:')
print(most_common_spans)

Most common span for each entity type:
label
ENTITY    patients
Name: text, dtype: object


In the result when using the spaCy model trained on web text, entities are fit into several label categories, but in in the result when using the spaCy model trained on scientific text, all the entities haven't be recongnized as specific entity types by the model. This might because that in scientific texts, entities may be more specialized and diverse, and they may not fit into the general categories defined by spaCy's standard models trained on web text.

In [None]:
# filter entities containing the substring 'covid' (case-insensitive)
covid_entities = entities_df[entities_df['text'].str.lower().str.contains('covid')]

# output unique spans
unique_covid_spans = covid_entities['text'].unique()

print('Unique entity spans containing "covid":')
print(unique_covid_spans)

The spans do not correspond to a single entity but rather represent various aspects, contexts, and references related to COVID-19. 

My outputs from this search may not sufficiently identify all papers from the dataset that are about "COVID-19". Though the extracted spans do capture a wide range of references to COVID-19, it's important to note that NER might not be perfect, and variations in terminology can exist, precision may vary, and some false positives might be present.

## Train a custom model for named entity recognition

In [19]:
from nltk.corpus.reader import ConllCorpusReader

train = ConllCorpusReader('data/conll2003/', 'eng.train', ['words', 'pos', 'ignore', 'chunk'])
testa = ConllCorpusReader('data/conll2003/', 'eng.testa', ['words', 'pos', 'ignore', 'chunk'])
testb = ConllCorpusReader('data/conll2003/', 'eng.testb', ['words', 'pos', 'ignore', 'chunk'])

train_sents = list(train.iob_sents())
testa_sents = list(testa.iob_sents())
testb_sents = list(testb.iob_sents())

len(train_sents), len(testa_sents), len(testb_sents)

(14987, 3466, 3684)

There are 14987 sentences in train split, 3466 sentences in testa split, 3684 sentences in testb split

In [20]:
# train
train_each_length = [len(i) for i in train_sents]
mean_train = sum(train_each_length) / len(train_sents)
min_train = min(train_each_length)
max_train = max(train_each_length)
mean_train, min_train, max_train

(13.586508307199573, 0, 113)

In [21]:
# testa
testa_each_length = [len(i) for i in testa_sents]
mean_testa = sum(testa_each_length) / len(testa_sents)
min_testa = min(testa_each_length)
max_testa = max(testa_each_length)
mean_testa, min_testa, max_testa

(14.818811309867282, 0, 109)

In [22]:
# testb
testb_each_length = [len(i) for i in testb_sents]
mean_testb = sum(testb_each_length) / len(testb_sents)
min_testb = min(testb_each_length)
max_testb = max(testb_each_length)
mean_testb, min_testb, max_testb

(12.604505971769816, 0, 124)

### 8

In [23]:
# flatten the list of sentences to get a list of (word, pos, label) tuples
train_flat = [item for sublist in train_sents for item in sublist]
testa_flat = [item for sublist in testa_sents for item in sublist]
testb_flat = [item for sublist in testb_sents for item in sublist]

# create df for each split
df_train = pd.DataFrame(train_flat, columns=['word', 'pos', 'label'])
df_testa = pd.DataFrame(testa_flat, columns=['word', 'pos', 'label'])
df_testb = pd.DataFrame(testb_flat, columns=['word', 'pos', 'label'])

print("Distribution of NER labels in the train:")
print(df_train['label'].value_counts())
print()
print("Distribution of NER labels in the testA:")
print(df_testa['label'].value_counts())
print()
print("Distribution of NER labels in the testB:")
print(df_testb['label'].value_counts())

Distribution of NER labels in the train:
O         169578
I-PER      11128
I-ORG      10001
I-LOC       8286
I-MISC      4556
B-MISC        37
B-ORG         24
B-LOC         11
Name: label, dtype: int64

Distribution of NER labels in the testA:
O         42759
I-PER      3149
I-LOC      2094
I-ORG      2092
I-MISC     1264
B-MISC        4
Name: label, dtype: int64

Distribution of NER labels in the testB:
O         38323
I-PER      2773
I-ORG      2491
I-LOC      1919
I-MISC      909
B-MISC        9
B-LOC         6
B-ORG         5
Name: label, dtype: int64


The label set and counts indicate a typical distribution for NER, where "O" represents entities that are outside the scope of interest, and other labels like "I-PER," "I-ORG," "I-LOC," and "I-MISC" represent entities of a person, organization, location, and miscellaneous types, respectively. The counts for "O" are significantly higher, which is expected as non-entity tokens typically outnumber named entities in natural language text. The presence of "B-MISC," "B-ORG," "B-LOC" indicates the beginning of miscellaneous, organization, and location entities, respectively, although their counts are relatively low. 

### 9

In [24]:
# loading word2vec vectors
from gensim.models import KeyedVectors
w2v_vectors = KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary=True)

In [25]:
import string
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics
from sklearn.model_selection import train_test_split

# function to convert Word2Vec vectors to features
def word_to_features(word: str) -> dict:
    features = {}
    if word in w2v_vectors:
        word_vector = w2v_vectors[word]
        for i, value in enumerate(w2v_vectors[word]):
            features[f'w2v_{i}'] = float(value)
    else:
        for i in range(300):  # assuming Word2Vec vectors are 300-dimensional
            features[f'w2v_{i}'] = 0.0
    return features

# function to convert a CONLL instance to a sequence of features associated with each word
def sent_to_features(sent: list) -> list:
    return [word_to_features(word) for word, pos, label in sent]

def sent_to_labels(sent: list) -> list:
    return [label for word, pos, label in sent]

train,test = train_test_split(train_sents,test_size=0.2, random_state = 498)

# convert train and test splits to features and labels
X_train = [sent_to_features(sent) for sent in train]
y_train = [sent_to_labels(sent) for sent in train]

X_test = [sent_to_features(sent) for sent in test]
y_test = [sent_to_labels(sent) for sent in test]

# train CRF model
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    max_iterations=100,
    verbose=True
)

crf.fit(X_train, y_train)

loading training data to CRFsuite: 100%|████████████████████████████████████████| 11989/11989 [01:48<00:00, 110.72it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 1190
Seconds required: 3.945

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=2.94  loss=197000.80 active=1190  feature_norm=5.00
Iter 2   time=1.02  loss=142541.34 active=1190  feature_norm=3.93
Iter 3   time=1.82  loss=105312.52 active=1190  feature_norm=3.30
Iter 4   time=0.88  loss=98207.90 active=1190  feature_norm=3.76
Iter 5   time=0.93  loss=83527.91 active=1190  feature_norm=4.63
Iter 6   time=0.88  loss=62752.43 active=1190  feature_norm=7.50
Iter 7   time=0.89  loss=50199.92 active=1190  feature_norm=10.23
Iter 8   time=0.88  loss=40446.73 active=1190  feature_norm=14.30
Iter 9   time=0.88  loss=36636.79 active=1190  feature_norm=16.78
Iter 10  

AttributeError: 'CRF' object has no attribute 'keep_tempfiles'

AttributeError: 'CRF' object has no attribute 'keep_tempfiles'

### 10

In [26]:
from sklearn import metrics

X_testa = [sent_to_features(sent) for sent in testa_sents]
y_testa = [sent_to_labels(sent) for sent in testa_sents]

# remove O label when compute metrics (since most tokens are O)
labels = list(crf.classes_)
labels.remove('O')

flatten = lambda l: [item for sublist in l for item in sublist]
y_pred = crf.predict(X_testa)

print(metrics.classification_report(flatten(y_testa), flatten(y_pred), labels=labels))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       I-ORG       0.70      0.65      0.67      2092
       I-PER       0.91      0.85      0.88      3149
       I-LOC       0.82      0.81      0.81      2094
      I-MISC       0.81      0.63      0.71      1264
       B-ORG       0.00      0.00      0.00         0
      B-MISC       0.00      0.00      0.00         4
       B-LOC       0.00      0.00      0.00         0

   micro avg       0.82      0.76      0.79      8603
   macro avg       0.46      0.42      0.44      8603
weighted avg       0.82      0.76      0.79      8603



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 11

From the result, there are notable differences between micro- and macro-averaged results. The micro-averaged results suggest that, overall, the model performs relatively better when considering all instances equally, regardless of the class.
The macro-averaged results indicate that the model's performance is more varied across different classes, and the lower scores in macro-averaging might be influenced by classes with fewer instances. The differences between these two metrics highlight the impact of class imbalances on the overall evaluation of the model.

### 12

From the result, we can see that classes B-ORG, B-MISC, B-LOC have have a precision, recall, and F1-score of 0.00, indicating that they were not predicted at all. This could be due to a lack of instances of these classes in the testA data (there are no B-ORG and B-LOC classes in testA and only 4 instances of B-MISC). This could be addressed by adjusting the feature representation, providing more examples of these cases in the training and testing data, or modifying the model architecture to better capture the structure of these entities. In addition, while precision for I-MISC has improved, recall has decreased. This might be due to that there are less instances in I-MISC than other classes start with "I". Augmenting the training data with more diverse examples of I-MISC or experimenting with different features could help.