# Structural analysis of discontinuous concepts

The purpose of this analysis notebook is to make an in-depth analysis of structures. This makes the basis for a large portion of the corresponding section in the final thesis paper.

## Coordinated concepts

Coordinations are the only thing marked in GENIA while CRAFT also contains other types. Therefore, GENIA will be used primarily in this part. The variable `corpus_name` can be set to either `genia` or `craft` to run with.

In [15]:
corpus_name = 'genia'

First, let's load the corpus, then extract concepts and their constituent structures.

In [16]:
import os
os.chdir('..')  # get to the root directory of the project

from datautils import dataio, annotations as anno

# load the corpora
corpus = dataio.load_corpus(corpus_name)

# not all docs in genia get Constituent annotations; if so, leave them out
if corpus_name.lower() == 'genia':
    corpus = [doc for doc in corpus if doc.get_annotations(anno.Constituent)]

Loading GENIA corpus ... NOTE: 13 files cannot get Constituent annotations!


100%|██████████| 1599/1599 [00:10<00:00, 149.36it/s]


In [17]:
concepts = [c for doc in corpus for c in doc.get_annotations(anno.Concept)]
disc_concepts = [c for doc in corpus for c in doc.get_annotations(anno.DiscontinuousConcept)]

In [18]:
concept_consts = [(concept, doc.get_annotations_at(concept.span, anno.Constituent)[0])
                  for doc in corpus
                  for concept in doc.get_annotations(anno.DiscontinuousConcept)]

# for craft, keep only the ones with a conjunction in them
temp_list = []
for concept, const in concept_consts:
    present_pos = {t.mapped_pos() for t in const.get_tokens()}
    if 'c' in present_pos:
        temp_list.append((concept, const))
concept_consts = temp_list

Each constituent can spit out a treebank structure. Let's do that and put it all in a dataframe

In [19]:
def skipped_structure(const, allowed_tokens):
    """Returns  a treebank structure of a Constituent with a # symbol in front of
    all tokens that are not in allowed_tokens"""
    return '(' + const.label + ' ' + ' '.join(
        skipped_structure(c, allowed_tokens) if isinstance(c, anno.Constituent)
        else '#' + c.mapped_pos() if c not in allowed_tokens
        else c.mapped_pos()
        for c in const.constituents
    ) + ')'

In [20]:
import pandas as pd
from collections import defaultdict

full_structures = defaultdict(list)
skip_structures = defaultdict(list)

for concept, const in concept_consts:
    struct = const.structure()  
    full_structures[struct].append(const)
    tokens = concept.get_concept_tokens()
    skip_struct = skipped_structure(const, tokens)
    skip_structures[skip_struct].append(const)

In [21]:
n_dcs = len(concept_consts)

dfs = {}

for name, structure_dict in [('full', full_structures),
                             ('skip', skip_structures)]:
    
    data_dict = {'struct': [], 'count': [], '%': [], 'example': []}
    
    for struct, sample in structure_dict.items():
        data_dict['struct'].append(struct)
        data_dict['count'].append(len(sample))
        data_dict['%'].append(round(len(sample) / n_dcs * 100, 2))
        data_dict['example'].append(sample[0].get_covered_text())
        
    dfs[name] = pd.DataFrame(data_dict)

In [22]:
dfs['full'].sort_values('count', ascending=False).head(10)

Unnamed: 0,struct,count,%,example
11,(NP (NP n (NP Ø)) c (NP n (NP Ø)) (NP n)),70,4.53,T and NK cells
9,(NP (NP (ADJP a) (NP Ø)) c (NP (ADJP a) (NP Ø)...,61,3.95,Positive and negative regulation
0,(NP t (NP n (NP Ø)) c (NP n (NP Ø)) (NP n)),36,2.33,the FY*A and FY*B alleles
5,(NP (NP (NP n) c (NP n)) n),32,2.07,Egr-1 and c-fos mRNA
145,(NP (ADJP (ADJP a) c (ADJP a)) n),20,1.29,monocytic and erythroid differentiation
13,(NP (NP n (NP Ø)) c (NP n (NP Ø)) (NP n n)),20,1.29,HLA-DMA and HLA-DMB gene expression
32,(NP (NP n (ADJP Ø) (NP Ø)) c (NP f a (NP Ø)) (...,13,0.84,TNF-alpha- or PMA-induced stimulation
42,(NP t (NP (NP n) c (NP n)) n),12,0.78,the AP-1 and Elf-1 sites
90,(NP (ADJP (ADJP a) c (ADJP a)) n n),12,0.78,endocrine and exocrine precursor cells
140,(NP n c n n),11,0.71,HL-60 and NB4 cells


In [25]:
dfs['skip'].sort_values('count', ascending=False).head(50)

Unnamed: 0,struct,count,%,example
15,(NP (NP n (NP #Ø)) #c (NP #n (NP #Ø)) (NP n)),70,4.53,T and NK cells
10,(NP (NP (ADJP a) (NP #Ø)) #c (NP (ADJP #a) (NP...,61,3.95,Positive and negative regulation
0,(NP #t (NP n (NP #Ø)) #c (NP #n (NP #Ø)) (NP n)),36,2.33,the FY*A and FY*B alleles
6,(NP (NP (NP n) #c (NP #n)) n),32,2.07,Egr-1 and c-fos mRNA
186,(NP (ADJP (ADJP a) #c (ADJP #a)) n),20,1.29,monocytic and erythroid differentiation
17,(NP (NP n (NP #Ø)) #c (NP #n (NP #Ø)) (NP n n)),20,1.29,HLA-DMA and HLA-DMB gene expression
41,(NP (NP n (ADJP #Ø) (NP #Ø)) #c (NP #f a (NP #...,13,0.84,TNF-alpha- or PMA-induced stimulation
118,(NP (ADJP (ADJP a) #c (ADJP #a)) n n),12,0.78,endocrine and exocrine precursor cells
55,(NP #t (NP (NP n) #c (NP #n)) n),12,0.78,the AP-1 and Elf-1 sites
179,(NP n #c #n n),11,0.71,HL-60 and NB4 cells


In [14]:
skip_structures['(NML n (NML #n #c n))']

[Constituent('histone H3 and H4'(4171, 4188))\NML,
 Constituent('odor discrimination and learning'(13544, 13576))\NML,
 Constituent('Odor Discrimination and Learning'(31323, 31355))\NML,
 Constituent('odor discrimination and learning'(31540, 31572))\NML,
 Constituent('odor discrimination and learning'(36894, 36926))\NML,
 Constituent('RNA splicing and transport'(36008, 36034))\NML]