# Structural analysis of discontinuous concepts

The purpose of this analysis notebook is to make an in-depth analysis of structures. This makes the basis for a large portion of the corresponding section in the final thesis paper.

## Coordinated concepts

Coordinations are the only thing marked in GENIA while CRAFT also contains other types. Therefore, GENIA will be used primarily in this part. The variable `corpus_name` can be set to either `genia` or `craft` to run with.

In [1]:
import utils  # a python module in the same dir as the notebooks

In [2]:
corpus_name = 'genia'

First, let's load the corpus, then extract concepts and their constituent structures.

In [3]:
import os
os.chdir(utils.ROOT)  # get to the root directory of the project

from datautils import dataio, annotations as anno

# load the corpora
corpus = dataio.load_corpus(corpus_name)

# not all docs in genia get Constituent annotations; if so, leave them out
if corpus_name.lower() == 'genia':
    corpus = [doc for doc in corpus if doc.get_annotations(anno.Constituent)]

Loading GENIA corpus ... NOTE: 13 files cannot get Constituent annotations!


100%|██████████| 1599/1599 [00:15<00:00, 105.69it/s]


In [4]:
concept_consts = [(concept, doc.get_annotations_at(concept.span, anno.Constituent)[0])
                  for doc in corpus
                  for concept in doc.get_annotations(anno.DiscontinuousConcept)
                  if len(concept.get_spanned_tokens()) < 10]

# for craft, keep only the ones with a conjunction in them
temp_list = []
for concept, const in concept_consts:
    present_pos = {t.mapped_pos() for t in const.get_tokens()}
    if 'c' in present_pos:
        temp_list.append((concept, const))
concept_consts = temp_list

Each constituent can spit out a treebank structure. Let's do that and put it all in dataframes.

In [5]:
import re

def skipped_structure(const, allowed_tokens):
    """Returns  a treebank structure of a Constituent with a # symbol in front of
    all tokens that are in allowed_tokens"""
    # note: craft constituents often have labels like NP-SBJ; keep only the first part
    return '(' + const.label.split('-')[0] + ' ' + ' '.join(
        skipped_structure(c, allowed_tokens) if isinstance(c, anno.Constituent)
        else c.mapped_pos() if c not in allowed_tokens
        else '#' + c.mapped_pos()
        for c in const.constituents
    ) + ')'

def unlabeled_structure(const, allowed_tokens):
    """Returns a simplified treebank structure of a Constituent with no constituent labels
    and with a # symbol in front of all tokens that are in allowed_tokens"""
    return_string = '(' + ' '.join(
        unlabeled_structure(c, allowed_tokens) if isinstance(c, anno.Constituent)
        else c.mapped_pos() if c not in allowed_tokens
        else '#' + c.mapped_pos()
        for c in const.constituents
        if not contains_empty_token(c)
    ) + ')'
    # remove all cases of single token constituents, e.g. (n)
    return re.sub('\(+(#?.)\)+', r'\1', return_string)

def contains_empty_token(const):
    if not isinstance(const, anno.Constituent):
        return False
    for c in const.constituents:
        if isinstance(c, anno.Token) and c.mapped_pos() == 'Ø':
            return True
    return False

In [6]:
import pandas as pd
from collections import defaultdict

full_structures = defaultdict(list)
skip_structures = defaultdict(list)
collapsed_full_structures = defaultdict(list)
collapsed_skip_structures = defaultdict(list)

for concept, const in concept_consts:
    struct = const.structure()  
    full_structures[struct].append(const)
    
    skip_struct = skipped_structure(const, concept.get_tokens())
    skip_structures[skip_struct].append(const)
    
    collapsed_full = unlabeled_structure(const, {})
    collapsed_full_structures[collapsed_full].append(const)
    
    collapsed_skip = unlabeled_structure(const, concept.get_tokens())
    collapsed_skip_structures[collapsed_skip].append(const)

In [7]:
n_dcs = len(concept_consts)

dfs = {}

for name, structure_dict in [('full', full_structures),
                             ('skip', skip_structures),
                             ('col_full', collapsed_full_structures),
                             ('col_skip', collapsed_skip_structures)]:
    
    data_dict = {'struct': [], 'pos-seq': [], 'count': [], '%': [], 'example': []}
    
    for struct, sample in structure_dict.items():
        data_dict['struct'].append(struct)
        data_dict['pos-seq'].append(''.join(c for c in struct if c.islower() or c in ','))
        data_dict['count'].append(len(sample))
        data_dict['%'].append(round(len(sample) / n_dcs * 100, 2))
        data_dict['example'].append(sample[0].get_covered_text())
        
    dfs[name] = pd.DataFrame(data_dict)

In [8]:
dfs['col_skip'].sort_values('count', ascending=False).head(50)

Unnamed: 0,struct,pos-seq,count,%,example
93,(#n c n #n),ncnn,74,5.21,alpha and beta subunits
50,(#a c a #n),acan,58,4.08,lymphoid and nonlymphoid cells
17,(t #n c n #n),tncnn,37,2.61,the TSG101 and FHIT genes
83,((#n c n) #n),ncnn,29,2.04,H9 and CEM cells
62,((#a c a) #n),acan,20,1.41,cell-cell and cell-matrix interactions
138,(#n c n (#n #n)),ncnnn,17,1.2,T and B cell co-cultures
187,((#n c (f #a)) #n),ncfan,12,0.85,glucocorticoid- and Fas-mediated cytotoxicity
66,(#a c a (#n #n)),acann,11,0.77,Tax-expressing and HTLV-I-transformed T cells
137,((#a c a) #n #n),acann,11,0.77,normal and dex-resistant CEM cells
4,(t (#n c n) #n),tncnn,11,0.77,the GM-CSF and IL-2 genes


In [9]:
# for latex table
import re
csv_string = dfs['full'].sort_values('count', ascending=False).head(15).to_csv(sep='&')
csv_string = re.sub(r'\n\d+&', r'  \\\\ \\addlinespace\n', csv_string)  # line breaks and row numbers
csv_string = re.sub('^&', r'', csv_string)  # first &
csv_string = re.sub('%', r'\\%', csv_string)  # \% for latex
csv_string = re.sub('Ø', r'\\O', csv_string)  # \O for latex
csv_string = re.sub('#([A-Za-z\-]+)', r'\\textbf{\1}', csv_string)  # bold typeface used words 
csv_string = re.sub('&', r'\t&\t', csv_string)  # more space on delimiters

In [10]:
print(csv_string)

struct	&	pos-seq	&	count	&	\%	&	example  \\ \addlinespace
(NP (NP n (NP \O)) c (NP n (NP \O)) (NP n))	&	ncnn	&	63	&	4.44	&	alpha and beta subunits  \\ \addlinespace
(NP (NP (ADJP a) (NP \O)) c (NP (ADJP a) (NP \O)) (NP n))	&	acan	&	55	&	3.87	&	lymphoid and nonlymphoid cells  \\ \addlinespace
(NP t (NP n (NP \O)) c (NP n (NP \O)) (NP n))	&	tncnn	&	30	&	2.11	&	the TSG101 and FHIT genes  \\ \addlinespace
(NP (NP (NP n) c (NP n)) n)	&	ncnn	&	28	&	1.97	&	H9 and CEM cells  \\ \addlinespace
(NP (ADJP (ADJP a) c (ADJP a)) n)	&	acan	&	19	&	1.34	&	cell-cell and cell-matrix interactions  \\ \addlinespace
(NP (NP n (NP \O)) c (NP n (NP \O)) (NP n n))	&	ncnnn	&	17	&	1.2	&	T and B cell co-cultures  \\ \addlinespace
(NP (ADJP (ADJP a) c (ADJP a)) n n)	&	acann	&	11	&	0.77	&	normal and dex-resistant CEM cells  \\ \addlinespace
(NP t (NP (NP n) c (NP n)) n)	&	tncnn	&	11	&	0.77	&	the GM-CSF and IL-2 genes  \\ \addlinespace
(NP n c n n)	&	ncnn	&	10	&	0.7	&	B and T cells  \\ \addlinespace
(NP (NP n (ADJP \

In [11]:
collapsed_skip_structures['((#a c a) #n)']

[Constituent('cell-cell and cell-matrix interactions'(553, 591))\NP,
 Constituent('immune and inflammatory responses'(124, 157))\NP,
 Constituent('steroid-sensitive and steroid-resistant asthma'(69, 115))\NP,
 Constituent('humoral or cell-mediated immunity'(476, 509))\NP,
 Constituent('monocytic and erythroid differentiation'(54, 93))\NP,
 Constituent('erythroid and megakaryocytic phenotypes'(768, 807))\NP,
 Constituent('control and infected cells'(535, 561))\NP,
 Constituent('basal and Tax-mediated transcription'(851, 887))\NP,
 Constituent('cellular and viral genes'(205, 229))\NP,
 Constituent('immune and inflammatory responses'(303, 336))\NP,
 Constituent('severe and symptomatic VKC'(309, 335))\NP,
 Constituent('bronchial and alveolar cells'(786, 814))\NP,
 Constituent('oxidant and antioxidant conditions'(1924, 1958))\NP,
 Constituent('haematopoietic and solid tumours'(387, 419))\NP,
 Constituent('immune and inflammatory responses'(353, 386))\NP,
 Constituent('neonatal and adult cel

We can capture more examples under fewer labels if we only use POS-tag sequences.

In [71]:
CROSS_COUNTING = ['concept', 'super', 'skip']  # order: concept, super, skip

MOD = '[an]'
SUPER_SEQ_COLLAPSERS = {
    # re.compile('n*(n,?)+cnn+'),
    # re.compile('a*(a,?)+caa*n+')
    #re.compile('[an]c[an]n'),  # base case: shared head
    #re.compile('[an]ncn'),  # base case: shared modifier
    #re.compile('([an],)+[an],?c[an]n'),  # exp case: enumeration of modifiers + shared head
    #re.compile('[an]+(n,)+n,?cn'),  # exp case: enumeration of heads with shared modifier
    #re.compile('[an]c[an][an]+n'),  # exp case: multi-word head
    #re.compile('[an]+[an]ncn'),  # exp case: shared pre-modifiers
    re.compile(f'({MOD}n?,?)+c{MOD}*n'),  # all
}

CONCEPT_SEQ_COLLAPSERS = {
    re.compile('(a|n)+n')
}
SAMPLE_CATEGORIES = {'concept', 'super', 'skip', 'cross-count'}

def make_sequence(tokens, collapser=None):
    sequence = [t.mapped_pos() if not t.pos == '#' else '#'
                for t in tokens]
    sequence_str = ''.join(sequence)
    if not collapser:
        return sequence_str

    for pos_pattern in collapser:
        if pos_pattern.fullmatch(sequence_str):
            return pos_pattern.pattern
    return sequence_str  # sequence did not match a pattern

In [72]:
disc_concepts = [dc for doc in corpus
                 for dc in doc.get_annotations(anno.DiscontinuousConcept)
                 if 'c' in {t.mapped_pos() for t in dc.get_spanned_tokens()}
                 and len(dc.get_tokens()) < 10]

samples_after_category = {}
for ct in SAMPLE_CATEGORIES:
    samples_after_category[ct] = defaultdict(list)

for dc in disc_concepts:
    concept_tokens = dc.get_tokens()
    all_tokens = dc.get_spanned_tokens()
    skip_tokens = [t if t in concept_tokens  # actual token
                   else anno.Token(dc.document, (-1, -1), '#')  # skipped token
                   for t in all_tokens]

    cross_count_type = []
    cross_count_example = []

    concept_sequence = make_sequence(concept_tokens, CONCEPT_SEQ_COLLAPSERS)
    concept_text = dc.get_covered_text()
    samples_after_category['concept'][concept_sequence].append(dc)
    if 'concept' in CROSS_COUNTING:
        cross_count_type.append(concept_sequence)
        cross_count_example.append(dc)

    super_sequence = make_sequence(all_tokens, SUPER_SEQ_COLLAPSERS)
    super_text = dc.get_spanned_text()
    samples_after_category['super'][super_sequence].append(dc)
    if 'super' in CROSS_COUNTING:
        cross_count_type.append(super_sequence)
        cross_count_example.append(dc)

    skip_sequence = make_sequence(skip_tokens)
    samples_after_category['skip'][skip_sequence].append(dc)
    if 'skip' in CROSS_COUNTING:
        cross_count_type.append(skip_sequence)
        cross_count_example.append(dc)

    samples_after_category['cross-count'][tuple(cross_count_type)].append(
        tuple(cross_count_example)
    )

In [73]:
data_dict = {'type': [], 'count': [], '%': [], 'example': []}
for type_, items in samples_after_category['super'].items():
    data_dict['type'].append(type_)
    data_dict['count'].append(len(items))
    data_dict['%'].append(round(len(items) / len(disc_concepts) * 100, 2))
    data_dict['example'].append(items[0].get_spanned_text())
    
df = pd.DataFrame(data_dict)

In [75]:
df.sort_values('count', ascending=False).head(50)

Unnamed: 0,type,count,%,example
5,"([an]n?,?)+c[an]*n",990,66.8,GM-CSF and IL-2 genes
0,ncfan,45,3.04,AP-1- and NF-kappaB-binding sites
30,ncfann,16,1.08,Sp1- and Ets-related transcription factors
16,ndcd,15,1.01,Jak 1 and 3
3,ncfn,11,0.74,c- and v-rel
25,ncfnn,8,0.54,B- and T-cell development
9,vncnn,6,0.4,truncated TSG101 and FHIT transcripts
85,"and,d,d,d,cdn",5,0.34,"molecular mass 40, 42, 70, 120, and 130 kDa"
78,fncnn,5,0.34,HLA-DR or -DP transcription
57,fncd,5,0.34,cyclin-dependent-kinase-4 and 6


In [25]:
for item in samples_after_category['super']['[an]+[an]ncn']:
    print(*(t.get_covered_text() for t in item.get_spanned_tokens()))

IL-4 receptor subunits alpha and gamma
viral gene expression and replication
viral transcription initiation and elongation
HTLV-I basal transcription and expression
B cell proliferation and differentiation
PKC isoenzymes alpha and beta
T cell receptor alpha and -beta
human glial cells and lymphocytes
human B cell proliferation and differentiation
human GM-CSF promoter and enhancer
T lymphocyte differentiation and activation
human chromosomes 11p15 and 11p13
T lymphocyte activation and mitogenesis
T cell activation and differentiation
T cell activation and growth


In [76]:
# for exp case: multi-word head: for multiple tokens before/after the conjunction,
# do the tokens go with the head or the second modifier?
df = dfs['col_full']
subset = df[df['pos-seq'].str.contains('^[an]+[an]ncn$')]
subset.head(25)

Unnamed: 0,struct,pos-seq,count,%,example
101,(a n n c n),anncn,2,0.14,viral gene expression and replication
218,(n ((a n) c n)),nancn,1,0.07,HTLV-I basal transcription and expression
262,(n n n c n),nnncn,2,0.14,B cell proliferation and differentiation
275,(n n (n c n)),nnncn,3,0.21,PKC isoenzymes alpha and beta
321,(n n n n c n),nnnncn,1,0.07,T cell receptor alpha and -beta
345,(a (a n) c n),aancn,1,0.07,human glial cells and lymphocytes
389,(a n n (n c n)),annncn,1,0.07,human B cell proliferation and differentiation
528,(a n (n c n)),anncn,1,0.07,human chromosomes 11p15 and 11p13


What about CC's?'

In [86]:
all_concepts = [c for doc in corpus for c in doc.get_annotations(anno.Concept)
               if not isinstance(c, anno.DiscontinuousConcept) and not len(c) < 2]


In [87]:
from collections import Counter
dist_of_types = Counter(c.pos_sequence() for c in all_concepts)

In [89]:
dist_of_types.most_common(10)

[('nn', 15031),
 ('an', 7388),
 ('nnn', 4569),
 ('ann', 2870),
 ('aan', 1206),
 ('nnnn', 1006),
 ('fan', 865),
 ('annn', 831),
 ('nan', 617),
 ('nd', 384)]

In [97]:
simple_count = sum(v for t, v in dist_of_types.items() if re.match('[an]n', t))

In [98]:
simple_count / sum(dist_of_types.values())

0.7885333579233713