<img align="right" src="images/tf-small.png" width="90"/>
<img align="right" src="images/etcbc.png" width="100"/>

# Introduction - (Code is Work in Progress)

The code in this NB explores the coreference features that have been made with the TF conversion program in the [tf_conversion repository](). The features that have been made on the basis of annotations in *brat* are:

* mention: a mention is any referring expression; 
* coref: coreference relation between two or more mentions; 
* mentionNote: annotator notes on a mention. 

The code is still in an experimentation phase. 

In [1]:
__author__ = 'erwich/roorda'

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
import os
from collections import *
from shutil import rmtree
from glob import glob
from pprint import pprint
from operator import itemgetter, attrgetter

from tf.app import use
from tf.fabric import Fabric
from utils import *

In [7]:
A = use(
    'bhsa', version='2017',
    mod=(
        'text-fabric-data/Gyusang/participant-analysis/coreference/tf:clone,'
        'text-fabric-data/Gyusang/bh-reference-system/tf:clone'
    ), 
    hoist=globals())

	connecting to online GitHub repo annotation/app-bhsa ... failed
The offline TF-app may not be the latest
Using TF-app in C:\Users\gyuus/text-fabric-data/annotation/app-bhsa/code:
	rv1.0=#d3cf8f0c2ab5d690a0fda14ea31c33da5c5c8483 (latest? release)
	connecting to online GitHub repo etcbc/bhsa ... failed
The offline data may not be the latest
Using data in C:\Users\gyuus/text-fabric-data/etcbc/bhsa/tf/2017:
	rv1.6=#bac4a9f5a2bbdede96ba6caea45e762fe88f88c5 (latest? release)
	connecting to online GitHub repo etcbc/phono ... failed
The offline data may not be the latest
Using data in C:\Users\gyuus/text-fabric-data/etcbc/phono/tf/2017:
	r1.2 (latest? release)
	connecting to online GitHub repo etcbc/parallels ... failed
The offline data may not be the latest
Using data in C:\Users\gyuus/text-fabric-data/etcbc/parallels/tf/2017:
	r1.2 (latest? release)


The requested data is not available offline
The requested data is not available offline
There were problems with loading data.
The Text-Fabric API has not been loaded!
The app "bhsa" will not work!


## Choose a corpus or text 

It has to be a chapter or corpus that has been annotated for coreference. Annotated for coference:

1. Psalms 
2. Genesis 1
3. Numbers 8-10
4. Isaiah 42:1-25

In [4]:
# Set any Hebrew Bible Book
MY_BOOK = {'Psalms'} 

# Set any range in chapters of specified HB book
MY_CHAPTERS = set(range(1,151))

In [5]:
name_types = {'pers': 'pers',
              'mens': 'mens', 
              'gens': 'gens', 
              'topo': 'topo',
              'ppde': 'ppde',
              'pers,gens,topo': 'pers,gens,topo',
              'pers,gens': 'pers,gens',
              None: '-',
             'NA': '-',
             ' ': '-'}

In [6]:
def parseParticipant(node, fName):
    '''Helper function to parse the 
    coref, mention, and mentionNote features
    in a structured way. 
    '''
    
    valueStr = Fs(fName).v(node)
    if not valueStr:
        return None
    parts = valueStr.split('|')
    valueData = {}
    for part in parts:
        (ct, seqNum, wordSize, isSuffix, wordPart) = part.split(',', maxsplit=4)
        valueData.setdefault(ct, {})[seqNum] = (int(wordSize or 1), isSuffix == 's', wordPart)
    return valueData

# Prepare all coreference and singleton information

In [7]:
coref_tuples = []
singleton_tuples = []

for book in F.otype.s('book'):
    for chn in L.d(book, 'chapter'):
        chapter = F.chapter.v(chn)
        book_name = T.bookName(book)
        
        if book_name in MY_BOOK and chapter in MY_CHAPTERS:
           
            for word in L.d(chn, 'word'):

                nametype = F.nametype.v(L.u(word, 'lex')[0])

                # load all needed features 
                coref = parseParticipant(word, 'coref')

                if not coref and nametype:
                    lex_text = T.text(word, fmt='lex-trans-plain')
                    error(f'NO COREF for NAMETYPE {nametype}: {lex_text} ({word})')

                # extract coref feature 
                if coref:
                    #print(coref)
                    vt = F.vt.v(word)
                    person = F.ps.v(word)
                    st = F.st.v(word)
                    lex_text = T.text(word, fmt='lex-trans-plain')                       
                    bo, ch, vs = T.sectionFromNode(word)
                    text_pos = str(ch)+'.'+str(vs)
                    pdpos = F.pdp.v(word)
                    phrase_type = F.typ.v(L.u(word, 'phrase')[0])
                    det = F.det.v(L.u(word, 'phrase')[0])

                    for (ct, ctData) in coref.items():
                        dest = singleton_tuples if ct == 'T' else coref_tuples
                        for (seqNum, (wordSize, isSuffix, wordPart)) in ctData.items():
                            #print(seqNum, (wordSize, isSuffix, wordPart))

                            cls = f'{ct}{seqNum}'
                            text = wordPart if isSuffix else lex_text
                            ne = '-' if isSuffix else name_types[nametype]
                            pgn = converse_pgn_suffix(wordPart) if isSuffix else converse_pgn(F, word)
                            pdpos = 'suffix' if isSuffix else F.pdp.v(word)
                            pt = '-' if isSuffix else phrase_type
                            dt = 'det' if isSuffix else det
                            dest.append((
                                    cls,
                                    text,
                                    ne,
                                    pgn,
                                    text_pos,
                                    pdpos,
                                    st,
                                    pt,
                                    vt,
                                    dt,
                            ))                           
#pprint(singleton_tuples)
#pprint(coref_tuples)

 1m 02s NO COREF for NAMETYPE pers: JHWH/  (310675)
 1m 02s NO COREF for NAMETYPE pers: JHWH/  (310794)
 1m 02s NO COREF for NAMETYPE pers: >BCLWM/  (310860)
 1m 02s NO COREF for NAMETYPE pers: KWC=/  (311281)
 1m 02s NO COREF for NAMETYPE pers: JHWH/  (311454)
 1m 02s NO COREF for NAMETYPE topo: YJWN==/  (311702)
 1m 02s NO COREF for NAMETYPE pers: JHWH/  (312121)
 1m 02s NO COREF for NAMETYPE pers,gens,topo: JFR>L/  (312292)
 1m 02s NO COREF for NAMETYPE pers: JHWH/  (312649)
 1m 02s NO COREF for NAMETYPE pers: C>WL=/  (312674)
 1m 02s NO COREF for NAMETYPE pers: JHWH/  (312889)
 1m 02s NO COREF for NAMETYPE pers: JHWH/  (312970)
 1m 02s NO COREF for NAMETYPE pers: JHWH/  (313255)
 1m 02s NO COREF for NAMETYPE pers: JHWH/  (313260)
 1m 02s NO COREF for NAMETYPE pers: JHWH/  (313265)
 1m 02s NO COREF for NAMETYPE pers: JHWH/  (313270)
 1m 02s NO COREF for NAMETYPE pers: JHWH/  (313275)
 1m 02s NO COREF for NAMETYPE pers: JHWH/  (313281)
 1m 02s NO COREF for NAMETYPE pers: J<QB/  (3133

 1m 05s NO COREF for NAMETYPE pers: QRX==/  (325467)
 1m 05s NO COREF for NAMETYPE topo: YJWN==/  (325477)
 1m 05s NO COREF for NAMETYPE pers: J<QB/  (325481)
 1m 05s NO COREF for NAMETYPE pers: QRX==/  (325537)
 1m 05s NO COREF for NAMETYPE pers: JHWH/  (325728)
 1m 05s NO COREF for NAMETYPE pers,gens,topo: JFR>L/  (325910)
 1m 05s NO COREF for NAMETYPE pers: >DNJ/  (326398)
 1m 05s NO COREF for NAMETYPE pers: CDJ/  (326416)
 1m 05s NO COREF for NAMETYPE pers: JHWH/  (326686)
 1m 05s NO COREF for NAMETYPE pers: J<QB/  (326812)
 1m 05s NO COREF for NAMETYPE pers: JHWH/  (327019)
 1m 05s NO COREF for NAMETYPE topo: MSH===/  (327044)
 1m 05s NO COREF for NAMETYPE pers: JHWH/  (327216)
 1m 05s NO COREF for NAMETYPE pers: JHWH/  (327274)
 1m 05s NO COREF for NAMETYPE pers,gens,topo: JHWDH/  (327312)
 1m 05s NO COREF for NAMETYPE pers: JHWH/  (327330)
 1m 05s NO COREF for NAMETYPE pers,gens,topo: JFR>L/  (327389)
 1m 06s NO COREF for NAMETYPE pers: JHWH/  (327446)
 1m 06s NO COREF for NAMET

# View on the coref after parsing

In [None]:
{'T': {'44': (1, False, '<T')}, 
 'C': {'2': (1, True, 'W')}
}

# Make coref sets

Here all coref sets are made and the number of unique coreference relations, or classes, are counted. 


In [8]:
info('Counting ... \n')

my_mention_counter = 0
my_mention_note_counter = 0
my_appo_counter = 0
my_entity_counter = 0

my_coref_sets = []
my_classes_count = [] 
sets_dict = defaultdict()

for book in F.otype.s('book'):
    for chn in L.d(book, 'chapter'):
        chapter = F.chapter.v(chn)
        book_name = T.bookName(book)
        #bo, ch, vs = T.sectionFromNode(chn)
        
        corefSet = set()
        sets_dict[chapter] = corefSet

        if book_name in MY_BOOK and chapter in MY_CHAPTERS:

            for phr_atom in L.d(chn, 'phrase_atom'):
                relas = F.rela.v(phr_atom)
                if relas == 'Appo':
                    my_appo_counter +=1

            for word in L.d(chn, 'word'):
                mentions = F.mention.v(word)
                mention_notes = F.mentionNote.v(word)
                nametypes = F.nametype.v(L.u(word, 'lex')[0])

                if nametypes:
                    my_entity_counter += 1

                if mention_notes: 
                    my_mention_note_counter +=1

                if mentions:
                    my_mention_counter +=1

                # load all needed features 
                coref = parseParticipant(word, 'coref')

                # extract coref feature 
                if coref:
                    for (ct, ctData) in coref.items():
                        if ct == 'T':
                            continue
                        for (seqNum, (wordSize, isSuffix, wordPart)) in ctData.items():
                            cls = f'{ct}{seqNum}'
                            corefSet.add(cls)

        my_coref_sets.append(corefSet)

#print(my_coref_sets)

for items in my_coref_sets:

    lens = len(items)
    my_classes_count.append(lens)

print('There are {} participant classes in your corpus: \n\n {}\n{}'.format(sum(my_classes_count),MY_BOOK,MY_CHAPTERS))

#print(sets_dict)

 1m 12s Counting ... 

There are 2001 participant classes in your corpus: 

 {'Psalms'}
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150}


# Count all coference features for all coreference-annotated texts


In [9]:
info('Counting all annotated mentions, classes and mention notes...\n')

total_mentions = 0
total_mention_notes = 0
total_appositions = 0
total_entities = 0
coref_counter_set = []
total_classes = []

for book in F.otype.s('book'):
    for chn in L.d(book, 'chapter'):
        chapter = F.chapter.v(chn)
        book_name = T.bookName(book)
        corefSets = set()
        
        for phr_atom in L.d(chn, 'phrase_atom'):
            relas = F.rela.v(phr_atom)
            if relas == 'Appo':
                total_appositions +=1

        for word in L.d(chn, 'word'):
            mentions = F.mention.v(word)
            mention_notes = F.mentionNote.v(word)
            nametypes = F.nametype.v(L.u(word, 'lex')[0])
            
            if nametypes:
                total_entities += 1

            if mention_notes: 
                total_mention_notes +=1

            if mentions:
                total_mentions +=1
            
            # load all needed features 
            corefs = parseParticipant(word, 'coref')
            
            # extract coref feature 
            if corefs:
                for (ct, ctData) in corefs.items():
                    if ct == 'T':
                        continue
                    for (seqNum, (wordSize, isSuffix, wordPart)) in ctData.items():
                        cls = f'{ct}{seqNum}'
                        corefSets.add(cls)
                             
        coref_counter_set.append(corefSets)

for cl in coref_counter_set:
    lengths = len(cl)
    total_classes.append(lengths)
info('Counting is done!')

 1m 18s Counting all annotated mentions, classes and mention notes...

 1m 26s Counting is done!


# Print the counts

In [10]:
percent_mentions = round((my_mention_counter/total_mentions)*100,1)
percent_participant_classes = round((sum(my_classes_count)/sum(total_classes))*100,1)
percent_appositions = round((my_appo_counter/total_appositions)*100,1)
percent_entities = round((my_entity_counter/total_entities)*100,1)
percent_mention_notes = round((my_mention_note_counter/total_mention_notes)*100,1)

print('Your corpus and all coreference-annotated corpora consist of: \n\n \
      {:<5}/{} Mentions\n \
      {:<5}/{} Participant classes\n \
      {:<5}/{} Appositions\n \
      {:<5}/{} Named Entities\n \
      {:<5}/{} Annotator notes on those mentions\n\n'.format(
    my_mention_counter, 
    total_mentions, 
    sum(my_classes_count), 
    sum(total_classes), 
    my_appo_counter, 
    total_appositions, 
    my_entity_counter, 
    total_entities, 
    my_mention_note_counter, 
    total_mention_notes
    ))

print('You are studying: \n\n \
      {}% of all Mentions\n \
      {}% of all Participant classes\n \
      {}% of all Appositions\n \
      {}% of all Named Entities\n \
      {}% of all Annotator notes'.format(
    percent_mentions, 
    percent_participant_classes, 
    percent_appositions, 
    percent_entities,
    percent_mention_notes
    ))

Your corpus and all coreference-annotated corpora consist of: 

       15195/16432 Mentions
       2001 /2209 Participant classes
       131  /5887 Appositions
       1330 /35569 Named Entities
       712  /840 Annotator notes on those mentions


You are studying: 

       92.5% of all Mentions
       90.6% of all Participant classes
       2.2% of all Appositions
       3.7% of all Named Entities
       84.8% of all Annotator notes


# Print singletons 

A singleton is one mention that forms it's own class. It does not belong to a class with other mentions. 

--TO DO: most of the information in the table below is right, except for the phrase types (`phr_type`). Mentions can contain multiple words within a phrase, but does not necessarily have to be so. The `phr_type` that is printed now is the `phr_type` in which the mention was originally contained. To illustrate: T1 is a named entity, not a prepositional phrase. 

In [23]:
# print singletons 
# !!! check exceptions here.!!!

def print_singletons():
    
    print('{:<10}{:>10}{:>10}{:>12}{:>12}{:>12}{:>10}{:>13}{:>9}{:>10}\n{}'.format(
    'singleton', 'lexeme', 'NE', 'pgn', 'text', 'pdpos', 'st', 'phr_type', 'vt', 'det', '-'*110))
    
    for s in sorted (singleton_tuples, key=itemgetter(0)):
        print('{:<10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10}'.format(*s))

print_singletons()

singleton     lexeme        NE         pgn        text       pdpos        st     phr_type       vt       det
--------------------------------------------------------------------------------------------------------------
T1            DWD==/        pers      NAmsg      138.1       nmpr          a         PP         NA        det
T10            HJKL/           -      NAmsg      138.2       subs          c         PP         NA        det
T13              CM/           -      NAmsg      138.2       subs          a         PP         NA        det
T17             >MT/           -      NAfsg      138.2       subs          a         PP         NA        det
T20              KL/           -      NAmsg      138.2       subs          c         PP         NA        det
T22            >MRH/           -      NAfsg      138.2       subs          a         NP         NA        det
T24             JWM/           -      NAmsg      138.3       subs          c         PP         NA        und
T30       

# Print all participant classes

--TO DO: as with the singletons most of the information in the table below is right, except for the phrase types (`phr_type`). Mentions can contain multiple words within a phrase, but does not necessarily have to be so. The `phr_type` that is printed now is the `phr_type` in which the mention was originally contained. The fix is to isolate the head of the phrase.

In [24]:
#print participant classes
# !!! check exceptions here.!!!

def print_participant_classes():
    
    print('{:<10}{:>10}{:>10}{:>12}{:>12}{:>12}{:>10}{:>13}{:>9}{:>10}\n{}'.format(
    'class', 'lexeme', 'NE', 'pgn', 'text', 'pdpos', 'st', 'phr_type', 'vt', 'det', '-'*110))
    
    for c in sorted(coref_tuples, key=itemgetter(0)):
        print('{:<10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10} {:>10}'.format(*c))
        #print((c[3]),type(c[3]))
print_participant_classes()


class         lexeme        NE         pgn        text       pdpos        st     phr_type       vt       det
--------------------------------------------------------------------------------------------------------------
C1              JDH[           -      p1usg      138.1       verb         NA         VP       impf         NA
C1                  J          -      p1usg      138.1     suffix          a          -         NA        det
C1              ZMR[           -      p1usg      138.1       verb         NA         VP       impf         NA
C1              XWH[           -      p1usg      138.2       verb         NA         VP       impf         NA
C1              JDH[           -      p1usg      138.2       verb         NA         VP       impf         NA
C1              QR>[           -      p1usg      138.3       verb         NA         VP       perf         NA
C1                 NJ          -      p1usg      138.3     suffix         NA          -       wayq        det
C1        

### 1. Print all coreference chains
* with all words
* with all POS