# Classifying timephrases

This notebook will seek to establish a taxonomy of time phrases in Biblical Hebrew that is as comprehensive as possible. The `Construction` object is used as the starting point for the analysis. We already have a set of `Construction` objects (henceforth simply "cx") that have been preprocessed based on their subphrase grammar. These subphrases allow us to make certain selections of the data and place labels on the time phrases.

In [1]:
import collections
import pickle
import copy
import random
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
from helpers import barplot_counts, convert2pandas
from tf_tools.load import load_tf
from tf_tools.tokenizers import tokenize_surface
from cx_analysis.cx import Construction
from cx_analysis.build import CXbuilder
from cx_analysis.search import SearchCX
from positions import Positions
from locations import cxs as cx_data

TF, api, A = load_tf()
F, E, T, L = api.F, api.E, api.T, api.L

with open(cx_data, 'rb') as infile:
    cx_load = pickle.load(infile)
    phrase2cxs = cx_load['phrase2cxs']
    
se = SearchCX(A)
A.displaySetup(condenseType='phrase', withNodes=True)

This is Text-Fabric 7.8.12
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

119 features found and 6 ignored
  0.00s loading features ...
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used
  6.55s All features loaded/computed - for details use loadLog()


# Dataset

The current cx dataset excludes time phrases that have gaps inside. These will be analyzed at a later stage due to their complexity. Let's get a sense for how many there are and what is included in the analysis set. The `timephrase` object is a custom object built from the ETCBC phrase object. It makes several corrections as well as fusions of the time phrases. The `timephrase` object is what the Construction classes are built upon.

In [2]:
all_times = A.search('timephrase', shallow=True)

  0.00s 3881 results


In [3]:
analyzed_times = set(phrase2cxs.keys())

In [4]:
unanalyzed_times = collections.Counter()

for time in all_times - analyzed_times:
    surface = tokenize_surface(time, api)
    unanalyzed_times[surface] += 1
    
print(sum(unanalyzed_times.values()), 'times not analyzed...')
print()
print("summary:")

unanalyzed_times = convert2pandas(unanalyzed_times)

unanalyzed_times

17 times not analyzed...

summary:


Unnamed: 0,Total
ל.מן.ה.יום.ו.עד.ה.יום.ה.זה,1
ב.שׁלושׁה.עשׂר.יום.בו.ב.ה.יום,1
ב.יום.ו.עד.ה.יום.ה.זה,1
כ.ימות.שׁנות,1
מ.ימי.ה.שׁפטים.ו.כל.ימי.מלכי.ישׂראל.ו.מלכי.יהודה,1
תמיד.מ.רשׁית.ה.שׁנה.ו.עד.אחרית.שׁנה,1
ל.מן.ה.יום.עד.ה.יום.ה.זה,1
מן.ה.יום.ו.עד.ה.יום.ה.זה,1
ל.מ.יום.ו.עד.ה.יום.ה.זה,1
ב.ה.שׁנה.ה.ראשׁונה.ב.ה.חדשׁ.ה.ראשׁון,1


# Basic Exploration

The most basic clustering for time phrases is their surface forms. What are the most common types?

In [5]:
analyzed_time_forms = collections.Counter()

for time in analyzed_times:
    surface = tokenize_surface(time, api)
    analyzed_time_forms[surface] += 1
    
analyzed_time_forms = convert2pandas(analyzed_time_forms)

In [6]:
print(f'{analyzed_time_forms.shape[0]} unique surface forms found')

1150 unique surface forms found


In [7]:
top = 20
print(f'showing top {top} surface forms')
analyzed_time_forms.head(top)

showing top 20 surface forms


Unnamed: 0,Total
עתה,342
ב.ה.יום.ה.הוא,203
ה.יום,191
ל.עולם,85
ב.ה.בקר,78
עד.ה.יום.ה.זה,71
ב.יום,69
אז,66
שׁבעת.ימים,63
עד.עולם,53


This top list accounts for a substantial proportion of all known time adverbials in the dataset:

In [8]:
print(f'ratio of times accounted for in top {top}:')
analyzed_time_forms.head(top).sum()[0] / len(all_times)

ratio of times accounted for in top 20:


0.413295542385983

# Formal Taxonomy, Dividing the Times

**A time adverbial is defined as any construction that modifies event time.** The construction may be a word, phrase, or even clause. This project is focused on word and phrase level time adverbials. The time adverbials can be divided into two main forms: single-phrase and multi-phrase.

**Single phrase time adverbials contain a single _profiled_ time word.** The "profiled" word is the head of the phrase, following Croft's model of headship as "the primary information bearing unit" (2001: 257ff). In a time adverbial, the head is typically a specialized term that indicates time, though not always (e.g. as is the case with event nouns). Besides the head, single phrasal adverbials can contain other words that modify the head. There are prepositional and non-prepositional varieties of single phrase adverbials. Note that in semantic headship as defined by Croft, it is the object of the preposition, not the preposition itself, which is considered the head of a phrase.

**Multiphrasal time adverbials contain two or more profiled time elements which are coordinated together.** This coordination can come in the form of literal coordination, e.g. with ו, or various kinds of appositional functions, e.g. when multiple prepositions are "stacked" to coordinate a time within a specific position. Multi-phrasal time adverbials appear with any combination of prepositional and non-prepositional forms.

The basic taxonomy looks like so:

```
single-phrase
|     |
|     prepositional
|     |
|     non-prepositional
|
multi-phrase
      |
      prep/non-prep combinations
```

In [9]:
# TO-DO: Generate taxonomy from tags directly

# # build up taxonomy as a directed graph
# taxonomy = nx.DiGraph((
#     ('time', 'single'),
#     ('time', 'multi'),
#     ('single', 'øprep'),
#     ('single', 'prep'),
#     ('øprep', 'bare'),
#     ('prep', 'bare'),
# ))

## A Deductive and Inductive Classification Process

For classifying the current set of time adverbials, we will utilize a process of elimination. That deductive process is aided by the inductive analysis of time adverbial surface form data. In other words, the categories outlined above and to be outlined further below have been identified by looking at the quantities of the surface form counts to see which categories seem to exert influence. The goal is to be guided by the data, but at the same time derive categories which are useful for collocation research.

### Matching (`CXBuilder`) and Searching (`SearchCX`)

The `CXBuilder` class provides methods for testing any number of conditions on a provided element. It can then modify any matched CX, or compile it into a new `Construction` object. 

The tools provided by `CXSearch` can then scan the time adverbials for matches based on the `CXBuilder`'s rules.

### Surface form counting

Surface forms are counted by first being stripped of accentuation, then tokenized along their lexical boundaries, and finally joined on periods. We utilize prominent counts in the inductive side of the process.

### Keeping Track

We maintain a set of constructions which are and are not accounted for as we build and match the conditions.

# Classification

We put together a custom `CXBuilder` for labeling the CXs. For single-phrase constructions, we simply will add an attribute to each CX object: `classification`. The attribute will be a list of class labels that correspond to a position in the taxonomy tree.

**For single-phrase adverbials, the CXbuilder will simply add a classification tag, while a seperate builder will, instead, combine components of multi-phrase constructions into a single analyzed form.**

### Copy and Track Covered Times

In [10]:
# build up taxonomy and keep track of todo-cxs
class Tracker:
    """A class for tracking tagged Constructions"""
    
    def __init__(self, classdict, cxset, 
                 exclude=set(),
                 base='single'
                ):
        """Initialize Tracker.
        
        Args:
            classdict: dict of class string to set of
                classified CX objects
            cxset: a set of all CXs that are analyzed
            exclude: a set of class tags to ignore in 
                calculations of remaining classes
        """
        self.classdict = classdict
        self.cxset = classdict[base]
        self.exclude = exclude | {base}
        self.setselect = SetSelection(classdict) # select overlapping sets
        
    def tally_classes(self):
        """Return a Counter on classes"""
        count = collections.Counter()
        for cl, cxset in self.classdict.items():
            count[cl] += len(cxset)
        return convert2pandas(count)
        
    def get_found(self):
        """Get classified CXs
        
        !!NB!! Currently we wrap tagged cxs in a tuple so that
        the found cxs can be compared with the original dataset
        which is wrapped cxs. This works fine when dealing with 
        single phrasal cxs. But a better mapping solution will be 
        needed for multi-phrasal classification. Should probl use
        the timephrase node number then. But may want to also build
        a sanity check to make sure the whole tuple gets covered.
        """
        return set(
            cx for classname, cxs in self.classdict.items()
                for cx in cxs if classname not in self.exclude
        )
        
    def get_remaining(self):
        """Get CXs not yet classified."""
        found = self.get_found()
        return self.cxset - found
        
    def remaining_data(self):
        """Make a count dict of all remaining forms"""
        remaining = self.get_remaining()
        count = collections.Counter()
        form2cxs = collections.defaultdict(set)
        for cx in remaining:
            slots = cx.slots
            surface = tokenize_surface(slots, api) 
            count[surface] += 1
            form2cxs[surface].add(cx)
        return (count, form2cxs)
        
    def remaining_forms(self):
        """Retrieve a sorted count of remaining CX surface forms"""
        count,x = self.remaining_data()
        return convert2pandas(count)
    
    def see_remaining(self, forms, end=10, shuffle=False, **tf_kwargs):
        """Display remaining cxs that are fed in"""
        x,form2cxs = self.remaining_data()
        cxs = list(
            cx for form in forms
                for cx in form2cxs[form]
        )
        if shuffle:
            random.shuffle(cxs)
        for cx in cxs[:end]:
            se.showcx(cx, **tf_kwargs)
        return cxs
    
    def percent(self, n1, total):
        """Calculate ratio"""
        return round(n1/total, 2) * 100
        
    def prog(self, head=10):
        """Report progress dynamically."""
        
        # report progress
        to_do = len(self.get_remaining())
        done = len(self.get_found())
        done_progress = self.percent(done, done+to_do)
        todo_progress = self.percent(to_do, done+to_do)
        print(f'{done_progress}% ({done}) classified')
        print(f'{todo_progress}% ({to_do}) unclassified')
        
        # report class counts
        print()
        print(f'Class counts:')
        class_counts = self.tally_classes()
        display(class_counts)
        
        # report forms of unclassified CXs
        print()
        remain_forms = self.remaining_forms()
        print(f'Top {head} unclassified surface forms') 
        display(remain_forms.head(head))

In [11]:
class SetSelection:
    """Get sets of CXs based on interesecting sets"""
    def __init__(self, setdict):
        """Initialize.
        
        Args:
            setdict: a dict of string to set mappings
        """
        self.setdict = setdict
    def __getitem__(self, sets):
        """Retrieve overlapping sets.
        
        Args:
            sets: an iterable of strings which are
                the names of the sets to be searched.
        Returns:
            The overlapping set.
        """
        result = set()
        for st in sets:
            if not result:
                result |= self.setdict[st]
            else:
                result = result & self.setdict[st]
        return result
    
    def get_union(self, sets):
        """Return a union of the sets"""
        result = set(
            cx for stname, st in self.setdict.items()
                if stname in sets
                for cx in st
        )
        return result       

def show_classes(classes, classtags, exclude=tuple(), 
                 counts=True, view=False,
                 shuffle=False, end=100, head=50, 
                 **tfkwargs,
                 ):
    """Iterate through overlapping sets and count/display their results"""
    cxs = classes[classtags] - classes.get_union(exclude)
    cl_counts = collections.Counter()
    surface2cx = collections.defaultdict(set)
    
    # tokenize cx and count/store it for review
    for cx in cxs:
        surface = tokenize_surface(cx.slots, api)
        cl_counts[surface] += 1
        surface2cx[surface].add(cx)
        
    # display counts 
    if counts:
        cl_counts = convert2pandas(cl_counts)
        print(cl_counts.sum().sum(), 'results')
        display(cl_counts.head(head))
        
    # display cxs in class tags
    if view is True:
        cxs = list(cxs)
        if shuffle: 
            random.shuffle(cxs)
        for cx in cxs[:end]:
            se.showcx(cx, **tfkwargs)
        return cxs
            
    # display cxs in an iterable of surface forms
    elif view:
        view_list = [
            cx for surf in view
                for cx in surface2cx[surf]
        ]
        if shuffle:
            random.shuffle(view_list)
        for cx in view_list[:end]:
            se.showcx(cx, **tfkwargs)
        return view_list
    
    else:
        return list(cxs)

### CXBuilders

In [12]:
# copy cxs for modification by builder
cx_dataset = set(
    tuple(copy.deepcopy(cx_data))
        for ph, cx_data in phrase2cxs.items()
)

class SinglePhrase(CXbuilder):
    """Modify cx classifications for single phrase CXs"""
    
    def __init__(self, cxset, tf):
        CXbuilder.__init__(self) # initialize with standard CXbuilder methods
        
        self.cxset = cxset
        self.api = tf
        self.F, self.L = tf.api.F, tf.api.L
        
        # cx queries
        # NB: order matters!
        self.cxs = (
            self.prep,
            self.bare,
            self.definite,
            self.def_appo,
            self.genitive,
            self.quantified,
            self.adjective,
        )
        self.prereq = self.single
        self.kind = 'time_class'
        
        self.class2cx = collections.defaultdict(set)
        
    def test_result(self, test, *cases):
        """Add class attributes to CX results"""
        if test:
            result = test[-1]
            cx = result['element']
            classi= result['class']
            cx.__dict__.setdefault('classification', []).extend(classi)
            cx.match = result
            cx.conds = result['conds']
            cx.cases = (result,) + cx.cases
            return cx
        else:
            return Construction(cases=cases, **cases[0])
    
    def findall(self, element):
        """Find all results with prerequisite
        
        NB this version of findall only returns
        a single result: the construction object
        itself, since it is modified in-place.
        This version expects cx tuples with
        only one cx.
        """
        results = []
        if self.prereq(element):
            for funct in self.cxs:
                cx = funct(element)
                if cx:
                    results.append(cx)
        if results:
            return results[0] # NB, only 1st matters as all are same obj
        else:
            return None
    
    def label_cxs(self):
        """Run all queries against dataset"""
        for cxtuple in self.cxset:
            cx = self.findall(cxtuple)
            if cx:
                for tag in cx.classification:
                    self.class2cx[tag].add(cx)
    
    def geta(self, item, attrib, default=None):
        """Safely retrieve attribute from object
        
        Some objects in a CX graph are TF integer
        nodes, while most are CX objects. In order
        to safely call attributes on a given position,
        we need to handle attribute errors when called
        on an integer.
        """
        try:
            return item.__dict__[attrib]
        except AttributeError:
            return default
    
    def get_headword(self, cx):
        """Get a word that serves as head"""
        head = list(cx.getsuccroles('head'))[-1]
        return head
    
    def get_head_modi(self, head, cx, name, default=Construction()):
        """Retrieve a modifier on a particular head"""
        for c in cx.graph:
            if (self.geta(c,'name') == name) and (head in c):
                return c
        # unsuccessful search
        return default
    
    def single(self, cxtuple):
        """Tag CXs as singles"""
        cx1 = cxtuple[0]
        relas = set(
            self.geta(c,'name') for c in cx1
        )
        bhsa_phrase = L.u(cx1.slots[0], 'phrase')[0]
        attr_cl = E.mother.t(bhsa_phrase)
        
        return self.test(
            {
                'element': cxtuple[0],
                'class': ['single'],
                'kind': self.kind,
                'conds': {
                    'len(cxtuple) == 1':
                        len(cxtuple) == 1,
                    'no apposition in cx':
                        not relas & {'appo'},
                    'no attributive clause on phrase':
                        not attr_cl
                }
            }
        )
    
    def prep(self, cxtuple):
        """Tag prepositional cxs"""
        cx = cxtuple[0]
        return self.test(
            {
                'element': cx,
                'class': ['prep'],
                'kind': self.kind,
                'conds': {
                    'cx.name == prep_ph':
                        cx.name == 'prep_ph',
                }
            },
            {
                'element': cx,
                'class': ['øprep'],
                'conds': {
                    'cx.name != prep_ph':
                        cx.name != 'prep_ph',
                }
            }
        )

    def bare(self, cxtuple):
        """Tag bare, non-modified cxs"""
        F = self.F
        cx = cxtuple[0]
        head_path = list(cx.getsuccroles('head'))
        head = head_path[-1]
        etcbc_phrase = self.L.u(int(head),'phrase')[0]
        
        # two types of units allowed in the path:
        # word cxs or prep_ph
        # trace path to head and collect relations along the way
        cx_name = cx.name if cx.kind != 'word_cx' else cx.kind
        head_phs = {cx_name}
        for c in head_path:
            if self.geta(c,'kind') == 'subphrase':
                head_phs.add(c.name)
            else:
                head_phs.add('word_cx')
        
        prereqs = {
            'head_phs is subset of {word_cx, prep_ph}':
                head_phs.issubset({'word_cx', 'prep_ph', 'advb'}),
            'F.st.v(head) != c':
                F.st.v(int(head)) != 'c',
            'not daughters(etcbc_phrase)':
                not E.mother.t(etcbc_phrase),
        }
        
        return self.test(
            {
                'element': cx,
                'class': ['bare'],
                'kind': self.kind,
                'conds': dict({
                    'F.prs.v(head) in {n/a, absent}':
                        F.prs.v(int(head)) in {'n/a', 'absent'},
                }, **prereqs)
            },
            {
                'element': cx,
                'class': ['suffix'],
                'kind': self.kind,
                'conds': dict({
                    'F.prs.v(head) not in {n/a, absent}':
                        F.prs.v(int(head)) not in {'n/a', 'absent'},
                }, **prereqs)
            },
        )
    
    def definite(self, cxtuple):
        """A definite phrase"""
        cx = cxtuple[0]
        head = self.get_headword(cx)
        def_ph = self.get_head_modi(head, cx, 'defi_ph')
        
        return self.test(
            {
                'element': cx,
                'class': ['definite'],
                'kind': self.kind,
                'conds': {
                    'cx contains defi phrase with head':
                        bool(def_ph)
                }
            }
        
        )
    
    def def_appo(self, cxtuple):
        """Definite apposition"""
        
        F = self.F
        geta = self.geta
        cx = cxtuple[0]
        head = self.get_headword(cx)
        
        # get attribute cx if it contains head word
        att_ph = self.get_head_modi(head, cx, 'attrib_ph')
        
        return self.test(
            {
                'element': cx,
                'class': ['def_apposition'],
                'kind': self.kind,
                'conds': {
                    f'cx contains attrib ph with head':
                        bool(att_ph)
                }
            },
            {
                'element': cx,
                'class': ['def_apposition', 'demonstrative'],
                'kind': self.kind,
                'conds': {
                    f'cx contains attrib ph with head':
                        bool(att_ph),
                    'apposition contains demonstrative':
                        {'prde', 'prps'} & set(
                            F.pdp.v(w) for w in att_ph.getrole('attrib', Construction()).slots
                        )
                }
            },
            {
                'element': cx,
                'class': ['def_apposition', 'ordinal'],
                'kind': self.kind,
                'conds': {
                    f'cx contains attrib ph with head':
                        bool(att_ph),
                    'apposition contains ordinal':
                        'ordn' in set(
                            geta(c,'name') for c in att_ph.graph
                        ),
                }
            },
        )
    
    def genitive(self, cxtuple):
        """Genitive relation on head"""
        cx = cxtuple[0]
        head = self.get_headword(cx)
        geni_ph = self.get_head_modi(head, cx, 'geni_ph')
        geni_items = set(
            self.geta(c, 'name') for c in geni_ph
        )
        return self.test(
            {
                'element': cx,
                'class': ['genitive'],
                'kind': self.kind,
                'conds': {
                    'cx contains geni phrase on head':
                        bool(geni_ph)
                }
            },
            {
                'element': cx,
                'class': ['geni_cardinal'],
                'kind': self.kind,
                'conds': {
                    'cx contains geni phrase on head':
                        bool(geni_ph),
                    
                    'a cardinal is genitive to this word':
                        'card' in geni_items,
                }
            }
        )
    
    def quantified(self, cxtuple):
        """Find quantified time phrases"""
        cx = cxtuple[0]
        head = self.get_headword(cx)
        quant_ph = self.get_head_modi(head, cx, 'numb_ph')
        geta = self.geta
        return self.test(
            {
                'element': cx,
                'class': ['quantified', 'cardinal'],
                'kind': self.kind,
                'conds': {
                    'cx contains numbered phrase on head':
                        bool(quant_ph),
                    
                    'does not contain qualitative quant':
                        'qquant' not in set(
                            geta(c,'name') for c in quant_ph
                        )
                }
            },
            {
                'element': cx,
                'class': ['quantified', 'qualitative'],
                'kind': self.kind,
                'conds': {
                    'cx contains numbered phrase on head':
                        bool(quant_ph),
                    
                    'contains qualitative quant':
                        'qquant' in set(
                            geta(c,'name') for c in quant_ph
                        )
                }
            },
            {
                'element': cx,
                'class': ['cardinal'],
                'kind': self.kind,
                'conds': {
                    'cx.name == card_chain':
                        cx.name == 'card_chain'
                }
            }
        )
    
    def adjective(self, cxtuple):
        """Adjectival modifications via non-definite apposition"""
        cx = cxtuple[0]
        head = self.get_headword(cx)
        adjv_ph = self.get_head_modi(head, cx, 'adjv_ph')
        
        return self.test(
            {
                'element': cx,
                'class': ['adjective'],
                'kind': self.kind,
                'conds': {
                    'cx contains adjectival phrase on head':
                        bool(adjv_ph),
                }
            },
            {
                'element': cx,
                'class': ['demonstrative'],
                'kind': self.kind,
                'conds': {
                    'cx is a demonstrative phrase':
                        cx.name == 'demon_ph',
                }
            }
        )
    
# tag patterns in CXs
sp = SinglePhrase(cx_dataset, A)
sp.label_cxs()
print('done')

done


### Track Progress

In [13]:
track = Tracker(
    sp.class2cx,
    cx_dataset,
    exclude={
        'single', 'multi',
        'prep', 'øprep',
    },
)

track.prog(head=25)

100.0% (3281) classified
0.0% (0) unclassified

Class counts:


Unnamed: 0,Total
single,3281
prep,1895
øprep,1386
definite,1273
bare,1047
def_apposition,633
quantified,564
demonstrative,485
genitive,415
cardinal,356



Top 25 unclassified surface forms


Unnamed: 0,Total


### See remaining forms

In [20]:
# test_ph = phrase2cxs[1450440]
# test_ph

In [14]:
# test = sp.bare(test_ph)
# test.conds

In [31]:
# remaining = track.see_remaining(['ב.דור.אחר'], condenseType='sentence', end=5, shuffle=True)

### See Results

In [18]:
classes = SetSelection(sp.class2cx)

show_cl = show_classes(
    classes,
    ('single', 'cardinal'),
    exclude=('quantified'),
    end=5,
    counts=True,
    #view=['עתה.זה'],
    shuffle=True,
    condenseType='sentence'
)

1 results


Unnamed: 0,Total
שׁלשׁים.ו.שׁלושׁ,1


<hr>

### Scratch Code

## Multi-Phrasals

### To-Do

I have currently written the SinglePhrase builder with only single-phrased examples in mind. However, this misses the important fact that many multi-phrasal CXs will likewise have single-phrasal component parts. I can re-write the single phrase CXBuilder to receive cxs that are also from multi-phrasal items. Yet this approach would be complicated by the fact that not all phrases within a multi-phrasal time construction will be time-oriented. The result would be that I would have skewed class statistics. For example, if a phrase is "למלך" as part of a calendrical CX, it would end up getting counted as a "prepositional" and "definite" time cx. But it is not itself a time CX, only a part of one. But in other cases, the phrase may indeed also be able to function as its own independent time CX, such as in יום ביום. But even this example raises the issue of whether this CX can truly be decomposed into those smaller parts.

It is worth considering whether it is better to:

1. utilize the same rules in the single-phrasal builder to tag constituent phrases in multi-phrasal time CXs
2. re-write many rules separately, at the risk of duplicating logic already handled in single-phrases.

Option 1 has the strength of enforcing consistency across all categories, whereas option 2 has the ability to cater solutions specific to multi-phrasal constructions. 

**I lean toward option 2.** There are likely many phrases, specific to certain constructions, that do not contain heads that are lexicalized for time. These need to be defined individually. It might mean that certain lower level patterns are duplicated. But that would also mean that they are available for future restrictions and modifications specific to multi-phrasal constructions. 

An option 3 might be to require the SinglePhrase Builder as an argument to the MultiPhrase Builder, and use only the patterns which are relevant. This would allow me to take advantage of both situations.