# Classifying timephrases

This notebook will seek to establish a taxonomy of time phrases in Biblical Hebrew that is as comprehensive as possible. The `Construction` object is used as the starting point for the analysis. We already have a set of `Construction` objects (henceforth simply "cx") that have been preprocessed based on their subphrase grammar. These subphrases allow us to make certain selections of the data and place labels on the time phrases.

In [2]:
import collections
import pickle
import copy
import networkx as nx
import pandas as pd
import matplotlib.pyplot as plt
from helpers import barplot_counts, convert2pandas
from tf_tools.load import load_tf
from tf_tools.tokenizers import tokenize_surface
from cx_analysis.cx import Construction
from cx_analysis.build import CXbuilder
from cx_analysis.search import SearchCX
from positions import Positions
from locations import cxs as cx_data

TF, api, A = load_tf()
F, T, L = api.F, api.T, api.L

with open(cx_data, 'rb') as infile:
    cx_load = pickle.load(infile)
    phrase2cxs = cx_load['phrase2cxs']
    
se = SearchCX(A)
A.displaySetup(condenseType='phrase')

This is Text-Fabric 7.8.12
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

119 features found and 6 ignored
  0.00s loading features ...
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used
  5.49s All features loaded/computed - for details use loadLog()


# Dataset

The current cx dataset excludes time phrases that have gaps inside. These will be analyzed at a later stage due to their complexity. Let's get a sense for how many there are and what is included in the analysis set. The `timephrase` object is a custom object built from the ETCBC phrase object. It makes several corrections as well as fusions of the time phrases. The `timephrase` object is what the Construction classes are built upon.

In [3]:
all_times = A.search('timephrase', shallow=True)

  0.00s 3881 results


In [4]:
analyzed_times = set(phrase2cxs.keys())

In [5]:
unanalyzed_times = collections.Counter()

for time in all_times - analyzed_times:
    surface = tokenize_surface(time, api)
    unanalyzed_times[surface] += 1
    
print(sum(unanalyzed_times.values()), 'times not analyzed...')
print()
print("summary:")

unanalyzed_times = convert2pandas(unanalyzed_times)

unanalyzed_times

17 times not analyzed...

summary:


Unnamed: 0,Total
ל.מן.ה.יום.ו.עד.ה.יום.ה.זה,1
ב.שׁלושׁה.עשׂר.יום.בו.ב.ה.יום,1
ב.יום.ו.עד.ה.יום.ה.זה,1
כ.ימות.שׁנות,1
מ.ימי.ה.שׁפטים.ו.כל.ימי.מלכי.ישׂראל.ו.מלכי.יהודה,1
תמיד.מ.רשׁית.ה.שׁנה.ו.עד.אחרית.שׁנה,1
ל.מן.ה.יום.עד.ה.יום.ה.זה,1
מן.ה.יום.ו.עד.ה.יום.ה.זה,1
ל.מ.יום.ו.עד.ה.יום.ה.זה,1
ב.ה.שׁנה.ה.ראשׁונה.ב.ה.חדשׁ.ה.ראשׁון,1


# Basic Exploration

The most basic clustering for time phrases is their surface forms. What are the most common types?

In [6]:
analyzed_time_forms = collections.Counter()

for time in analyzed_times:
    surface = tokenize_surface(time, api)
    analyzed_time_forms[surface] += 1
    
analyzed_time_forms = convert2pandas(analyzed_time_forms)

In [7]:
print(f'{analyzed_time_forms.shape[0]} unique surface forms found')

1150 unique surface forms found


In [8]:
top = 20
print(f'showing top {top} surface forms')
analyzed_time_forms.head(top)

showing top 20 surface forms


Unnamed: 0,Total
עתה,342
ב.ה.יום.ה.הוא,203
ה.יום,191
ל.עולם,85
ב.ה.בקר,78
עד.ה.יום.ה.זה,71
ב.יום,69
אז,66
שׁבעת.ימים,63
עד.עולם,53


This top list accounts for a substantial proportion of all known time adverbials in the dataset:

In [9]:
print(f'ratio of times accounted for in top {top}:')
analyzed_time_forms.head(top).sum()[0] / len(all_times)

ratio of times accounted for in top 20:


0.413295542385983

# Formal Taxonomy, Dividing the Times

**A time adverbial is defined as any construction that modifies event time.** The construction may be a word, phrase, or even clause. This project is focused on word and phrase level time adverbials. The time adverbials can be divided into two main forms: single-phrase and multi-phrase.

**Single phrase time adverbials contain a single _profiled_ time word.** The "profiled" word is the head of the phrase, following Croft's model of headship as "the primary information bearing unit" (2001: 257ff). In a time adverbial, the head is typically a specialized term that indicates time, though not always (e.g. as is the case with event nouns). Besides the head, single phrasal adverbials can contain other words that modify the head. There are prepositional and non-prepositional varieties of single phrase adverbials. Note that in semantic headship as defined by Croft, it is the object of the preposition, not the preposition itself, which is considered the head of a phrase.

**Multiphrasal time adverbials contain two or more profiled time elements which are coordinated together.** This coordination can come in the form of literal coordination, e.g. with ו, or various kinds of appositional functions, e.g. when multiple prepositions are "stacked" to coordinate a time within a specific position. Multi-phrasal time adverbials appear with any combination of prepositional and non-prepositional forms.

The basic taxonomy looks like so:

```
single-phrase
|     |
|     prepositional
|     |
|     non-prepositional
|
multi-phrase
      |
      prep/non-prep combinations
```

## A Deductive and Inductive Classification Process

For classifying the current set of time adverbials, we will utilize a process of elimination. That deductive process is aided by the inductive analysis of time adverbial surface form data. In other words, the categories outlined above and to be outlined further below have been identified by looking at the quantities of the surface form counts to see which categories seem to exert influence. The goal is to be guided by the data, but at the same time derive categories which are useful for collocation research.

### Matching (`CXBuilder`) and Searching (`SearchCX`)

The `CXBuilder` class provides methods for testing any number of conditions on a provided element. It can then modify any matched CX, or compile it into a new `Construction` object. 

The tools provided by `CXSearch` can then scan the time adverbials for matches based on the `CXBuilder`'s rules.

### Surface form counting

Surface forms are counted by first being stripped of accentuation, then tokenized along their lexical boundaries, and finally joined on periods. We utilize prominent counts in the inductive side of the process.

### Keeping Track

We maintain a set of constructions which are and are not accounted for as we build and match the conditions.

# Classification

We put together a custom `CXBuilder` for labeling the CXs. For single-phrase constructions, we simply will add an attribute to each CX object: `classification`. The attribute will be a list of class labels that correspond to a position in the taxonomy tree.

**For single-phrase adverbials, the CXbuilder will simply add a classification tag, while a seperate builder will, instead, combine components of multi-phrase constructions into a single analyzed form.**

### Copy and Track Covered Times

In [14]:
# build up taxonomy and keep track of todo-cxs

# taxonomy stored as a directed graph
taxonomy = nx.DiGraph((
    ('time', 'single'),
    ('time', 'multi'),
    ('single', 'øprep'),
    ('single', 'prep'),
))

# count CX classes
class2cx = collections.defaultdict(set)

def classtag(classlist):
    return '.'.join(classlist)

def classcount(cxset=class2cx):
    """Update classcount with latest CX statuses"""
    classcount = collections.Counter()
    cxset = set(cx for cl,cxs in cxset.items() for cx in cxs)
    for cx in cxset:
        classi = cx.__dict__.get('classification', 'NA')
        classcount[classtag(classi)] += 1
    return convert2pandas(classcount)
    
def get_remaining(count=None, exclude=[]):
    """Calculate how many CXs remain to be classified
    
    Args
        exclude: a list of labels to exclude from 
            the counts.
    """
    count = count if count is not None else classcount()
    include = [i for i in count.index if i not in exclude]
    return count.loc[include]

def percent(n, total):
    """Make percent"""
    return round(n/total, 2)

def prog(ident=0, exclude=['single', 'multi']):
    """Report progress"""
    total_times = len(cx_dataset)
    count = classcount()
    found = get_remaining(count, exclude).sum().sum()
    ratio = percent(found, total_times)
    ident = '\t' * ident
    print(f'{ident}{ratio} ({found}) now accounted for')

### CXBuilders

In [15]:
# copy cxs for modification by builder
cx_dataset = set(
    tuple(copy.deepcopy(cx_data))
        for ph, cx_data in phrase2cxs.items()
)

class SinglePhrase(CXbuilder):
    """Modify cx classifications for single phrase CXs"""
    
    def __init__(self, cxset):
        CXbuilder.__init__(self) # initialize with standard CXbuilder methods
        
        self.cxset = cxset
        
        # cx queries
        # NB: order matters!
        self.cxs = (
            self.prep,
        )
        self.prereq = self.single
        
    def test_result(self, test, *cases):
        """Add class attributes to CX results"""
        if test:
            result = test[-1]
            cx = result['element']
            classi= result['class']
            cur_class = (
                cx.__dict__.setdefault('classification', []).append(classi)
            )
            return cx
    
    def findall(self, element):
        """Find all results with prerequisite"""
        if self.prereq(element):
            for funct in self.cxs:
                cx = funct(element)
            return [element]
        else:
            return []
    
    def label_cxs(self, class2cx):
        """Run all queries against dataset"""
        for cxtuple in self.cxset:
            for result in self.findall(cxtuple):
                tag = classtag(result[0].classification)
                class2cx[tag].add(result[0])
    
    def single(self, cxtuple):
        """Tag CXs as singles"""
        return self.test(
            {
                'element': cxtuple[0],
                'class': 'single',
                'conds': {
                    'len(cxtuple) == 1':
                        len(cxtuple) == 1,
                }
            }
        )
    
    def prep(self, cxtuple):
        """Tag prepositional cxs"""
        cx = cxtuple[0]
        return self.test(
            {
                'element': cx,
                'class': 'prep',
                'conds': {
                    'cx.name == prep_ph':
                        cx.name == 'prep_ph',
                }
            },
            {
                'element': cx,
                'class': 'øprep',
                'conds': {
                    'cx.name != prep_ph':
                        cx.name != 'prep_ph',
                }
            }
        )

    
sp = SinglePhrase(cx_dataset)

In [16]:
sp.label_cxs(class2cx)

In [28]:
classcount()

Unnamed: 0,Total
single.prep,1965
single.øprep,1414


In [27]:
for cx in list(class2cx['single.prep'])[:10]:
    print()
    print(cx.classification)
    print()
    se.showcx(cx)


['single', 'prep']



{   '__cx__': 'prep_ph',
    'head': {'__cx__': 'cont', 'head': 217685},
    'prep': {'__cx__': 'prep', 'head': 217684}}




['single', 'prep']



{   '__cx__': 'prep_ph',
    'head': {   '__cx__': 'prep_ph',
                'head': {   '__cx__': 'numb_ph',
                            'head': {'__cx__': 'cont', 'head': 284641},
                            'numb': {'__cx__': 'card', 'head': 284640}},
                'prep': {'__cx__': 'prep', 'head': 284639}},
    'prep': {'__cx__': 'prep', 'head': 284638}}




['single', 'prep']



{   '__cx__': 'prep_ph',
    'head': {   '__cx__': 'prep_ph',
                'head': {   '__cx__': 'attrib_ph',
                            'attrib': {   '__cx__': 'defi_ph',
                                          'art': {   '__cx__': 'art',
                                                     'head': 35750},
                                          'head': {   '__cx__': 'prde',
                                                      'head': 35751}},
                            'head': {   '__cx__': 'defi_ph',
                                        'art': {'__cx__': 'art', 'head': 35748},
                                        'head': {   '__cx__': 'cont',
                                                    'head': 35749}}},
                'prep': {'__cx__': 'prep', 'head': 35747}},
    'prep': {'__cx__': 'prep', 'head': 35746}}




['single', 'prep']



{   '__cx__': 'prep_ph',
    'head': {   '__cx__': 'attrib_ph',
                'attrib': {   '__cx__': 'defi_ph',
                              'art': {'__cx__': 'art', 'head': 367545},
                              'head': {'__cx__': 'ordn', 'head': 367546}},
                'head': {   '__cx__': 'defi_ph',
                            'art': {'__cx__': 'art', 'head': 367543},
                            'head': {'__cx__': 'cont', 'head': 367544}}},
    'prep': {'__cx__': 'prep', 'head': 367542}}




['single', 'prep']



{   '__cx__': 'prep_ph',
    'head': {   '__cx__': 'defi_ph',
                'art': {'__cx__': 'art', 'head': 342059},
                'head': {'__cx__': 'cont', 'head': 342060}},
    'prep': {'__cx__': 'prep', 'head': 342058}}




['single', 'prep']



{   '__cx__': 'prep_ph',
    'head': {   '__cx__': 'defi_ph',
                'art': {'__cx__': 'art', 'head': 349378},
                'head': {'__cx__': 'cont', 'head': 349379}},
    'prep': {'__cx__': 'prep', 'head': 349377}}




['single', 'prep']



{   '__cx__': 'prep_ph',
    'head': {   '__cx__': 'geni_ph',
                'geni': {'__cx__': 'card', 'head': 211126},
                'head': {'__cx__': 'cont', 'head': 211125}},
    'prep': {'__cx__': 'prep', 'head': 211124}}




['single', 'prep']



{   '__cx__': 'prep_ph',
    'head': {'__cx__': 'cont', 'head': 415694},
    'prep': {'__cx__': 'prep', 'head': 415693}}




['single', 'prep']



{   '__cx__': 'prep_ph',
    'head': {   '__cx__': 'attrib_ph',
                'attrib': {   '__cx__': 'defi_ph',
                              'art': {'__cx__': 'art', 'head': 61763},
                              'head': {'__cx__': 'ordn', 'head': 61764}},
                'head': {   '__cx__': 'defi_ph',
                            'art': {'__cx__': 'art', 'head': 61761},
                            'head': {'__cx__': 'cont', 'head': 61762}}},
    'prep': {'__cx__': 'prep', 'head': 61760}}




['single', 'prep']



{   '__cx__': 'prep_ph',
    'head': {   '__cx__': 'defi_ph',
                'art': {'__cx__': 'art', 'head': 148866},
                'head': {'__cx__': 'cont', 'head': 148867}},
    'prep': {'__cx__': 'prep', 'head': 148865}}

