# Review notebook: harvest senses with provenance

Notebook for reviewing functions

- `get_provenance_by_semantic_class`
- `extend_from_saved_lemma_query`

all saved in `utils.dataset_download`.

These functions assume:
    - a pickled dataframe with information harvested from the OED word endpoint for a given lemma id

What these functions should do:
    - for a given lemma id (e.g. `machine_nn01` saved in pickled data)
    - get all senses
    - for each of the senses get synonyms
    - for each of the senses + synonyms, get all branches (siblings and descedants
    - keep track of the relation between the initial lemma and sense harvested (this is saved in provenance and provenance_type column
    - for more documentation please refer to the code and this notebook

In [3]:
!git branch

  1-dataframe[m
  19-machine-tagger[m
  3-group-senses[m
* [32m4-semantic-provenance[m
  dev[m
  master[m
  oed-experiments[m


In [4]:
%load_ext autoreload

In [5]:
%autoreload 2

In [6]:
from utils.dataset_download import *
import pickle
import json
from pathlib import Path, PosixPath
import pandas as pd

# Load credentials, set paths and arguments

In [7]:
# import API credentials
with open('oed_experiments/oed_credentials.json') as f:
    auth = json.load(f)

In [8]:
# define lemma
lemma_id = "machine_nn01"

In [9]:
dp = "../data"

In [10]:
save_path = Path(dp)
save_path.mkdir(exist_ok=True)

In [11]:
start,end = 1750,1950
lemma_id = 'machine_nn01'

## Run function

In [None]:
extended_df = extend_from_saved_lemma_query(auth,lemma_id,start,end)

In [None]:
extended_df.head(3)

In [None]:
extended_df.shape

# Inspect functions

- `get_provenance_by_semantic_class`

In [None]:
def get_provenance_by_semantic_class(row: pd.Series) -> list:
    """
    decide on the relation between the sense and the target querry
    here we use the lowest semantic class id to decide on the relation
    
    if last semantic class id (sc_ids[-1]) == provenance id: then sense is sibling of provenance id
    elif provenance semantic class id in the list of semantic class last ids
    (but provenance not the last one): then sense is descendant of provenance id
    Argument:
        row (pd.Series): row of dataframe obtained from branchsenses endpoint
    
    Returns:
        nested listed in the format of [lowest semantic class id, relation, provenance semantic class id]
            in other words it said that for a given sense (which can have multiple semantic class ids)
            the lowest semantic class id stands in the relation "sibling" or "descendant" of the 
            provenance semantic class id
    """
    
    provenance = []
    
    # one sense can belong to multiple semantic class ids
    for sc_ids in row.semantic_class_ids:
        relation = ''
        
        # scenario 1
        # if the last id equals provenance, the relation is sibling
        if sc_ids[-1] == row.provenance_pivot:
            relation = 'sibling'
        
        # scenario 2
        # if not, then the relation is descendant
        elif (row.provenance_pivot in sc_ids):
            relation = 'descendant'
        
        # exclude other relations
        if relation:
            provenance.append([sc_ids[-1], relation, row.provenance_pivot])
    
    # double check, each sense SHOULD have a provenance
    # if not this will print a warning message
    if not provenance:
        print(f'Warning: No descendants or siblings found for {row.id}')
 
    return provenance


Inspect function `extend_from_saved_lemma_query`

Below we put the function in seperate cells, to facilitate scrutinizing individual steps.

In [16]:
senses_df = pd.read_pickle(f"./data/senses_{lemma_id}.pickle")
senses_df.head()

Unnamed: 0,id,lemma,notes,oed_url,word_id,first_use,definition,transitivity,oed_reference,part_of_speech,...,meta.updated,meta.sense_group,meta.position_in_entry,daterange.end,daterange.start,daterange.obsolete,daterange.rangestring,categories.topic,categories.usage,categories.region
0,machinery_nn01-38481087,machinery,[],https://www.oed.com/view/Entry/111856#eid38481087,machinery_nn01,William Winstanley,Theatre. Stage appliances and apparatus. Cf. m...,,"machinery, n., sense 1a",NN,...,2000,machinery_nn01-g01,1,,1687,False,1687—,"[[Arts, Performing Arts, Theatre]]",[[historical]],[]
0,machinery_nn01-38481135,machinery,[],https://www.oed.com/view/Entry/111856#eid38481135,machinery_nn01,Richard Steele,Contrivances employed for effect in a literary...,,"machinery, n., sense 1b",NN,...,2000,machinery_nn01-g01,2,,1713,False,1713—,"[[Arts, Literature]]",[],[]
0,machinery_nn01-38481203,machinery,[],https://www.oed.com/view/Entry/111856#eid38481203,machinery_nn01,Nathan Bailey,"Machines, or the constituent parts of a machin...",,"machinery, n., sense 2a",NN,...,2000,machinery_nn01-g02,3,,1731,False,1731—,"[[Technology, Engineering, Mechanics]]",[],[]
0,machinery_nn01-38481287,machinery,[],https://www.oed.com/view/Entry/111856#eid38481287,machinery_nn01,Gentleman's Magazine,As a count noun: a system of machinery (litera...,,"machinery, n., sense 2b",NN,...,2000,machinery_nn01-g02,4,,1736,False,1736—,[],[],[[India]]
0,machinery_nn01-38481377,machinery,[],https://www.oed.com/view/Entry/111856#eid38481377,machinery_nn01,William Battie,"(a) The workings, organization, or functional ...",,"machinery, n., sense 2c",NN,...,2000,machinery_nn01-g02,5,,1758,False,1758—,"[[Technology, Engineering, Mechanics]]",[],[]


In [15]:
# helper function to get last element in a nested list
get_last_id = lambda nested_list :[l[-1] for l in nested_list]
    
# load seed query dataframe or download from api
lemma_path = f"./data/senses_{lemma_id}.pickle"
if Path(lemma_path).is_file():
    print(f'Loading senses for {lemma_id} from pickle.')
    query_df = pd.read_pickle(lemma_path)
else:
    print(f'Dowloading senses for {lemma_id} from OED API.')
    sense_json = query_oed(auth,'word',lemma_id,flags='include_senses=true&include_quotations=true')
    # convert the json in a dataframe
    senses_df = convert_json_to_dataframe(sense_json)
    # save the datafram as pickle
    senses_df.to_pickle(f"./data/senses_{lemma_id}.pickle")
    
# use the sense endpoint to ensure all information 
# can be properly concatenated in one dataframe
    
# retrieve all sense ids
query_sense_ids = query_df.id.unique()

Dowloading senses for machinery_nn01 from OED API.


In [None]:
# get all senses by sense id
print(f"Get all sense for the lemma {lemma_id}")
seeds = [(s,query_oed(auth,'sense',s,
                flags=f"current_in='{start}-{end}'&limit=1000", # probably "current_in" not needed here see APi
                verbose=False)) # set verbose to True to see the url request
                    for s in tqdm(query_sense_ids)]

In [None]:
# convert to dataframe
seeds_df = pd.DataFrame([seed['data'] for s_id,seed in seeds])

# seed_df contains all the senses of the word machine_nn01
# we distinguish between provenance and provenance_type
# provenance will refer to specific word, sense of semantic class ids
# provenance_type will distinguish between different types of extension
# define provenance, these words are "seed"
seeds_df['provenance'] = [[[i,'seed',lemma_id]] for i in seeds_df.id] # for the seed sense we use the id of the word machine_nn0
                                       # we use list here, reason is explained later, see provenance of synonyms
seeds_df['provenance_type'] = 'seed' # categorize these lemmas as seed

In [None]:
# get all synonyms for the seed senses
# reminder synonyms uses same function as the /senses/ endpoint, flags should work here
print(f"Get all synonyms of the senses listed in {lemma_id}")
synonyms = [(s,query_oed(auth,'sense',s,
                level='synonyms',
                flags=f"current_in='{start}-{end}'&limit=1000"))
                        for s in tqdm(query_sense_ids)]

In [None]:
# transform list of synonyms to a dataframe
synonyms_df = pd.DataFrame([s for s_id,syn in synonyms for s in syn['data']])
    
# for synonyms the provenance_type is set to "synonym"
synonyms_df['provenance_type'] = 'synonym'
# for synonyms we refer the sense_id via which this synonym was retrieved
synonyms_df['provenance'] = [[[s['id'],'synonym',s_id]] for s_id,syn in synonyms for s in syn['data']]

In [None]:
# seed + synonyms constitute the nucleas of our query
# these are saved in the core_df
# shape should be 485 (synonyms senses) + 26 (seed senses)
core_df = pd.concat([seeds_df,synonyms_df],sort=True)
    
# branch out from there
# we save the lowest level of the semantic_class_last_id columns
core_df['semantic_class_last_id'] = core_df['semantic_class_ids'].apply(get_last_id)

In [None]:
# retrieve all the _lowest_ (or last) semantic class ids for the core senses so far
semantic_class_ids = set([s for l in core_df.semantic_class_last_id.to_list() for s in l])

In [None]:
# now, we use the descendants endpoint
# for each lowest semantic class id
# we get all "descendants" which according the API documentation
# returns an array of senses that belong to the semantic class
# specified by ID, plus senses that belong to its child and descendant classes.
print("Get all branches for seed senses and synonyms")
branches = [(idx,query_oed(auth,'semanticclass', idx, 
                        level='branchsenses', # 
                        flags=f"current_in='{start}-{end}'&limit=1000"))
                            for idx in tqdm(semantic_class_ids)]

In [None]:
# convert API response to dataframe
branches_df = pd.DataFrame([s for idx,branch in branches for s in branch['data']])
    
# ISSUE: again we have duplicate 
# senses here, as some appear multiple time as
# in the same semantic class (or as descendant)
    
# provenance_type is branch with semantic class id 
# that was use for retrieving the sense is the provenance
branches_df['provenance_type'] = 'branch'
    
# we create a provenance_pivot columsn, which shows
# the semantic class id via which the sense was retrieved
branches_df['provenance_pivot'] = [idx for idx, branch in branches for s in branch['data']]

In [None]:
# now there are two scenarios to specify for the pro
# both scenarios can apply to one sense
# if last semantic class id (sc_ids[-1]) == provenance id: then sense is sibling of provenance id
# elif provenance semantic class id in the list of semantic class last ids
# (but provenance not the last one): then sense is descendant of provenance id
    
branches_df['provenance'] = branches_df.apply(get_provenance_by_semantic_class,axis=1)
    
# drop the provenance_pivot column
branches_df.drop('provenance_pivot',axis=1,inplace=True)
    
# concatenate core and branch senses
# ISSUE: have a closer look at the warning message
extended_df = pd.concat([core_df,branches_df],sort=True)

# to check if rows match
#extended_df.shape[0] == core_df.shape[0] + branches_df.shape[0]
# save dataframe as pickle
extended_df.to_pickle(f"./data/extended_{lemma_id}.pickle") 