# Initial exploration of UDS dataset

## Caroline Gish | cng18@pitt.edu | for 2022-02-24

---

**Source:** https://decomp.readthedocs.io/en/latest/index.html

Initial exploration will involve working through the sections listed on the main Decomp read the docs page.

## Installation 

https://decomp.readthedocs.io/en/latest/install.html

First installation option listed - operation timed out on my machine. 

Ended up installing locally using `pip`:
- `pip install git+git://github.com/decompositional-semantics-initiative/decomp.git`

## Quick Start

https://decomp.readthedocs.io/en/latest/tutorial/quick-start.html

In [1]:
# importing dataset

from decomp import UDSCorpus

In [2]:
# creating uds object (UDSCorpus object)

uds = UDSCorpus()

The dataset is built from the [Univeral Dependencies English Web Treebank](https://github.com/UniversalDependencies/UD_English-EWT) and the [UDS annotations](http://decomp.io/data/). 

### UDSSentenceGraph objects

- accessed through standard dictionary getters or iteration

In [6]:
# UDS graph corresponding to 12th 
# sentence in en-ud-train.conllu

uds["ewt-train-12"]

<decomp.semantics.uds.graph.UDSSentenceGraph at 0x7fbedb51e700>

### UDSDocument objects & UDSDocumentGraph

- graph associated with each document object

In [5]:
# document

uds.documents["reviews-112579"]

<decomp.semantics.uds.document.UDSDocument at 0x7fbeb849fd30>

In [7]:
# associated graph

uds.documents["reviews-112579"].document_graph

<decomp.semantics.uds.graph.UDSDocumentGraph at 0x7fbec00ad6d0>

#### `UDSCorpus` objects behave like dictionaries:

In [41]:
# sentence-level graph identifiers

#for graphid in uds:
    #print(graphid)

In [24]:
len(uds.graphids)

16622

In [21]:
# document identifiers
# (correspond directly to English 
# Web Treebank file IDs)

#for documentid in uds.documents:
    #print(documentid)

In [19]:
len(uds.documents)

1174

In [42]:
# sentence-level graph identifiers 
# with corresponding sentence

#for graphid, graph in uds.items():
    #print(graphid)
    #print(graph.sentence)

In [43]:
# document identifiers with 
# each document’s entire text

#for documentid, document in uds.documents.items():
    #print(documentid)
    #print(document.text)

- `graphids` attribute of UDSCorpus: list of sentence-level graph identifiers

In [44]:
# a list of the sentence-level 
# graph identifiers in the corpus
#uds.graphids

- `graphs` attribute of UDSCorpus: mapping from identifiers and the corresponding graph

In [45]:
# a dictionary mapping the sentence-level
# graph identifiers to the corresponding graph
#uds.graphs

- `document_ids` attribute of UDSCorpus: list of document identifiers

#### sentence-level graphs:

In [34]:
# dictionary mapping identifiers for 
# syntax nodes to their attributes

uds["ewt-train-12"].syntax_nodes

{'ewt-train-12-syntax-1': {'domain': 'syntax',
  'type': 'token',
  'position': 1,
  'form': 'The',
  'lemma': 'the',
  'upos': 'DET',
  'xpos': 'DT',
  'Definite': 'Def',
  'PronType': 'Art'},
 'ewt-train-12-syntax-2': {'domain': 'syntax',
  'type': 'token',
  'position': 2,
  'form': 'police',
  'lemma': 'police',
  'upos': 'NOUN',
  'xpos': 'NN',
  'Number': 'Sing'},
 'ewt-train-12-syntax-3': {'domain': 'syntax',
  'type': 'token',
  'position': 3,
  'form': 'commander',
  'lemma': 'commander',
  'upos': 'NOUN',
  'xpos': 'NN',
  'Number': 'Sing'},
 'ewt-train-12-syntax-4': {'domain': 'syntax',
  'type': 'token',
  'position': 4,
  'form': 'of',
  'lemma': 'of',
  'upos': 'ADP',
  'xpos': 'IN'},
 'ewt-train-12-syntax-5': {'domain': 'syntax',
  'type': 'token',
  'position': 5,
  'form': 'Ninevah',
  'lemma': 'Ninevah',
  'upos': 'PROPN',
  'xpos': 'NNP',
  'Number': 'Sing'},
 'ewt-train-12-syntax-6': {'domain': 'syntax',
  'type': 'token',
  'position': 6,
  'form': 'Province',
  'l

In [35]:
# dictionary mapping identifiers for 
# semantics nodes to their attributes

uds["ewt-train-12"].semantics_nodes

{'ewt-train-12-semantics-pred-7': {'domain': 'semantics',
  'frompredpatt': True,
  'type': 'predicate',
  'factuality': {'factual': {'confidence': 1.0, 'value': 1.0583}},
  'event_structure': {'dynamic': {'value': -1.0745528936386108,
    'confidence': 0.9999988079071045},
   'natural_parts': {'value': -1.0745666027069092,
    'confidence': 0.9999988079071045},
   'telic': {'value': -1.074510931968689, 'confidence': 0.9999988079071045},
   'situation_duration_lbound-centuries': {'value': -0.4917, 'confidence': 1},
   'situation_duration_ubound-centuries': {'value': -0.4917, 'confidence': 1},
   'situation_duration_lbound-days': {'value': -1.1589, 'confidence': 1},
   'situation_duration_ubound-days': {'value': -1.1589, 'confidence': 1},
   'situation_duration_lbound-decades': {'value': -0.9165, 'confidence': 1},
   'situation_duration_ubound-decades': {'value': -0.9165, 'confidence': 1},
   'situation_duration_lbound-forever': {'value': -0.458, 'confidence': 1},
   'situation_duration

In [36]:
# dictionary mapping identifiers for 
# semantics edges (tuples of node 
# identifiers) to their attributes

uds["ewt-train-12"].semantics_edges()

{('ewt-train-12-semantics-pred-7',
  'ewt-train-12-semantics-arg-3'): {'domain': 'semantics', 'type': 'dependency', 'frompredpatt': True, 'protoroles': {'instigation': {'confidence': 1.0,
    'value': 1.3557},
   'change_of_possession': {'confidence': 0.7724, 'value': -0.0},
   'existed_before': {'confidence': 1.0, 'value': 1.3527},
   'was_for_benefit': {'confidence': 0.1976, 'value': -0.0504},
   'change_of_state_continuous': {'confidence': 1.0, 'value': -0.0},
   'change_of_state': {'confidence': 0.2067, 'value': -0.0548},
   'volition': {'confidence': 1.0, 'value': 1.3545},
   'change_of_location': {'confidence': 0.272, 'value': -0.0922},
   'partitive': {'confidence': 0.1148, 'value': -0.0018},
   'existed_during': {'confidence': 1.0, 'value': 1.3557},
   'existed_after': {'confidence': 1.0, 'value': 1.3527},
   'awareness': {'confidence': 1.0, 'value': 1.3526},
   'sentient': {'confidence': 1.0, 'value': 1.354},
   'was_used': {'confidence': 0.4373, 'value': -0.0207}}},
 ('ewt-tr

In [37]:
# dictionary mapping identifiers for 
# syntax edges (tuples of node 
# identifiers) to their attributes

uds["ewt-train-12"].syntax_edges()

{('ewt-train-12-syntax-3', 'ewt-train-12-syntax-1'): {'deprel': 'det',
  'domain': 'syntax',
  'type': 'dependency'},
 ('ewt-train-12-syntax-3', 'ewt-train-12-syntax-2'): {'deprel': 'compound',
  'domain': 'syntax',
  'type': 'dependency'},
 ('ewt-train-12-syntax-3', 'ewt-train-12-syntax-6'): {'deprel': 'nmod',
  'domain': 'syntax',
  'type': 'dependency'},
 ('ewt-train-12-syntax-6', 'ewt-train-12-syntax-4'): {'deprel': 'case',
  'domain': 'syntax',
  'type': 'dependency'},
 ('ewt-train-12-syntax-6', 'ewt-train-12-syntax-5'): {'deprel': 'compound',
  'domain': 'syntax',
  'type': 'dependency'},
 ('ewt-train-12-syntax-7', 'ewt-train-12-syntax-3'): {'deprel': 'nsubj',
  'domain': 'syntax',
  'type': 'dependency'},
 ('ewt-train-12-syntax-7', 'ewt-train-12-syntax-11'): {'deprel': 'ccomp',
  'domain': 'syntax',
  'type': 'dependency'},
 ('ewt-train-12-syntax-7', 'ewt-train-12-syntax-29'): {'deprel': 'punct',
  'domain': 'syntax',
  'type': 'dependency'},
 ('ewt-train-12-syntax-11', 'ewt-tra

#### Accessing relationships between semantics and syntax nodes:

In [38]:
uds["ewt-train-12"].head('ewt-train-12-semantics-pred-7', ['form', 'lemma'])

(7, ['announced', 'announce'])

In [39]:
uds["ewt-train-12"].span('ewt-train-12-semantics-pred-7', ['form', 'lemma'])

{7: ['announced', 'announce']}

- `.genre`
- `.text`
- `.timestamp`
- `.sentence_ids`
- `.sentence_graphs`

In [40]:
uds.documents["reviews-112579"].document_graph.nodes

NodeView(('ewt-train-11719-document-pred-root', 'ewt-train-11719-document-arg-0', 'ewt-train-11719-document-arg-author', 'ewt-train-11719-document-arg-addressee', 'ewt-train-11720-document-pred-4', 'ewt-train-11720-document-arg-1', 'ewt-train-11720-document-arg-5', 'ewt-train-11720-document-arg-9', 'ewt-train-11720-document-pred-root', 'ewt-train-11720-document-arg-0', 'ewt-train-11720-document-arg-author', 'ewt-train-11720-document-arg-addressee', 'ewt-train-11721-document-pred-2', 'ewt-train-11721-document-pred-11', 'ewt-train-11721-document-pred-17', 'ewt-train-11721-document-pred-18', 'ewt-train-11721-document-pred-22', 'ewt-train-11721-document-pred-35', 'ewt-train-11721-document-arg-1', 'ewt-train-11721-document-arg-4', 'ewt-train-11721-document-arg-10', 'ewt-train-11721-document-arg-13', 'ewt-train-11721-document-arg-18', 'ewt-train-11721-document-arg-21', 'ewt-train-11721-document-arg-25', 'ewt-train-11721-document-arg-28', 'ewt-train-11721-document-arg-39', 'ewt-train-11721-do