# Initial exploration of UDS dataset

## Caroline Gish | cng18@pitt.edu | for 2022-02-24

---

**Source:** https://decomp.readthedocs.io/en/latest/index.html

Initial exploration will involve working through the sections listed on the main Decomp read the docs page.

- reference paper: https://aclanthology.org/2020.lrec-1.699.pdf
- Decomp website: http://decomp.io/
- documentation: https://decomp.readthedocs.io/en/latest/index.html
- datasets and toolkit: https://github.com/decompositional-semantics-initiative/decomp

## Installation 

https://decomp.readthedocs.io/en/latest/install.html

First installation option listed - tried to got it to work for a while as it is the recommended option, but I ultimately couldn't end up getting it to work and the operation timed out on my machine. 

Ended up installing locally using `pip`:
- `pip install git+git://github.com/decompositional-semantics-initiative/decomp.git`

## Quick Start

https://decomp.readthedocs.io/en/latest/tutorial/quick-start.html

In [1]:
# importing dataset

from decomp import UDSCorpus

AttributeError: can't set attribute

In [None]:
# creating uds object (UDSCorpus object)

uds = UDSCorpus()

The dataset is built from the [Univeral Dependencies English Web Treebank](https://github.com/UniversalDependencies/UD_English-EWT) and the [UDS annotations](http://decomp.io/data/). 

### UDSSentenceGraph objects

- accessed through standard dictionary getters or iteration

In [None]:
# UDS graph corresponding to 12th 
# sentence in en-ud-train.conllu

uds["ewt-train-12"]

### UDSDocument objects & UDSDocumentGraph

- graph associated with each document object

In [None]:
# document

uds.documents["reviews-112579"]

In [None]:
# associated graph

uds.documents["reviews-112579"].document_graph

#### `UDSCorpus` objects behave like dictionaries:

In [None]:
# sentence-level graph identifiers

#for graphid in uds:
    #print(graphid)

In [None]:
len(uds.graphids)

In [None]:
# document identifiers
# (correspond directly to English 
# Web Treebank file IDs)

#for documentid in uds.documents:
    #print(documentid)

In [None]:
len(uds.documents)

In [None]:
# sentence-level graph identifiers 
# with corresponding sentence

#for graphid, graph in uds.items():
    #print(graphid)
    #print(graph.sentence)

In [None]:
# document identifiers with 
# each document’s entire text

#for documentid, document in uds.documents.items():
    #print(documentid)
    #print(document.text)

- `graphids` attribute of UDSCorpus: list of sentence-level graph identifiers

In [None]:
# a list of the sentence-level 
# graph identifiers in the corpus
#uds.graphids

- `graphs` attribute of UDSCorpus: mapping from identifiers and the corresponding graph

In [None]:
# a dictionary mapping the sentence-level
# graph identifiers to the corresponding graph
#uds.graphs

- `document_ids` attribute of UDSCorpus: list of document identifiers

#### sentence-level graphs:

In [None]:
# dictionary mapping identifiers for 
# syntax nodes to their attributes

uds["ewt-train-12"].syntax_nodes

In [None]:
# dictionary mapping identifiers for 
# semantics nodes to their attributes

uds["ewt-train-12"].semantics_nodes

In [None]:
# dictionary mapping identifiers for 
# semantics edges (tuples of node 
# identifiers) to their attributes

uds["ewt-train-12"].semantics_edges()

In [None]:
# dictionary mapping identifiers for 
# syntax edges (tuples of node 
# identifiers) to their attributes

uds["ewt-train-12"].syntax_edges()

#### Accessing relationships between semantics and syntax nodes:

In [None]:
uds["ewt-train-12"].head('ewt-train-12-semantics-pred-7', ['form', 'lemma'])

In [None]:
uds["ewt-train-12"].span('ewt-train-12-semantics-pred-7', ['form', 'lemma'])

- `.genre`
- `.text`
- `.timestamp`
- `.sentence_ids`
- `.sentence_graphs`

In [None]:
uds.documents["reviews-112579"].document_graph.nodes