## EnRel-G: Classes used in constructing a graph

```python
class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, label=None, tags = None):
        """Constructs a InputExample.

        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            label: (Optional) string. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.label = label
        self.tags = tags

class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_id, valid_ids=None, label_mask=None, pos=None, dep = None, head = None, adj_a=None, adj_f=None):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_id = label_id
        self.valid_ids = valid_ids
        self.label_mask = label_mask
        self.adj_a = adj_a
        self.adj_f = adj_f
```

## EnRel-G: Steps to creating a graph

### data_utils.py

```python readfile(filename)```

Reads .conll file to extract each line (sentences) and extract their respective labels

```python convert_examples_to_features(examples, label_list, max_seq_length, tokenizer, gat_type)```

Converts InputExample object into numerical features for training.
- Tokenize example and label each token with part-of-speech, dependency parsing, syntactic head indices.
- Insert [CLS] and [SEP] (start and end) token
- Construct segment IDs for each token
- Converts tokens and labels into numerical IDs.

Returns
A list of InputFeatures

```python construct_graphs(input_ids,tokenizer, pos_ids, dep_ids, head, max_len, type)```

Build a graph based on the specified type ('AF': Returns both adjacency matrices, 'A': Returns only adj_a, 'F': Returns only adj_f)
- Converts token id to token
- Converts features into numpy array

### build_graph.py

```python normalize_adj(mx)```: normalize adjecency matrix

```python normalize_features(mx)```: normalize features matrix

```python acronyms(tokens)```: identifies acronym tokens based on their surroundings in the sentence

```python get_edges(sentence, pos_id,dep_id, head,longest_token_sequence_in_batch)```: 
NOTE: Sentence is each line of the .conll file
- Lexicon rule (Coreference Graph): edges between
    * Exact similar token
    * The lemmatized token (like a root word) is the same as the token
    * The acronyms and their definition
- Dependency rule (Dependency Graph): edges between
    * Two words that are subject and object of the same verb
    * Word and its syntactic head
    * additional features like compounds

```python edge2adj(edges,longest_token_sequence_in_batch)```: Converts the list of edges into a symmetric adjacency matrix

```python buildgraph(sentences,max_len, pos_ids, dep_ids, head)```: get the edges and construct the adjecency matrix, process into a torch tensor


### Would it be possible to provide the graph or the type of nodes and edges we want to extract?

Yes, the graph is constructed all manually. We can specify what type of edges to add in ```python get_edges()```. The node is the token as a default, we can also change that manually.