In [1]:
from structural import * 

Get token-level structural features for component identification 
- Token Position: 
    - Token present in introduction or conclusion 
    - token is first or last token in sentence 
    - relative and absolute token position in document, paragraph and sentence
- Token Punctation: 
    - Token precedes or follows any punctuation, full stop, comma and semicolon **Boolean**
    - token is any punctuation or full stop **Boolean**
- Position of covering sentence: 
    - Absolute and relative position of the token’s covering sentence in the document and paragraph

In [2]:
essayDir = "/Users/amycweng/Downloads/CS333_Project/ArgumentAnnotatedEssays-2.0/brat-project-final"
annDir = "/Users/amycweng/Downloads/CS333_Project/CS333AES/stab/preprocessing/src/main/resources/token_level"
sentDir = "/Users/amycweng/Downloads/CS333_Project/CS333AES/stab/preprocessing/src/main/resources/sentence_sentiment"
filename = 'essay001'
essay_ann_file = f"{essayDir}/{filename}.ann"
essay_txt_file = f"{essayDir}/{filename}.txt"
token_file = f"{annDir}/{filename}.txt"
sentence_file = f"{sentDir}/{filename}.txt"

essay = Structural()
essay.read_data(essay_ann_file, essay_txt_file, token_file, sentence_file)
essay.annotate_tokens()
print(essay.annotations[0][1])
# outputdir = "/Users/amycweng/Downloads/CS333_Project/CS333AES/stab/token_annotations" 
# essay.write_data(f"{outputdir}/{filename}.csv")

{'token': 'It', 'sentence': 0, 'index': 1, 'lemma': 'it', 'pos': 'PRP', 'sentiment': 'Neutral', 'start': 55, 'paragraph': 0, 'docPosition': 'Introduction', 'sentPosition': 'First', 'IOB': 'O', 'isPunc': False, 'followsPunc': False, 'precedesPunc': False}


TOKEN STATISTICS FOR EACH COMPONENT (all **Integer** or **Float**)

For component classification:  
- (1) Number of tokens in component
- (2) Number of tokens in covering sentence 
- (3) Number of tokens in covering paragraph 
- (4) Number of tokens preceding component in sentence 
- (5) Number of tokens succeeding component in sentence

For component stance recognition:   
- (2), (4), (5)
- (6) Ratio of number of component to sentence tokens 

For relation identification: 
- Number of tokens in source. See (1)
- Number of tokens in target. See (1)


In [3]:
essay.token_stats()
print("T1: ", essay.token_info["T1"]) 
print("\tRatio of component to sentence tokens: ", 
      essay.token_info["T1"]["within"]
        / essay.token_info["T1"]["sentence"])

T1:  {'sentence': 21, 'paragraph': 95, 'within': 10, 'preceding': 10, 'following': 1}
	Ratio of component to sentence tokens:  0.47619047619047616


COMPONENT STATISTICS 

For component classification:  
- (1) If first or last in paragraph **Boolean**
- (2) Present in intro or conclusion **Boolean**
- (3) Relative position in paragraph **Integer**
- (4) Number of preceding and following components in paragraph **Integer**

For relation identification (source and target are both in the same paragraph): 
- (5) Number of components between source and target **Integer**
- (6) Number of components in covering paragraph **Integer**
- (7) If source and target are present in the same sentence **Boolean**
- (8) If target present before source **Boolean**
- (9) If source and target are first or last component in paragraph **Boolean**
- (10) If source and target present in introduction or conclusion **Boolean**

For stance recognition: 
- (6), (4), (3)


In [4]:
essay.component_stats()
for name, info in essay.component_info.items(): 
    print(f"{name}: {info}")

T1: {'sentIdx': 3, 'first/last': True, 'intro/conc': True, 'num_paragraph': 1, 'num_preceding': 0, 'num_following': 0}
T3: {'sentIdx': 4, 'first/last': True, 'intro/conc': False, 'num_paragraph': 4, 'num_preceding': 0, 'num_following': 3}
T4: {'sentIdx': 5, 'first/last': False, 'intro/conc': False, 'num_paragraph': 4, 'num_preceding': 1, 'num_following': 2}
T5: {'sentIdx': 6, 'first/last': False, 'intro/conc': False, 'num_paragraph': 4, 'num_preceding': 2, 'num_following': 1}
T6: {'sentIdx': 7, 'first/last': True, 'intro/conc': False, 'num_paragraph': 4, 'num_preceding': 3, 'num_following': 0}
T8: {'sentIdx': 8, 'first/last': True, 'intro/conc': False, 'num_paragraph': 5, 'num_preceding': 0, 'num_following': 4}
T7: {'sentIdx': 9, 'first/last': False, 'intro/conc': False, 'num_paragraph': 5, 'num_preceding': 1, 'num_following': 3}
T9: {'sentIdx': 10, 'first/last': False, 'intro/conc': False, 'num_paragraph': 5, 'num_preceding': 2, 'num_following': 2}
T10: {'sentIdx': 12, 'first/last': F

In [13]:
essay.pairs()
print("Second Paragraph Pairwise Token Stats: \n")
for name, pair in essay.pairwise_tokens[1].items():
    for name2,info in pair.items():  
        print(f"Source {name} has {info[0]} tokens and Target {name2} has {info[1]} tokens")

print("\nSecond Paragraph Pairwise Component Stats: \n")
for name, pair in  essay.pairwise_components[1].items(): 
    for name2, info in pair.items(): 
        print(f"Source {name} and Target {name2}: {info}")
    print("\n")

Second Paragraph Pairwise Token Stats: 

Source T3 has 19 tokens and Target T4 has 27 tokens
Source T3 has 19 tokens and Target T5 has 41 tokens
Source T3 has 19 tokens and Target T6 has 21 tokens
Source T4 has 27 tokens and Target T3 has 19 tokens
Source T4 has 27 tokens and Target T5 has 41 tokens
Source T4 has 27 tokens and Target T6 has 21 tokens
Source T5 has 41 tokens and Target T3 has 19 tokens
Source T5 has 41 tokens and Target T4 has 27 tokens
Source T5 has 41 tokens and Target T6 has 21 tokens
Source T6 has 21 tokens and Target T3 has 19 tokens
Source T6 has 21 tokens and Target T4 has 27 tokens
Source T6 has 21 tokens and Target T5 has 41 tokens

Second Paragraph Pairwise Component Stats: 

Source T3 and Target T4: {'first/last': False, 'num_between': 0, 'num_paragraph': 4, 'intro/conc': False, 'targetBeforeSource': True, 'sameSentence': False}
Source T3 and Target T5: {'first/last': False, 'num_between': 1, 'num_paragraph': 4, 'intro/conc': False, 'targetBeforeSource': True