In [1]:
import pandas as pd
from collections import defaultdict

In [2]:
df = pd.read_csv("./Data/source-content/parc_features/parc_train_features.tsv", sep="\t", index_col=0, header=0)

  mask |= (ar1 == a)


In [3]:
df.columns

Index(['POS', 'dependency_head', 'dependency_label', 'doc_token_number',
       'lemma', 'ne_info', 'sentence_number', 'sentence_token_number', 'token',
       'cue_label', 'attribution', 'token_-1', 'token_-2', 'token_-3',
       'token_-4', 'token_-5', 'token_+1', 'token_+2', 'token_+3', 'token_+4',
       'token_+5', 'lemma_-1', 'lemma_-2', 'lemma_-3', 'lemma_-4', 'lemma_-5',
       'lemma_+1', 'lemma_+2', 'lemma_+3', 'lemma_+4', 'lemma_+5', 'POS_-1',
       'POS_-2', 'POS_-3', 'POS_-4', 'POS_-5', 'POS_+1', 'POS_+2', 'POS_+3',
       'POS_+4', 'POS_+5', 'bigram_prev_token', 'bigram_prev_lemma',
       'bigram_prev_POS', 'bigram_following_token', 'bigram_following_lemma',
       'bigram_following_POS', 'shape', 'ne_short', 'relevant_ne', 'ne_+-5',
       'candidate_cue', 'reporting_verb', 'quotation', 'near_sent_boundary',
       'near_doc_boundary', 'dist_beg_sent', 'dist_end_sent', 'sent_len',
       'pn_in_sent', 'ne_in_sent', 'qm_in_sent', 'any_in_sent', 'quotation_pn',
       'q

### Part 0: Analysis of Sources: Are they NEs?

In [4]:
pos_dict = defaultdict(int)
count = 0
for token, pos, attribution, relevant_ne in zip(df["token"], df["POS"], df["attribution"], df["relevant_ne"]):
    source = False
    att_list = attribution.split(" ")
    for att in att_list:
        att_split = att.split("-")
        if att_split[0] not in {"_", "0", ""} and att_split[1] == "SOURCE":
            count +=1
            if relevant_ne == 0:
                pos_dict[pos] += 1


In [8]:
ne_count = 0
non_ne = 0
for token, pos, attribution, ne_info in zip(df["token"], df["POS"], df["attribution"], df["ne_info"]):
    source = False
    att_list = attribution.split(" ")
    for att in att_list:
        att_split = att.split("-")
        if att_split[0] not in {"_", "0", ""} and att_split[1] == "SOURCE":
            if ne_info != "O":
                ne_count +=1
            else:
                non_ne += 1

#### Analysis:

Note that this is all at the token level - so when I say "number of sources" I mean "number of tokens that appear in source spans"

pos_dict is a dict of the POS of sources that are NOT relevant NEs.

count is the total number of sources.

ne_count is the number of sources that are NEs (both relevant and other).

non_ne is the number of sources that are NOT NEs of any type.

In [21]:
print(f"Total sources that are not relevant NEs: {sum(pos_dict.values())}")
print(f"Total sources: {count}")
print(f"Total sources that are any type of NE: {ne_count}")
print(f"Total sources that are NOT any type of NE: {non_ne}")
print(f"Total sources that are pronouns: {pos_dict['PRP']+pos_dict['PRP$']}")

Total sources that are not relevant NEs: 37777
Total sources: 60513
Total sources that are any type of NE: 23845
Total sources that are NOT any type of NE: 36668
Total sources that are pronouns: 3570


In [23]:
pos_dict

defaultdict(int,
            {'NNS': 3425,
             'DT': 5403,
             'NN': 8608,
             'PRP': 3406,
             'IN': 3491,
             'POS': 550,
             'NNP': 3010,
             ',': 4077,
             'JJ': 2224,
             'JJR': 40,
             'RB': 216,
             'VBN': 298,
             'WDT': 225,
             'CC': 625,
             'CD': 536,
             'VBG': 213,
             'VBZ': 222,
             'WP': 368,
             'VBD': 171,
             'VBP': 52,
             'PRP$': 164,
             'JJS': 86,
             'NNPS': 58,
             '``': 21,
             "''": 27,
             'TO': 47,
             'VB': 70,
             '$': 35,
             'RBR': 6,
             'RBS': 21,
             '.': 10,
             'WRB': 11,
             'PDT': 13,
             'MD': 16,
             'WP$': 17,
             ':': 1,
             'HYPH': 2,
             'RP': 10,
             'UH': 2})

In [24]:
total_pos_dict = defaultdict(int)
for pos in df["POS"]:
    total_pos_dict[pos]+=1

In [27]:
total_pos_dict["NNP"]

102182

#### Conclusion:
It is NOT enough by a long shot to only consider NEs and pronouns as candidate sources (~34000/60513 tokens in source spans, about 55%, are NOT relevant NEs or pronouns.)

### Part 1: Candidate Mention Detection
This happens at the token/phrase level; ideally we have IOB spans that can be collapsed into symbols in later steps.

In [None]:
# Step 1: Content in sentence label

In [28]:
def content_in_sentence(df):
    '''
    Takes a df with "attribution", "filename", and "sentence_number" columns and returns a list (column) of binary
    "sentence contains a content" labels
    '''
    sent_with_content = set()
    for filename, sentence_number, attribution in zip(df["filename"], df["sentence_number"], df["attribution"]):
        for att in attribution.split(" "):
            att_split = att.split("-")
            if att_split[0] not in {"_", "0", ""} and att_split[1] == "CONTENT":
                sent_with_content.add((filename, sentence_number))
    labels = []
    for filename, sentence_number in zip(df["filename"], df["sentence_number"]):
        if (filename, sentence_number) in sent_with_content:
            label = 1
        else:
            label = 0
        labels.append(label)
    return labels

    

In [29]:
content_in_sent_labels = content_in_sentence(df)

In [30]:
df["content_in_sent"] = content_in_sent_labels

In [None]:
# Step 2: count sources in content sentences

In [33]:
content_in_sent_count = 0
for token, pos, attribution, content_in_sent_label in zip(df["token"], df["POS"], df["attribution"], df["content_in_sent"]):
    source = False
    att_list = attribution.split(" ")
    for att in att_list:
        att_split = att.split("-")
        if att_split[0] not in {"_", "0", ""} and att_split[1] == "SOURCE":
            if content_in_sent_label == 1:
                content_in_sent_count += 1


In [35]:
print(f"Number of sources that occur in sentences that contain contents: {content_in_sent_count}")

Number of sources that occur in sentences that contain contents: 60274


##### Conclusion here: Basically all (60274/60513) sources appear in the same sentence as a content
Note that this doesn't represent sources that appear in the same sentence as their content, necessarily.

It does mean that only considering tokens/spans that appear in the same sentence as a content is a good way to determine candidate sources.

In [39]:
pos_candidate_dict = {"NN":[0,0], "NNS":[0,0], "NNP":[0,0], "PRP":[0,0]}
for token, pos, attribution, label in zip(df["token"], df["POS"], df["attribution"], df["content_in_sent"]):
    source = False
    att_list = attribution.split(" ")
    if pos in {"NN", "NNS", "NNP", "PRP"} and label == 1:
        source = False
        for att in att_list:
            att_split = att.split("-")
            if att_split[0] not in {"_", "0", ""} and att_split[1] == "SOURCE":
                pos_candidate_dict[pos][0] +=1
                source = True
        pos_candidate_dict[pos][1] += 1

This dictionary represents the occurrence of POS's in sentences with contents.
The first number is a count of the POS AS A SOURCE in such sentences, and the second is the total count of the POS in such sentences.

In [40]:
pos_candidate_dict

{'NN': [8810, 61333],
 'NNS': [3492, 26404],
 'NNP': [22681, 42763],
 'PRP': [3402, 11974]}

##### Conclusion here: Probably viable to use these POS in sentences with content as candidate mentions; we'll have a ~25% positive example ratio, and we should get nearly all mentions.

The next valuable step would be to determine the number of sources covered at least partially (as in, not necessarily whole span) by such POS's. The numbers to this point are entirely token-based (not overall source span based).

### Part 2: Collapsing/Anonymizing Contents and Sources
The goal here is to take content and source spans and convert them into symbols.

The complication is that we need to keep working with DFs, and it'll be hard to collapse portions of a column or two and maintain the other columns.

I need to decide what info needs to come out of this. That comes down to two things: i) the rest of this classifier (what is needed for the feature engineering?), and ii) evaluation: we can evaluate based on the labels we concoct, but what about at the level of the source-contents we're provided? Ideally we'll use the eval script from Roser, but that makes this conversion too and from our classifier DF really challenging.

Ideally, there's a pandas functionality that lets me preserve index nums as I collapse the DF so they can be put back.

There is, of course, the "filename" and "doc_token_number" columns to work with; in converting back to eval mode these can be all the difference.

The other question is how do I deal with multi-part contents/spans? Where do I put them, and how on Earth do I convert them back? I probably need to keep information on the original location of these spans before I collapse them entirely. Maybe a "CONTENT-(span1)\_(span2)\_..." kind of label