In [1]:
import spacy 
import pandas as pd

Below we import the spacy model for coreference resolution. It requires a specific version of numpy so this step must be done separatley 

In [2]:
nlp = spacy.load("en_coreference_web_trf")

  from .autonotebook import tqdm as notebook_tqdm
2024-04-15 12:49:31.237921: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


The function initially iterates through each cluster and creates a dictionary where the ID of each token referencing another is associated with the entity it is referencing. The sentence is then rebuilt substituting references for their referenced entity.  

In [3]:

def resolve_references(sentence):
    doc = nlp(sentence)
    token_mention_mapper = {}
    output_string = ""
    clusters = [
        val for key, val in doc.spans.items() if key.startswith("coref_cluster")
    ]

    # Iterate through every found cluster
    for cluster in clusters:
        first_mention = cluster[0]
        # Iterate through every other span in the cluster
        for mention_span in list(cluster)[1:]:
            # Set first_mention as value for the first token in mention_span in the token_mention_mapper
            token_mention_mapper[mention_span[0].idx] = first_mention.text + mention_span[0].whitespace_
            
            for token in mention_span[1:]:
                # Set empty string for all the other tokens in mention_span
                token_mention_mapper[token.idx] = ""

    # Iterate through every token in the Doc
    for token in doc:
        # Check if token exists in token_mention_mapper
        if token.idx in token_mention_mapper:
            output_string += token_mention_mapper[token.idx]
        # Else add original token text
        else:
            output_string += token.text + token.whitespace_

    return output_string

Below is an example of corefrence resolution in action using a portion os an article about Donald Trump.

In [4]:
sent = '''Donald Trump’s hush-money trial is set to begin on March 25, a New York judge ordered Thursday, making it the first of the four 
          criminal cases the former president faces to be heard by a jury.Justice Juan Merchan, who is presiding over the case, also denied Trump’s bid 
          to dismiss the 34 felony counts pending against him. Trump sat at the defense table flanked by his lawyers.Sitting for a weekslong trial in Manhattan 
          could complicate Trump’s presidential bid and add to his legal burdens, which include three other prosecutions and a civil-fraud case that could 
          put him on the hook to pay hundreds of millions of dollars in penalties. A ruling on the civil matter, in which the New York attorney general 
          alleged Trump inflated his wealth for financial gain, could come as soon as Friday.'''
coref_sent = resolve_references(sent)
print(coref_sent)

Donald Trump’s hush-money trial is set to begin on March 25, a New York judge ordered Thursday, making ’s hush-money trial the first of the four 
          criminal cases Donald Trump’s faces to be heard by a jury.a New York judge also denied Donald Trump’sbid 
          to dismiss the 34 felony counts pending against Donald Trump’s. Donald Trump’s sat at the defense table flanked by Donald Trump’s lawyers.Sitting for ’s hush-money trial could complicate Donald Trump’spresidential bid and add to Donald Trump’s legal burdens, which include three other prosecutions and a civil-fraud case that could 
          put Donald Trump’s on the hook to pay hundreds of millions of dollars in penalties. A ruling on the civil matter, in which the New York attorney general 
          alleged Donald Trump’s inflated Donald Trump’s wealth for financial gain, could come as soon as Friday.


In [10]:
# apply coreference resolution to the entire corpus
# read in df and make sure there are no empty entires 
center = pd.read_csv("model_data/NER_center.csv")
left = pd.read_csv("model_data/NER_left.csv")
right = pd.read_csv("model_data/NER_right.csv")

center = center[center['summary'].apply(type) == str]
left = left[left['summary'].apply(type) == str]
right = right[right['summary'].apply(type) == str]

In [13]:
# also load in FOX and CNN articles to apply coreference 
CNN_articles = pd.read_csv('application_data/cnn.csv')
FOX_articles = pd.read_csv('application_data/foxnews.csv')

CNN_articles = CNN_articles[CNN_articles['summary'].apply(type) == str]
FOX_articles = FOX_articles[FOX_articles['summary'].apply(type) == str]

In [14]:
dfs = [center, left, right, CNN_articles, FOX_articles]

for df in dfs:
    for index, row in df.iterrows():
        print(index)
        article = row['summary']
        resolved_article = resolve_references(article)
        row['summary'] = resolved_article

0
1
2


KeyboardInterrupt: 

In [None]:
center.to_csv("model_data/NER_center.csv")
left.to_csv("model_data/NER_left.csv")
right.to_csv("model_data/NER_right.csv")

CNN_articles.to_csv('application_data/cnn_coref.csv')
FOX_articles.to_csv('application_data/fox_coref.csv')

