# Creating Convokit Corpus element
according to https://github.com/CornellNLP/ConvoKit/blob/master/examples/converting_movie_corpus.ipynb

In [None]:
!pip install torch torchvision
!pip install convokit
!pip install datasets

In [2]:
from convokit import Corpus, Speaker, Utterance
import pandas as pd
import tqdm
import ast

In [3]:
media_sum_path = "data/MediaSum/news_dialogue.json"
media_sum_json = pd.read_json(media_sum_path)

In [4]:
media_sum_json

Unnamed: 0,id,program,date,url,title,summary,utt,speaker
0,NPR-1,News & Notes,2007-11-28,https://www.npr.org/templates/story/story.php?...,Black Actors Give Bible Star Appeal,"More than 400 black actors, artists and minist...","[Now, moving on, Forest Whitaker as Moses, Tis...","[FARAI CHIDEYA, host, FARAI CHIDEYA, host, Mr...."
1,NPR-2,Weekend Edition Sunday,2016-10-23,https://www.npr.org/2016/10/23/499042298/young...,"Young, First-Time Voters Share Views On Electi...",NPR's Rachel Martin speaks with young voters w...,[You have heard it again and again - this is a...,"[RACHEL MARTIN, HOST, ASHANTI MARTINEZ, LAUREN..."
2,NPR-3,News & Notes,2007-11-30,https://www.npr.org/templates/story/story.php?...,Snapshots: On Solid Ground,"In this week's snapshot, actor and playwright ...","[I came close to running out of luck, when I a...","[Mr. JEFF OBAFEMI CARR (Actor, Playwright), CH..."
3,NPR-4,News & Notes,2007-11-30,https://www.npr.org/templates/story/story.php?...,"Washington, D.C. Facing HIV/AIDS Epidemic",A new study says one in 50 people in the natio...,"[This is NEWS & NOTES. I'm Farai Chideya., In ...","[FARAI CHIDEYA, host, FARAI CHIDEYA, host, Dr...."
4,NPR-5,News & Notes,2007-11-30,https://www.npr.org/templates/story/story.php?...,Coping When AIDS Hits Your Family: Part II,When a family member is diagnosed with HIV/AID...,"[I'm Farai Chideya and this is NEWS & NOTES., ...","[FARAI CHIDEYA, host, FARAI CHIDEYA, host, FAR..."
...,...,...,...,...,...,...,...,...
463591,CNN-414237,CNN NEWSROOM,2020-10-25,http://transcripts.cnn.com/TRANSCRIPTS/2010/25...,,"U.S. Officials: Russia, Iran Have Stolen Voter...",[Welcome back to our viewers in the United Sta...,"[BRUNHUBER, NATASHA CHEN, CNN CORRESPONDENT, W..."
463592,CNN-414238,CNN NEWSROOM,2020-10-25,http://transcripts.cnn.com/TRANSCRIPTS/2010/25...,,Nigerian Police Force Mobilize To Quell Worst ...,"[In Nigeria, chaotic scenes of looting and des...","[BRUNHUBER, BRUNHUBER (voice-over), BRUNHUBER ..."
463593,CNN-414239,CNN NEWSROOM,2020-10-25,http://transcripts.cnn.com/TRANSCRIPTS/2010/25...,,COVID-19 Triggers Rise In Asian American Unemp...,[Officials in the U.S. are worried about wides...,"[BRUNHUBER, AMARA WALKER, CNN ANCHOR (voice-ov..."
463594,CNN-414240,STATE OF THE UNION,2020-10-25,http://transcripts.cnn.com/TRANSCRIPTS/2010/25...,,COVID-19 Outbreak Hits Vice President Pence's ...,[Dark winter? U.S. COVID cases hit a new daily...,"[JAKE TAPPER, CNN HOST (voice-over), DONALD TR..."


## 1. Create speakers

**Note**: In the speaker list, authors sometimes have non-unique identifiers (e.g., ‘STEVE PROFFITT’, ‘PROFFITT’ or ‘S. PROFFITT’ refer to the same speaker). See example below. Currently I **do not** address this. I will count each unique identifier as a different speaker. Plus, I will count an identifier that is the same in one conversation as in another as the same speaker in another conversation. This might be incorrect for cases like below with 'UNIDENTIFIED MALE' or 'UNIDENTIFIED FEMALE', but I will not address this for now.

In [5]:
media_sum_json["speaker"][300000]

['CUOMO',
 'ED LAVANDERA, CNN CORRESPONDENT',
 'LAVANDERA (voice-over)',
 'ERIC HOLDER, U.S. ATTORNEY GENERAL',
 'LAVANDERA',
 'UNIDENTIFIED FEMALE',
 'UNIDENTIFIED MALE',
 'UNIDENTIFIED MALE',
 'LAVANDERA',
 'HOLDER',
 'LAVANDERA',
 'LAVANDERA',
 'PEREIRA',
 'PASTOR ROBERT WHITE, PEACE OF MIND CHURCH OF HAPPINESS',
 'PEREIRA',
 'MO IVORY, ATTORNEY/RADIO PERSONALITY',
 'PEREIRA',
 'IVORY',
 'PEREIRA',
 'IVORY',
 'PEREIRA',
 'WHITE',
 'PEREIRA',
 'WHITE',
 'PEREIRA',
 'WHITE',
 'PEREIRA',
 'WHITE',
 'PEREIRA',
 'IVORY',
 'WHITE',
 'IVORY',
 'PEREIRA',
 'IVORY',
 'PEREIRA',
 'WHITE',
 'PEREIRA',
 'WHITE',
 'PEREIRA',
 'CUOMO',
 'BERMAN']

Thus, I use the incorrect **assumption that each element in the speaker list is a string that is the only unique string for this speaker across the whole dataset**.

In [6]:
# get all speakers from the speaker column
speakers = media_sum_json['speaker']
unique_speakers = sorted(set(name for sublist in speakers for name in sublist))

I create a speaker object that only includes the speaker name as information and identifier.

In [7]:
print(len(unique_speakers))

718483


In [8]:
corpus_speakers = {speaker_name: Speaker(id = speaker_name, meta ={'name': speaker_name}) for speaker_name in unique_speakers}

In [9]:
corpus_speakers['LAVANDERA']

Speaker({'obj_type': 'speaker', 'vectors': [], 'owner': None, 'id': 'LAVANDERA', 'temp_backend': {}, 'meta': {'name': 'LAVANDERA'}})

In [10]:
corpus_speakers['ED LAVANDERA, CNN CORRESPONDENT']

Speaker({'obj_type': 'speaker', 'vectors': [], 'owner': None, 'id': 'ED LAVANDERA, CNN CORRESPONDENT', 'temp_backend': {}, 'meta': {'name': 'ED LAVANDERA, CNN CORRESPONDENT'}})

## 2. Creating utterance objects

In [11]:
type(media_sum_json['utt'][0])

list

In [12]:
utterance_corpus = {}
conversation_meta = {}

count = 0
# iterate over each row in the dataframe
for index, row in tqdm.tqdm(media_sum_json.iterrows(), total=media_sum_json.shape[0]):
    # get the conversation id
    conversation_id = row['id']
    program = row['program']
    date = row['date']
    summary = row['summary']
    url = row['url']
    title = row['title']

    conversation_meta[conversation_id] = {
        'program': program,
        'date': date,
        'summary': summary,
        'url': url,
        'title': title,
        'broadcaster': conversation_id.split('-')[0],  # should be either NPR or CNN
    }

    # get utterance information
    utterance_list = row['utt']
    speaker_list = row['speaker']

    for i, utt in enumerate(utterance_list):
        # create a unique identifier for the utterance as in https://aclanthology.org/2024.emnlp-main.52.pdf
        #   i.e., from the code base ID of the form 'CNN-67148-13' where 'CNN-67148' is the identifier as used in MediaSum and 13 is the index of the utterance in the original utterance list
        utterance_id = f"{conversation_id}-{i}"
        utt_speaker = corpus_speakers[speaker_list[i]]
        utt_text = utt
        reply_to = None if i == 0 else f"{conversation_id}-{i-1}"  # reply_to is None for the first utterance in the conversation
        # timestamp is not provided

        utterance_corpus[utterance_id] = Utterance(
            id=utterance_id,
            speaker=utt_speaker,
            conversation_id=conversation_id,
            reply_to=reply_to,
            text=utt_text,
        )

print(f"Total number of utterances: {len(utterance_corpus)}")

100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 463596/463596 [01:06<00:00, 6972.92it/s]

Total number of utterances: 13919244





Note: Due to the format of the original dataset, **the same speaker can have several turns in a row** in an interview. For example:

In [13]:
# example utterance
print(utterance_corpus['NPR-4-0'])
print(utterance_corpus['NPR-4-1'])

Utterance(id: 'NPR-4-0', conversation_id: NPR-4, reply-to: None, speaker: Speaker(id: 'FARAI CHIDEYA, host', vectors: [], meta: {'name': 'FARAI CHIDEYA, host'}), timestamp: None, text: "This is NEWS & NOTES. I'm Farai Chideya.", vectors: [], meta: {})
Utterance(id: 'NPR-4-1', conversation_id: NPR-4, reply-to: NPR-4-0, speaker: Speaker(id: 'FARAI CHIDEYA, host', vectors: [], meta: {'name': 'FARAI CHIDEYA, host'}), timestamp: None, text: "In the nation's capital, a killer is on the loose. It's been operating in America for decades now. We're talking about AIDS. Tomorrow is World AIDS Day. Today, we'll discuss staggering new information on how prevalent AIDS is in Washington D.C., particularly among African-Americans. Overall, the rate of AIDS cases in Washington D.C. is about 10 times higher than in the United States. Dr. Shannon Hader is the director of the D.C. HIV/AIDS Administration. Welcome.", vectors: [], meta: {})


We keep with this original formatting.

In [14]:
# example utterance
utterance_corpus['CNN-67148-13']

Utterance({'obj_type': 'utterance', 'vectors': [], 'speaker_': Speaker({'obj_type': 'speaker', 'vectors': [], 'owner': None, 'id': 'CLARK', 'temp_backend': {}, 'meta': {'name': 'CLARK'}}), 'owner': None, 'id': 'CNN-67148-13', 'temp_backend': {'speaker_id': 'CLARK', 'conversation_id': 'CNN-67148', 'reply_to': 'CNN-67148-12', 'timestamp': None, 'text': "Well, I don't think -- as far as I know, we're not paying anything to Saudi Arabia, for example, right now. In fact, they're still buying weapons. They are having economic difficulties, but they do have oil. But the other countries in the region are in one way or another in financial trouble, and have been for a long time. They've been sustained on a diet of expectations of economic growth, funded by taking short and long term loans that come from commercial banks, sometimes guaranteed by governments. And then they have to repay these loans. And repaying these loans consumes their foreign exchange earnings from their exports and from remi

In [15]:
# example utterance
utterance_corpus["NPR-1-0"]

Utterance({'obj_type': 'utterance', 'vectors': [], 'speaker_': Speaker({'obj_type': 'speaker', 'vectors': [], 'owner': None, 'id': 'FARAI CHIDEYA, host', 'temp_backend': {}, 'meta': {'name': 'FARAI CHIDEYA, host'}}), 'owner': None, 'id': 'NPR-1-0', 'temp_backend': {'speaker_id': 'FARAI CHIDEYA, host', 'conversation_id': 'NPR-1', 'reply_to': None, 'timestamp': None, 'text': 'Now, moving on, Forest Whitaker as Moses, Tisha Campbell Martin as Mary Magdalene - well, that\'s all in "The Bible Experience." A New Testament edition was released in 2006. This edition is billed as "The Complete Bible." It doesn\'t have one person reading the gospels. It features nearly 400 African-American artists, actors and ministers, plus sound effects.'}, 'meta': {}})

## 3. Creating corpus from list of utterances

In [16]:
utterance_list = utterance_corpus.values()

In [17]:
media_sum_corpus = Corpus(utterances=utterance_list)

No configuration file found at /Users/Wegma003/.convokit/config.yml; writing with contents: 
# Default Backend Parameters
db_host: localhost:27017
data_directory: ~/.convokit/saved-corpora
model_directory: ~/.convokit/saved-models
default_backend: mem


In [18]:
print("number of conversations in the dataset = {}".format(len(media_sum_corpus.get_conversation_ids())))

number of conversations in the dataset = 463596


In [19]:
convo_ids = media_sum_corpus.get_conversation_ids()
for i, convo_idx in enumerate(convo_ids[0:5]):
    print("sample conversation {}:".format(i))
    print(media_sum_corpus.get_conversation(convo_idx).get_utterance_ids())

sample conversation 0:
['NPR-1-0', 'NPR-1-1', 'NPR-1-2', 'NPR-1-3', 'NPR-1-4', 'NPR-1-5', 'NPR-1-6', 'NPR-1-7', 'NPR-1-8', 'NPR-1-9', 'NPR-1-10', 'NPR-1-11', 'NPR-1-12', 'NPR-1-13', 'NPR-1-14', 'NPR-1-15', 'NPR-1-16', 'NPR-1-17', 'NPR-1-18', 'NPR-1-19', 'NPR-1-20', 'NPR-1-21', 'NPR-1-22', 'NPR-1-23', 'NPR-1-24', 'NPR-1-25', 'NPR-1-26', 'NPR-1-27', 'NPR-1-28', 'NPR-1-29', 'NPR-1-30', 'NPR-1-31', 'NPR-1-32', 'NPR-1-33', 'NPR-1-34', 'NPR-1-35', 'NPR-1-36', 'NPR-1-37', 'NPR-1-38', 'NPR-1-39', 'NPR-1-40', 'NPR-1-41', 'NPR-1-42', 'NPR-1-43', 'NPR-1-44', 'NPR-1-45', 'NPR-1-46', 'NPR-1-47']
sample conversation 1:
['NPR-2-0', 'NPR-2-1', 'NPR-2-2', 'NPR-2-3', 'NPR-2-4', 'NPR-2-5', 'NPR-2-6', 'NPR-2-7', 'NPR-2-8', 'NPR-2-9', 'NPR-2-10', 'NPR-2-11', 'NPR-2-12', 'NPR-2-13', 'NPR-2-14', 'NPR-2-15', 'NPR-2-16', 'NPR-2-17', 'NPR-2-18', 'NPR-2-19', 'NPR-2-20', 'NPR-2-21', 'NPR-2-22', 'NPR-2-23', 'NPR-2-24', 'NPR-2-25', 'NPR-2-26', 'NPR-2-27', 'NPR-2-28', 'NPR-2-29', 'NPR-2-30', 'NPR-2-31', 'NPR-2-32', 

## 4. Updating Conversation and Corpus level metadata

In [20]:
for convo in media_sum_corpus.iter_conversations():
    # get the conversation id by checking from utterance info
    convo_id = convo.get_id()

    # update meta with additional conversation information
    convo.meta.update(conversation_meta[convo_id])

In [21]:
media_sum_corpus.get_conversation("CNN-67148").meta

ConvoKitMeta({'program': 'CNN SATURDAY NIGHT', 'date': '2003-2-22', 'summary': 'How Much Will War With Iraq Cost?', 'url': 'http://transcripts.cnn.com/TRANSCRIPTS/0302/22/stn.02.html', 'title': nan, 'broadcaster': 'CNN'})

In [22]:
media_sum_corpus.get_conversation("NPR-1").meta

ConvoKitMeta({'program': 'News & Notes', 'date': '2007-11-28', 'summary': 'More than 400 black actors, artists and ministers are bringing the Gospel to life in the audio book, The Bible Experience:The Complete Bible. Farai Chideya talks with producer Kyle Bowser and actress Wendy Raquel Robinson, who lends her voice to the project.', 'url': 'https://www.npr.org/templates/story/story.php?storyId=16697288', 'title': 'Black Actors Give Bible Star Appeal', 'broadcaster': 'NPR'})

In [23]:
media_sum_corpus.get_speaker("ED LAVANDERA, CNN CORRESPONDENT")

Speaker({'obj_type': 'speaker', 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x6053886d0>, 'id': 'ED LAVANDERA, CNN CORRESPONDENT', 'meta': ConvoKitMeta({'name': 'ED LAVANDERA, CNN CORRESPONDENT'})})

In [24]:
# add name
media_sum_corpus.meta['name'] = 'MediaSum Corpus'

## 5. Adding Paraphrase annotations

Annotations are saved as lists which correspond to the text with utt.text.split() calls.

In [25]:
# load annotations from huggingface dataset
from datasets import load_dataset
dataset = load_dataset("AnnaWegmann/Paraphrases-in-Interviews")

In [26]:
# load into one dataframe
split_names = list(dataset.keys())
dataframes = [dataset[split].to_pandas() for split in split_names]
df = pd.concat(dataframes, ignore_index=True)  # if you just need one split: dataset['train'].to_pandas()

In [27]:
utterance = media_sum_corpus.get_utterance("CNN-177596-7")

In [28]:
utterance.text

'This is not good.'

In [29]:
unique_annotators = set(df['Annotator'])
len(unique_annotators)

112

In [30]:
# get all unique pairs or QIDs that were annotated for paraphrases
unique_qids = set(df['QID'].unique())

In [31]:
pairs = []
paraphrase_labels = []
# go over the unique QIDs
for q_id in tqdm.tqdm(unique_qids):
    group = df[df['QID'] == q_id]
    # Compute total votes and paraphrase votes
    total_votes = len(group)
    paraphrase_votes = group['Is Paraphrase'].astype(int).sum()

    meta_info = {
        'paraphrase_number_votes': int(total_votes),  # the number of annotators that rated in total paraphrase and not
        'paraphrase_votes': int(paraphrase_votes),  # the number of annotators voting 
        'paraphrase_ratio': float(paraphrase_votes / total_votes if total_votes > 0 else 0)
    }

    # paraphrase label
    paraphrase_labels.append(meta_info['paraphrase_ratio'])
    
    # Process Guest Highlights
    guest_highlights_list = group['Guest Highlights'].apply(ast.literal_eval).tolist()
    guest_highlights_sums = [sum(x)/int(total_votes) for x in zip(*guest_highlights_list)]

    # Process Host Highlights
    host_highlights_list = group['Host Highlights'].apply(ast.literal_eval).tolist()
    host_highlights_sums = [sum(x)/int(total_votes) for x in zip(*host_highlights_list)]

    cur_utt = media_sum_corpus.get_utterance(q_id)
    utt_number = int(q_id.split("-")[2])
    # guest_speaker = cur_utt.speaker.id
    cur_id = 0
    cur_pair = [[], []]
    while cur_id < len(guest_highlights_sums):
        cur_utt_text_len = len(cur_utt.text.split())
        cur_utt.add_meta('paraphrase_guest_highlights', guest_highlights_sums[cur_id:cur_utt_text_len])
        # cur_utt.add_meta('paraphrase_guest_words'], cur_utt.text.split())
        cur_utt.add_meta('paraphrase_is_host', False)
        # print(meta_info)
        for key in meta_info.keys():
            cur_utt.add_meta(key, meta_info[key])
        for index, row in group.iterrows():
            cur_utt.add_meta("paraphrase_" + row['Annotator'], ast.literal_eval(row['Guest Highlights'])[cur_id:cur_utt_text_len])
        cur_pair[0].append((f"{cur_utt.conversation_id}-{utt_number}"))
        utt_number+=1
        cur_utt = media_sum_corpus.get_utterance(f"{cur_utt.conversation_id}-{utt_number}")
        cur_id += cur_utt_text_len
    cur_id = 0
    while cur_id < len(host_highlights_sums):
        cur_utt_text_len = len(cur_utt.text.split())
        cur_utt.add_meta("paraphrase_host_highlights", host_highlights_sums[cur_id:cur_utt_text_len])
        # cur_utt.add_meta('paraphrase_host_words', cur_utt.text.split())
        cur_utt.add_meta('paraphrase_is_host', True)
        for key in meta_info.keys():
            cur_utt.add_meta(key, meta_info[key])
        for index, row in group.iterrows():
            cur_utt.add_meta("paraphrase_" + row['Annotator'], ast.literal_eval(row['Host Highlights'])[cur_id:cur_utt_text_len])
        cur_pair[1].append((f"{cur_utt.conversation_id}-{utt_number}"))
        utt_number+=1
        cur_utt = media_sum_corpus.get_utterance(f"{cur_utt.conversation_id}-{utt_number}")
        cur_id += cur_utt_text_len
    pairs.append(cur_pair)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 600/600 [00:01<00:00, 330.18it/s]


In [32]:
utterance = media_sum_corpus.get_utterance("CNN-177596-7")
print(utterance)

Utterance(id: 'CNN-177596-7', conversation_id: CNN-177596, reply-to: CNN-177596-6, speaker: Speaker(id: 'JOHNS', vectors: [], meta: ConvoKitMeta({'name': 'JOHNS'})), timestamp: None, text: 'This is not good.', vectors: [], meta: ConvoKitMeta({'paraphrase_guest_highlights': [0.5, 0.45, 0.45, 0.45], 'paraphrase_is_host': False, 'paraphrase_number_votes': 20, 'paraphrase_votes': 10, 'paraphrase_ratio': 0.5, 'paraphrase_PROLIFIC_1': [0, 0, 0, 0], 'paraphrase_PROLIFIC_2': [1, 1, 1, 1], 'paraphrase_PROLIFIC_3': [0, 0, 0, 0], 'paraphrase_PROLIFIC_4': [0, 0, 0, 0], 'paraphrase_PROLIFIC_5': [0, 0, 0, 0], 'paraphrase_PROLIFIC_6': [1, 1, 1, 1], 'paraphrase_PROLIFIC_7': [1, 0, 0, 0], 'paraphrase_PROLIFIC_8': [1, 1, 1, 1], 'paraphrase_PROLIFIC_9': [0, 0, 0, 0], 'paraphrase_PROLIFIC_10': [0, 0, 0, 0], 'paraphrase_PROLIFIC_11': [1, 1, 1, 1], 'paraphrase_PROLIFIC_12': [0, 0, 0, 0], 'paraphrase_PROLIFIC_13': [1, 1, 1, 1], 'paraphrase_PROLIFIC_14': [0, 0, 0, 0], 'paraphrase_PROLIFIC_15': [0, 0, 0, 0], '

In [33]:
utterance = media_sum_corpus.get_utterance("CNN-177596-8")
utterance.meta, utterance.text

(ConvoKitMeta({'paraphrase_host_highlights': [0.45, 0.4, 0.45, 0.45, 0.45, 0.45, 0.45, 0.35, 0.35, 0.35, 0.05], 'paraphrase_is_host': True, 'paraphrase_number_votes': 20, 'paraphrase_votes': 10, 'paraphrase_ratio': 0.5, 'paraphrase_PROLIFIC_1': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'paraphrase_PROLIFIC_2': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0], 'paraphrase_PROLIFIC_3': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'paraphrase_PROLIFIC_4': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'paraphrase_PROLIFIC_5': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'paraphrase_PROLIFIC_6': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0], 'paraphrase_PROLIFIC_7': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'paraphrase_PROLIFIC_8': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0], 'paraphrase_PROLIFIC_9': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'paraphrase_PROLIFIC_10': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'paraphrase_PROLIFIC_11': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 'paraphrase_PROLIFIC_12': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'paraphrase_PROLIFIC_13': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [34]:
utterance = media_sum_corpus.get_utterance("CNN-80522-7")
utterance.meta, utterance.text

(ConvoKitMeta({'paraphrase_guest_highlights': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.3333333333333333, 0.3333333333333333, 0.0], 'paraphrase_is_host': False, 'paraphrase_number_votes': 3, 'paraphrase_votes': 3, 'paraphrase_ratio': 1.0, 'paraphrase_PROLIFIC_36': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0], 'paraphrase_PROLIFIC_40': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [35]:
utterance = media_sum_corpus.get_utterance("CNN-80522-8")
utterance.meta, utterance.text

(ConvoKitMeta({'paraphrase_host_highlights': [0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'paraphrase_is_host': True, 'paraphrase_number_votes': 3, 'paraphrase_votes': 3, 'paraphrase_ratio': 1.0, 'paraphrase_PROLIFIC_36': [0, 1, 1, 1, 1, 1, 1, 1, 1], 'paraphrase_PROLIFIC_40': [0, 1, 1, 1, 1, 1, 1, 1, 1], 'paraphrase_PROLIFIC_53': [0, 1, 1, 1, 1, 1, 1, 1, 1]}),
 "Yes, you've been doing this for a while now.")

In [36]:
# add the pair IDs that were annotated for paraphrases to the media_sum_corpus metadata
media_sum_corpus.add_meta("paraphrase_pairs", pairs)
media_sum_corpus.add_meta("paraphrase_labels", paraphrase_labels)

In [37]:
media_sum_corpus.meta

ConvoKitMeta({'name': 'MediaSum Corpus', 'paraphrase_pairs': [[['NPR-15505-5'], ['NPR-15505-6']], [['NPR-29413-12'], ['NPR-29413-13']], [['NPR-32322-5'], ['NPR-32322-6']], [['NPR-36238-12'], ['NPR-36238-13']], [['CNN-79698-3'], ['CNN-79698-4']], [['NPR-4258-5'], ['NPR-4258-6']], [['CNN-74539-5'], ['CNN-74539-6']], [['CNN-187362-5'], ['CNN-187362-6']], [['CNN-235767-5'], ['CNN-235767-6']], [['NPR-24552-10'], ['NPR-24552-11']], [['CNN-236636-5'], ['CNN-236636-6']], [['CNN-26647-3'], ['CNN-26647-4']], [['CNN-395861-10'], ['CNN-395861-11']], [['CNN-300212-13'], ['CNN-300212-14']], [['CNN-72381-11'], ['CNN-72381-12']], [['CNN-319588-9'], ['CNN-319588-10']], [['CNN-363088-3'], ['CNN-363088-4']], [['NPR-26301-10', 'NPR-26301-11'], ['NPR-26301-12']], [['CNN-198991-3'], ['CNN-198991-4']], [['CNN-390120-3'], ['CNN-390120-4']], [['CNN-64125-11'], ['CNN-64125-12']], [['NPR-58-11'], ['NPR-58-12']], [['CNN-319869-5'], ['CNN-319869-6']], [['CNN-33404-7'], ['CNN-33404-8']], [['NPR-44959-24', 'NPR-4495

In [38]:
media_sum_corpus.meta['paraphrase_pairs'][0]

[['NPR-15505-5'], ['NPR-15505-6']]

## provide function for pretty printing of annotations

In [39]:
from itertools import chain
def get_paraphrase_pair_info(corpus, pair_id):
    """Get text, paraphrase ratio, and highlighting for a paraphrase pair."""
    pairs = corpus.meta['paraphrase_pairs']
    labels = corpus.meta['paraphrase_labels']
    
    pair = pairs[pair_id]
    group1_text = " ".join([corpus.get_utterance(uid).text for uid in pair[0]])
    group2_text = " ".join([corpus.get_utterance(uid).text for uid in pair[1]])
    
    # Get highlighting from all utterances in each group
    group1_highlights = list(chain.from_iterable(corpus.get_utterance(uid).meta['paraphrase_guest_highlights'] for uid in pair[0]))
    group2_highlights = list(chain.from_iterable(corpus.get_utterance(uid).meta['paraphrase_host_highlights'] for uid in pair[1]))
        
    return {
        'pair_id': pairs[pair_id],
        'text1': group1_text,
        'text2': group2_text,
        'paraphrase_ratio': corpus.meta["paraphrase_labels"][pair_id],
        'is_paraphrase': corpus.meta["paraphrase_labels"][pair_id] >= 0.5,
        'guest_highlights': group1_highlights,
        'host_highlights': group2_highlights,
    }
def print_highlighted_pair(pair_info):
    """Print paraphrase pair with token-level highlighting -- upper casing if >= 0.5 and emphasis if >= 0.4"""
    
    def highlight_text(text, highlights):
        tokens = text.split()
        return " ".join(
            token.upper() if score >= 0.5 
            else f"\033[1m{token}\033[0m" if score >= 0.4 
            else token
            for token, score in zip(tokens, highlights)
        )
    
    print(f"=== Pair {pair_info['pair_id']} ===")
    print(f"Paraphrase ratio: {pair_info['paraphrase_ratio']:.3f} ({'PARAPHRASE' if pair_info['is_paraphrase'] else 'NOT PARAPHRASE'})")
    print(f"\nGuest:\n{highlight_text(pair_info['text1'], pair_info['guest_highlights'])}")
    print(f"\nHost:\n{highlight_text(pair_info['text2'], pair_info['host_highlights'])}\n")

In [40]:
nbr = 9
print_highlighted_pair(get_paraphrase_pair_info(media_sum_corpus, nbr))

=== Pair [['NPR-24552-10'], ['NPR-24552-11']] ===
Paraphrase ratio: 0.600 (PARAPHRASE)

Guest:
Oh, yeah. Roger Goodell has been paid $79 million in the last two years to pretend to be a person running a philanthropy that serves the public interest. Obviously, he's not. The only reason is that IRS regulations attached to nonprofit status require this. And he's not the only one in NFL Headquarters with a million dollars-plus salary, he's just the most prominent one. The reason the league is [1mgiving[0m [1mup[0m [1mits[0m [1mtax[0m [1mexemption[0m is so that they can stop [1mdisclosing[0m [1mthe[0m [1mamounts[0m [1mof[0m [1mmoney[0m [1mthat[0m [1mGoodell[0m [1mand[0m [1mthe[0m [1mother[0m [1mtop[0m [1mofficials[0m in the league make.

Host:
Are there any areas apart from [1mdisclosure[0m [1mof[0m [1msalaries[0m [1mof[0m [1mtheir[0m [1mtop[0m [1mexecutives[0m - any areas of the NFL's activities or any of its alleged defaults that would be ch

## 6. Saving created datsets

In [41]:
media_sum_corpus.dump("mediasum-corpus")