In [1]:
from convokit import Corpus, Utterance, Speaker

## Merging two corpora

Let's take a look at the characteristics of our input corpora before the merge. Apart from the summary statistics, notice that Speaker 'foxtrot' appears in both corpora. Moreover, it has Speaker metadata that is inconsistent.

The root field in each Utterance indicates where a new Conversation begins. In this case, while there are 2 conversations in each corpus, 1 conversation (with root 2) is featured in both corpuses, so there are only 3 conversations in total.

### Corpus 1

In [2]:
corpus1 = Corpus(utterances = [
            Utterance(id="0", conversation_id="0", text="hello world", speaker=Speaker(id="alice")),
            Utterance(id="1", conversation_id="0", reply_to=0, text="my name is bob", speaker=Speaker(id="bob")),
            Utterance(id="2", conversation_id="2", text="this is a sentence", speaker=Speaker(id="foxtrot", meta={"yellow": "food"})),
        ])

In [3]:
corpus1.print_summary_stats()

Number of Speakers: 3
Number of Utterances: 3
Number of Conversations: 2


### Corpus 2

In [4]:
corpus2 = Corpus(utterances = [
            Utterance(id="3", conversation_id="3", text="i like pie", speaker=Speaker(id="charlie", meta={"what": "a mood", "hey": "food"})),
            Utterance(id='4', conversation_id='3', reply_to=3, text="sentence galore", speaker=Speaker(id="echo")),
            Utterance(id='2', conversation_id='2', text="this is a sentence", speaker=Speaker(id="foxtrot", meta={"yellow": "mood", "hello": "world"})),
        ])

In [5]:
corpus2.print_summary_stats()

Number of Speakers: 3
Number of Utterances: 3
Number of Conversations: 2


Let's attempt a merge:

In [6]:
corpus3 = corpus1.merge(corpus2)



In [7]:
corpus3.print_summary_stats()

Number of Speakers: 5
Number of Utterances: 5
Number of Conversations: 3


### Merging user metadata

Notice that because Speaker 'foxtrot' had conflicting metadata, the latest utterance (i.e. the utterance in corpus2) had its Speaker metadata for 'foxtrot' take precedence. We verify this below. Note too that the other metadata key-value pair ('hello': 'world') has been added to the metadata as well.

In [8]:
corpus3.get_speaker('foxtrot').meta

{'yellow': 'mood', 'hello': 'world'}

Users were not initialized with their list of corresponding utterances / conversations. Corpus has a method for updating these Speaker lists.

In [9]:
print(list(corpus3.iter_speakers()))
speaker_echo = corpus3.get_speaker('echo')
print()
speaker_echo.print_speaker_stats()

[Speaker({'obj_type': 'speaker', 'meta': {}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x7fbb118f16d0>, 'id': 'alice'}), Speaker({'obj_type': 'speaker', 'meta': {}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x7fbb118f16d0>, 'id': 'bob'}), Speaker({'obj_type': 'speaker', 'meta': {'yellow': 'mood', 'hello': 'world'}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x7fbb118f16d0>, 'id': 'foxtrot'}), Speaker({'obj_type': 'speaker', 'meta': {'what': 'a mood', 'hey': 'food'}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x7fbb118f16d0>, 'id': 'charlie'}), Speaker({'obj_type': 'speaker', 'meta': {}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x7fbb118f16d0>, 'id': 'echo'})]

Number of Utterances: 1
Number of Conversations: 1


### Merging Utterance and Corpus metadata 

We quickly demonstrate the Utterance and Corpus metadata merging functionality. This is all handled in the merge() function as well, we just make its effects explicit here. In addition, we encode the corpora with problematic data/metadata so that the warning functionality is explicit.

(Note that if Utterances have the same id but different data, the Utterance from the other Corpus is ignored and a warning is printed, though the Speaker metadata is still kept.)

### Corpus 4

In [10]:
corpus4 = Corpus(utterances = [
            Utterance(id='0', conversation_id='0', text="hello world", speaker=Speaker(id="alice"), meta={'in': 'wonderland'}),
            Utterance(id='1', conversation_id='0', reply_to='0', text="my name is bob", speaker=Speaker(id="bob"), meta={'fu': 'bu'})
        ])
corpus4.add_meta('AB', 1)
corpus4.add_meta('CD', 2)


In [11]:
corpus4.print_summary_stats()

Number of Speakers: 2
Number of Utterances: 2
Number of Conversations: 1


### Corpus 5

In [12]:
corpus5 = Corpus(utterances = [
            Utterance(id='0', conversation_id='0', text="hello world", speaker=Speaker(id="alice"), meta={'in': 'the hat'}),
            Utterance(id='1', conversation_id='0', reply_to='0', text="my name is bobbb", speaker=Speaker(id="bob"), meta={'barrel': 'roll'})
        ])
corpus5.add_meta('AB', 3)
corpus5.add_meta('EF', 3)

In [13]:
corpus5.print_summary_stats()

Number of Speakers: 2
Number of Utterances: 2
Number of Conversations: 1


In [14]:
corpus6 = corpus4.merge(corpus5)

Utterance(id: '1', conversation_id: 0, reply-to: 0, speaker: Speaker(id: bob, vectors: [], meta: {}), timestamp: None, text: 'my name is bob', vectors: [], meta: {'fu': 'bu'})
Utterance(id: '1', conversation_id: 0, reply-to: 0, speaker: Speaker(id: bob, vectors: [], meta: {}), timestamp: None, text: 'my name is bobbb', vectors: [], meta: {'barrel': 'roll'})
Ignoring second corpus's utterance.


In [15]:
corpus6.print_summary_stats()

Number of Speakers: 2
Number of Utterances: 2
Number of Conversations: 1


In [16]:
corpus6.meta

{'AB': 3, 'CD': 2, 'EF': 3}

In [17]:
corpus6.get_utterance('1')

Utterance({'obj_type': 'utterance', 'meta': {'fu': 'bu'}, 'vectors': [], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x7fbb118f1b50>, 'id': 'bob'}), 'conversation_id': '0', 'reply_to': '0', 'timestamp': None, 'text': 'my name is bob', 'owner': <convokit.model.corpus.Corpus object at 0x7fbb118f1b50>, 'id': '1'})

In [18]:
corpus6.get_utterance('0')

Utterance({'obj_type': 'utterance', 'meta': {'in': 'the hat'}, 'vectors': [], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x7fbb118f1b50>, 'id': 'alice'}), 'conversation_id': '0', 'reply_to': None, 'timestamp': None, 'text': 'hello world', 'owner': <convokit.model.corpus.Corpus object at 0x7fbb118f1b50>, 'id': '0'})

For the most part however, as long as the data is well behaved (e.g. Speaker/Utterance/Conversation/Corpus do not have different values for the same key in the metadata, Utterances with the same id have the same data) one should expect to see no warnings when using merge().

In [19]:
list(list(corpus6.iter_conversations())[0].iter_utterances())

[Utterance({'obj_type': 'utterance', 'meta': {'in': 'the hat'}, 'vectors': [], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x7fbb118f1b50>, 'id': 'alice'}), 'conversation_id': '0', 'reply_to': None, 'timestamp': None, 'text': 'hello world', 'owner': <convokit.model.corpus.Corpus object at 0x7fbb118f1b50>, 'id': '0'}),
 Utterance({'obj_type': 'utterance', 'meta': {'fu': 'bu'}, 'vectors': [], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x7fbb118f1b50>, 'id': 'bob'}), 'conversation_id': '0', 'reply_to': '0', 'timestamp': None, 'text': 'my name is bob', 'owner': <convokit.model.corpus.Corpus object at 0x7fbb118f1b50>, 'id': '1'})]

In [None]:
corpus6.dump('temp-corpus', './')