## Converting the Cornell Movie-Dialogs Corpus into ConvoKit format 

This notebook is a demonstration of how custom datasets can be converted into Corpus with ConvoKit

In [66]:
from tqdm import tqdm
from convokit import Corpus, User, Utterance

### The Cornell Movie-Dialogs Corpus

The original version of the Cornell Movie-Dialogs Corpus can be downloaded from:  https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html. It contains the following files:

* __movie_characters_metadata.txt__ contains information about each movie character
* __movie_lines.txt contains__ the actual text of each utterance
* __movie_conversations.txt__ contains the structure of the conversations
* __movie_titles_metadata.txt__ contains information about each movie title

### Constructing the Corpus from a list of Utterances 

Corpus can be constructed from a list of utterances with:

    corpus = Corpus(utterances= custom_utterance_list)
    
Our goal is to convert the original dataset into this "custom_utterance_list", and let ConvoKit will do the rest of the conversion for us.

#### Creating users

Each character in a movie is considered a user, and there are 9,035 characters in total in this dataset. We will read off metadata for each user from __movie_characters_metadata.txt__. 

In general, we would directly use the name of the user as the name. However, in our case, since only the first name of the movie character is given, these names may not uniquely map to a character. We will instead use user_id provided in the original dataset as username, whereas the actual charatcter name will be saved in user metadata.

For each user, metadata include the following information: 
    * name of the character.
    * idx and name of the movie this charater is from
    * gender(available for 3,774 characters)
    * position on movie credits (3,321 characters available)

In [20]:
# replace the directory with where your downloaded cornell movie dialogs corpus is saved
data_dir = "../../data_collection/cornell_movie_dialogs_corpus/"

In [21]:
with open(data_dir + "movie_characters_metadata.txt", "r", encoding='utf-8', errors='ignore') as f:
    user_data = f.readlines()

In [22]:
user_meta = {}
for user in user_data:
    user_info = [info.strip() for info in user.split("+++$+++")]
    user_meta[user_info[0]] = {"character_name": user_info[1],
                               "movie_idx": user_info[2],
                               "movie_name": user_info[3],
                               "gender": user_info[4],
                               "credit_pos": user_info[5]}

We will now create an User object for each unique character in the dataset, which will be used to create Utterances objects later. 

In [23]:
corpus_users = {k: User(name = k, meta = v) for k,v in user_meta.items()}

Sanity checking use-level data:

In [24]:
print("number of users in the data = {0}".format(len(corpus_users)))

number of users in the data = 9035


In [25]:
corpus_users['u0'].meta

{'character_name': 'BIANCA',
 'movie_idx': 'm0',
 'movie_name': '10 things i hate about you',
 'gender': 'f',
 'credit_pos': '4'}

#### Creating utterance objects
Utterances can be found in __movie_lines.txt__. There are 304,713 utterances in total. 

An utterance object normally expects at least:
- id: the unique id of the utterance. 
- user: the user giving the utterance.
- root: the id of the root utterance of the conversation.
- reply_to: id of the utterance this was a reply to.
- timestamp: timestamp of the utterance. 
- text: text of the utterance.

Additional information associated with the utterance, e.g., in this case, the movie this utterance is coming from, may be saved as utterance level metadata.

In [26]:
with open(data_dir + "movie_lines.txt", "r", encoding='utf-8', errors='ignore') as f:
    utterance_data = f.readlines()

In [134]:
utterance_corpus = {}

count = 0
for utterance in tqdm(utterance_data):
    
    utterance_info = [info.strip() for info in utterance.split("+++$+++")]
    
    # ignoring character name since User object already has information
    idx, user, movie_id, text = utterance_info[0], utterance_info[1], utterance_info[2], utterance_info[4]
    
    
    if count % 2 == 0:
        meta = {'movie_id': movie_id}
    else:
        meta = {'movie_id': movie_id}
    count += 1
    
    # root & reply_to will be updated later, timestamp is not applicable 
    utterance_corpus[idx] = Utterance(idx, corpus_users[user], None, None, None, text, meta=meta)

100%|██████████| 304713/304713 [00:07<00:00, 41637.13it/s]


In [135]:
len(utterance_corpus)

304713

Sanity checking on the status of the utterance objects, they should now contain an id, the users who said them, the actual texts, as well as the movie ids as the metadata: 

In [88]:
utterance_corpus['L1044'] 

Utterance({'id': 'L1044', 'user': User([('name', 'u2')]), 'root': None, 'reply_to': None, 'timestamp': None, 'text': 'They do to!', 'meta': {'movie_id': 'm0', 'test': []}})

#### Updating root and reply_to information to utterances
__movie_conversations.txt__ provides the structure of conversations that organizes the above utterances. This will allow us to add the missing root and reply_to information to individual utterances. 

In [89]:
with open(data_dir + "movie_conversations.txt", "r", encoding='utf-8', errors='ignore') as f:
    convo_data = f.readlines()

In [90]:
import ast

In [136]:
for info in tqdm(convo_data):
        
    user1, user2, m, convo = [info.strip() for info in info.split("+++$+++")]

    convo_seq = ast.literal_eval(convo)
    
    # update utterance
    root = convo_seq[0]
    
    # convo_seq is a list of utterances ids, arranged in conversational order
    for i, line in enumerate(convo_seq):
        
        # sanity checking: user giving the utterance is indeed in the pair of characters provided
        if utterance_corpus[line].user.name not in [user1, user2]:
            print("user mismatch in line {0}".format(i))
        
        utterance_corpus[line].root = root
        
        if i == 0:
            utterance_corpus[line].reply_to = None
        else:
            utterance_corpus[line].reply_to = convo_seq[i-1]

100%|██████████| 83097/83097 [00:02<00:00, 28463.86it/s]


Sanity checking on the status of utterances. After updating root and reply_to information, they should now contain all mandatory fields:

In [92]:
utterance_corpus['L666499']

Utterance({'id': 'L666499', 'user': User([('name', 'u9028')]), 'root': 'L666497', 'reply_to': 'L666498', 'timestamp': None, 'text': 'How quickly can you move your artillery forward?', 'meta': {'movie_id': 'm616', 'test': []}})

#### Creating corpus from list of utterances
We are now ready to create the movie-corpus. Note that we can specify a version number for a corpus, making it easier for us to keep track of which corpus we are working with.  

In [137]:
utterance_list = [utterance for k,utterance in utterance_corpus.items()]

In [138]:
movie_corpus = Corpus(utterances=utterance_list, version=1)

ConvoKit will automatically help us create conversations based on the information about the utterances we provide. 

In [57]:
print("number of conversations in the dataset = {}".format(len(movie_corpus.get_conversation_ids())))

number of conversations in the dataset = 83097


In [67]:
convo_ids = movie_corpus.get_conversation_ids()
for i, convo_idx in enumerate(convo_ids[0:5]):
    print("sample conversation {}:".format(i))
    print(movie_corpus.get_conversation(convo_idx).get_utterance_ids())

sample conversation 0:
['L1045', 'L1044']
sample conversation 1:
['L985', 'L984']
sample conversation 2:
['L925', 'L924']
sample conversation 3:
['L872', 'L871', 'L870']
sample conversation 4:
['L869', 'L868', 'L867', 'L866']


#### Adding parses for utterances
We can also "annotate" the utterances, e.g., getting dependency parses for them, and save the parsed versions as utterance-level metadata. Here is an example of how this can be done: 

In [71]:
from convokit import Parser

In [48]:
annotator = Parser()

In [49]:
movie_corpus = annotator.fit_transform(movie_corpus)

#### Updating Corpus level metadata:
In this dataset, there are a few sets of additional information about a total of 617 movies from which these conversations are drawn. For instance, genres, release year, url from which the raw sources are retrieved are included in the original dataset. These may be saved as Corpus level metadata. 

Adding urls information: 

In [72]:
with open(data_dir + "raw_script_urls.txt", "r", encoding='utf-8', errors='ignore') as f:
    urls = f.readlines()

In [73]:
movie_meta = {}
for movie in urls:
    movie_id, title, url = [info.strip() for info in movie.split("+++$+++")]
    movie_meta[movie_id] = {'title': title, "url": url}

In [74]:
len(movie_meta)

617

Adding more movie meta from movie_titles_metadata.txt: 

In [79]:
with open(data_dir + "movie_titles_metadata.txt", "r", encoding='utf-8', errors='ignore') as f:
    movie_extra = f.readlines()

In [80]:
for movie in movie_extra:
    movie_id, title, year, rating, votes, genre  = [info.strip() for info in movie.split("+++$+++")]
    movie_meta[movie_id]['release_year'] = year
    movie_meta[movie_id]['rating'] = rating
    movie_meta[movie_id]['votes'] = votes
    movie_meta[movie_id]['genre'] = genre

Sanity checking for a random movie in the dataset:

In [81]:
movie_meta['m23']

{'title': 'the avengers',
 'url': 'http://www.dailyscript.com/scripts/Avengers.html',
 'release_year': '1998',
 'rating': '3.40',
 'votes': '21519',
 'genre': "['action', 'adventure', 'thriller']"}

In [82]:
movie_corpus.meta['movie_metadata'] = movie_meta

Optionally, we can also the original name of the dataset:

In [83]:
movie_corpus.meta['name'] = "Cornell Movie-Dialogs Corpus"

#### Saving created datasets
To complete the final step of dataset conversion, we want to save the dataset such that it can be loaded later for reuse. You may want to specify a name. The default location to find the saved datasets will be __./convokit/saved-copora__ in your home directory, but you can also specify where you want the saved corpora to be. 

In [145]:
# movie_corpus.dump("movie-corpus", base_path = <specify where you prefer to save it to>)
# the following would save the Corpus to the default location
movie_corpus.dump("movie-corpus")

After saving, the available info from dataset can be checked directly, without loading

In [86]:
from convokit import meta_index

In [88]:
meta_index(filename = "movie-corpus")

{'utterances-index': {'movie_id': "<class 'str'>", 'parsed': 'bin'},
 'users-index': {'character_name': "<class 'str'>",
  'movie_idx': "<class 'str'>",
  'movie_name': "<class 'str'>",
  'gender': "<class 'str'>",
  'credit_pos': "<class 'str'>"},
 'conversations-index': {},
 'overall-index': {'movie_metadata': "<class 'dict'>",
  'name': "<class 'str'>"},
 'version': 1}

### Other ways of conversion

The above method is only one way to convert the dataset. Alternatively, one may follow strictly with the specifications of the expected data format described [here](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/doc/source/data_format.rst) and write out the component files directly. 