## Converting the Cornell Movie-Dialogs Corpus into ConvoKit format 

This notebook is a demonstration of how custom datasets can be converted into Corpus with ConvoKit. 

The original version of the Cornell Movie-Dialogs Corpus can be downloaded from:  https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html. It contains the following files:

* __movie_characters_metadata.txt__ contains information about each movie character
* __movie_lines.txt contains__ the actual text of each utterance
* __movie_conversations.txt__ contains the structure of the conversations
* __movie_titles_metadata.txt__ contains information about each movie title

In [2]:
from tqdm import tqdm
from convokit import Corpus, User, Utterance

Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.


### Constructing the Corpus from a list of Utterances 

Corpus can be constructed from a list of utterances with:

    corpus = Corpus(utterances= custom_utterance_list)
    
Our goal is to convert the original dataset into this "custom_utterance_list", and let ConvoKit will do the rest of the conversion for us.

#### 1. Creating users

Each character in a movie is considered a user, and there are 9,035 characters in total in this dataset. We will read off metadata for each user from __movie_characters_metadata.txt__. 

In [3]:
# replace the directory with where your downloaded cornell movie dialogs corpus is saved
data_dir = "cornell-movie-dialogs-corpus/"

In [4]:
with open(data_dir + "movie_characters_metadata.txt", "r", encoding='utf-8', errors='ignore') as f:
    user_data = f.readlines()

In general, we would directly use the name of the user as the name. However, in our case, since only the first name of the movie character is given, these names may not uniquely map to a character. We will instead use user_id provided in the original dataset as username, whereas the actual charatcter name will be saved in user metadata.

For this dataset, we include the following information for each user:  
* name of the character.
* idx and name of the movie this charater is from
* gender(available for 3,774 characters)
* position on movie credits (3,321 characters available)

In [5]:
user_meta = {}
for user in user_data:
    user_info = [info.strip() for info in user.split("+++$+++")]
    user_meta[user_info[0]] = {"character_name": user_info[1],
                               "movie_idx": user_info[2],
                               "movie_name": user_info[3],
                               "gender": user_info[4],
                               "credit_pos": user_info[5]}

In general, an user object can be initiated with `User(name = <user_name>, meta = <user_metadata>)`. The following example shows how we create an User object for each unique character in the dataset, which will be used to create Utterances objects later. 

In [6]:
corpus_users = {k: User(name = k, meta = v) for k,v in user_meta.items()}

Sanity checking use-level data:

In [7]:
print("number of users in the data = {}".format(len(corpus_users)))

number of users in the data = 9035


In [8]:
corpus_users['u0'].meta

{'character_name': 'BIANCA',
 'movie_idx': 'm0',
 'movie_name': '10 things i hate about you',
 'gender': 'f',
 'credit_pos': '4'}

#### 2. Creating utterance objects
Utterances can be found in __movie_lines.txt__. There are 304,713 utterances in total. 

In [9]:
with open(data_dir + "movie_lines.txt", "r", encoding='utf-8', errors='ignore') as f:
    utterance_data = f.readlines()

To instantiate an utterance object, we generally need the following information (all ids should be of type string):
- id: representing the unique id of the utterance. 
- user: a ConvoKit User object representing the user giving the utterance.
- root: the id of the root utterance of the conversation.
- reply_to: id of the utterance this was a reply to.
- timestamp: timestamp of the utterance. 
- text: text of the utterance.

Additional information associated with the utterance may be saved as utterance level metadata. In this case, we consider the movie_id from which this utterance is extracted as an example for metadata. 

An utterance possessing all the above information may be initiated by `Utterance(id=..., user =..., root =..., rely_to=..., timestamp=..., text =..., meta =...)`

We now create such Utterance objects for the utterances in our dataset. Note that normally we would provide `root` and `reply_to` information at the time of instantiation, but we will defer it to later as such information need to be retrieved from a different file. 

In [10]:
utterance_corpus = {}

count = 0
for utterance in tqdm(utterance_data):
    
    utterance_info = [info.strip() for info in utterance.split("+++$+++")]
    if len(utterance_info) < 4:
        print(utterance_info)
    
    # ignoring character name since User object already has information
    try:
        idx, user, movie_id, text = utterance_info[0], utterance_info[1], utterance_info[2], utterance_info[4]
    except:
        print(utterance_info)
    
        meta = {'movie_id': movie_id}
    
    # root & reply_to will be updated later, timestamp is not applicable 
    utterance_corpus[idx] = Utterance(id=idx, user=corpus_users[user], text=text, meta=meta)

100%|██████████| 304713/304713 [00:03<00:00, 90094.76it/s] 


In [11]:
len(utterance_corpus)

304713

Sanity checking on the status of the utterance objects, they should now contain an id, the users who said them, the actual texts, as well as the movie ids as the metadata: 

In [12]:
utterance_corpus['L1044'] 

Utterance({'obj_type': 'utterance', '_owner': None, 'meta': {'movie_id': 'm0'}, '_id': 'L1044', 'user': User({'obj_type': 'user', '_owner': None, 'meta': {'character_name': 'CAMERON', 'movie_idx': 'm0', 'movie_name': '10 things i hate about you', 'gender': 'm', 'credit_pos': '3'}, '_id': 'u2', '_name': 'u2'}), 'root': None, 'reply_to': None, 'timestamp': None, 'text': 'They do to!'})

#### Updating root and reply_to information to utterances
__movie_conversations.txt__ provides the structure of conversations that organizes the above utterances. This will allow us to add the missing root and reply_to information to individual utterances. 

In [13]:
with open(data_dir + "movie_conversations.txt", "r", encoding='utf-8', errors='ignore') as f:
    convo_data = f.readlines()

In [14]:
import ast

In [15]:
for info in tqdm(convo_data):
        
    user1, user2, m, convo = [info.strip() for info in info.split("+++$+++")]

    convo_seq = ast.literal_eval(convo)
    
    # update utterance
    root = convo_seq[0]
    
    # convo_seq is a list of utterances ids, arranged in conversational order
    for i, line in enumerate(convo_seq):
        
        # sanity checking: user giving the utterance is indeed in the pair of characters provided
        if utterance_corpus[line].user.name not in [user1, user2]:
            print("user mismatch in line {0}".format(i))
        
        utterance_corpus[line].root = root
        
        if i == 0:
            utterance_corpus[line].reply_to = None
        else:
            utterance_corpus[line].reply_to = convo_seq[i-1]

100%|██████████| 83097/83097 [00:02<00:00, 33615.30it/s]


Sanity checking on the status of utterances. After updating root and reply_to information, they should now contain all mandatory fields:

In [16]:
utterance_corpus['L666499']

Utterance({'obj_type': 'utterance', '_owner': None, 'meta': {'movie_id': 'm616'}, '_id': 'L666499', 'user': User({'obj_type': 'user', '_owner': None, 'meta': {'character_name': 'COGHILL', 'movie_idx': 'm616', 'movie_name': 'zulu dawn', 'gender': '?', 'credit_pos': '?'}, '_id': 'u9028', '_name': 'u9028'}), 'root': 'L666497', 'reply_to': 'L666498', 'timestamp': None, 'text': 'How quickly can you move your artillery forward?'})

#### 3. Creating corpus from list of utterances
We are now ready to create the movie-corpus. Note that we can specify a version number for a corpus, making it easier for us to keep track of which corpus we are working with.  

In [17]:
utterance_list = [utterance for k,utterance in utterance_corpus.items()]

In [18]:
# in actual use, create the appropriate version number
movie_corpus = Corpus(utterances=utterance_list, version=1)

ConvoKit will automatically help us create conversations based on the information about the utterances we provide. 

In [19]:
print("number of conversations in the dataset = {}".format(len(movie_corpus.get_conversation_ids())))

number of conversations in the dataset = 83097


In [20]:
convo_ids = movie_corpus.get_conversation_ids()
for i, convo_idx in enumerate(convo_ids[0:5]):
    print("sample conversation {}:".format(i))
    print(movie_corpus.get_conversation(convo_idx).get_utterance_ids())

sample conversation 0:
['L1045', 'L1044']
sample conversation 1:
['L985', 'L984']
sample conversation 2:
['L925', 'L924']
sample conversation 3:
['L872', 'L871', 'L870']
sample conversation 4:
['L869', 'L868', 'L867', 'L866']


#### 4. Updating Corpus level metadata:
In this dataset, there are a few sets of additional information about a total of 617 movies from which these conversations are drawn. For instance, genres, release year, url from which the raw sources are retrieved are included in the original dataset. These may be saved as Corpus level metadata. 

* Adding urls information: 

In [21]:
with open(data_dir + "raw_script_urls.txt", "r", encoding='utf-8', errors='ignore') as f:
    urls = f.readlines()

In [22]:
movie_meta = {}
for movie in urls:
    movie_id, title, url = [info.strip() for info in movie.split("+++$+++")]
    movie_meta[movie_id] = {'title': title, "url": url}

In [23]:
len(movie_meta)

617

* Adding more movie meta from movie_titles_metadata.txt: 

In [24]:
with open(data_dir + "movie_titles_metadata.txt", "r", encoding='utf-8', errors='ignore') as f:
    movie_extra = f.readlines()

In [25]:
for movie in movie_extra:
    movie_id, title, year, rating, votes, genre  = [info.strip() for info in movie.split("+++$+++")]
    movie_meta[movie_id]['release_year'] = year
    movie_meta[movie_id]['rating'] = rating
    movie_meta[movie_id]['votes'] = votes
    movie_meta[movie_id]['genre'] = genre

Sanity checking for a random movie in the dataset:

In [26]:
movie_meta['m23']

{'title': 'the avengers',
 'url': 'http://www.dailyscript.com/scripts/Avengers.html',
 'release_year': '1998',
 'rating': '3.40',
 'votes': '21519',
 'genre': "['action', 'adventure', 'thriller']"}

In [27]:
movie_corpus.meta['movie_metadata'] = movie_meta

Optionally, we can also the original name of the dataset:

In [28]:
movie_corpus.meta['name'] = "Cornell Movie-Dialogs Corpus"

#### 5. Processing utterance texts 

We can also "annotate" the utterances, e.g., getting dependency parses for them, and save the resultant parses. Here is an example of how this can be done, more examples related to text processing can be found at https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/examples/text-processing/text_preprocessing_demo.ipynb:

In [29]:
from convokit.text_processing import TextParser

In [30]:
parser = TextParser(verbosity=10000)

In [31]:
movie_corpus = parser.transform(movie_corpus)

10000/304713 utterances processed
20000/304713 utterances processed
30000/304713 utterances processed
40000/304713 utterances processed
50000/304713 utterances processed
60000/304713 utterances processed
70000/304713 utterances processed
80000/304713 utterances processed
90000/304713 utterances processed
100000/304713 utterances processed
110000/304713 utterances processed
120000/304713 utterances processed
130000/304713 utterances processed
140000/304713 utterances processed
150000/304713 utterances processed
160000/304713 utterances processed
170000/304713 utterances processed
180000/304713 utterances processed
190000/304713 utterances processed
200000/304713 utterances processed
210000/304713 utterances processed
220000/304713 utterances processed
230000/304713 utterances processed
240000/304713 utterances processed
250000/304713 utterances processed
260000/304713 utterances processed
270000/304713 utterances processed
280000/304713 utterances processed
290000/304713 utterances proc

- parses are saved under 'parsed' in utterance meta

In [32]:
movie_corpus.get_utterance('L666499').get_info('parsed')

[{'rt': 4,
  'toks': [{'tok': 'How', 'tag': 'WRB', 'dep': 'advmod', 'up': 1, 'dn': []},
   {'tok': 'quickly', 'tag': 'RB', 'dep': 'advmod', 'up': 4, 'dn': [0]},
   {'tok': 'can', 'tag': 'MD', 'dep': 'aux', 'up': 4, 'dn': []},
   {'tok': 'you', 'tag': 'PRP', 'dep': 'nsubj', 'up': 4, 'dn': []},
   {'tok': 'move', 'tag': 'VB', 'dep': 'ROOT', 'dn': [1, 2, 3, 6, 7, 8]},
   {'tok': 'your', 'tag': 'PRP$', 'dep': 'poss', 'up': 6, 'dn': []},
   {'tok': 'artillery', 'tag': 'NN', 'dep': 'dobj', 'up': 4, 'dn': [5]},
   {'tok': 'forward', 'tag': 'RB', 'dep': 'advmod', 'up': 4, 'dn': []},
   {'tok': '?', 'tag': '.', 'dep': 'punct', 'up': 4, 'dn': []}]}]

#### Saving created datasets
To complete the final step of dataset conversion, we want to save the dataset such that it can be loaded later for reuse. You may want to specify a name. The default location to find the saved datasets will be __./convokit/saved-copora__ in your home directory, but you can also specify where you want the saved corpora to be. 

In [33]:
# movie_corpus.dump("movie-corpus", base_path = <specify where you prefer to save it to>)
# the following would save the Corpus to the default location, i.e., ./convokit/saved-corpora
movie_corpus.dump("movie-corpus")

After saving, the available info from dataset can be checked directly, without loading

In [34]:
from convokit import meta_index
import os.path

In [35]:
meta_index(filename = os.path.join(os.path.expanduser("~"), ".convokit/saved-corpora/movie-corpus"))

{'utterances-index': {'movie_id': "<class 'str'>", 'parsed': "<class 'list'>"},
 'users-index': {'character_name': "<class 'str'>",
  'movie_idx': "<class 'str'>",
  'movie_name': "<class 'str'>",
  'gender': "<class 'str'>",
  'credit_pos': "<class 'str'>"},
 'conversations-index': {},
 'overall-index': {},
 'version': 1}

### Other ways of conversion

The above method is only one way to convert the dataset. Alternatively, one may follow strictly with the specifications of the expected data format described [here](https://github.com/CornellNLP/Cornell-Conversational-Analysis-Toolkit/blob/master/doc/source/data_format.rst) and write out the component files directly. 