In [1]:
import json

data_path = "./data/tom_in_amc.json"
with open(data_path) as f:
    dataset = json.load(f)

In [2]:
print(f"Dataset Split:           \t{dataset.keys()}")
print(f"Example Movies (Dev Set):\t{list(dataset['train'].keys())[:3]}")

Dataset Split:           	dict_keys(['train', 'dev', 'test'])
Example Movies (Dev Set):	['constantine', 'nightbreed', 'the martian']


Take the first movie in the training set as an example movie data:
- `movie_data["chars"]`: character names and the their frequency of utterances throughout the movie
- `movie_data["training_scenes"]`: training scenes in the movie
- `movie_data["testing_scenes"]`: testing scenes in the movie

In [3]:
movie_data = dataset["train"]["constantine"]
print(f"Movie Data Components:\t{movie_data.keys()}")
print(f"Characters:           \t{movie_data['chars']}")
print(f"# Training Scenes:    \t{len(movie_data['training_scenes'])}")
print(f"# Testing Scenes:     \t{len(movie_data['testing_scenes'])}")

Movie Data Components:	dict_keys(['chars', 'training_scenes', 'testing_scenes'])
Characters:           	{'john': 40, 'chaz': 7, 'angela': 20}
# Training Scenes:    	49
# Testing Scenes:     	31


Take the first training scene as an example scene data:
- `scene_data["scene"]`: scene content composed by lines
- `scene_data["chars"]`: characters in the scene
- `scene_data["char_map"]`: character map from character name to anonymous ID

In [4]:
scene_data = movie_data["training_scenes"][0]
print(f"Scene Data Components:         \t{scene_data.keys()}")
print(f"The First Line of the Scene:   \t{scene_data['scene'][0]}")
print(f"Characters in the Scene:       \t{scene_data['chars']}")
print(f"Character Name to Anonymous ID:\t{scene_data['char_map']}")

Scene Data Components:         	dict_keys(['scene', 'chars', 'char_map'])
The First Line of the Scene:   	{'type': 'scene', 'title': 'INT. APARTMENT 7B', 'text': 'One scan of the situation is all it takes. The bed -- the child -- the panicked priest -- who rushes to P0.'}
Characters in the Scene:       	{'john': 1, 'chaz': 1}
Character Name to Anonymous ID:	{'john': 'P0', 'chaz': 'P1'}


Lines can be processed to text-form scene content by the following function:

In [5]:
def process_lines(lines):
    scene_content = []
    for line in lines:
        type = line["type"]
        title = line["title"]
        content = line["text"]
        
        assert type in ("scene", "dialog"), f"Unknown type: {type}"
        if type == "scene":
            if title != "NULL":
                scene_content.append(title)
            scene_content.append(content)
        elif type == "dialog":
            scene_content.append(f"{title}: {content}")

    return "\n".join(scene_content)

scene_content = process_lines(scene_data["scene"])
print(scene_content)

INT. APARTMENT 7B
One scan of the situation is all it takes. The bed -- the child -- the panicked priest -- who rushes to P0.
HENNESSEY: (whispering) Thank God you're here... 
P0 shoots him a disgusted look. Hennessey gives him a wide berth. P0 walks past the panic-stricken MOTHER without a glance, sets his cigarette on the nightstand, the glowing tip drooped over the edge. He puts a gloved hand to the child's face and it burns on contact. His demeanor instantly changes as he leans right next to the ear of the little girl and whispers --
JOHN: This is Constantine. John Constantine, asshole. 
The girl JOLTS, bandages on her arms cut into her skin. Eyes snap open -- glare right through him.
JOHN: How ya doing? 
JEANIE: Vamos juntos a matarla. 
P0 whips out a key chain crammed with medallions.
JOHN: Let's see who we got here... 
He holds them up so they cast shadows across Jeanie's face. He flips through each of these sculptured SAINTS until the child suddenly reacts to one -- tries to lo