This notebook fits a topic model to the Sherlock text descriptions and then transformed the recall transcripts with the model.

## Import libraries

In [1]:
import numpy as np
import pandas as pd
import hypertools as hyp
from scipy.interpolate import interp1d

from sherlock_helpers.constants import (
    DATA_DIR, 
    RAW_DIR, 
    RECALL_WSIZE, 
    SEMANTIC_PARAMS,
    VECTORIZER_PARAMS,
    VIDEO_WSIZE
)
from sherlock_helpers.functions import (
    format_text, 
    get_video_timepoints, 
    multicol_display,
    parse_windows,
    show_source
)

%matplotlib inline

Helper functions and variables used across multiple notebooks can be found in `/mnt/code/sherlock_helpers/sherlock_helpers`, or on GitHub, [here](https://github.com/ContextLab/sherlock-topic-model-paper/tree/master/code/sherlock_helpers).<br />You can also view source code directly from the notebook with:<br /><pre>    from sherlock_helpers.functions import show_source<br />    show_source(foo)</pre>

## Inspect some things defined in `sherlock_helpers` 

In [2]:
show_source(format_text)

In [3]:
show_source(parse_windows)

In [4]:
show_source(get_video_timepoints)

In [5]:
_vec_params = dict(model=VECTORIZER_PARAMS['model'], **VECTORIZER_PARAMS['params'])
_sem_params = dict(model=SEMANTIC_PARAMS['model'], **SEMANTIC_PARAMS['params'])
multicol_display(VIDEO_WSIZE, RECALL_WSIZE, _vec_params, _sem_params, 
                 caption='Modeling parameters',
                 col_headers=('Video annotation window length', 
                              'Recall transcript window length', 
                              'Vectorizer parameters', 
                              'Topic model parameters'), 
                 ncols=4)

0,1,2,3
50,10,model: CountVectorizer stop_words: english,model: LatentDirichletAllocation n_components: 100 learning_method: batch random_state: 0


## Load and format data

In [6]:
video_text = pd.read_excel(RAW_DIR.joinpath('Sherlock_Segments_1000_NN_2017.xlsx'))
video_text['Scene Segments'].fillna(method='ffill', inplace=True)

# drop 1s shot & 6s of black screen after end of 1st scan
video_text.drop(index=[480, 481], inplace=True)
video_text.reset_index(drop=True, inplace=True)

# timestamps for 2nd scan restart from 0; add duration of 1st scan to values
video_text.loc[480:, 'Start Time (s) ':'End Time (s) '] += video_text.loc[479, 'End Time (s) ']

## Inspect some of the raw data we're working with

In [7]:
video_text.loc[7:, 'Scene Details - A Level ':'Words on Screen '].head()

Unnamed: 0,Scene Details - A Level,Space-In/Outdoor,Name - All,Name - Focus,Name - Speaking,Location,Camera Angle,Music Presence,Words on Screen
7,Gunfire by a soldier along a wall made of stac...,Outdoor,Soldiers,Soldiers,,Afghanistan,Medium,No,
8,A bomb or land mine goes off in the middle of ...,Outdoor,Soldiers,Soldiers,,Afghanistan,Long,No,
9,A Soldier kicks open a door. Soldiers shooting...,Outdoor,Soldiers,Soldiers,,Afghanistan,Medium,No,
10,Close up view of John tossing in bed while sle...,Indoor,John,John,,John's Room,Close Up,No,
11,More gunfire. Two soldiers seen hand signallin...,Outdoor,Soldiers,Soldiers,,Afghanistan,Medium,No,


In [8]:
_vid_samples = {i: video_text.loc[i, 'Scene Details - A Level ':'Words on Screen '].to_frame(name='')
                for i in range(9, 13)}
multicol_display(*_vid_samples.values(), 
                 caption="<i>A Study in Pink</i> sample annotations",
                 col_headers=(f"Annotation {i}" for i in _vid_samples.keys()),
                 ncols=4)

In [9]:
_rec_samples = {f'P{p}': f"{RAW_DIR.joinpath(f'NN{p} transcript.txt').read_text()[:400]}..." 
                for p in (11, 13, 15, 17)}
multicol_display(*_rec_samples.values(), 
                 caption="Sample recall transcripts",
                 col_headers=_rec_samples.keys(), 
                 ncols=4,
                 cell_css={'text-align': 'left'})

0,1,2,3
"So the show starts with Watson dreaming, or like reliving his time in the military. Starts with the battlefield and he's on it. And there are shots being fired and its pretty green. He's in an army uniform. And he wakes up in a room on a bed. And I think he gets up and walks around. And like brushing his teeth or something, or like checks his computer. And then, I think gets like the intro scene, ...","So before the episode began, there's the cartoon for a movie theater specifically to get people to get snacks in the lobby. So there's a jingle, everyone go to the lobby and get ourselves a treat. And said the sparkling drinks are neat, then there's chocolate candy bars and popcorn. And there's a picture of the popcorn and a picture of the--well first it started off with life-size versions of the ...","Okay um so.. the story starts out with.. scenes of people killing themselves. The suicide shots, it appears that they're all killing themselves in a similar fashion, by taking pills. They seem kinda like they're compelled to do it, almost like they're fighting themselves. There's a scene with a boy doing it after he's split up with his friend in the street. Yeah so then there's a scene with a dete...","So it began with Watson being sort of like in a battlefield and we get a bunch of shots of people getting shot at and then he, after a few seconds he wakes up all in a sweat, it seems like at night. And then it fades, and then we see him just sitting in his apartment. It kind of even looked like a hotel room I don't know. He's just sitting there. And then we see a cane in the shot, and he's kind o..."


## Fit topic model to manually-annotated movie

In [10]:
# create a list of text samples from the scene descriptions / details to train the topic model
video = video_text.loc[:,'Scene Details - A Level ':'Words on Screen '].apply(format_text, axis=1).tolist()
video_windows, window_bounds = parse_windows(video, VIDEO_WSIZE)

# create video model with hypertools
video_model = hyp.tools.format_data(video_windows, 
                                    vectorizer=VECTORIZER_PARAMS, 
                                    semantic=SEMANTIC_PARAMS, 
                                    corpus=video_windows)[0]

# description are by scene, not TR so stretch the model to be in TRs
video_model_TRs = np.empty((1976, 100))
xvals = get_video_timepoints(window_bounds, video_text)
xvals_TR = xvals * 1976 / 2963
TR_times = np.arange(1, 1977)
interp_func = interp1d(xvals_TR, video_model, axis=0, fill_value='extrapolate')
video_model_TRs = interp_func(TR_times)

## Transform recalls

In [11]:
# loop over subjects
recall_w = []
for sub in range(1, 18):
    # load subject data
    transcript_path = RAW_DIR.joinpath(f'NN{sub} transcript.txt')
    with transcript_path.open(encoding='cp1252') as f:
        recall = f.read().replace(b'\x92'.decode('cp1252'), "'").strip()

    # create overlapping windows of n sentences
    recall_fmt = format_text(recall).split('.')
    if not recall_fmt[-1]:
        recall_fmt = recall_fmt[:-1]
    sub_recall_w = parse_windows(recall_fmt, RECALL_WSIZE)[0]
    recall_w.append(sub_recall_w)
    
    # save example participant's recall windows 
    if sub == 17:
        np.save(DATA_DIR.joinpath('recall_text.npy'), sub_recall_w)
    
# create recall models
recall_models = hyp.tools.format_data(recall_w, 
                                      vectorizer=VECTORIZER_PARAMS, 
                                      semantic=SEMANTIC_PARAMS, 
                                      corpus=video_windows)

## Save video model, recall models, and text corpus

In [12]:
n_topics = SEMANTIC_PARAMS['params'].get('n_components')
np.save(DATA_DIR.joinpath(f'models_t{n_topics}_v{VIDEO_WSIZE}_r{RECALL_WSIZE}'), 
        [video_model_TRs, recall_models])
np.save(DATA_DIR.joinpath('video_text.npy'), video_windows)