# Phase 2

By Asher Lipman, Cianna Chairez, Allie Pultorak, and Carrie Kim  


## Data Collection and Cleaning

Our raw text files existed scattered across a series of folders delineated by genre (examples included "Crime", "Action", "Comedy", etc.). In order to just work with the movies which met our "classical mystery" classification, we grouped all the movies we'd identified into a separate "Using" folder. At this point we had our movie scripts and our movie metadata files, which are stored in the drama_movies, crime_movies, and thriller_movies python files. These metadata files are an intermediate stage, but they're desribed in more detail in the dataset creation section. We then took the below steps in order to turn those text files into a workable dataset. 

First, we load the necessary libaries and metadata dictionaries, as well as create our spacy nlp object for later use

In [1]:
#will install spacy and necessary dataset if you do not have it already on your computer

##!conda install -c conda-forge spacy spacy-lookups-data -y
##!python -m spacy download en_core_web_lg

In [1]:
#import necessary libraries
import os
import string
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import re
import spacy

#import the three libraries of assembled movie metdata
from collections import defaultdict
from drama_movies import drama_movies
from crime_film_noir import crime_movies
from thriller_movies import thriller_movies

In [2]:
nlp=spacy.load("en_core_web_lg")

Then we created a set of convenience functions. This first one checks to see if a new line in a moviescript indicates a new character

In [3]:
#Checks to see if a given line follows the same format as a new character tag
def check_if_character(line, punct_set):
    if "(" in line:
        line=line[:line.index("(")].strip()
    if "{" in line:
        line=line[:line.index("{")].strip()
    if '[' in line:
        line=line[:line.index('[')].strip()
    if "/" in line:
            line=line[:line.index('/')-1].strip()
    if line.upper()==line and line.isupper() and line[-1] not in punct_set and line.count(' ')<4:
        
        return True, line
    else:
        return False, ""


This convenience function checks to see if the character that's currently being mentioned in the script is one of the characters we've previously identified as important, either for their own merit or for being the villain

In [4]:
#Checks to see if this character is either one of the important characters in the story or a villain
def check_if_acceptable_character(this_character, acceptable_characters):
    if this_character in acceptable_characters:
        return True
    else:
        cleaner=str.maketrans('','',string.punctuation)
        this_character=this_character.translate(cleaner)
        for character in acceptable_characters:
            if character in this_character or this_character in character:
                return True
    return False

        

This is the beefiest convenience function, the one which actually reads the movie scripts. Instead of using this to directly create a pandas dataframe, we instead decided to make it a standard list in order to make it easier to iterate through each character and remove them. While the exact specifications are described in the method, essentially it creates a list of all the characters from across all movies and includes their name, the movie title, the year the movie's from, their dialogue, and the number of words they had to say in a dedicated dictionary. 

In [5]:
"""takes the movie metadata, reads each script, and returns a list made of dictionaries corresponding to each new character that's been labeled as important. The list and dictionaries follow the format below

[{"character": "the name of the character", "movie_title": "the title of the movie they're in", "year" int(the year the movie came out), "raw_dialogue": "All the words they say in the movie", "num_words": int(the number of words they say in the movie)}]
"""
def read_scripts(movie_metadata, unacceptable_starters, punct_remover):

    returner=[]

    for movie_filename in movie_metadata:
        path=os.path.join("data", "scripts","Using", movie_filename)
        current_movie=movie_metadata[movie_filename]

        with open(path, "r", encoding="utf-8") as f:
            movie_script=f.readlines()
            start_of_content=False
            this_movie_dicts=[]
            current_character_dict={}
            movie_title=movie_metadata[movie_filename]['title']
            current_character_name="" 
            year=int(current_movie['year'])
            acceptable_characters=current_movie["characters"]+current_movie['villain']      
            
            for line in movie_script:
                line=line.strip()

                #checks for the title and the first character, otherwise skips the header stuff
                if not start_of_content:
                    line=line.strip('"')

                    is_character, new_character_name = check_if_character(line, string.punctuation)

                    if is_character and new_character_name not in unacceptable_starters and check_if_acceptable_character(new_character_name, acceptable_characters):
                        start_of_content=True
                        current_character_name=new_character_name
                        current_character_dict={"character": current_character_name, "movie_title": movie_title, "year":year, "is_villain": False, "raw_dialogue":""}
                        if current_character_name in current_movie['villain']:
                            current_character_dict['is_villain']=True
                        this_movie_dicts.append(current_character_dict)
                
                #if in the middle of the script and you have a current character, check if this line is a new character otherwise add the dialogue to this character's list
                else:
                    is_new_character, new_character_name = check_if_character(line, string.punctuation)
                    if is_new_character:
                        current_character_name=new_character_name
                        if current_character_name not in unacceptable_starters and check_if_acceptable_character(current_character_name, acceptable_characters):
                            if current_character_name not in [entry['character'] for entry in this_movie_dicts]:
                                current_character_dict={"character": current_character_name, "movie_title": movie_title, "year":year, "is_villain": False, "raw_dialogue":""}
                                if current_character_name in current_movie['villain']:
                                    current_character_dict['is_villain']=True
                                this_movie_dicts.append(current_character_dict)
                            else:
                                current_character_dict=[entry for entry in this_movie_dicts if entry['character']==current_character_name][0]
                            
                    else:
                        if current_character_name in unacceptable_starters or not check_if_acceptable_character(current_character_name, acceptable_characters):
                            pass
                        else:
                            if len(line)>0 and line[0]!='(' and line not in unacceptable_starters:
                                current_character_dict['raw_dialogue']=current_character_dict['raw_dialogue'].strip()+' ' + line.translate(punct_remover).strip()

            returner+=this_movie_dicts

        
    for entry in returner:
        entry['num_words']=entry['raw_dialogue'].count(' ')+1
    returner=[entry for entry in returner if entry['num_words']>50]    

    return returner                            
    

While the above code works for most script formats, occassionally formatting errors in the script itself can result in errors. Additionally often the same character will be referred to by different names in a script. For example, in a single movie dialogue spoken by the character Harry Wilkens might be labelled as Harry, Mr. Wilkens, Harry's Voice, or any other permutations. Additionally, while the original researchers claimed to have removed setting description and camera directions from the scripts, they did not do so entirely. These often appear in the same format in the script as calls for new characters, causing countless errors. These two convenience functions below serve to elimniate these errors. The first deals with the issue of character permutations, allowing us to combine all the dialogue said by two characters and then deleting the extras from the list. The second is more simple, and is used to flat-out remove entries that were not characters but were added in error.

In [6]:
#A convenience function that adds all lines from character_name_to_remove to character_name_to_keep and then removes character_name_to_remove from the list. 
def combine_characters(movie_character_dict, movie_title, character_name_to_keep, character_name_to_remove):
    movie_subset=[entry for entry in movie_character_dict if entry['movie_title']==movie_title]
    if character_name_to_remove not in [entry['character'] for entry in movie_subset] or character_name_to_keep not in [entry['character'] for entry in movie_subset]:
        return
    to_remove=[entry for entry in movie_character_dict if entry['character']==character_name_to_remove][0]
    keeper=[entry for entry in movie_subset if entry['character']==character_name_to_keep][0]
    keeper['raw_dialogue']+=to_remove['raw_dialogue']
    keeper['num_words']+=to_remove['num_words']
    movie_character_dict.remove(to_remove)

In [7]:
#removes the character with this character name from the list
def remove_character(movie_character_dicts, character):
    to_remove=[entry for entry in movie_character_dicts if entry['character']==character]
    if len(to_remove)!=0:
        movie_character_dicts.remove(to_remove[0])

The below function, given a pandas dataframe which includes a character's raw dialogue, will add a column including a list of all the tokens present in each character's dialogue that isn't punctuation, a space, or a commonly used filler or "stop" word

In [8]:
#given a dataframe and a spacy nlp object, adds a column to the dataframe with an array of tokens present in each dialogue entry that aren't spaces, punctuation, or common stop words
def tokenize(data_frame,nlp):
    data_frame['token_list']=[[token for token in nlp(doc) if not (token.is_punct or token.is_space or token.is_stop)] for doc in data_frame['raw_dialogue']]

The below function makes use of our second dataset, the NRC Word-Emotion Association Lexicon. Again, the specifics and sourcing of this lexicon are addressed more directly in our data collection section. However, in practice what this does is associate a set of given words with a set of ten sentiments: anger, anticipation, disgust, fear, joy negative, positive, sadness, surprise, trust. These associations are binary, if a word is associated with the sentiment it will have a score of 1, otherwise it will be 0. The below function runs through each of the tokens in a character's dialogue and calculates the average sentiment score for all of their words put togeher across the 10 sentiments.

In [9]:
#given a list of tokens, calculates a a mean emotional scoring of that list for each of the ten emotions present in the NRC Word-Emotion Association Emolex
def add_emotion_scores(token_list, lexicon=None):
    string_list=[token.text.lower().strip() for token in token_list]
    re=[[],[],[],[],[],[],[],[],[],[]]
    if lexicon==None:
        lexicon=read_emolex()
    for token in string_list:
        if token in lexicon:
            score_list=list(lexicon[token].values())
            for score_index in range(len(score_list)):
                re[score_index].append(score_list[score_index])
    re=[score if len(score)>0 else [0] for score in re]
    to_return = [np.mean(emotion) for emotion in re]
    return to_return
    

    

The below function simply reads the emolex into a python dictionary, it was actually provided by Professor Wilkens for a previous class (INFO 3350)

In [10]:
# Source: Professor Wilkens, INFO 3350
#convenience function to read the emotion lexicon into the variable emolex
def read_emolex(filepath=None):
    '''
    Takes a file path to the emolex lexicon file.
    Returns a dictionary of emolex sentiment values.
    '''
    if filepath==None: # Try to find the emolex file
        filepath = os.path.join('data','emolex.txt')
        if os.path.isfile(filepath):
            pass
        elif os.path.isfile('emolex.txt'):
            filepath = 'emolex.txt'
        else:
            raise FileNotFoundError('No EmoLex file found')
    emolex = defaultdict(dict) # Like Counter(), defaultdict eases dictionary creation
    with open(filepath, 'r') as f:
    # emolex file format is: word emotion value
        for line in f:
            word, emotion, value = line.strip().split()
            emolex[word][emotion] = int(value)
    return emolex

# Get EmoLex data. Make sure you set the right file path above.
emolex = read_emolex()


Another important factor we wanted to take into account was that while the words characters use might differ dramatically, the actual content of what they're trying to say might align very well. Word2Vec embeddings are essentially 300dimensional representations of a given word, with words of similar meanings being located near each other in this high dimensional space. Below, we calculate for each word a character says that word's embedding (or where it lies in this 300 dimensional space). We then average out these word embeddings (the mean of each dimension) to create a single 300 dimensional point that represents the character's dialogue as a whole. 

In [20]:
#given a list of tokens, calculate the mean embedding for all token in the list
def add_embeddings(token_list, vector_length):
    token_list=[token for token in token_list if token.has_vector]
    doc_matrix=np.zeros([len(token_list), vector_length])
    for i in range(len(doc_matrix)):
        doc_matrix[i]=token_list[i].vector
    return np.average(doc_matrix, axis=0)

Here we start actually assembling our dataset. We start by loading in our three separate metadata dictionaries into a single variable called amalgamated_dicts. we then declare a series of terms we cite as being unacceptable character names (often camera directions or non-character narration) and then store our list of character entities into movie_character_dicts

In [11]:

amalgamated_dicts={}
amalgamated_dicts.update(drama_movies)
amalgamated_dicts.update(crime_movies)
amalgamated_dicts.update(thriller_movies)

unacceptable_starters=["VOICE (cont'd)", "VOICE (CONT'D)", "VOICE OVER (CONT'D)", "VOICE OVER (cont'd)", "DISSOLVE", "CUT", "CUT TO", 'FADE', 'FADE OUT', 'FADE IN', 'PAN', 'CONTINUED', "CONT'D", '', ' ', "VOICE", "VOICE OVER", 'CUT TO', 'DISSOLVE TO', 'THE END', 'FADE TO BLACK', "DISSOLVE TO:", "CUT TO:", "FADE TO:"]

punct_remover=str.maketrans('','', '"#$%&()*+-/:;<=>?@[\\]^_`{|}~')

movie_character_dicts=read_scripts(amalgamated_dicts, unacceptable_starters, punct_remover)

Below is a long list of miscalaneous fixes as described in the combine_characters and remove_character defintions

In [19]:
#random fixes

combine_characters(movie_character_dicts, "8MM", "AMY", "AMY'S VOICE")
combine_characters(movie_character_dicts, "8mm", "DINO VELVET", "DINO")
combine_characters(movie_character_dicts, "8mm", "DINO VELVET", "DINO VELVET VOICE")
combine_characters(movie_character_dicts, "8mm", "WELLES", "WELLES VOICE")
combine_characters(movie_character_dicts, "8mm", "WELLES", "WELLES' VOICE")
combine_characters(movie_character_dicts, "MANHATTAN MURDER MYSTERIES", "HELEN", "HELEN'S VOICE")
combine_characters(movie_character_dicts, 'The Black Dahlia', "CAPTAIN VASQUEZ", 'VASQUEZ')
combine_characters(movie_character_dicts, 'The Black Dahlia', "JOHNNY VOGEL", 'JOHNNY')
combine_characters(movie_character_dicts, 'The Black Dahlia', "JOHNNY VOGEL", 'VOGEL')
combine_characters(movie_character_dicts, "The Black Dahlia", "LEE BLANCHARD", "LEE")
combine_characters(movie_character_dicts, "The Black Dahlia", "Liz", "Elizabeth")
combine_characters(movie_character_dicts, 'The Black Dahlia', "ELLIS LOEW", 'LOEW')
combine_characters(movie_character_dicts, 'The Black Dahlia', "RUSS MILLARD", 'MILLARD')
combine_characters(movie_character_dicts, 'BASIC INSTINCT', "CAPTAIN TALCOTT", 'TALCOTT')
combine_characters(movie_character_dicts, 'BASIC INSTINCT', "CAPTAIN TALCOTT", 'CAPT. TALCOTT')
combine_characters(movie_character_dicts, 'Basic', "DUNBAR", 'DUN BAR')
combine_characters(movie_character_dicts, 'Basic', "MUELLER", 'MUE:LLER')
combine_characters(movie_character_dicts, 'Basic', "OSBORNE", 'OSB0RNE')
combine_characters(movie_character_dicts, 'The Girl With ', "GREGOR", 'GREGER')
combine_characters(movie_character_dicts, 'THE GIRL WITH THE DRAGON TATTOO', "GREGOR", 'GREGER')
combine_characters(movie_character_dicts, 'THE GIRL WITH THE DRAGON TATTOO', "BLOMKVIST", 'BLOMVIST')
combine_characters(movie_character_dicts, 'THE GIRL WITH THE DRAGON TATTOO', "HARRIET", 'HARRIE')
combine_characters(movie_character_dicts, 'THE GIRL WITH THE DRAGON TATTOO', "WENNERSTROM", 'WENNERSTROM ON TV')
combine_characters(movie_character_dicts, 'THE GIRL WITH THE DRAGON TATTOO', "VANGER", 'YOUNGER VANGER')
combine_characters(movie_character_dicts, 'Insomnia', 'WALTER', "WALTER'S VOICE")
combine_characters(movie_character_dicts, "Blood Simple", "MARTY", "MARTY'S VOICE")
combine_characters(movie_character_dicts, "Bonnie and Clyde", "BONNIE", "BONNIE'S VOICE")
combine_characters(movie_character_dicts, "Twin Peaks", "BOBBY", "BOB'S VOICE")
combine_characters(movie_character_dicts, "Klute", "TRASK", "TRASK'S VOICE")
combine_characters(movie_character_dicts, "Klute", "CABLE", "CABLE'S VOICE")
combine_characters(movie_character_dicts, "Brick", "LAURA", "LAURA'S VOICE")
combine_characters(movie_character_dicts, "Brick", "BRENDAN", "BRENDAN'S VOICE")
combine_characters(movie_character_dicts, "Charade", "REGGIE", "REGGIE'S VOICE")
combine_characters(movie_character_dicts, "Copycat", "DARYLL LEE", "DARYLL")
combine_characters(movie_character_dicts, "Sherlock Holmes", "SHERLOCK", "HOLMES")
combine_characters(movie_character_dicts, "Crank", "Verona", "Erona")
combine_characters(movie_character_dicts, "Blood Simple", "DOC MILES", "OC MILES")
combine_characters(movie_character_dicts, "Devil in a Blue Dress", "ALBRIGHT", "ALBRIGHT'S VOICE")
combine_characters(movie_character_dicts, "Blood Simple", "DAPHNE", "DAPHNE'S VOICE")
combine_characters(movie_character_dicts, "Anonymous", "ELIZABETH", "YOUNG ELIZABETH")
combine_characters(movie_character_dicts, "Anonymous", "OXFORD", "YOUNG OXFORD")
combine_characters(movie_character_dicts, "Anonymous", "ROBERT CECIL", "BOY ROBERT CECILDAPHNE'S VOICE")
combine_characters(movie_character_dicts, 'Arctic Blue', "JESSICA", "ESSICA")

remove_character(movie_character_dicts, "BONNY & CLYDE")
remove_character(movie_character_dicts,"I")
remove_character(movie_character_dicts, "GRAMAM'S FEET")
remove_character(movie_character_dicts, "HOLMES POV")
remove_character(movie_character_dicts, "MRS. MULWRAY")
remove_character(movie_character_dicts,"C")
remove_character(movie_character_dicts,"S")
remove_character(movie_character_dicts,"T")
remove_character(movie_character_dicts,"H")
remove_character(movie_character_dicts,"A")
remove_character(movie_character_dicts,"I")
remove_character(movie_character_dicts,"VE")
remove_character(movie_character_dicts,"C")
remove_character(movie_character_dicts,"DARKMAN")
remove_character(movie_character_dicts,"161 PEYTON 161")
remove_character(movie_character_dicts,"THE DARKMAN")
remove_character(movie_character_dicts,"421 DARKMAN 421")


We finally create a dataframe from our list of dictionaries here, stored in the local_dataframe variable. We then added columns for word tokens, average emotional scores, and dialogue embeddings as described above

In [14]:
local_dataframe=pd.DataFrame.from_dict(movie_character_dicts)

In [15]:
%%time
tokenize(local_dataframe, nlp)

Wall time: 1min 27s


In [21]:
%%time
all_emotions = [add_emotion_scores(entry) for entry in local_dataframe['token_list']]
local_dataframe['mean_anger'], local_dataframe['mean_anticipation'], local_dataframe['mean_disgust'], local_dataframe['mean_fear'], local_dataframe['mean_joy'], local_dataframe['mean_negative'], local_dataframe['mean_positive'], local_dataframe['mean_sadness'], local_dataframe['mean_surprise'], local_dataframe['mean_trust'] = np.flipud(np.rot90(all_emotions))

Wall time: 53.6 s


In [22]:
%%time
vector_length=nlp.vocab.vectors_length
all_embeddings=[add_embeddings(entry, vector_length) for entry in local_dataframe['token_list']]
for i in range(vector_length):
    local_dataframe["embedding_"+str(i)]=[row[i] for row in all_embeddings]

Wall time: 1.67 s


Finally, the local_dataframe dataframe is written to the mystery_movie_data csv file in the upper directory, ignoring the index

In [None]:
local_dataframe.to_csv("mystery_movie_data.csv", index=False)