# Refining My Disney Data
This code will accomplish several things, including
* A clear outline of what the disney script corpus includes
* Separating out movies I do not wish to analyze
* Identifying song lyrics vs dialogue
* Identifying speaker gender
* Identifying role: Protagonist or Antagonist
* Adding Moana

In [1]:
#import necessary modules
import numpy as np
import pandas as pd

In [2]:
#import csv as a dataframe
disney = pd.read_csv(r'C:\Users\cassi\Desktop\Disney_Corpus.csv')

In [3]:
disney.head()

Unnamed: 0,Disney_Period,Text,Speaker_Status,Movie,Speaker,Year,UTTERANCE_NUMBER
0,EARLY,slave in the magic mirror come from the farthe...,NON-P,Snow White,queen,1937,1
1,EARLY,"what wouldst thou know, my queen ?",NON-P,Snow White,mirror,1937,2
2,EARLY,"magic mirror on the wall, who is the fairest o...",NON-P,Snow White,queen,1937,3
3,EARLY,"famed is thy beauty, majesty. but hold, a love...",NON-P,Snow White,mirror,1937,4
4,EARLY,alas for her ! reveal her name.,NON-P,Snow White,queen,1937,5


In [4]:
disney.tail()

Unnamed: 0,Disney_Period,Text,Speaker_Status,Movie,Speaker,Year,UTTERANCE_NUMBER
7743,LATE,we are never closing them again.,PRINCESS,Frozen,elsa,2013,984
7744,LATE,form on anna's boots.,PRINCESS,Frozen,elsa,2013,985
7745,LATE,"what? oh, elsa, they're beautiful, but you kno...",PRINCESS,Frozen,anna,2013,986
7746,LATE,look out. reindeer coming through!,NON-P,Frozen,kristoff,2013,987
7747,LATE,that's it. glide and pivot and glide and pivot.,NON-P,Frozen,olaf,2013,988


In [5]:
disney.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7748 entries, 0 to 7747
Data columns (total 7 columns):
Disney_Period       7748 non-null object
Text                7748 non-null object
Speaker_Status      7748 non-null object
Movie               7748 non-null object
Speaker             7748 non-null object
Year                7748 non-null int64
UTTERANCE_NUMBER    7748 non-null int64
dtypes: int64(2), object(5)
memory usage: 272.4+ KB


There are no null objects here! This is good, but I need to double check that there aren't any typos in the categories.

1) What Periods are included?

In [7]:
disney.Disney_Period.value_counts()

MID      4155
LATE     2268
EARLY    1325
Name: Disney_Period, dtype: int64

Okay, three categories: MID, LATE, EARLY. No typos.

2) What status can a speaker have?

In [8]:
disney.Speaker_Status.value_counts()

NON-P       4925
PRINCESS    1793
PRINCE      1030
Name: Speaker_Status, dtype: int64

Looks like characters can be a princess, prince, or non-princess. This does not account for gender or for role in the story. Gender and Role columns will need to be added.

3) What speakers are there?

In [9]:
disney.Speaker.value_counts()

anna                                    335
simba                                   241
aladdin                                 238
merida                                  181
tiana                                   176
kristoff                                159
belle                                   150
pocahontas                              142
john smith                              125
prince naveen                           125
jasmine                                 117
mulan                                   116
mushu                                   114
scar                                    114
genie                                   113
cinderella                              112
timon                                   112
elsa                                    111
jafar                                   111
flora                                   106
queen elinor                            103
olaf                                     93
beast                           

In [10]:
disney.Speaker.describe()

count     7748
unique     424
top       anna
freq       335
Name: Speaker, dtype: object

424 unique speakers. Yikes. These might be easier to go through movie by movie, where character names will be easier to parse through. Also, look! A typo! "Sweep again and by then" is a lyric Rapunzel sings, not a speaker!

4) What Years are inlcuded?

In [11]:
disney.Year.value_counts()

2013    988
1994    952
1992    842
1991    772
2009    676
1995    638
1998    554
1950    497
1959    462
2012    411
1989    397
1937    366
2010    193
Name: Year, dtype: int64

In [12]:
disney.Movie.value_counts()

Frozen                        988
The Lion King                 952
Aladdin                       842
Beauty and the Beast          772
The Princess and the Frog     676
Pocahontas                    638
Mulan                         554
Cinderella                    497
Sleeping Beauty               462
Brave                         411
The Little Mermaid            397
Snow White                    366
Tangled                       193
Name: Movie, dtype: int64

These columns look good, and all their counts line up! Sweet! (These counts should line up with the utterance numbers for each film)

In [14]:
disney.UTTERANCE_NUMBER.value_counts()

4      13
186    13
150    13
154    13
158    13
162    13
166    13
170    13
174    13
178    13
182    13
190    13
191    13
68     13
64     13
60     13
56     13
52     13
48     13
44     13
40     13
36     13
146    13
142    13
138    13
134    13
58     13
62     13
66     13
144    13
       ..
974     1
978     1
982     1
955     1
954     1
959     1
963     1
967     1
971     1
975     1
979     1
958     1
983     1
960     1
988     1
964     1
968     1
972     1
976     1
980     1
984     1
953     1
981     1
957     1
961     1
965     1
969     1
973     1
977     1
985     1
Name: UTTERANCE_NUMBER, Length: 988, dtype: int64

In [25]:
disney_tangled = disney.loc[disney.Movie == 'Tangled']

In [26]:
disney_tangled.head()

Unnamed: 0,Disney_Period,Text,Speaker_Status,Movie,Speaker,Year,UTTERANCE_NUMBER


In [22]:
for m in disney.Movie.value_counts():
    print(m)

988
952
842
772
676
638
554
497
462
411
397
366
193


In [27]:
disney.columns

Index(['Disney_Period', 'Text', 'Speaker_Status', 'Movie', 'Speaker', 'Year',
       'UTTERANCE_NUMBER'],
      dtype='object')

In [34]:
for m in disney.Movie[:10]:
    print('hi'+m+'hi')

hiSnow White hi
hiSnow White hi
hiSnow White hi
hiSnow White hi
hiSnow White hi
hiSnow White hi
hiSnow White hi
hiSnow White hi
hiSnow White hi
hiSnow White hi


HAHA! So there's a ghost space that's messing things up.(this was actually in all entries, I just didn't want to flash all of it. Okay, let's fix that

In [43]:
disney.Movie = disney.Movie.map(lambda x: x.strip())

In [44]:
for m in disney.Movie[:10]:
    print('hi'+m+'hi')

hiSnow Whitehi
hiSnow Whitehi
hiSnow Whitehi
hiSnow Whitehi
hiSnow Whitehi
hiSnow Whitehi
hiSnow Whitehi
hiSnow Whitehi
hiSnow Whitehi
hiSnow Whitehi


Yay! Fixed! (I hope)

In [45]:
disney_frozen = disney[disney.Movie == 'Frozen']

In [46]:
disney_frozen.head()

Unnamed: 0,Disney_Period,Text,Speaker_Status,Movie,Speaker,Year,UTTERANCE_NUMBER
6760,LATE,born of cold and winter air and mountain rain ...,NON-P,Frozen,ice harvesters,2013,1
6761,LATE,", and his reindeer calf, sven, share a carrot ...",NON-P,Frozen,ice harvesters,2013,2
6762,LATE,ice harvesters hup! ho! watch your step! let i...,NON-P,Frozen,ice harvesters,2013,3
6763,LATE,"ice harvesters stronger than one, stronger tha...",NON-P,Frozen,ice harvesters,2013,4
6764,LATE,ice harvesters born of cold and winter air and...,NON-P,Frozen,ice harvesters,2013,5


In [47]:
disney_frozen.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 988 entries, 6760 to 7747
Data columns (total 7 columns):
Disney_Period       988 non-null object
Text                988 non-null object
Speaker_Status      988 non-null object
Movie               988 non-null object
Speaker             988 non-null object
Year                988 non-null int64
UTTERANCE_NUMBER    988 non-null int64
dtypes: int64(2), object(5)
memory usage: 42.5+ KB


In [56]:
#Checking that utterance numbers are in order
init = 1
if disney_frozen.UTTERANCE_NUMBER.iloc[0] == init:
    print(True)
ordered = []
for i in disney_frozen.UTTERANCE_NUMBER.iloc[1:]:
    if disney_frozen.UTTERANCE_NUMBER.iloc[i] == (disney_frozen.UTTERANCE_NUMBER.iloc[i-1] + 1):
        ordered.append(True)
    else:
        ordered.append(False)


True


IndexError: single positional indexer is out-of-bounds

In [51]:
disney_frozen.UTTERANCE_NUMBER.iloc[0]

1

## Song Lyrics
How are song lyrics treated in this data set? Is there an easy way to separate them out? The opening of Frozen is a song, so let's see what we have there

In [57]:
disney_frozen.Text.iloc[:10]

6760    born of cold and winter air and mountain rain ...
6761    , and his reindeer calf, sven, share a carrot ...
6762    ice harvesters hup! ho! watch your step! let i...
6763    ice harvesters stronger than one, stronger tha...
6764    ice harvesters born of cold and winter air and...
6765    ice harvesters this icy force both foul and fa...
6766               ext. the kingdom of arendelle — night 
6767    sleeps in her bed. her little sister anna: (5)...
6768    elsa. psst. elsa! psst. elsa doesn't stir. ann...
6769                          wake up. wake up. wake up. 
Name: Text, dtype: object

In [58]:
disney_frozen.Text.iloc[0]

'born of cold and winter air and mountain rain combining, this icy force both foul and fair has a frozen heart worth mining. the men drag giant ice blocks through channels of water. ice harvesters cut through the heart, cold and clear. strike for love and strike for fear. see the beauty sharp and sheer. split the ice apart! and break the frozen heart. hup! ho! watch your step! let it go! '

In [59]:
disney_frozen.Text.iloc[1]

', and his reindeer calf, sven, share a carrot as they try to keep up with the men.'

In [60]:
for line in disney_frozen.Text[:10]:
    print(line)

born of cold and winter air and mountain rain combining, this icy force both foul and fair has a frozen heart worth mining. the men drag giant ice blocks through channels of water. ice harvesters cut through the heart, cold and clear. strike for love and strike for fear. see the beauty sharp and sheer. split the ice apart! and break the frozen heart. hup! ho! watch your step! let it go! 
, and his reindeer calf, sven, share a carrot as they try to keep up with the men.
ice harvesters hup! ho! watch your step! let it go! kristoff struggles to get a block of ice out of the water. he fails, ends up soaked. sven licks his wet cheek. ice harvesters beautiful! powerful! dangerous! cold! ice has a magic can't be controlled. 
ice harvesters stronger than one, stronger than ten stronger than a hundred men! 
ice harvesters born of cold and winter air and mountain rain combining 
ice harvesters this icy force both foul and fair has a frozen heart worth mining. cut through the heart, cold and clea

WOW! A closer look at this data very quickly reveals that the utterances aren't really utterances--they're parts of a script, including scene headers!

In [63]:
not_line = [i for i in disney_frozen.Text if 'ext.' in i]

In [64]:
len(not_line)

4

In [66]:
print(not_line)

['ext. the kingdom of arendelle — night ', 'faster, sven! ext. the valley of the living rock — night ', 'ext. mountain forest clearing — day', "olaf oh, look at that. i've been impaled. he laughs it off. ext. steep mountain face — day"]


Is this true for all the movies?

In [67]:
disney_sb = disney[disney.Movie == 'Sleeping Beauty']
disney_sb.head()

Unnamed: 0,Disney_Period,Text,Speaker_Status,Movie,Speaker,Year,UTTERANCE_NUMBER
863,EARLY,"in a far away land, long ago, lived a king and...",NON-P,Sleeping Beauty,narrator,1959,1
864,EARLY,"joyfully now to our princess we come, bringing...",NON-P,Sleeping Beauty,choir,1959,2
865,EARLY,thus on this great and joyous day did all the ...,NON-P,Sleeping Beauty,narrator,1959,3
866,EARLY,"their royal highnesses, king hubert and prince...",NON-P,Sleeping Beauty,announcer,1959,4
867,EARLY,fondly had these monarchs dreamed one day thei...,NON-P,Sleeping Beauty,narrator,1959,5


In [68]:
disney_sb.tail()

Unnamed: 0,Disney_Period,Text,Speaker_Status,Movie,Speaker,Year,UTTERANCE_NUMBER
1320,EARLY,"i know you, i walked with you once upon a dream",NON-P,Sleeping Beauty,choir,1959,458
1321,EARLY,blue!,NON-P,Sleeping Beauty,merryweather,1959,459
1322,EARLY,"i know you, the gleam in your eyes is so famil...",NON-P,Sleeping Beauty,choir,1959,460
1323,EARLY,and i know it's true that visions are seldom a...,NON-P,Sleeping Beauty,choir,1959,461
1324,EARLY,you'll love me at once the way you did once up...,NON-P,Sleeping Beauty,choir,1959,462


In [73]:
disney_sb.Text.iloc[175:225]

1038                               but i wanted it blue. 
1039           now, dear, we decided pink was her color. 
1040                                        you decided! 
1041             two eggs, fold in gently fold? oh well. 
1042                                    i can't breathe! 
1043                                     it looks awful. 
1044                   that's because it's on you, dear. 
1045                            now yeast, one tsp. tsp? 
1046                                       one teaspoon! 
1047                            one teaspoon, of course. 
1048                oh gracious how the child has grown. 
1049    oh, it seems only yesterday we brought her here. 
1050                                   just a tiny baby. 
1051                                   why merryweather! 
1052                        whatever's the matter, dear? 
1053    after the day she'll be a princess, and we won...
1054                                           oh flora! 
1055          

In [74]:
disney_sb.Text[1059]

"good gracious, we're acting like a lot of ninnies! come on, she'll be back before we get started. "

At first glance, Sleeping Beauty doesn't appear to have screenplay annotations like Frozen. So this data has VERY varying in accuracy. However, it looks like song lyrics are included! These will have to be marked up.

# Moana
I have to create my own csv file for Moana. I don't need to include Movie, Year, or Disney_Period labels for this one because I can easily add that to a dataframe! But, I will initially include gender, role, and if something is a song or not!

In [109]:
moana = pd.read_csv(r'C:\Users\cassi\Desktop\moana.csv')

In [110]:
moana.head()

Unnamed: 0,Text,Speaker,Song
0,"in the beginning, there was only ocean until t...",tala,D
1,"Whoa, whoa, whoa! mother, that's enough.",tui,D
2,papa!,young moana,D
3,No one goes outside the reef. We are safe here...,tui,D
4,Monsters! Monsters! Monsters!,children,D


In [111]:
moana.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402 entries, 0 to 401
Data columns (total 3 columns):
Text       402 non-null object
Speaker    382 non-null object
Song       382 non-null object
dtypes: object(3)
memory usage: 4.8+ KB


In [112]:
moana[moana.Text == 'scene break'].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19 entries, 12 to 393
Data columns (total 3 columns):
Text       19 non-null object
Speaker    0 non-null object
Song       0 non-null object
dtypes: object(3)
memory usage: 380.0+ bytes


I've marked 19 scene breaks in my data, so the 19 missing entries for song/speaker make sense!

## Marking Gander, Speaker_Status, and Role
### Gender

In [113]:
moana.Speaker.describe()

count       382
unique       20
top       moana
freq        167
Name: Speaker, dtype: object

In [114]:
moana.Speaker.value_counts()

moana                167
maui                 108
tui                   36
tala                  24
tamatoa                9
sina                   8
young moana            5
male villager 3        5
chorus                 4
children               3
female villager 1      2
male villager 2        2
male villager 1        2
male villager 7        1
male villager 5        1
male villager 6        1
male villager 8        1
old male villager      1
female villager        1
male villager 4        1
Name: Speaker, dtype: int64

In [115]:
female = ['moana', 'tala', 'sina', 'young moana', 'female villager 1', 'female villager']
male = ['maui', 'tui', 'tomatoa', 'male villager 1', 'male villager 2', 'male villager 3', 'male villager 4',
       'male villager 5', 'male villager 6', 'male villager 7', 'male villager 8', 'old male villager']
neutral = ['children', 'chorus']
len(female)+len(male)+len(neutral)

20

In [116]:
#for x in moana.Speaker.values:
#    if x in female: moana["Gender"] = 'f'
#    elif x in male: moana["Gender"] = 'm'
#    else: moana["Gender"] = 'n'
def whichgen(name):
    if name in female: return 'f'
    elif name in male: return 'm'
    else: return 'n'
moana["Gender"] = moana.Speaker.map(whichgen)

In [117]:
moana.head()

Unnamed: 0,Text,Speaker,Song,Gender
0,"in the beginning, there was only ocean until t...",tala,D,f
1,"Whoa, whoa, whoa! mother, that's enough.",tui,D,m
2,papa!,young moana,D,f
3,No one goes outside the reef. We are safe here...,tui,D,m
4,Monsters! Monsters! Monsters!,children,D,n


### Speaker_Status
in this film, the only princess is moana, and everyone else is a NON-P

In [118]:
def whichstat(name):
    if name == 'moana' or name == 'young moana': return 'PRINCESS'
    else: return 'NON-P'

moana['Speaker_Status'] = moana.Speaker.map(whichstat)

In [119]:
moana.head()

Unnamed: 0,Text,Speaker,Song,Gender,Speaker_Status
0,"in the beginning, there was only ocean until t...",tala,D,f,NON-P
1,"Whoa, whoa, whoa! mother, that's enough.",tui,D,m,NON-P
2,papa!,young moana,D,f,PRINCESS
3,No one goes outside the reef. We are safe here...,tui,D,m,NON-P
4,Monsters! Monsters! Monsters!,children,D,n,NON-P


### Role
Is the speaker a protagonist, antagonist, helper, or neutral. I'm listing Tui, Moana's father, as an antagonist along with tamatoa, since he prevents her from going on her journey at the start of the film. Moana and Maui are going to be listed as protagonists, while tala and sina can be listed as helpers who aid Moana and Maui on their journey. Perhaps later on I may refine this (Maui is arguably a helper, and towards the beginning he acts more like an antagonist). That will be an easy edit, as long as I have editable code down

In [125]:
pro = ['young moana','moana', 'maui']
ant = ['tui', 'tamatoa']
helper = ['sina', 'tala']


In [126]:
def whichrole(name):
    if name in pro: return 'PRO'
    if name in ant: return 'ANT'
    if name in helper: return 'HELPER'
    else: return "N" #for neutral

In [127]:
moana['Role'] = moana.Speaker.map(whichrole)

In [128]:
moana.head()

Unnamed: 0,Text,Speaker,Song,Gender,Speaker_Status,Role
0,"in the beginning, there was only ocean until t...",tala,D,f,NON-P,HELPER
1,"Whoa, whoa, whoa! mother, that's enough.",tui,D,m,NON-P,ANT
2,papa!,young moana,D,f,PRINCESS,PRO
3,No one goes outside the reef. We are safe here...,tui,D,m,NON-P,ANT
4,Monsters! Monsters! Monsters!,children,D,n,NON-P,N


## Adding pre-existing columns
Cool! Now, let's annotate this Moana data with data from the original corpus!