# Dreamworks Films DataFrames
In the following code, I will attempt to streamline the method I used for Shrek with the other Dreamworks films. After creating individual dataframes for each movie, I'll combine them into one dataframe. 

The movies included are:
* Antz
* The Croods
* How to Train Your Dragon
* How to Train Your Dragon 2
* Kung Fu Panda
* Megamind
* Rise of the Guardians
* Shrek the Third

In [1]:
import glob
import numpy as np
import pandas as pd
import re

First, we need to check the formatting in each file, to see if they match the format of the shrek script, or if we need to modify the code. Let's start off with Antz.

In [2]:
#importing the file
antz = open(r'C:\Users\cassi\Desktop\Data_Science\Animated-Movie-Gendered-Dialogue\private\imsdb_raw_nov_2015\Animation\antz.txt')
antz_script = antz.read()
antz.close()

In [3]:
%pprint

Pretty printing has been turned OFF


In [52]:
print(antz_script[:850])

"Antz", unknown draft

A N T Z

CHARACTERS                                            VOICES

     "Z"...............................................WOODY ALLEN

     "WASP #1".........................................DAN AYKROYD

     "WASP #2".........................................JANE CURTIN

     "GEN. FORMICA"...................................DANNY GLOVER

     "MANDIBLE".......................................GENE HACKMAN

     "AZTECA".......................................JENNIFER LOPEZ

     "DRUNK SCOUT"....................................JOHN MAHONEY

     "WEAVER"...................................SYLVESTER STALLONE

     "PRINCESS BALA"..................................SHARON STONE

     "QUEEN"..........................................MERYL STREEP

     "CARPENTER"................................CHRISTOPHER WALKEN

        


In [9]:
print(antz_script[850:2500])

                 Z (O.S.)

                      (over a dark screen)

               All my life, I've lived and worked in

               the big city...

     We see:

     EXT. AN ANT MOUND - DAY

     The camera swoops towards the entrance, then dives inside,

     past a couple of tough-looking soldier ants who stand at the

     gates of the ant colony like insect bouncers...into an access

     tunnel that snakes this way and that, past a row of ants

     plodding along...

     ...and into the MAIN CHAMBER of the colony, a huge, teeming

     vista that seems to stretch away forever, filled with ants

     rushing here and there on their business.  We see -- a

     "traffic cop" directing foot traffic, waving his arms like

     crazy so both sides move at once -- a column of 

soldier ants

     marching along in formation -- a chain of ants letting down

     a 

matchbox

 elevator filled with workers.

                         Z (V.O.)

               ...which is kind of

Interesting...Antz has a character list (for the main characters at least). Also, one of the characters is simply named "Z", which poses a problem. In the Shrek script, we could rely on character names being longer than two letters. Let's try what we used with Shrek. But first, we'll need to cut off the beginning of the script.

In [17]:
antz_script[840:1000]

"\n\n                         Z (O.S.)\n\n                      (over a dark screen)\n\n               All my life, I've lived and worked in\n\n               the big ci"

In [21]:
antz_script2 = antz_script[840:]

In [23]:
antz_script2[:100]

'\n\n                         Z (O.S.)\n\n                      (over a dark screen)\n\n               All '

Okay, now let's count the white space and see how it's used.

In [57]:
white_space = re.findall(" {3,}", antz_script2)
len(white_space)

3376

In [58]:
print(white_space[:10])

['                         ', '                      ', '               ', '               ', '     ', '     ', '     ', '     ', '     ', '     ']


In [59]:
len_w_s = [len(x) for x in white_space]
print(len_w_s[:50])
print(set(len_w_s))

[25, 22, 15, 15, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 25, 15, 15, 15, 5, 5, 5, 25, 15, 15, 15, 15, 15, 15, 15, 22, 15, 15, 15, 15, 15, 15, 15, 15, 15, 25, 22, 15, 15, 5, 5, 5]
{3, 5, 46, 15, 22, 23, 25}


In [122]:
antz_script2[:5000]
#25 before a character name
#22 after a character name and before a parenthetical
#15 after a parenthetical and before a line
#15 between lines
#5 after a line is done and before scene headers
#5 between scene headers
#25 before a character's name
#15 between between a character's name and their line
# 5 after the end of a character's line and before a scene header

'\n\n                         Z (O.S.)\n\n                      (over a dark screen)\n\n               All my life, I\'ve lived and worked in\n\n               the big city...\n\n     We see:\n\n     EXT. AN ANT MOUND - DAY\n\n     The camera swoops towards the entrance, then dives inside,\n\n     past a couple of tough-looking soldier ants who stand at the\n\n     gates of the ant colony like insect bouncers...into an access\n\n     tunnel that snakes this way and that, past a row of ants\n\n     plodding along...\n\n     ...and into the MAIN CHAMBER of the colony, a huge, teeming\n\n     vista that seems to stretch away forever, filled with ants\n\n     rushing here and there on their business.  We see -- a\n\n     "traffic cop" directing foot traffic, waving his arms like\n\n     crazy so both sides move at once -- a column of \n\nsoldier ants\n\n     marching along in formation -- a chain of ants letting down\n\n     a \n\nmatchbox\n\n elevator filled with workers.\n\n             

In [130]:
test_line = "\n\n                         Z\n\n"
len(test_line)
len(test_line[2:-3])

25

In [153]:
test = re.findall(r"(\b[A-Z]['A-Z ]+[0-9]?)+", antz_script2)

In [154]:
len(test)

1346

In [136]:
test[:20]

['Z ', "I'", 'EXT', 'AN ANT MOUND ', 'DAY\n', 'MAIN CHAMBER ', 'Z ', "I'", 'INT', "MOTIVATIONAL COUNSELLOR'S OFFICE ", 'DAY\n', 'I ', "I'", 'I ', 'I ', "I'", "I'", 'I ', 'I ', 'I ']

In [146]:
 #What if we include that extra space ahead of a character's name?
test2 = re.findall(r" {25}(\b['A-Z ]+[0-9]?)+", antz_script2)

In [147]:
len(test2)

659

In [149]:
test2[:20]

['Z ', 'Z ', 'Z', 'MOTIVATIONAL COUNSELLOR', 'Z', 'MOTIVATIONAL COUNSELLOR', 'MOTIVATIONAL COUNSELLOR', 'MOTIVATIONAL COUNSELLOR', 'Z', 'MOTIVATIONAL COUNSELLOR', 'MOTIVATIONAL COUNSELLOR', 'Z', 'MOTIVATIONAL COUNSELLOR', 'MOTIVATIONAL COUNSELLOR', 'Z', 'MOTIVATIONAL COUNSELLOR', 'MOTIVATIONAL COUNSELLOR', 'AZTECA', 'Z', 'AZTECA']

In [148]:
set(test2) 
#wrong hits: FADE TO, CUT TO

{'WORKER ANTS', 'WORKER ', 'WORKERS', 'BEETLE', 'QUEEN', 'GUARD ANT', 'FADE TO', 'Z', 'MOTIVATIONAL COUNSELLOR', 'BALA ', 'BALA', 'TRACKER ANT', 'COMMANDO ', 'ANT SOLDIERS', 'FORMICA ', 'MAJOR MANDIBLE', 'COMMANDO ANT', 'FOREMAN', 'PRINCESS', 'CUT TO', 'SOLDIER ANTS', 'BARKER', 'MANDIBLE', 'WEAVER', 'FEMALE WASP ', 'LOUD VOICE', 'WORKER', 'SOLDIERS', 'COLONEL', 'OFFICER', 'AZTECA', 'WASP', 'BARBATUS', 'WORKER ANT ', 'MALE WASP', 'SOLDIER ANT', 'EXCITED ANTS', 'Z ', 'FEMALE WASP', 'HANDMAIDEN ', 'COMMANDO ANT ', 'CRICKET', 'ALL', 'LADYBUG', 'BUTTERFLY', 'APHIDS', 'CARPENTER', 'THE WASPS', 'BARTENDER', 'MALE WASP ', 'TOUGH VOICE ', 'FORMICA', 'FLY', 'ANT OFFICER', 'DRUNK SCOUT', 'SOLDIER ', 'SOLDIER', 'BARBATUS ', 'GENERAL FORMICA'}

In [50]:
test3 = re.findall(r" {46}([\s\S]{,100})", antz_script2)
len(test3)

15

In [51]:
test3

["CUT TO:\n\n     INT. DORMITORY - THE NEXT DAY\n\n     Z is talking to Weaver, who's getting ready to go ", 'CUT TO:\n\n     INT. EARLY MEGA-TUNNEL - DAY\n\n     Weaver is "passing" as a worker, working alongside ', 'CUT TO:\n\n     INT. TOWN CENTER - DAY\n\n     A huge crowd is forming, eager to welcome the army back. ', 'CUT TO:\n\n     INT. THRONE ROOM - DAY\n\n     Z follows Formica and Carpenter into the throne room.  At', 'CUT TO:\n\n     EXT. WEED CLUMP - DAY\n\n     Z and Bala dust themselves off.\n\n                         ', 'CUT TO:\n\n     EXT. INSECTOPIA - NIGHT\n\n     The insects are having a \n\ncookout\n\n, their faces illumi', 'CUT TO:\n\n     INT. INSECTOPIA - NIGHT\n\n     Z, the termite, the beetle and the fly are happily carry', 'CUT TO:\n\n     EXT. SKY ABOVE COLONY - MORNING\n\n     The wasp and Z fly high above the colony...\n\n   ', 'CUT TO:\n\n     INT. ANT MOUND - DAY\n\n     Elsewhere in the colony, a column of soldiers marches by, a', 'CUT TO:\n\n     EX

In [45]:
#all the things w/ 46 spaces in front of them are CUT TO:'s and FADE TO:'s
test3[0][:15]

'CUT TO:\n\n     I'

In [150]:
antz_scr = re.sub(r"\n\n {25}(\b['A-Z ]+[0-9]?)+", r'_NEWLINE_\1_', antz_script2)

In [151]:
antz_scr[:5000]

'_NEWLINE_Z _(O.S.)\n\n                      (over a dark screen)\n\n               All my life, I\'ve lived and worked in\n\n               the big city...\n\n     We see:\n\n     EXT. AN ANT MOUND - DAY\n\n     The camera swoops towards the entrance, then dives inside,\n\n     past a couple of tough-looking soldier ants who stand at the\n\n     gates of the ant colony like insect bouncers...into an access\n\n     tunnel that snakes this way and that, past a row of ants\n\n     plodding along...\n\n     ...and into the MAIN CHAMBER of the colony, a huge, teeming\n\n     vista that seems to stretch away forever, filled with ants\n\n     rushing here and there on their business.  We see -- a\n\n     "traffic cop" directing foot traffic, waving his arms like\n\n     crazy so both sides move at once -- a column of \n\nsoldier ants\n\n     marching along in formation -- a chain of ants letting down\n\n     a \n\nmatchbox\n\n elevator filled with workers._NEWLINE_Z _(V.O.)\n\n              

In [152]:
#reanalyze the spacing:
white_space2 = re.findall(" {3,}", antz_scr)
print(len(white_space2)) #2939 checks out, since we found 659 spaces of 25 or more, and 15 spaces of 46.

2732


In [None]:
#making sure I only found lines


In [155]:
len_w_s = [len(x) for x in white_space2]
print(len_w_s[:50])
print(set(len_w_s))

[22, 15, 15, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 15, 15, 15, 5, 5, 5, 15, 15, 15, 15, 15, 15, 15, 22, 15, 15, 15, 15, 15, 15, 15, 15, 15, 22, 15, 15, 5, 5, 5, 15, 22, 23, 15]
{3, 5, 46, 15, 22, 23}


In [156]:
antz_lines = antz_scr.split('_NEWLINE_')

In [157]:
len(antz_lines) #659-15 (once the blank entry at the beginning is removed)

645

In [158]:
antz_lines = antz_lines[1:]
antz_lines[0:2]

['Z _(O.S.)\n\n                      (over a dark screen)\n\n               All my life, I\'ve lived and worked in\n\n               the big city...\n\n     We see:\n\n     EXT. AN ANT MOUND - DAY\n\n     The camera swoops towards the entrance, then dives inside,\n\n     past a couple of tough-looking soldier ants who stand at the\n\n     gates of the ant colony like insect bouncers...into an access\n\n     tunnel that snakes this way and that, past a row of ants\n\n     plodding along...\n\n     ...and into the MAIN CHAMBER of the colony, a huge, teeming\n\n     vista that seems to stretch away forever, filled with ants\n\n     rushing here and there on their business.  We see -- a\n\n     "traffic cop" directing foot traffic, waving his arms like\n\n     crazy so both sides move at once -- a column of \n\nsoldier ants\n\n     marching along in formation -- a chain of ants letting down\n\n     a \n\nmatchbox\n\n elevator filled with workers.', "Z _(V.O.)\n\n               ...which is 

In [102]:
antz_lines[0:3]

['Z _(O.S.)\n\n                      (over a dark screen)\n\n               All my life, I\'ve lived and worked in\n\n               the big city...\n\n     We see:\n\n     EXT. AN ANT MOUND - DAY\n\n     The camera swoops towards the entrance, then dives inside,\n\n     past a couple of tough-looking soldier ants who stand at the\n\n     gates of the ant colony like insect bouncers...into an access\n\n     tunnel that snakes this way and that, past a row of ants\n\n     plodding along...\n\n     ...and into the MAIN CHAMBER of the colony, a huge, teeming\n\n     vista that seems to stretch away forever, filled with ants\n\n     rushing here and there on their business.  We see -- a\n\n     "traffic cop" directing foot traffic, waving his arms like\n\n     crazy so both sides move at once -- a column of \n\nsoldier ants\n\n     marching along in formation -- a chain of ants letting down\n\n     a \n\nmatchbox\n\n elevator filled with workers.', "Z _(V.O.)\n\n               ...which is 

In [159]:
headers = re.findall(r"\n\n {5}\S+", antz_lines[0])

In [160]:
headers

['\n\n     We', '\n\n     EXT.', '\n\n     The', '\n\n     past', '\n\n     gates', '\n\n     tunnel', '\n\n     plodding', '\n\n     ...and', '\n\n     vista', '\n\n     rushing', '\n\n     "traffic', '\n\n     crazy', '\n\n     marching', '\n\n     a']

In [105]:
##antz_lines[0] = re.sub(r"\n\n {5}\S+",r"_HEADER_", antz_lines[0])

In [106]:
##antz_lines[0]

'Z _(O.S.)\n\n                      (over a dark screen)\n\n               All my life, I\'ve lived and worked in\n\n               the big city..._HEADER_ see:_HEADER_ AN ANT MOUND - DAY_HEADER_ camera swoops towards the entrance, then dives inside,_HEADER_ a couple of tough-looking soldier ants who stand at the_HEADER_ of the ant colony like insect bouncers...into an access_HEADER_ that snakes this way and that, past a row of ants_HEADER_ along..._HEADER_ into the MAIN CHAMBER of the colony, a huge, teeming_HEADER_ that seems to stretch away forever, filled with ants_HEADER_ here and there on their business.  We see -- a_HEADER_ cop" directing foot traffic, waving his arms like_HEADER_ so both sides move at once -- a column of \n\nsoldier ants_HEADER_ along in formation -- a chain of ants letting down_HEADER_ \n\nmatchbox\n\n elevator filled with workers.'

In [161]:
antz_lines_2 = []
for line in antz_lines:
    line = re.sub(r"\n\n {5}\S+", r"_HEADER_", line)
    real_line = line.split("_HEADER_")[0]
    antz_lines_2.append(real_line)

In [162]:
len(antz_lines_2) #should still be 644

644

In [163]:
antz_lines_2[:50]

["Z _(O.S.)\n\n                      (over a dark screen)\n\n               All my life, I've lived and worked in\n\n               the big city...", "Z _(V.O.)\n\n               ...which is kind of a problem, since\n\n               I've always felt uncomfortably in\n\n               crowds.", "Z_\n\n               I feel...isolated.  Different.  I've\n\n               got abandonment issues.  My father\n\n               flew away when I was just a larva.\n\n               My mother didn't have much time for\n\n               me...when you have five million\n\n               siblings, it's difficult to get\n\n               attention.\n\n                      (pause)\n\n               I feel physically inadequate -- I've\n\n               never been able to lift more than ten\n\n               times my own weight.  Sometimes I\n\n               think I'm just not cut out to be a\n\n               worker.  But I don't have any other\n\n               options.  I was assigned to trade\n

In [168]:
#now, let's remove all the new line characters and parentheticals
antz_lines_final = []
for line in antz_lines_2:
    line = re.sub("\n", ' ', line)
    line = re.sub(" {3,}", ' ', line)
    line = re.sub(r"\([^\)]*\)", '', line)
    antz_lines_final.append(line)

In [190]:
antz_lines_final[:50] #Eww, Z's lines are getting swallowed up. How?
#Fixed!!!

["Z _  All my life, I've lived and worked in the big city...", "Z _ ...which is kind of a problem, since I've always felt uncomfortably in crowds.", "Z_ I feel...isolated.  Different.  I've got abandonment issues.  My father flew away when I was just a larva. My mother didn't have much time for me...when you have five million siblings, it's difficult to get attention.  I feel physically inadequate -- I've never been able to lift more than ten times my own weight.  Sometimes I think I'm just not cut out to be a worker.  But I don't have any other options.  I was assigned to trade school when I was just a grub  .  The whole system just...makes me feel...  insignificant  .", 'MOTIVATIONAL COUNSELLOR_  Terrific!  You should feel insignificant!', 'Z_ ...I should?', 'MOTIVATIONAL COUNSELLOR_  YES!!!  You know, people ask me, "Doctor, why are you always happy?" And I tell them it\'s mind over matter.  I don\'t mind that I don\'t  matter  !  Do you get it?  Do you get it?', "MOTIVATIONAL COUNS

In [191]:
#now, split the lines
line_list = []
for x in antz_lines_final:
    my_list = re.split("_ ", x)
    my_tu = tuple(my_list)
    line_list.append(my_tu) 

In [193]:
len(line_list)

644

In [194]:
antz_DF = pd.DataFrame(line_list, columns=["Speaker", "Text"])

In [195]:
antz_DF.head()

Unnamed: 0,Speaker,Text
0,Z,"All my life, I've lived and worked in the big..."
1,Z,"...which is kind of a problem, since I've alwa..."
2,Z,I feel...isolated. Different. I've got aband...
3,MOTIVATIONAL COUNSELLOR,Terrific! You should feel insignificant!
4,Z,...I should?


In [196]:
#Let's export this as a pickle file
import pickle as pkl

In [197]:
antz_DF.to_pickle(r'C:\Users\cassi\Desktop\Data_Science\Animated-Movie-Gendered-Dialogue\private\antz_lines.pkl')

## Onto the next movie: The Croods

In [282]:
#getting the script
croods = open(r'C:\Users\cassi\Desktop\Data_Science\Animated-Movie-Gendered-Dialogue\private\imsdb_raw_nov_2015\Animation\croodsthe.txt')
croods_script = croods.read()
croods.close()

In [283]:
print(croods_script[:1000])

                           THE CROODS

                           Written by

                 Kirk DeMicco & Chris Sanders

                                                       12.12.2012

          SEQ. 75 - PROLOGUE

                         FADE IN:

          A cave painting of the Dreamworks logo. Push past the

          moon to the sun. Bright. Beautiful. The sun DISSOLVES

                         TO:

          Cave paintings of a family of cavemen -- we will come to

          know them as The Croods.

                          EEP (V.O.)

           With every sun comes a new day. A new

           beginning. A hope that things will be

           better today than they were yesterday.

          The Croods scurry out of their cave like mice looking for

          food. Scared. Fast.

          The Croods are chased by beasts across the desert. They

          escape from creatures up trees. They hide behind rocks.

                          EEP (V.O.)

           But not

In [284]:
croods_script[:5000]

"                           THE CROODS\n\n                           Written by\n\n                 Kirk DeMicco & Chris Sanders\n\n                                                       12.12.2012\n\n          SEQ. 75 - PROLOGUE\n\n                         FADE IN:\n\n          A cave painting of the Dreamworks logo. Push past the\n\n          moon to the sun. Bright. Beautiful. The sun DISSOLVES\n\n                         TO:\n\n          Cave paintings of a family of cavemen -- we will come to\n\n          know them as The Croods.\n\n                          EEP (V.O.)\n\n           With every sun comes a new day. A new\n\n           beginning. A hope that things will be\n\n           better today than they were yesterday.\n\n          The Croods scurry out of their cave like mice looking for\n\n          food. Scared. Fast.\n\n          The Croods are chased by beasts across the desert. They\n\n          escape from creatures up trees. They hide behind rocks.\n\n                 

In [206]:
seq = re.findall('SEQ', croods_script)

In [207]:
len(seq)

29

In [208]:
sequences = croods_script.split('SEQ.')

In [210]:
sequences[2]

" 100 - MEET THE CROODS\n\n          A HUGE CAVEMAN, GRUG, RUSHES forward, screaming,\n\n          growling, and throwing handfuls of dirt in a THREAT\n\n          DISPLAY. He BLINKS in the bright morning light. He\n\n          LIFTS and HURLS a large boulder.\n\n                          GRUG\n\n           Raaaaar grooooOOOOoooowwwll ERF ERF\n\n           Glaaaaaabbbbllllllllthhhh!\n\n          As suddenly as he began, Grug stops. Panting, he waits\n\n          for the echoes of his outburst to fade.\n\n          Grug TURNS to the cave entrance, he prepares to BELLOW a\n\n          signal, but before any sound escapes his lips--\n\n          A CAVE GIRL, EEP, bursts from the cave.\n\n          3.\n\n                          GRUG (CONT'D)\n\n           You're supposed to wait for my signal\n\n           Eep. Eep?\n\n          Eep scares a pack of nearby Liyotes away. They pounce on\n\n          Grug briefly before scampering off. Eep spreads out on\n\n          an overhanging rock. Sh

This script seems to mark sequences: aka scene breaks. Now that I think about it, Antz did this, too, and I'm beginning to wonder if I shouldn't have just abandoned those, or used them as scene starters. Maybe there's a good way to convert any FADE TO and CUT TO chunks into a single character that would indicate the beginning of a scene. And, if I mark them with "_ ", I could use that to split the data into tuples of 3, and then create a Start_of_Scene column.

For now, let's just worry about getting the lines.

In [285]:
#getting rid of jargon before the script starts
croods_script[:519]

'                           THE CROODS\n\n                           Written by\n\n                 Kirk DeMicco & Chris Sanders\n\n                                                       12.12.2012\n\n          SEQ. 75 - PROLOGUE\n\n                         FADE IN:\n\n          A cave painting of the Dreamworks logo. Push past the\n\n          moon to the sun. Bright. Beautiful. The sun DISSOLVES\n\n                         TO:\n\n          Cave paintings of a family of cavemen -- we will come to\n\n          know them as The Croods.'

In [286]:
croods_script = croods_script[519:]

In [287]:
croods_script[0]

'\n'

In [236]:
len_w_s.count(25) #not many of these...

21

In [253]:
ws_25 = re.findall("\n {25}[\S]+ [\S]+", croods_script)

In [254]:
ws_25

['\n                         ON GRAN', '\n                         ON EEP', '\n                         ON SANDY', '\n                         X-DISS TO:', '\n                         THUNK MOANS', '\n                         X-DISS TO:', '\n                         X-DISS TO:', '\n                         ON GUY', '\n                         THE LOG', '\n                         ON THUNK', '\n                         TREE TOP', '\n                         ON GRUG', '\n                         ON UGGA', '\n                         ON THUNK', '\n                         ON GRUG', '\n                         ON GRUG', '\n                         ON GRUG', '\n                         ON EEP']

It looks like the 25 spaces come before stage commands.

In [231]:
#the only large chunks of white space are
#10 spaces: after a character's line
#11 spaces: between character name and line or between lines
#25 spaces: before stage commands
#26 spaces: before a character name
croods_script[:1000]

"\n\n                          EEP (V.O.)\n\n           With every sun comes a new day. A new\n\n           beginning. A hope that things will be\n\n           better today than they were yesterday.\n\n          The Croods scurry out of their cave like mice looking for\n\n          food. Scared. Fast.\n\n          The Croods are chased by beasts across the desert. They\n\n          escape from creatures up trees. They hide behind rocks.\n\n                          EEP (V.O.)\n\n           But not for me. My name's Eep. This, is\n\n           my family, the Croods. If you weren't\n\n           clued in already by the animal skins and\n\n           sloping foreheads - we're cavemen. Most\n\n           days we spend in our cave, in the dark.\n\n           Night after night, day after day. Yep.\n\n           Home sweet home. When we did go out, we\n\n           struggled to find food in a harsh and\n\n           hostile world. And I struggled to\n\n           survive my family.\n\n       

In [274]:
#Let's just remove parentheticals from the start
croods_script = re.sub(r"\([^\)]*\)", '', croods_script)

In [295]:
#analyzing white space again
w_s = re.findall(" {3,}", croods_script)
len(w_s)

3396

In [296]:
len_w_s = [len(x) for x in w_s]
print(len_w_s[:100])
print(set(len_w_s))
print(len_w_s.index(25))
print(len_w_s[215])

[26, 11, 11, 11, 10, 10, 10, 10, 26, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 10, 10, 26, 11, 11, 11, 11, 11, 11, 10, 26, 11, 11, 10, 10, 10, 26, 11, 11, 11, 11, 11, 10, 10, 10, 10, 26, 11, 11, 10, 10, 10, 26, 11, 11, 11, 11, 10, 10, 26, 10, 10, 10, 10, 10, 26, 11, 11, 10, 10, 10, 10, 10, 10, 26, 11, 11, 10, 10, 10, 26, 11, 26, 11, 26, 11, 26, 11, 11, 10, 10, 26, 11, 11, 11, 10, 10, 10, 26, 11, 26]
{10, 25, 26, 11}
215
25


In [275]:
croods_script[:5000]

"\n\n                          EEP \n\n           With every sun comes a new day. A new\n\n           beginning. A hope that things will be\n\n           better today than they were yesterday.\n\n          The Croods scurry out of their cave like mice looking for\n\n          food. Scared. Fast.\n\n          The Croods are chased by beasts across the desert. They\n\n          escape from creatures up trees. They hide behind rocks.\n\n                          EEP \n\n           But not for me. My name's Eep. This, is\n\n           my family, the Croods. If you weren't\n\n           clued in already by the animal skins and\n\n           sloping foreheads - we're cavemen. Most\n\n           days we spend in our cave, in the dark.\n\n           Night after night, day after day. Yep.\n\n           Home sweet home. When we did go out, we\n\n           struggled to find food in a harsh and\n\n           hostile world. And I struggled to\n\n           survive my family.\n\n          The Crood

In [356]:
croods_scr = re.sub(r"\n\n {26}(\b['A-Z ]+)", r'_NEWLINE_\1_', croods_script)

In [357]:
croods_scr[:500]

"_NEWLINE_EEP _(V.O.)\n\n           With every sun comes a new day. A new\n\n           beginning. A hope that things will be\n\n           better today than they were yesterday.\n\n          The Croods scurry out of their cave like mice looking for\n\n          food. Scared. Fast.\n\n          The Croods are chased by beasts across the desert. They\n\n          escape from creatures up trees. They hide behind rocks._NEWLINE_EEP _(V.O.)\n\n           But not for me. My name's Eep. This, is\n\n           my family,"

In [351]:
testing = re.findall(r"\n\n {26}\b['A-Z ]+", croods_script)

In [352]:
len(testing)

804

In [353]:
len(set(testing))

30

In [354]:
set(testing)

{'\n\n                          UGGA', '\n\n                          EEP', '\n\n                          AWAY', '\n\n                          GUY', '\n\n                          GRUG', '\n\n                          GRAN ', '\n\n                          ERF ERF', '\n\n                          SIGH ', '\n\n                          MACAWNIVORE', '\n\n                          SPOTTED', '\n\n                          EEP ', '\n\n                          CUT TO', '\n\n                          THUNK ', '\n\n                          SANDY', '\n\n                          GRUG ', '\n\n                          UGA', '\n\n                          THUNK', '\n\n                          GASP ', '\n\n                          YOU TO', '\n\n                          CREATURE', '\n\n                          CHUNKY', '\n\n                          DISSOLVE TO', '\n\n                          GUY ', '\n\n                          GRAN', '\n\n                          TA', '\n\n           

In [345]:
testing[:10]

['EE', 'EE', 'EE', 'EE', 'EE', 'EE', 'EE', 'CU', 'GR', 'GR']

This also includes various stage commands, like CUT TO, AWAY, SIGH, GASP, GASPS, YOU TO, DISSOLVE TO, and whatever TA is (within my research of the movie, TA isn't a character, or anything really.

In [310]:
testing_2 = re.findall(r"\n\n {26}(\b['A-Z ]+)+", croods_script)

In [311]:
testing_2[:10]

['EEP ', 'EEP ', 'EEP ', 'EEP ', 'EEP ', 'EEP ', 'EEP ', 'CUT TO', 'GRUG', 'GRUG ']

In [312]:
len(testing_2)

804

In [313]:
set(testing_2)

{'UGA', 'GROAN ', 'SANDY', 'GRUG', 'GUY ', 'CREATURE', 'CROODS', 'UGGA ', 'EEP ', 'UGGA', 'GUY', 'BELT', 'YOU TO', 'CUT TO', 'GRAN', 'AWAY', 'THUNK', 'DISSOLVE TO', 'SIGH ', 'TA', 'EEP', 'MACAWNIVORE', 'GASP ', 'GRUG ', 'CHUNKY', 'ERF ERF', 'GASPS ', 'GRAN ', 'THUNK ', 'SPOTTED'}

In [314]:
len(set(testing_2))

30

Split the script into lines.

In [358]:
croods_lines = croods_scr.split('_NEWLINE_')

In [359]:
len(croods_lines)

805

In [360]:
croods_lines[0]

''

In [361]:
croods_lines = croods_lines[1:]

In [363]:
croods_lines[:20]

['EEP _(V.O.)\n\n           With every sun comes a new day. A new\n\n           beginning. A hope that things will be\n\n           better today than they were yesterday.\n\n          The Croods scurry out of their cave like mice looking for\n\n          food. Scared. Fast.\n\n          The Croods are chased by beasts across the desert. They\n\n          escape from creatures up trees. They hide behind rocks.', "EEP _(V.O.)\n\n           But not for me. My name's Eep. This, is\n\n           my family, the Croods. If you weren't\n\n           clued in already by the animal skins and\n\n           sloping foreheads - we're cavemen. Most\n\n           days we spend in our cave, in the dark.\n\n           Night after night, day after day. Yep.\n\n           Home sweet home. When we did go out, we\n\n           struggled to find food in a harsh and\n\n           hostile world. And I struggled to\n\n           survive my family.\n\n          The Croods are chased by beasts across the desert.

In [369]:
re.findall("\n\n {10}[\S]", croods_lines[18]) #notlines

['\n\n          4']

In [371]:
croods_lines2 = []
for line in croods_lines:
    line = re.sub("\n\n {10}[\S]", '_HEADER_', line)
    real_line = line.split('_HEADER_')[0]
    croods_lines2.append(real_line)


In [373]:
croods_lines2[:10]

['EEP _(V.O.)\n\n           With every sun comes a new day. A new\n\n           beginning. A hope that things will be\n\n           better today than they were yesterday.', "EEP _(V.O.)\n\n           But not for me. My name's Eep. This, is\n\n           my family, the Croods. If you weren't\n\n           clued in already by the animal skins and\n\n           sloping foreheads - we're cavemen. Most\n\n           days we spend in our cave, in the dark.\n\n           Night after night, day after day. Yep.\n\n           Home sweet home. When we did go out, we\n\n           struggled to find food in a harsh and\n\n           hostile world. And I struggled to\n\n           survive my family.", "EEP _(V.O.)\n\n           We were the last ones around. There used\n\n           to be neighbors. The Gorts, smashed by a\n\n           mammoth. The Horks, swallowed by a sand\n\n           snake. The Erfs, mosquito bite. Throgs,\n\n           common cold. And, the Croods. That's\n\n           us. The

In [374]:
not_lines = ["CUT TO", "AWAY", "SIGH", "GASP", "GASPS", "YOU TO", "DISSOLVE TO", "TA"]
not_croods_lines = []
for x in not_lines:
    not_real_lines = [line for line in croods_lines2 if x in line]
    not_croods_lines.extend(not_real_lines)


In [375]:
len(not_croods_lines)

42

In [376]:
not_croods_lines

['CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'GRUG_\n\n           Auughhauhhhgggh!\n\n           SMASH CUT TO:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'GUY_\n\n           AAAAAAAAAAAAAAAAAAAAUGHHHHH.\n\n           SMASH CUT TO:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'CUT TO_:', 'AWAY_)\n\n           You too, Mom.', 'SIGH _', 'GUY_\n\n                          (GASPING)\n\n           Air.', 'GASPS _', 'GASP _', 'GASPS _', 'YOU TO_--\n\n           GRUG (O.S.)\n\n           GET OUT OF THE WAY!!!', 'DISSOLVE TO_:', 'DISSOLVE TO_:', 'DISSOLVE TO_:', 'DISSOLVE TO_:', 'DISSOLVE TO_:', 'DISSOLVE TO_:', 'DISSOLVE TO_:', 'TA_-DAAAAA']

In [377]:
#so, some of these ARE parts of lines. Will have to fix