In [11]:
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 
import os

About <br>
This notebook is for data collection and cleaning <br>
* It reads in movies scripts
* Sort different kinds of lines (stage setup, character, lines)
<br><br>
Goal: 
<ls>
- Emotion Composition 
--> Need to know line position
- Character Relationship 
--> Need to know character appearance
</ls>
<br>

In [13]:
# 1) read in the scripts: character, line
path = "..\\scripts\\sleeping_beauty.txt"
file = open(path, 'r')
script = file.readlines()
file.close()
start, end = 75, 1156


line_count = 0 # number of lines total 
characters = set() # all the characters
lines = [] # character with their lines, index are the line numbers
line_nums = [] # actual line numbers in the script


# helper functions
def remove_setup(string): 
    stack = False
    new_str = ""
    for s in string:
        if s=='[':
            stack = True
        elif s==']':
            stack = False
        else: 
            if not stack:
                new_str += s
    return new_str.strip() 

def return_setup(string): 
    stack = False
    new_str = ""
    for s in string:
        if s=='[':
            stack = True
        elif s==']':
            stack = False
        else: 
            if stack:
                new_str += s
    return new_str.strip() 

def remove_linebreak(string):
    return string.replace('\n', ' ')

# iterate through the lines 
for i in range(start, end+1):
    holder = script[i] # line that is being read
    if ':' in holder:
        character = holder[:-2]
        characters.add(character) # record for unique characters
        lines.append([character, ""])
        line_count += 1
        line_nums.append(i)
    else:
        if len(holder) != 0 and line_count != 0:
            lines[line_count-1][1] += holder

lines = np.array(lines)

# # 2) put the lines into a df and store it 
# example: 
d = {
        'chars':lines[:,0],
        'lines': lines[:,1], 
    }
scripts = pd.DataFrame(d)

scripts['line_num'] = pd.Series(line_nums)
scripts['mod_line'] = scripts.apply(lambda x: remove_linebreak(remove_setup(x.iloc[1])), axis=1)
scripts['scene_setup'] = scripts.apply(lambda x: remove_linebreak(return_setup(x.iloc[1])), axis=1)

out_path = os.path.join('..\cleaned_scripts', 'sleeping_beauty.csv')
scripts.to_csv(out_path)


In [10]:
scripts

Unnamed: 0,chars,lines,line_num,mod_line,scene_setup
0,Narrator,"In a far away land, long ago, lived a king and...",77,"In a far away land, long ago, lived a king and...",a crowd is on its way to the castle
1,Choir,"Joyfully now to our princess we come,\nBringin...",80,"Joyfully now to our princess we come, Bringing...",inside the castle
2,Narrator,Thus on this great and joyous day did all the ...,101,Thus on this great and joyous day did all the ...,
3,Announcer,"Their royal highnesses, King Hubert and prince...",103,"Their royal highnesses, King Hubert and prince...",
4,Narrator,Fondly had these monarchs dreamed one day thei...,105,Fondly had these monarchs dreamed one day thei...,
...,...,...,...,...,...
440,Choir,"I know you,\nI walked with you\nOnce upon a dr...",1135,"I know you, I walked with you Once upon a dream",
441,Merryweather,Blue! [the dress changes to blue]\n,1139,Blue!,the dress changes to blue
442,Choir,"I know you,\nThe gleam in your eyes\nIs so fam...",1141,"I know you, The gleam in your eyes Is so famil...",The castle disappears around Aurora and Philli...
443,Choir,And I know it's true\nThat visions are seldom ...,1146,And I know it's true That visions are seldom a...,Aurora and Phillip kiss each other. The storyb...


## Some issue about cleaning the scripts

1) Different scripts follow a different format and text convention: the_little_mermaid.txt has different character <br>
comparing to beauty_and_the_beast.txt; the cleaning code above can deal with: the_rescuers_down_under.txt,
the_little_mermaid.txt, sleeping_beauty.txt (the testing file)

2) Special formatted scripts: <br> 
* Special#1: lion_king.txt [the most unique one] <br>
1) There are special dash divider
2) '[]' contains the scene name under the divider
3) '{}' contains the scene setup

* Special#2: a_goofy_movie.txt <br> 
1) There are special dash divider
2) setup is in '()'

* Special#3: beauty_and_the_beast.txt, aladdin.txt, the_hunchback_of_notre_dame.txt <br>
1) Speaker is on the same line as the line
2) setup is in '()'

* Special#4: mulan.txt [ignore this one for now] <br> 
1) scene description has no indicator [BIG PROBLEM]
2) contains '***' separator
3) speaker is on another line
solution: use the html file and web scrape

Now that we have the scripts, lets do a practive run on one of the line for some common NLP and Sentiment Analysis

In [None]:
in_path = os.path.join('..\cleaned_scripts', 'sleeping_beauty.csv')

sentence = read_csv(in_path).iloc[0]['mod_line']