# Turning Scripts into a Data Set
This notebook explains the process of turning our folder of movie scripts, `moviescripts/`, into a data frame that we can use for machine learning. There are a few steps we need to take to get there:

1. Import the text files and turn them into strings

2. Turn the script string into a dictionary

3. Turn the dictionary into a dataframe

4. Combine all of the scripts' data frames into one dataframe

5. Export final dataframe as a csv

## Step 1: Reading the Script

We've created a simple function, `read_script()` which takes the script's file name and outputs the text. We'll display this process using one of the scripts, `megamind.txt`.


In [11]:
from scriptfuncts import *

megamind = read_script('megamind.txt')
megamind[1:200]

'MEGAMIND\n\n\n\nWritten by\n\nAlan Schoolcraft & Brent Simons\n\n\n\n\nCREDITS SEQUENCE\n\nNEWSPAPER HEADLINE MONTAGE:\n\nHEADLINES flash before us, displaying their accompanying\nphotographs.\n\n"UBERMAN - METRO CITY'

## Step 2: Turning the Script into a Dictionary
The function for step 2 is a little more involved. It uses regular expressions in order to separate the speaker name from their lines and put them into a dictionary.

* **Input**: a string containing a full movie script with this format:

```
    SPEAKER
    Hello, my name is speaker

    SPEAKER2
    Hi, my name is speaker2

    SPEAKER
    Hi speaker2
```
* **Output**: a dictionary that looks like this:

```
    {'SPEAKER': 'Hello, my name is speaker. Hi speaker2', 
     'SPEAKER2': 'Hi, my name is speaker2.'}
```


In [12]:
script_dict = script_to_dict(megamind)
script_dict['MASTER MIND'][1:5]

['Alright, then - clean slate. Do we have the girl?',
 'Reporters are a curious lot, and easily manipulated.',
 "Alright, let's not keep the lady waiting.",
 '(O.S.) Miss Ritchi, we meet again.']

## Step 3: Turning the Dictionary into a Dataframe
In order to use the machine learning methods we want, we will have to format the dataframe in a specific way. At the very least, the dataframe should look like this:

movie | character_name | lines 
--- | --- | ---
'Megamind' | 'Master Mind' | 'Alright, then - ...' 
'megamind' | 'Master Mind' | 'Reporters are a curi...'

Ultimately, we will also include a `line_num` column in case we want to know where the line is relative to the character's other lines. For machine learning purposes, we will want to filter the dataframe so that only characters with more than 100 lines will be included in the final dataframe. All of these specifications are included in the function `movie_dict_to_df`. 


In [13]:
movie_df = movie_dict_to_df(script_dict, 'megamind')
movie_df.head()

Unnamed: 0,movie,character_name,line_num,line
788,Megamind,Master Mind,0,"the real einstein once said, ""god does not pla..."
789,Megamind,Master Mind,1,"alright, then - clean slate. do we have the girl?"
790,Megamind,Master Mind,2,"reporters are a curious lot, and easily manipu..."
791,Megamind,Master Mind,3,"alright, let's not keep the lady waiting."
792,Megamind,Master Mind,4,"miss ritchi, we meet again."


## Step 4: Combining All Scripts Into One DataFrame
Now that we have our functions working properly for one file, we can apply them to all of our movie script files. To do so, we will need to read in all of the files in the `moviescripts` folder, apply each of the above functions to each file, and then combine them into one dataframe. The `movies_to_df` function does just that.

In [14]:
movies_df = movies_to_df('moviescripts/')
movies_df.head()

Unnamed: 0,movie,character_name,line_num,line
404,American Psycho,Bateman,0,"we're sitting in pastels, this nouvelle north..."
405,American Psycho,Bateman,1,you'll notice that my friends and i all look...
406,American Psycho,Bateman,2,or can it be worn with a suit?
407,American Psycho,Bateman,3,with discreet pinstripes you should wear a sub...
408,American Psycho,Bateman,4,van patten looks puffy. has he stopped working...


We can use this to see how many characters in these scripts had more than 100 lines.

In [15]:
movies_df['character_name'].unique()

array(['Bateman', 'Nick Fury', 'Tony', 'Lloyd', 'Harry', 'Marlin', 'Nemo',
       'Dory', 'Double White', 'Harold', 'Kumar', 'Indy', 'Int', 'Will',
       'Ext', 'Angle On', 'Close On', 'Sam', 'Frodo', 'Gandalf',
       'Master Mind', 'Roxanne', 'Bella', 'Edward'], dtype=object)

Some of these names are a little off, such as 'Ext', 'Angle On', 'Close On', and 'Int'. These are stage directions that were captured as characters, so we will remove them before creating our .csv file.

In [16]:
exclude = ['Ext', 'Angle On', 'Close On', 'Int']
movies_df2 = movies_df[~movies_df['character_name'].isin(exclude)].copy(deep=True)
movies_df2['character_name'].unique()

array(['Bateman', 'Nick Fury', 'Tony', 'Lloyd', 'Harry', 'Marlin', 'Nemo',
       'Dory', 'Double White', 'Harold', 'Kumar', 'Indy', 'Will', 'Sam',
       'Frodo', 'Gandalf', 'Master Mind', 'Roxanne', 'Bella', 'Edward'],
      dtype=object)

In [22]:
counts = movies_df2['character_name'].value_counts()
filtered_df = movies_df2[movies_df2['character_name'].isin(counts[counts > 250].index)]
filtered_df['character_name'].unique()

array(['Bateman', 'Lloyd', 'Harry', 'Marlin', 'Dory', 'Harold', 'Kumar',
       'Master Mind', 'Bella'], dtype=object)

## Step 5: Exporting as CSV
Now that we have our data set, we can go ahead and export it into our data folder!

In [23]:
filtered_df.to_csv('data/moviedata.csv', index=False)