# Creating a function to split scripts
I want to make a function to split up the scripts into a dictionary with speaker name as the key and all of their lines as a string put together.

* **Input**: a string containing a full movie script with this format:

```
    SPEAKER
    Hello, my name is speaker

    SPEAKER2
    Hi, my name is speaker2

    SPEAKER
    Hi speaker2
```
* **Output**: a dictionary that looks like this:

```
    {'SPEAKER': 'Hello, my name is speaker. Hi speaker2', 
     'SPEAKER2': 'Hi, my name is speaker2.'}
```

All of the functions can be found in the `splitfuncts.py` file, each function will be walked through here.

## Importing the scripts
First, I'll make a function to easily import the scripts from the moviescripts folder as a string. We'll use the Megamind script as a sample.
The read_script function is simple, it just imports the text file as a string.

In [1]:
import re
import os
import pandas as pd
from splitfuncts import *

megamind = read_script('megamind.txt')
megamind[1:200]


'MEGAMIND\n\n\n\nWritten by\n\nAlan Schoolcraft & Brent Simons\n\n\n\n\nCREDITS SEQUENCE\n\nNEWSPAPER HEADLINE MONTAGE:\n\nHEADLINES flash before us, displaying their accompanying\nphotographs.\n\n"UBERMAN - METRO CITY'

## Script-to-dictionary function
Now we've covered the importing, we need to actually convert them into a dictionary as outlined above. The script_to_dict function contains the regex used to split up the script by speaking character, which outputs a dictionary with each character as the key and their lines as one string object.

In [2]:
megamind_dict = script_to_dict(megamind)
megamind_dict['MASTER MIND'][1:200] # perfect

'he real Einstein once said, "God does not play dice with the world." He was right, because the world is MY dice. Is that understood? Alright, then - clean slate. Do we have the girl? Reporters are a '

## Generating the rest of the movies
Now that we've seen it works on Megamind, let's put the functions all together and try it for everyone else!

**Note**: there's probably a faster way of doing this (can you iterate through a folder?) so maybe ask about that lol. Perhaps make a list of the file names and iterate through that...

In [3]:
#addams_family = read_script('addams_family.txt')
#movie_list = ['american-psycho.txt', 'avengers.txt', 'dumb_and_dumber.txt', 'finding_nemo.txt', 
#              'harold_kumar_white_castle.txt', 'indiana_jones_raiders.txt', 'it.txt', 'lord_of_rings_return.txt',
#              'twilight.txt']
#american_psycho_df = file_to_df('american-psycho.txt')
avengers_df = file_to_df('avengers.txt')
dumb_and_dumber_df = file_to_df('dumb_and_dumber.txt')
finding_nemo_df = file_to_df('finding_nemo.txt')
harold_kumar_df = file_to_df('harold_kumar_white_castle.txt')
#harry_potter = read_script('harry_potter_chamber.txt')
indiana_jones_df = file_to_df('indiana_jones_raiders.txt')
it_df = file_to_df('it.txt')
lord_rings_df = file_to_df('lord_of_rings_return.txt')
#titanic = read_script('titanic.txt')
twilight_df = file_to_df('twilight.txt')




Upon further inspection, Addams Family, Harry Potter, and Titanic are not formatted correctly and the function will not work. Find a replacement or fix it somehow!

Now, I will merge all of the dataframes, taking the top 7 speakers from each movie. We will remove the weird ones manually.

In [13]:
movie_dfs_list = [avengers_df.head(7), dumb_and_dumber_df.head(7), finding_nemo_df.head(7), harold_kumar_df.head(7), indiana_jones_df.head(7), it_df.head(7), lord_rings_df.head(7), twilight_df.head(7)]
movies_df = pd.concat(movie_dfs_list)

idx = movies_df.groupby('movie_name')['lines_len'].transform(max) == movies_df['lines_len']
movies_df[idx] # gets largest line length by movie

Unnamed: 0,movie_name,character_name,lines,lines_len
0,avengers,TONY,You're good on this end. The rest is up to you...,11498
0,dumb_and_dumber,LLOYD,"Excuse me, can you tell me how to get to the m...",22037
0,finding_nemo,MARLIN,"Wow. Wow. Wow. So, Coral, when you said you wa...",17775
0,harold_kumar_white_castle,KUMAR,Mononucleosis or mono is an infection caused b...,37536
0,indiana_jones_raiders,I,"n the undergrowth, there is slithering movemen...",40450
0,it,T,"hey wave at each other. Richie, bug-eyed glass...",16290
0,lord_of_rings_return,ANGLE ON,": SMEAGOL and his cousin, DEAGOL, sit in a SMA...",26217
0,twilight,B,ased on the novel by ut dying in the place of ...,11456
