# Creating a function to split scripts
I want to make a function to split up the scripts into a dictionary with speaker name as the key and all of their lines as a string put together.

* **Input**: a string containing a full movie script with this format:

```
    SPEAKER
    Hello, my name is speaker

    SPEAKER2
    Hi, my name is speaker2

    SPEAKER
    Hi speaker2
```
* **Output**: a dictionary that looks like this:

```
    {'SPEAKER': 'Hello, my name is speaker. Hi speaker2', 
     'SPEAKER2': 'Hi, my name is speaker2.'}
```

All of the functions can be found in the `splitfuncts.py` file, each function will be walked through here.

## Importing the scripts
First, I'll make a function to easily import the scripts from the moviescripts folder as a string. We'll use the Megamind script as a sample.
The read_script function is simple, it just imports the text file as a string.

In [1]:
import re
import os
import pandas as pd
from splitfuncts import *

megamind = read_script('megamind.txt')
megamind[1:200]


'MEGAMIND\n\n\n\nWritten by\n\nAlan Schoolcraft & Brent Simons\n\n\n\n\nCREDITS SEQUENCE\n\nNEWSPAPER HEADLINE MONTAGE:\n\nHEADLINES flash before us, displaying their accompanying\nphotographs.\n\n"UBERMAN - METRO CITY'

## Script-to-dictionary function
Now we've covered the importing, we need to actually convert them into a dictionary as outlined above. The script_to_dict function contains the regex used to split up the script by speaking character, which outputs a dictionary with each character as the key and their lines as one string object.

In [2]:
megamind_dict = script_to_dict(megamind)
megamind_dict['MASTER MIND'][1:200] # perfect

'he real Einstein once said, "God does not play dice with the world." He was right, because the world is MY dice. Is that understood? Alright, then - clean slate. Do we have the girl? Reporters are a '

## Generating the rest of the movies
Now that we've seen it works on Megamind, let's put the functions all together and try it for everyone else!

To do this, I've created a function `movies_to_df()` which takes in the name of the movie folder and outputs the dataframe that we want with the top 7 speakers in each movie!

In [3]:
movies_df = movies_to_df('moviescripts/')
movies_df.head()

Unnamed: 0,movie_name,character_name,lines,lines_len
0,avengers,TONY,You're good on this end. The rest is up to you...,11492
1,avengers,NICK FURY,How bad is it? NASA didn't authorize Selvig to...,9327
2,avengers,LOKI,You have heart. Loki points the head of his sp...,6742
3,avengers,T,his doesn't have to get any messier. he Tesser...,5549
4,avengers,BANNER,Calm down. What's wrong? Is he like them? The ...,5276


Now, just out of curiosity, let's look at the top 7 speakers of each movie. Notice that some of them are weird -- we'll remove those manually (or find some regex way of catching them if we have time lol). These are mostly stage directions that were captured.

In [4]:
idx = movies_df.groupby('movie_name')['lines_len'].transform(max) == movies_df['lines_len']
movies_df[idx] # gets largest line length by movie

Unnamed: 0,movie_name,character_name,lines,lines_len
0,avengers,TONY,You're good on this end. The rest is up to you...,11492
0,dumb_and_dumber,LLOYD,"Excuse me, can you tell me how to get to the m...",22037
0,lord_of_rings_return,ANGLE ON,": SMEAGOL and his cousin, DEAGOL, sit in a SMA...",26217
0,american-psycho,BATEMAN,"(V.O.) We're sitting in Pastels, this nouvelle...",33267
0,indiana_jones_raiders,I,"n the undergrowth, there is slithering movemen...",40450
0,finding_nemo,MARLIN,"Wow. Wow. Wow. So, Coral, when you said you wa...",17775
0,twilight,B,ased on the novel by ut dying in the place of ...,11444
0,megamind,MASTER MIND,"The real Einstein once said, ""God does not pla...",16424
0,harold_kumar_white_castle,KUMAR,Mononucleosis or mono is an infection caused b...,37536
0,it,T,"hey wave at each other. Richie, bug-eyed glass...",16290


In [5]:
movies_df = movies_df.drop(columns=['lines_len'])
movies_df.to_csv('moviedata.csv')