# Creating a function to split scripts
I want to make a function to split up the scripts into a dictionary with speaker name as the key and all of their lines as a string put together.

* **Input**: a string containing a full movie script with this format:

```
    SPEAKER
    Hello, my name is speaker

    SPEAKER2
    Hi, my name is speaker2

    SPEAKER
    Hi speaker2
```
* **Output**: a dictionary that looks like this:

```
    {'SPEAKER': 'Hello, my name is speaker. Hi speaker2', 
     'SPEAKER2': 'Hi, my name is speaker2.'}
```

All of the functions can be found in the `splitfuncts.py` file, each function will be walked through here.

## Importing the scripts
First, I'll make a function to easily import the scripts from the moviescripts folder as a string. We'll use the Megamind script as a sample.
The read_script function is simple, it just imports the text file as a string.

In [1]:
import re
import os
import pandas as pd
from splitfuncts import *

megamind = read_script('megamind.txt')
megamind[1:200]


'MEGAMIND\n\n\n\nWritten by\n\nAlan Schoolcraft & Brent Simons\n\n\n\n\nCREDITS SEQUENCE\n\nNEWSPAPER HEADLINE MONTAGE:\n\nHEADLINES flash before us, displaying their accompanying\nphotographs.\n\n"UBERMAN - METRO CITY'

## Script-to-dictionary function
Now we've covered the importing, we need to actually convert them into a dictionary as outlined above. The script_to_dict function contains the regex used to split up the script by speaking character, which outputs a dictionary with each character as the key and their lines as one string object.

In [2]:
megamind_dict = script_to_dict(megamind)
megamind_dict['MASTER MIND'][1:200] # perfect

'he real Einstein once said, "God does not play dice with the world." He was right, because the world is MY dice. Is that understood? Alright, then - clean slate. Do we have the girl? Reporters are a '

## Generating the rest of the movies
Now that we've seen it works on Megamind, let's put the functions all together and try it for everyone else!

To do this, I've created a function `movies_to_df()` which takes in the name of the movie folder and outputs the dataframe that we want with the top 5 speakers in each movie!

In [3]:
movies_df = movies_to_df('moviescripts/')
movies_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_df['lines'] = movie_df['lines'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_df['lines'] = movie_df['lines'].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movie_df['lines'] = movie_df['lines'].str.lower()
A value is trying to be set on a copy of a slice from a

Unnamed: 0,movie_name,character_name,lines,lines_len
0,avengers,TONY,you're good on this end. the rest is up to you...,11492
1,avengers,NICK FURY,how bad is it? nasa didn't authorize selvig to...,9327
2,avengers,LOKI,you have heart. loki points the head of his sp...,6742
4,avengers,BANNER,calm down. what's wrong? is he like them? the ...,5276
5,avengers,THOR,where is the tesseract? do i look to be in a g...,5084


Now, just out of curiosity, let's look at the top speaker of each movie. 

In [4]:
idx = movies_df.groupby('movie_name')['lines_len'].transform(max) == movies_df['lines_len']
movies_df[idx] # gets largest line length by movie

Unnamed: 0,movie_name,character_name,lines,lines_len
0,avengers,TONY,you're good on this end. the rest is up to you...,11492
0,dumb_and_dumber,LLOYD,"excuse me, can you tell me how to get to the m...",22037
0,lord_of_rings_return,ANGLE ON,": smeagol and his cousin, deagol, sit in a sma...",26217
0,american-psycho,BATEMAN,"(v.o.) we're sitting in pastels, this nouvelle...",33267
3,indiana_jones_raiders,INDY,no. we don't need them. we'll leave them. once...,9810
0,finding_nemo,MARLIN,"wow. wow. wow. so, coral, when you said you wa...",17775
1,twilight,BELLA,(v.o.) i'd never given much thought to how i w...,11439
0,megamind,MASTER MIND,"the real einstein once said, ""god does not pla...",16424
0,harold_kumar_white_castle,KUMAR,mononucleosis or mono is an infection caused b...,37536
4,it,WILL,don't be such a wuss. i'd come if i weren't dy...,7895


In [5]:
movies_df = movies_df.drop(columns=['lines_len']
                           )[((movies_df['character_name'] != 'ANGLE ON') &
                              (movies_df['character_name'] != 'CLOSE ON'))]

movies_df.to_csv('moviedata.csv')