# Converting with the Art of Literary Text Analysis

Our objective here is to process a plain text file so that it is more suitable for analysis. In particular. we will take two _Godfather_ screenplays and remove the stage directions. Here are the steps:

* fetch the two screenplays
* extract the screenplay text from the files
* remove the stage directions

Since we're doing this for two files we will introduce the concept of reusable functions. We've used functions in Python, in this case we're defining our own functions for the first time and using them. The basic syntax is simple:

    def function_name(arguments):
        # processing
        # return a value (usually)
    
We can start by defining our function to fetch a URL, building on the materials we saw with [Scraping](Scraping.ipynb).

In [64]:
import urllib.request

# this function simply fetches the contents of a URL
def fetch(url):
    response = urllib.request.urlopen(url) # open for reading
    return response.read() # read and return

In [65]:
godfatherUrl = "https://www.imsdb.com/scripts/Godfather.html" # URL to use
godfatherSource = fetch(godfatherUrl) # fetch URL
godfatherSource[0:80] # preview

b'<html>\r\n<head><title>Godfather Script at IMSDb.</title>\r\n<meta name="description'

In [66]:
from bs4 import BeautifulSoup

# this function extracts the text from the Godfather screenplays
def extract(source):
    soup = BeautifulSoup(source) # parse the source document
    return soup.find("pre").find("pre").text.strip() # return the plain text (no tags)

In [67]:
godfatherText = extract(godfatherSource) # extract text from source
godfatherText[0:80] # preview

'THE GODFATHER\n\t_____________\n\n\tScreenplay\n\n\tby\n\n\tMARIO PUZO\n\n\tand\n\n\tFRANCIS FORD'

In [68]:
import re

directions = r'^\t?[^\t]' # regular expression to avoid one tab only at start of line

# this function cleans the text by skipping lines with one tab (and multiple new lines)
def clean(text):
    lines = re.sub(r'\n\n+', "\n\n", text).split("\n") # create list from new line
    return [l for l in lines if not re.match(directions, l)] # create list from non-match lines

In [72]:
godfather = clean(godfatherText) # clean text
godfather[0:20] # preview

['',
 '',
 '',
 '',
 '',
 '',
 '\t\t\t\t\t1 Gulf and Western Plaza',
 '',
 '',
 '',
 '\t\t\t\t  THE GODFATHER',
 '',
 '',
 '\t\t\t\tBONASERA',
 '\t\tAmerica has made my fortune.',
 '',
 '',
 '\t\t\t\tBONASERA',
 '\t\tI raised my daughter in the American',
 '\t\tfashion; I gave her freedom, but']

In [74]:
godfather2url = "https://www.imsdb.com/scripts/Godfather-Part-II.html"
godfather2 = clean(extract(fetch(godfather2url))) # call nested functions
godfather2[0:40] # preview

['',
 '\t\t\t\t Part Two',
 '',
 '\t\t\t\tScreenplay by',
 '',
 '\t\t\t\tMario Puzo',
 '',
 '\t\t\t\t    and',
 '',
 '\t\t\t Francis Ford Coppola',
 '',
 '',
 '',
 '',
 '',
 "\t\t     Mario Puzo's THE GODFATHER",
 '',
 '',
 '',
 '',
 '',
 '\t\t\t\t\t\t\t\tDISSOLVE TO:',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '\t\t\t\tWOMAN',
 '\t\t\t(Sicilian)',
 "\t\tThey've killed young Paolo!  They've",
 '\t\tkilled the boy Paolo!',
 '',
 '',
 '',
 '',
 '',
 '']

And there we are, we now have code to process our _Godfather_ screenplays. It's not perfect, but it's a great start!

---
[CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) From [The Art of Literary Text Analysis](ArtOfLiteraryTextAnalysis.ipynb) by [Stéfan Sinclair](http://stefansinclair.name) &amp; [Geoffrey Rockwell](http://geoffreyrockwell.com). <br >Created January 31, 2019 (Jupyter 5).