# Parser - Film scripts

The use of this parser is to extract *spoken words* and the talking *character's name* from given movie scripts as HTML-files.
The goal is a SVG-file containing columns for *character's name*, *spoken phrase*, and optional *number of words*.


## Start of implementation

Rough algorithm plan:
- read file
- find start of script (have to work only for scripts from imsdb.com)
- skip scene descriptions
- find pattern for spoken words
- save character's name
- save phrase
- (optional) save number of words
- find next pattern

For this first test only the first Star Wars movie 'Star Wars: A New Hope' will be analyzed.

## Implementation Setup

**This is uses for all parser iterations**

**Current iteration: v02**

In [1]:
# Imports and Setup

import csv
import re

# Parameters for 'Star Wars: A New Hope'

# Number of whitespaces between '<b>' and CHARACTER NAME
wsChar = 37
wsAdd = 30
wsPhrase = 25

In [2]:
# Create a CSV
header = ['name', 'phrase', 'numberOfWords']

data = open('./output/swiv.csv', 'w', encoding='UTF-8', newline='')

writer = csv.writer(data)

writer.writerow(header)

27

## Parser v01

In [7]:
# Open file
with open('./filmScripts/Star-Wars-A-New-Hope.html', 'r') as script:
    # Check for pattern:

    # '<b>' + ws * whitespaces + CHARACTER NAME
    #
    #
    #
    #
    for line in script:
        # Checks pattern: '<b>' + ws * whitespaces + CHARACTER NAME 
        if line.startswith('<b>' + wsChar * ' ') and not line[wsChar + 3].isspace():
            
            # Save name
            name = line[wsChar + 3 : -1]
            
            # Save phrase
            phrase = ''
            # Next line could either be the beginning of the phrase or additional instructions how to speak.
            line = next(script)
            # Check for additional instructions
            if line.startswith('<b>' + wsAdd * ' '):
                # if true: skip line
                line = next(script)
            # else: save to phrase
            else:
                phrase = phrase + line[wsPhrase + 4:-1]
            line = next(script)
            # All upcoming lines contain the phrase until an empty line is found
            while not line == '\n': 
                phrase = phrase + line[wsPhrase:-1]
                line = next(script)
            
            # Count words
            numberOfWords = len(phrase.split())
            
            # Write svg
            writer.writerow([name, phrase, numberOfWords])

## Parser improvements

Until now the parser works and can combine Names with phrases and number of words.

Points to improve:
- Empty lines appear in the CSV file (fixed)
- Better pattern recognition using RegEx
- Doesn't recognize stage direction in phrases, like '(whispers)'

### Regular Expressions

In [None]:
# Name line
# '^<b> {37}([a-zA-Z]{1,})'gm

# Phrase line
# '^ {25}'gm

## Parser v02

In [3]:
# Open file
with open('./filmScripts/Star-Wars-A-New-Hope.html', 'r') as script:
    # Check for pattern:

    # '<b>' + ws * whitespaces + CHARACTER NAME
    #
    #
    #
    #
    for line in script:
        # Checks pattern: '<b>' + ws * whitespaces + CHARACTER NAME 
        if re.match(r'<b> {37}[a-zA-Z]', line):
            
            # Save name
            name = line[wsChar + 3 : -1]
            
            # Save phrase
            phrase = ''
            # Next line could either be the beginning of the phrase or additional instructions how to speak.
            line = next(script)
            # Check for additional instructions
            if line.startswith('<b>' + wsAdd * ' '):
                # if true: skip line
                line = next(script)
            # else: save to phrase
            else:
                phrase = phrase + line[wsPhrase + 4:-1]
            line = next(script)
            # All upcoming lines contain the phrase until an empty line is found
            while not line == '\n': 
                phrase = phrase + line[wsPhrase:-1]
                line = next(script)
            
            # Count words
            numberOfWords = len(phrase.split())
            
            # Write svg
            writer.writerow([name, phrase, numberOfWords])
