1. Webscraping (HTML in pre-Tag)
2. b-Tags (und andere Tags) löschen d.h. alles zwischen Spitzklammern <>
3. Aufsplitten des Texts in Textblöcke anhand von Leerzeilen (2x \n nacheinander), Textblöcke als Array speichern
4. Textblöcke kategorisieren, typische Merkmale identifizieren, Verarbeitungsanweisungen für die einzelnen Textblöcke definieren
5. in CSV umwandeln, dabei die Verarbeitungsanweisungen als if-Bedingungen umsetzen (jeden Textblocktyp einzeln verarbeiten)
6. weitere automatisierte Datenbereinigungsschritte: Fragezeichen ersetzen, bestimmte Textteile löschen z.B. (V.O.) etc.
7. händische Datenbereinigung (restliche Fragezeichen ersetzen, Regieanweisungen die direkt am Sprechertext kleben ausfindig machen und in eine eigene Regie-Zeile auslagern
8. Zusammenführen der 3 CSV-Dateien in eine (zusätzliche Spalte für Film-Nummer)

Beispiel für Textblöcke:

**1.) Textblock vom Typ "Sprechertext"**

>**GALADRIEL (V.O.) (CONT'D)**

          But the power of the Ring could not be
          undone.
          

typisches Merkmal: beginnt mittig (x Leerzeichen Abstand vom Rand, genaue Anzahl der Leerzeichen variiert je nach Filmskritpt!)

Verarbeitung: Text in erster Zeile (Name des Sprechers) wird in linke Spalte gepackt, Text ab zweiter Zeile wird in rechte Spalte gepackt

**2.) Textblock vom Typ "Regieanweisung"** (Skript 1 und 3)

>IMAGES: THE HUGE, DARK FIGURE OF SARURON, bearing the ONE
RING on his finger, looms over the field of battle...

typisches Merkmal: beginnt linksbündig (! ausgenommen Textblöcke, in denen nur **CONTINUED:** steht)

Verarbeitung: in linke Spalte wird "Regie" geschrieben, Text kommt in rechte Spalte, (Wörter vorm Doppelpunkt werden später gelöscht)

**3.) Textblock vom Typ "Regieanweisung"** (Skript 2)

>[The hall is shown to be filled with 
                         light again, as everyone marvels at 
                         the rejuvenation of the king.]

typisches Merkmal: steht in []

Verarbeitung: wie 2. (linke Spalte: "Regie", rechte Spalte: Text ohne Klammern)

**Vorteil: alle Textteile, für die keine Verarbeitungsanweisung definiert wird, werden auch nicht mit ins CSV aufgenommen.** Man zieht sich nur die Teile raus, die man braucht. z.B. werden Seitenzahlen nicht berücksichtigt, weil sie rechtsbündig stehen

## Ziel: CSV-Datei in folgendem Format

| Nr. | Regie/Sprecher | Text | Film-Nr. |


| 1 | Gollum | My precious! | 1 |

| 2 | Regie | Gollum looks at the ring. | 1 |

## 1.) Regieanweisungen aus Skript 1 und 3

**Datenquelle:** Webscraping von imsdb.com

In [106]:
from bs4 import BeautifulSoup
import requests
import os
import re
import pandas as pd
import numpy as np
from textblob import TextBlob
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

In [107]:
lotr_1 = requests.get('https://imsdb.com/scripts/Lord-of-the-Rings-Fellowship-of-the-Ring,-The.html')
lotr_3 = requests.get('https://imsdb.com/scripts/Lord-of-the-Rings-Return-of-the-King.html')

In [108]:
# Liste mit Regieanweisungen, die im Sprechertext vorkommen
# ToDo: weitere ergänzen
regieInSpeakertext= ["Bilbo stands gazing out of the kitchen window."]


In [109]:
def get_regieanweisungen(script, filmNr):
            # webscrape text
            soup = BeautifulSoup(script.text, 'lxml')
            all_character_tags = soup.find('pre')

            # convert scraped bs4 element tags to string
            text = str(all_character_tags)

            # delete all tags
            text = re.sub(r'</?[a-zA-Z]+>', '', text)       

            # whitespace in names
            text = re.sub(r'A RW EN','ARWEN', text)
            text = re.sub(r'DEN ETH OR','DENETHOR', text)
            text = re.sub(r'SME AGO L', 'SMEAGOL', text)

            # split to text blocs (Trenner: Leerzeilen = 2x new line = \r\n\r\n)
            blocs = re.split(r'\r\n\r\n', text)

            # Dateiname festlegen
            filename = f'lotr_skript_{filmNr}_regie.csv'

            # Datei löschen, falls sie existiert:
            if os.path.exists(f'data/{filename}'):
                os.remove(f'data/{filename}')

            # TO CSV

            with open(f'data/{filename}', 'a') as f:
                # Spaltenüberschriften
                f.write(f'Nr.|Sprecher/Regie|Text|Filmnr.\n')
                print(f'Nr.|Sprecher/Regie|Text|Filmnr.\n')

                # Blöcke durchgehen
                for num, bloc in enumerate(blocs):
                    # Leerzeichen entfernen und Fragezeichen/Asterisk ersetzen (in Skript 1 und 3 können alle Fragezeichen durch Leerzeichen ersetzt werden)
                    bloc = bloc.replace("�", " ")
                    bloc = bloc.replace("*", " ")

                    # match Sprechertext (wenn mehr als ein Leerzeichen am Anfang des Blocks)
                    if(re.match(r'^\s+[A-Z]', bloc)):    
                        sprechertext = bloc
                        # remove whitespace
                        sprechertext = re.sub(r'\r\n', ' ', sprechertext) # Zeilenumbrüche innerhalb des Texts
                        sprechertext = re.sub(r'\s+', ' ', sprechertext) # mehrfache Leerzeichen werden zu einem

                        # wenn in einem Block ein Satz aus der Liste regieInSpeakertext vorkommt, dann wird der Satz ausgegeben (nur! der Satz)
                        for count, regietext in enumerate(regieInSpeakertext):
                            if (regietext in sprechertext):
                                print(f'{num}|Regie aus Sprechertext|{regietext}|Filmnr.\n')
                                #print(num, sprechertext)

                    # match Regieanweisungen (wenn kein Leerzeichen am Anfang des Blocks)
                    elif (not re.match(r'^\s+[A-Z]', bloc)):
                        # remove whitespace
                        regietext = bloc.strip()
                        regietext = re.sub(r'\r\n', ' ', regietext) # Zeilenumbrüche innerhalb des Texts
                        regietext = re.sub(r'\s+', ' ', regietext) # mehrfache Leerzeichen werden zu einem

                        # löschen bestimmter Blöcke (Blöcke mit CONTINUED, mit Ortsangaben Int. oder Ext.)
                        if ('CONTINUED' in regietext):
                            regietext = ''
                        elif('EXT.' in regietext):
                            regietext = ''
                        elif('EX.' in regietext):
                            regietext = ''
                        elif('INT.' in regietext):
                            regietext = ''
                        elif('CUT TO' in regietext):
                            regietext = ''


                        # löschen von Kameraanweisungen und Ähnlichem
                        regietext = re.sub(r'We SLOWLY FADE TO BLACK ...', '', regietext)
                        regietext = re.sub(r'FADE TO BLACK.?', '', regietext)
                        regietext = re.sub(r'FADE UP:', '', regietext)

                        regietext = re.sub(r'ANGLE ON::', '', regietext)
                        regietext = re.sub(r'ANGLES ON:', '', regietext)
                        regietext = re.sub(r'ANGLE ON:', '', regietext)

                        regietext = re.sub(r'IMAGES:', '', regietext)
                        regietext = re.sub(r'IMAGE:', '', regietext)

                        regietext = re.sub(r'CLOSE ON:', '', regietext)
                        regietext = re.sub(r'Close on:', '', regietext)

                        regietext = re.sub(r'\sSUPER:\s.+TRACKS\sTO:', '', regietext)

                        regietext = re.sub(r'TEASING SHOTS:', '', regietext)

                        regietext = re.sub(r'WIDE ON:', '', regietext)
                        regietext = re.sub(r'Wide on:', '', regietext)

                        regietext = re.sub(r'MONTAGE:', '', regietext)

                        regietext = re.sub(r'SLOW MOTION:?', '', regietext)
                        regietext = re.sub(r'NORMAL SPEED', '', regietext)

                        regietext = re.sub(r'ON THE SOUNDTRACK:', '', regietext)
                        regietext = re.sub(r'ON SOUNDTRACK:', '', regietext)
                        regietext = re.sub(r'SOUNDTRACK:', '', regietext)           

                        regietext = re.sub(r'TILT DOWN:', '', regietext)

                        regietext = re.sub(r'Low angle:', '', regietext)
                        regietext = re.sub(r'QUICK ANGLES:', '', regietext)
                        regietext = re.sub(r'LOW ANGLE:', '', regietext)
                        regietext = re.sub(r'HIGH WIDE ANGLE:', '', regietext)
                        regietext = re.sub(r'HIGH ANGLE:', '', regietext)


                        regietext = re.sub(r'HIGH WIDE:', '', regietext)
                        regietext = re.sub(r'WIDER:', '', regietext)
                        regietext = re.sub(r'Wider:', '', regietext)
                        regietext = re.sub(r'WIDE SHOT:', '', regietext)
                        regietext = re.sub(r'WIDE PROFILE:?', '', regietext)
                        regietext = re.sub(r'WIDE:', '', regietext) 

                        regietext = re.sub(r'QUICK INSERT:', '', regietext)
                        regietext = re.sub(r'FLASH INSERT:', '', regietext)
                        regietext = re.sub(r'INSERTS:', '', regietext)
                        regietext = re.sub(r'INSERT:', '', regietext)

                        regietext = re.sub(r'Aerial on:', '', regietext)
                        regietext = re.sub(r'AERIAL SHOT:', '', regietext)
                        regietext = re.sub(r'AERIAL:', '', regietext)

                        regietext = re.sub(r'SAM POV:', '', regietext)
                        regietext = re.sub(r'LOW ANGLE POV:', '', regietext)
                        regietext = re.sub(r'ARAGORN POV', '', regietext)
                        regietext = re.sub(r'ARWEN\'S POV:', '', regietext)
                        regietext = re.sub(r'Frodo\'s half- conscious POV:', '', regietext)
                        regietext = re.sub(r'Frodo\'s POV:', '', regietext)
                        regietext = re.sub(r'SURREAL SLOW MOTION POV:', '', regietext)
                        regietext = re.sub(r'SPEEDING POV:', '', regietext)
                        regietext = re.sub(r'Rushing POV:', '', regietext)
                        regietext = re.sub(r'LOW ANGLE POV:', '', regietext)
                        regietext = re.sub(r'POV:', '', regietext)

                        regietext = re.sub(r'CAMERA CIRCLES SUMMIT:', '', regietext)
                        regietext = re.sub(r'CAMERA CRANES to REVEAL:', '', regietext)

                        regietext = re.sub(r'PAN ONTO:', '', regietext)
                        regietext = re.sub(r'PAN OFF:', '', regietext)

                        regietext = re.sub(r'CRANE DOWN:', '', regietext)

                        regietext = re.sub(r'Slow motion:', '', regietext)

                        regietext = re.sub(r'QUICK CUTS:', '', regietext)

                        regietext = re.sub(r'BLACK SCREEN:', '', regietext)
                        regietext = re.sub(r'BLACK SCREEN . . .', '', regietext)
                        regietext = re.sub(r'BLACK SCREEN', '', regietext)

                        regietext = re.sub(r'SETTLE ON:', '', regietext)
                        regietext = re.sub(r'PUSH IN:', '', regietext)
                        regietext = re.sub(r'PULL BACK:', '', regietext)
                        regietext = re.sub(r'QUICK BEAT:', '', regietext)
                        regietext = re.sub(r'TRACKING BACK:', '', regietext)
                        regietext = re.sub(r'TRACKING:', '', regietext)
                        regietext = re.sub(r'REVEAL ON:', '', regietext)
                        regietext = re.sub(r'SOARING UP:', '', regietext)
                        regietext = re.sub(r'MATCHING MOVE:', '', regietext)
                        regietext = re.sub(r'Screenplay by: Fran Walsh, Philippa Boyens, Peter Jackson', '', regietext)

                        regietext = re.sub(r'\(CON TINUED \)', '', regietext)
                        regietext = re.sub(r'\(C ON T IN U ED \)', '', regietext)
                        regietext = re.sub(r'\(C O N TI N U ED \)', '', regietext)


                        # Seitenzahlen löschen
                        regietext = re.sub(r'\d+\.', '', regietext)
                        
                        # auf dem Bildschirm eingeblendeten Text löschen
                        if ('SUPER:' in regietext):
                            regietext = ''

                        # Sonstige Texte löschen, die keine Regieanweisungen sind
                        regietext = re.sub(r'MRS. SACKVILLE BAGGINS \(O.S.\)', '', regietext)
                        regietext = re.sub(r'SAURON \(V.O.\)', '', regietext)
                        regietext = re.sub(r'\(IN BLACK SPEECH\)', '', regietext)

                        # Sätze mit zu viel Whitespace
                        regietext = re.sub(r'A s t h e G R E A T B O U LD E R S l a n d a m o ng t h em \^ t he O R C s t a rt t o P A N IC ', 'As the great bloulders land among them the Orc start to panic', regietext)

                        # komische Zeichenfolgen löschen
                        regietext = re.sub(r'::\.\. \. \. \.', '.', regietext) # GOLLUM quickly turns and BOLTS::.. . . .
                        regietext = re.sub(r';?:?', '', regietext)               # H e looks up as GOLLUM lunges at him. ;:
                        regietext = re.sub(r'\/', '', regietext)
                        regietext = re.sub(r"\. \. ' '", '', regietext)

                        # Zeichen ersetzen
                        # &amp durch and
                        regietext = re.sub(r'&amp', 'and', regietext)
                        # ! durch .
                        regietext = re.sub(r'!', '.', regietext)

                        # remove whitespace
                        regietext = regietext.strip()
                        regietext = re.sub(r'\s+', ' ', regietext) # mehrfache Leerzeichen werden zu einem

                        # Auslassungspunkte
                        # am Anfang: löschen 
                        # am Ende: durch einfachen Punkt ersetzen
                        # in der Mitte: durch Leerzeichen ersetzen (passt leider nicht immer, aber keine bessere Möglichkeit)
                        # Anfang
                        regietext = re.sub(r'^\s?\.\.\.*\s', '', regietext) # ohne Leerzeichen zwischen Punkten
                        regietext = re.sub(r'^\s?\. \.( \.)*\s?', '', regietext) # mit Leerzeichen zwischen Punkten
                        # Ende
                        regietext = re.sub(r'\s?\.\.(\.)*$', '.', regietext)
                        regietext = re.sub(r'\s?\. \.( \.)*$', '.', regietext)
                        # Mitte (alle, die jetzt noch übrig sind, müssen in der Mitte stehen)
                        regietext = re.sub(r'\s?\. \.( \.)*\s?', ' ', regietext)
                        regietext = re.sub(r'\s?\.\.(\.)*\s?', ' ', regietext)

                        # write to csv (only if text column not empty)
                        if(regietext != ""):
                            f.write(f'{num}|Regie|{regietext}|{filmNr}\n')
                        # print
                        if(regietext != ""):
                            print(f'{num}|Regie|{regietext}|{filmNr}\n')

get_regieanweisungen(lotr_1, 1)
get_regieanweisungen(lotr_3, 3)

Nr.|Sprecher/Regie|Text|Filmnr.

3|Regie|BLACK CONTINUES ELVISH SINGING A WOMAN'S VOICE IS whispering, tinged with SADNESS and REGRET|1

7|Regie|FLICKERING FIRELIGHT. The NOLDORIN FORGE in EREGION. MOLTEN GOLD POURS from the lip of an IRON LADLE.|1

9|Regie|THREE RINGS, each set with a single GEM, are received by the HIGH ELVES-GALADRIEL, GIL-GALAD and CIRDAN.|1

11|Regie|SEVEN RINGS held aloft in triumph by the DWARF LORDS.|1

13|Regie|NINE RINGS clutched tightly by the KINGS OF MEN as if holding-close a precious secret.|1

20|Regie|An ancient PARCHMENT MAP of MIDDLE EARTH moving slowly across the MAP as if drawn by an unseen force the CAMERA closes in on a PLACE NAME MORDOR.|1

22|Regie|SAURON forging the ONE RING in the CHAMBERS of SAMMATH NAUR.|1

26|Regie|THE ONE RING falls through SPACE and into flames.|1

28|Regie|A GREAT SHADOW falls across the MAP closing in around the realm of GONDOR.|1

29|Regie|SCREAMING VILLAGERS, MEN, WOMEN, AND CHILDREN, RUN|1

30|Regie|from their homes,

Nr.|Sprecher/Regie|Text|Filmnr.

8|Regie|SMEAGOL and his cousin, DEAGOL, sit in a SMALL CORACLE, their FISHING LINES draped over the side SUNSHINE glinting off the surface of the water.|3

9|Regie|An idyllic image.|3

10|Regie|SUDDENLY DEAGOL's FISHING ROD BENDS under the weight of a LARGE FISH.|3

13|Regie|DEAGOL pulls on his ROD, but is HAULED OVERBOARD and disappears underwater with a SPLASH.|3

14|Regie|SMEAGOL leaning over the BOAT CONCERNED.|3

17|Regie|DEAGOL is towed to the RIVER BED by a LARGE FISH he suddenly lets go of the line eyes fixed on a SHINING GOLD RING, lying in 'the SILT.|3

21|Regie|DEAGOL climbs out of the WATER, onto the RIVER BANK.|3

22|Regie|the RING revealed in DEAGOL'S PALM.|3

23|Regie|SMEAGOL peers over his shoulder the GOLD reflects in SMEAGOL'S EYES.|3

24|Regie|The HUM of the RING growing LOUDER.|3

26|Regie|DEAGOL turns to look at him, a smirk on his face.|3

28|Regie|SMEAGOL moves towards DEAGOL.|3

30|Regie|SMEAGOL jumps on DEAGOL STRANGLING HIM. SM


1997|Regie|FRODO and SAM staggering across the TORTURED LANDSCAPE they are no longer WEARING the ORC ARMOUR.|3

1998|Regie|FRODO is walking half-bowed, often stumbling as if his eyes not longer see the way before his feet.|3

1999|Regie|His right HAND is pressed against his CHEST supporting a HEAVY WEIGHT. His left HAND often rises, as if to ward off some invisible blow. SAM watches him, CONCERN etched across his FACE.|3

2000|Regie|FRODO as a malevolent VOICE in his head calls to him "Baggins - Baggins".|3

2001|Regie|SAM looking behind him in time to see.|3

2002|Regie|A RAY of RED LIGHT stabs through the GLOOM and begins to sweep over the BARREN LANDSCAPE.|3

2004|Regie|SAM throws himself to the ground FRODO turns to the light, unable to stop himself.|3

2005|Regie|FRODO crumpling to the ground as the RED LIGHT hits him like a SEARCHLIGHT.|3

2007|Regie|All is QUIET No sign of the ENEMY.|3

2009|Regie|GANDALF watchful alert. He nods at ARAGORN.|3

2010|Regie|ARAGORN, GANDALF, LEGOL

## 2.) Regieanweisungen aus Skript 2

In [110]:
lotr_2 = requests.get('https://imsdb.com/scripts/Lord-of-the-Rings-The-Two-Towers.html')

In [111]:
def get_all_lines():
    # webscrape text
    soup = BeautifulSoup(lotr_2.text, 'lxml')
    all_character_tags = soup.find('pre')

    # convert scraped bs4 element tags to string
    text = str(all_character_tags)

    # delete all tags
    text = re.sub(r'</?[a-zA-Z]+>', '', text)

    # split into text blocs 
    # Problem: mehrere Trenner - \r\n\r\n und \r\n \r\n)
    blocs = re.split(r'(\r\n\s?\r\n)', text)


    # WRITE TO CSV

    # Dateiname festlegen
    filename = f'lotr_skript2_dialog_regie_first_output.csv'

    # Datei löschen, falls sie existiert:
    if os.path.exists(f'data/{filename}'):
        os.remove(f'data/{filename}')


    with open(f'data/{filename}', 'a') as f:
        # Spaltenüberschriften
        f.write(f'Nr.|Sprecher/Regie|Text|Filmnr.\n')
        print(f'Nr.|Sprecher/Regie|Text|Filmnr.\n')

        # Blöcke durchgehen
        for num, bloc in enumerate(blocs):
            # Leerzeichen entfernen und Fragezeichen ersetzen (erzeugt beim Ausführen sonst Unicode-Error)
            bloc = bloc.replace("�", "%")   

            # match Regieanweisungen d.h. alle Blöcke, die direkt mit [ beginnen
            if(re.match(r'^\s*\[', bloc)):
                # remove whitespace
                regieText = bloc.strip()
                regieText = re.sub(r'\r\n', '', regieText) # Zeilenumbrüche innerhalb des Texts
                regieText = re.sub(r'\s+', ' ', regieText) # mehrfache Leerzeichen werden zu einem

                # write to CSV
                # Bsp: 1|Regie|Frodo jumps.|                 
                f.write(f'{num}|Regie|{regieText}|2\n')
                print(f'{num}|Regie|{regieText}|2\n')

            # match Sprechertext, d.h. alle Blöcke, die NICHT direkt mit [ beginnen
            elif(not re.match(r'^\s*\[', bloc)):
                # überflüssiger whitespace vor dem Namen entfernen
                # positives Lookahead, matched jegliche Art whitespace vor dem ersten Buchstaben
                sprecherText = re.sub(r'^\s*(?=([a-zA-Z]))', "", bloc)

                # matched alles vor dem ersten Zeilenumbruch, d.h. den Sprechernamen
                if (re.match(r'^.*(?=(\r\n))', sprecherText)):
                    # Name ohne Whitespace aus dem Text herausziehen
                    # group(0) wird benötigt, um auf den Text im Match-Objekt zuzugreifen, das von re.match zurückgegeben wird
                    name = re.match(r'^.*(?=(\r\n))', sprecherText).group(0).strip()

                    # alles vor dem ersten Zeilenumbruch (inkl. Name wird gelöscht)
                    sprecherText = re.sub(r'^.*\r\n', '', sprecherText)
                    sprecherText = sprecherText.strip()
                    sprecherText = re.sub(r'\r\n', '', sprecherText) # Zeilenumbrüche innerhalb des Texts
                    sprecherText = re.sub(r'\s+', ' ', sprecherText) # mehrfache Leerzeichen werden zu einem

                    # write to CSV
                    # only write if name is not empty (otherwise there will be empty rows in between)
                    if(name != ""):
                        f.write(f'{num}|{name}|{sprecherText}|2\n')
                        print(f'{num}|{name}|{sprecherText}|2\n')
                else:
                    f.write(f'{num}|Sprechertext|{sprecherText}|2\n')
                    print(f'{num}|Sprechertext|{sprecherText}|2\n')
                
get_all_lines()

Nr.|Sprecher/Regie|Text|Filmnr.

0|Sprechertext||2

2|Sprechertext|THE LORD OF THE RINGS: THE TWO TOWERS|2

4|Screenplay by|Peter Jackson, Fran Walsh and Philippa Boyens.|2

6|Based on "The Lord of The Rings" trilogy by|J.R.R Tolkien.|2

8|Sprechertext|Transcription credits|2

10|Accela, Aina, Bad burn, Bridget Chubb,|Brionn Equus (Lochrann), Drusilia, Elf Lady, %owyn Unquendor, Feanari, Finafyr, Flourish, Galadriel, Heri, Julamb, JustinsIce(Mdjasrie), Kazren, Krystal, Lady%owynKenobi, Lady Evenstar, Legolas%Bow, Lithorose, Melody, Mormegil, Nilmandra, Padfoot, Penwiper, Pilgrim Grey, Primula Baggins, Randy Savage, Samwise the Brave, Sirius Black, Tethra, The Lidless Eye, Turnar, Xyla, Yaksha|2

12|Elvish dialogue from The Elvish Linguistic|Fellowship.|2

14|Sprechertext||2

16|Regie|[TITLE: THE LORD OF THE RINGS]|2

18|Sprechertext||2

20|Regie|[Camera pans over the Misty Mountains as voices drift in from the background.]|2

22|GANDALF|You cannot pass!|2

24|FRODO|Gandalf!|2

26|GANDA

### Datenbereinigung Skript 2

### 1. CSV laden

In [112]:
lotr_2 = pd.read_csv('data/lotr_skript2_dialog_regie_first_output.csv', sep='|')
lotr_2

Unnamed: 0,Nr.,Sprecher/Regie,Text,Filmnr.
0,0,Sprechertext,,2
1,2,Sprechertext,THE LORD OF THE RINGS: THE TWO TOWERS,2
2,4,Screenplay by,"Peter Jackson, Fran Walsh and Philippa Boyens.",2
3,6,"Based on ""The Lord of The Rings"" trilogy by",J.R.R Tolkien.,2
4,8,Sprechertext,Transcription credits,2
...,...,...,...,...
1213,2442,SM%AGOL,"Come on, hobbits. Long ways to go yet. Sm%agol...",2
1214,2444,Regie,"[He turns to walk on, with Frodo and Sam follo...",2
1215,2446,GOLLUM,Follow me.,2
1216,2448,Regie,[ Camera pans up over the forest and Ephel D%a...,2


### 2. Erste Rows mit unwichtigen Informationen + mit Attribut "Sprechertext" löschen

In [113]:
lotr_2 = lotr_2[lotr_2["Nr."] > 20]
lotr_2 = lotr_2[lotr_2["Sprecher/Regie"] != "Sprechertext"]
lotr_2

Unnamed: 0,Nr.,Sprecher/Regie,Text,Filmnr.
11,22,GANDALF,You cannot pass!,2
12,24,FRODO,Gandalf!,2
13,26,GANDALF,"I am a servant of the Secret Fire, wielder of ...",2
14,28,Regie,[Camera pans closer to the mountain side.],2
15,30,GANDALF,Argh! Go back to the shadow. The Dark Fire wil...,2
...,...,...,...,...
1212,2440,GOLLUM,Shh% [He pops out from hiding in front of the ...,2
1213,2442,SM%AGOL,"Come on, hobbits. Long ways to go yet. Sm%agol...",2
1214,2444,Regie,"[He turns to walk on, with Frodo and Sam follo...",2
1215,2446,GOLLUM,Follow me.,2


### 3. Fehlerhaftes Encoding bereinigen (%-Zeichen ersetzen)

In [114]:
for index, row in lotr_2.iterrows():
    
    # Clear character names in character column
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("SM%AGOL", "SMEAGOL", lotr_2.at[index, "Sprecher/Regie"])
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("GR%MA", "GRIMA", lotr_2.at[index, "Sprecher/Regie"])
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("H%MA", "HAMA", lotr_2.at[index, "Sprecher/Regie"])
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("NAZG%L", "NAZGUL", lotr_2.at[index, "Sprecher/Regie"])
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("%OWYN", "EOWYN", lotr_2.at[index, "Sprecher/Regie"])
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("%OMER", "EOMER", lotr_2.at[index, "Sprecher/Regie"])
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("TH%ODEN", "THEODEN", lotr_2.at[index, "Sprecher/Regie"])
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("TH%ODRED", "THEODRED", lotr_2.at[index, "Sprecher/Regie"])
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("%OTHAIN", "EOTHAIN", lotr_2.at[index, "Sprecher/Regie"])
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("UGL%K", "UGLUK", lotr_2.at[index, "Sprecher/Regie"])
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("MA%HUR", "MAUHUR", lotr_2.at[index, "Sprecher/Regie"])
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("GRISHN%KH", "GRISHNAKH", lotr_2.at[index, "Sprecher/Regie"])
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("\(V.O.\)", "", lotr_2.at[index, "Sprecher/Regie"])
    lotr_2.at[index, "Sprecher/Regie"] = re.sub("With a grimace, he kills the Uruk-hai", "Regie", lotr_2.at[index, "Sprecher/Regie"])
    
    # Clear place names and person names
    
    lotr_2.at[index, "Text"] = re.sub("Barad-d%r", "Barad-dur", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Palant%r", "Palantir", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Khazad-d%m", "Khazad-dum", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Helm%s\s*Deep", "Helm's Deep", lotr_2.at[index, "Text"])  # covers line breaks
    lotr_2.at[index, "Text"] = re.sub("L%rien", "Lorien", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("M%ria", "Moria", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Lothl%rien", "Lothlorien", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("D%rin%s Tower", "Durin's Tower", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Ud%n", "Udun", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Sm%agol", "Smeagol", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Sm%a%gol", "Smeagol", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Gr%ma", "Grima", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("H%ma", "Hama", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Ham%", "Hama", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Nazg%l", "Nazgul", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("%owyn", "Eowyn", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("%omer", "Eomer", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("%omund", "Eomund", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Th%oden", "Theoden", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Th%odred", "Theodred", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("%othain", "Eothain", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Ugl%k", "Ugluk", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Ma%hur", "Mauhur", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Grishn%kh", "Grishnak", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Gl%in", "Gloin", lotr_2.at[index, "Text"])


    # replace apostrophe's
    
    ## I'm, you're etc.

    lotr_2.at[index, "Text"] = re.sub("I%m", "I'm", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("You%re", "You're", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("you%re", "you're", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("He%s", "He's", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("he%s", "he's", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("She%s", "She's", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("she%s", "she's", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("It%s", "It's", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("it%s", "it's", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("We%re", "We're", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("we%re", "we're", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("They%re", "They're", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("they%re", "they're", lotr_2.at[index, "Text"])

    ## have/has, haven't/hasn't

    lotr_2.at[index, "Text"] = re.sub("i%ve", "i've", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("I%ve", "I've", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("you%ve", "you've", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("You%ve", "You've", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("we%ve", "we've", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("We%ve", "We've", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("they%ve", "they've", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("They%ve", "They've", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("haven%t", "haven't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Haven%t", "Haven't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("hasn%t", "hasn't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Hasn%t", "Hasn't", lotr_2.at[index, "Text"])

    ## is, can, do

    lotr_2.at[index, "Text"] = re.sub("mustn%t", "mustn't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Mustn%t", "Mustn't", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("let%s", "let's", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Let%s", "Let's", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("can%", "can't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Can%t", "Can't", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("that%s", "that's", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("That%s", "That's", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("there%s", "there's", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("There%s", "There's", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("what%s", "what's", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("What%s", "What's", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("isn%t", "isn't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Isn%t", "Isn't", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("aren%t", "aren't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Aren%t", "Aren't", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("don%t", "don't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Don%t", "Don't", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("doesn%t", "doesn't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("doesn%t", "doesn't", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("didn%t", "didn't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Didn%t", "Didn't", lotr_2.at[index, "Text"])

    ## was/will, wasn't/won't

    lotr_2.at[index, "Text"] = re.sub("I%ll", "I'll", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("you%ll", "you'll", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("You%ll", "You'll", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("he%ll", "he'll", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("He%ll", "He'll", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("she%ll", "she'll", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("She%ll", "She'll", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("it%ll", "it'll", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("It%ll", "It'll", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("we%ll", "we'll", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("We%ll", "We'll", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("they%ll", "they'll", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("They%ll", "They'll", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("wasn%t", "wasn't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Wasn%t", "Wasn't", lotr_2.at[index, "Text"])

    lotr_2.at[index, "Text"] = re.sub("won%t", "won't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Won%t", "Won't", lotr_2.at[index, "Text"])

    ## Conjunctive (wouldn't, couldn't, should)

    lotr_2.at[index, "Text"] = re.sub("couldn%t", "couldn't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("shouldn%t", "shouldn't", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("I%d", "I'd", lotr_2.at[index, "Text"])


    # Single cases

    lotr_2.at[index, "Text"] = re.sub("YOU% SHALL NOT... PASS!!!", "YOU SHALL NOT... PASS!!!", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("lovely % Lembas bread", "lovely Lembas bread", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("S... S...Smeagol%", "S... S...Smeagol.", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Grima% Grima%", "Grima. Grima.", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Ooh% They look tasty!", "Ooh! They look tasty!", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("small % only children", "small, only children", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Bur%rum", "Burarum", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("then% a tunnel", "then...a tunnel", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Oh, he% he must have died", "Oh, he ... he must have died", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Why should I% welcome you, Gandalf%", "Why should I welcome you, Gandalf", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Eowyn% Eowyn", "Eowyn Eowyn", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Simbelmyn%", "Simbelmyn%", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Because% because", "Because ... because", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Mur%derer%!", "Murderer!", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("%em", "them", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("of course ridiculous%", "of course ridiculous.", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("my heart%", "my heart.", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Sleep%", "Sleep.", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Arwen%", "Arwen ...", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Brego% ", "Brego. ", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("And%ril", "Anduril", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Myyy% PRECIOUSSS", "Myyy PRECIOUSSS", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("but% I'm sorry", "but ... I'm sorry", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Where was Gon%%", "Where was Gon--", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("lord %", "lord.", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("To whatever end%", "To whatever end.", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("agreed % you", "agreed ... you", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("War, yes%", "War, yes.", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("ooold%", "ooold", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("anything%", "anything", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("looong%", "looong", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("NAZG'L", "NAZGUL", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("they % Oh!!", "they ... Oh!!", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Hold on, Mr. Frodo%", "Hold on, Mr. Frodo.", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("% Samwise the Brave", ": Samwise the Brave", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("they're dead%", "they're dead.", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Shh%", "Shh.", lotr_2.at[index, "Text"])
    
    
    # Other question marks are either apostrophes or in Elvis/Entish/Old English and will be deleted later anyway
    
    lotr_2.at[index, "Text"] = re.sub("%", "'", lotr_2.at[index, "Text"])
    
    # Remove Kameraanweisungen etc.
    
    lotr_2.at[index, "Text"] = re.sub("FLASHBACK", "", lotr_2.at[index, "Text"])
    lotr_2.at[index, "Text"] = re.sub("Camera pans closer to the mountain side.", "", lotr_2.at[index, "Text"])


### Regieanweisungen herausfiltern und in eigene Datei auslagern

In [115]:
# filter stage directions
stage_directions = lotr_2[(lotr_2["Text"].str.contains("\[")) | (lotr_2["Sprecher/Regie"] == "Regie")]

# remove text outside brackets

for index, row in stage_directions.iterrows():
    # remove curved brackets in advance (data contain nested brackets like [...(..)..], makes things very hard)
    stage_directions.at[index, "Text"] = re.sub("[()]", "", stage_directions.at[index, "Text"])
    # remove content outside square brackets, regex expression idea from 41686d6564's answer at https://stackoverflow.com/questions/64740437/remove-text-outside-of-bracket
    stage_directions.at[index, "Text"] = re.sub("[^[\]]+(?=[[(]|$)", "", stage_directions.at[index, "Text"])
    # remove square brackets
    stage_directions.at[index, "Text"] = re.sub("[\[\]]", "", stage_directions.at[index, "Text"])
    # strip to remove whitespace in beginning and end
    stage_directions.at[index, "Text"] = stage_directions.at[index, "Text"].strip()

# Dateiname festlegen
filename = f'lotr_skript_2_regie.csv'

# Datei löschen, falls sie existiert:
if os.path.exists(f'data/{filename}'):
    os.remove(f'data/{filename}')

# TO CSV

with open(f'data/{filename}', 'a') as f:
    # Spaltenüberschriften
    f.write(f'Nr.|Sprecher/Regie|Text|Filmnr.\n')
    print(f'Nr.|Sprecher/Regie|Text|Filmnr.\n')
    for index, row in stage_directions.iterrows():
        if(row["Text"] != ''):
            f.write(f'{row["Nr."]}|Regie|{row["Text"]}|{row["Filmnr."]}\n')
            print(f'{row["Nr."]}|Regie|{row["Text"]}|{row["Filmnr."]}\n')

            
# Index reset brauchen wir eigentlich nicht mehr...        
        
# reset index and Nr.-column
#count = 1
#for index, row in stage_directions.iterrows():
 #   stage_directions.at[index, "Nr."] = count
  #  count += 1
#stage_directions.reset_index(drop=True, inplace=True)



Nr.|Sprecher/Regie|Text|Filmnr.

30|Regie|Camera zooms in through the mountain and focuses on Gandalf and the Balrog on the bridge of Khazad-dum. The Balrog strikes down on Gandalf with its flaming sword. Gandalf parries the blow with Glamdring, shattering the Balrog's sword.|2

32|Regie|Gandalf strikes his staff onto the bridge. As the Balrog steps forward, the bridge collapses from under it and the demon plunges backward into the chasm. Gandalf, exhausted, leans on his staff and watches the Balrog fall then turns to follow the others. At the last minute, the flaming whip lashes up from the depths of the abyss and winds around Gandalf's ankle, dragging him over the edge. He clings onto the bridge but is straining to keep his grip.|2

36|Regie|Frodo rushes forward but Boromir restrains him.|2

44|Regie|Gandalf loses his grip and falls into the chasm|2

48|Regie|Gandalf loses his grip and falls into the chasm|2

50|Regie|Calls after Gandalf as he falls into the abyss|2

52|Regie|Gandalf

### Regieanweisungen als zusammenhängender Text

In [116]:
# Tabellen mit Regieanweisungen laden
regie_1 = pd.read_csv('data/lotr_skript_1_regie.csv', sep='|')
regie_2 = pd.read_csv('data/lotr_skript_2_regie.csv', sep='|')
regie_3 = pd.read_csv('data/lotr_skript_3_regie.csv', sep='|')

In [117]:
# Funktion um Regieanweisungen als zusammenhängenden Text aus der Tabelle zu extrahieren
def get_regie_text(regie_df, filmNr):

    # Anzahl der Rows
    length = len(regie_df)

    # Range
    range = np.arange(0, length, 1)

    # empty Textvariable
    text = ""

    # Text als String
    for index in range:
        rowText = regie_df["Text"].values[index]
        text = text + f'{rowText} '

    # save to file
    filename = f'lotr_skript_{filmNr}_regie.txt'
    # Datei löschen, falls sie existiert:
    if os.path.exists(f'data/{filename}'):
        os.remove(f'data/{filename}')

    with open(f'data/{filename}', 'a') as f:
        f.write(text)
        # text in lower case
        # f.write(text.lower())
        print(f'File {filename} saved.')

In [118]:
# Regieanweisungen zu Skript 1 als Text
get_regie_text(regie_1, 1)

# Regieanweisungen zu Skript 2 als Text
get_regie_text(regie_2, 2)

# Regieanweisungen zu Skript 3 als Text
get_regie_text(regie_3, 3)

File lotr_skript_1_regie.txt saved.
File lotr_skript_2_regie.txt saved.
File lotr_skript_3_regie.txt saved.


## 2.) Dialogtexte zu Skript 1, 2, 3

**Datenquelle:** bearbeitete (reduzierte) Tabelle von James Tauber

**Was wurde geändert?** (händisch in Excel)
* i-Tags gelöscht
* eckige Klammern gelöscht
* "VO" in Speaker-Spalte gelöscht
* Passagen auf Elbisch
    * wenn ÜS hintendran in runden Klammern: Elbisch gelöscht, aber Übersetzung aus Klammern behalten 
    * wenn ÜS in Spalte Translation: Elbischer Text in Subtitle-Spalte durch ÜS aus Translations-Spalte ersetzt
* Spalte "Spoken text if different from subtitles" wurde übernommen, d.h. Text in Subtitle-Spalte wurde durch den Text aus dieser Spalte ersetzt
* sonstige Anweisungen in runden Klammern gelöscht
* Spalten "START" und "END" mit den Time Codes wurden gelöscht
* Rows mit "TITLE" in der Sprecherspalte

### Alter Code zu Skript 2

In [14]:
#lotr_1 = requests.get('https://imsdb.com/scripts/Lord-of-the-Rings-Fellowship-of-the-Ring,-The.html')
lotr_2 = requests.get('https://imsdb.com/scripts/Lord-of-the-Rings-The-Two-Towers.html')
#lotr_3 = requests.get('https://imsdb.com/scripts/Lord-of-the-Rings-Return-of-the-King.html')

scripts_2 = [lotr_2]

In [94]:
def get_scripts():
    # Schleife für jedes Script
    for movie, script in enumerate(scripts_2, 1):
        # webscrape text
        soup = BeautifulSoup(script.text, 'lxml')
        all_character_tags = soup.find('pre')
        
        # convert scraped bs4 element tags to string
        text= str(all_character_tags)
        
        # delete all tags
        text= re.sub(r'</?[a-zA-Z]+>', '', text)
        
        # split intoTextblocs 
        # Problem: mehrere Trenner - \r\n\r\n und \r\n \r\n)
        blocs = re.split(r'(\r\n\s?\r\n)', text)
        
        # delete empty blocs (does not work yet!!!!)
        # nice to have, but not necessarily needed
        # blocs = [i for i in blocs if i != '\r\n \r\n' or i != '\r\n\r\n']

        # print(blocs)
        
        # PRINT ALL BLOCS
        #for num, bloc in enumerate(blocs):
         #   print(num, bloc)
        
        # PRINT ONLY REGIEANWEISUNGEN
        #counter = 0
        #for num, bloc in enumerate(blocs):
            # matches Regieanweisungen, d.h. alle Blöcke, die direkt mit [ beginnen
            #if(re.match(r'^\s*\[', bloc)):
               # counter += 1
               # print(num, bloc)
        #print(f'Anzahl der Regieanweisungen: {counter}')
        # Ergenbis: 363 (ohne die Regieanweisungen, die direkt hinter demTextstehen!)
        # Gesamtzahl an Regieanweisungen: 671 (Ergebnis der Suche nach [ in Skript 2)
        
        # PRINT ONLY SPRECHERTEXT
        #counter = 0
        #for num, bloc in enumerate(blocs):
           #  matches Sprechertext, d.h. alle Blöcke, die nicht mit [ beginnen
            #if(not re.match(r'^\s*\[', bloc)):
               # counter += 1
                # Sprechername ausgeben
                #if (re.match(r'^.*(?=(\r\n))', bloc)):
                   # name = re.match(r'^.*(?=(\r\n))', bloc).group(0)
                   # print(f'Name: {name}')
                # Sprechertext ausgeben
               # print(num, bloc)
                
       # print(f'Anzahl der Sprechertexte: {counter}')
        
        # WRITE TO CSV
        
        # Dateiname festlegen
        filename = f'lotr_skript2_test.csv'

        # Datei löschen, falls sie existiert:
        if os.path.exists(f'data/{filename}'):
            os.remove(f'data/{filename}')
            
            
        with open(f'data/{filename}', 'a') as f:
            # Spaltenüberschriften
            f.write(f'Nr.|Sprecher/Regie|Text|Filmnr.\n')
            print(f'Nr.|Sprecher/Regie|Text|Filmnr.\n')
                
            # Blöcke durchgehen
            for num, bloc in enumerate(blocs):
                # Leerzeichen entfernen und Fragezeichen ersetzen (erzeugt beim Ausführen sonst Unicode-Error)
                bloc = bloc.replace("�", "%")   
                
                # match Regieanweisungen d.h. alle Blöcke, die direkt mit [ beginnen
                if(re.match(r'^\s*\[', bloc)):
                    # remove whitespace
                    regieText = bloc.strip()
                    regieText = re.sub(r'\r\n', '', regieText) # Zeilenumbrüche innerhalb des Texts
                    regieText = re.sub(r'\s+', ' ', regieText) # mehrfache Leerzeichen werden zu einem
                    
                    # write to CSV
                    # Bsp: 1 | Regie | Frodo jumps. |                  
                    f.write(f'{num}|Regie|{regieText}|2\n')
                    print(f'{num}|Regie|{regieText}|2\n')
                    
                # match Sprechertext, d.h. alle Blöcke, die NICHT direkt mit [ beginnen
                elif(not re.match(r'^\s*\[', bloc)):
                    # überflüssiger whitespace vor dem Namen entfernen
                    # positives Lookahead, matched jegliche Art whitespace vor dem ersten Buchstaben
                    sprecherText = re.sub(r'^\s*(?=([a-zA-Z]))', "", bloc)
                    
                    # matched alles vor dem ersten Zeilenumbruch, d.h. den Sprechernamen
                    if (re.match(r'^.*(?=(\r\n))', sprecherText)):
                        # Name ohne Whitespace aus demTextherausziehen
                        # group(0) wird benötigt, um auf denTextim Match-Objekt zuzugreifen, das von re.match zurückgegeben wird
                        name = re.match(r'^.*(?=(\r\n))', sprecherText).group(0).strip()
                        
                        # alles vor dem ersten Zeilenumbruch (inkl. Name wird gelöscht)
                        sprecherText = re.sub(r'^.*\r\n', '', sprecherText)
                        sprecherText = sprecherText.strip()
                        sprecherText = re.sub(r'\r\n', '', sprecherText) # Zeilenumbrüche innerhalb des Texts
                        sprecherText = re.sub(r'\s+', ' ', sprecherText) # mehrfache Leerzeichen werden zu einem
                        
                        # write to CSV
                        # only write if name is not empty (otherwise there will be empty rows in between)
                        if(name != ""):
                            f.write(f'{num}|{name}|{sprecherText}|2\n')
                            print(f'{num}|{name}|{sprecherText}|2\n')
                    else:
                        f.write(f'{num}|Sprechertext|{sprecherText}|2 \n')
                        print(f'{num}|Sprechertext|{sprecherText}|2\n')
                
get_scripts()

NameError: name 'scripts_2' is not defined

### Alte Codeschnipsel

In [97]:
# replacement notes - examples 

 # 930 CAMERA CIRCLES SUMMIT:
                # CAMERA CIRCLES SUMMIT: MORE AND MORE TREES are hauled down and
                
                # 1067 PAN ONTO:
                # ringwraiths close behind PAN ONTO: 2 more ringwrait
                
                # 1709 PAN OFF:
                # PAN OFF: DENETHOR'S DEATH PLUNGE to the ROHIRRIM
                
                # 1071 CRANE DOWN:
                # CRANE DOWN: As the White Horse races towards Camera, 
                
                # 1622 Slow motion:
                # Slow motion: As the Balrog falls, he lashes out with his whip of fire... Slow motion: The thongs of the whip lash 
                
                # 1743 QUICK CUTS:
                # QUICK CUTS: LURTZ is quickly armored...
                
                # BLACK SCREEN: und BLACK SCREEN und BLACK SCREEN . . . (3 und 6 Skript 3)
                
                # 44 SETTLE ON:
                # SETTLE ON: FRODO and SAM in a FILTHY CULVERT. 
                
                # 122 PUSH IN:
                # PUSH IN: EOWYN standing alone outside the GOLDEN HALL
                
                # 660 PULL BACK:
                # PULL BACK: GANDALF hurries to the BATTLEMENT
                
                # 977 CAMERA CRANES to REVEAL:
                # CAMERA CRANES to REVEAL: THOUSANDS of MEN and HORSES!
                
                # 1224 QUICK BEAT:
                # QUICK BEAT: ARAGORN RAISES ANDURIL
                
                # 1437 TRACKING BACK: 
                # TRACKING BACK: with FRODO as he careers blindly 
                
                # 2397 TRACKING:
                # TRACKING: Passing under a beautiful ELVEN ARCHWAY 
                
                # 1533 REVEAL ON:
                # REVEAL ON: SAMWISE GAMGEE stands before the GIANT SPIDER 
                
                # 2294 SOARING UP:
                # SOARING UP: to REVEAL the COURT OF THE KINGS
                
                # 2332 MATCHING MOVE:
                # MATCHING MOVE: Revealing HOBBITON bathed in a WARM SUNSET 
                
                # 2450 Screenplay by: Fran Walsh, Philippa Boyens, Peter Jackson

In [98]:
# delete irrelevant information
        # voice over and cont'd
        #text = re.sub(r'\s*\(V.O.\)\s*\(cont\'d\)', '', text)
        #text = re.sub(r'\s*V/0', '', text)
        #text = re.sub(r'\s*\( V .O .\)\s*\(c on t\' d \)', '', text)
        #text = re.sub(r'\s*\(cont\'d\)', '', text)
        #text = re.sub(r'\s*\(c ont \'d \)', '', text)
        #text = re.sub(r'\s*\(c ont\' d\)','', text)
        #text = re.sub(r'\s*\(V.O.\)\s*\(CONT\'D\)', '', text)
        #text = re.sub(r'\s*\(CONT\'D\)', '', text)
        #text = re.sub(r'\s*\(V.O.\)', '', text)
        #text = re.sub(r'\s*\(0.S.\)', '', text)
        #text = re.sub(r'\s*\( O . S . \)', '', text)
        #text = re.sub(r'\s*\(O.S.\)', '', text)

In [37]:
# info related to page turns
        #text = re.sub(r'\s*Final Revision - October, 2003 [0-9]+.', '', text)
        #text = re.sub(r'\s*\(CONTINUED\)\s*[0-9]*\.?', '', text)
        #text = re.sub(r'\s*\(M�RE\)', '', text)
        #text = re.sub(r'\s*\(MORE\)', '', text)
    
        #text = re.sub(r'CONTINUED:\s\(\s?[0-9]\s?\)', '\n', text)
        #text = re.sub(r'CONTINUED:', '\r\n\r\n', text)
        
        # matches page numbers (that do not have "continued" before or after it)
        # but does not match the only number in the text: "1296...a very good year" by only matching if there is not more than one .
        #text = re.sub(r'\s*[0-9]{1,3}\.[^\.]\s*', '\r\n\r\n', text) 

In [12]:
# Sätze zusammenführen (wenn Text in Zelle auf "..." endet und nächste Zelle mit "..." beginnt)
subtitle_column = dialog_1["SUBTITLE TEXT"]
for num, element in enumerate(subtitle_column.array):
        if(not pd.isna(element) and num < 1793):
                nextElement = subtitle_column.array[num+1]
                if(re.search(r'\.\.\.$', element)and re.search(r'^\.\.\.', nextElement)):
                    element = f'{element} {nextElement}'
                #print(num, subtitle_column.array[num], subtitle_column.array[num+1])
        # re.search takes all lines of input string into account, re.match only the first line
        # check if string ends with "..."
        #thisElement = str(dialog_1["SUBTITLE TEXT"].iloc[[num]])
        #nextElement = str(dialog_1["SUBTITLE TEXT"].iloc[[num +1]])
        #if(re.search(r'\.\.\.$', thisElement) and re.search(r'^\.\.\.', nextElement)):
        #if(re.search(r'\.\.\.$', thisElement)):
            #print(dialog_1["SUBTITLE TEXT"].iloc[[num, num +1]])
# check if next x rows start with "..."
# concat all those rows
#data["Name"]= data["Name"].str.cat(new, sep =", ")

NameError: name 'dialog_1' is not defined

In [13]:
# filter lines that contain speech
speech = lotr[(lotr["Sprecher/Regie"] != "Regie")]

# remove text INSIDE square brackets (these are stage directions and unwanted)

for index, row in speech.iterrows():
    # remove text inside square brackets AND ALSO' brackets themselves
    speech.at[index, "Text"] = re.sub("\[[^\]]*\]", "", speech.at[index, "Text"])
    # strip to remove whitespace in beginning and end
    speech.at[index, "Text"] = speech.at[index, "Text"].strip()

# print lines for control
for index, row in speech.iterrows():
    #if "[" in row["Text"]:
        print(index, row["Text"])

# reset index and Nr.-column
count = 1
for index, row in speech.iterrows():
    speech.at[index, "Nr."] = count
    count += 1
speech.reset_index(drop=True, inplace=True)

11 You cannot pass!
12 Gandalf!
13 I am a servant of the Secret Fire, wielder of the Flame of Anor!
15 Argh! Go back to the shadow. The Dark Fire will not avail you, Flame of Udun!  YOU SHALL NOT... PASS!!!
17 Argh!
19 No! No!
20 Gandalf!
21 Fly, you fools!
23 Noooooooooooooooo!!!!
25 Gaaandaaaaalf!!
27 Gandalf!
28 What is it, Mr. Frodo?
29 Nothing. Just a dream.
31 Mordor. The one place in Middle-earth we don't want to see any closer, and the one place we're trying to get to. It's just where we can'tt get. Let's face it, Mr. Frodo, we're lost. I don't think Gandalf meant for us to come this way.
32 He didn't mean for a lot of things to happen, Sam... but they did.
34 Mr. Frodo? It's the Ring, isn't it?
35 It's getting heavier.
36 What food have we got left?
37 Well, let me see.  Oh yes, lovely Lembas bread. And look!  More lembas bread.
39 I don't usually hold with foreign food, but this Elvish stuff, it's not bad.
40 Nothing ever dampens your spirits, does it Sam?
42 Those rain cloud

1046 Timbers! Brace the Gate!
1048 Come on! We can take them!
1049 It's a long way.
1051 Toss me.
1052 What?
1053 I cannot jump the distance! You'll have to toss me!
1055 Oh!  Don't tell the Elf.
1056 Not a word.
1058 ARGH!!
1060 Shore up the door!
1061 Make way!
1062 Follow me to the barricade.
1063 Watch our backs!
1064 Throw another one over here!
1065 Higher!
1067 Hold fast the gate!]
1068 Gimli! Aragorn! Get out of there!
1070 Aragorn!
1072 Pull everybody back! Pull them back!
1073 Fall back! Fall back!
1074 They've broken through! The castle is breached. Retreat!
1075 Fall back!
1076 Retreat!
1077 Hurry! Inside! Get them inside!
1078 Into the Keep!
1082 I will leave you at the western borders of the forest. You can make your way north to your homeland from there.
1084 Wait! Stop! Stop!  Turn around. Turn around. Take us south!
1085 South? But that will lead you past Isengard.
1086 Yes. Exactly. If we go south we can slip past Saruman unnoticed. The closer we are to danger, the fa

In [14]:
# remove elvish (and westron) original lines and keep translation  

for index, row in speech.iterrows():
    # condition: must contain bracket
    if "(" in row["Text"]:
        # remove content outside brackets, regex expression idea from 41686d6564's answer at https://stackoverflow.com/questions/64740437/remove-text-outside-of-bracket
        speech.at[index, "Text"] = re.sub("[^[()]+(?=[(]|$)", "", speech.at[index, "Text"])
        # remove brackets
        speech.at[index, "Text"] = re.sub("[()]", "", speech.at[index, "Text"])
        # strip to remove whitespace in beginning and end
        speech.at[index, "Text"] = speech.at[index, "Text"].strip()

# print for control    
for index, row in speech.iterrows():
    print(index, row["Text"])


0 You cannot pass!
1 Gandalf!
2 I am a servant of the Secret Fire, wielder of the Flame of Anor!
3 Argh! Go back to the shadow. The Dark Fire will not avail you, Flame of Udun!  YOU SHALL NOT... PASS!!!
4 Argh!
5 No! No!
6 Gandalf!
7 Fly, you fools!
8 Noooooooooooooooo!!!!
9 Gaaandaaaaalf!!
10 Gandalf!
11 What is it, Mr. Frodo?
12 Nothing. Just a dream.
13 Mordor. The one place in Middle-earth we don't want to see any closer, and the one place we're trying to get to. It's just where we can'tt get. Let's face it, Mr. Frodo, we're lost. I don't think Gandalf meant for us to come this way.
14 He didn't mean for a lot of things to happen, Sam... but they did.
15 Mr. Frodo? It's the Ring, isn't it?
16 It's getting heavier.
17 What food have we got left?
18 Well, let me see.  Oh yes, lovely Lembas bread. And look!  More lembas bread.
19 I don't usually hold with foreign food, but this Elvish stuff, it's not bad.
20 Nothing ever dampens your spirits, does it Sam?
21 Those rain clouds might.
2

676 Brace the gate!
677 As long as you can give me!
678 Gimli!
679 Timbers! Brace the Gate!
680 Come on! We can take them!
681 It's a long way.
682 Toss me.
683 What?
684 I cannot jump the distance! You'll have to toss me!
685 Oh!  Don't tell the Elf.
686 Not a word.
687 ARGH!!
688 Shore up the door!
689 Make way!
690 Follow me to the barricade.
691 Watch our backs!
692 Throw another one over here!
693 Higher!
694 Hold fast the gate!]
695 Gimli! Aragorn! Get out of there!
696 Aragorn!
697 Pull everybody back! Pull them back!
698 Fall back! Fall back!
699 They've broken through! The castle is breached. Retreat!
700 Fall back!
701 Retreat!
702 Hurry! Inside! Get them inside!
703 Into the Keep!
704 I will leave you at the western borders of the forest. You can make your way north to your homeland from there.
705 Wait! Stop! Stop!  Turn around. Turn around. Take us south!
706 South? But that will lead you past Isengard.
707 Yes. Exactly. If we go south we can slip past Saruman unnoticed. T