# Der vollständige Shakespeare Korpus

Wir bauen jetzt den Code aus dem Notebook "preprocessing" so um, dass wir auf dem gesamten Korpus arbeiten können.

Erst mal die Initialisierung:

In [2]:
import re
from collections import Counter
from pathlib import Path
# NLTK laden und Stopwords holen
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fried\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Ein einzelnes Stück

Als nächstes lagern wir das Parsen eines Stückes in eine Funktion aus. Ausserdem optimieren wir wie von Ihnen vorgeschlagen, indem wir direkt die End-Tags suchen.

Wir ändern die Struktur dabei etwas ab, so dass die Reihenfolge der Beiträge im Stück erhalten bleibt, siehe Folien und Kommentar in preprocessing:

In [6]:
#Strukurierung des oberen Dictionarys
def work_name(filename):
    return filename.split("\\")[1][0:-4]
#Bsp: Bei Cymbeline: [0] = "comedies" / [1] = "Cymbeline" .txt
#     --> work_name = Cymbeline

def category_name(filename):
    return filename.split("\\")[0]
#Bsp: Bei Cymbeline: [0] = "comedies" / [1] = "Cymbeline" .txt
#     --> category_name = comedies

def parse_work(filename):
    work = {
        "name": work_name(filename),
        "category": category_name(filename),
        "speakers": [],
        "texts": []
    } #Einzelne .txt Dateien werden dem Dictionary zugeordnet.
    
    # Durchlauf 1: Speaker ermitteln
    with open(filename, encoding = "utf16") as file:
        text = file.read()
        alle_tags = re.findall("</[^>]*>", text)
        
        # Alle Tags für SCENE, ACT und STAGE 
        filtered = [a for a in alle_tags if not "SCENE" in a]
        filtered = [a for a in filtered if not "ACT" in a]
        filtered = [a for a in filtered if not "STAGE" in a]
        filtered = [a for a in filtered if not "SONG" in a]
        
        # Alle Tags, die Kleinbuchstaben enthalten
        filtered = [a for a in filtered if not any(c.islower() for c in a)]
        
        # Spitze Klammern entfernen:
        filtered = [a[2:-1] for a in filtered]
        for s in filtered:
            if s not in work["speakers"]:
                work["speakers"].append(s)
    
    with open(filename, encoding="utf16") as f:
        current_speaker=""
        current_text=""
        stage=False
        for line in f:
            if current_speaker=="":
                for key in work["speakers"]:
                    if "<" + key + ">" in line:
                        current_speaker=key
                        break
            else:
                if current_speaker!="" and "</" + current_speaker + ">" in line:
                            work["texts"].append({
                                "speaker": current_speaker,
                                "text": current_text.strip()
                            })
                            current_text=""
                            current_speaker=""                
                if current_speaker!="":
                    if "<STAGE" in line:
                        stage=True
                    if "</STAGE" in line:
                        stage=False
                    if stage==False:
                        # ggf. noch STAGE-ende-tag wegwerfen
                        line = line.replace("<SONG>", "").replace("</SONG>", "")
                        current_text=current_text + " " + re.sub(r'.*</STAGE DIR>','',line).strip()
    return work

## Alle Stücke

Jetzt gehen wir alle Dateien durch und erstellen eine Liste aller Stücke.

In [13]:
works = []
for filename in Path(".").glob('*\\*.txt'):
    w = parse_work(str(filename))
    print("{} ({})".format(w["name"], w["category"]))
    works.append(w)
    
#def parse_work(filename):
#    work = {
#        "name": work_name(filename),
#        "category": category_name(filename),
#        "speakers": [],
#        "texts": []
#

A Midsummer-Night's Dream (comedies)
All’s Well that Ends Well (comedies)
As You Like It (comedies)
Cymbeline (comedies)
Love's Labour's Lost (comedies)
Measure for Measure (comedies)
Much Ado About Nothing (comedies)
Pericles, Prince of Tyre (comedies)
The Comedy of Errors (comedies)
The Merchant Of Venice (comedies)
The Merry Wives of Windsor (comedies)
The Taming of the Shrew (comedies)
The Tempest (comedies)
The Two Gentlemen of Verona (comedies)
The Winter’s Tale (comedies)
Troilus and Cressida (comedies)
Twelfth-Night; or What You Will (comedies)
The Famous History of the Life of King Henry VIII (historical)
The First Part of King Henry IV (historical)
The First Part of King Henry VI (historical)
The Life and Death of King John (historical)
The Life of King Henry V (historical)
The Second Part of King Henry IV (historical)
The Second Part of King Henry VI (historical)
The Third Part of King Henry VI (historical)
The Tragedy of King Richard II (historical)
The Tragedy of King Rich

## Tokenisierung

Hier ist die tokenize-Funktion, die wir entwickelt haben:

In [None]:
def tokenize(text):
    tokens = text.split()
    tokens = [t.strip(".,;!?").lower() for t in tokens]
    return tokens

## Hilfsfunktionen

Damit wir wieder unsere Analysen durchführen können, sollten wir ein paar Hilfsfunktionen entwickeln:

In [None]:
def work(name):
    for w in works:
        if w["name"] == name:
            return w
    raise ValueError("Work not found: {}".format(name))

def work_text(name):
    alleszusammen = ""
    w = work(name)
    for t in w["texts"]:
        alleszusammen = alleszusammen + " " + t["text"]
    return alleszusammen

def speaker_text(work, speaker):
    alleszusammen = ""
    for t in work["texts"]:
        if t["speaker"] == speaker:
            alleszusammen = alleszusammen + " " + t["text"]
    return alleszusammen

# Wir testen, ob das auch den alten Werten entspricht, vgl. preprocessing

tokens = tokenize(work_text("Cymbeline"))
print("Anzahl Tokens: {}".format(len(tokens)))
print("Anzahl unterschiedlicher Tokens: {}".format(len(set(tokens))))

## Stopwords

Hier die Entfernung der Stopwords, wie in preprocessing:

In [None]:
def remove_stopwords(tokens):
    return [t for t in tokens if t not in stop]


## Hausaufgaben

### Längstes Stück


In [None]:
for w in sorted(works, reverse=True, key=lambda w: len(tokenize(work_text(w["name"])))):
    print("{} ({} Tokens)".format(w["name"], len(tokenize(work_text(w["name"])))))
    

### Sprecher mit dem meisten Text


In [None]:
all_speakers = [(work["name"], speaker) for work in works for speaker in work["speakers"]]
for s in sorted(all_speakers, reverse=True, key=lambda s: len(tokenize(speaker_text(work(s[0]),s[1])))):
    print("{} in {} ({} Tokens)".format(s[1], s[0], len(tokenize(speaker_text(work(s[0]),s[1])))))

### Am ähnlichsten zu Hamlet
