# Assignment 05
#### Python Basics V - Text Processing

This tutorial was written by Terry L. Ruas (University of Göttingen). The references for external contributors for which this material was anyhow adapted/inspired are in the Acknowledgments section (end of the document).

This notebook will cover the following tasks:

1. Text Pre-Processing
2. Simple Text Analysis

## Task 01 – Text Pre-Processing
A computational analysis of natural language text typically requires several pre-processing steps, such as excluding irrelevant text parts, separating the text into words, phrases, or sentences depending on the analysis use case, removing so-called stop words, i.e., words that contain little to no semantic meaning, and normalizing the texts, e.g., by removing punctuation and capitalization.

Use the *download_file()* function developed in the past assignments to download the plain text versions of Shakespeare’s play [Macbeth](https://ia802707.us.archive.org/1/items/macbeth02264gut/0ws3410.txt) and Bacon’s [New Atlantis](https://ia801309.us.archive.org/24/items/newatlantis02434gut/nwatl10.txt). If you choose not to implement assignment 4, task 6, download the files manually. We will also provide some txt files.

Inspect these real-world texts manually to get an idea of what needs to be done to clean and prepare
the texts for computational analysis. Implement the following functions to perform common pre-processing steps on the texts:
1. *get_speaker_text()* – returns only the text spoken by the characters in the plays and removes all other text in the files, such as:
    - Information about the file, acknowledgements, copyright notices etc.
    - Headings indicating the act and scene
    - Stage instructions
    - Character names
2. *normalize_text()*
    - converts all text to lower case
    - removes all punctuation from the texts
3. *remove_stopwords()* – eliminates all stop words that are part of the list of English stop words (we provide two lists of stopwords, try both and see how they perform)
4. *tokenize_text()* – splits the cleaned text into words

This program is a pre-req for the next one.

In [26]:
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords



def get_speaker_Macbeth(file_name):
    global speaker_text
    speaker_text = []
    with open(file_name, 'r', encoding='ISO-8859-1') as file:
        for line in file:
            if line.startswith((' ', '\t')) and '.' in line:
                speaker_text.append(line.split('.', 1)[1])
    return speaker_text

file_name = 'Macbeth.txt'


def normalize_text(text1_list):
    cleaned_text_list = []
    for text in text_list:
        # Convert text to lowercase
        text = text.lower()
        # Remove punctuation
        text = text.translate(text.maketrans('', '', string.punctuation))
        cleaned_text_list.append(text)
    return cleaned_text_list

text_list1 = get_speaker_Macbeth(file_name)
text_list1 = normalize_text(text_list1)

def remove_stop_words(text2_list):
    stop_words = set(stopwords.words('english'))
    cleaned_text_list = []
    for text in text_list2:
        cleaned_text = [word for word in text.split() if word.lower() not in stop_words]
        cleaned_text_list.append(" ".join(cleaned_text))
    return cleaned_text_list

text_list2 = text_list1
print(remove_stop_words(text_list2))

#import re

#def get_speaker_atlantis(file_name):
#    with open(file_name, 'r', encoding='ISO-8859-1') as file:
#        alltext = file.read()
#    split_text = re.findall(r'"(.*?)"', alltext)
#    return split_text



['may however', '', '', '', 'royalties', '', 'whos', 'nay answer stand vnfold', 'long liue king', 'shall three meet againe', 'hurleyburleys done', 'ere set sunne', 'place', 'vpon heath', 'meet macbeth', 'come graymalkin', 'padock calls anon faire foule foule faire', 'bloody man report', 'serieant', 'doubtfull stood', 'valiant cousin worthy gentleman', 'whence sunne gins reflection', 'dismayd captaines macbeth', 'yes sparrowes eagles', 'well thy words become thee thy wounds', 'worthy thane rosse', 'haste lookes eyes', 'god saue king', 'whence camst thou worthy thane', 'fiffe great king', 'great happinesse', 'sweno norwayes king', 'thane cawdor shall deceiue', 'ile see done', 'hath lost noble macbeth hath wonne', 'hast thou beene sister', 'killing swine', 'sister thou', 'saylors wife chestnuts lappe', 'ile giue thee winde', 'thart kinde', 'another', 'selfe haue', 'shew shew', 'haue pilots thumbe', 'drumme drumme', 'weyward sisters hand hand', 'foule faire day haue seene', 'farre ist call

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/gabrielgeissler/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Task 02 – Classes
The [Baconian theory](https://en.wikipedia.org/wiki/Baconian_theory_of_Shakespeare_authorship) holds that Sir Francis Bacon is the author of Shakespeare’s plays. We want to perform a very simple stylistic analysis between Shakespeare’s play Macbeth and Bacon’s New Atlantis. We check for words that frequently occur in both documents to see whether there are characteristic words that co-occur in the texts, which might give some support to the theory.

Your Task:
1. Download and pre-process the texts as follows:  
  New Atlantis
    1. *get_speaker_text()*
    2. *normalize_text()*
    3. *remove_stopwords()*
    4. *tokenize_text()*   
  
 Macbeth
    1. *get_speaker_text()*
    2. *normalize_text()*
        1. *utils_ocr.correct_ocr_errors()* – we will provide a function to deal with OCR errors.
    3. *remove_stopwords()*
    4. *tokenize_text()*
2. For the pre-processed texts, compute the list of word co-occurrence frequencies, i.e. which words occur in both documents and how often. Use the format:  
[term , frequency_doc1 , frequency_doc2 , sum_of_frequencies]  
Sort the list according to the sum of the frequencies in descending order.
3. Use the csv library to store the ordered word co-occurrence frequency list in a CSV file. **You can zip the csv and upload it to GitHub.**