# Diss Corpus Preprocessing and Word Count

This notebook provides a regular expression based approach to clean the literary text corpus. It opens the corpus texts as a dataframe column and adds new columns as different versions of the texts with annotations as layers. 
In the end, it counts the total word token number of each text separately before summing them up to get the total number of words in the corpus.

In [73]:
# Import 
import os
import pandas as pd
import regex as re
from pathlib import Path
from collections import Counter
import csv
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string



[nltk_data] Downloading package punkt to /Users/sguhr/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Read in the corpus

In the following, we read in the corpus. Here, we only take a subset of the corpus to shorten the process time.  

In [43]:
corpus_directory = "/Users/sguhr/Desktop/Diss_Korpus/Diss_Korpus_202303_bereinigt"


In [44]:
# Generate a corpus by loading all the txt files from the chosen directory 
# and list the names of the first 10 txt files 
corpus = os.listdir('/Users/sguhr/Desktop/Diss_Korpus/Diss_Korpus_202303_bereinigt')
corpus[:10]

['von_Wolzogen_Ernst_Vom_Peperl_und von andern_Raritaeten_Der_Raritaetenliabhaber.txt',
 'Anzengruber_Ludwig_Kalendergeschichten_Treff-Ass.txt',
 'Fontane_Theodor_Cecile.txt',
 'Hollaender_Felix_Die_Briefe_des_Fraeulein.txt',
 'Heyse_Paul_Gegen_den_Strom.txt',
 'Peters_August_Blind_und_doch_sehend.txt',
 'Groller_Balduin_Detektiv_Dagobert_Eine_teure_Depesche.txt',
 'von_Zobeltitz_Fedor_Das_Heiratsjahr.txt',
 'Thoma_Ludwig_Nachbarsleute_Bismarck.txt',
 'Heyse_Paul_Marienkind.txt']

In [45]:
#to delete the .DS_Store file that always pops up with Mac

import os

#corpus_directory = "/path/to/your/corpus/directory"
file_to_delete = ".DS_Store"
file_path = os.path.join(corpus_directory, file_to_delete)

if os.path.exists(file_path):
    os.remove(file_path)
    print(f"{file_to_delete} has been deleted.")
else:
    print(f"{file_to_delete} does not exist in the specified directory.")


.DS_Store does not exist in the specified directory.


With the next cell, we ask for the number of corpus texts in the chosen subset. 

In [46]:
# Print how many txt files are in the corpus
corpus_length = len(corpus)
print(corpus_length)

1442


## Convert the corpus to a dataframe 

We then create an empty dictionary and add the file name and the text of the document as columns to build a dataframe of two columns.

In [47]:
# Create an empty dictionary for preparation of the conversion of the txt-file-corpus to a data frame
empty_dictionary = {}

# Loop through the folder of documents to open and read each one
for document in corpus:
    with open('/Users/sguhr/Desktop/Diss_Korpus/Diss_Korpus_202303_bereinigt/' + document, 'r', encoding = 'utf-8') as to_open:
         empty_dictionary[document] = to_open.read()

# Populate the data frame with two columns: file name and document text
diss_corpus_texts = (pd.DataFrame.from_dict(empty_dictionary, 
                                       orient = 'index')
                .reset_index().rename(index = str, 
                                      columns = {'index': 'file_name', 0: 'document_text'}))

In [49]:
#extract the title of the text as a further column with metadata


# Define the regular expression pattern to extract the title followed by double line break \n\n
pattern = r'^(.*?)\n\n'

# Extract the first line and create a new 'titles' column
diss_corpus_texts['title'] = diss_corpus_texts['document_text'].str.extract(pattern, flags=re.DOTALL)
diss_corpus_texts['title'] = diss_corpus_texts['title'].str.replace('<title>', '') # remove xml tags from title column
diss_corpus_texts['title'] = diss_corpus_texts['title'].str.replace('</title>', '') # remove xml tags from title column

# Print the DataFrame to see the results
diss_corpus_texts[:10]

Unnamed: 0,file_name,document_text,title
0,von_Wolzogen_Ernst_Vom_Peperl_und von andern_R...,<title>Der Raritätenliabhaber</title>\n\n » Ja...,Der Raritätenliabhaber
1,Anzengruber_Ludwig_Kalendergeschichten_Treff-A...,<title>Treff-Aß</title>\n\n\nGibt es ein Buch ...,Treff-Aß
2,Fontane_Theodor_Cecile.txt,<title>Cécile</title>\n\n\n<chapter>Erstes Kap...,Cécile
3,Hollaender_Felix_Die_Briefe_des_Fraeulein.txt,<title>Die Briefe des Fräulein Brandt</title>\...,Die Briefe des Fräulein Brandt
4,Heyse_Paul_Gegen_den_Strom.txt,<title>Gegen den Strom</title>\n\n<chapter>Ers...,Gegen den Strom
5,Peters_August_Blind_und_doch_sehend.txt,<title>Blind und doch sehend</title>\n\n<chapt...,Blind und doch sehend
6,Groller_Balduin_Detektiv_Dagobert_Eine_teure_D...,<title>Eine teure Depesche</title>\n\n\nSie sa...,Eine teure Depesche
7,von_Zobeltitz_Fedor_Das_Heiratsjahr.txt,<title>Das Heiratsjahr</title>\n\n<chapter>Ers...,Das Heiratsjahr
8,Thoma_Ludwig_Nachbarsleute_Bismarck.txt,"<title>Bismarck</title>\n\nDie Wahrheit ist, d...",Bismarck
9,Heyse_Paul_Marienkind.txt,<title>Marienkind</title>\n\nAuf der Landstraß...,Marienkind


In the next cell, we verify the content of the first 10 lines of the dataframe.

In [50]:
# show the first 10 lines of the data frame
diss_corpus_texts[:10]

Unnamed: 0,file_name,document_text,title
0,von_Wolzogen_Ernst_Vom_Peperl_und von andern_R...,<title>Der Raritätenliabhaber</title>\n\n » Ja...,Der Raritätenliabhaber
1,Anzengruber_Ludwig_Kalendergeschichten_Treff-A...,<title>Treff-Aß</title>\n\n\nGibt es ein Buch ...,Treff-Aß
2,Fontane_Theodor_Cecile.txt,<title>Cécile</title>\n\n\n<chapter>Erstes Kap...,Cécile
3,Hollaender_Felix_Die_Briefe_des_Fraeulein.txt,<title>Die Briefe des Fräulein Brandt</title>\...,Die Briefe des Fräulein Brandt
4,Heyse_Paul_Gegen_den_Strom.txt,<title>Gegen den Strom</title>\n\n<chapter>Ers...,Gegen den Strom
5,Peters_August_Blind_und_doch_sehend.txt,<title>Blind und doch sehend</title>\n\n<chapt...,Blind und doch sehend
6,Groller_Balduin_Detektiv_Dagobert_Eine_teure_D...,<title>Eine teure Depesche</title>\n\n\nSie sa...,Eine teure Depesche
7,von_Zobeltitz_Fedor_Das_Heiratsjahr.txt,<title>Das Heiratsjahr</title>\n\n<chapter>Ers...,Das Heiratsjahr
8,Thoma_Ludwig_Nachbarsleute_Bismarck.txt,"<title>Bismarck</title>\n\nDie Wahrheit ist, d...",Bismarck
9,Heyse_Paul_Marienkind.txt,<title>Marienkind</title>\n\nAuf der Landstraß...,Marienkind


You can see still some \n\n these are line breaks. You can use regular expressions to extract the first line as title of the text. And the file name contains the name of the author, but that is not as easy to extract. Better to be extracted from the metadata with a comparison of filename and filename indicated in metadata.

In the next cell, we do some basic text cleaning steps.

## Preprocessing

In the next cell, we extract the title of the text by extracting the first line followed by two line breaks. This is possible because we know about the structure of the text that were manually prepared following that schema.

In [66]:
#create a new column
#use regular expressions to clean the plain text and store the cleaned text in a new column as a further layer of the text without deleting the original version


diss_corpus_texts['clean_text'] = diss_corpus_texts['document_text'].str.replace('&', 'and') # exchange & for 'and'
diss_corpus_texts['clean_text'] = diss_corpus_texts['clean_text'].str.replace('<title>', '') # remove xml tags from clean_text column
diss_corpus_texts['clean_text'] = diss_corpus_texts['clean_text'].str.replace('</title>', '') # remove xml tags from clean_text column
diss_corpus_texts['clean_text'] = diss_corpus_texts['clean_text'].str.replace('<section>', '') # remove xml tags from clean_text column
diss_corpus_texts['clean_text'] = diss_corpus_texts['clean_text'].str.replace('</section>', '') # remove xml tags from clean_text column
diss_corpus_texts['clean_text'] = diss_corpus_texts['clean_text'].str.replace('<chapter>', '') # remove xml tags from clean_text column
diss_corpus_texts['clean_text'] = diss_corpus_texts['clean_text'].str.replace('</chapter>', '') # remove xml tags from clean_text column
diss_corpus_texts['clean_text'] = diss_corpus_texts['clean_text'].str.replace('\n+', ' ') # replace double line break with single
diss_corpus_texts['clean_text'] = diss_corpus_texts['clean_text'].str.replace('\s+', ' ') # replace double white space with single

In [67]:
# show the first 10 lines of the data frame
diss_corpus_texts[:10]

Unnamed: 0,file_name,document_text,title,clean_text
0,von_Wolzogen_Ernst_Vom_Peperl_und von andern_R...,<title>Der Raritätenliabhaber</title>\n\n » Ja...,Der Raritätenliabhaber,"Der Raritätenliabhaber\n\n » Ja, grüaß Eahna G..."
1,Anzengruber_Ludwig_Kalendergeschichten_Treff-A...,<title>Treff-Aß</title>\n\n\nGibt es ein Buch ...,Treff-Aß,"Treff-Aß\n\n\nGibt es ein Buch des Schicksals,..."
2,Fontane_Theodor_Cecile.txt,<title>Cécile</title>\n\n\n<chapter>Erstes Kap...,Cécile,Cécile\n\n\nErstes Kapitel\n\n\n » Thale. Zwei...
3,Hollaender_Felix_Die_Briefe_des_Fraeulein.txt,<title>Die Briefe des Fräulein Brandt</title>\...,Die Briefe des Fräulein Brandt,"Die Briefe des Fräulein Brandt\n\nIserbaude, 7..."
4,Heyse_Paul_Gegen_den_Strom.txt,<title>Gegen den Strom</title>\n\n<chapter>Ers...,Gegen den Strom,Gegen den Strom\n\nErstes Kapitel.\n\nEs war z...
5,Peters_August_Blind_und_doch_sehend.txt,<title>Blind und doch sehend</title>\n\n<chapt...,Blind und doch sehend,Blind und doch sehend\n\nI. Ein junger Arzt.\n...
6,Groller_Balduin_Detektiv_Dagobert_Eine_teure_D...,<title>Eine teure Depesche</title>\n\n\nSie sa...,Eine teure Depesche,Eine teure Depesche\n\n\nSie saßen wieder zu d...
7,von_Zobeltitz_Fedor_Das_Heiratsjahr.txt,<title>Das Heiratsjahr</title>\n\n<chapter>Ers...,Das Heiratsjahr,Das Heiratsjahr\n\nErstes Kapitel.\n\nIn welch...
8,Thoma_Ludwig_Nachbarsleute_Bismarck.txt,"<title>Bismarck</title>\n\nDie Wahrheit ist, d...",Bismarck,"Bismarck\n\nDie Wahrheit ist, daß es in Bernau..."
9,Heyse_Paul_Marienkind.txt,<title>Marienkind</title>\n\nAuf der Landstraß...,Marienkind,"Marienkind\n\nAuf der Landstraße, die in gerin..."


Now, we want to find out more about the texts in our corpus. 


## Word Count

In the following, I will tokenize the texts and count the total number of word tokens excluding the punctuation via "not in string.punctuation".

In [74]:


# Function to count words in a text excluding stopwords and punctuation
def count_words(text):
    # Tokenize the text using the German tokenizer
    words = word_tokenize(text, language='german')
    
    # Exclude stopwords and punctuation
    filtered_words = [word for word in words if word not in string.punctuation]
    
    # Return the count of non-stopwords and non-punctuation words
    return len(filtered_words)

# Apply the count_words function to each row in the DataFrame
diss_corpus_texts['word_count'] = diss_corpus_texts['clean_text'].apply(count_words)

# Display the DataFrame with total word counts
#diss_corpus_texts[:10]

Unnamed: 0,file_name,document_text,title,clean_text,word_count,word_count_new
0,von_Wolzogen_Ernst_Vom_Peperl_und von andern_R...,<title>Der Raritätenliabhaber</title>\n\n » Ja...,Der Raritätenliabhaber,"Der Raritätenliabhaber\n\n » Ja, grüaß Eahna G...",4959,4109
1,Anzengruber_Ludwig_Kalendergeschichten_Treff-A...,<title>Treff-Aß</title>\n\n\nGibt es ein Buch ...,Treff-Aß,"Treff-Aß\n\n\nGibt es ein Buch des Schicksals,...",5722,4738
2,Fontane_Theodor_Cecile.txt,<title>Cécile</title>\n\n\n<chapter>Erstes Kap...,Cécile,Cécile\n\n\nErstes Kapitel\n\n\n » Thale. Zwei...,67705,57754
3,Hollaender_Felix_Die_Briefe_des_Fraeulein.txt,<title>Die Briefe des Fräulein Brandt</title>\...,Die Briefe des Fräulein Brandt,"Die Briefe des Fräulein Brandt\n\nIserbaude, 7...",55281,47298
4,Heyse_Paul_Gegen_den_Strom.txt,<title>Gegen den Strom</title>\n\n<chapter>Ers...,Gegen den Strom,Gegen den Strom\n\nErstes Kapitel.\n\nEs war z...,75434,65474
5,Peters_August_Blind_und_doch_sehend.txt,<title>Blind und doch sehend</title>\n\n<chapt...,Blind und doch sehend,Blind und doch sehend\n\nI. Ein junger Arzt.\n...,14672,13002
6,Groller_Balduin_Detektiv_Dagobert_Eine_teure_D...,<title>Eine teure Depesche</title>\n\n\nSie sa...,Eine teure Depesche,Eine teure Depesche\n\n\nSie saßen wieder zu d...,9442,8176
7,von_Zobeltitz_Fedor_Das_Heiratsjahr.txt,<title>Das Heiratsjahr</title>\n\n<chapter>Ers...,Das Heiratsjahr,Das Heiratsjahr\n\nErstes Kapitel.\n\nIn welch...,101133,85969
8,Thoma_Ludwig_Nachbarsleute_Bismarck.txt,"<title>Bismarck</title>\n\nDie Wahrheit ist, d...",Bismarck,"Bismarck\n\nDie Wahrheit ist, daß es in Bernau...",2980,2617
9,Heyse_Paul_Marienkind.txt,<title>Marienkind</title>\n\nAuf der Landstraß...,Marienkind,"Marienkind\n\nAuf der Landstraße, die in gerin...",34429,30288


In [69]:
# Example DataFrame with a 'text' column
#text = 'Dies ist ein Beispielsatz. Ein weiteres Beispiel für Text mit mehr Wörtern.'
#words = word_tokenize(text, language='german')
#print(words)


# Function to count words in a text
def count_tokens(text):
    # Tokenize the text
    tokens = word_tokenize(text, language='german')
    
    # Return the total token count including punctuation
    return len(tokens)

# Apply the count_tokens function to each row in the DataFrame
diss_corpus_texts['token_count'] = diss_corpus_texts['clean_text'].apply(count_tokens)

# Display the DataFrame with total word counts
#diss_corpus_texts[:10]




Unnamed: 0,file_name,document_text,title,clean_text,word_count
0,von_Wolzogen_Ernst_Vom_Peperl_und von andern_R...,<title>Der Raritätenliabhaber</title>\n\n » Ja...,Der Raritätenliabhaber,"Der Raritätenliabhaber\n\n » Ja, grüaß Eahna G...",4959
1,Anzengruber_Ludwig_Kalendergeschichten_Treff-A...,<title>Treff-Aß</title>\n\n\nGibt es ein Buch ...,Treff-Aß,"Treff-Aß\n\n\nGibt es ein Buch des Schicksals,...",5722
2,Fontane_Theodor_Cecile.txt,<title>Cécile</title>\n\n\n<chapter>Erstes Kap...,Cécile,Cécile\n\n\nErstes Kapitel\n\n\n » Thale. Zwei...,67705
3,Hollaender_Felix_Die_Briefe_des_Fraeulein.txt,<title>Die Briefe des Fräulein Brandt</title>\...,Die Briefe des Fräulein Brandt,"Die Briefe des Fräulein Brandt\n\nIserbaude, 7...",55281
4,Heyse_Paul_Gegen_den_Strom.txt,<title>Gegen den Strom</title>\n\n<chapter>Ers...,Gegen den Strom,Gegen den Strom\n\nErstes Kapitel.\n\nEs war z...,75434
5,Peters_August_Blind_und_doch_sehend.txt,<title>Blind und doch sehend</title>\n\n<chapt...,Blind und doch sehend,Blind und doch sehend\n\nI. Ein junger Arzt.\n...,14672
6,Groller_Balduin_Detektiv_Dagobert_Eine_teure_D...,<title>Eine teure Depesche</title>\n\n\nSie sa...,Eine teure Depesche,Eine teure Depesche\n\n\nSie saßen wieder zu d...,9442
7,von_Zobeltitz_Fedor_Das_Heiratsjahr.txt,<title>Das Heiratsjahr</title>\n\n<chapter>Ers...,Das Heiratsjahr,Das Heiratsjahr\n\nErstes Kapitel.\n\nIn welch...,101133
8,Thoma_Ludwig_Nachbarsleute_Bismarck.txt,"<title>Bismarck</title>\n\nDie Wahrheit ist, d...",Bismarck,"Bismarck\n\nDie Wahrheit ist, daß es in Bernau...",2980
9,Heyse_Paul_Marienkind.txt,<title>Marienkind</title>\n\nAuf der Landstraß...,Marienkind,"Marienkind\n\nAuf der Landstraße, die in gerin...",34429


In [75]:
# Rename the columns
#diss_corpus_texts.rename(columns={'word_count': 'token_count', 'word_count_new': 'word_count'}, inplace=True)


In [76]:
# Display the DataFrame with total word counts
diss_corpus_texts[:10]

Unnamed: 0,file_name,document_text,title,clean_text,token_count,word_count
0,von_Wolzogen_Ernst_Vom_Peperl_und von andern_R...,<title>Der Raritätenliabhaber</title>\n\n » Ja...,Der Raritätenliabhaber,"Der Raritätenliabhaber\n\n » Ja, grüaß Eahna G...",4959,4109
1,Anzengruber_Ludwig_Kalendergeschichten_Treff-A...,<title>Treff-Aß</title>\n\n\nGibt es ein Buch ...,Treff-Aß,"Treff-Aß\n\n\nGibt es ein Buch des Schicksals,...",5722,4738
2,Fontane_Theodor_Cecile.txt,<title>Cécile</title>\n\n\n<chapter>Erstes Kap...,Cécile,Cécile\n\n\nErstes Kapitel\n\n\n » Thale. Zwei...,67705,57754
3,Hollaender_Felix_Die_Briefe_des_Fraeulein.txt,<title>Die Briefe des Fräulein Brandt</title>\...,Die Briefe des Fräulein Brandt,"Die Briefe des Fräulein Brandt\n\nIserbaude, 7...",55281,47298
4,Heyse_Paul_Gegen_den_Strom.txt,<title>Gegen den Strom</title>\n\n<chapter>Ers...,Gegen den Strom,Gegen den Strom\n\nErstes Kapitel.\n\nEs war z...,75434,65474
5,Peters_August_Blind_und_doch_sehend.txt,<title>Blind und doch sehend</title>\n\n<chapt...,Blind und doch sehend,Blind und doch sehend\n\nI. Ein junger Arzt.\n...,14672,13002
6,Groller_Balduin_Detektiv_Dagobert_Eine_teure_D...,<title>Eine teure Depesche</title>\n\n\nSie sa...,Eine teure Depesche,Eine teure Depesche\n\n\nSie saßen wieder zu d...,9442,8176
7,von_Zobeltitz_Fedor_Das_Heiratsjahr.txt,<title>Das Heiratsjahr</title>\n\n<chapter>Ers...,Das Heiratsjahr,Das Heiratsjahr\n\nErstes Kapitel.\n\nIn welch...,101133,85969
8,Thoma_Ludwig_Nachbarsleute_Bismarck.txt,"<title>Bismarck</title>\n\nDie Wahrheit ist, d...",Bismarck,"Bismarck\n\nDie Wahrheit ist, daß es in Bernau...",2980,2617
9,Heyse_Paul_Marienkind.txt,<title>Marienkind</title>\n\nAuf der Landstraß...,Marienkind,"Marienkind\n\nAuf der Landstraße, die in gerin...",34429,30288


In [77]:
#save the dataframe to a new csv.file

diss_corpus_texts.to_csv('/Users/sguhr/Desktop/Diss_Korpus/diss_corpus_texts_df.csv', index=False)

End of the Preprocessing.

Read the saved csv as a pandas dataframe.

In [78]:
csv_file_path = '/Users/sguhr/Desktop/Diss_Korpus/diss_corpus_texts_df.csv'

# Read the CSV file into a Pandas DataFrame
diss_corpus_texts = pd.read_csv(csv_file_path)

# Display the DataFrame
print(diss_corpus_texts)


                                              file_name  \
0     von_Wolzogen_Ernst_Vom_Peperl_und von andern_R...   
1     Anzengruber_Ludwig_Kalendergeschichten_Treff-A...   
2                            Fontane_Theodor_Cecile.txt   
3         Hollaender_Felix_Die_Briefe_des_Fraeulein.txt   
4                        Heyse_Paul_Gegen_den_Strom.txt   
...                                                 ...   
1437  Riehl_Wilhelm_Heinrich_Musiker-Geschichten_Dem...   
1438       Riehl_Wilhelm_Heinrich_Der_Maerzminister.txt   
1439         Willkomm_Ernst_Der_verwandelte_Schmuck.txt   
1440                   Sudermann_Hermann_Frau_Sorge.txt   
1441  Loens_Hermann_Tiergeschichten_Die_Einwanderer.txt   

                                          document_text  \
0     <title>Der Raritätenliabhaber</title>\n\n » Ja...   
1     <title>Treff-Aß</title>\n\n\nGibt es ein Buch ...   
2     <title>Cécile</title>\n\n\n<chapter>Erstes Kap...   
3     <title>Die Briefe des Fräulein Brandt</title>\...

End of the Notebook.