# Sound Analysis
## Measurements

In the following cells, I will calculate: <br>
* Sound Event Density 
* Loudness Level Average

Merging of the dataframes in the end! 
We open the metadata table of theme-d-Prose as a pandas dataframe and merge the dataframe with the pandas dataframe of our corpus folder

In [15]:
import pandas as pd
import os

# Load CSV into DataFrame
csv_path = "/Users/sguhr/Downloads/20240511_Metadata_theme-d-Prose_xml.csv"  # Update with the path to your CSV file
csv_df = pd.read_csv(csv_path)

# Remove file extensions from filenames in the CSV DataFrame
csv_df['filename'] = csv_df['filename'].str.replace('.xml', '')


In [16]:
csv_df

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,word_count
0,Achleitner_Arthur_Das_Schloss_im_Moor,1,Arthur Achleitner,Deutschland,male,116005327,Das Schloß im Moor,1903,Deutschland,61568,53164
1,Achleitner_Arthur_Der_Finanzer,2,Arthur Achleitner,Deutschland,male,116005327,Der Finanzer,1903,Österreich,28240,24600
2,Adolph_Karl_Schackerl,3,Karl Adolph,Österreich,male,123217687,Schackerl,1912,Österreich,48194,41024
3,Adolph_Karl_Toechter,4,Karl Adolph,Österreich,male,123217687,Töchter,1914,Österreich,108949,91169
4,Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser,7,Louise Anklam,Deutschland,female,125905726,Der Herr Kaiser,1898,Deutschland,4276,3721
...,...,...,...,...,...,...,...,...,...,...,...
1222,zu_Reventlow_Franziska_Das_graefliche_Milchges...,1223,Franziska zu Reventlow,Deutschland,female,118600044,Das gräfliche Milchgeschäft,1897,Deutschland,4027,3487
1223,zu_Reventlow_Franziska_Ellen_Olestjerne,1224,Franziska zu Reventlow,Deutschland,female,118600044,Ellen Olestjerne,1903,Deutschland,73466,64116
1224,zu_Reventlow_Franziska_Herrn_Dames_Aufzeichnungen,1225,Franziska zu Reventlow,Deutschland,female,118600044,Herrn Dames Aufzeichnungen,1913,Deutschland,42090,36984
1225,zu_Reventlow_Franziska_Spiritismus,1226,Franziska zu Reventlow,Deutschland,female,118600044,Spiritismus,1917,Deutschland,3998,3528


## Sound Event Density and Measurement of Loudness Level Average
Given is a sound event annotated XML file provided as a pandas data frame.
In the following, I count the number of words in the XML files excluding the XML elements. 
I further extract the XML content which are the sound event spans, separately for ambient and character sound events, and save them in two separate columns together with their assigned loudness level as dictionaries, having the spans as keys and the loudness level as values. 
Then, I tokenize the dictionary keys, save the tokenized copy to a further data frame column, and calculate the average size of a sound event in words. 

Furthermore, the loudness values of the separate dictionaries are summed up for each text and divided by the total number of loudness labeled sound events, for calculating the average loudness level of a texts character sound events, ambient sound events, and all sound events separately.


In [17]:
folder_path = '/Users/sguhr/Desktop/Diss_notebooks/20240511_theme-d-prose_LL_predicted'

# The following code provides theme-d-Prose as a pandas dataframe with the first column containing the filename, and the second and third column containing sound_span-loudness_level dictionaries separately for ambient and character sound.
# Integrated is also a check, if there are no XML texts with XML errors. This will be omitted afterwards.

In [18]:
# takes 2 minutes

import os
import xml.etree.ElementTree as ET
import pandas as pd
from nltk.tokenize import word_tokenize
import re
import string
import numpy as np

def extract_sound_spans(xml_content):
    sound_spans = {'character_sound': {}, 'ambient_sound': {}}
    root = ET.fromstring(xml_content)

    ns = {'tei': 'http://www.tei-c.org/ns/1.0'}  # Define the namespace

    for elem in root.iter():
        if elem.tag.endswith('character_sound') or elem.tag.endswith('ambient_sound'):
            sound_text = elem.text.strip() if elem.text else ""
            loudness_str = elem.attrib.get('loudness', np.nan)  # Assign NaN if 'loudness' attribute is not present
            loudness = float(loudness_str) if loudness_str != 'S' else np.nan  # Assign NaN if 'loudness' attribute is 'S'
            tokenized_text = word_tokenize(sound_text, language='german')  # Tokenize the text
            filtered_tokens = [token for token in tokenized_text if token not in string.punctuation]  # Filter out punctuation
            sound_spans[elem.tag.split('}')[1].rstrip('_')][tuple(filtered_tokens)] = loudness  # Tokenize the keys
    return sound_spans

def calculate_word_token_length(text):
    # Remove XML tags
    text_without_tags = re.sub(r'<[^>]+>', '', text)
    # Tokenize the text
    tokens = word_tokenize(text_without_tags, language='german')
    # Remove punctuation
    filtered_tokens = [token for token in tokens if token not in string.punctuation]
    # Calculate the word token length
    word_token_length = len(filtered_tokens)
    return word_token_length

def calculate_avg_loudness(sound_spans):
    loudness_values = [v for v in sound_spans.values() if not np.isnan(v)]
    return round(sum(loudness_values) / len(loudness_values), 2) if loudness_values else 0

def process_xml_file(filepath):
    character_sound_spans = {}
    ambient_sound_spans = {}
    word_token_length = 0
    character_loudness_count_all = 0
    ambient_loudness_count_all = 0
    character_se_count_without_nan = 0
    ambient_se_count_without_nan = 0
    with open(filepath, 'r', encoding='utf-8') as file:
        xml_content = file.read()
        sound_spans = extract_sound_spans(xml_content)
        character_sound_spans = sound_spans['character_sound']
        ambient_sound_spans = sound_spans['ambient_sound']
        word_token_length = calculate_word_token_length(xml_content)
        
        # Count the number of character_sound and ambient_sound elements with loudness attribute
        for elem in ET.fromstring(xml_content).iter():
            if elem.tag.endswith('character_sound'):
                character_loudness_count_all += 1
                if 'loudness' in elem.attrib and elem.attrib['loudness'] == 'S':
                    character_loudness_count_all -= 1
                if 'loudness' in elem.attrib and not np.isnan(float(elem.attrib['loudness'])):
                    character_se_count_without_nan += 1
            elif elem.tag.endswith('ambient_sound'):
                ambient_loudness_count_all += 1
                if 'loudness' in elem.attrib and elem.attrib['loudness'] == 'S':
                    ambient_loudness_count_all -= 1
                if 'loudness' in elem.attrib and not np.isnan(float(elem.attrib['loudness'])):
                    ambient_se_count_without_nan += 1
    
    # Calculate average character sound span including NaN
    character_avg_token_count_with_nan = round(sum(len(key) for key in character_sound_spans.keys()) / len(character_sound_spans) if character_sound_spans else 0, 2)

    # Calculate average ambient sound span including NaN
    ambient_avg_token_count_with_nan = round(sum(len(key) for key in ambient_sound_spans.keys()) / len(ambient_sound_spans) if ambient_sound_spans else 0, 2)

    # Calculate total average token count including NaN for both ambient and character sound
    total_avg_token_count_with_nan = round(((sum(len(key) for key in character_sound_spans.keys()) if character_sound_spans else 0) + 
                         (sum(len(key) for key in ambient_sound_spans.keys()) if ambient_sound_spans else 0)) / \
                        ((len(character_sound_spans) if character_sound_spans else 0) + (len(ambient_sound_spans) if ambient_sound_spans else 0)), 2)

    # Calculate average token count excluding NaN for character sound
    character_avg_token_count_without_nan = round(sum(len(key) for key in character_sound_spans.keys() if not np.isnan(character_sound_spans[key])) / len(character_sound_spans) if character_sound_spans else 0, 2)

    # Calculate average token count excluding NaN for ambient sound
    ambient_avg_token_count_without_nan = round(sum(len(key) for key in ambient_sound_spans.keys() if not np.isnan(ambient_sound_spans[key])) / len(ambient_sound_spans) if ambient_sound_spans else 0, 2)

    # Calculate total average token count excluding NaN for both ambient and character sound
    total_avg_token_count_without_nan = round(((sum(len(key) for key in character_sound_spans.keys() if not np.isnan(character_sound_spans[key])) if character_sound_spans else 0) + 
                     (sum(len(key) for key in ambient_sound_spans.keys() if not np.isnan(ambient_sound_spans[key])) if ambient_sound_spans else 0)) / \
                    ((len(character_sound_spans) if character_sound_spans else 0) + (len(ambient_sound_spans) if ambient_sound_spans else 0)), 2)

    # Calculate t_se_aver
    t_se_aver = round(word_token_length / total_avg_token_count_with_nan, 2) if total_avg_token_count_with_nan != 0 else 0

    # Calculate t_se_aver without nan
    t_se_aver_without_nan = round(word_token_length / total_avg_token_count_without_nan, 2) if total_avg_token_count_without_nan != 0 else 0

    return (character_sound_spans, ambient_sound_spans, word_token_length, character_loudness_count_all, ambient_loudness_count_all, t_se_aver, character_avg_token_count_with_nan, ambient_avg_token_count_with_nan, total_avg_token_count_with_nan, character_avg_token_count_without_nan, ambient_avg_token_count_without_nan, total_avg_token_count_without_nan, t_se_aver_without_nan, character_se_count_without_nan, ambient_se_count_without_nan, (character_se_count_without_nan + ambient_se_count_without_nan))


def process_folder(folder_path):
    data = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.xml'):
            filepath = os.path.join(folder_path, filename)
            (character_sound_spans, ambient_sound_spans, word_token_length, character_loudness_count_all, ambient_loudness_count_all, t_se_aver, character_avg_token_count_with_nan, ambient_avg_token_count_with_nan, total_avg_token_count_with_nan, character_avg_token_count_without_nan, ambient_avg_token_count_without_nan, total_avg_token_count_without_nan, t_se_aver_without_nan, character_se_count_without_nan, ambient_se_count_without_nan, total_se_count_without_nan) = process_xml_file(filepath)
            
            # Calculate the average loudness for character sound and ambient sound dictionaries
            character_avg_loudness = calculate_avg_loudness(character_sound_spans)
            ambient_avg_loudness = calculate_avg_loudness(ambient_sound_spans)
            
            # Calculate the text loudness average
            text_loudness_average = round((character_avg_loudness + ambient_avg_loudness) / 2, 2)
            
            data.append({'filename': filename,
                         'character_sound-loudness_dictionary': character_sound_spans,
                         'ambient_sound-loudness_dictionary': ambient_sound_spans,
                         'character_avg_token_count_without_nan': character_avg_token_count_without_nan,
                         'ambient_avg_token_count_without_nan': ambient_avg_token_count_without_nan,
                         'total_avg_token_count_without_nan': total_avg_token_count_without_nan,
                         't_se_aver_without_nan': t_se_aver_without_nan,
                         'character_se_count_without_nan': character_se_count_without_nan,
                         'ambient_se_count_without_nan': ambient_se_count_without_nan,
                         'total_se_count_without_nan': total_se_count_without_nan,
                         'character_avg_loudness': character_avg_loudness,
                         'ambient_avg_loudness': ambient_avg_loudness,
                         'text_loudness_average': text_loudness_average})

    return pd.DataFrame(data)

# Call the modified function
df_new = process_folder(folder_path)
print(df_new)


                                               filename  \
0                        Hille_Peter_Die_Hassenburg.xml   
1                       Bierbaum_Otto_Julius_Stilpe.xml   
2                      Wichert_Ernst_Schuster_Lange.xml   
3                Ganghofer_Ludwig_Der_laufende_Berg.xml   
4                        Raabe_Wilhelm_Der_gute_Tag.xml   
...                                                 ...   
1222                     Ernst_Otto_Semper_der_Mann.xml   
1223                  Meyer-Foerster_Wilhelm_Lena_S.xml   
1224                      Altenberg_Peter_Prodromos.xml   
1225  von_Liliencron_Detlev_Roggen_und_Weizen_Der_Di...   
1226                    Oelschlaeger_Hermann_Klytia.xml   

                    character_sound-loudness_dictionary  \
0     {('Ich', 'verneinte'): 3.0, ('so', 'daß', 'ich...   
1     {('hat',): nan, ('sagte', 'Tante', 'Pauline'):...   
2     {('daß', 'ich', 'einmal', 'einem', 'befreundet...   
3     {('und', 'unter', 'Lachen', 'ein', 'seltsames'...

In [98]:
import numpy as np
# overwrites the column 'text_loudness_average' by excluding the NaN in 'ambient_avg_loudness' which are indicated as 0.0

# Define a custom function to calculate the average loudness level
def calculate_loudness(row):
    if row['ambient_avg_loudness'] != 0.00:
        return (row['character_avg_loudness'] + row['ambient_avg_loudness']) / 2
    else:
        return row['character_avg_loudness']

# Add a new column 'text_loudness_average2' by applying the custom function to each row
df_new['text_loudness_average'] = df_new.apply(calculate_loudness, axis=1)


In [20]:
df_new

Unnamed: 0,filename,character_sound-loudness_dictionary,ambient_sound-loudness_dictionary,character_avg_token_count_without_nan,ambient_avg_token_count_without_nan,total_avg_token_count_without_nan,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average
0,Hille_Peter_Die_Hassenburg.xml,"{('Ich', 'verneinte'): 3.0, ('so', 'daß', 'ich...","{('da', 'man', 'den', 'Bürgermeister', 'Heitem...",4.33,2.88,3.97,7994.96,196,44,240,2.96,2.75,2.85
1,Bierbaum_Otto_Julius_Stilpe.xml,"{('hat',): nan, ('sagte', 'Tante', 'Pauline'):...","{('Um', 'fünf', 'Uhr', 'frih', 'klingelt', 'ei...",3.94,3.62,3.88,15580.15,440,79,519,3.06,3.58,3.32
2,Wichert_Ernst_Schuster_Lange.xml,"{('daß', 'ich', 'einmal', 'einem', 'befreundet...","{('Die', 'Einen', 'klagen'): 3.0, ('klang', 'e...",4.13,4.33,4.14,5468.36,328,12,340,3.02,2.71,2.87
3,Ganghofer_Ludwig_Der_laufende_Berg.xml,"{('und', 'unter', 'Lachen', 'ein', 'seltsames'...","{('Silberne', 'Fäden', 'schimmernd', 'in', 'de...",4.01,6.01,4.56,17794.30,969,286,1255,2.84,3.11,2.97
4,Raabe_Wilhelm_Der_gute_Tag.xml,"{('verkündigte', 'er', 'von', 'Stockwerk', 'zu...","{('fragten',): 3.0, ('die', 'Hausbewohner'): n...",3.81,3.57,3.78,2609.52,101,10,111,3.05,3.00,3.02
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1222,Ernst_Otto_Semper_der_Mann.xml,"{('mit', 'den', 'Worten', 'und', 'Überzeugunge...","{('Und', 'als', 'es', 'von', 'ihren', 'Verzwei...",4.50,5.12,4.63,21701.08,998,217,1215,3.18,3.29,3.24
1223,Meyer-Foerster_Wilhelm_Lena_S.xml,"{('sagte', 'sie', 'lachend'): 3.0, ('da', 'er'...","{('Es', 'wurde', 'schon', 'gestern', 'in', 'de...",4.63,5.37,4.75,10811.58,555,95,650,2.79,2.96,2.88
1224,Altenberg_Peter_Prodromos.xml,"{('Und', 'wenn', 'er', 'nur', 'verkündet'): na...","{('Das', 'Orchester', 'singt'): 4.0, ('jauchzt...",4.26,4.10,4.24,7834.67,235,28,263,2.96,3.21,3.08
1225,von_Liliencron_Detlev_Roggen_und_Weizen_Der_Di...,"{('dann', 'plötzlich', 'hielt', 'er', 'wieder'...","{('Spielte', 'die', 'Musik'): 4.0, ('Irgendwoh...",4.54,3.64,4.37,674.83,39,7,46,2.76,3.79,3.27


In [None]:
# Save DataFrame to CSV
df_new.to_csv('/Users/sguhr/Desktop/Diss_notebooks/20240512_df_theme-d-Prose.csv', index=False)

 Load the dataframe from the saved csv.

In [None]:
# Load the save csv
import pandas as pd

# Load the CSV file into a pandas DataFrame
df_new = pd.read_csv("/Users/sguhr/Desktop/Diss_notebooks/20240512_df_theme-d-Prose.csv")

# Display the DataFrame
#print(df_theme-d-Prose)

In [None]:
df_new

In [21]:
#Merge the calculation results and dictionaries of new_df with the metadata df 

import pandas as pd
import os

# Load CSV into DataFrame
csv_path = "/Users/sguhr/Downloads/20240511_Metadata_theme-d-Prose_xml.csv"  # Update with the path to your CSV file
csv_df = pd.read_csv(csv_path)

# Remove file extensions from filenames in the CSV DataFrame
csv_df['filename'] = csv_df['filename'].str.replace('.xml', '')
df_new['filename'] = df_new['filename'].str.replace('.xml', '')


# Merge CSV DataFrame with XML DataFrame based on filename
merged_df = pd.merge(csv_df, df_new, on='filename', how='inner')

# Display merged DataFrame
print(merged_df)

                                               filename  ID_theme-d-Prose  \
0                 Achleitner_Arthur_Das_Schloss_im_Moor                 1   
1                        Achleitner_Arthur_Der_Finanzer                 2   
2                                 Adolph_Karl_Schackerl                 3   
3                                  Adolph_Karl_Toechter                 4   
4       Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser                 7   
...                                                 ...               ...   
1222  zu_Reventlow_Franziska_Das_graefliche_Milchges...              1223   
1223            zu_Reventlow_Franziska_Ellen_Olestjerne              1224   
1224  zu_Reventlow_Franziska_Herrn_Dames_Aufzeichnungen              1225   
1225                 zu_Reventlow_Franziska_Spiritismus              1226   
1226                   Zweig_Stefan_Die_Liebe_der_Erika              1227   

            author_used_name author_nationality author_gender GND_number  \

In [None]:
# Save DataFrame to CSV
merged_df.to_csv('/Users/sguhr/Desktop/Diss_notebooks/20240512_df_theme-d-Prose_plus_calculations.csv', index=False)

In [None]:
# Load the save csv
import pandas as pd

# Load the CSV file into a pandas DataFrame
df = pd.read_csv("/Users/sguhr/Desktop/Diss_notebooks/20240512_df_theme-d-Prose_plus_calculations.csv")

# Display the DataFrame
#print(df_theme-d-Prose)

In [100]:
# Delete the column 'text_loudness_average2'
#merged_df.drop(columns=['text_loudness_average2'], inplace=True)


In [22]:
# Calculate sound event density (SED)
merged_df['SED_without_nan'] = (merged_df['total_se_count_without_nan'] / merged_df['t_se_aver_without_nan']) * 100


In [213]:
merged_df

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range,publication_year_range,publication_decade
0,Achleitner_Arthur_Das_Schloss_im_Moor,1,Arthur Achleitner,Deutschland,male,116005327,Das Schloß im Moor,1903,Deutschland,61568,...,711,49,760,3.07,2.86,2.965,5.789413,40000-69999_words,1901-1920,1900s
1,Achleitner_Arthur_Der_Finanzer,2,Arthur Achleitner,Deutschland,male,116005327,Der Finanzer,1903,Österreich,28240,...,294,82,376,3.08,3.18,3.130,5.807658,10000-39999_words,1901-1920,1900s
2,Adolph_Karl_Schackerl,3,Karl Adolph,Österreich,male,123217687,Schackerl,1912,Österreich,48194,...,307,81,388,3.08,3.28,3.180,4.471942,40000-69999_words,1901-1920,1910s
3,Adolph_Karl_Toechter,4,Karl Adolph,Österreich,male,123217687,Töchter,1914,Österreich,108949,...,653,121,774,2.97,3.07,3.020,3.786169,70000-100000_words,1901-1920,1910s
4,Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser,7,Louise Anklam,Deutschland,female,125905726,Der Herr Kaiser,1898,Deutschland,4276,...,63,8,71,2.94,3.25,3.095,8.910643,2000-9999_words,1871-1900,1890s
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1222,zu_Reventlow_Franziska_Das_graefliche_Milchges...,1223,Franziska zu Reventlow,Deutschland,female,118600044,Das gräfliche Milchgeschäft,1897,Deutschland,4027,...,41,8,49,3.01,3.31,3.160,5.254016,2000-9999_words,1871-1900,1890s
1223,zu_Reventlow_Franziska_Ellen_Olestjerne,1224,Franziska zu Reventlow,Deutschland,female,118600044,Ellen Olestjerne,1903,Deutschland,73466,...,587,142,729,2.92,3.11,3.015,5.024606,40000-69999_words,1901-1920,1900s
1224,zu_Reventlow_Franziska_Herrn_Dames_Aufzeichnungen,1225,Franziska zu Reventlow,Deutschland,female,118600044,Herrn Dames Aufzeichnungen,1913,Deutschland,42090,...,620,48,668,2.93,3.28,3.105,7.530783,10000-39999_words,1901-1920,1910s
1225,zu_Reventlow_Franziska_Spiritismus,1226,Franziska zu Reventlow,Deutschland,female,118600044,Spiritismus,1917,Deutschland,3998,...,39,12,51,2.77,3.04,2.905,6.313600,2000-9999_words,1901-1920,1910s


In [137]:
# Calculate the mean of "total_avg_token_count_without_nan"
mean_total_avg_token_count = df["text_loudness_average"].mean()

# Print the mean
print("Mean of 'total_avg_token_count_without_nan':", mean_total_avg_token_count)


Mean of 'total_avg_token_count_without_nan': 2.977367563162184


In [58]:
df = merged_df

## Analysis of the Data 

In [56]:
# Sort the DataFrame by 'SED_without_nan' column in descending order
df_sorted = merged_df.sort_values(by='SED_without_nan', descending=False)

# Select the top 100 rows
#df_SED_110 = df_sorted.head(100).copy()
df_SED_50 = df_sorted.head(50).copy()


In [92]:
# Sort the DataFrame by 'SED_without_nan' column in ascending order
df_sorted = merged_df.sort_values(by='SED_without_nan', ascending=True)

# Select the top 100 rows
df_SED_100_lowest = df_sorted.head(100).copy()


In [93]:
df_SED_100_lowest

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,total_avg_token_count_without_nan,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range
353,von_Hofmannsthal_Hugo_Die_Briefe_des_Zurueckge...,1020,Hugo von Hofmannsthal,Österreich,male,118552759,Die Briefe des Zurückgekehrten,1901,Deutschland,11312,...,3.71,2627.22,19,7,26,2.92,2.21,2.56,0.989639,2000-9999_words
725,Schnitzler_Arthur_Leutnant_Gustl,742,Arthur Schnitzler,Österreich,male,118609807,Lieutenant Gustl,1900,Österreich,15966,...,4.57,2972.21,36,2,38,3.04,3.00,3.02,1.278510,10000-39999_words
487,Loeffler_Ludwig_Was_ich_im_Lande,506,Ludwig Löffler,Deutschland,male,117148415,Was ich im Lande der Thüringer und Franken fand,1857,Deutschland,9853,...,3.71,2344.74,19,12,31,2.82,3.29,3.05,1.322108,2000-9999_words
843,Storch_Ludwig_Die_Plassenburg,856,Ludwig Storch,Deutschland,male,117289337,Die Plassenburg,1861,Deutschland,3084,...,4.18,646.89,9,0,9,3.22,0.00,1.61,1.391272,2000-9999_words
1062,von_Polenz_Wilhelm_Luginsland_Mutter_Maukschen...,1062,Wilhelm von Polenz,Deutschland,male,118803336,Mutter Maukschens Liebster,1901,Österreich,4436,...,2.97,1289.90,12,8,20,3.00,3.06,3.03,1.550508,2000-9999_words
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
442,Kroeger_Timm_Im_Moor,473,Timm Kröger,Deutschland,male,116548088,Im Moor,1904,Deutschland,3674,...,4.45,715.06,15,11,26,3.13,2.95,3.04,3.636059,2000-9999_words
1065,von_Saar_Ferdinand_Ausser_Dienst,1066,Ferdinand von Saar,Österreich,male,118604449,Außer Dienst,1902,Deutschland,3870,...,4.21,795.01,26,3,29,2.52,3.67,3.09,3.647753,2000-9999_words
1126,von_Wildenbruch_Ernst_Die_letzte_Partie,1127,Ernst von Wildenbruch,Deutschland,male,118771760,Die letzte Partie,1909,Deutschland,6016,...,3.71,1394.88,45,6,51,2.78,3.33,3.05,3.656228,2000-9999_words
1103,von_Schmid_Hermann_Der_Schwaerzer,1104,Hermann von Schmid,Deutschland,male,100801919,Der Schwärzer,1867,Deutschland,3392,...,3.82,792.67,26,3,29,3.08,4.00,3.54,3.658521,2000-9999_words


In [94]:
df_SED_100

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,total_avg_token_count_without_nan,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range
916,Thoma_Ludwig_Franz_und_Cora,930,Ludwig Thoma,Deutschland,male,118622072,Franz und Cora,1907,Deutschland,3955,...,4.81,699.38,158,15,173,3.10,3.43,3.27,24.736195,2000-9999_words
910,Thoma_Ludwig_Coras_Abreise,924,Ludwig Thoma,Deutschland,male,118622072,Coras Abreise,1907,Deutschland,4968,...,4.88,871.93,168,4,172,3.14,4.25,3.70,19.726354,2000-9999_words
915,Thoma_Ludwig_Die_Indianerin,929,Ludwig Thoma,Deutschland,male,118622072,Die Indianerin,1907,Deutschland,3761,...,4.79,680.17,124,3,127,3.19,4.00,3.59,18.671803,2000-9999_words
929,Thoma_Ludwig_Tante_Frieda,943,Ludwig Thoma,Deutschland,male,118622072,Tante Frieda,1907,Deutschland,6368,...,4.37,1254.46,203,5,208,3.18,3.10,3.14,16.580840,2000-9999_words
913,Thoma_Ludwig_Der_vornehme_Knabe,927,Ludwig Thoma,Deutschland,male,118622072,Der vornehme Knabe,1905,Deutschland,3196,...,3.84,701.30,111,5,116,3.15,4.40,3.78,16.540710,2000-9999_words
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1143,Wassermann_Jakob_Der_Moloch,1146,Jakob Wassermann,Deutschland,male,118629387,Der Moloch,1903,Deutschland,94947,...,4.24,19165.33,1547,200,1747,2.80,3.11,2.96,9.115418,70000-100000_words
701,Schmitthenner_Adolf_Kopf_und_Herz,719,Adolf Schmitthenner,Deutschland,male,11681019X,Kopf und Herz,1896,Deutschland,7260,...,4.67,1338.54,110,12,122,2.85,2.83,2.84,9.114408,2000-9999_words
660,Salomon_Ludwig_Die_Bluechertrompete,688,Ludwig Salomon,Deutschland,male,116767235,Die Blüchertrompete,1888,Deutschland,33918,...,4.93,6049.70,437,114,551,2.90,3.07,2.98,9.107890,10000-39999_words
862,Storm_Theodor_Zur_Wald-_und_Wasserfreude,875,Theodor Storm,Deutschland,male,118618725,Zur Wald- und Wasserfreude,1878,Deutschland,21069,...,4.96,3703.02,281,56,337,2.91,2.96,2.94,9.100680,10000-39999_words


In [236]:
import plotly.graph_objects as go

# Sort the DataFrame by 'word_count' column
df_sorted_by_word_count = df.sort_values(by='word_count')

# Create scatter plot for SED_without_nan
scatter = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                     y=df_sorted_by_word_count['SED_without_nan'],
                     mode='markers',
                     marker=dict(color='blue'),
                     name='SED_without_nan',
                     text=df_sorted_by_word_count['filename'],  # Hover text
                     hoverinfo='text')

# Create line plot for scaled word_count
line = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                  y=df_sorted_by_word_count['word_count'] / df_sorted_by_word_count['word_count'].max() * 25,
                  mode='lines+markers',
                  marker=dict(color='red'),
                  name='word_count (scaled)')

# Create layout
layout = go.Layout(title= '', #'SED related to text length in words',
                   xaxis=dict(title='Text Length in Words'),
                   yaxis=dict(title='SED', range=[0, 25]),
                   hovermode='closest')

# Create figure
fig = go.Figure(data=[scatter, line], layout=layout)

# Show plot
fig.show()


KeyError: 'SED_without_nan'

In [ ]:
import plotly.graph_objects as go

# Sort the DataFrame by 'word_count' column
df_sorted_by_word_count = df.sort_values(by='word_count')

# Create scatter plot for SED_without_nan
scatter = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                     y=df_sorted_by_word_count['SED_without_nan'],
                     mode='markers',
                     marker=dict(color='blue'),
                     name='SED_without_nan',
                     text=df_sorted_by_word_count['filename'],  # Hover text
                     hoverinfo='text')

# Create layout
layout = go.Layout(title= '', #'SED related to text length in words',
                   xaxis=dict(title='Text Length in Words'),
                   yaxis=dict(title='SED', range=[0, 25]),
                   hovermode='closest')

# Create figure (exclude line plot for scaled word count)
fig = go.Figure(data=[scatter], layout=layout)

# Show plot
fig.show()


In [79]:
import plotly.graph_objects as go

# Filter out rows with author "Ludwig Thoma"
df_filtered = df[df['author_used_name'] != 'Ludwig Thoma']

# Sort the filtered DataFrame by 'word_count' column
df_sorted_by_word_count = df_filtered.sort_values(by='word_count')

# Create scatter plot for SED_without_nan
scatter = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                     y=df_sorted_by_word_count['SED_without_nan'],
                     mode='markers',
                     marker=dict(color='blue'),
                     name='SED_without_nan',
                     text=df_sorted_by_word_count['filename'],  # Hover text
                     hoverinfo='text')

# Create line plot for scaled word_count
line = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                  y=df_sorted_by_word_count['word_count'] / df_sorted_by_word_count['word_count'].max() * 25,
                  mode='lines+markers',
                  marker=dict(color='red'),
                  name='word_count (scaled)')

# Create layout
layout = go.Layout(title='SED related to text length in words (excluding author Ludwig Thoma)',
                   xaxis=dict(title='Text Length in Words'),
                   yaxis=dict(title='SED', range=[0, 25]),
                   hovermode='closest')

# Create figure
fig = go.Figure(data=[scatter, line], layout=layout)

# Show plot
fig.show()


In [80]:
import numpy as np

# Calculate cosine smoothed SED values
window_size = 5  # Define the window size for cosine smoothing
cosine_smoothed_SED = np.convolve(df_sorted_by_word_count['SED_without_nan'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for SED_without_nan
scatter = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                     y=df_sorted_by_word_count['SED_without_nan'],
                     mode='markers',
                     marker=dict(color='blue'),
                     name='SED_without_nan',
                     text=df_sorted_by_word_count['filename'],  # Hover text
                     hoverinfo='text')

# Create line plot for smoothed SED
smoothed_line = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                           y=cosine_smoothed_SED,
                           mode='lines',
                           line=dict(color='red', width=2),
                           name='Cosine Smoothed SED')

# Create line plot for scaled word_count
line = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                  y=df_sorted_by_word_count['word_count'] / df_sorted_by_word_count['word_count'].max() * 25,
                  mode='lines+markers',
                  marker=dict(color='green'),
                  name='word_count (scaled)')

# Create layout
layout = go.Layout(title='SED related to text length in words (excluding author Ludwig Thoma) with Cosine Smoothed SED',
                   xaxis=dict(title='Text Length in Words'),
                   yaxis=dict(title='SED', range=[0, 25]),
                   hovermode='closest')

# Create figure
fig = go.Figure(data=[scatter, smoothed_line, line], layout=layout)

# Show plot
fig.show()


In [83]:
import plotly.graph_objects as go

# Calculate cosine smoothed SED values
window_size = 5  # Define the window size for cosine smoothing
cosine_smoothed_SED = np.convolve(df_sorted_by_word_count['SED_without_nan'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for SED_without_nan
scatter = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                     y=df_sorted_by_word_count['SED_without_nan'],
                     mode='markers',
                     marker=dict(color='blue'),
                     name='SED_without_nan',
                     text=df_sorted_by_word_count['filename'],  # Hover text
                     hoverinfo='text')

# Create line plot for smoothed SED
smoothed_line = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                           y=cosine_smoothed_SED,
                           mode='lines',
                           line=dict(color='red', width=2),
                           name='Cosine Smoothed SED')

# Create line plot for scaled word_count
line = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                  y=df_sorted_by_word_count['word_count'] / df_sorted_by_word_count['word_count'].max() * 25,
                  mode='lines+markers',
                  marker=dict(color='green'),
                  name='word_count (scaled)')

# Create layout
layout = go.Layout(title='SED related to text length in words (excluding author Ludwig Thoma) with Cosine Smoothed SED',
                   xaxis=dict(title='Text Length in Words'),
                   yaxis=dict(title='SED', range=[0, 25]),
                   hovermode='closest')

# Create figure
fig = go.Figure(data=[scatter, smoothed_line, line], layout=layout)

# Export as PNG
fig.write_image("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/SED_theme-d-prose_cos_smoothed_excluding_Thoma.png")

# Export as HTML
fig.write_html("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/SED_theme-d-prose_cos_smoothed_excluding_Thoma.html")


In [84]:
import plotly.graph_objects as go

# Filter out rows with author "Ludwig Thoma"
df_filtered = df[df['author_used_name'] != 'Ludwig Thoma']

# Sort the filtered DataFrame by 'word_count' column
df_sorted_by_word_count = df_filtered.sort_values(by='word_count')

# Calculate cosine smoothed SED values
window_size = 5  # Define the window size for cosine smoothing
cosine_smoothed_SED = np.convolve(df_sorted_by_word_count['SED_without_nan'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for SED_without_nan
scatter = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                     y=df_sorted_by_word_count['SED_without_nan'],
                     mode='markers',
                     marker=dict(color='blue'),
                     name='SED_without_nan',
                     text=df_sorted_by_word_count['filename'],  # Hover text
                     hoverinfo='text')

# Create line plot for smoothed SED
smoothed_line = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                           y=cosine_smoothed_SED,
                           mode='lines',
                           line=dict(color='red', width=2),
                           name='Cosine Smoothed SED')

# Create line plot for scaled word_count
line = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                  y=df_sorted_by_word_count['word_count'] / df_sorted_by_word_count['word_count'].max() * 25,
                  mode='lines+markers',
                  marker=dict(color='green'),
                  name='word_count (scaled)')

# Create layout
layout = go.Layout(title='SED related to text length in words (including author Ludwig Thoma) with Cosine Smoothed SED',
                   xaxis=dict(title='Text Length in Words'),
                   yaxis=dict(title='SED', range=[0, 25]),
                   hovermode='closest')

# Create figure
fig = go.Figure(data=[scatter, smoothed_line, line], layout=layout)

# Export as PNG
fig.write_image("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/SED_theme-d-prose_cos_smoothed.png")

# Export as HTML
fig.write_html("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/SED_theme-d-prose_cos_smoothed.html")


In [91]:
import plotly.graph_objects as go

# Filter out rows with author "Ludwig Thoma"
df_filtered = df[df['author_used_name'] != 'Ludwig Thoma']

# Sort the filtered DataFrame by 'word_count' column
df_sorted_by_word_count = df_filtered.sort_values(by='word_count')

# Calculate cosine smoothed SED values
window_size = 5  # Define the window size for cosine smoothing
cosine_smoothed_SED = np.convolve(df_sorted_by_word_count['SED_without_nan'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for SED_without_nan
scatter = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                     y=df_sorted_by_word_count['SED_without_nan'],
                     mode='markers',
                     marker=dict(color='blue'),
                     name='SED',
                     text=df_sorted_by_word_count['filename'],  # Hover text
                     hoverinfo='text')

# Create line plot for smoothed SED
smoothed_line = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                           y=cosine_smoothed_SED,
                           mode='lines',
                           line=dict(color='red', width=2),
                           name='Cosine Smoothed SED')

# Create line plot for scaled word_count
line = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                  y=df_sorted_by_word_count['word_count'] / df_sorted_by_word_count['word_count'].max() * 25,
                  mode='lines+markers',
                  marker=dict(color='green'),
                  name='word_count (scaled)')

# Create layout
layout = go.Layout(#title='SED related to text length in words (including author Ludwig Thoma) with Cosine Smoothed SED',
                   xaxis=dict(title='Text Length in Words'),
                   yaxis=dict(title='SED', range=[0, 25]),
                   hovermode='closest')

# Create figure
fig = go.Figure(data=[scatter, smoothed_line, line], layout=layout)

# Export as PNG with high resolution for print
fig.write_image("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/SED_theme-d-prose_cos_smoothed_high_res.png", scale=20)

# Export as HTML
fig.write_html("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/SED_theme-d-prose_cos_smoothed_high_res.html")


In [105]:
import plotly.graph_objects as go

# Filter out rows with author "Ludwig Thoma"
df_filtered = df[df['author_used_name'] != 'Ludwig Thoma']

# Sort the filtered DataFrame by 'word_count' column
df_sorted_by_word_count = df_filtered.sort_values(by='word_count')

# Calculate cosine smoothed SED values
window_size = 50  # Define the window size for cosine smoothing
cosine_smoothed_SED = np.convolve(df_sorted_by_word_count['SED_without_nan'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for SED_without_nan
scatter = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                     y=df_sorted_by_word_count['SED_without_nan'],
                     mode='markers',
                     marker=dict(color='blue'),
                     name='SED',
                     text=df_sorted_by_word_count['filename'],  # Hover text
                     hoverinfo='text')

# Create line plot for smoothed SED
smoothed_line = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                           y=cosine_smoothed_SED,
                           mode='lines',
                           line=dict(color='red', width=2),
                           name='Cosine Smoothed SED')

# Create layout
layout = go.Layout(#title='SED related to text length in words (including author Ludwig Thoma) with Cosine Smoothed SED',
                   xaxis=dict(title='Text Length in Words'),
                   yaxis=dict(title='SED', range=[0, 25]),
                   hovermode='closest')

# Create figure (exclude line plot for scaled word count)
fig = go.Figure(data=[scatter, smoothed_line], layout=layout)

# Show plot
fig.show()

# Export as PNG with high resolution for print
fig.write_image("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/SED_theme-d-prose_cos_smoothed_50_high_res.png", scale=20)

# Export as HTML
fig.write_html("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/SED_theme-d-prose_cos_smoothed_50_high_res.html")


In [215]:
import plotly.graph_objects as go

# Sort the DataFrame by 'word_count' column
df_sorted_by_word_count = merged_df.sort_values(by='word_count')

# Calculate cosine smoothed SED values
window_size = 50  # Define the window size for cosine smoothing
cosine_smoothed_SED = np.convolve(df_sorted_by_word_count['SED_without_nan'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for SED_without_nan
scatter = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                     y=df_sorted_by_word_count['SED_without_nan'],
                     mode='markers',
                     marker=dict(color='blue'),
                     name='SED',
                     text=df_sorted_by_word_count['filename'],  # Hover text
                     hoverinfo='text')

# Create line plot for smoothed SED
smoothed_line = go.Scatter(x=df_sorted_by_word_count['word_count'], 
                           y=cosine_smoothed_SED,
                           mode='lines',
                           line=dict(color='red', width=2),
                           name='Cosine Smoothed SED')

# Create layout
layout = go.Layout(#title='SED related to text length in words (including author Ludwig Thoma) with Cosine Smoothed SED',
                   xaxis=dict(title='Text Length in Words'),
                   yaxis=dict(title='SED', range=[0, 25]),
                   hovermode='closest',
                   legend=dict(x=0.5, y=1.2, orientation='h'))

# Create figure (exclude line plot for scaled word count)
fig = go.Figure(data=[scatter, smoothed_line], layout=layout)

# Show plot
fig.show()

# Export as PNG with high resolution for print
fig.write_image("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/SED_theme-d-prose_cos_smoothed_50_high_res3.png", scale=10)

# Export as HTML
fig.write_html("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/SED_theme-d-prose_cos_smoothed_50_high_res3.html")


In [109]:
# Define a function to calculate count and average for each DataFrame
def calculate_count_and_average(df, threshold):
    # Filter texts with 'SED_without_nan' >= threshold
    filtered_texts = df[df['SED_without_nan'] <= threshold]
    
    # Count the number of texts
    count = len(filtered_texts)
    
    # Calculate the average 'SED_without_nan' for the filtered texts
    average = filtered_texts['SED_without_nan'].mean()
    
    return count, average

# New threshold value
threshold = 3.67

# Calculate count and average for each DataFrame
count_2000_9999, average_2000_9999 = calculate_count_and_average(df_2000_9999, threshold)
count_10000_39999, average_10000_39999 = calculate_count_and_average(df_10000_39999, threshold)
count_40000_69999, average_40000_69999 = calculate_count_and_average(df_40000_69999, threshold)
count_70000_100000, average_70000_100000 = calculate_count_and_average(df_70000_100000, threshold)

# Print the results
print("For 2000-9999_words DataFrame:")
print("Number of texts with 'SED_without_nan' >= 9.82:", count_2000_9999)
print("Average 'SED_without_nan' for the filtered selection:", float(average_2000_9999))
print()

print("For 10000-39999_words DataFrame:")
print("Number of texts with 'SED_without_nan' >= 9.82:", count_10000_39999)
print("Average 'SED_without_nan' for the filtered selection:", float(average_10000_39999))
print()

print("For 40000-69999_words DataFrame:")
print("Number of texts with 'SED_without_nan' >= 9.82:", count_40000_69999)
print("Average 'SED_without_nan' for the filtered selection:", float(average_40000_69999))
print()

print("For 70000-100000_words DataFrame:")
print("Number of texts with 'SED_without_nan' >= 9.82:", count_70000_100000)
print("Average 'SED_without_nan' for the filtered selection:", float(average_70000_100000))


For 2000-9999_words DataFrame:
Number of texts with 'SED_without_nan' >= 9.82: 63
Average 'SED_without_nan' for the filtered selection: 2.876227878471321

For 10000-39999_words DataFrame:
Number of texts with 'SED_without_nan' >= 9.82: 24
Average 'SED_without_nan' for the filtered selection: 2.87277659667899

For 40000-69999_words DataFrame:
Number of texts with 'SED_without_nan' >= 9.82: 9
Average 'SED_without_nan' for the filtered selection: 3.111807007367642

For 70000-100000_words DataFrame:
Number of texts with 'SED_without_nan' >= 9.82: 3
Average 'SED_without_nan' for the filtered selection: 2.8959497302734154


In [111]:
# Define a function to calculate count and average for the top 10% of 'SED_without_nan' values for each DataFrame
def calculate_count_and_average_top_10_percent(df):
    # Sort the DataFrame by 'SED_without_nan' column in descending order
    df_sorted = df.sort_values(by='SED_without_nan', ascending=False)
    
    # Calculate the number of texts that correspond to the top 10%
    top_10_percent_count = int(len(df_sorted) * 0.1)
    
    # Keep only the top 10% of rows
    top_10_percent_df = df_sorted.head(top_10_percent_count)
    
    # Calculate the average 'SED_without_nan' for the top 10% texts
    average_top_10_percent = top_10_percent_df['SED_without_nan'].mean()
    
    return top_10_percent_count, average_top_10_percent

# Calculate count and average for the top 10% of 'SED_without_nan' values for each DataFrame
count_2000_9999_top_10_percent, average_2000_9999_top_10_percent = calculate_count_and_average_top_10_percent(df_2000_9999)
count_10000_39999_top_10_percent, average_10000_39999_top_10_percent = calculate_count_and_average_top_10_percent(df_10000_39999)
count_40000_69999_top_10_percent, average_40000_69999_top_10_percent = calculate_count_and_average_top_10_percent(df_40000_69999)
count_70000_100000_top_10_percent, average_70000_100000_top_10_percent = calculate_count_and_average_top_10_percent(df_70000_100000)

# Print the results
print("For 2000-9999_words DataFrame:")
print("Number of texts in the top 10%:", count_2000_9999_top_10_percent)
print("Average 'SED_without_nan' for the top 10%:", round(average_2000_9999_top_10_percent, 2))
print()

print("For 10000-39999_words DataFrame:")
print("Number of texts in the top 10%:", count_10000_39999_top_10_percent)
print("Average 'SED_without_nan' for the top 10%:", round(average_10000_39999_top_10_percent, 2))
print()

print("For 40000-69999_words DataFrame:")
print("Number of texts in the top 10%:", count_40000_69999_top_10_percent)
print("Average 'SED_without_nan' for the top 10%:", round(average_40000_69999_top_10_percent, 2))
print()

print("For 70000-100000_words DataFrame:")
print("Number of texts in the top 10%:", count_70000_100000_top_10_percent)
print("Average 'SED_without_nan' for the top 10%:", round(average_70000_100000_top_10_percent, 2))


For 2000-9999_words DataFrame:
Number of texts in the top 10%: 42
Average 'SED_without_nan' for the top 10%: 11.33

For 10000-39999_words DataFrame:
Number of texts in the top 10%: 46
Average 'SED_without_nan' for the top 10%: 9.63

For 40000-69999_words DataFrame:
Number of texts in the top 10%: 19
Average 'SED_without_nan' for the top 10%: 10.11

For 70000-100000_words DataFrame:
Number of texts in the top 10%: 14
Average 'SED_without_nan' for the top 10%: 9.8


In [112]:
# Define a function to calculate count and average for the lowest 10% of 'SED_without_nan' values for each DataFrame
def calculate_count_and_average_lowest_10_percent(df):
    # Sort the DataFrame by 'SED_without_nan' column in ascending order
    df_sorted = df.sort_values(by='SED_without_nan', ascending=True)
    
    # Calculate the number of texts that correspond to the lowest 10%
    lowest_10_percent_count = int(len(df_sorted) * 0.1)
    
    # Keep only the lowest 10% of rows
    lowest_10_percent_df = df_sorted.head(lowest_10_percent_count)
    
    # Calculate the average 'SED_without_nan' for the lowest 10% texts
    average_lowest_10_percent = lowest_10_percent_df['SED_without_nan'].mean()
    
    return lowest_10_percent_count, average_lowest_10_percent

# Calculate count and average for the lowest 10% of 'SED_without_nan' values for each DataFrame
count_2000_9999_lowest_10_percent, average_2000_9999_lowest_10_percent = calculate_count_and_average_lowest_10_percent(df_2000_9999)
count_10000_39999_lowest_10_percent, average_10000_39999_lowest_10_percent = calculate_count_and_average_lowest_10_percent(df_10000_39999)
count_40000_69999_lowest_10_percent, average_40000_69999_lowest_10_percent = calculate_count_and_average_lowest_10_percent(df_40000_69999)
count_70000_100000_lowest_10_percent, average_70000_100000_lowest_10_percent = calculate_count_and_average_lowest_10_percent(df_70000_100000)

# Print the results
print("For 2000-9999_words DataFrame:")
print("Number of texts in the lowest 10%:", count_2000_9999_lowest_10_percent)
print("Average 'SED_without_nan' for the lowest 10%:", round(average_2000_9999_lowest_10_percent, 2))
print()

print("For 10000-39999_words DataFrame:")
print("Number of texts in the lowest 10%:", count_10000_39999_lowest_10_percent)
print("Average 'SED_without_nan' for the lowest 10%:", round(average_10000_39999_lowest_10_percent, 2))
print()

print("For 40000-69999_words DataFrame:")
print("Number of texts in the lowest 10%:", count_40000_69999_lowest_10_percent)
print("Average 'SED_without_nan' for the lowest 10%:", round(average_40000_69999_lowest_10_percent, 2))
print()

print("For 70000-100000_words DataFrame:")
print("Number of texts in the lowest 10%:", count_70000_100000_lowest_10_percent)
print("Average 'SED_without_nan' for the lowest 10%:", round(average_70000_100000_lowest_10_percent, 2))


For 2000-9999_words DataFrame:
Number of texts in the lowest 10%: 42
Average 'SED_without_nan' for the lowest 10%: 2.56

For 10000-39999_words DataFrame:
Number of texts in the lowest 10%: 46
Average 'SED_without_nan' for the lowest 10%: 3.41

For 40000-69999_words DataFrame:
Number of texts in the lowest 10%: 19
Average 'SED_without_nan' for the lowest 10%: 3.48

For 70000-100000_words DataFrame:
Number of texts in the lowest 10%: 14
Average 'SED_without_nan' for the lowest 10%: 3.71


In [63]:
# Create a new DataFrame containing texts with lengths between 2,000 and 20,000 words
df_2000_20000 = df[(df['word_count'] >= 2000) & (df['word_count'] <= 20000)].copy()

# Calculate count and average for the top 10% of 'SED_without_nan' values for the new DataFrame
count_2000_20000_top_10_percent, average_2000_20000_top_10_percent = calculate_count_and_average_top_10_percent(df_2000_20000)

# Print the results for the new group
print("For 2000-20000_words DataFrame:")
print("Number of texts in the top 10%:", count_2000_20000_top_10_percent)
print("Average 'SED_without_nan' for the top 10%:", round(average_2000_20000_top_10_percent, 2))
print()


For 2000-20000_words DataFrame:
Number of texts in the top 10%: 68
Average 'SED_without_nan' for the top 10%: 10.69


In [64]:

# Sort the DataFrame by 'SED_without_nan' column in descending order
df_sorted_by_SED = df.sort_values(by='SED_without_nan', ascending=False)

# Select the top 50 texts with the highest 'SED_without_nan'
df_top_50_SED = df_sorted_by_SED.head(50).copy()

# Calculate count and average for the top 10% of 'SED_without_nan' values for the new DataFrame
count_top_50_SED_top_10_percent, average_top_50_SED_top_10_percent = calculate_count_and_average_top_10_percent(df_top_50_SED)

# Print the results for the new group
print("For the group of the 50 highest SED_without_nan:")
print("Number of texts in the top 10%:", count_top_50_SED_top_10_percent)
print("Average 'SED_without_nan' for the top 10%:", round(average_top_50_SED_top_10_percent, 2))
print()


For the group of the 50 highest SED_without_nan:
Number of texts in the top 10%: 5
Average 'SED_without_nan' for the top 10%: 19.25


In [135]:
# Sort the DataFrame by 'text_loudness_average' column in descending order
df_sorted_by_text_loudness = df.sort_values(by='text_loudness_average', ascending=False)

# Select the top 50 texts with the highest 'text_loudness_average'
df_top_50_text_loudness = df_sorted_by_text_loudness.head(50).copy()

# Calculate count and average for the top 10% of 'text_loudness_average' values for the new DataFrame
count_top_50_text_loudness_top_10_percent, average_top_50_text_loudness_top_10_percent = calculate_count_and_average_top_10_percent(df_top_50_text_loudness)

# Print the results for the new group
print("For the group of the 50 highest text_loudness_average:")
print("Number of texts in the top 10%:", count_top_50_text_loudness_top_10_percent)
print("Average 'text_loudness_average' for the top 10%:", round(average_top_50_text_loudness_top_10_percent, 2))


For the group of the 50 highest text_loudness_average:
Number of texts in the top 10%: 5
Average 'text_loudness_average' for the top 10%: 16.11


In [29]:
# Define the bins for word count ranges
bins = [2000, 10000, 40000, 70000, 100000]

# Define the labels for the bins
labels = ['2000-9999_words', '10000-39999_words', '40000-69999_words', '70000-100000_words']

# Create a new column 'word_count_range' that categorizes word counts into the specified bins
df['word_count_range'] = pd.cut(df['word_count'], bins=bins, labels=labels, right=False)

# Create separate dataframes for each word count range
df_2000_9999 = df[df['word_count_range'] == '2000-9999_words'].copy()
df_10000_39999 = df[df['word_count_range'] == '10000-39999_words'].copy()
df_40000_69999 = df[df['word_count_range'] == '40000-69999_words'].copy()
df_70000_100000 = df[df['word_count_range'] == '70000-100000_words'].copy()


In [34]:
df_2000_9999

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,total_avg_token_count_without_nan,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range
4,Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser,7,Louise Anklam,Deutschland,female,125905726,Der Herr Kaiser,1898,Deutschland,4276,...,4.69,796.80,63,8,71,2.94,3.25,3.09,8.910643,2000-9999_words
5,Anklam_Louise_Kindergeschichten_Der_Weihnachts...,8,Louise Anklam,Deutschland,female,125905726,Der Weihnachtsabend,1898,Deutschland,4466,...,5.21,748.56,51,1,52,2.81,4.00,3.41,6.946671,2000-9999_words
6,Anklam_Louise_Kindergeschichten_Der_wilde_Arno,9,Louise Anklam,Deutschland,female,125905726,Der wilde Arno,1898,Deutschland,4613,...,3.97,1012.85,59,5,64,3.03,2.50,2.76,6.318803,2000-9999_words
7,Anklam_Louise_Kindergeschichten_Der_Wunderdoktor,10,Louise Anklam,Deutschland,female,125905726,Der Wunderdoktor,1898,Deutschland,5472,...,5.05,943.96,69,4,73,3.08,3.75,3.42,7.733379,2000-9999_words
8,Annecke_Fritz_Der_verehrter_Herr,11,Fritz Anneke,Deutschland,male,118649523,Der verehrter Herr,1861,Deutschland,2739,...,4.58,528.82,29,0,29,3.16,0.00,1.58,5.483908,2000-9999_words
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1190,Wildermuth_Ottilie_Krieg_und_Frieden,1193,Ottilie Wildermuth,Deutschland,female,118632833,Krieg und Frieden,1866,Deutschland,8470,...,4.07,1792.14,112,13,125,2.95,3.65,3.30,6.974902,2000-9999_words
1191,Wildermuth_Ottilie_Onkel_Gottliebs_Jugendliebe,1194,Ottilie Wildermuth,Deutschland,female,118632833,Onkel Gottliebs Jugendliebe,1860,Deutschland,9082,...,4.10,1920.24,133,12,145,2.85,2.12,2.49,7.551139,2000-9999_words
1206,Wuerkert_Ludwig_Ein_Haeckerlingsschneider_als_...,1207,Ludwig Würkert,Deutschland,male,117374997,Ein Häckerlingsschneider als Apostel,1874,Deutschland,6791,...,4.58,1296.29,51,24,75,2.98,2.68,2.83,5.785742,2000-9999_words
1222,zu_Reventlow_Franziska_Das_graefliche_Milchges...,1223,Franziska zu Reventlow,Deutschland,female,118600044,Das gräfliche Milchgeschäft,1897,Deutschland,4027,...,3.74,932.62,41,8,49,3.01,3.31,3.16,5.254016,2000-9999_words


In [31]:
df_10000_39999

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,total_avg_token_count_without_nan,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range
1,Achleitner_Arthur_Der_Finanzer,2,Arthur Achleitner,Deutschland,male,116005327,Der Finanzer,1903,Österreich,28240,...,3.80,6474.21,294,82,376,3.08,3.18,3.13,5.807658,10000-39999_words
10,Anzengruber_Ludwig_Dorfgaenge_ Die_Märchen_des...,13,Ludwig Anzengruber,Österreich,male,11850357X,Die Märchen des Steinklopferhanns,1879,Österreich,23844,...,3.55,5559.44,211,40,251,3.11,3.15,3.13,4.514843,10000-39999_words
11,Anzengruber_Ludwig_Dorfgaenge_Die_Heimkehr,14,Ludwig Anzengruber,Österreich,male,11850357X,Die Heimkehr,1879,Österreich,15003,...,3.51,3533.05,107,6,113,3.19,3.58,3.38,3.198370,10000-39999_words
12,Anzengruber_Ludwig_Dorfgaenge_Unter_schwerer_A...,15,Ludwig Anzengruber,Österreich,male,11850357X,Unter schwerer Anklage,1879,Österreich,12253,...,3.82,2728.01,96,28,124,3.01,2.61,2.81,4.545438,10000-39999_words
18,Auerbach_Berthold_Der_Viereckig_oder,23,Berthold Auerbach,Deutschland,male,11865103X,Der Viereckig oder die amerikanische Kiste,1852,Deutschland,23129,...,4.35,4653.10,248,62,310,2.96,3.32,3.14,6.662225,10000-39999_words
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1218,Zedelius_Marie_In_sengender_Gluth,1219,Marie Zedelius,Deutschland,female,101198566,In sengender Gluth,1868,Deutschland,19893,...,4.51,3825.06,350,13,363,2.95,3.46,3.21,9.490047,10000-39999_words
1219,Zedelius_Marie_Lorbeeren,1220,Marie Zedelius,Deutschland,female,101198566,Lorbeeren,1873,Deutschland,25825,...,4.49,5034.52,405,25,430,2.97,2.68,2.83,8.541033,10000-39999_words
1220,Zoeller_Charlotte_Am_Meer,1221,Charlotte Zoeller,Deutschland,female,1043840923,Am Meer,1880,Deutschland,14638,...,4.34,2936.87,174,36,210,2.81,2.91,2.86,7.150470,10000-39999_words
1224,zu_Reventlow_Franziska_Herrn_Dames_Aufzeichnungen,1225,Franziska zu Reventlow,Deutschland,female,118600044,Herrn Dames Aufzeichnungen,1913,Deutschland,42090,...,4.17,8870.26,620,48,668,2.93,3.28,3.10,7.530783,10000-39999_words


In [32]:
df_40000_69999

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,total_avg_token_count_without_nan,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range
0,Achleitner_Arthur_Das_Schloss_im_Moor,1,Arthur Achleitner,Deutschland,male,116005327,Das Schloß im Moor,1903,Deutschland,61568,...,4.05,13127.41,711,49,760,3.07,2.86,2.96,5.789413,40000-69999_words
2,Adolph_Karl_Schackerl,3,Karl Adolph,Österreich,male,123217687,Schackerl,1912,Österreich,48194,...,4.73,8676.32,307,81,388,3.08,3.28,3.18,4.471942,40000-69999_words
14,Aston_Louise_Lydia,19,Louise Aston,Deutschland,female,118650769,Lydia,1850,Deutschland,61815,...,4.95,10852.73,882,45,927,2.89,2.76,2.83,8.541630,40000-69999_words
15,Auerbach_Berthold_Barfuessele,20,Berthold Auerbach,Deutschland,male,11865103X,Barfüßele,1856,Deutschland/Schweiz,76666,...,4.46,14630.27,1059,167,1226,3.00,3.18,3.09,8.379886,40000-69999_words
16,Auerbach_Berthold_Brosi_und_Moni,21,Berthold Auerbach,Deutschland,male,11865103X,Brosi und Mosi,1852,Deutschland,46710,...,4.42,9244.12,637,151,788,3.04,3.08,3.06,8.524338,40000-69999_words
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1209,Zapp_Arthur_Falsches_Geld,1210,Arthur Zapp,Deutschland,male,116965959,Falsches Geld,1920,Deutschland,54381,...,4.42,10719.00,492,48,540,2.92,2.96,2.94,5.037783,40000-69999_words
1210,Zapp_Arthur_Junggesellinnen,1211,Arthur Zapp,Deutschland,male,116965959,Junggesellinnen,1899,Deutschland,67835,...,4.20,13915.95,751,37,788,2.88,3.02,2.95,5.662567,40000-69999_words
1212,Zapp_Arthur_Zwischen_Mann_und_Frau,1213,Arthur Zapp,Deutschland,male,116965959,Zwischen Mann und Frau,1914,Deutschland,73441,...,4.77,13381.13,712,69,781,2.92,3.06,2.99,5.836577,40000-69999_words
1215,Zedelius_Marie_Durch_die_Brandung,1216,Marie Zedelius,Deutschland,female,101198566,Durch die Brandung,1877,Deutschland,60920,...,4.54,11566.74,919,45,964,3.00,3.11,3.05,8.334241,40000-69999_words


In [33]:
df_70000_100000

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,total_avg_token_count_without_nan,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range
3,Adolph_Karl_Toechter,4,Karl Adolph,Österreich,male,123217687,Töchter,1914,Österreich,108949,...,4.46,20442.83,653,121,774,2.97,3.07,3.02,3.786169,70000-100000_words
9,Anzengruber_Ludwig_Der_Sternsteinhof,12,Ludwig Anzengruber,Österreich,male,11850357X,Der Sternsteinhof,1884,Österreich,103915,...,3.93,22587.28,895,165,1060,3.06,3.28,3.17,4.692907,70000-100000_words
27,Bahr_Hermann_Die_Rahl,33,Hermann Bahr,Österreich,male,118505955,Die Rahl,1908,Österreich,92041,...,4.26,18106.81,1695,230,1925,2.99,3.25,3.12,10.631359,70000-100000_words
28,Bahr_Hermann_Himmelfahrt,34,Hermann Bahr,Österreich,male,118505955,Himmelfahrt,1916,Deutschland/Österreich,107045,...,4.14,21915.70,910,89,999,2.86,2.45,2.66,4.558376,70000-100000_words
35,Bahr_Hermann_O_Mensch,41,Hermann Bahr,Österreich,male,118505955,O Mensch!,1910,Österreich,92300,...,4.44,17559.68,1970,146,2116,3.01,3.03,3.02,12.050333,70000-100000_words
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1159,Westkirch_Luise_Der_Todfeind,1162,Luise Westkirch,Deutschland,female,11732065X,Der Todfeind,1912,Deutschland,84655,...,4.18,17547.85,711,121,832,2.89,3.19,3.04,4.741322,70000-100000_words
1164,Wiesebach_Wilhelm_Er_und_Ich,1167,Wilhelm Wiesebach,Deutschland,male,117368539,Er und Ich,1914,Deutschland,98439,...,4.73,17948.20,590,143,733,2.96,3.20,3.08,4.083975,70000-100000_words
1195,Wildhagen_Else_Trotzkopfs_Brautzeit,1198,Else Wildhagen,Deutschland,female,116824514,Trotzkopfs Brautzeit,1892,Deutschland,86979,...,4.55,16607.47,1499,119,1618,2.89,3.03,2.96,9.742604,70000-100000_words
1204,Wohlbrueck_Olga_Du_sollst_ein_Mann,1206,Olga Wohlbrück,Österreich,female,117285064,Du sollst ein Mann sein!,1911,Deutschland,111136,...,4.35,22013.10,1485,131,1616,2.99,2.61,2.80,7.341083,70000-100000_words


In [50]:
merged_df

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,total_avg_token_count_without_nan,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range
0,Achleitner_Arthur_Das_Schloss_im_Moor,1,Arthur Achleitner,Deutschland,male,116005327,Das Schloß im Moor,1903,Deutschland,61568,...,4.05,13127.41,711,49,760,3.07,2.86,2.96,5.789413,40000-69999_words
1,Achleitner_Arthur_Der_Finanzer,2,Arthur Achleitner,Deutschland,male,116005327,Der Finanzer,1903,Österreich,28240,...,3.80,6474.21,294,82,376,3.08,3.18,3.13,5.807658,10000-39999_words
2,Adolph_Karl_Schackerl,3,Karl Adolph,Österreich,male,123217687,Schackerl,1912,Österreich,48194,...,4.73,8676.32,307,81,388,3.08,3.28,3.18,4.471942,40000-69999_words
3,Adolph_Karl_Toechter,4,Karl Adolph,Österreich,male,123217687,Töchter,1914,Österreich,108949,...,4.46,20442.83,653,121,774,2.97,3.07,3.02,3.786169,70000-100000_words
4,Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser,7,Louise Anklam,Deutschland,female,125905726,Der Herr Kaiser,1898,Deutschland,4276,...,4.69,796.80,63,8,71,2.94,3.25,3.09,8.910643,2000-9999_words
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1222,zu_Reventlow_Franziska_Das_graefliche_Milchges...,1223,Franziska zu Reventlow,Deutschland,female,118600044,Das gräfliche Milchgeschäft,1897,Deutschland,4027,...,3.74,932.62,41,8,49,3.01,3.31,3.16,5.254016,2000-9999_words
1223,zu_Reventlow_Franziska_Ellen_Olestjerne,1224,Franziska zu Reventlow,Deutschland,female,118600044,Ellen Olestjerne,1903,Deutschland,73466,...,4.42,14508.60,587,142,729,2.92,3.11,3.01,5.024606,40000-69999_words
1224,zu_Reventlow_Franziska_Herrn_Dames_Aufzeichnungen,1225,Franziska zu Reventlow,Deutschland,female,118600044,Herrn Dames Aufzeichnungen,1913,Deutschland,42090,...,4.17,8870.26,620,48,668,2.93,3.28,3.10,7.530783,10000-39999_words
1225,zu_Reventlow_Franziska_Spiritismus,1226,Franziska zu Reventlow,Deutschland,female,118600044,Spiritismus,1917,Deutschland,3998,...,4.37,807.78,39,12,51,2.77,3.04,2.91,6.313600,2000-9999_words


In [61]:
# Define a function to calculate count and average for each DataFrame
def calculate_count_and_average(df, threshold):
    # Filter texts with 'SED_without_nan' >= threshold
    filtered_texts = df[df['SED_without_nan'] >= threshold]
    
    # Count the number of texts
    count = len(filtered_texts)
    
    # Calculate the average 'SED_without_nan' for the filtered texts
    average = filtered_texts['SED_without_nan'].mean()
    
    return count, average

# Threshold value
threshold = 9.08

# Calculate count and average for each DataFrame
count_2000_9999, average_2000_9999 = calculate_count_and_average(df_2000_9999, threshold)
count_10000_39999, average_10000_39999 = calculate_count_and_average(df_10000_39999, threshold)
count_40000_69999, average_40000_69999 = calculate_count_and_average(df_40000_69999, threshold)
count_70000_100000, average_70000_100000 = calculate_count_and_average(df_70000_100000, threshold)

# Print the results
print("For 2000-9999_words DataFrame:")
print("Number of texts with 'SED_without_nan' >= 9.08:", count_2000_9999)
print("Average 'SED_without_nan' for the filtered selection:", float(average_2000_9999))
print()

print("For 10000-39999_words DataFrame:")
print("Number of texts with 'SED_without_nan' >= 9.08:", count_10000_39999)
print("Average 'SED_without_nan' for the filtered selection:", float(average_10000_39999))
print()

print("For 40000-69999_words DataFrame:")
print("Number of texts with 'SED_without_nan' >= 9.08:", count_40000_69999)
print("Average 'SED_without_nan' for the filtered selection:", float(average_40000_69999))
print()

print("For 70000-100000_words DataFrame:")
print("Number of texts with 'SED_without_nan' >= 9.08:", count_70000_100000)
print("Average 'SED_without_nan' for the filtered selection:", float(average_70000_100000))


For 2000-9999_words DataFrame:
Number of texts with 'SED_without_nan' >= 9.08: 35
Average 'SED_without_nan' for the filtered selection: 11.804704671115944

For 10000-39999_words DataFrame:
Number of texts with 'SED_without_nan' >= 9.08: 37
Average 'SED_without_nan' for the filtered selection: 9.793326384312724

For 40000-69999_words DataFrame:
Number of texts with 'SED_without_nan' >= 9.08: 18
Average 'SED_without_nan' for the filtered selection: 10.185450940270208

For 70000-100000_words DataFrame:
Number of texts with 'SED_without_nan' >= 9.08: 11
Average 'SED_without_nan' for the filtered selection: 10.032671871554903


In [35]:
# Calculate and print the sum of 'total_se_count_without_nan' for each DataFrame
sum_2000_9999 = df_2000_9999['total_se_count_without_nan'].sum()
print("Sum of 'total_se_count_without_nan' for 2000-9999_words DataFrame:", sum_2000_9999)

sum_10000_39999 = df_10000_39999['total_se_count_without_nan'].sum()
print("Sum of 'total_se_count_without_nan' for 10000-39999_words DataFrame:", sum_10000_39999)

sum_40000_69999 = df_40000_69999['total_se_count_without_nan'].sum()
print("Sum of 'total_se_count_without_nan' for 40000-69999_words DataFrame:", sum_40000_69999)

sum_70000_100000 = df_70000_100000['total_se_count_without_nan'].sum()
print("Sum of 'total_se_count_without_nan' for 70000-100000_words DataFrame:", sum_70000_100000)


Sum of 'total_se_count_without_nan' for 2000-9999_words DataFrame: 32687
Sum of 'total_se_count_without_nan' for 10000-39999_words DataFrame: 138859
Sum of 'total_se_count_without_nan' for 40000-69999_words DataFrame: 159586
Sum of 'total_se_count_without_nan' for 70000-100000_words DataFrame: 172121


In [46]:
# Filter texts with 'total_se_count_without_nan' higher/equal to 9.08 and calculate average
def filter_and_avg(df, threshold):
    filtered_df = df[df['total_se_count_without_nan'] >= threshold]
    count = len(filtered_df)
    avg = filtered_df['total_se_count_without_nan'].mean()
    return count, avg

# Define the threshold
threshold = 9.08

# Filter and calculate for each DataFrame
count_2000_9999, avg_2000_9999 = filter_and_avg(df_2000_9999, threshold)
print("For 2000-9999_words DataFrame:")
print("Number of texts with 'total_se_count_without_nan' >= 9.08:", count_2000_9999)
print("Average of 'total_se_count_without_nan':", avg_2000_9999)

count_10000_39999, avg_10000_39999 = filter_and_avg(df_10000_39999, threshold)
print("\nFor 10000-39999_words DataFrame:")
print("Number of texts with 'total_se_count_without_nan' >= 9.08:", count_10000_39999)
print("Average of 'total_se_count_without_nan':", avg_10000_39999)

count_40000_69999, avg_40000_69999 = filter_and_avg(df_40000_69999, threshold)
print("\nFor 40000-69999_words DataFrame:")
print("Number of texts with 'total_se_count_without_nan' >= 9.08:", count_40000_69999)
print("Average of 'total_se_count_without_nan':", avg_40000_69999)

count_70000_100000, avg_70000_100000 = filter_and_avg(df_70000_100000, threshold)
print("\nFor 70000-100000_words DataFrame:")
print("Number of texts with 'total_se_count_without_nan' >= 9.08:", count_70000_100000)
print("Average of 'total_se_count_without_nan':", avg_70000_100000)


For 2000-9999_words DataFrame:
Number of texts with 'total_se_count_without_nan' >= 9.08: 423
Average of 'total_se_count_without_nan': 77.23404255319149

For 10000-39999_words DataFrame:
Number of texts with 'total_se_count_without_nan' >= 9.08: 461
Average of 'total_se_count_without_nan': 301.2125813449024

For 40000-69999_words DataFrame:
Number of texts with 'total_se_count_without_nan' >= 9.08: 196
Average of 'total_se_count_without_nan': 814.2142857142857

For 70000-100000_words DataFrame:
Number of texts with 'total_se_count_without_nan' >= 9.08: 140
Average of 'total_se_count_without_nan': 1229.4357142857143


In [116]:
# Define the bins for word count ranges
bins = [1848, 1871, 1901, 1921]

# Define the labels for the bins
labels = ['1848-1870', '1871-1900', '1901-1920']

# Create a new column 'publication_year_range' that categorizes word counts into the specified bins
df['publication_year_range'] = pd.cut(df['text_used_date'], bins=bins, labels=labels, right=False)

# Create separate dataframes for each word count range
df_1848_1870 = df[df['publication_year_range'] == '1848-1870'].copy()
df_1871_1900 = df[df['publication_year_range'] == '1871-1900'].copy()
df_1901_1920 = df[df['publication_year_range'] == '1901-1920'].copy()

In [117]:
df_1848_1870

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range,publication_year_range
8,Annecke_Fritz_Der_verehrter_Herr,11,Fritz Anneke,Deutschland,male,118649523,Der verehrter Herr,1861,Deutschland,2739,...,528.82,29,0,29,3.16,0.00,3.160,5.483908,2000-9999_words,1848-1870
14,Aston_Louise_Lydia,19,Louise Aston,Deutschland,female,118650769,Lydia,1850,Deutschland,61815,...,10852.73,882,45,927,2.89,2.76,2.825,8.541630,40000-69999_words,1848-1870
15,Auerbach_Berthold_Barfuessele,20,Berthold Auerbach,Deutschland,male,11865103X,Barfüßele,1856,Deutschland/Schweiz,76666,...,14630.27,1059,167,1226,3.00,3.18,3.090,8.379886,40000-69999_words,1848-1870
16,Auerbach_Berthold_Brosi_und_Moni,21,Berthold Auerbach,Deutschland,male,11865103X,Brosi und Mosi,1852,Deutschland,46710,...,9244.12,637,151,788,3.04,3.08,3.060,8.524338,40000-69999_words,1848-1870
17,Auerbach_Berthold_Der_Lehnhold,22,Berthold Auerbach,Deutschland,male,11865103X,Der Lehnhold,1853,Schweiz/Deutschland,66046,...,11921.44,892,156,1048,2.94,3.19,3.065,8.790884,40000-69999_words,1848-1870
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1200,Winckler_Willibald_Ein_Wort,1203,Willibald Winckler,Deutschland,male,117399973,Ein Wort,1868,Deutschland,28175,...,5480.72,390,11,401,3.03,2.79,2.910,7.316557,10000-39999_words,1848-1870
1205,Rein_Ludwig_Fabrikantenbrod,637,Ludwig Rein,Deutschland,male,117374997,Fabrikantenbrod,1858,Deutschland,12653,...,2599.28,164,22,186,2.83,2.64,2.735,7.155828,10000-39999_words,1848-1870
1214,Zedelius_Marie_Doctor_Reinhard,1215,Marie Zedelius,Deutschland,female,101198566,Doctor Reinhard,1870,Deutschland,20015,...,3893.72,370,9,379,2.98,3.17,3.075,9.733622,10000-39999_words,1848-1870
1217,Zedelius_Marie_Getrennt,1218,Marie Zedelius,Deutschland,female,101198566,Getrennt,1867,Deutschland,20143,...,3509.20,369,21,390,2.96,3.30,3.130,11.113644,10000-39999_words,1848-1870


In [118]:
df_1871_1900

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range,publication_year_range
4,Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser,7,Louise Anklam,Deutschland,female,125905726,Der Herr Kaiser,1898,Deutschland,4276,...,796.80,63,8,71,2.94,3.25,3.095,8.910643,2000-9999_words,1871-1900
5,Anklam_Louise_Kindergeschichten_Der_Weihnachts...,8,Louise Anklam,Deutschland,female,125905726,Der Weihnachtsabend,1898,Deutschland,4466,...,748.56,51,1,52,2.81,4.00,3.405,6.946671,2000-9999_words,1871-1900
6,Anklam_Louise_Kindergeschichten_Der_wilde_Arno,9,Louise Anklam,Deutschland,female,125905726,Der wilde Arno,1898,Deutschland,4613,...,1012.85,59,5,64,3.03,2.50,2.765,6.318803,2000-9999_words,1871-1900
7,Anklam_Louise_Kindergeschichten_Der_Wunderdoktor,10,Louise Anklam,Deutschland,female,125905726,Der Wunderdoktor,1898,Deutschland,5472,...,943.96,69,4,73,3.08,3.75,3.415,7.733379,2000-9999_words,1871-1900
9,Anzengruber_Ludwig_Der_Sternsteinhof,12,Ludwig Anzengruber,Österreich,male,11850357X,Der Sternsteinhof,1884,Österreich,103915,...,22587.28,895,165,1060,3.06,3.28,3.170,4.692907,70000-100000_words,1871-1900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1216,Zedelius_Marie_Ein_unheilvoller_Augenblick,1217,Marie Zedelius,Deutschland,female,101198566,Ein unheilvoller Augenblick,1874,Deutschland,26480,...,4884.99,406,28,434,2.92,2.94,2.930,8.884358,10000-39999_words,1871-1900
1219,Zedelius_Marie_Lorbeeren,1220,Marie Zedelius,Deutschland,female,101198566,Lorbeeren,1873,Deutschland,25825,...,5034.52,405,25,430,2.97,2.68,2.825,8.541033,10000-39999_words,1871-1900
1220,Zoeller_Charlotte_Am_Meer,1221,Charlotte Zoeller,Deutschland,female,1043840923,Am Meer,1880,Deutschland,14638,...,2936.87,174,36,210,2.81,2.91,2.860,7.150470,10000-39999_words,1871-1900
1221,zu_Eulenburg-Hertefeld_Philipp_Erlebnisse_an,1222,Philipp zu Eulenburg-Hertefeld,Deutschland,male,118531352,Erlebnisse an deutschen und fremden Höfen,1894,Österreich,98213,...,18192.99,711,75,786,3.02,2.94,2.980,4.320345,70000-100000_words,1871-1900


In [119]:
df_1901_1920

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range,publication_year_range
0,Achleitner_Arthur_Das_Schloss_im_Moor,1,Arthur Achleitner,Deutschland,male,116005327,Das Schloß im Moor,1903,Deutschland,61568,...,13127.41,711,49,760,3.07,2.86,2.965,5.789413,40000-69999_words,1901-1920
1,Achleitner_Arthur_Der_Finanzer,2,Arthur Achleitner,Deutschland,male,116005327,Der Finanzer,1903,Österreich,28240,...,6474.21,294,82,376,3.08,3.18,3.130,5.807658,10000-39999_words,1901-1920
2,Adolph_Karl_Schackerl,3,Karl Adolph,Österreich,male,123217687,Schackerl,1912,Österreich,48194,...,8676.32,307,81,388,3.08,3.28,3.180,4.471942,40000-69999_words,1901-1920
3,Adolph_Karl_Toechter,4,Karl Adolph,Österreich,male,123217687,Töchter,1914,Österreich,108949,...,20442.83,653,121,774,2.97,3.07,3.020,3.786169,70000-100000_words,1901-1920
27,Bahr_Hermann_Die_Rahl,33,Hermann Bahr,Österreich,male,118505955,Die Rahl,1908,Österreich,92041,...,18106.81,1695,230,1925,2.99,3.25,3.120,10.631359,70000-100000_words,1901-1920
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1212,Zapp_Arthur_Zwischen_Mann_und_Frau,1213,Arthur Zapp,Deutschland,male,116965959,Zwischen Mann und Frau,1914,Deutschland,73441,...,13381.13,712,69,781,2.92,3.06,2.990,5.836577,40000-69999_words,1901-1920
1223,zu_Reventlow_Franziska_Ellen_Olestjerne,1224,Franziska zu Reventlow,Deutschland,female,118600044,Ellen Olestjerne,1903,Deutschland,73466,...,14508.60,587,142,729,2.92,3.11,3.015,5.024606,40000-69999_words,1901-1920
1224,zu_Reventlow_Franziska_Herrn_Dames_Aufzeichnungen,1225,Franziska zu Reventlow,Deutschland,female,118600044,Herrn Dames Aufzeichnungen,1913,Deutschland,42090,...,8870.26,620,48,668,2.93,3.28,3.105,7.530783,10000-39999_words,1901-1920
1225,zu_Reventlow_Franziska_Spiritismus,1226,Franziska zu Reventlow,Deutschland,female,118600044,Spiritismus,1917,Deutschland,3998,...,807.78,39,12,51,2.77,3.04,2.905,6.313600,2000-9999_words,1901-1920


In [126]:
merged_df

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range,publication_year_range
0,Achleitner_Arthur_Das_Schloss_im_Moor,1,Arthur Achleitner,Deutschland,male,116005327,Das Schloß im Moor,1903,Deutschland,61568,...,13127.41,711,49,760,3.07,2.86,2.965,5.789413,40000-69999_words,1901-1920
1,Achleitner_Arthur_Der_Finanzer,2,Arthur Achleitner,Deutschland,male,116005327,Der Finanzer,1903,Österreich,28240,...,6474.21,294,82,376,3.08,3.18,3.130,5.807658,10000-39999_words,1901-1920
2,Adolph_Karl_Schackerl,3,Karl Adolph,Österreich,male,123217687,Schackerl,1912,Österreich,48194,...,8676.32,307,81,388,3.08,3.28,3.180,4.471942,40000-69999_words,1901-1920
3,Adolph_Karl_Toechter,4,Karl Adolph,Österreich,male,123217687,Töchter,1914,Österreich,108949,...,20442.83,653,121,774,2.97,3.07,3.020,3.786169,70000-100000_words,1901-1920
4,Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser,7,Louise Anklam,Deutschland,female,125905726,Der Herr Kaiser,1898,Deutschland,4276,...,796.80,63,8,71,2.94,3.25,3.095,8.910643,2000-9999_words,1871-1900
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1222,zu_Reventlow_Franziska_Das_graefliche_Milchges...,1223,Franziska zu Reventlow,Deutschland,female,118600044,Das gräfliche Milchgeschäft,1897,Deutschland,4027,...,932.62,41,8,49,3.01,3.31,3.160,5.254016,2000-9999_words,1871-1900
1223,zu_Reventlow_Franziska_Ellen_Olestjerne,1224,Franziska zu Reventlow,Deutschland,female,118600044,Ellen Olestjerne,1903,Deutschland,73466,...,14508.60,587,142,729,2.92,3.11,3.015,5.024606,40000-69999_words,1901-1920
1224,zu_Reventlow_Franziska_Herrn_Dames_Aufzeichnungen,1225,Franziska zu Reventlow,Deutschland,female,118600044,Herrn Dames Aufzeichnungen,1913,Deutschland,42090,...,8870.26,620,48,668,2.93,3.28,3.105,7.530783,10000-39999_words,1901-1920
1225,zu_Reventlow_Franziska_Spiritismus,1226,Franziska zu Reventlow,Deutschland,female,118600044,Spiritismus,1917,Deutschland,3998,...,807.78,39,12,51,2.77,3.04,2.905,6.313600,2000-9999_words,1901-1920


In [128]:
# Calculate the average of the values in the column "text_loudness_average"
average_text_loudness = df['text_loudness_average'].mean()

# Print the average
print("Average text loudness:", average_text_loudness)


Average text loudness: 2.977367563162184


In [129]:
# Calculate the average for each sub-dataframe
average_1848_1870 = df_1848_1870['text_loudness_average'].mean()
average_1871_1900 = df_1871_1900['text_loudness_average'].mean()
average_1901_1920 = df_1901_1920['text_loudness_average'].mean()

# Print the averages
print("Average text loudness for 1848-1870 range:", average_1848_1870)
print("Average text loudness for 1871-1900 range:", average_1871_1900)
print("Average text loudness for 1901-1920 range:", average_1901_1920)


Average text loudness for 1848-1870 range: 2.9561580882352945
Average text loudness for 1871-1900 range: 2.982439703153989
Average text loudness for 1901-1920 range: 2.984663461538462


In [237]:
import plotly.graph_objects as go

# Sort the DataFrame by 'word_count' column
df_sorted_by_publication_year = df.sort_values(by='text_used_date')

# Calculate cosine smoothed SED values
window_size = 50  # Define the window size for cosine smoothing
cosine_smoothed_SED = np.convolve(df_sorted_by_publication_year['text_loudness_average'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for SED_without_nan
scatter = go.Scatter(x=df_sorted_by_publication_year['text_used_date'], 
                     y=df_sorted_by_publication_year['text_loudness_average'],
                     mode='markers',
                     marker=dict(color='blue'),
                     name='Average Loudness Level',
                     text=df_sorted_by_publication_year['filename'],  # Hover text
                     hoverinfo='text')

# Create line plot for smoothed SED
smoothed_line = go.Scatter(x=df_sorted_by_publication_year['text_used_date'], 
                           y=cosine_smoothed_SED,
                           mode='lines',
                           line=dict(color='red', width=2),
                           name='Cosine Smoothed SED')

# Create layout
layout = go.Layout(#title='SED related to text length in words (including author Ludwig Thoma) with Cosine Smoothed SED',
                   xaxis=dict(title='Publication Year'),
                   yaxis=dict(title='Average Loudness Level', range=[0.5, 5]),
                   hovermode='closest',
                   legend=dict(x=0.5, y=1.2, orientation='h'))

# Create figure (exclude line plot for scaled word count)
fig = go.Figure(data=[scatter, smoothed_line], layout=layout)

# Show plot
fig.show()

# Export as PNG with high resolution for print
fig.write_image("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/ALoudness_theme-d-prose_cos_smoothed_50_medium_res3.png", scale=10)

# Export as HTML
fig.write_html("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/ALoudness_theme-d-prose_cos_smoothed_50_high_res3.html")


In [210]:
import plotly.graph_objects as go

# Sort the DataFrame by 'word_count' column
df_sorted_by_publication_year = df.sort_values(by='text_used_date')

# Calculate cosine smoothed CAL values
window_size = 50  # Define the window size for cosine smoothing
cosine_smoothed_CAL = np.convolve(df_sorted_by_publication_year['character_avg_loudness'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for character_avg_loudness
scatter = go.Scatter(x=df_sorted_by_publication_year['text_used_date'], 
                     y=df_sorted_by_publication_year['character_avg_loudness'],
                     mode='markers',
                     marker=dict(color='blue'),
                     name='Character Average Loudness Level',
                     text=df_sorted_by_publication_year['filename'],  # Hover text
                     hoverinfo='text')

# Create line plot for smoothed CAL
smoothed_line = go.Scatter(x=df_sorted_by_publication_year['text_used_date'], 
                           y=cosine_smoothed_CAL,
                           mode='lines',
                           line=dict(color='red', width=2),
                           name='Cosine Smoothed Character Average Loudness')

# Create layout
layout = go.Layout(#title='SED related to text length in words (including author Ludwig Thoma) with Cosine Smoothed SED',
                   xaxis=dict(title='Publication Year'),
                   yaxis=dict(title='Character Average Loudness Level', range=[0.5, 5]),
                   hovermode='closest',
                   legend=dict(x=0.5, y=1.2, orientation='h'))

# Create figure (exclude line plot for scaled word count)
fig = go.Figure(data=[scatter, smoothed_line], layout=layout)

# Show plot
fig.show()

# Export as PNG with high resolution for print
fig.write_image("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_cos_smoothed_50_medium_res3.png", scale=10)

# Export as HTML
fig.write_html("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_cos_smoothed_50_high_res3.html")


In [134]:
# Filter texts with 'text_loudness_average' higher/equal to 9.08 and calculate average
def filter_and_avg(df, threshold):
    filtered_df = df[df['text_loudness_average'] >= threshold]
    count = len(filtered_df)
    avg = filtered_df['text_loudness_average'].mean()
    return count, avg

# Define the threshold
threshold = 3.295

# Filter and calculate for each DataFrame
count_1848_1870, avg_1848_1870 = filter_and_avg(df_1848_1870, threshold)
print("For 1848-1870 publication year range DataFrame:")
print("Number of texts with 'text_loudness_average' >= 9.08:", count_1848_1870)
print("Average of 'text_loudness_average':", avg_1848_1870)

count_1871_1900, avg_1871_1900 = filter_and_avg(df_1871_1900, threshold)
print("\nFor 1871-1900 publication year range DataFrame:")
print("Number of texts with 'text_loudness_average' >= 9.08:", count_1871_1900)
print("Average of 'text_loudness_average':", avg_1871_1900)

count_1901_1920, avg_1901_1920 = filter_and_avg(df_1901_1920, threshold)
print("\nFor 1901-1920 publication year range DataFrame:")
print("Number of texts with 'text_loudness_average' >= 9.08:", count_1901_1920)
print("Average of 'text_loudness_average':", avg_1901_1920)


For 1848-1870 publication year range DataFrame:
Number of texts with 'text_loudness_average' >= 9.08: 27
Average of 'text_loudness_average': 3.4066666666666667

For 1871-1900 publication year range DataFrame:
Number of texts with 'text_loudness_average' >= 9.08: 44
Average of 'text_loudness_average': 3.377840909090909

For 1901-1920 publication year range DataFrame:
Number of texts with 'text_loudness_average' >= 9.08: 29
Average of 'text_loudness_average': 3.4584482758620685


## Gender-dependent Character Sound

In [138]:
df_2000_9999

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,total_avg_token_count_without_nan,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range
4,Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser,7,Louise Anklam,Deutschland,female,125905726,Der Herr Kaiser,1898,Deutschland,4276,...,4.69,796.80,63,8,71,2.94,3.25,3.09,8.910643,2000-9999_words
5,Anklam_Louise_Kindergeschichten_Der_Weihnachts...,8,Louise Anklam,Deutschland,female,125905726,Der Weihnachtsabend,1898,Deutschland,4466,...,5.21,748.56,51,1,52,2.81,4.00,3.41,6.946671,2000-9999_words
6,Anklam_Louise_Kindergeschichten_Der_wilde_Arno,9,Louise Anklam,Deutschland,female,125905726,Der wilde Arno,1898,Deutschland,4613,...,3.97,1012.85,59,5,64,3.03,2.50,2.76,6.318803,2000-9999_words
7,Anklam_Louise_Kindergeschichten_Der_Wunderdoktor,10,Louise Anklam,Deutschland,female,125905726,Der Wunderdoktor,1898,Deutschland,5472,...,5.05,943.96,69,4,73,3.08,3.75,3.42,7.733379,2000-9999_words
8,Annecke_Fritz_Der_verehrter_Herr,11,Fritz Anneke,Deutschland,male,118649523,Der verehrter Herr,1861,Deutschland,2739,...,4.58,528.82,29,0,29,3.16,0.00,1.58,5.483908,2000-9999_words
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1190,Wildermuth_Ottilie_Krieg_und_Frieden,1193,Ottilie Wildermuth,Deutschland,female,118632833,Krieg und Frieden,1866,Deutschland,8470,...,4.07,1792.14,112,13,125,2.95,3.65,3.30,6.974902,2000-9999_words
1191,Wildermuth_Ottilie_Onkel_Gottliebs_Jugendliebe,1194,Ottilie Wildermuth,Deutschland,female,118632833,Onkel Gottliebs Jugendliebe,1860,Deutschland,9082,...,4.10,1920.24,133,12,145,2.85,2.12,2.49,7.551139,2000-9999_words
1206,Wuerkert_Ludwig_Ein_Haeckerlingsschneider_als_...,1207,Ludwig Würkert,Deutschland,male,117374997,Ein Häckerlingsschneider als Apostel,1874,Deutschland,6791,...,4.58,1296.29,51,24,75,2.98,2.68,2.83,5.785742,2000-9999_words
1222,zu_Reventlow_Franziska_Das_graefliche_Milchges...,1223,Franziska zu Reventlow,Deutschland,female,118600044,Das gräfliche Milchgeschäft,1897,Deutschland,4027,...,3.74,932.62,41,8,49,3.01,3.31,3.16,5.254016,2000-9999_words


In [139]:
# Filter rows where author_gender is female
df_female = df_2000_9999[df_2000_9999['author_gender'] == 'female'].copy()

# Filter rows where author_gender is male
df_male = df_2000_9999[df_2000_9999['author_gender'] == 'male'].copy()


In [140]:
df_male

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,total_avg_token_count_without_nan,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range
8,Annecke_Fritz_Der_verehrter_Herr,11,Fritz Anneke,Deutschland,male,118649523,Der verehrter Herr,1861,Deutschland,2739,...,4.58,528.82,29,0,29,3.16,0.00,1.58,5.483908,2000-9999_words
13,Anzengruber_Ludwig_Kalendergeschichten_Der_Ver...,16,Ludwig Anzengruber,Österreich,male,11850357X,Der Verschollene,1894,Österreich,8699,...,4.21,1781.24,64,14,78,3.02,2.96,2.99,4.378972,2000-9999_words
20,Auerbach_Berthold_Drei_einzige_Toechter_Auf_Wache,25,Berthold Auerbach,Deutschland,male,11865103X,Auf Wache,1875,Österreich,6160,...,3.95,1335.70,97,9,106,3.04,3.67,3.35,7.935914,2000-9999_words
25,Auerbach_Berthold_Hopfen_und_Gerste,30,Berthold Auerbach,Deutschland,male,11865103X,Hopfen und Gerste,1865,Deutschland,10721,...,4.39,2146.01,139,41,180,2.88,2.96,2.92,8.387659,2000-9999_words
29,Bahr_Hermann_Leander_Der_verstaendige_Herr,35,Hermann Bahr,Österreich,male,118505955,Der verständige Herr,1899,Deutschland/Elsass,2720,...,3.82,613.09,14,7,21,2.86,3.29,3.08,3.425272,2000-9999_words
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1146,Wassermann_Jakob_Drei_Erzaehlungen_Hilperich,1149,Jakob Wassermann,Deutschland,male,118629387,Hilperich,1900,Deutschland,8992,...,4.41,1752.38,130,10,140,2.96,3.70,3.33,7.989135,2000-9999_words
1148,Wassermann_Jakob_Erzaehlungen_Adam_Urbas,1151,Jakob Wassermann,Deutschland,male,118629387,Adam Urbas,1901,Deutschland,10447,...,3.41,2609.38,111,21,132,2.83,2.52,2.67,5.058673,2000-9999_words
1149,Wassermann_Jakob_Erzaehlungen_Jost,1152,Jakob Wassermann,Deutschland,male,118629387,Jost,1901,Österreich,11025,...,4.27,2170.96,106,33,139,2.99,3.22,3.11,6.402697,2000-9999_words
1152,Wehl_Feodor_Der_Diebstahl_aus_Liebe,1155,Feodor Wehl,Deutschland,male,117220264,Der Diebstahl aus Liebe,1855,Deutschland,8287,...,4.27,1715.46,90,7,97,3.12,2.14,2.63,5.654460,2000-9999_words


In [141]:
df_female

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,total_avg_token_count_without_nan,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average,SED_without_nan,word_count_range
4,Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser,7,Louise Anklam,Deutschland,female,125905726,Der Herr Kaiser,1898,Deutschland,4276,...,4.69,796.80,63,8,71,2.94,3.25,3.09,8.910643,2000-9999_words
5,Anklam_Louise_Kindergeschichten_Der_Weihnachts...,8,Louise Anklam,Deutschland,female,125905726,Der Weihnachtsabend,1898,Deutschland,4466,...,5.21,748.56,51,1,52,2.81,4.00,3.41,6.946671,2000-9999_words
6,Anklam_Louise_Kindergeschichten_Der_wilde_Arno,9,Louise Anklam,Deutschland,female,125905726,Der wilde Arno,1898,Deutschland,4613,...,3.97,1012.85,59,5,64,3.03,2.50,2.76,6.318803,2000-9999_words
7,Anklam_Louise_Kindergeschichten_Der_Wunderdoktor,10,Louise Anklam,Deutschland,female,125905726,Der Wunderdoktor,1898,Deutschland,5472,...,5.05,943.96,69,4,73,3.08,3.75,3.42,7.733379,2000-9999_words
43,Behrens_Bertha_Alte_Liebe_und_anderes_Alte_Liebe,49,Bertha Behrens,Deutschland,female,104352787,Alte Liebe,1904,Deutschland,6666,...,4.23,1347.99,118,8,126,2.85,1.91,2.38,9.347250,2000-9999_words
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1189,Wildermuth_Ottilie_Geschichten_aus_Schwaben_Ke...,1192,Ottilie Wildermuth,Deutschland,female,118632833,Keine Neigungsheirath,1854,Deutschland,4734,...,4.94,832.79,49,4,53,3.13,2.50,2.81,6.364149,2000-9999_words
1190,Wildermuth_Ottilie_Krieg_und_Frieden,1193,Ottilie Wildermuth,Deutschland,female,118632833,Krieg und Frieden,1866,Deutschland,8470,...,4.07,1792.14,112,13,125,2.95,3.65,3.30,6.974902,2000-9999_words
1191,Wildermuth_Ottilie_Onkel_Gottliebs_Jugendliebe,1194,Ottilie Wildermuth,Deutschland,female,118632833,Onkel Gottliebs Jugendliebe,1860,Deutschland,9082,...,4.10,1920.24,133,12,145,2.85,2.12,2.49,7.551139,2000-9999_words
1222,zu_Reventlow_Franziska_Das_graefliche_Milchges...,1223,Franziska zu Reventlow,Deutschland,female,118600044,Das gräfliche Milchgeschäft,1897,Deutschland,4027,...,3.74,932.62,41,8,49,3.01,3.31,3.16,5.254016,2000-9999_words


In [142]:
# Count the number of unique names in the "author_used_name" column of df_female
unique_names_count = df_female['author_used_name'].nunique()

# Print the count
print("Number of unique names in 'author_used_name' column of df_female:", unique_names_count)


Number of unique names in 'author_used_name' column of df_female: 34


In [143]:
# Count the number of unique names in the "author_used_name" column of df_female
unique_names_count = df_male['author_used_name'].nunique()

# Print the count
print("Number of unique names in 'author_used_name' column of df_male:", unique_names_count)


Number of unique names in 'author_used_name' column of df_male: 117


In [144]:
import plotly.graph_objects as go

# Sort the DataFrame by 'text_used_date' column
df_sorted_by_publication_year_female = df_female.sort_values(by='text_used_date')
df_sorted_by_publication_year_male = df_male.sort_values(by='text_used_date')

# Calculate cosine smoothed loudness values for females
window_size = 50  # Define the window size for cosine smoothing
cosine_smoothed_loudness_female = np.convolve(df_sorted_by_publication_year_female['character_avg_loudness'], np.ones(window_size)/window_size, mode='same')

# Calculate cosine smoothed loudness values for males
cosine_smoothed_loudness_male = np.convolve(df_sorted_by_publication_year_male['character_avg_loudness'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for loudness for females
scatter_female = go.Scatter(x=df_sorted_by_publication_year_female['text_used_date'], 
                            y=df_sorted_by_publication_year_female['character_avg_loudness'],
                            mode='markers',
                            marker=dict(color='violet'),
                            name='Average Loudness Level (Female)',
                            text=df_sorted_by_publication_year_female['filename'],  # Hover text
                            hoverinfo='text')

# Create scatter plot for loudness for males
scatter_male = go.Scatter(x=df_sorted_by_publication_year_male['text_used_date'], 
                          y=df_sorted_by_publication_year_male['character_avg_loudness'],
                          mode='markers',
                          marker=dict(color='green'),
                          name='Average Loudness Level (Male)',
                          text=df_sorted_by_publication_year_male['filename'],  # Hover text
                          hoverinfo='text')

# Create layout
layout = go.Layout(xaxis=dict(title='Publication Year'),
                   yaxis=dict(title='Character Average Loudness Level', range=[0.5, 5]),
                   hovermode='closest')

# Create figure
fig = go.Figure(data=[scatter_female, scatter_male], layout=layout)

# Show plot
fig.show()

# Export as PNG with high resolution for print
fig.write_image("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_female_male_high_res.png", scale=20)

# Export as HTML
fig.write_html("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_female_male_high_res.html")


In [209]:
# This is a nice one! 


import plotly.graph_objects as go

# Sort the DataFrame by 'text_used_date' column
df_sorted_by_publication_year_female = df_female.sort_values(by='text_used_date')
df_sorted_by_publication_year_male = df_male.sort_values(by='text_used_date')

# Calculate cosine smoothed loudness values for females
window_size = 50  # Define the window size for cosine smoothing
cosine_smoothed_loudness_female = np.convolve(df_sorted_by_publication_year_female['character_avg_loudness'], np.ones(window_size)/window_size, mode='same')

# Calculate cosine smoothed loudness values for males
cosine_smoothed_loudness_male = np.convolve(df_sorted_by_publication_year_male['character_avg_loudness'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for loudness for females
scatter_female = go.Scatter(x=df_sorted_by_publication_year_female['text_used_date'], 
                            y=df_sorted_by_publication_year_female['character_avg_loudness'],
                            mode='markers',
                            marker=dict(color='violet'),
                            name='Average Loudness Level (Female Author)',
                            text=df_sorted_by_publication_year_female['filename'],  # Hover text
                            hoverinfo='text')

# Create scatter plot for loudness for males
scatter_male = go.Scatter(x=df_sorted_by_publication_year_male['text_used_date'], 
                          y=df_sorted_by_publication_year_male['character_avg_loudness'],
                          mode='markers',
                          marker=dict(color='green'),
                          name='Average Loudness Level (Male Author)',
                          text=df_sorted_by_publication_year_male['filename'],  # Hover text
                          hoverinfo='text')

# Create line plot for cosine smoothed loudness for females
line_smoothed_female = go.Scatter(x=df_sorted_by_publication_year_female['text_used_date'], 
                                  y=cosine_smoothed_loudness_female,
                                  mode='lines',
                                  line=dict(color='darkviolet', width=2),
                                  name='Cosine Smoothed Loudness (Female Author)')

# Create line plot for cosine smoothed loudness for males
line_smoothed_male = go.Scatter(x=df_sorted_by_publication_year_male['text_used_date'], 
                                y=cosine_smoothed_loudness_male,
                                mode='lines',
                                line=dict(color='darkgreen', width=2),
                                name='Cosine Smoothed Loudness (Male Author)')

# Create layout
layout = go.Layout(xaxis=dict(title='Publication Year'),
                   yaxis=dict(title='Character Average Loudness Level by Author Gender', range=[0.5, 5]),
                   hovermode='closest',
                   legend=dict(x=0.5, y=1.2, orientation='h'))

# Create figure
fig = go.Figure(data=[scatter_female, scatter_male, line_smoothed_female, line_smoothed_male], layout=layout)

# Show plot
fig.show()

# Export as PNG with high resolution for print
fig.write_image("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_female_male_smoothed_mdeium_res3.png", scale=10)

# Export as HTML
fig.write_html("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_female_male_smoothed_high_res3.html")


In [155]:
import plotly.graph_objects as go

# Define bins for decades
bins = [1840, 1861, 1871, 1881, 1891, 1901, 1911, 1921]

# Define labels for the decades
labels = ['1840s-1860', '1860s', '1870s', '1880s', '1890s', '1900s', '1910s']

# Create a new column 'publication_decade' that categorizes years into decades
df_male['publication_decade'] = pd.cut(df_male['text_used_date'], bins=bins, labels=labels, right=False)
df_female['publication_decade'] = pd.cut(df_female['text_used_date'], bins=bins, labels=labels, right=False)

# Sort the DataFrames by 'publication_decade' column
df_sorted_by_decade_male = df_male.sort_values(by='publication_decade')
df_sorted_by_decade_female = df_female.sort_values(by='publication_decade')

# Calculate cosine smoothed CAL values for male authors
window_size = 50  # Define the window size for cosine smoothing
cosine_smoothed_CAL_male = np.convolve(df_sorted_by_decade_male['character_avg_loudness'], np.ones(window_size)/window_size, mode='same')

# Calculate cosine smoothed CAL values for female authors
cosine_smoothed_CAL_female = np.convolve(df_sorted_by_decade_female['character_avg_loudness'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for each text for male authors
scatter_male = go.Scatter(x=df_sorted_by_decade_male['publication_decade'], 
                           y=df_sorted_by_decade_male['character_avg_loudness'],
                           mode='markers',
                           marker=dict(color='green'),
                           name='Male Authors',
                           text=df_sorted_by_decade_male['filename'],  # Hover text
                           hoverinfo='text')

# Create scatter plot for each text for female authors
scatter_female = go.Scatter(x=df_sorted_by_decade_female['publication_decade'], 
                             y=df_sorted_by_decade_female['character_avg_loudness'],
                             mode='markers',
                             marker=dict(color='violet'),
                             name='Female Authors',
                             text=df_sorted_by_decade_female['filename'],  # Hover text
                             hoverinfo='text')

# Create line plot for smoothed character_avg_loudness for male authors
smoothed_line_male = go.Scatter(x=df_sorted_by_decade_male['publication_decade'], 
                                y=cosine_smoothed_CAL_male,
                                mode='lines',
                                line=dict(color='green', width=2),
                                name='Male Authors')

# Create line plot for smoothed character_avg_loudness for female authors
smoothed_line_female = go.Scatter(x=df_sorted_by_decade_female['publication_decade'], 
                                  y=cosine_smoothed_CAL_female,
                                  mode='lines',
                                  line=dict(color='violet', width=2),
                                  name='Female Authors')

# Create layout
layout = go.Layout(xaxis=dict(title='Publication Decade'),
                   yaxis=dict(title='Character Average Loudness Level', range=[1.75, 3.75]),
                   hovermode='closest')

# Create figure
fig = go.Figure(data=[scatter_male, scatter_female, smoothed_line_male, smoothed_line_female], layout=layout)

# Show plot
fig.show()


In [157]:
import plotly.graph_objects as go

# Define bins for decades
bins = [1840, 1861, 1871, 1881, 1891, 1901, 1911, 1921]

# Define labels for the decades
labels = ['1848-1860', '1861-1870', '1871-1880', '1881-1890', '1891-1900', '1901-1910', '1911-1920']

# Create a new column 'publication_decade' that categorizes years into decades
df_male['publication_decade'] = pd.cut(df_male['text_used_date'], bins=bins, labels=labels, right=False)
df_female['publication_decade'] = pd.cut(df_female['text_used_date'], bins=bins, labels=labels, right=False)

# Sort the DataFrames by 'publication_decade' column
df_sorted_by_decade_male = df_male.sort_values(by='publication_decade')
df_sorted_by_decade_female = df_female.sort_values(by='publication_decade')

# Calculate cosine smoothed CAL values for male authors
window_size = 50  # Define the window size for cosine smoothing
cosine_smoothed_CAL_male = np.convolve(df_sorted_by_decade_male['character_avg_loudness'], np.ones(window_size)/window_size, mode='same')

# Calculate cosine smoothed CAL values for female authors
cosine_smoothed_CAL_female = np.convolve(df_sorted_by_decade_female['character_avg_loudness'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for each text for male authors
scatter_male = go.Scatter(x=df_sorted_by_decade_male['publication_decade'], 
                           y=df_sorted_by_decade_male['character_avg_loudness'],
                           mode='markers',
                           marker=dict(color='green'),
                           name='Male Authors',
                           text=df_sorted_by_decade_male['filename'],  # Hover text
                           hoverinfo='text')

# Create scatter plot for each text for female authors
scatter_female = go.Scatter(x=df_sorted_by_decade_female['publication_decade'], 
                             y=df_sorted_by_decade_female['character_avg_loudness'],
                             mode='markers',
                             marker=dict(color='violet'),
                             name='Female Authors',
                             text=df_sorted_by_decade_female['filename'],  # Hover text
                             hoverinfo='text')

# Create line plot for smoothed character_avg_loudness for male authors
smoothed_line_male = go.Scatter(x=df_sorted_by_decade_male['publication_decade'], 
                                y=cosine_smoothed_CAL_male,
                                mode='lines',
                                line=dict(color='darkgreen', width=2),  # Adjust color
                                name='Male Authors')

# Create line plot for smoothed character_avg_loudness for female authors
smoothed_line_female = go.Scatter(x=df_sorted_by_decade_female['publication_decade'], 
                                  y=cosine_smoothed_CAL_female,
                                  mode='lines',
                                  line=dict(color='darkviolet', width=2),  # Adjust color
                                  name='Female Authors')

# Create layout
layout = go.Layout(xaxis=dict(title='Publication Decade'),
                   yaxis=dict(title='Character Average Loudness Level', range=[1.75, 3.75]),  # Adjust range
                   hovermode='closest')

# Create figure
fig = go.Figure(data=[scatter_male, scatter_female, smoothed_line_male, smoothed_line_female], layout=layout)

# Show plot
fig.show()


## Investigating Gender-associated Loudness 

In [235]:
# diese Zelle!

import pandas as pd

# Load the CSV file into a pandas DataFrame
df_new2 = pd.read_csv("/Users/sguhr/Desktop/Diss_notebooks/20240511_theme-d-prose_subcorpus_shorts_df_Gender.csv")

# Display the DataFrame
print(df_new2)

                                              filename  \
0                       Raabe_Wilhelm_Der_gute_Tag.xml   
1    Wildermuth_Ottilie_Geschichten_aus_Schwaben_Da...   
2                          Ring_Max_Vom_alten_Heim.xml   
3              Viebig_Clara_Die_Cigarrenarbeiterin.xml   
4    Groller_Balduin_Detektiv_Dagobert_Der_Kassenei...   
..                                                 ...   
420  Wildermuth_Ottilie_Geschichten_aus_Schwaben_Da...   
421  Stifter_Adalbert_Leben_und_Haushalt_dreier_Wie...   
422  Berthold_Theodor_Lustige_Gymnasialgeschichten_...   
423  von_Liliencron_Detlev_Roggen_und_Weizen_Der_Di...   
424                    Oelschlaeger_Hermann_Klytia.xml   

                   character_sound-loudness_dictionary  \
0    {'verkündigte er von Stockwerk zu Stockwerk': ...   
1    {'und wurde bei allen Familienfesten gebeten':...   
2    {'Während er sprach': 3.0, 'mit denen er sich ...   
3    {'sie sprach nicht': 3.0, 'und klopfte verlang...   
4    {'fragte

In [221]:
#Merge the calculation results and dictionaries of new_df with the metadata df 

import pandas as pd
import os

# Load CSV into DataFrame
csv_path = "/Users/sguhr/Downloads/20240511_Metadata_theme-d-Prose_xml.csv"  # Update with the path to your CSV file
csv_df = pd.read_csv(csv_path)

# Remove file extensions from filenames in the CSV DataFrame
csv_df['filename'] = csv_df['filename'].str.replace('.xml', '')
df_new2['filename'] = df_new2['filename'].str.replace('.xml', '')


# Merge CSV DataFrame with XML DataFrame based on filename
df_new2 = pd.merge(csv_df, df_new2, on='filename', how='inner')

# Display merged DataFrame
print(df_new2)

                                              filename  ID_theme-d-Prose  \
0      Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser                 7   
1    Anklam_Louise_Kindergeschichten_Der_Weihnachts...                 8   
2       Anklam_Louise_Kindergeschichten_Der_wilde_Arno                 9   
3     Anklam_Louise_Kindergeschichten_Der_Wunderdoktor                10   
4                     Annecke_Fritz_Der_verehrter_Herr                11   
..                                                 ...               ...   
420               Wildermuth_Ottilie_Krieg_und_Frieden              1193   
421     Wildermuth_Ottilie_Onkel_Gottliebs_Jugendliebe              1194   
422  Wuerkert_Ludwig_Ein_Haeckerlingsschneider_als_...              1207   
423  zu_Reventlow_Franziska_Das_graefliche_Milchges...              1223   
424                 zu_Reventlow_Franziska_Spiritismus              1226   

           author_used_name author_nationality author_gender GND_number  \
0           

In [222]:
df_new2

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,ambient_sound-loudness_dictionary,character_avg_loudness,ambient_avg_loudness,text_loudness_average,male_sound_dict,male_loudness_values,male_loudness_average,female_sound_dict,female_loudness_values,female_loudness_average
0,Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser,7,Louise Anklam,Deutschland,female,125905726,Der Herr Kaiser,1898,Deutschland,4276,...,"{'tönte es von allen Seiten': 4.0, 'riefen die...",2.93,3.25,3.09,"{'unterbrach jetzt <Mann>Fritz</Mann>, der ein...","[3.0, 3.0, 3.0, 3.0, 3.5, 3.0, 3.0, 3.0, 3.0, ...",2.891304,{'sagte das kleine <Frau>Lieschen</Frau>': 3.0...,"[3.0, 3.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",2.958333
1,Anklam_Louise_Kindergeschichten_Der_Weihnachts...,8,Louise Anklam,Deutschland,female,125905726,Der Weihnachtsabend,1898,Deutschland,4466,...,{'und alle lachten über den kleinen <Mann>Sche...,2.81,4.00,3.41,"{'sagte <Mann>Kläre</Mann>': 3.0, 'Sprachlos v...","[3.0, 0.0, 3.0]",2.000000,{'unterbrach die <Frau>Mutter</Frau> ihr fröhl...,"[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",3.058824
2,Anklam_Louise_Kindergeschichten_Der_wilde_Arno,9,Louise Anklam,Deutschland,female,125905726,Der wilde Arno,1898,Deutschland,4613,...,"{'und sagten': 3.0, 'Lachend und sich darüber ...",3.03,2.50,2.76,{'<Mann>Nachbars</Mann> <Mann>Knecht</Mann> <M...,"[3.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 3.0, ...",3.095238,{'erwiderte die bekümmerte <Frau>Mutter</Frau>...,"[3.0, 3.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",2.666667
3,Anklam_Louise_Kindergeschichten_Der_Wunderdoktor,10,Louise Anklam,Deutschland,female,125905726,Der Wunderdoktor,1898,Deutschland,5472,...,{'die sich untereinander neckten und fröhlich ...,3.06,3.75,3.41,{'So recht aus tiefstem Herzen weinend und sch...,"[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",3.083333,{'daß die <Frau>Erzieherin</Frau> in verzeihli...,"[3.0, 3.0, 3.0, 3.0, 4.0, 3.0, 3.0, 0.0, 2.0, ...",2.904762
4,Annecke_Fritz_Der_verehrter_Herr,11,Fritz Anneke,Deutschland,male,118649523,Der verehrter Herr,1861,Deutschland,2739,...,{},3.17,0.00,1.58,{'Der <Mann>Präsident</Mann> richtete dann nac...,"[3.0, 3.0, 3.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, ...",3.294118,{},[],0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
420,Wildermuth_Ottilie_Krieg_und_Frieden,1193,Ottilie Wildermuth,Deutschland,female,118632833,Krieg und Frieden,1866,Deutschland,8470,...,{'sie nicht gar zu viel Lärm und Unordnung mac...,2.95,3.65,3.30,"{'versicherte der <Mann>Martin</Mann>': 3.0, '...","[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",2.979592,{'Manche <Frau>Mutter</Frau> weinte laut': 3.5...,"[3.5, 3.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0]",2.812500
421,Wildermuth_Ottilie_Onkel_Gottliebs_Jugendliebe,1194,Ottilie Wildermuth,Deutschland,female,118632833,Onkel Gottliebs Jugendliebe,1860,Deutschland,9082,...,"{'Jubel begrüßt': 3.0, 'Da erscholl außen ein ...",2.86,2.32,2.59,"{'lachte der <Mann>Onkel</Mann>': 4.0, 'er hat...","[4.0, 3.0, 3.0, 2.0, 3.0, 0.0, 3.0, 3.0, 2.0, ...",2.700000,{'begann die kleine <Frau>Hedwig</Frau> bedenk...,"[3.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.5, ...",3.011628
422,Wuerkert_Ludwig_Ein_Haeckerlingsschneider_als_...,1207,Ludwig Würkert,Deutschland,male,117374997,Ein Häckerlingsschneider als Apostel,1874,Deutschland,6791,...,"{'Kinder weinten': 3.0, 'rief ein hinzutretend...",2.98,2.68,2.83,"{'schon <Mann>Seume</Mann> sagt': 3.0, 'Und wi...","[3.0, 3.0, 3.0, 1.0, 3.0, 3.0, 3.5, 3.0, 3.0, ...",2.863636,{'als sein <Frau>Weib</Frau> und seine fünf un...,"[3.0, 4.0, 4.0, 3.0, 3.5, 3.0, 1.0, 3.0, 3.5, ...",3.181818
423,zu_Reventlow_Franziska_Das_graefliche_Milchges...,1223,Franziska zu Reventlow,Deutschland,female,118600044,Das gräfliche Milchgeschäft,1897,Deutschland,4027,...,"{'Endlich schlug es vier': 4.0, 'und ein schar...",3.00,3.31,3.16,{'sagte <Mann>Fritz</Mann> <Mann>Beier</Mann>'...,"[3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 3.0, 3.0, 3.0]",3.111111,{},[],0.000000


In [223]:
import pandas as pd

# Initialize lists to store the female sound dictionaries and loudness values
female_sound_list = []
female_loudness_values_list = []

# Iterate over the rows of the DataFrame
for _, row in df.iterrows():
    # Extract the character_sound-loudness_dictionary
    char_sound_dict = eval(row['character_sound-loudness_dictionary'])
    
    # Initialize dictionaries for female sounds and loudness values
    female_sound_dict = {}
    female_loudness_values = []
    
    # Iterate over the keys in the character_sound-loudness_dictionary
    for key, value in char_sound_dict.items():
        # Check if the key contains the regex match "<Frau>"
        if "<Frau>" in key:
            # Add the key-value pair to the female_sound_dict
            female_sound_dict[key] = value
            # Append the loudness value to the list
            female_loudness_values.append(value)
            
    # Append the female_sound_dict to the female_sound_list
    female_sound_list.append(female_sound_dict)
    
    # Append the female_loudness_values to the female_loudness_values_list
    female_loudness_values_list.append(female_loudness_values)

# Add the female_sound_list as a new column "female_sound_dict" to the DataFrame
df_new2['female_sound_dict'] = female_sound_list

# Add the female_loudness_values_list as a new column "female_loudness_values" to the DataFrame
df_new2['female_loudness_values'] = female_loudness_values_list

# Calculate the average of the female loudness values
df_new2['female_loudness_average'] = df_new2['female_loudness_values'].apply(lambda x: sum(x) / len(x) if len(x) > 0 else 0)

# Display the DataFrame
print(df_new2)


                                              filename  ID_theme-d-Prose  \
0      Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser                 7   
1    Anklam_Louise_Kindergeschichten_Der_Weihnachts...                 8   
2       Anklam_Louise_Kindergeschichten_Der_wilde_Arno                 9   
3     Anklam_Louise_Kindergeschichten_Der_Wunderdoktor                10   
4                     Annecke_Fritz_Der_verehrter_Herr                11   
..                                                 ...               ...   
420               Wildermuth_Ottilie_Krieg_und_Frieden              1193   
421     Wildermuth_Ottilie_Onkel_Gottliebs_Jugendliebe              1194   
422  Wuerkert_Ludwig_Ein_Haeckerlingsschneider_als_...              1207   
423  zu_Reventlow_Franziska_Das_graefliche_Milchges...              1223   
424                 zu_Reventlow_Franziska_Spiritismus              1226   

           author_used_name author_nationality author_gender GND_number  \
0           

In [224]:
df_new2

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,ambient_sound-loudness_dictionary,character_avg_loudness,ambient_avg_loudness,text_loudness_average,male_sound_dict,male_loudness_values,male_loudness_average,female_sound_dict,female_loudness_values,female_loudness_average
0,Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser,7,Louise Anklam,Deutschland,female,125905726,Der Herr Kaiser,1898,Deutschland,4276,...,"{'tönte es von allen Seiten': 4.0, 'riefen die...",2.93,3.25,3.09,"{'unterbrach jetzt <Mann>Fritz</Mann>, der ein...","[3.0, 3.0, 3.0, 3.0, 3.5, 3.0, 3.0, 3.0, 3.0, ...",2.891304,{},[],0.000000
1,Anklam_Louise_Kindergeschichten_Der_Weihnachts...,8,Louise Anklam,Deutschland,female,125905726,Der Weihnachtsabend,1898,Deutschland,4466,...,{'und alle lachten über den kleinen <Mann>Sche...,2.81,4.00,3.41,"{'sagte <Mann>Kläre</Mann>': 3.0, 'Sprachlos v...","[3.0, 0.0, 3.0]",2.000000,{},[],0.000000
2,Anklam_Louise_Kindergeschichten_Der_wilde_Arno,9,Louise Anklam,Deutschland,female,125905726,Der wilde Arno,1898,Deutschland,4613,...,"{'und sagten': 3.0, 'Lachend und sich darüber ...",3.03,2.50,2.76,{'<Mann>Nachbars</Mann> <Mann>Knecht</Mann> <M...,"[3.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 3.0, ...",3.095238,{},[],0.000000
3,Anklam_Louise_Kindergeschichten_Der_Wunderdoktor,10,Louise Anklam,Deutschland,female,125905726,Der Wunderdoktor,1898,Deutschland,5472,...,{'die sich untereinander neckten und fröhlich ...,3.06,3.75,3.41,{'So recht aus tiefstem Herzen weinend und sch...,"[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",3.083333,{'Sie sprach niemals über ihre <Frau>Herrin</F...,"[3.0, 3.0, 3.0, 4.0, 1.0, 2.0, 3.0, 3.0, 3.0, ...",2.900000
4,Annecke_Fritz_Der_verehrter_Herr,11,Fritz Anneke,Deutschland,male,118649523,Der verehrter Herr,1861,Deutschland,2739,...,{},3.17,0.00,1.58,{'Der <Mann>Präsident</Mann> richtete dann nac...,"[3.0, 3.0, 3.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, ...",3.294118,{'fragte ich eine alte <Frau>Frau</Frau>': 3.0...,"[3.0, 3.0, 3.0, 3.0, 3.0]",3.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
420,Wildermuth_Ottilie_Krieg_und_Frieden,1193,Ottilie Wildermuth,Deutschland,female,118632833,Krieg und Frieden,1866,Deutschland,8470,...,{'sie nicht gar zu viel Lärm und Unordnung mac...,2.95,3.65,3.30,"{'versicherte der <Mann>Martin</Mann>': 3.0, '...","[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",2.979592,"{'sagte die <Frau>Wärterin</Frau>': 3.0, 'Und ...","[3.0, 3.0]",3.000000
421,Wildermuth_Ottilie_Onkel_Gottliebs_Jugendliebe,1194,Ottilie Wildermuth,Deutschland,female,118632833,Onkel Gottliebs Jugendliebe,1860,Deutschland,9082,...,"{'Jubel begrüßt': 3.0, 'Da erscholl außen ein ...",2.86,2.32,2.59,"{'lachte der <Mann>Onkel</Mann>': 4.0, 'er hat...","[4.0, 3.0, 3.0, 2.0, 3.0, 0.0, 3.0, 3.0, 2.0, ...",2.700000,{'fragte <Frau>Frau</Frau> <Frau>Meier</Frau>'...,"[3.0, 3.0, 3.0, 3.0, 4.0, 3.0, 3.0, 0.0, 3.0, ...",2.909091
422,Wuerkert_Ludwig_Ein_Haeckerlingsschneider_als_...,1207,Ludwig Würkert,Deutschland,male,117374997,Ein Häckerlingsschneider als Apostel,1874,Deutschland,6791,...,"{'Kinder weinten': 3.0, 'rief ein hinzutretend...",2.98,2.68,2.83,"{'schon <Mann>Seume</Mann> sagt': 3.0, 'Und wi...","[3.0, 3.0, 3.0, 1.0, 3.0, 3.0, 3.5, 3.0, 3.0, ...",2.863636,{'Meine <Frau>Mutter</Frau> nannte den Namen':...,"[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",3.040000
423,zu_Reventlow_Franziska_Das_graefliche_Milchges...,1223,Franziska zu Reventlow,Deutschland,female,118600044,Das gräfliche Milchgeschäft,1897,Deutschland,4027,...,"{'Endlich schlug es vier': 4.0, 'und ein schar...",3.00,3.31,3.16,{'sagte <Mann>Fritz</Mann> <Mann>Beier</Mann>'...,"[3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 3.0, 3.0, 3.0]",3.111111,{'Und meine <Frau>Mutter</Frau> hat gesagt': 3...,"[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",3.000000


In [225]:
import pandas as pd

# Initialize lists to store the male sound dictionaries and loudness values
male_sound_list = []
male_loudness_values_list = []

# Iterate over the rows of the DataFrame
for _, row in df.iterrows():
    # Extract the character_sound-loudness_dictionary
    char_sound_dict = eval(row['character_sound-loudness_dictionary'])
    
    # Initialize dictionaries for male sounds and loudness values
    male_sound_dict = {}
    male_loudness_values = []
    
    # Iterate over the keys in the character_sound-loudness_dictionary
    for key, value in char_sound_dict.items():
        # Check if the key contains the regex match "<Mann>"
        if "<Mann>" in key:
            # Add the key-value pair to the male_sound_dict
            male_sound_dict[key] = value
            # Append the loudness value to the list
            male_loudness_values.append(value)
            
    # Append the male_sound_dict to the male_sound_list
    male_sound_list.append(male_sound_dict)
    
    # Append the male_loudness_values to the male_loudness_values_list
    male_loudness_values_list.append(male_loudness_values)

# Add the male_sound_list as a new column "male_sound_dict" to the DataFrame
df_new2['male_sound_dict'] = male_sound_list

# Add the male_loudness_values_list as a new column "male_loudness_values" to the DataFrame
df_new2['male_loudness_values'] = male_loudness_values_list

# Calculate the average of the male loudness values
df_new2['male_loudness_average'] = df_new2['male_loudness_values'].apply(lambda x: sum(x) / len(x) if len(x) > 0 else 0)

# Display the DataFrame
print(df_new2)


                                              filename  ID_theme-d-Prose  \
0      Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser                 7   
1    Anklam_Louise_Kindergeschichten_Der_Weihnachts...                 8   
2       Anklam_Louise_Kindergeschichten_Der_wilde_Arno                 9   
3     Anklam_Louise_Kindergeschichten_Der_Wunderdoktor                10   
4                     Annecke_Fritz_Der_verehrter_Herr                11   
..                                                 ...               ...   
420               Wildermuth_Ottilie_Krieg_und_Frieden              1193   
421     Wildermuth_Ottilie_Onkel_Gottliebs_Jugendliebe              1194   
422  Wuerkert_Ludwig_Ein_Haeckerlingsschneider_als_...              1207   
423  zu_Reventlow_Franziska_Das_graefliche_Milchges...              1223   
424                 zu_Reventlow_Franziska_Spiritismus              1226   

           author_used_name author_nationality author_gender GND_number  \
0           

In [226]:
df_new2

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,ambient_sound-loudness_dictionary,character_avg_loudness,ambient_avg_loudness,text_loudness_average,male_sound_dict,male_loudness_values,male_loudness_average,female_sound_dict,female_loudness_values,female_loudness_average
0,Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser,7,Louise Anklam,Deutschland,female,125905726,Der Herr Kaiser,1898,Deutschland,4276,...,"{'tönte es von allen Seiten': 4.0, 'riefen die...",2.93,3.25,3.09,{'Endlich fragte mich der <Mann>Eine</Mann>': ...,"[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0]",3.000000,{},[],0.000000
1,Anklam_Louise_Kindergeschichten_Der_Weihnachts...,8,Louise Anklam,Deutschland,female,125905726,Der Weihnachtsabend,1898,Deutschland,4466,...,{'und alle lachten über den kleinen <Mann>Sche...,2.81,4.00,3.41,{'schrie der corpulente <Mann>Berthold</Mann>'...,"[4.0, 3.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",3.054688,{},[],0.000000
2,Anklam_Louise_Kindergeschichten_Der_wilde_Arno,9,Louise Anklam,Deutschland,female,125905726,Der wilde Arno,1898,Deutschland,4613,...,"{'und sagten': 3.0, 'Lachend und sich darüber ...",3.03,2.50,2.76,"{'sagte <Mann>Jochen</Mann>': 3.0, 'mit welche...","[3.0, 3.0, 3.0, 3.0]",3.000000,{},[],0.000000
3,Anklam_Louise_Kindergeschichten_Der_Wunderdoktor,10,Louise Anklam,Deutschland,female,125905726,Der Wunderdoktor,1898,Deutschland,5472,...,{'die sich untereinander neckten und fröhlich ...,3.06,3.75,3.41,"{'meinte ein junger <Mann>Mann</Mann>': 3.0, '...","[3.0, 3.0, 2.0, 3.0, 3.0, 3.0, 3.0]",2.857143,{'Sie sprach niemals über ihre <Frau>Herrin</F...,"[3.0, 3.0, 3.0, 4.0, 1.0, 2.0, 3.0, 3.0, 3.0, ...",2.900000
4,Annecke_Fritz_Der_verehrter_Herr,11,Fritz Anneke,Deutschland,male,118649523,Der verehrter Herr,1861,Deutschland,2739,...,{},3.17,0.00,1.58,"{'nennt sie der <Mann>Seefahrer</Mann>': 3.0, ...","[3.0, 3.0, 3.0, 4.0, 4.0, 3.0, 3.0, 4.0, 3.0, ...",3.272727,{'fragte ich eine alte <Frau>Frau</Frau>': 3.0...,"[3.0, 3.0, 3.0, 3.0, 3.0]",3.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
420,Wildermuth_Ottilie_Krieg_und_Frieden,1193,Ottilie Wildermuth,Deutschland,female,118632833,Krieg und Frieden,1866,Deutschland,8470,...,{'sie nicht gar zu viel Lärm und Unordnung mac...,2.95,3.65,3.30,"{'antwortete der <Mann>Alexianer</Mann>': 3.0,...","[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",3.153846,"{'sagte die <Frau>Wärterin</Frau>': 3.0, 'Und ...","[3.0, 3.0]",3.000000
421,Wildermuth_Ottilie_Onkel_Gottliebs_Jugendliebe,1194,Ottilie Wildermuth,Deutschland,female,118632833,Onkel Gottliebs Jugendliebe,1860,Deutschland,9082,...,"{'Jubel begrüßt': 3.0, 'Da erscholl außen ein ...",2.86,2.32,2.59,{'Ich stürzte auf den <Mann>Sprecher</Mann> zu...,"[4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0]",3.125000,{'fragte <Frau>Frau</Frau> <Frau>Meier</Frau>'...,"[3.0, 3.0, 3.0, 3.0, 4.0, 3.0, 3.0, 0.0, 3.0, ...",2.909091
422,Wuerkert_Ludwig_Ein_Haeckerlingsschneider_als_...,1207,Ludwig Würkert,Deutschland,male,117374997,Ein Häckerlingsschneider als Apostel,1874,Deutschland,6791,...,"{'Kinder weinten': 3.0, 'rief ein hinzutretend...",2.98,2.68,2.83,{'als sie der <Mann>Wirth</Mann> unterbrach': ...,"[3.0, 3.0]",3.000000,{'Meine <Frau>Mutter</Frau> nannte den Namen':...,"[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",3.040000
423,zu_Reventlow_Franziska_Das_graefliche_Milchges...,1223,Franziska zu Reventlow,Deutschland,female,118600044,Das gräfliche Milchgeschäft,1897,Deutschland,4027,...,"{'Endlich schlug es vier': 4.0, 'und ein schar...",3.00,3.31,3.16,{'und der <Mann>Stationsdiener</Mann> hat gesa...,"[3.0, 3.0, 3.0, 3.0, 0.0, 3.0, 3.0, 3.0, 3.0, ...",3.113636,{'Und meine <Frau>Mutter</Frau> hat gesagt': 3...,"[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",3.000000


In [227]:
# Save DataFrame to CSV
df_new2.to_csv('/Users/sguhr/Desktop/Diss_notebooks/20240514_theme-d-prose_subcorpus_shorts_df_gender_loudness_df_new2.csv', index=False)

In [228]:
# Character Loudness sorted by female and male average character loudness but independent of authors

import plotly.graph_objects as go
import numpy as np

# Sort the DataFrame by 'text_used_date' column
df_sorted_by_publication_year = df_new2.sort_values(by='text_used_date')

# Calculate cosine smoothed loudness values for females
window_size = 50  # Define the window size for cosine smoothing
cosine_smoothed_loudness_female = np.convolve(df_sorted_by_publication_year['female_loudness_average'], np.ones(window_size)/window_size, mode='same')

# Calculate cosine smoothed loudness values for males
cosine_smoothed_loudness_male = np.convolve(df_sorted_by_publication_year['male_loudness_average'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for loudness for females
scatter_female = go.Scatter(x=df_sorted_by_publication_year['text_used_date'], 
                            y=df_sorted_by_publication_year['female_loudness_average'],
                            mode='markers',
                            marker=dict(color='violet'),
                            name='Average Loudness Level (Female Characters)',
                            text=df_sorted_by_publication_year['filename'],  # Hover text
                            hoverinfo='text')

# Create scatter plot for loudness for males
scatter_male = go.Scatter(x=df_sorted_by_publication_year['text_used_date'], 
                          y=df_sorted_by_publication_year['male_loudness_average'],
                          mode='markers',
                          marker=dict(color='green'),
                          name='Average Loudness Level (Male Characters)',
                          text=df_sorted_by_publication_year['filename'],  # Hover text
                          hoverinfo='text')

# Create line plot for cosine smoothed loudness for females
line_smoothed_female = go.Scatter(x=df_sorted_by_publication_year['text_used_date'], 
                                  y=cosine_smoothed_loudness_female,
                                  mode='lines',
                                  line=dict(color='darkviolet', width=2),
                                  name='Cosine Smoothed Loudness (Female Characters)')

# Create line plot for cosine smoothed loudness for males
line_smoothed_male = go.Scatter(x=df_sorted_by_publication_year['text_used_date'], 
                                y=cosine_smoothed_loudness_male,
                                mode='lines',
                                line=dict(color='darkgreen', width=2),
                                name='Cosine Smoothed Loudness (Male Characters)')

# Create layout
layout = go.Layout(xaxis=dict(title='Publication Year'),
                   yaxis=dict(title='Character Average Loudness Levels of Characters (F/M)', range=[0.5, 5]),
                   hovermode='closest',
                   legend=dict(x=0.5, y=1.2, orientation='h')) 

# Create figure
fig = go.Figure(data=[scatter_female, scatter_male, line_smoothed_female, line_smoothed_male], layout=layout)

# Show plot
fig.show()

# Export as PNG with high resolution for print
fig.write_image("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_female_male_characters_smoothed_medium_res_df_new2.png", scale=10)

# Export as HTML
fig.write_html("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_female_male_characters_smoothed_medium_res_df_new2.html")


In [229]:
# Create sub-dataframes for each author gender
df_new2_female = df_new2[df_new2['author_gender'] == 'female']
df_new2_male = df_new2[df_new2['author_gender'] == 'male']


In [230]:
df_new2_female

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,ambient_sound-loudness_dictionary,character_avg_loudness,ambient_avg_loudness,text_loudness_average,male_sound_dict,male_loudness_values,male_loudness_average,female_sound_dict,female_loudness_values,female_loudness_average
0,Anklam_Louise_Kindergeschichten_Der_Herr_Kaiser,7,Louise Anklam,Deutschland,female,125905726,Der Herr Kaiser,1898,Deutschland,4276,...,"{'tönte es von allen Seiten': 4.0, 'riefen die...",2.93,3.25,3.09,{'Endlich fragte mich der <Mann>Eine</Mann>': ...,"[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0]",3.000000,{},[],0.000000
1,Anklam_Louise_Kindergeschichten_Der_Weihnachts...,8,Louise Anklam,Deutschland,female,125905726,Der Weihnachtsabend,1898,Deutschland,4466,...,{'und alle lachten über den kleinen <Mann>Sche...,2.81,4.00,3.41,{'schrie der corpulente <Mann>Berthold</Mann>'...,"[4.0, 3.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",3.054688,{},[],0.000000
2,Anklam_Louise_Kindergeschichten_Der_wilde_Arno,9,Louise Anklam,Deutschland,female,125905726,Der wilde Arno,1898,Deutschland,4613,...,"{'und sagten': 3.0, 'Lachend und sich darüber ...",3.03,2.50,2.76,"{'sagte <Mann>Jochen</Mann>': 3.0, 'mit welche...","[3.0, 3.0, 3.0, 3.0]",3.000000,{},[],0.000000
3,Anklam_Louise_Kindergeschichten_Der_Wunderdoktor,10,Louise Anklam,Deutschland,female,125905726,Der Wunderdoktor,1898,Deutschland,5472,...,{'die sich untereinander neckten und fröhlich ...,3.06,3.75,3.41,"{'meinte ein junger <Mann>Mann</Mann>': 3.0, '...","[3.0, 3.0, 2.0, 3.0, 3.0, 3.0, 3.0]",2.857143,{'Sie sprach niemals über ihre <Frau>Herrin</F...,"[3.0, 3.0, 3.0, 4.0, 1.0, 2.0, 3.0, 3.0, 3.0, ...",2.900000
14,Behrens_Bertha_Alte_Liebe_und_anderes_Alte_Liebe,49,Bertha Behrens,Deutschland,female,104352787,Alte Liebe,1904,Deutschland,6666,...,"{'und die Damen sich verabschiedeten': 3.0, 'A...",2.85,1.91,2.38,"{'sagt <Mann>Schiller</Mann>': 3.0, 'Mein <Man...","[3.0, 3.0]",3.000000,{},[],0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
419,Wildermuth_Ottilie_Geschichten_aus_Schwaben_Ke...,1192,Ottilie Wildermuth,Deutschland,female,118632833,Keine Neigungsheirath,1854,Deutschland,4734,...,{'der schlechten Beleuchtung und mangelhaften ...,3.13,2.50,2.81,"{'sagte der kleine <Mann>Willner</Mann>': 3.0,...","[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0]",3.000000,{},[],0.000000
420,Wildermuth_Ottilie_Krieg_und_Frieden,1193,Ottilie Wildermuth,Deutschland,female,118632833,Krieg und Frieden,1866,Deutschland,8470,...,{'sie nicht gar zu viel Lärm und Unordnung mac...,2.95,3.65,3.30,"{'antwortete der <Mann>Alexianer</Mann>': 3.0,...","[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",3.153846,"{'sagte die <Frau>Wärterin</Frau>': 3.0, 'Und ...","[3.0, 3.0]",3.000000
421,Wildermuth_Ottilie_Onkel_Gottliebs_Jugendliebe,1194,Ottilie Wildermuth,Deutschland,female,118632833,Onkel Gottliebs Jugendliebe,1860,Deutschland,9082,...,"{'Jubel begrüßt': 3.0, 'Da erscholl außen ein ...",2.86,2.32,2.59,{'Ich stürzte auf den <Mann>Sprecher</Mann> zu...,"[4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0]",3.125000,{'fragte <Frau>Frau</Frau> <Frau>Meier</Frau>'...,"[3.0, 3.0, 3.0, 3.0, 4.0, 3.0, 3.0, 0.0, 3.0, ...",2.909091
423,zu_Reventlow_Franziska_Das_graefliche_Milchges...,1223,Franziska zu Reventlow,Deutschland,female,118600044,Das gräfliche Milchgeschäft,1897,Deutschland,4027,...,"{'Endlich schlug es vier': 4.0, 'und ein schar...",3.00,3.31,3.16,{'und der <Mann>Stationsdiener</Mann> hat gesa...,"[3.0, 3.0, 3.0, 3.0, 0.0, 3.0, 3.0, 3.0, 3.0, ...",3.113636,{'Und meine <Frau>Mutter</Frau> hat gesagt': 3...,"[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",3.000000


In [234]:
df_new2_male

Unnamed: 0,filename,ID_theme-d-Prose,author_used_name,author_nationality,author_gender,GND_number,text_title,text_used_date,text_fictional_space_decision,token_count,...,ambient_sound-loudness_dictionary,character_avg_loudness,ambient_avg_loudness,text_loudness_average,male_sound_dict,male_loudness_values,male_loudness_average,female_sound_dict,female_loudness_values,female_loudness_average
4,Annecke_Fritz_Der_verehrter_Herr,11,Fritz Anneke,Deutschland,male,118649523,Der verehrter Herr,1861,Deutschland,2739,...,{},3.17,0.00,1.58,"{'nennt sie der <Mann>Seefahrer</Mann>': 3.0, ...","[3.0, 3.0, 3.0, 4.0, 4.0, 3.0, 3.0, 4.0, 3.0, ...",3.272727,{'fragte ich eine alte <Frau>Frau</Frau>': 3.0...,"[3.0, 3.0, 3.0, 3.0, 3.0]",3.000000
5,Anzengruber_Ludwig_Kalendergeschichten_Der_Ver...,16,Ludwig Anzengruber,Österreich,male,11850357X,Der Verschollene,1894,Österreich,8699,...,"{'und deuteten höhnisch gegen Himmel': 4.0, 'v...",3.02,2.96,2.99,{},[],0.000000,{},[],0.000000
6,Auerbach_Berthold_Drei_einzige_Toechter_Auf_Wache,25,Berthold Auerbach,Deutschland,male,11865103X,Auf Wache,1875,Österreich,6160,...,"{'Da rief ein <Mann>Officier</Mann>': 4.0, 'sp...",3.04,3.67,3.35,{'hatte dem <Mann>Alten</Mann> nochmals stumm ...,"[0.0, 3.0, 3.0, 3.0, 3.0, 4.0, 3.0, 3.0, 2.0]",2.666667,{'fragt plötzlich die <Frau>Mutter</Frau>': 3....,"[3.0, 4.0, 3.0, 3.0]",3.250000
7,Auerbach_Berthold_Hopfen_und_Gerste,30,Berthold Auerbach,Deutschland,male,11865103X,Hopfen und Gerste,1865,Deutschland,10721,...,{'aber der Laut erstickte ihm in der Kehle': 3...,2.88,2.96,2.92,{'so rief ich mit <Mann>Schiller</Mann> aus vo...,"[4.0, 3.0, 3.0, 3.0, 3.0, 4.0, 3.0, 3.0, 3.0, ...",3.187500,{},[],0.000000
8,Bahr_Hermann_Leander_Der_verstaendige_Herr,35,Hermann Bahr,Österreich,male,118505955,Der verständige Herr,1899,Deutschland/Elsass,2720,...,{'Da ward des Jubels und der Sänge und der Küs...,2.86,3.29,3.08,{'Der <Mann>Doctor</Mann> neckte sie dann wohl...,"[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 3.0, ...",3.000000,{'erklärte er <Frau>Sophien</Frau> seine Neigu...,"[3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, ...",3.055556
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
401,Wassermann_Jakob_Drei_Erzaehlungen_Hilperich,1149,Jakob Wassermann,Deutschland,male,118629387,Hilperich,1900,Deutschland,8992,...,"{'Es gab ein großes Gelächter': 3.5, 'Wir stie...",2.96,3.80,3.38,{'Lachend brachte er dem bestürzten <Genderneu...,"[3.5, 3.0, 3.0, 3.0]",3.125000,{'<Frau>Frau</Frau> <Frau>v.</Frau> <Frau>Magn...,"[4.0, 4.0, 3.0, 4.0, 4.0, 2.0, 3.0, 4.0, 3.0, ...",2.882353
402,Wassermann_Jakob_Erzaehlungen_Adam_Urbas,1151,Jakob Wassermann,Deutschland,male,118629387,Adam Urbas,1901,Deutschland,10447,...,"{'äußerst knappen Worten vorgebracht, wurde pr...",2.83,2.52,2.67,{'sagte der junge <Mann>Mann</Mann> in befehle...,"[3.0, 3.0, 3.0, 4.0, 4.0, 3.0, 3.0, 4.0, 4.0, ...",3.312500,{},[],0.000000
403,Wassermann_Jakob_Erzaehlungen_Jost,1152,Jakob Wassermann,Deutschland,male,118629387,Jost,1901,Österreich,11025,...,{'und in der schwarzen Ebene ein klagendverkli...,2.99,3.32,3.16,{'rief der <Mann>Maurermeister</Mann> <Mann>Br...,"[4.0, 2.5, 0.0, 3.0, 4.0, 3.0, 3.0, 3.0, 2.0, ...",3.205882,{'so lange der <Mann>Herr</Mann> <Mann>Secreta...,"[0.0, 4.0, 4.0, 2.0, 4.0, 2.0, 2.0, 2.0, 2.0, ...",2.500000
404,Wehl_Feodor_Der_Diebstahl_aus_Liebe,1155,Feodor Wehl,Deutschland,male,117220264,Der Diebstahl aus Liebe,1855,Deutschland,8287,...,{'Daß die Musik weichen und empfindsamen Gemüt...,3.11,2.14,2.62,"{'sagte ein <Mann>Polizeiagent</Mann>': 3.0, '...","[3.0, 3.0, 3.0, 3.0, 4.0, 3.0, 3.0, 3.0, 3.0, ...",3.153846,{'und hält vor einem kleinen <Frau>Mädchen</Fr...,"[0.0, 4.0, 3.0]",2.333333


In [231]:
# Character Loudness sorted by female and male average character loudness but independent of authors

import plotly.graph_objects as go
import numpy as np

# Sort the DataFrame by 'text_used_date' column
df_female_sorted_by_publication_year = df_new2_female.sort_values(by='text_used_date')

# Calculate cosine smoothed loudness values for females
window_size = 50  # Define the window size for cosine smoothing
cosine_smoothed_loudness_female = np.convolve(df_female_sorted_by_publication_year['female_loudness_average'], np.ones(window_size)/window_size, mode='same')

# Calculate cosine smoothed loudness values for males
cosine_smoothed_loudness_male = np.convolve(df_female_sorted_by_publication_year['male_loudness_average'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for loudness for females
scatter_female = go.Scatter(x=df_female_sorted_by_publication_year['text_used_date'], 
                            y=df_female_sorted_by_publication_year['female_loudness_average'],
                            mode='markers',
                            marker=dict(color='violet'),
                            name='Average Loudness Level (Female Characters)',
                            text=df_female_sorted_by_publication_year['filename'],  # Hover text
                            hoverinfo='text')

# Create scatter plot for loudness for males
scatter_male = go.Scatter(x=df_female_sorted_by_publication_year['text_used_date'], 
                          y=df_female_sorted_by_publication_year['male_loudness_average'],
                          mode='markers',
                          marker=dict(color='green'),
                          name='Average Loudness Level (Male Characters)',
                          text=df_female_sorted_by_publication_year['filename'],  # Hover text
                          hoverinfo='text')

# Create line plot for cosine smoothed loudness for females
line_smoothed_female = go.Scatter(x=df_female_sorted_by_publication_year['text_used_date'], 
                                  y=cosine_smoothed_loudness_female,
                                  mode='lines',
                                  line=dict(color='darkviolet', width=2),
                                  name='Cosine Smoothed Loudness (Female Characters)')

# Create line plot for cosine smoothed loudness for males
line_smoothed_male = go.Scatter(x=df_female_sorted_by_publication_year['text_used_date'], 
                                y=cosine_smoothed_loudness_male,
                                mode='lines',
                                line=dict(color='darkgreen', width=2),
                                name='Cosine Smoothed Loudness (Male Characters)')

# Create layout
layout = go.Layout(#title='Character Representation by Female Authors',
                   xaxis=dict(title='Publication Year'),
                   yaxis=dict(title='Character Average Loudness Levels of Characters (F/M)', range=[0.5, 5]),
                   hovermode='closest',
                   legend=dict(x=0.5, y=1.2, orientation='h')) 

# Create figure
fig = go.Figure(data=[scatter_female, scatter_male, line_smoothed_female, line_smoothed_male], layout=layout)

# Show plot
fig.show()

# Export as PNG with high resolution for print
fig.write_image("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_female_male_characters_smoothed_medium_res_female_authors2.png", scale=10)

# Export as HTML
fig.write_html("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_female_male_characters_smoothed_medium_res_female_authors2.html")


In [233]:
# Character Loudness sorted by female and male average character loudness but independent of authors

import plotly.graph_objects as go
import numpy as np

# Sort the DataFrame by 'text_used_date' column
df_male_sorted_by_publication_year = df_new2_male.sort_values(by='text_used_date')

# Calculate cosine smoothed loudness values for females
window_size = 50  # Define the window size for cosine smoothing
cosine_smoothed_loudness_female = np.convolve(df_male_sorted_by_publication_year['female_loudness_average'], np.ones(window_size)/window_size, mode='same')

# Calculate cosine smoothed loudness values for males
cosine_smoothed_loudness_male = np.convolve(df_male_sorted_by_publication_year['male_loudness_average'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for loudness for females
scatter_female = go.Scatter(x=df_male_sorted_by_publication_year['text_used_date'], 
                            y=df_male_sorted_by_publication_year['female_loudness_average'],
                            mode='markers',
                            marker=dict(color='violet'),
                            name='Average Loudness Level (Female Characters)',
                            text=df_male_sorted_by_publication_year['filename'],  # Hover text
                            hoverinfo='text')

# Create scatter plot for loudness for males
scatter_male = go.Scatter(x=df_male_sorted_by_publication_year['text_used_date'], 
                          y=df_male_sorted_by_publication_year['male_loudness_average'],
                          mode='markers',
                          marker=dict(color='green'),
                          name='Average Loudness Level (Male Characters)',
                          text=df_male_sorted_by_publication_year['filename'],  # Hover text
                          hoverinfo='text')

# Create line plot for cosine smoothed loudness for females
line_smoothed_female = go.Scatter(x=df_male_sorted_by_publication_year['text_used_date'], 
                                  y=cosine_smoothed_loudness_female,
                                  mode='lines',
                                  line=dict(color='darkviolet', width=2),
                                  name='Cosine Smoothed Loudness (Female Characters)')

# Create line plot for cosine smoothed loudness for males
line_smoothed_male = go.Scatter(x=df_male_sorted_by_publication_year['text_used_date'], 
                                y=cosine_smoothed_loudness_male,
                                mode='lines',
                                line=dict(color='darkgreen', width=2),
                                name='Cosine Smoothed Loudness (Male Characters)')

# Create layout
layout = go.Layout(#title='Character Representation by Male Authors',
                   xaxis=dict(title='Publication Year'),
                   yaxis=dict(title='Character Average Loudness Levels of Characters (F/M)', range=[0.5, 5]),
                   hovermode='closest',
                   legend=dict(x=0.5, y=1.2, orientation='h')) 

# Create figure
fig = go.Figure(data=[scatter_female, scatter_male, line_smoothed_female, line_smoothed_male], layout=layout)

# Show plot
fig.show()

# Export as PNG with high resolution for print
fig.write_image("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_female_male_characters_smoothed_medium_res_male_authors.png", scale=10)

# Export as HTML
fig.write_html("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_female_male_characters_smoothed_medium_res_male_authors.html")


In [202]:
# Character Loudness sorted by female and male average character loudness but independent of authors

import plotly.graph_objects as go
import numpy as np

# Sort the DataFrame by 'text_used_date' column
df_male_sorted_by_publication_year = df_male.sort_values(by='text_used_date')

# Calculate cosine smoothed loudness values for females
window_size = 50  # Define the window size for cosine smoothing
cosine_smoothed_loudness_female = np.convolve(df_male_sorted_by_publication_year['female_loudness_average'], np.ones(window_size)/window_size, mode='same')

# Calculate cosine smoothed loudness values for males
cosine_smoothed_loudness_male = np.convolve(df_male_sorted_by_publication_year['male_loudness_average'], np.ones(window_size)/window_size, mode='same')

# Create scatter plot for loudness for females
scatter_female = go.Scatter(x=df_male_sorted_by_publication_year['text_used_date'], 
                            y=df_male_sorted_by_publication_year['female_loudness_average'],
                            mode='markers',
                            marker=dict(color='violet'),
                            name='Average Loudness Level (Female Characters)',
                            text=df_male_sorted_by_publication_year['filename'],  # Hover text
                            hoverinfo='text')

# Create scatter plot for loudness for males
scatter_male = go.Scatter(x=df_male_sorted_by_publication_year['text_used_date'], 
                          y=df_male_sorted_by_publication_year['male_loudness_average'],
                          mode='markers',
                          marker=dict(color='green'),
                          name='Average Loudness Level (Male Characters)',
                          text=df_male_sorted_by_publication_year['filename'],  # Hover text
                          hoverinfo='text')

# Create line plot for cosine smoothed loudness for females
line_smoothed_female = go.Scatter(x=df_male_sorted_by_publication_year['text_used_date'], 
                                  y=cosine_smoothed_loudness_female,
                                  mode='lines',
                                  line=dict(color='darkviolet', width=2),
                                  name='Cosine Smoothed Loudness (Female Characters)')

# Create line plot for cosine smoothed loudness for males
line_smoothed_male = go.Scatter(x=df_male_sorted_by_publication_year['text_used_date'], 
                                y=cosine_smoothed_loudness_male,
                                mode='lines',
                                line=dict(color='darkgreen', width=2),
                                name='Cosine Smoothed Loudness (Male Characters)')

# Create layout
layout = go.Layout(#title='Character Representation by Male Authors',
                   xaxis=dict(title='Publication Year'),
                   yaxis=dict(title='Character Average Loudness Levels of Characters (F/M)', range=[0.5, 5]),
                   hovermode='closest',
                   legend=dict(x=0.5, y=1.2, orientation='h'))  # Set x and y coordinates for the legend

# Create figure
fig = go.Figure(data=[scatter_female, scatter_male, line_smoothed_female, line_smoothed_male], layout=layout)

# Show plot
fig.show()

# Export as PNG with high resolution for print
fig.write_image("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_female_male_characters_smoothed_medium_res_male_authors2.png", scale=10)

# Export as HTML
fig.write_html("/Users/sguhr/Desktop/Diss_notebooks/Diss_data_visualization/CharacterALoudness_theme-d-prose_female_male_characters_smoothed_medium_res_male_authors2.html")


In [ ]:
Stop ab hier alter Code 

In [0]:
# diese Zelle!

import os
import xml.etree.ElementTree as ET
import pandas as pd
from nltk.tokenize import word_tokenize
import re
import string
import numpy as np

def extract_sound_spans(xml_content):
    sound_spans = {'character_sound': {}, 'ambient_sound': {}}
    root = ET.fromstring(xml_content)

    ns = {'tei': 'http://www.tei-c.org/ns/1.0'}  # Define the namespace

    for elem in root.iter():
        if elem.tag.endswith('character_sound') or elem.tag.endswith('ambient_sound'):
            sound_text = elem.text.strip() if elem.text else ""
            loudness_str = elem.attrib.get('loudness', np.nan)  # Assign NaN if 'loudness' attribute is not present
            loudness = float(loudness_str) if loudness_str != 'S' else np.nan  # Assign NaN if 'loudness' attribute is 'S'
            tokenized_text = word_tokenize(sound_text, language='german')  # Tokenize the text
            filtered_tokens = [token for token in tokenized_text if token not in string.punctuation]  # Filter out punctuation
            sound_spans[elem.tag.split('}')[1].rstrip('_')][tuple(filtered_tokens)] = loudness  # Tokenize the keys
    return sound_spans

def calculate_word_token_length(text):
    # Remove XML tags
    text_without_tags = re.sub(r'<[^>]+>', '', text)
    # Tokenize the text
    tokens = word_tokenize(text_without_tags, language='german')
    # Remove punctuation
    filtered_tokens = [token for token in tokens if token not in string.punctuation]
    # Calculate the word token length
    word_token_length = len(filtered_tokens)
    return word_token_length

def calculate_avg_loudness(sound_spans):
    loudness_values = [v for v in sound_spans.values() if not np.isnan(v)]
    return round(sum(loudness_values) / len(loudness_values), 2) if loudness_values else 0

def process_xml_file(filepath):
    character_sound_spans = {}
    ambient_sound_spans = {}
    word_token_length = 0
    character_loudness_count_all = 0
    ambient_loudness_count_all = 0
    character_se_count_without_nan = 0
    ambient_se_count_without_nan = 0
    with open(filepath, 'r', encoding='utf-8') as file:
        xml_content = file.read()
        sound_spans = extract_sound_spans(xml_content)
        character_sound_spans = sound_spans['character_sound']
        ambient_sound_spans = sound_spans['ambient_sound']
        word_token_length = calculate_word_token_length(xml_content)
        
        # Count the number of character_sound and ambient_sound elements with loudness attribute
        for elem in ET.fromstring(xml_content).iter():
            if elem.tag.endswith('character_sound'):
                character_loudness_count_all += 1
                if 'loudness' in elem.attrib and elem.attrib['loudness'] == 'S':
                    character_loudness_count_all -= 1
                if 'loudness' in elem.attrib and not np.isnan(float(elem.attrib['loudness'])):
                    character_se_count_without_nan += 1
            elif elem.tag.endswith('ambient_sound'):
                ambient_loudness_count_all += 1
                if 'loudness' in elem.attrib and elem.attrib['loudness'] == 'S':
                    ambient_loudness_count_all -= 1
                if 'loudness' in elem.attrib and not np.isnan(float(elem.attrib['loudness'])):
                    ambient_se_count_without_nan += 1
    
    # Calculate average character sound span including NaN
    character_avg_token_count_with_nan = round(sum(len(key) for key in character_sound_spans.keys()) / len(character_sound_spans) if character_sound_spans else 0, 2)

    # Calculate average ambient sound span including NaN
    ambient_avg_token_count_with_nan = round(sum(len(key) for key in ambient_sound_spans.keys()) / len(ambient_sound_spans) if ambient_sound_spans else 0, 2)

    # Calculate total average token count including NaN for both ambient and character sound
    total_avg_token_count_with_nan = round(((sum(len(key) for key in character_sound_spans.keys()) if character_sound_spans else 0) + 
                         (sum(len(key) for key in ambient_sound_spans.keys()) if ambient_sound_spans else 0)) / \
                        ((len(character_sound_spans) if character_sound_spans else 0) + (len(ambient_sound_spans) if ambient_sound_spans else 0)), 2)

    # Calculate average token count excluding NaN for character sound
    character_avg_token_count_without_nan = round(sum(len(key) for key in character_sound_spans.keys() if not np.isnan(character_sound_spans[key])) / len(character_sound_spans) if character_sound_spans else 0, 2)

    # Calculate average token count excluding NaN for ambient sound
    ambient_avg_token_count_without_nan = round(sum(len(key) for key in ambient_sound_spans.keys() if not np.isnan(ambient_sound_spans[key])) / len(ambient_sound_spans) if ambient_sound_spans else 0, 2)

    # Calculate total average token count excluding NaN for both ambient and character sound
    total_avg_token_count_without_nan = round(((sum(len(key) for key in character_sound_spans.keys() if not np.isnan(character_sound_spans[key])) if character_sound_spans else 0) + 
                     (sum(len(key) for key in ambient_sound_spans.keys() if not np.isnan(ambient_sound_spans[key])) if ambient_sound_spans else 0)) / \
                    ((len(character_sound_spans) if character_sound_spans else 0) + (len(ambient_sound_spans) if ambient_sound_spans else 0)), 2)

    # Calculate t_se_aver
    t_se_aver = round(word_token_length / total_avg_token_count_with_nan, 2) if total_avg_token_count_with_nan != 0 else 0

    # Calculate t_se_aver without nan
    t_se_aver_without_nan = round(word_token_length / total_avg_token_count_without_nan, 2) if total_avg_token_count_without_nan != 0 else 0

    return (character_sound_spans, ambient_sound_spans, word_token_length, character_loudness_count_all, ambient_loudness_count_all, t_se_aver, character_avg_token_count_with_nan, ambient_avg_token_count_with_nan, total_avg_token_count_with_nan, character_avg_token_count_without_nan, ambient_avg_token_count_without_nan, total_avg_token_count_without_nan, t_se_aver_without_nan, character_se_count_without_nan, ambient_se_count_without_nan, (character_se_count_without_nan + ambient_se_count_without_nan))


def process_folder(folder_path):
    data = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.xml'):
            filepath = os.path.join(folder_path, filename)
            (character_sound_spans, ambient_sound_spans, word_token_length, character_loudness_count_all, ambient_loudness_count_all, t_se_aver, character_avg_token_count_with_nan, ambient_avg_token_count_with_nan, total_avg_token_count_with_nan, character_avg_token_count_without_nan, ambient_avg_token_count_without_nan, total_avg_token_count_without_nan, t_se_aver_without_nan, character_se_count_without_nan, ambient_se_count_without_nan, total_se_count_without_nan) = process_xml_file(filepath)
            
            # Calculate the average loudness for character sound and ambient sound dictionaries
            character_avg_loudness = calculate_avg_loudness(character_sound_spans)
            ambient_avg_loudness = calculate_avg_loudness(ambient_sound_spans)
            
            # Calculate the text loudness average
            text_loudness_average = round((character_avg_loudness + ambient_avg_loudness) / 2, 2)
            
            data.append({'filename': filename,
                         'character_sound-loudness_dictionary': character_sound_spans,
                         'ambient_sound-loudness_dictionary': ambient_sound_spans,
                         'character_avg_loudness': character_avg_loudness,
                         'ambient_avg_loudness': ambient_avg_loudness,
                         'text_loudness_average': text_loudness_average})

    return pd.DataFrame(data)

# Call the modified function
df_newG = process_folder(folder_path)
print(df_newG)


In [0]:
# Save DataFrame to CSV
df_newG.to_csv('/Users/sguhr/Desktop/Diss_notebooks/20240511_theme-d-prose_subcorpus_shorts_df_all_numbers_new13.csv', index=False)

In [0]:
import os
import re
import pandas as pd

def extract_sound_spans(xml_content):
    sound_spans = {'character_sound': {}, 'ambient_sound': {}}

    # Define regex patterns for character_sound and ambient_sound elements
    character_pattern = r'<character_sound[^>]*>(.*?)<\/character_sound>'
    ambient_pattern = r'<ambient_sound[^>]*>(.*?)<\/ambient_sound>'

    # Extract character_sound spans
    character_matches = re.findall(character_pattern, xml_content, re.DOTALL)
    for match in character_matches:
        loudness_match = re.search(r'loudness="([^"]+)"', match)
        loudness = float(loudness_match.group(1)) if loudness_match else 0.0
        sound_spans['character_sound'][match.strip()] = loudness

    # Extract ambient_sound spans
    ambient_matches = re.findall(ambient_pattern, xml_content, re.DOTALL)
    for match in ambient_matches:
        loudness_match = re.search(r'loudness="([^"]+)"', match)
        loudness = float(loudness_match.group(1)) if loudness_match else 0.0
        sound_spans['ambient_sound'][match.strip()] = loudness

    return sound_spans

def process_xml_file(filepath):
    character_sound_spans = {}
    ambient_sound_spans = {}
    try:
        with open(filepath, 'r', encoding='utf-8') as file:
            xml_content = file.read()
        sound_spans = extract_sound_spans(xml_content)
        character_sound_spans = sound_spans['character_sound']
        ambient_sound_spans = sound_spans['ambient_sound']
    except Exception as e:
        print(f"Error processing XML file {filepath}: {e}")
    return character_sound_spans, ambient_sound_spans

def process_folder(folder_path):
    data = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.xml'):
            filepath = os.path.join(folder_path, filename)
            character_sound_spans, ambient_sound_spans = process_xml_file(filepath)
            data.append({'filename': filename,
                         'character_sound-loudness_dictionary': character_sound_spans,
                         'ambient_sound-loudness_dictionary': ambient_sound_spans})
    return pd.DataFrame(data)

#folder_path = 'your_folder_path_here'  # Update with your folder path
df_new2 = process_folder(folder_path)
print(df_new2)


In [0]:
# Save DataFrame to CSV
df_newG.to_csv('/Users/sguhr/Desktop/Diss_notebooks/20240511_theme-d-prose_subcorpus_shorts_df_all_numbers_new13.csv', index=False)

In [0]:
import os
import re
import pandas as pd

def extract_sound_spans(xml_content):
    sound_spans = {'character_sound': {}, 'ambient_sound': {}}

    # Define regex patterns for character_sound and ambient_sound elements
    character_pattern = r'<character_sound[^>]*>(.*?)<\/character_sound>'
    ambient_pattern = r'<ambient_sound[^>]*>(.*?)<\/ambient_sound>'

    # Extract character_sound spans
    character_matches = re.findall(character_pattern, xml_content, re.DOTALL)
    for match in character_matches:
        loudness_match = re.search(r'loudness="([^"]+)"', match)
        loudness = float(loudness_match.group(1)) if loudness_match else 0.0
        sound_spans['character_sound'][match.strip()] = loudness

    # Extract ambient_sound spans
    ambient_matches = re.findall(ambient_pattern, xml_content, re.DOTALL)
    for match in ambient_matches:
        loudness_match = re.search(r'loudness="([^"]+)"', match)
        loudness = float(loudness_match.group(1)) if loudness_match else 0.0
        sound_spans['ambient_sound'][match.strip()] = loudness

    return sound_spans

def process_xml_file(filepath):
    character_sound_spans = {}
    ambient_sound_spans = {}
    try:
        with open(filepath, 'r', encoding='utf-8') as file:
            xml_content = file.read()
        sound_spans = extract_sound_spans(xml_content)
        character_sound_spans = sound_spans['character_sound']
        ambient_sound_spans = sound_spans['ambient_sound']
    except Exception as e:
        print(f"Error processing XML file {filepath}: {e}")
    return character_sound_spans, ambient_sound_spans

def process_folder(folder_path):
    data = []
    for filename in os.listdir(folder_path):
        if filename.endswith('.xml'):
            filepath = os.path.join(folder_path, filename)
            character_sound_spans, ambient_sound_spans = process_xml_file(filepath)
            data.append({'filename': filename,
                         'character_sound-loudness_dictionary': character_sound_spans})
    return pd.DataFrame(data)

#folder_path = 'your_folder_path_here'  # Update with your folder path
df = process_folder(folder_path)
print(df)


In [0]:
# Save DataFrame to CSV
df_newG.to_csv('/Users/sguhr/Desktop/Diss_notebooks/20240511_theme-d-prose_subcorpus_shorts_df_all_numbers_new14.csv', index=False)

In [162]:
import os
import xml.etree.ElementTree as ET
import pandas as pd
from nltk.tokenize import word_tokenize
import re
import string
import numpy as np

folder_path = '/Users/sguhr/Desktop/Diss_notebooks/gender_predicted_subcorpus_2-10000_predicted_LL'

def extract_sound_spans(xml_content):
    sound_spans = {'character_sound': {}, 'ambient_sound': {}}
    root = ET.fromstring(xml_content)

    ns = {'tei': 'http://www.tei-c.org/ns/1.0'}  # Define the namespace

    for elem in root.iter():
        if elem.tag.endswith('character_sound') or elem.tag.endswith('ambient_sound'):
            sound_text = elem.text.strip() if elem.text else ""
            loudness_str = elem.attrib.get('loudness', np.nan)  # Assign NaN if 'loudness' attribute is not present
            loudness = float(loudness_str) if loudness_str != 'S' else np.nan  # Assign NaN if 'loudness' attribute is 'S'
            tokenized_text = word_tokenize(sound_text, language='german')  # Tokenize the text
            filtered_tokens = [token for token in tokenized_text if token not in string.punctuation]  # Filter out punctuation
            sound_spans[elem.tag.split('}')[1].rstrip('_')][tuple(filtered_tokens)] = loudness  # Tokenize the keys
        elif elem.tag in ['Frau', 'Mann', 'Genderneutral']:  # Add this part for extracting sound spans with XML elements like 'Frau', 'Mann', 'Genderneutral'
            sound_text = elem.text.strip() if elem.text else ""
            loudness_str = elem.getparent().attrib.get('loudness', np.nan)  # Get the loudness value from the parent element
            loudness = float(loudness_str) if loudness_str != 'S' else np.nan  # Assign NaN if 'loudness' attribute is 'S'
            tokenized_text = word_tokenize(sound_text, language='german')  # Tokenize the text
            filtered_tokens = [token for token in tokenized_text if token not in string.punctuation]  # Filter out punctuation
            sound_spans['character_sound'][tuple(filtered_tokens)] = loudness  # Assuming they all belong to 'character_sound'
    return sound_spans

# Other functions remain the same

# Call the modified function
df_newG2 = process_folder(folder_path)
print(df_newG2)


                                              filename  \
0                       Raabe_Wilhelm_Der_gute_Tag.xml   
1    Wildermuth_Ottilie_Geschichten_aus_Schwaben_Da...   
2                          Ring_Max_Vom_alten_Heim.xml   
3              Viebig_Clara_Die_Cigarrenarbeiterin.xml   
4    Groller_Balduin_Detektiv_Dagobert_Der_Kassenei...   
..                                                 ...   
420  Wildermuth_Ottilie_Geschichten_aus_Schwaben_Da...   
421  Stifter_Adalbert_Leben_und_Haushalt_dreier_Wie...   
422  Berthold_Theodor_Lustige_Gymnasialgeschichten_...   
423  von_Liliencron_Detlev_Roggen_und_Weizen_Der_Di...   
424                    Oelschlaeger_Hermann_Klytia.xml   

                   character_sound-loudness_dictionary  \
0    {('verkündigte', 'er', 'von', 'Stockwerk', 'zu...   
1    {('und', 'wurde', 'bei', 'allen', 'Familienfes...   
2    {('Während', 'er', 'sprach'): 3.0, ('mit', 'de...   
3    {('sie', 'sprach', 'nicht'): 3.0, ('und', 'klo...   
4    {('und',

In [168]:
# Save DataFrame to CSV
df_newG2.to_csv('/Users/sguhr/Desktop/Diss_notebooks/20240514_theme-d-prose_subcorpus_shorts_df_all_numbers.csv', index=False)

In [169]:
# diese Zelle!

import os
import xml.etree.ElementTree as ET
import pandas as pd
import re
import numpy as np

def extract_sound_spans(xml_content):
    sound_spans = {'character_sound': {}, 'ambient_sound': {}}
    
    # Define the pattern for finding sound spans with loudness attributes
    pattern = r'<(character_sound|ambient_sound)[^>]*loudness="([^"]+)"[^>]*>(.*?)<\/\1>'
    
    # Find all matches in the XML content
    matches = re.findall(pattern, xml_content)
    
    for match in matches:
        sound_type = match[0]  # character_sound or ambient_sound
        loudness = float(match[1]) if match[1] != 'S' else np.nan  # Assign NaN if loudness attribute is 'S'
        text = match[2].strip()  # Get the text content
        
        sound_spans[sound_type][text] = loudness  # Add to the sound_spans dictionary
    
    return sound_spans

# Other functions remain the same

# Call the modified function
df_newG = process_folder(folder_path)
print(df_newG)


                                              filename  \
0                       Raabe_Wilhelm_Der_gute_Tag.xml   
1    Wildermuth_Ottilie_Geschichten_aus_Schwaben_Da...   
2                          Ring_Max_Vom_alten_Heim.xml   
3              Viebig_Clara_Die_Cigarrenarbeiterin.xml   
4    Groller_Balduin_Detektiv_Dagobert_Der_Kassenei...   
..                                                 ...   
420  Wildermuth_Ottilie_Geschichten_aus_Schwaben_Da...   
421  Stifter_Adalbert_Leben_und_Haushalt_dreier_Wie...   
422  Berthold_Theodor_Lustige_Gymnasialgeschichten_...   
423  von_Liliencron_Detlev_Roggen_und_Weizen_Der_Di...   
424                    Oelschlaeger_Hermann_Klytia.xml   

                   character_sound-loudness_dictionary  \
0    {'verkündigte er von Stockwerk zu Stockwerk': ...   
1    {'und wurde bei allen Familienfesten gebeten':...   
2    {'Während er sprach': 3.0, 'mit denen er sich ...   
3    {'sie sprach nicht': 3.0, 'und klopfte verlang...   
4    {'fragte

In [0]:
# Save DataFrame to CSV
df_newG.to_csv('/Users/sguhr/Desktop/Diss_notebooks/20240511_theme-d-prose_subcorpus_shorts_df_all_numbers_new16.csv', index=False)

In [216]:
df_newG

Unnamed: 0,filename,character_sound-loudness_dictionary,ambient_sound-loudness_dictionary,character_avg_token_count_without_nan,ambient_avg_token_count_without_nan,total_avg_token_count_without_nan,t_se_aver_without_nan,character_se_count_without_nan,ambient_se_count_without_nan,total_se_count_without_nan,character_avg_loudness,ambient_avg_loudness,text_loudness_average
0,Raabe_Wilhelm_Der_gute_Tag.xml,{'verkündigte er von Stockwerk zu Stockwerk': ...,"{'fragten': 3.0, 'und fügten hinzu': 3.0, 'Es ...",35.76,50.00,37.05,266.23,102,10,112,3.06,3.00,3.03
1,Wildermuth_Ottilie_Geschichten_aus_Schwaben_Da...,{'und wurde bei allen Familienfesten gebeten':...,{'Tagelang erschallte die ganze Straße von den...,45.85,87.00,49.97,55.69,27,3,30,3.13,3.57,3.35
2,Ring_Max_Vom_alten_Heim.xml,"{'Während er sprach': 3.0, 'mit denen er sich ...",{},44.83,0.00,44.83,59.34,35,0,35,3.14,0.00,1.57
3,Viebig_Clara_Die_Cigarrenarbeiterin.xml,"{'sie sprach nicht': 3.0, 'und klopfte verlang...",{'und wenn sie sprachen klangen die Stimmen be...,31.28,34.98,32.98,117.59,54,46,100,2.90,3.27,3.08
4,Groller_Balduin_Detektiv_Dagobert_Der_Kassenei...,{'fragte <Mann>Herr</Mann> <Mann>Grumbach</Man...,"{'Die Uhr schlug eben ein Viertel': 4.0, 'und ...",39.86,44.00,40.05,183.12,66,3,69,3.13,3.67,3.40
...,...,...,...,...,...,...,...,...,...,...,...,...,...
420,Wildermuth_Ottilie_Geschichten_aus_Schwaben_Da...,{'Mit Lachen erzählte er seiner <Frau>Frau</Fr...,{'immer stiller und einfacher wurde das Tauffe...,44.21,46.79,44.80,148.93,48,14,62,2.62,2.32,2.47
421,Stifter_Adalbert_Leben_und_Haushalt_dreier_Wie...,"{'fragte er einen wildfremden, gesetzten, ältl...",{'das mit seinen Frontsäulen und dem ruhigen P...,40.95,45.95,43.39,180.00,21,20,41,2.76,3.44,3.10
422,Berthold_Theodor_Lustige_Gymnasialgeschichten_...,"{'wie er grollend sagt': 3.0, '<Mann>Hans</Man...","{'die Schule lachte': 4.0, 'flüsterten': 2.0, ...",42.97,31.88,41.00,86.90,37,8,45,3.35,3.75,3.55
423,von_Liliencron_Detlev_Roggen_und_Weizen_Der_Di...,"{'dann plötzlich hielt er wieder inne': 1.0, '...","{'Spielte die Musik': 4.0, 'Irgendwoher klang ...",35.28,40.43,36.07,81.76,39,7,46,2.84,3.79,3.31


In [0]:
# diese Zelle!

import pandas as pd

# Load the CSV file into a pandas DataFrame
df = pd.read_csv("/Users/sguhr/Desktop/Diss_notebooks/20240511_theme-d-prose_subcorpus_shorts_df_all_numbers_new16.csv")

# Display the DataFrame
print(df)


In [173]:
import pandas as pd

# Load the CSV file into a pandas DataFrame
#df = pd.read_csv("20240511_theme-d-prose_subcorpus_shorts_df_all_numbers_new16.csv")

# Initialize a list to store the female sound dictionaries
female_sound_list = []

# Iterate over the rows of the DataFrame
for index, row in df.iterrows():
    # Extract the character_sound-loudness_dictionary
    char_sound_dict = eval(row['character_sound-loudness_dictionary'])
    
    # Initialize a new dictionary for female sounds
    female_sound_dict = {}
    
    # Iterate over the keys in the character_sound-loudness_dictionary
    for key, value in char_sound_dict.items():
        # Check if the key contains the regex match "<Frau>"
        if "<Frau>" in key:
            # Add the key-value pair to the female_sound_dict
            female_sound_dict[key] = value
            
    # Append the female_sound_dict to the female_sound_list
    female_sound_list.append(female_sound_dict)

# Add the female_sound_list as a new column "female_sound" to the DataFrame
df['female_sound'] = female_sound_list

# Display the DataFrame
print(df)


TypeError: eval() arg 1 must be a string, bytes or code object

In [174]:
import pandas as pd

# Load the CSV file into a pandas DataFrame
#df = pd.read_csv("20240511_theme-d-prose_subcorpus_shorts_df_all_numbers_new16.csv")

# Initialize a list to store the male sound dictionaries
male_sound_list = []

# Iterate over the rows of the DataFrame
for index, row in df.iterrows():
    # Extract the character_sound-loudness_dictionary
    char_sound_dict = eval(row['character_sound-loudness_dictionary'])
    
    # Initialize a new dictionary for male sounds
    male_sound_dict = {}
    
    # Iterate over the keys in the character_sound-loudness_dictionary
    for key, value in char_sound_dict.items():
        # Check if the key contains the regex match "<Mann>"
        if "<Mann>" in key:
            # Add the key-value pair to the male_sound_dict
            male_sound_dict[key] = value
            
    # Append the male_sound_dict to the male_sound_list
    male_sound_list.append(male_sound_dict)

# Add the male_sound_list as a new column "male_sound_dict" to the DataFrame
df['male_sound'] = male_sound_list

# Display the DataFrame
print(df)


TypeError: eval() arg 1 must be a string, bytes or code object

In [None]:
import pandas as pd



# Initialize a list to store the male sound dictionaries
male_sound_list = []

# Initialize a list to store the male loudness values
male_loudness_values_list = []

# Iterate over the rows of the DataFrame
for index, row in df.iterrows():
    # Extract the character_sound-loudness_dictionary
    char_sound_dict = eval(row['character_sound-loudness_dictionary'])
    
    # Initialize a new dictionary for male sounds
    male_sound_dict = {}
    
    # Iterate over the keys in the character_sound-loudness_dictionary
    for key, value in char_sound_dict.items():
        # Check if the key contains the regex match "<Mann>"
        if "<Mann>" in key:
            # Add the key-value pair to the male_sound_dict
            male_sound_dict[key] = value
            
    # Append the male_sound_dict to the male_sound_list
    male_sound_list.append(male_sound_dict)
    
    # Extract the values from the male_sound_dict
    male_loudness_values = list(male_sound_dict.values())
    
    # Append the male_loudness_values to the male_loudness_values_list
    male_loudness_values_list.append(male_loudness_values)

# Add the male_sound_list as a new column "male_sound_dict" to the DataFrame
df['male_sound'] = male_sound_list

# Add the male_loudness_values_list as a new column "male_loudness_values" to the DataFrame
df['male_loudness_values'] = male_loudness_values_list

# Display the DataFrame
print(df)


In [171]:
# diese Zelle!

import pandas as pd



# Initialize a list to store the male sound dictionaries
male_sound_list = []

# Initialize a list to store the male loudness values
male_loudness_values_list = []

# Iterate over the rows of the DataFrame
for index, row in df.iterrows():
    # Extract the character_sound-loudness_dictionary
    char_sound_dict = eval(row['character_sound-loudness_dictionary'])
    
    # Initialize a new dictionary for male sounds
    male_sound_dict = {}
    
    # Iterate over the keys in the character_sound-loudness_dictionary
    for key, value in char_sound_dict.items():
        # Check if the key contains the regex match "<Mann>"
        if "<Mann>" in key:
            # Add the key-value pair to the male_sound_dict
            male_sound_dict[key] = value
            
    # Append the male_sound_dict to the male_sound_list
    male_sound_list.append(male_sound_dict)
    
    # Extract the values from the male_sound_dict
    male_loudness_values = list(male_sound_dict.values())
    
    # Append the male_loudness_values to the male_loudness_values_list
    male_loudness_values_list.append(male_loudness_values)

# Add the male_sound_list as a new column "male_sound_dict" to the DataFrame
df['male_sound_dict'] = male_sound_list

# Add the male_loudness_values_list as a new column "male_loudness_values" to the DataFrame
df['male_loudness_values'] = male_loudness_values_list

# Calculate the average of the male loudness values
df['male_loudness_average'] = df['male_loudness_values'].apply(lambda x: sum(x) / len(x) if len(x) > 0 else 0)

# Display the DataFrame
print(df)


TypeError: eval() arg 1 must be a string, bytes or code object

In [172]:
# diese Zelle!


import pandas as pd



# Initialize a list to store the female sound dictionaries
female_sound_list = []

# Initialize a list to store the female loudness values
female_loudness_values_list = []

# Iterate over the rows of the DataFrame
for index, row in df.iterrows():
    # Extract the character_sound-loudness_dictionary
    char_sound_dict = eval(row['character_sound-loudness_dictionary'])
    
    # Initialize a new dictionary for female sounds
    female_sound_dict = {}
    
    # Iterate over the keys in the character_sound-loudness_dictionary
    for key, value in char_sound_dict.items():
        # Check if the key contains the regex match "<Frau>"
        if "<Frau>" in key:
            # Add the key-value pair to the female_sound_dict
            female_sound_dict[key] = value
            
    # Append the female_sound_dict to the female_sound_list
    female_sound_list.append(female_sound_dict)
    
    # Extract the values from the female_sound_dict
    female_loudness_values = list(female_sound_dict.values())
    
    # Append the female_loudness_values to the female_loudness_values_list
    female_loudness_values_list.append(female_loudness_values)

# Add the female_sound_list as a new column "female_sound_dict" to the DataFrame
df['female_sound_dict'] = female_sound_list

# Add the female_loudness_values_list as a new column "female_loudness_values" to the DataFrame
df['female_loudness_values'] = female_loudness_values_list

# Calculate the average of the female loudness values
df['female_loudness_average'] = df['female_loudness_values'].apply(lambda x: sum(x) / len(x) if len(x) > 0 else 0)

# Display the DataFrame
print(df)


TypeError: eval() arg 1 must be a string, bytes or code object

In [None]:
# Save DataFrame to CSV
df.to_csv('/Users/sguhr/Desktop/Diss_notebooks/20240511_theme-d-prose_subcorpus_shorts_df_Gender.csv', index=False)