This notebook enables the user to reproduce the analysis steps we have performed for our study *They crossed the valley of Catamarca: A study of narrative space in novel openings* based on the annotation data saved as [Catma](https://www.catma.de) project in `/CATMA_4AA4ADC0-4C28-54F9-B6A1-5DCEFF34B90B_DH2025_CANSpiN` and as `.tsv` data in `/canspin-deu-19`, `/canspin-deu-20`, `/canspin-lat-19`, and `/canspin-spa-19`.

If you wish to see the analysis results only, it is not necessary to execute this notebook. In this case, see our paper and the content of the `/results` folder.

To use the notebook, install the [gitma-canspin package (v1.6.5)](https://github.com/CANSpiNproject/gitma-canspin/tree/v1.6.5) first, following the instructions of its README.

All files produced with this notebook will be saved in the new folder `/perform_analysis_output`.

## initialization

In [1]:
# imports
from gitma_canspin.canspin import AnnotationAnalyzer

import pandas as pd
import plotly
import math
import json
import os
import itertools
import shutil

from typing import Tuple, List, Union, Literal

In [2]:
# create /perform_analysis_output folder and copy annotation distribution .png files from /results to /perform_analysis_output, if necessary;
# those files were manually created with help of the plotly chart export function of the annotation distribution .html files created 
# in step 4 in the steps section below
annotation_distribution_png_filepaths_in_results, annotation_distribution_png_filepaths_in_perform_analysis_output = [
    [
        os.path.join('results', 'visualizations', filename) for filename in [
            'annotation_distribution__CANSpiN-deu-19_001_1-1-1.png',
            'annotation_distribution__CANSpiN-deu-19_030_1-1-1.png',
            'annotation_distribution__CANSpiN-lat-19_004_1.png',
            'annotation_distribution__CANSpiN-lat-19_041_1.png',
            'annotation_distribution__CANSpiN-spa-19_001_1.png',
            'annotation_distribution__CANSpiN-spa-19_008_1.png'
        ]
    ],
    [
        os.path.join('perform_analysis_output', filename) for filename in [
            'annotation_distribution__CANSpiN-deu-19_001_1-1-1.png',
            'annotation_distribution__CANSpiN-deu-19_030_1-1-1.png',
            'annotation_distribution__CANSpiN-lat-19_004_1.png',
            'annotation_distribution__CANSpiN-lat-19_041_1.png',
            'annotation_distribution__CANSpiN-spa-19_001_1.png',
            'annotation_distribution__CANSpiN-spa-19_008_1.png'
        ]
    ]
]

if not (os.path.isdir('perform_analysis_output')):
    os.makedirs('perform_analysis_output')

for index, filepath in enumerate(annotation_distribution_png_filepaths_in_perform_analysis_output):
    if not (os.path.isfile(filepath)):
        shutil.copy(annotation_distribution_png_filepaths_in_results[index], 'perform_analysis_output')

In [3]:
# load the analyzer with the catma project from the CATMA_4AA4ADC0-4C28-54F9-B6A1-5DCEFF34B90B_DH2025_CANSpiN folder
analyzer = AnnotationAnalyzer(init_settings={'project_name': 'CATMA_4AA4ADC0-4C28-54F9-B6A1-5DCEFF34B90B_DH2025_CANSpiN'})

gitma_canspin.project - INFO - Loading tagsets ...
gitma_canspin.project - INFO - 	Found 3 tagset(s).
gitma_canspin.project - INFO - Loading documents ...
gitma_canspin.project - INFO - 	Found 6 document(s).
gitma_canspin.project - INFO - Loading annotation collections ...
gitma_canspin.project - INFO - 	Found 6 annotation collection(s).
gitma_canspin.project - INFO - 	Annotation collection "Collection CS1 v1.1.0 - Gold Standard" for document "El pozo del Yocci"
gitma_canspin.project - INFO - 		Annotations: 201
gitma_canspin.project - INFO - 	Annotation collection "CS1 v1.1.0 - Nils (Gold: 1)" for document "El Señor de Bembibre"
gitma_canspin.project - INFO - 		Annotations: 285
gitma_canspin.project - INFO - 	Annotation collection "CS1 v1.1.0 - Ulrike (Gold standard: 1)" for document "CANSpiN-spa-19-008"
gitma_canspin.project - INFO - 		Annotations: 1332
gitma_canspin.project - INFO - 	Annotation collection "Nils -- CS1 1.1.0 (Gold: 1-1-1)" for document "DEU-19_001"
gitma_canspin.proje

In [4]:
# display loaded tsv annotations from corpus folders
analyzer.print_tsv_annotations_overview()

gitma_canspin.canspin - INFO - tsv files found in canspin project!



overview:
- schema "cs1"
	CANSpiN-deu-19_001_1-1-1.tsv
	CANSpiN-deu-19_030_1-1-1.tsv
	CANSpiN-deu-20_002_1_shuffled.tsv
	CANSpiN-deu-20_021_1_shuffled.tsv
	CANSpiN-lat-19_004_1.tsv
	CANSpiN-lat-19_041_1.tsv
	CANSpiN-spa-19_001_1.tsv
	CANSpiN-spa-19_008_1.tsv


In [5]:
# display loaded annotation collections from catma project
analyzer.print_projects_annotation_collection_list()


Annotation collection list:
index	collection_name	text_title
0	Collection CS1 v1.1.0 - Gold Standard	El pozo del Yocci
1	CS1 v1.1.0 - Nils (Gold: 1)	El Señor de Bembibre
2	CS1 v1.1.0 - Ulrike (Gold standard: 1)	CANSpiN-spa-19-008
3	Nils -- CS1 1.1.0 (Gold: 1-1-1)	DEU-19_001
4	Nils -- CS1 V.1.1.0 (Gold:1-1-1)	DEU-19_030
5	Collection CS1 v1.1.0 - Gold Standard	El falso Inca


## steps
Perform all steps in the specified order to obtain a correct result.

In each step, the result can be presented in a cell respectively as temporary image file in step 3. Or the output can be saved to files into the `perform_analysis_output` folder. Use the following parameter cell to control the output behaviour (`'show'` or `'save'`).

Besides that, steps 1 to 5 can be executed in a row without any settings to be done.

- [1. get CS1 annotation statistics (as dict / JSON)](#get-CS1-annotation-statistics-as-JSON-file)
- [2. get bar charts with annotation amounts of all chapters normalized to their token amount (as HTML)](#get-bar-charts-with-annotation-amounts-of-all-chapters-normalized-to-their-token-amount)
- [3. get first character event overview (as PNG file)](#get-first-character-event-overview)
- [4. get annotation distribution of each chapter (as HTML)](#get-annotation-distribution-of-each-chapter)
- [5. get first character event CS1 relations (as PNG file)](#get-first-character-event-cs1-relations)

In [6]:
# define behaviour in steps: show in cell or save to file
get_cs1_annotation_statistics_output: Literal['save', 'show'] = 'save'
get_bar_charts_with_annotation_amounts_of_all_chapters_output: Literal['save', 'show'] = 'save'
get_first_character_event_overview_output: Literal['save', 'show'] = 'save'
get_annotation_distribution_of_each_chapter_output: Literal['save', 'show'] = 'save'
get_first_character_event_cs1_relations_output: Literal['save', 'show'] = 'save'

In [7]:
# further settings

# additional token selection for steps 1 and 2 (in addition to the mandatory complete token selection)
token_selection: Tuple[int, int] = (0, 1000)

#### get CS1 annotation statistics as JSON file

In [8]:
# get the annotation statistics from the tsv files (for the entire chapters and for the first tokens respectively)
# and translate the class names into English

results: dict = {
    'whole_chapters': analyzer.get_corpus_annotation_statistics(),
    f'first_{token_selection[1] - token_selection[0]}_token': analyzer.get_corpus_annotation_statistics({
        'calculations': {
            'amount_of_annotations': True,
            'amount_of_annotations_by_class': True,
            'amount_of_token': True,
            'amount_of_annotated_token': True,
            'amount_of_annotated_token_by_class': True,
            'ratios': True,
            'word_lists_by_class': True
        },
        'custom_grouping': None,
        'text_borders': token_selection
    })
}

key_translation = {
    'Ort-Container': 'Place-Container',
    'Ort-Container-BK': 'Place-Container-MC',
    'Ort-Objekt': 'Place-Object',
    'Ort-Objekt-BK': 'Place-Object-MC',
    'Ort-Abstrakt': 'Place-Abstract',
    'Ort-Abstrakt-BK': 'Place-Abstract-MC',
    'Ort-ALT': 'Place-ALT',
    'Bewegung-Subjekt': 'Movement-Subject',
    'Bewegung-Objekt': 'Movement-Object',
    'Bewegung-Licht': 'Movement-Light',
    'Bewegung-Schall': 'Movement-Sound',
    'Bewegung-Geruch': 'Movement-Smell',
    'Bewegung-ALT': 'Movement-ALT',
    'Dimensionierung-Groesse': 'Dimensioning-Size',
    'Dimensionierung-Abstand': 'Dimensioning-Distance',
    'Dimensionierung-Menge': 'Dimensioning-Amount',
    'Dimensionierung-ALT': 'Dimensioning-ALT',
    'Positionierung': 'Positioning',
    'Positionierung-ALT': 'Positioning-ALT',
    'Richtung': 'Direction',
    'Richtung-ALT': 'Direction-ALT',
    'Lugar-Contenedor': 'Place-Container',
    'Lugar-Contenedor-CM': 'Place-Container-MC',
    'Lugar-Objeto': 'Place-Object',
    'Lugar-Objeto-CM': 'Place-Object-MC',
    'Lugar-Abstracto': 'Place-Abstract',
    'Lugar-Abstracto-CM': 'Place-Abstract-MC',
    'Lugar-ALT': 'Place-ALT',
    'Movimiento-Sujeto': 'Movement-Subject',
    'Movimiento-Objeto': 'Movement-Object',
    'Movimiento-Luz': 'Movement-Light',
    'Movimiento-Sonido': 'Movement-Sound',
    'Movimiento-Olfato': 'Movement-Smell',
    'Movimiento-ALT': 'Movement-ALT',
    'Dimensionamiento-Tamaño': 'Dimensioning-Size',
    'Dimensionamiento-Distancia': 'Dimensioning-Distance',
    'Dimensionamiento-Cantitad': 'Dimensioning-Amount',
    'Dimensionamiento-ALT': 'Dimensioning-ALT',
    'Posicionamiento': 'Positioning',
    'Posicionamiento-ALT': 'Positioning-ALT',
    'Dirección': 'Direction',
    'Dirección-ALT': 'Direction-ALT'
}

def merge_word_list_by_class_dicts(first: dict, second: dict) -> dict:
    result = first
    for token, token_amount in second.items():
        if token not in result:
            result[token] = token_amount
            continue
        result[token] = result[token] + token_amount
    result = dict(sorted(result.items(), key=lambda x: int(x[1]), reverse=True))
    return result

def translate_dict(input: dict, translation: dict) -> dict:
    translated = {}
    if len([key for key in input if key in key_translation]) == len(key_translation):
        # for translating schema totals per class with mixed german and spanish classes...
        if isinstance(input[list(key_translation.keys())[0]], int):
            # ...in case of class instances amounts
            for key, value in input.items():
                translated_key: str = key_translation[key]
                translated[translated_key] = value if not translated.get(translated_key) else translated[translated_key] + value
            translated = dict(sorted(translated.items(), key=lambda x: int(x[1]), reverse=True))
        elif isinstance(input[list(key_translation.keys())[0]], dict):
            # ...in case of word lists
            for key, value in input.items():
                translated_key: str = key_translation[key]
                translated[translated_key] = value if not translated.get(translated_key) else merge_word_list_by_class_dicts(translated[translated_key], value)
    else:
        # for translating everything else
        translated = dict([(translation.get(k, k), v) for k, v in input.items()])
    for key, value in translated.items():
        if isinstance(value, dict):
            translated[key] = translate_dict(value, translation)
    return translated

for result_type in results:
    results[result_type] = translate_dict(input=results[result_type], translation=key_translation)


In [9]:
# print annotation statistics or save it to files

if get_cs1_annotation_statistics_output == 'show':
    for result_type in results:
        print(
            json.dumps(
                results[result_type],
                indent=2, 
                sort_keys=False, 
                ensure_ascii=False
            )
        )
elif get_cs1_annotation_statistics_output == 'save':
    for result_type in results:
        filename: str = f'annotation_statistics__{result_type}.json'
        filepath: str = os.path.join('perform_analysis_output', filename)
        json_file_str: str = json.dumps(results[result_type], indent=2, sort_keys=False, ensure_ascii=False)

        if not (os.path.isdir('perform_analysis_output')):
            os.makedirs('perform_analysis_output')

        if (os.path.isfile(filepath)):
            print(f'JSON file {filepath} already exists and will be overwritten.')

        with open(filepath, 'w+', encoding='utf-8') as file:
            file.write(json_file_str)
            print(f'JSON file {filepath} successfully created.')
else:
    print('#### No valid output behaviour is defined for "get CS1 annotation statistics": No data is displayed or saved. ####')

JSON file perform_analysis_output\annotation_statistics__whole_chapters.json successfully created.
JSON file perform_analysis_output\annotation_statistics__first_1000_token.json successfully created.


#### get bar charts with annotation amounts of all chapters normalized to their token amount

In [10]:
# for annotation data with all tokens;
# you may manipulate and save the plotly diagram as png image file by using the interface of the .html output

data_dict: dict = {
    'text': list(itertools.chain(*[[item] * 21 for item in [
        'DEU-19_030', 
        'DEU-19_001', 
        'DEU-20_002', 
        'DEU-20_021', 
        'SPA-19_001', 
        'SPA-19_008', 
        'LAT-19_004', 
        'LAT-19_041'
    ]])),
    'annotation_class': [
        'Place-Container', 
        'Place-Container-MC', 
        'Place-Object', 
        'Place-Object-MC', 
        'Place-Abstract', 
        'Place-Abstract-MC', 
        'Place-ALT',
        'Movement-Subject',
        'Movement-Object',
        'Movement-Light',
        'Movement-Sound',
        'Movement-Smell',
        'Movement-ALT',
        'Dimensioning-Size',
        'Dimensioning-Distance',
        'Dimensioning-Amount',
        'Dimensioning-ALT',
        'Positioning',
        'Positioning-ALT',
        'Direction',
        'Direction-ALT'
    ] * 8,
    'amount': []
}
data_dict['amount'] = list(itertools.chain(*[[results['whole_chapters']['ratios']['cs1'][corpus_file[0]][corpus_file[1]]['annotations_by_class_in_file:total_token_amount_in_file'].get(annotation_class, 0) * 100] for corpus_file in [
    ('canspin-deu-19', 'CANSpiN-deu-19_030_1-1-1.tsv'), 
    ('canspin-deu-19', 'CANSpiN-deu-19_001_1-1-1.tsv'),
    ('canspin-deu-20', 'CANSpiN-deu-20_002_1_shuffled.tsv'),
    ('canspin-deu-20', 'CANSpiN-deu-20_021_1_shuffled.tsv'),
    ('canspin-spa-19', 'CANSpiN-spa-19_001_1.tsv'),
    ('canspin-spa-19', 'CANSpiN-spa-19_008_1.tsv'),
    ('canspin-lat-19', 'CANSpiN-lat-19_004_1.tsv'),
    ('canspin-lat-19', 'CANSpiN-lat-19_041_1.tsv')
] for annotation_class in data_dict['annotation_class'][:21]]))

data: pd.DataFrame = pd.DataFrame(data_dict)

figure: plotly.graph_objects.Figure = plotly.express.bar(
    data_frame=data,
    x='text',
    y='amount',
    color='annotation_class',
    labels={
        "text": "texts",
        "amount": "annotation amount (in %)",
        "annotation_class": "annotation classes"
    },
    title='CS1 annotation amounts inside the initial chapters with all tokens <br><sub>(in percentage of the total token amount of the respective chapter)</sub>',
    color_discrete_map={
        'Place-Container': '#B6D3FF',
        'Place-Container-MC': '#CCDEFF',
        'Place-Object': '#D4EAFF',
        'Place-Object-MC': '#E6F2FF',
        'Place-Abstract': '#89A8F6',
        'Place-Abstract-MC': '#98C3FA',
        'Place-ALT': '#90A6C7',
        'Movement-Subject': '#FF6D6D',
        'Movement-Object': '#F60D00',
        'Movement-Sound': '#FF4949',
        'Movement-Light': '#CA0B0B',
        'Movement-Smell': '#B60000',
        'Movement-ALT': '#960000',
        'Direction': '#92FFBD',
        'Direction-ALT': '#75CC96',
        'Positioning': '#DB8300',
        'Positioning-ALT': '#B56A01',
        'Dimensioning-Distance': '#8AB6AD',
        'Dimensioning-Size': '#7CD3C0',
        'Dimensioning-Amount': '#7EF5D9',
        'Dimensioning-ALT': '#60847B'
    },
    width=1400,
    height=800
)

figure.update_layout(font={'size': 18})

if get_bar_charts_with_annotation_amounts_of_all_chapters_output == 'show':
    figure.show()
elif get_bar_charts_with_annotation_amounts_of_all_chapters_output == 'save':
    if not (os.path.isdir('perform_analysis_output')):
        os.makedirs('perform_analysis_output')
    figure.write_html(os.path.join('perform_analysis_output', 'cs1_annotation_amounts__all_tokens.html'))
else:
    print('#### No valid output behaviour is defined for "get bar charts with annotation amounts of all chapters": No data is displayed or saved. ####')


In [11]:
# for annotation data with a user defined amount of tokens;
# you may manipulate and save the plotly diagram as png image file by using the interface of the .html output

data_dict: dict = {
    'text': list(itertools.chain(*[[item] * 21 for item in [
        'DEU-19_030', 
        'DEU-19_001', 
        'DEU-20_002', 
        'DEU-20_021', 
        'SPA-19_001', 
        'SPA-19_008', 
        'LAT-19_004', 
        'LAT-19_041'
    ]])),
    'annotation_class': [
        'Place-Container', 
        'Place-Container-MC', 
        'Place-Object', 
        'Place-Object-MC', 
        'Place-Abstract', 
        'Place-Abstract-MC', 
        'Place-ALT',
        'Movement-Subject',
        'Movement-Object',
        'Movement-Light',
        'Movement-Sound',
        'Movement-Smell',
        'Movement-ALT',
        'Dimensioning-Size',
        'Dimensioning-Distance',
        'Dimensioning-Amount',
        'Dimensioning-ALT',
        'Positioning',
        'Positioning-ALT',
        'Direction',
        'Direction-ALT'
    ] * 8,
    'amount': []
}
data_dict['amount'] = list(itertools.chain(*[[results[f'first_{token_selection[1] - token_selection[0]}_token']['ratios']['cs1'][corpus_file[0]][corpus_file[1]]['annotations_by_class_in_file:total_token_amount_in_file'].get(annotation_class, 0) * 100] for corpus_file in [
    ('canspin-deu-19', 'CANSpiN-deu-19_030_1-1-1.tsv'), 
    ('canspin-deu-19', 'CANSpiN-deu-19_001_1-1-1.tsv'),
    ('canspin-deu-20', 'CANSpiN-deu-20_002_1_shuffled.tsv'),
    ('canspin-deu-20', 'CANSpiN-deu-20_021_1_shuffled.tsv'),
    ('canspin-spa-19', 'CANSpiN-spa-19_001_1.tsv'),
    ('canspin-spa-19', 'CANSpiN-spa-19_008_1.tsv'),
    ('canspin-lat-19', 'CANSpiN-lat-19_004_1.tsv'),
    ('canspin-lat-19', 'CANSpiN-lat-19_041_1.tsv')
] for annotation_class in data_dict['annotation_class'][:21]]))

data: pd.DataFrame = pd.DataFrame(data_dict)

figure: plotly.graph_objects.Figure = plotly.express.bar(
    data_frame=data,
    x='text',
    y='amount',
    color='annotation_class',
    labels={
        "text": "texts",
        "amount": "annotation amount (in %)",
        "annotation_class": "annotation classes"
    },
    title=f'CS1 annotation amounts inside the initial chapters with {token_selection[1] - token_selection[0]} tokens <br><sub>(in percentage of the selected token amount of the respective chapter)</sub>',
    color_discrete_map={
        'Place-Container': '#B6D3FF',
        'Place-Container-MC': '#CCDEFF',
        'Place-Object': '#D4EAFF',
        'Place-Object-MC': '#E6F2FF',
        'Place-Abstract': '#89A8F6',
        'Place-Abstract-MC': '#98C3FA',
        'Place-ALT': '#90A6C7',
        'Movement-Subject': '#FF6D6D',
        'Movement-Object': '#F60D00',
        'Movement-Sound': '#FF4949',
        'Movement-Light': '#CA0B0B',
        'Movement-Smell': '#B60000',
        'Movement-ALT': '#960000',
        'Direction': '#92FFBD',
        'Direction-ALT': '#75CC96',
        'Positioning': '#DB8300',
        'Positioning-ALT': '#B56A01',
        'Dimensioning-Distance': '#8AB6AD',
        'Dimensioning-Size': '#7CD3C0',
        'Dimensioning-Amount': '#7EF5D9',
        'Dimensioning-ALT': '#60847B'
    },
    width=1400,
    height=800
)

figure.update_layout(font={'size': 18})

if get_bar_charts_with_annotation_amounts_of_all_chapters_output == 'show':
    figure.show()
elif get_bar_charts_with_annotation_amounts_of_all_chapters_output == 'save':
    if not (os.path.isdir('perform_analysis_output')):
        os.makedirs('perform_analysis_output')
    figure.write_html(os.path.join('perform_analysis_output', f'cs1_annotation_amounts__{token_selection[1] - token_selection[0]}_tokens.html'))
else:
    print('#### No valid output behaviour is defined for "get bar charts with annotation amounts of all chapters": No data is displayed or saved. ####')


#### get first character event overview

In [12]:
# install and import Pillow for creating and manipulating images
%pip install Pillow==11.2.1
from PIL import Image, ImageDraw, ImageFont

Note: you may need to restart the kernel to use updated packages.


In [13]:
# load first character events data
first_character_event_data: pd.DataFrame = pd.read_csv(filepath_or_buffer=os.path.join('novel_beginning_analysis', 'categorization.tsv'), sep='\t')
first_character_event_data.head(8)

Unnamed: 0,text_id,grammatical_person,narrator,discourse,story,first_character_event_sentence,chapters_total_sentences,first_character_event_token,chapters_total_token,first_character_event_description,general_entrance_description
0,DEU-19_030,3rd,heterodiegetic,high expositionality,in medias res,37.0,315,955.0,7179,professor looks at a manuscript,The opening is a spatial description of a nigh...
1,DEU-19_001,3rd,heterodiegetic,high expositionality,in medias res,13.0,206,430.0,5491,a man steers a horse-drawn carriage,The chapter opens with a spatial description o...
2,DEU-20_002,1st,homodiegetic,medium expositionality,in medias res,1.0,86,10.0,2689,arrival in Bonn,The chapter begins immediately with the arriva...
3,DEU-20_021,3rd,heterodiegetic,medium expositionality,in medias res,1.0,43,13.0,744,Mr. B. buys a car,The chapter immediatly starts with the act of ...
4,SPA-19_001,3rd,heterodiegetic,high expositionality,in medias res,1.0,54,18.0,1883,travel of three knights,The plot starts immediately by three knights c...
5,SPA-19_008,3rd,heterodiegetic,medium expositionality,in medias res,12.0,158,401.0,4309,inhabitant of the tower walks around inside,The novel starts with spatial descriptions of ...
6,LAT-19_004,3rd,heterodiegetic,medium expositionality,in medias res,1.0,37,17.0,1210,travel of indigenous man and women through val...,"In the year 1656, two travellers - an Andalusi..."
7,LAT-19_041,3rd,heterodiegetic,high expositionality,ab ovo,,27,,1074,no character-related event in the first chapter,"The story is set in 1814, in the midst of the ..."


In [14]:
# colors
black = (0, 0, 0)
white = (255, 255, 255)
blueish = (52, 101, 164)
grey = (120, 120, 120)

# fonts
size_18, size_22, size_25 = (ImageFont.truetype(os.path.join('assets', 'fonts', 'Aspekta-400.ttf'), x) for x in [18, 22, 25])

# create first character event overview image
image = Image.new(mode='RGB', size=(1100, 1000), color=white)
draw = ImageDraw.Draw(im=image)

# border
draw.rectangle(xy=(25, 25, 1075, 975), fill=None, outline=black)

# text names
for index, row in first_character_event_data.iterrows():
    draw.text(xy=(50, 150 + (index * 95)), text=row['text_id'], fill=black, font=size_22)

# horizontal blue lines
for index, row in first_character_event_data.iterrows():
    draw.line(xy=(190, 160 + (index * 95), 860, 160 + (index * 95)), fill=blueish, width=2)

# arrow heads
for index, row in first_character_event_data.iterrows():
    draw.polygon(xy=[(860, 150 + (index * 95)), (860, 170 + (index * 95)), (880, 160 + (index * 95))], fill=blueish)

# vertical line next to text names
draw.line(xy=(190, 130, 190, 860), fill=black, width=2)

# token amount text
for index, row in first_character_event_data.iterrows():
    draw.text(xy=(930, 150 + (index * 95)), text=f'{row["chapters_total_token"]} Token', fill=grey, font=size_18)

# vertical dotted line for middle of chapter
for z in range(90, 900, 10):
    draw.line(xy=(800, z, 800, z + 5), fill=black, width=2)

# rectangle sign middle of chapter
draw.rectangle(xy=(705, 40, 895, 90), fill=None, outline=black, width=2)
draw.text(xy=(725, 53), text='middle of chapter 1', fill=black, font=size_18)

# rectangle sign 1st character event example
draw.rectangle(xy=(405, 40, 595, 90), fill=None, outline=grey, width=2)
draw.text(xy=(425, 53), text='1st character event', fill=grey, font=size_18)
draw.line(xy=(393, 40, 393, 90), fill=black, width=3)

# draw 1st character event positions
for index, row in first_character_event_data.iterrows():
    if math.isnan(row['first_character_event_token']):
        draw.text(xy=(325, 140 + (index * 95) + 30), text='no character event inside the 1st chapter', fill=grey, font=size_18)
        continue
    x_pos = ((row['first_character_event_token'] / row['chapters_total_token']) * 1220) + 190
    draw.line(xy=(x_pos, 140 + (index * 95), x_pos, 180 + (index * 95)), fill=black, width=3)
    draw.rectangle(xy=(x_pos + 12, 140 + (index * 95), x_pos + 80, 180 + (index * 95)), fill=white, outline=grey, width=2)
    offset = 30 if int(row['first_character_event_token']) > 99 else 36
    draw.text(xy=(x_pos + offset, 148 + (index * 95)), text=str(int(row['first_character_event_token'])), fill=grey, font=size_18)

# show or export
if get_first_character_event_overview_output == 'show':
    image.show()
elif get_first_character_event_overview_output == 'save':
    if not (os.path.isdir('perform_analysis_output')):
        os.makedirs('perform_analysis_output')
    image.save(os.path.join('perform_analysis_output', 'first_character_event_overview.png'))
else:
    print('#### No valid output behaviour is defined for "get first character event overview": No data is displayed or saved. ####')

#### get annotation distribution of each chapter

In [15]:
save_or_show_setting: str = 'html' if get_annotation_distribution_of_each_chapter_output == 'save' \
                            else ('show' if get_annotation_distribution_of_each_chapter_output == 'show' else None)

german_text_paths: List[str] = [
    os.path.join('canspin-deu-19', 'cs1-tsv', 'CANSpiN-deu-19_030_1-1-1.tsv'),
    os.path.join('canspin-deu-19', 'cs1-tsv', 'CANSpiN-deu-19_001_1-1-1.tsv')
]

spanish_text_paths: List[str] = [
    os.path.join('canspin-spa-19', 'cs1-tsv', 'CANSpiN-spa-19_001_1.tsv'),
    os.path.join('canspin-spa-19', 'cs1-tsv', 'CANSpiN-spa-19_008_1.tsv'),
    os.path.join('canspin-lat-19', 'cs1-tsv', 'CANSpiN-lat-19_004_1.tsv'),
    os.path.join('canspin-lat-19', 'cs1-tsv', 'CANSpiN-lat-19_041_1.tsv'),
]

if save_or_show_setting:
    for text_filename in german_text_paths:
        analyzer.render_progression_bar_chart(
            input_data=[text_filename], 
            render_progression_bar_chart_settings={
                'separation_unit_type': 'token',
                'separation_unit_amount': 300,
                'output_type': save_or_show_setting,
                'width': 1400,
                'height': 800,
                'font_size': 18,
                'title': False,
                'svg_render_engine': 'auto',
                'category_and_class_system_name': 'CS1 v1.1.0 deu',
                'translate_classes_to_english': True,
                'save_data_to_json': True
            },
            export_filename=os.path.join('perform_analysis_output', f'annotation_distribution__{os.path.splitext(os.path.basename(text_filename))[0]}.html')
        )

    for text_filename in spanish_text_paths:
        analyzer.render_progression_bar_chart(
            input_data=[text_filename], 
            render_progression_bar_chart_settings={
                'separation_unit_type': 'token',
                'separation_unit_amount': 200,
                'output_type': save_or_show_setting,
                'width': 1400,
                'height': 800,
                'font_size': 18,
                'title': False,
                'svg_render_engine': 'auto',
                'category_and_class_system_name': 'CS1 v1.1.0 spa',
                'translate_classes_to_english': True,
                'save_data_to_json': True
            },
            export_filename=os.path.join('perform_analysis_output', f'annotation_distribution__{os.path.splitext(os.path.basename(text_filename))[0]}.html')
        )
else:
    print('#### No valid output behaviour is defined for "get annotation distribution of each chapter": No data is displayed or saved. ####')

gitma_canspin.canspin - INFO - JSON file perform_analysis_output\annotation_distribution__CANSpiN-deu-19_030_1-1-1.html.json successfully created.
gitma_canspin.canspin - INFO - HTML file perform_analysis_output\annotation_distribution__CANSpiN-deu-19_030_1-1-1.html successfully created.
gitma_canspin.canspin - INFO - JSON file perform_analysis_output\annotation_distribution__CANSpiN-deu-19_001_1-1-1.html.json successfully created.
gitma_canspin.canspin - INFO - HTML file perform_analysis_output\annotation_distribution__CANSpiN-deu-19_001_1-1-1.html successfully created.
gitma_canspin.canspin - INFO - JSON file perform_analysis_output\annotation_distribution__CANSpiN-spa-19_001_1.html.json successfully created.
gitma_canspin.canspin - INFO - HTML file perform_analysis_output\annotation_distribution__CANSpiN-spa-19_001_1.html successfully created.
gitma_canspin.canspin - INFO - JSON file perform_analysis_output\annotation_distribution__CANSpiN-spa-19_008_1.html.json successfully created

#### get first character event CS1 relations

In [16]:
# define filenames
filenames: List[str] = [
  'annotation_distribution__CANSpiN-deu-19_001_1-1-1',
  'annotation_distribution__CANSpiN-deu-19_030_1-1-1',
  'annotation_distribution__CANSpiN-lat-19_004_1',
  'annotation_distribution__CANSpiN-lat-19_041_1',
  'annotation_distribution__CANSpiN-spa-19_001_1',
  'annotation_distribution__CANSpiN-spa-19_008_1'
]

In [17]:
# create first character event cs1 relations as .png files for all annotation distribution images
for filename in filenames:

    # get text_id by filename and first character event data of text_id
    text_ids: List[str] = first_character_event_data['text_id'].to_list()
    possible_text_ids: List[str] = [text_id for text_id in text_ids if text_id.lower() in filename.lower()]
    text_id: Union[str, None] = possible_text_ids[0] if len(possible_text_ids) == 1 else None

    if not text_id:
        raise ValueError('It was not possible to determine the text_id from the given filename. \
                        Please make sure the filename contains a valid text_id by using the same filename of the .html for naming your .png file.')

    filtered_df = first_character_event_data[first_character_event_data['text_id'].str.match(text_id)]

    # load corresponding .json file
    annotation_distribution_data_filepath: str = os.path.join('perform_analysis_output', f'{filename}.html.json')
    with open(annotation_distribution_data_filepath) as file:
        annotation_distribution_data = json.loads(file.read())
    annotation_distribution_metadata: dict = annotation_distribution_data['METADATA'].copy()
    del annotation_distribution_data['METADATA']
    annotation_distribution_data: pd.DataFrame = pd.DataFrame(data=annotation_distribution_data)
    annotation_distribution_data.head(10)

    # load .png file
    annotation_distribution_image_filepath: str = os.path.join('perform_analysis_output', f'{filename}.png')
    annotation_distribution_image = Image.open(annotation_distribution_image_filepath)

    # create new white image with more height than annotation_distribution_image
    relation_image = Image.new(mode='RGB', size=(1400, 1100), color=white)
    draw = ImageDraw.Draw(im=relation_image)

    # paste annotation_distribution_image into new image
    relation_image.paste(annotation_distribution_image, (0, 100, 1400, 900))

    # draw reference line (length: 1030, starting from 80)
    draw.line(xy=(80, 950, 1110, 950), fill=blueish, width=3)

    # draw start and end point lines
    draw.line(xy=(80, 900, 80, 1000), fill=blueish, width=3)
    draw.line(xy=(1110, 900, 1110, 1000), fill=blueish, width=3)

    # draw token label on reference line
    draw.text(xy=(1200, 1030), text='Token', fill=black, font=size_18)

    # draw min and max token numbers on reference line
    draw.text(xy=(75, 1030), text='0', fill=black, font=size_18)
    draw.text(xy=(1090, 1030), text=str(filtered_df['chapters_total_token'].iloc[0]), fill=black, font=size_18)

    # draw token per bar indication
    draw.text(xy=(450, 190), text=f'{annotation_distribution_metadata["separation_unit_amount"]} token per bar, last bar: {int(filtered_df["chapters_total_token"].iloc[0]) - ((len(annotation_distribution_data.index) - 1) * int(annotation_distribution_metadata["separation_unit_amount"]))} token', fill=black, font=size_18)

    # draw 1st character event line
    if (math.isnan(filtered_df['first_character_event_token'].iloc[0])):
        draw.text(xy=(430, 970), text='no character event inside the 1st chapter', fill=black, font=size_18)
    else:
        bar_amount: int = len(annotation_distribution_data.index)
        bar_width: float = 1031 / (1.25 * bar_amount)
        gap_width: float = bar_width / 4

        def get_x_pos(bar_amount: int, bar_width: float, token_per_bar: int) -> float:
            last_bar_token_amount: int = int(filtered_df['chapters_total_token'].iloc[0]) - ((bar_amount - 1) * token_per_bar)
            last_bar_virtual_width: float = (last_bar_token_amount / token_per_bar) * bar_width
            new_total_length: int = (bar_width * (bar_amount - 1)) + last_bar_virtual_width
            x_pos_in_new_total_length = ((int(filtered_df['first_character_event_token'].iloc[0]) / int(filtered_df['chapters_total_token'].iloc[0])) * new_total_length)
            bar_borders: List[float] = [0] + [x * bar_width for x in range(1, bar_amount)] + [new_total_length]
            bar_number: int = next(iter([index for index, border in enumerate(bar_borders) if x_pos_in_new_total_length > bar_borders[index - 1] and x_pos_in_new_total_length < border]))
            position_in_bar: float = x_pos_in_new_total_length - ((bar_number - 1) * bar_width) \
                                    if bar_number != len(bar_borders) \
                                    else x_pos_in_new_total_length - new_total_length
            real_bar_pixel_start: float = (gap_width / 2) + ((bar_number - 1) * bar_width) + ((bar_number - 1) * gap_width) + 80
            real_x_pos: float = real_bar_pixel_start + position_in_bar

            return real_x_pos

        x_pos = get_x_pos(bar_amount, bar_width, annotation_distribution_metadata['separation_unit_amount'])

        draw.line(xy=(x_pos, 120, x_pos, 1000), fill=grey, width=1)
        draw.text(xy=(x_pos - 7 if int(filtered_df['first_character_event_token'].iloc[0]) < 100 else x_pos - 15, 1030), text=str(int(filtered_df['first_character_event_token'].iloc[0])), fill=black, font=size_18)
        draw.rectangle(xy=(x_pos - 85, 70, x_pos + 85, 130), fill=white, outline=black, width=1)
        draw.text(xy=(x_pos - 77, 90), text='1st character event', fill=black, font=size_18)

    # show or export
    if get_first_character_event_cs1_relations_output == 'show':
        relation_image.show()
    elif get_first_character_event_cs1_relations_output == 'save':
        if not (os.path.isdir('perform_analysis_output')):
            os.makedirs('perform_analysis_output')
        relation_image.save(os.path.join('perform_analysis_output', f'first-character-event-cs1-relation__{text_id}.png'))
    else:
        print('#### No valid output behaviour is defined for "get first character event CS1 relations": No data is displayed or saved. ####')

