# Data processing

**TABLE OF CONTENTS**
1. Imports
2. Transforming the content of the 1st edition of *Handbuch der Naturgeschichte* (without registers) into Documents
3. Transforming the registers of the 1st edition of *Handbuch der Naturgeschichte* into Documents
    * a) Volume 1
    * b) Volume 2: chapters 12th-13th
    * c) Volume 3: chapters 14th-16th
4. Resizing the Documents from the 1st edition
    * a) Document's length about 1500, overlap 400
    * b) Document's length about 700, overlap 200
5. Transforming the content of the 12th edition of *Handbuch der Naturgeschichte* (without registers) into Documents
6. Resizing the Documents from the 12th edition
7. Combine documents from the 1st and the 12th edition (without registers) & save

**NOTE: Sections 5-7 were added after hyperparameter selection, because they feature the 12th edition**

## 1. Imports

In [1]:
from collections import namedtuple
from langchain.schema import Document
from langchain.text_splitter import SpacyTextSplitter
import os
import pickle
import re

## 2. Transforming the content of the 1st edition of *Handbuch der Naturgeschichte* (without registers) into Documents

In [2]:
def clean_text(text, nl_for_pageNumber = False):
    
    """
    Takes str as an input, removes chapter titles, footnotes, page numbers, annotation about pictures 
    and restores hyphenated words at the end of line. Returns str. 
    """
    
    # remove chapter titles
    pattern_chapters = re.compile(r"\n.+ter\b Abschnitt[^§]+(\n§)")
    text_no_chapters = re.sub(pattern_chapters, r"\1", text)
    
    # remove footnotes 
    pattern_footnotes = re.compile(r"\n{2,}(\*+\)(.*|\n{1})*)(\n{2,})")
    text_no_footnotes = re.sub(pattern_footnotes, r"\3", text_no_chapters)
    
    # remove calls to footnote within paragraphs
    pattern_footnotes_par = re.compile(r"\s\*+\)\s")
    text_no_footnotes_par = re.sub(pattern_footnotes_par, " ", text_no_footnotes)
    
    # remove page numbers
    #pattern_pp = re.compile(r"(?:\n\n|)(?:\x0c\n|)\[\d+\/\d+]")
    pattern_pp = re.compile(r"(?:\n\n|)(?:\x0c\n|)\[\d+\/\d+]")
    if not nl_for_pageNumber:
        text_no_pp = re.sub(pattern_pp, "", text_no_footnotes_par)
    else:
        text_no_pp = re.sub(pattern_pp, "\n", text_no_footnotes_par)
    
    # remove annotation about pictures
    pattern_picture = re.compile("r\[Abbildung( |)\]")
    text_no_picture = re.sub(pattern_picture, "", text_no_pp)
    
    # restore hyphenated words at the end of line
    pattern_hyphen = re.compile(r"([a-züöäß])\-\n([a-züöäß])")
    text_restored_words = re.sub(pattern_hyphen, r"\1\2", text_no_pp)
    
    return text_restored_words


def remove_borders(text):
    
    """
    Takes str as an input, removes whitespaces and newlines from the beginning and the end of it.
    Returns str. 
    """
    
    pattern_beginning = re.compile("^[\s|\n]*")
    text = re.sub(pattern_beginning, "", text)
    pattern_end = re.compile("[\s|\n]*$")
    text = re.sub(pattern_end, "", text)
    
    return text

    
def get_sections(text, heading_pattern = "\\n§\.\s\d+\."):

    """
    Takes str as an input, splits it by heading pattern and returns 
    a list of tuples (section heading, section content).
    """
    
    sections = []
    
    # identify start and end positions of all section headings
    compiled_pattern = re.compile(heading_pattern) 
    headings_locs = [(match.start(), match.end()) for match 
                    in re.finditer(compiled_pattern, text)]
    
    if not headings_locs:  # if not sections found, return empty heading and the whole text
        sections.append(("", text))
        return sections
    
    # compile pattern for fixing newlines, to be used inside the next for loop
    pattern_newlines = re.compile(r"\b[.,:;?!\"'\(\)\{\}]?(\n)([a-zA-ZüöäßÜÖÄß])")
       
    # iterate through all section headings. Save tuples (heanding, content)
    for i, (s, e) in enumerate(headings_locs):
        
        if i == 0 and s != 0:   # deals with a possible "preface" before the first section
            content = text[:s]
            content = remove_borders(content)
            content_fixed_newlines = re.sub(pattern_newlines, r" \2", content)
            if len(content_fixed_newlines) != 0:
                sections.append(("", content_fixed_newlines))

        heading = text[s:e]
        if i != len(headings_locs)-1:  # deals with sections except for the last one
            content = text[e: headings_locs[i+1][0]]
        else:  # deals with the last section
            content = text[e:]
        # remove newlines and spaces from the beginning and end of heading and content
        heading = remove_borders(heading)
        content = remove_borders(content)
        # fix newlines
        content_fixed_newlines = re.sub(pattern_newlines, r" \2", content)
    
        # save
        sections.append((heading, content_fixed_newlines))
    
    return sections


def create_docs(sections, source = "Handbuch der Naturgeschichte", autor = "Blumenbach", edition = 1,
               date = 1779, language = "German"):
    
    """
    Takes a list[tuples] as an input, creates Document objects with tuple[1] as Document.page_content, 
    tuple[0] as part of Document.metadata for each tuple in the list. 
    Returns list[Document].
    """

    docs = []
    
    # create documents with according metadata
    for (par_heading, par_content) in sections:
        metadata = {"source": source,
                           "autor": autor,
                           "edition": edition,
                            "date": date,
                            "language": language,
                           "in-text location": par_heading}
        doc = Document(page_content = par_content, metadata = metadata)
        docs.append(doc)
    
    return docs


def transform_docs_in_chunks(list_of_docs, chunk_size = 1500, chunk_overlap = 400):
    
    """
    Takes list[Dokument] as an input, transforms them by dividing in chunks.
    Returns list[Dokuments].
    """
    
    # initiate text splitter to divide in chunks on sentence-basis
    spacy_splitter = SpacyTextSplitter(
        pipeline = "sentencizer", 
        separator = " ",
        chunk_size = chunk_size,
        chunk_overlap = chunk_overlap)
    
    # transform documents
    new_list_of_docs = spacy_splitter.transform_documents(list_of_docs)
    
    return new_list_of_docs

### a) Volume 1

In [3]:
# Load the text

with open("../data/Auflage 1 Goettingen/part_1/Auflage_1_Goettingen_1779-ohne_Vorrede_ohne_Register.txt", "r") as f:
    text_ed1_part1 = f.read()  # ed1 i.e. edition 1
    
    
# apply the preprocessing pipeline on text_ed1_part1

cleaned_ed1_part1 = clean_text(text_ed1_part1)
paragraphs_ed1_part1 = get_sections(cleaned_ed1_part1)
docs_ed1_part1 = create_docs(paragraphs_ed1_part1, 
                             source = "Handbuch der Naturgeschichte", 
                             autor = "J.F.Blumenbach", 
                             edition = 1, 
                             date = 1779)    
    

In [4]:
# Visualization

for doc in docs_ed1_part1[:3]:
    print(doc.metadata, "\n", doc.page_content)
    print("\n")

{'source': 'Handbuch der Naturgeschichte', 'autor': 'J.F.Blumenbach', 'edition': 1, 'date': 1779, 'language': 'German', 'in-text location': '§. 1.'} 
 Alle Dinge, die sich auf, und in unsrer Erde finden, zeigen sich entweder in derselben Gestalt, in welcher sie aus der Hand der Natur gekommen; oder so, wie sie durch Menschen oder Thiere, zu bestimmten Absichten, oder auch durch bloßen Zufall verändert und gleichsam umgeschaffen worden sind. Auf diese Verschiedenheit gründet sich die bekannte Eintheilung aller Körper in natürliche (naturalia), und durch Kunst verfertigte (artefacta). Die erstern machen den Gegenstand der Naturgeschichte aus, und man belegt alle Körper mit dem Namen der Naturalien, die nur noch keine wesentliche Veränderung durch Menschenhände erlitten haben; Da hingegen die mehresten von denen so der Zufall umgeändert hat, und beyläufig auch diejenigen so durch die Thiere nach ihren Trieben und zu Stillung ihrer Bedürfnisse verändert und umgebildet worden, mit unter den

### b) Volume 2

In [5]:
# Load the text

with open("../data/Auflage 1 Goettingen/part_2/Auflage_1_Goettingen_1780-ohne_Register.txt", "r") as f:
    text_ed1_part2 = f.read()


# apply the preprocessing pipeline on text_ed1_part2

cleaned_ed1_part2 = clean_text(text_ed1_part2)
paragraphs_ed1_part2 = get_sections(cleaned_ed1_part2)
docs_ed1_part2 = create_docs(paragraphs_ed1_part2, 
                             source = "Handbuch der Naturgeschichte", 
                             autor = "J.F.Blumenbach", 
                             edition = 1, 
                             date = 1780)

In [7]:
# Visualization

for doc in docs_ed1_part2[:3]:
    print(doc.metadata, "\n", doc.page_content)
    print("\n")

{'source': 'Handbuch der Naturgeschichte', 'autor': 'J.F.Blumenbach', 'edition': 1, 'date': 1780, 'language': 'German', 'in-text location': '§. 170.'} 
 Der gegenwärtige Abschnitt betrift allerdings eine eben so wichtige als anmuthige Untersuchung nemlich die allgemeine Naturgeschichte der Gewächse, die wir so viel möglich in der gleichen Ordnung abfassen wollen, die oben in der allgemeinen Thiergeschichte befolgt worden ist, damit beide desto leichter mit einander verglichen und die Aehnlichkeit oder Abweichung dieser zweyerley Arten von organisirten Körpern um so deutlicher ersehen Werden kan.


{'source': 'Handbuch der Naturgeschichte', 'autor': 'J.F.Blumenbach', 'edition': 1, 'date': 1780, 'language': 'German', 'in-text location': '§. 171.'} 
 Die Gewächse unterscheiden sich von den Thieren (§.3. u. 4.) erstens durch die gänzliche Unfähigkeit irgend einer willkürlichen Bewegung, und dann durch die Wurzeln, wodurch sie ihren Nahrungssaft in sich ziehen, statt daß hingegen die Thiere

## 3. Transforming the registers of the 1st edition of *Handbuch der Naturgeschichte* in Documents

In [8]:
def last_formatting_fix(text):
    
    """
    Takes str as input and removes superfluous newlines. Needed for registers only, which are treated 
    differently than main content of the Handbuch. 
    Returns str. 
    """
    pattern_hyphen = re.compile(r"([a-züöäß])\-\s?\n+\s?([a-züöäß])")
    text_v1 = re.sub(pattern_hyphen, r"\1\2", text)
    pattern_newlines = re.compile(r"\s?\n+\s?")
    return re.sub(pattern_newlines, " ", text_v1)


def create_docs_from_register(sections, subsections_pattern = "\n\d+\.\s",
                              source = "Handbuch der Naturgeschichte", autor = "Blumenbach", edition = 1,
                              date = 1779, language = "German", location = "Register 4th Chapter: Säugethiere"):
    
    """
    Takes a list[tuples] as input, where the tuples are of form (section heading, section content).
    Invokes get_sections to further divide the content into subsections. Then, creates Document objects
    for each subsection, storing content and metadata. 
    Returns list[Document].
    """
    
    docs = []
    
    # create documents with according metadata
    for (section_heading, section_content) in sections:
        
        subsections = get_sections(section_content, heading_pattern = subsections_pattern)
        
        for (subsection_heading, subsection_content) in subsections:
            
            if len(subsection_heading) != 0:
                current_location = location + ", " + section_heading + ", " + subsection_heading
            else:
                current_location = location + ", " + section_heading

            metadata = {"source": source,
                        "autor": autor,
                        "edition": edition,
                        "date": date,
                        "language": language,
                        "in-text location": current_location}
            doc = Document(page_content = last_formatting_fix(subsection_content), metadata = metadata)
            docs.append(doc)
    
    return docs


### a) Volume 1

In [9]:
# list files containing registers from part 1 (note that not all chapters contain a register)

directory = "../data/Auflage 1 Goettingen/Register/part_1"
files = os.listdir(directory)

for file in files:
    file_split = file.split("_")
    chapter = "Register " + file_split[-2] + ": " + file_split[-1][:-4]
    print(chapter)

Register 6th Chapter: Amphibien
Register 7th Chapter: Fische
Register 8th Chapter: Insecten
Register 5th Chapter: Vögeln
Register 9th Chapter: Würmer
Register 4th Chapter: Säugethiere


In [10]:
# create documents out of the registers

docs_register_part1 = []

for file in files:
    
    # load the register
    with open(f"{directory}/{file}", "r") as f:
        register = f.read()
    
    # clean and divide in sections
    register_cleaned = clean_text(register, nl_for_pageNumber = True)
    pattern_sections = "\\n\\n([IVX]+[^\\n]+)."
    register_sections = get_sections(register_cleaned, heading_pattern = pattern_sections)
    
    # parameters for create_docs_from_register
    pattern_subsections = "\n\d+\.\s"
    file_split = file.split("_")
    chapter = "Register " + file_split[-2] + ": " + file_split[-1][:-4]
    
    # create documents
    docs_register_current = create_docs_from_register(register_sections, subsections_pattern = pattern_subsections, 
                                              source = "Handbuch der Naturgeschichte", autor = "Blumenbach", 
                                              edition = 1, date = 1779, language = "German", location = chapter)
    # save
    for doc in docs_register_current:
        docs_register_part1.append(doc)
    

In [12]:
# Visualization

for doc in docs_register_part1[:3]:
    print(doc.metadata, "\n", doc.page_content)
    print("\n")

{'source': 'Handbuch der Naturgeschichte', 'autor': 'Blumenbach', 'edition': 1, 'date': 1779, 'language': 'German', 'in-text location': 'Register 6th Chapter: Amphibien, I. REPTILES.'} 
 Alle Thiere dieser Ordnung sind, wenigstens wenn sie ihre vollkommne Gestalt erlangt haben mit vier Fußen versehn, die nach dem verschiednen Aufenthalt dieser Thiere entweder freye, oder durch eine Schwimmhaut verbundene, oder gar wie in eine Flosse verwachsene Zehen haben. Sie legen sämmtlich Eyer, und manche von ihnen sind überaus fruchtbar.


{'source': 'Handbuch der Naturgeschichte', 'autor': 'Blumenbach', 'edition': 1, 'date': 1779, 'language': 'German', 'in-text location': 'Register 6th Chapter: Amphibien, I. REPTILES., 1.'} 
 testudo. Schildkröte. Corpus testa obtectum, cauda brevis, os mandibulis nudis edentulis Die Schildkröten sind wol die trägsten phlegmanschten Geschöpfe in der Natur. Auch ihr Wachsthum und übrige Lebensgeschäffte gehen auserordentlich langsam von statten, so daß man rechne

### b) Volume 2: chapters 12th and 13th
The registers of the two chapters have similar structure to the one found in part 1 and can be processed in the same way

In [13]:
# list files containing register from the chapters from the 2nd volume

directory = "../data/Auflage 1 Goettingen/Register/part_2"
files = os.listdir(directory)
files = [file for file in files if file != ".DS_Store"]

for file in files:
    file_split = file.split("_")
    chapter = "Register " + file_split[-2] + ": " + file_split[-1][:-4]
    print(chapter)

Register 13th Chapter: Saltze
Register 15th Chapter: Metalle
Register 12th Chapter: Erden und Steine
Register 16th Chapter: Versteinerungen
Register 14th Chapter: Erdharze


In [14]:
# create documents out of the registers

docs_register_part2_12to13 = []

for file in [files[0], files[2]]:
    
    # load the register
    with open(f"{directory}/{file}", "r") as f:
        register = f.read()
    
    # clean and divide in sections
    register_cleaned = clean_text(register, nl_for_pageNumber = True)
    pattern_sections = "\\n\\n([IVX]+[^\\n]+)."
    register_sections = get_sections(register_cleaned, heading_pattern = pattern_sections)
    
    # parameters for create_docs_from_register
    pattern_subsections = "\n\d+\.\s"
    file_split = file.split("_")
    chapter = "Register " + file_split[-2] + ": " + file_split[-1][:-4]
    
    # create documents
    docs_register_current = create_docs_from_register(register_sections, subsections_pattern = pattern_subsections, 
                                              source = "Handbuch der Naturgeschichte", autor = "Blumenbach", 
                                              edition = 1, date = 1780, language = "German", location = chapter)
    # save
    for doc in docs_register_current:
        docs_register_part2_12to13.append(doc)

In [15]:
# Visualization

for doc in docs_register_part2_12to13[:3]:
    print(doc.metadata, "\n", doc.page_content)
    print("\n")

{'source': 'Handbuch der Naturgeschichte', 'autor': 'Blumenbach', 'edition': 1, 'date': 1780, 'language': 'German', 'in-text location': 'Register 13th Chapter: Saltze, I. ACIDA.'} 
 1. vitriolum saporis stiptici, calcem in gypsum mutans.


{'source': 'Handbuch der Naturgeschichte', 'autor': 'Blumenbach', 'edition': 1, 'date': 1780, 'language': 'German', 'in-text location': 'Register 13th Chapter: Saltze, I. ACIDA., 1.'} 
 Ferri, Eisenvitriol Von grüngelber Farbe; wird bekanntlich zur Dinte, in der Arzney u. s. w. gebraucht.


{'source': 'Handbuch der Naturgeschichte', 'autor': 'Blumenbach', 'edition': 1, 'date': 1780, 'language': 'German', 'in-text location': 'Register 13th Chapter: Saltze, I. ACIDA., 2.'} 
 Cupri. Kupfervitriol Von himmelblauer oder Seewasserfarbe, nachdem er mehr oder weniger Kupfer hält. Im Rammelsberge bey Goslar, und in andern Cementwassern.




### c) Volume 2: chapters 14th, 15th and 16th. 
In registers from chapters 14th to 16th, sections and subsections are marked more irregularly. Luckily, these registers are short enough to deal with them manually. Specifically, special marking patterns were manually inserted into the text allow for efficient split in subsections.

In [16]:
def get_sections_extended(text, heading_pattern, start_pattern, end_pattern):
    sections = get_sections(text, heading_pattern = heading_pattern)
    sections_v2 = [(re.sub(start_pattern, "", heading), content) for (heading, content) in sections]
    sections_v3 = [(re.sub(end_pattern, "", heading), content) for (heading, content) in sections_v2]
    
    return sections_v3

In [17]:
Pattern = namedtuple("Pattern", ["heading", "start", "end"])

pattern_hierarchy = [Pattern("\^\^PREPRESEC\^\^([^\/]+)\/\/PREPRESEC\^\^", "\^\^PREPRESEC\^\^", "\/\/PREPRESEC\^\^"),
                     Pattern("\^\^PRESEC\^\^([^\/]+)\/\/PRESEC\^\^", "\^\^PRESEC\^\^", "\/\/PRESEC\^\^"),
                     Pattern("\^\^SEC\^\^([^\/]+)\/\/SEC\^\^", "\^\^SEC\^\^", "\/\/SEC\^\^"),
                     Pattern("\^\^SUBSEC\^\^([^\/]+)\/\/SUBSEC\^\^", "\^\^SUBSEC\^\^", "\/\/SUBSEC\^\^"),
                     Pattern("\^\^SUBSUBSEC\^\^([^\/]+)\/\/SUBSUBSEC\^\^", "\^\^SUBSUBSEC\^\^", "\/\/SUBSUBSEC\^\^")]

for pattern in pattern_hierarchy:
    print(pattern.heading, pattern.start, pattern.end)

\^\^PREPRESEC\^\^([^\/]+)\/\/PREPRESEC\^\^ \^\^PREPRESEC\^\^ \/\/PREPRESEC\^\^
\^\^PRESEC\^\^([^\/]+)\/\/PRESEC\^\^ \^\^PRESEC\^\^ \/\/PRESEC\^\^
\^\^SEC\^\^([^\/]+)\/\/SEC\^\^ \^\^SEC\^\^ \/\/SEC\^\^
\^\^SUBSEC\^\^([^\/]+)\/\/SUBSEC\^\^ \^\^SUBSEC\^\^ \/\/SUBSEC\^\^
\^\^SUBSUBSEC\^\^([^\/]+)\/\/SUBSUBSEC\^\^ \^\^SUBSUBSEC\^\^ \/\/SUBSUBSEC\^\^


In [18]:
segments_register_part2_14to16 = []

for file in [files[1], files[-2], files[-1]]:
    
    # load the register
    with open(f"{directory}/{file}", "r") as f:
        register = f.read()

    # clean the register
    register_cleaned = clean_text(register)
    
    # get 1st sections
    pattern = pattern_hierarchy[0]
    sections = get_sections_extended(register_cleaned, heading_pattern = pattern.heading, 
                        start_pattern = pattern.start, end_pattern = pattern.end) 
    
    # update first element of each tuple with chapter's number and title
    file_split = file.split("_")
    chapter = "Register " + file_split[-2] + ": " + file_split[-1][:-4]
    sections = [(chapter + ", " + heading, content) if len(heading) != 0
               else (chapter, content) for heading, content in sections]

    # iterate to get sections lower in the hierarchy
    for pattern in pattern_hierarchy[1:]:
        new_sections = []
        for heading, content in sections:

            subsections = get_sections_extended(content, heading_pattern = pattern.heading, 
                                            start_pattern = pattern.start, end_pattern = pattern.end)

            if len(heading) != 0:
                subsections = [(heading + ", " + subheading, subcontent) if len(subheading) != 0 
                                else (heading, subcontent) for subheading, subcontent in subsections]
            else: 
                subsections = [(subheading, subcontent) if len(subheading) != 0 
                                else ("", subcontent) for subheading, subcontent in subsections]

            for subsec in subsections:
                new_sections.append(subsec)
            sections = new_sections
    
    for segment in new_sections:
        segments_register_part2_14to16.append(segment)
    

In [19]:
# apply last_formatting_fix to the data
segments_register_part2_14to16_last_fix = [(heading, last_formatting_fix(content)) for heading, content 
                                            in segments_register_part2_14to16]

# create documents
docs_register_part2_14to16 = create_docs(segments_register_part2_14to16_last_fix, 
                             source = "Handbuch der Naturgeschichte", 
                             autor = "J.F.Blumenbach", 
                             edition = 2, 
                             date = 1780)

In [20]:
# Visualization

for doc in docs_register_part2_14to16[:3]:
    print(doc.metadata, "\n", doc.page_content)
    print("\n")

{'source': 'Handbuch der Naturgeschichte', 'autor': 'J.F.Blumenbach', 'edition': 2, 'date': 1780, 'language': 'German', 'in-text location': 'Register 15th Chapter: Metalle, I. Eigentliche Metalle., A. Edle., 1. avrvm. Gold, flauum, ponderosissimum, maxime ductile.'} 
 Der schwehrste Körper in der Natur: ohne allen Klang: zähe und zum Erstaunen geschmeidig und dehnbar, wie man beym Vergulden sieht.


{'source': 'Handbuch der Naturgeschichte', 'autor': 'J.F.Blumenbach', 'edition': 2, 'date': 1780, 'language': 'German', 'in-text location': 'Register 15th Chapter: Metalle, I. Eigentliche Metalle., A. Edle., 1. avrvm. Gold, flauum, ponderosissimum, maxime ductile., 1. Natiuum, gediegen.'} 
 Meist in Quarz, Spat ꝛc. theils wie Bäumgen, dendritisch, oder auch, doch weit seltner crystallinisch, mit acht dreyeckten Flächen wie der Diamant, vorzüglich schön in Mexiko, Ungarn, Siebenbürgen ꝛc Waschgold findet sich in grössern oder kleinern Körnchen unter dem Sande in einigen Flüssen, die es von G

## 4. Resize the documents from the 1st edition

In [36]:
# Combine all docs into one list
all_docs_ed1 = docs_ed1_part1 + docs_ed1_part2 + docs_register_part1 + docs_register_part2_12to13 + docs_register_part2_14to16
docs_ed1_without_register = docs_ed1_part1 + docs_ed1_part2

#### a) document's length about 1500, overlap 400

In [37]:
# resize the chunks (on sentence-basis) so that they have approximately 1500 signs, with about 400 sign overlap
docs_ed1_with_register_1500_400 = transform_docs_in_chunks(all_docs_ed1, chunk_size = 1500, chunk_overlap = 400)

Created a chunk of size 1619, which is longer than the specified 1500


In [38]:
# resize the chunks (on sentence-basis) so that they have approximately 1500 signs, with about 400 sign overlap
docs_ed1_without_register_1500_400 = transform_docs_in_chunks(docs_ed1_without_register, 
                                                              chunk_size = 1500, chunk_overlap = 400)

In [39]:
print(f"There are {len(docs_ed1_with_register_1500_400)} including register.")
print(f"There are {len(docs_ed1_without_register_1500_400)} excluding register.")

There are 1692 including register.
There are 314 excluding register.


In [40]:
# Visualization
for doc in docs_ed1_with_register_1500_400[:3]:
    print(doc.metadata, "\n", doc.page_content)
    print("\n")

{'source': 'Handbuch der Naturgeschichte', 'autor': 'J.F.Blumenbach', 'edition': 1, 'date': 1779, 'language': 'German', 'in-text location': '§. 1.'} 
 Alle Dinge, die sich auf, und in unsrer Erde finden, zeigen sich entweder in derselben Gestalt, in welcher sie aus der Hand der Natur gekommen; oder so, wie sie durch Menschen oder Thiere, zu bestimmten Absichten, oder auch durch bloßen Zufall verändert und gleichsam umgeschaffen worden sind. Auf diese Verschiedenheit gründet sich die bekannte Eintheilung aller Körper in natürliche (naturalia), und durch Kunst verfertigte (artefacta). Die erstern machen den Gegenstand der Naturgeschichte aus, und man belegt alle Körper mit dem Namen der Naturalien, die nur noch keine wesentliche Veränderung durch Menschenhände erlitten haben; Da hingegen die mehresten von denen so der Zufall umgeändert hat, und beyläufig auch diejenigen so durch die Thiere nach ihren Trieben und zu Stillung ihrer Bedürfnisse verändert und umgebildet worden, mit unter den

In [41]:
with open("../data/pickles/ed1_docs_with_register_1500_400.pickle", 'wb') as f:
    pickle.dump(docs_ed1_with_register_1500_400, f)
    
with open("../data/pickles/ed1_docs_without_register_1500_400.pickle", 'wb') as f:
    pickle.dump(docs_ed1_without_register_1500_400, f)

#### b) document's length about 700, overlap 200

In [42]:
# resize the chunks (on sentence-basis) so that they have approximately 1500 signs, with about 400 sign overlap
docs_ed1_with_register_700_200 = transform_docs_in_chunks(all_docs_ed1, chunk_size = 700, chunk_overlap = 200)

Created a chunk of size 743, which is longer than the specified 700
Created a chunk of size 966, which is longer than the specified 700
Created a chunk of size 729, which is longer than the specified 700
Created a chunk of size 732, which is longer than the specified 700
Created a chunk of size 708, which is longer than the specified 700
Created a chunk of size 757, which is longer than the specified 700
Created a chunk of size 1619, which is longer than the specified 700
Created a chunk of size 723, which is longer than the specified 700
Created a chunk of size 1403, which is longer than the specified 700
Created a chunk of size 743, which is longer than the specified 700


In [43]:
# resize the chunks (on sentence-basis) so that they have approximately 1500 signs, with about 400 sign overlap
docs_ed1_without_register_700_200 = transform_docs_in_chunks(docs_ed1_without_register, 
                                                              chunk_size = 700, chunk_overlap = 200)

Created a chunk of size 743, which is longer than the specified 700
Created a chunk of size 966, which is longer than the specified 700
Created a chunk of size 729, which is longer than the specified 700
Created a chunk of size 732, which is longer than the specified 700
Created a chunk of size 708, which is longer than the specified 700
Created a chunk of size 757, which is longer than the specified 700


In [44]:
print(f"There are {len(docs_ed1_with_register_700_200)} including register.")
print(f"There are {len(docs_ed1_without_register_700_200)} excluding register.")

There are 2088 including register.
There are 501 excluding register.


In [45]:
# Visualization
for doc in docs_ed1_with_register_700_200[:3]:
    print(doc.metadata, "\n", doc.page_content)
    print("\n")

{'source': 'Handbuch der Naturgeschichte', 'autor': 'J.F.Blumenbach', 'edition': 1, 'date': 1779, 'language': 'German', 'in-text location': '§. 1.'} 
 Alle Dinge, die sich auf, und in unsrer Erde finden, zeigen sich entweder in derselben Gestalt, in welcher sie aus der Hand der Natur gekommen; oder so, wie sie durch Menschen oder Thiere, zu bestimmten Absichten, oder auch durch bloßen Zufall verändert und gleichsam umgeschaffen worden sind. Auf diese Verschiedenheit gründet sich die bekannte Eintheilung aller Körper in natürliche (naturalia), und durch Kunst verfertigte (artefacta).


{'source': 'Handbuch der Naturgeschichte', 'autor': 'J.F.Blumenbach', 'edition': 1, 'date': 1779, 'language': 'German', 'in-text location': '§. 1.'} 
 Auf diese Verschiedenheit gründet sich die bekannte Eintheilung aller Körper in natürliche (naturalia), und durch Kunst verfertigte (artefacta). Die erstern machen den Gegenstand der Naturgeschichte aus, und man belegt alle Körper mit dem Namen der Naturali

In [47]:
# save to pickle

with open("../data/pickles/ed1_docs_with register_700_200.pickle", 'wb') as f:
    pickle.dump(docs_ed1_with_register_700_200, f)
    
with open("../data/pickles/ed1_docs_without_register_700_200.pickle", 'wb') as f:
    pickle.dump(docs_ed1_without_register_700_200, f)

## 5.  Transforming the content of the 12th edition of *Handbuch der Naturgeschichte* (without registers) into Documents

**EDIT: This section was added after hyperparameter selection, it features the 12th edition**

In [4]:
# Load the text

with open("../data/Auflage 12 Goettingen 1830/blumenbach_naturgeschichte_1830-ohne_Vorrede_ohne_Register.txt", "r") as f:
    text_ed12 = f.read()
    
    
# apply the preprocessing pipeline on text_ed12

cleaned_ed12 = clean_text(text_ed12)
paragraphs_ed12 = get_sections(cleaned_ed12)
docs_ed12 = create_docs(paragraphs_ed12, 
                             source = "Handbuch der Naturgeschichte", 
                             autor = "J.F.Blumenbach", 
                             edition = 12, 
                             date = 1830)    
 

In [5]:
# Visualization

for doc in docs_ed12[:3]:
    print(doc.metadata, "\n", doc.page_content)
    print("\n")

{'source': 'Handbuch der Naturgeschichte', 'autor': 'J.F.Blumenbach', 'edition': 12, 'date': 1830, 'language': 'German', 'in-text location': '§. 1.'} 
 Alle Körper, die sich auf, und in unserer Erde finden, zeigen sich entweder in derselben Gestalt und Beschaffenheit, die sie aus der Hand des Schöpfers erhalten und durch die Wirkung der sich selbst überlassenen Naturkräfte angenommen haben; oder so wie sie durch Menschen und Thiere, zu bestimmten Absichten, oder auch durch bloßen Zufall verändert und gleichsam umgeschaffen worden sind Auf diese Verschiedenheit gründet sich die bekannte Eintheilung derselben in natürliche (naturalia), und durch Kunst verfertigte (artefacta).
Die erstern machen den Gegenstand der Naturgeschichte aus, und man pflegt alle Körper zu den Naturalien zu rechnen, die nur noch keine wesentliche Veränderung durch Menschen erlitten haben. Artefacten werden sie dann genannt, wenn der Mensch absichtlich Veränderungen mit ihnen vorgenommen Anm. 1. Daß übrigens jene B

## 6. Resize the documents from the 12th edition

**EDIT: This section was added after hyperparameter selection, it features the 12th edition**

In [6]:
# resize (length approximately 1500 chars, overlap approximately 400 chunks)
docs_ed12 = transform_docs_in_chunks(docs_ed12, chunk_size = 1500, chunk_overlap = 400)

In [7]:
print(f"There are {len(docs_ed12)} documents.")

There are 293 documents.


In [8]:
# Visualization
for doc in docs_ed12[:3]:
    print(doc.metadata, "\n", doc.page_content)
    print("\n")

{'source': 'Handbuch der Naturgeschichte', 'autor': 'J.F.Blumenbach', 'edition': 12, 'date': 1830, 'language': 'German', 'in-text location': '§. 1.'} 
 Alle Körper, die sich auf, und in unserer Erde finden, zeigen sich entweder in derselben Gestalt und Beschaffenheit, die sie aus der Hand des Schöpfers erhalten und durch die Wirkung der sich selbst überlassenen Naturkräfte angenommen haben; oder so wie sie durch Menschen und Thiere, zu bestimmten Absichten, oder auch durch bloßen Zufall verändert und gleichsam umgeschaffen worden sind Auf diese Verschiedenheit gründet sich die bekannte Eintheilung derselben in natürliche (naturalia), und durch Kunst verfertigte (artefacta). 
Die erstern machen den Gegenstand der Naturgeschichte aus, und man pflegt alle Körper zu den Naturalien zu rechnen, die nur noch keine wesentliche Veränderung durch Menschen erlitten haben. Artefacten werden sie dann genannt, wenn der Mensch absichtlich Veränderungen mit ihnen vorgenommen Anm. 1. Daß übrigens jene 

## 7. Combine documents from the 1st and the 12th edition (without registers) & save

**EDIT: This section was added after hyperparameter selection, it features the 12th edition**

In [9]:
# load the docs from the 1st edition
with open("../data/pickles/ed1_docs_without_register_1500_400.pickle", 'rb') as f:
    docs_ed1_1500_without_register = pickle.load(f)

In [10]:
# combine documents from both editions
docs_1_and_12 = docs_ed1_1500_without_register + docs_ed12

In [11]:
print(f"There are {len(docs_1_and_12)} documents.")

There are 607 documents.


In [12]:
# save to pickle

with open("../data/pickles/ed1_ed12_docs.pickle", 'wb') as f2:
    pickle.dump(docs_1_and_12, f2)