# Quality Assured Information Extraction from Historical Address Books for Knowledge Graph Construction
Master Thesis by Ilona-Dewi Kusardi

## Coding Section

The following Jupyter Notebooks contain the practical part of the Master Thesis. The coding assignment is separated into several different Notebooks. Each Jupyter Notebook includes a separate coding step inbefore constructing the Knowledge Graph. Further details and descriptions of the performed steps are presented in the corresponding Juypter Notebooks.

## Notebook1: Reconstructing

The following Jupyter Notebook includes the first practical part of the Master thesis. In this Notebook the OCR txt files are read and first corrections to the text are applied.

### Import libraries

To begin all libraries that are necessary for the input and first corrections of the OCR are imported.

1. **re**  - This is a library to process regular expressions. 

2. **Numpy** - Numpy is a library for the easy use of vectors, matrices or arrays in general. It simplifies various numerical operations. 

3. **Codecs** - This module provides access to the most common Python encoders and decoders for example to be used for text encoding.

4. **Pandas** - Pandas is a library to analyze and to manage data. It is used to create tables.


In [5]:
# import statements

import re
import numpy
import codecs
import numpy as np
import pandas as pd

# #-*- coding: utf-8 -*-
# from google.colab import drive 
# drive.mount('/content/gdrive')

### OCR File Input

In [13]:
def read_files(path, start=1, end=428):
    """
    Function to import OCRed txt files
    If not specified differently when calling function, all txt files will be loaded and stored in a list
    """
    ocr = []
    for i in range(start, end):
        if i < 10:
            filedir=str(path) + "F_23_92-00" + str(i) + ".txt"
            bookpage=codecs.open(filedir, "r", 'utf-8-sig')
            page_content=bookpage.read()
            bookpage.close()
            ocr.append(page_content)
        elif i < 100:
            filedir=str(path) + "F_23_92-0" + str(i) + ".txt"
            bookpage=codecs.open(filedir, "r", 'utf-8-sig')
            page_content=bookpage.read()
            bookpage.close()
            ocr.append(page_content)
        else:
            filedir=str(path) + "F_23_92-" + str(i) + ".txt"
            bookpage=codecs.open(filedir, "r", 'utf-8-sig')
            page_content=bookpage.read()
            bookpage.close()
            ocr.append(page_content)
    return ocr

# read in files
#ocr = read_files("/Users/ilona-dewikusardi/Desktop/Datensets/F_23_92.jpg/F_23_92.jpg_txt/", start=1, end=428)
ocr = read_files("./files/F_23_92.jpg/F_23_92.jpg_txt/", start=1, end=428)
# delete first lines of first page with entries because these lines do not contain any address book entries
ocr[0] = ocr[0].split('\r\n',26)[26:][0]
#ocr

### Corrections in the OCR

In [None]:
# delete first lines if they contain only a number (page number) or only the name range of the entries on that page
# delete empty lines
# also delete if 'Stadtarchiv Nürnberg ...' is scanned

for i in range(0, len(ocr)):
    ocr[i] = ocr[i].replace('—', '-')


indexes_to_pop_outer = []
for i in range(0, len(ocr)):
    # check if string has more than one line
    ocr[i] = ocr[i].split('\r\n')
    if len(ocr[i]) > 1:
        # check for empty lines and whether Stadtarchiv Nürnberg ... was scanned as well or whether line contains "siehe"
        indexes_to_pop_inner = []
        for j in range(0, len(ocr[i])):
            if (ocr[i][j] == "") or (ocr[i][j] == " "):
                indexes_to_pop_inner.append(j)
                continue
            
            # check if line begins with Ztac|Ltac|Ztsc or has 8t3 in it  --> only the case if Stadtarchiv Nürnberg ... is scanned as well
            match = re.search('[\s]*[a-zäöüßA-ZÄÖÜ0-9,\.\-\t\: \)\^\s]*(Ztac|Ltac|Ztsc|Ltsc|Lkac|Zlac|8t3){1}[a-zäöüßA-ZÄÖÜ0-9,\.\-\t\: \)\^\s]', ocr[i][j])
            if match:
                indexes_to_pop_inner.append(j)
                continue
            
            # if line starts with S. a. then delete this line (cf)
            match = re.search('[\s]*(S[\.]{0,1}[\s]*a\.){1}', ocr[i][j])
            if match:
                indexes_to_pop_inner.append(j)
                continue
                
            # if does only contain letter and potentially a period
            match = re.search('^[\s]*([A-GJKP-UW-Z]{1}[\.]{0,1}[\s]*){1}$', ocr[i][j])
            if match:
                indexes_to_pop_inner.append(j)
                continue
            
            # if does only contain letter and potentially a period
            match = re.search('[Ss]iehe', ocr[i][j])
            if match:
                indexes_to_pop_inner.append(j)
                continue
                
        indexes_to_pop_inner.sort(reverse=True)
        if indexes_to_pop_inner is not None:
            for index in indexes_to_pop_inner:
                ocr[i].pop(index)
        
        # check if first line is only a number (page number) --> delete
        match = re.search('[\s]*[\d]{1,}[S]{0,1}[\d]{1,}[\s]*$', ocr[i][0])
        if match:
            ocr[i].pop(0)
            
        # check if first line os only namerange of entries on page --> delete
        match = re.search('([A-Za-zäöü\-]*\s*(straße|platz|gasse|weg|br[uü]cke|halle|markt)|[A-Za-zäöü\-]*\s*(Straße|Platz|Gasse|Weg|Br[uü]cke|Halle|Bahnlinie|Markt))', ocr[i][0])
        if match:
            ocr[i].pop(0)
            
        # now check again if first line is only a number (page number) --> delete
        match = re.search('[\s]*[\d]{1,}[S]{0,1}[\d]{1,}[\s]*$', ocr[i][0])
        if match:
            ocr[i].pop(0)
        
        # check if line begins with space --> delete
        for j in range(0, len(ocr[i])):
            ocr[i][j] = ocr[i][j].lstrip()
    else:
        indexes_to_pop_outer.append(i)
        
    # delete last two lines of last page because these only include information on printing
    if i == len(ocr) - 1:
        ocr[i] = ocr[i][0:-4]
    ocr[i] = '\r\n'.join(ocr[i])
if indexes_to_pop_outer is not None:
    for index in indexes_to_pop_outer:
        ocr.pop(index)
    
# make it one big text file
address_book = '\r\n'.join(ocr)
address_book

Output hidden; open in https://colab.research.google.com to view.

In [None]:
# get rid of linebreaks amid entries
# next line starting with straße, gasse or platz --> delete linebreaks

address_book = address_book.replace("\r\nstraße" , "straße")
address_book = address_book.replace("\r\nStraße" , "Straße")
address_book = address_book.replace("\r\ngasse" , "gasse")
address_book = address_book.replace("\r\nGasse" , "Gasse")
address_book = address_book.replace("\r\nplatz" , "platz")
address_book = address_book.replace("\r\nPlatz" , "Platz")
address_book = address_book.replace('\r\nweg' , 'weg')
address_book = address_book.replace('\r\nWeg' , 'Weg')
address_book = address_book.replace('\r\nbr[uü]cke' , 'brücke')
address_book = address_book.replace('\r\nBr[uü]cke' , 'Brücke')
address_book = address_book.replace('\r\nmarkt' , 'markt')
address_book = address_book.replace('\r\nMarkt' , 'Markt')

address_book_list = address_book.split('\r\n')
address_book_list

['0',
 'Altcrstratze t.',
 'Steinbühl.',
 ',Von der Espan- zur Land-',
 'grabenstraße.)',
 'Distrikt 50.',
 '1\t»Pirner, G., Wirt',
 '(zur Siegesgöttin)',
 'Bogner,G., Maurerpaller 0',
 'Haßmann, G-, Lackierer I',
 'Frühbeißer, K-, Kleiderm. I',
 '«tzrundler,J.,Feingoldschl.I',
 'Pirner, I., Eisendreher II',
 'Wagner, G., Taglöhner IN',
 'Kortes, P., Schlosser IN',
 'Röder, G., Kutscher IN',
 'MandlingerT.Zimmges.lII',
 'Sebald, P-, Taglöhner IV',
 'Sebald, P-, LadergehilfeIV',
 'Fcrstl, S., Taglöhner IV',
 '2\t*Bühner, I., Oekonom in',
 'Auerau',
 '^zuin neuen Bahnhof)',
 'Gruber, G.< Eisendreher',
 'Wirt',
 'Gareis, F., Friseur',
 'Jordan, I., Drechsler',
 'Wirth,G. Marmorschleiferi',
 'Lippl, I., Taglöhner I',
 'Kantorek, N., Futteralmch.I',
 'Vogel, G., Bleistiftarb. I',
 'Walleter, I., Flaschner I',
 'Hobineier, V., ZementcurN',
 'Wachter, A., Stukkaturer II',
 'Zapf, A., Vernickler II',
 'Zchuhbauer, I., Kutscher II',
 'SemmelmannG.Lokomh.il',
 'Rupprecht, I., Gießer NI',
 'John,

In [None]:
# delete linebreak when lines begin with complete street name
# when previous line ends with :
address_book=re.sub(r'\:\s*\r\n', r': ', address_book)

# when previous line ends with ,
address_book=re.sub(r'\,\r\n', r', ', address_book) 

# when previous line ends with .
#address_book=re.sub(r'\.\s*\r\n(\s*([A-Z][a-zßöäüA-ZÖÄÜ\s\-\.\,]+)[^\d+])', r'. \1', address_book)
# when previous line ends with -
address_book=re.sub(r'\-\r\n', r'', address_book)


#If line starts with , replace it with (
address_book = re.sub(r'(\r\n)(\,|\^)(.*)\)(.*\r\n)', r'\1(\3)\4', address_book)

# when next line begins with parenthesis
address_book=re.sub(r'\r\n\s*\((.*\s*([A-Z][a-zßöäüA-ZÖÄÜ\s\-\.]+)\s*)', r' (\1', address_book)

# when previous line ends with u. or und
address_book=re.sub(r'\b(u\.)\s*\r\n', r'und ', address_book)
address_book=re.sub(r'(und)\s*\r\n(\s*([A-Z]*[a-zßöäüA-ZÖÄÜ\s\-\.])*\s*([1-9I]*[0-9]*\s*[a-mo-tvz]*))', r'\1 \2', address_book)

# when next line begins with u./und/zur/en
address_book=re.sub(r'\r\n(und|u\.|en|zur)\s*(([A-Z]*[a-zßöäüA-ZÖÄÜ\s\-\.]+)\s*)', r' \1 \2', address_book)

# when previous line ends with im/in/aus/zu/zum/zur/der/für/am
address_book=re.sub(r'\b(im|in|aus|zu|zum|zur|der|f[uü]r|appr)\b\.*\s*\r\n([A-ZÄÖÜ][a-zäöü][0-9]\^\,\-)*', r' \1 \2', address_book)

#delete lines containing Einmündung/1908
address_book=re.sub(r'(\r\n).*(Einm(ü|il)(n|N)d(ung)|1908).*\r\n', r' \1', address_book)

#delete lines beginning with Bei/Bei/Zw.
address_book=re.sub(r'^(Bei|Bel|Bci|Zw|Aw|Zwischen)\b.*\r\n', r'', address_book)

address_book_list = address_book.split('\r\n')
address_book_list

['0',
 'Altcrstratze t.',
 'Steinbühl. (Von der Espan- zur Landgrabenstraße.)',
 'Distrikt 50.',
 '1\t»Pirner, G., Wirt (zur Siegesgöttin)',
 'Bogner,G., Maurerpaller 0',
 'Haßmann, G-, Lackierer I',
 'Frühbeißer, K-, Kleiderm. I',
 '«tzrundler,J.,Feingoldschl.I',
 'Pirner, I., Eisendreher II',
 'Wagner, G., Taglöhner IN',
 'Kortes, P., Schlosser IN',
 'Röder, G., Kutscher IN',
 'MandlingerT.Zimmges.lII',
 'Sebald, P-, Taglöhner IV',
 'Sebald, P-, LadergehilfeIV',
 'Fcrstl, S., Taglöhner IV',
 '2\t*Bühner, I., Oekonom  in Auerau (zuin neuen Bahnhof)',
 'Gruber, G.< Eisendreher',
 'Wirt',
 'Gareis, F., Friseur',
 'Jordan, I., Drechsler',
 'Wirth,G. Marmorschleiferi',
 'Lippl, I., Taglöhner I',
 'Kantorek, N., Futteralmch.I',
 'Vogel, G., Bleistiftarb. I',
 'Walleter, I., Flaschner I',
 'Hobineier, V., ZementcurN',
 'Wachter, A., Stukkaturer II',
 'Zapf, A., Vernickler II',
 'Zchuhbauer, I., Kutscher II',
 'SemmelmannG.Lokomh.il',
 'Rupprecht, I., Gießer NI',
 'John,I.,Fabrikarbeiter IN'

In [None]:
#Replace common mistakes from OCR
#Replace tab with space
address_book = re.sub(r'\t', r' ', address_book)
#Delete empty lines
address_book = re.sub(r'(\r\n)\s*\r\n', r'\1', address_book)
#Delete multiple - to one -
address_book = re.sub(r'\-+', '-', address_book)
#Delete lines with 5 or less characters
address_book = re.sub('(\r\n)[A-za-z\s\-\*\.\,\']{0,5}(\r\n)', '\g<1>', address_book)
#Delete lines without alphabetical or numeral characters
address_book = re.sub(r'(\r\n)[0.,\s\_\-]*\r\n', r'\1', address_book)
#Replace « and » with *
address_book = re.sub(r'(\d+)\s*(\«)\s*(\w)', r'\1 * \3', address_book)
address_book = re.sub(r'(\d+)\s*(\»)\s*(\w)', r'\1 * \3', address_book)

#Replace " with * 
address_book = re.sub(r'(\r\n)\"', r'\1* ', address_book)
#Replace " with * if its after a house number
address_book = re.sub(r'(\r\n)(\d+)\s(\"|\')', r'\1\2 * ', address_book)
address_book = re.sub(r'(\r\n)(\d+)([a-f]{1})\s(\"|\')', r'\1\2\3 * ', address_book)
#If line starts with , replace it with (
address_book = re.sub(r'(\r\n)(\,|\^)', r'\1(', address_book)
#Replace -, with ., if it's not at beginning of line
address_book = re.sub(r'\b([A-ZÄÖÜ]{1,2})\-(\,|\s+)\s*', r'\1., ', address_book)
#Delete characters infront of * if not a number or white space at beginning of line
address_book = re.sub(r'^([^0-9]{1,10})(\s)*\*\s*(.*)', r'* \3', address_book)
#Delete symbols
address_book = re.sub(r'(\'|\`|\<|\>|\«|\»|\„|\^|\°|\!|\?|\§|\"|\/)', '', address_book)
#delete everything at beginning of line thats not Word, *, ( or -
address_book = re.sub(r'^[^0-9A-ZÄÖÜa-zäöüß*(-]*', '', address_book)

#repeat deleting after -,:
address_book=re.sub(r'(\:|\,|\-)\s*\r\n', r'\1 ', address_book)
address_book_list = address_book.split('\r\n')
address_book_list

['0',
 'Altcrstratze t.',
 'Steinbühl. (Von der Espan- zur Landgrabenstraße.)',
 'Distrikt 50.',
 '1 * Pirner, G., Wirt (zur Siegesgöttin)',
 'Bogner,G., Maurerpaller 0',
 'Haßmann, G., Lackierer I',
 'Frühbeißer, K., Kleiderm. I',
 'tzrundler,J.,Feingoldschl.I',
 'Pirner, I., Eisendreher II',
 'Wagner, G., Taglöhner IN',
 'Kortes, P., Schlosser IN',
 'Röder, G., Kutscher IN',
 'MandlingerT.Zimmges.lII',
 'Sebald, P., Taglöhner IV',
 'Sebald, P., LadergehilfeIV',
 'Fcrstl, S., Taglöhner IV',
 '2 *Bühner, I., Oekonom  in Auerau (zuin neuen Bahnhof)',
 'Gruber, G. Eisendreher',
 'Gareis, F., Friseur',
 'Jordan, I., Drechsler',
 'Wirth,G. Marmorschleiferi',
 'Lippl, I., Taglöhner I',
 'Kantorek, N., Futteralmch.I',
 'Vogel, G., Bleistiftarb. I',
 'Walleter, I., Flaschner I',
 'Hobineier, V., ZementcurN',
 'Wachter, A., Stukkaturer II',
 'Zapf, A., Vernickler II',
 'Zchuhbauer, I., Kutscher II',
 'SemmelmannG.Lokomh.il',
 'Rupprecht, I., Gießer NI',
 'John,I.,Fabrikarbeiter IN',
 'Stiegler

In [None]:
#Correcting common mistakes with Straße/Gasse/Brücke
address_book = re.sub('[SL ](tratze|traf\;e|traize|trasze|trake|trahe)', 'Straße', address_book)
address_book = re.sub('[sf ](tratz|traf\;|traf|traiz|trasz|trak|trab|trah)[ace]', 'straße', address_book)
address_book = re.sub('(G|g){1}(affe)', '\1asse', address_book)
address_book = re.sub('(B|b){1}(rucke)', '\1rücke', address_book)
#Delete linebreak if new line starts with address
address_book = re.sub(r'\r\n([A-Z]?[a-zäöü\-]*\s*[A-Z]*[a-zäöü\-]*\s*(straße|str\.|platz|gasse|weg|br[uü]cke|halle|markt)|[A-Z]?[a-zäöü\-]+\s*[A-Z]*[a-zäöü\-]*\s*(Straße|Platz|Gasse|Weg|Br[uü]cke|Halle|Bahnlinie|Markt))', r' \1', address_book)
#Delete linebreak if new line starts with vorm. or en or Gesamtgeschlecht or Brauerei
address_book = re.sub(r'\r\n(vorm\.?|en\b|Ge[ls]{1}amtgeschlecht|Brauerei)', r' \1', address_book)

address_book_list = address_book.split("\r\n")
address_book_list

['0 Altcrstraße t.',
 'Steinbühl. (Von der Espan- zur Landgrabenstraße.)',
 'Distrikt 50.',
 '1 * Pirner, G., Wirt (zur Siegesgöttin)',
 'Bogner,G., Maurerpaller 0',
 'Haßmann, G., Lackierer I',
 'Frühbeißer, K., Kleiderm. I',
 'tzrundler,J.,Feingoldschl.I',
 'Pirner, I., Eisendreher II',
 'Wagner, G., Taglöhner IN',
 'Kortes, P., Schlosser IN',
 'Röder, G., Kutscher IN',
 'MandlingerT.Zimmges.lII',
 'Sebald, P., Taglöhner IV',
 'Sebald, P., LadergehilfeIV',
 'Fcrstl, S., Taglöhner IV',
 '2 *Bühner, I., Oekonom  in Auerau (zuin neuen Bahnhof)',
 'Gruber, G. Eisendreher',
 'Gareis, F., Friseur',
 'Jordan, I., Drechsler',
 'Wirth,G. Marmorschleiferi',
 'Lippl, I., Taglöhner I',
 'Kantorek, N., Futteralmch.I',
 'Vogel, G., Bleistiftarb. I',
 'Walleter, I., Flaschner I',
 'Hobineier, V., ZementcurN',
 'Wachter, A., Stukkaturer II',
 'Zapf, A., Vernickler II',
 'Zchuhbauer, I., Kutscher II',
 'SemmelmannG.Lokomh.il',
 'Rupprecht, I., Gießer NI',
 'John,I.,Fabrikarbeiter IN',
 'Stiegler, A.,

In [None]:
#delete lines with Einmündung 
address_book = re.sub('(\r\n).*Einm(ü|il)(n|N)d(ung).*\r\n', r'\1', address_book)
#Delete lines with zwischen -> Later, to not lose Street names
#address_book = re.sub('(\r\n)[Zz]w(ischen)*\b.*\r\n', r'\1', address_book)
#Delete lines starting with Bei
address_book = re.sub('(\r\n)Bei\b.*\r\n', r'\1', address_book)
#Delete lines with 1908
address_book = re.sub('(\r\n).*1908.*\r\n', r'\1', address_book)
#Delete lines ending with 92
address_book = re.sub('(\r\n).*\s92\r\n', r'\1', address_book)

address_book_list = address_book.split('\r\n')
address_book_list

['0 Altcrstraße t.',
 'Steinbühl. (Von der Espan- zur Landgrabenstraße.)',
 'Distrikt 50.',
 '1 * Pirner, G., Wirt (zur Siegesgöttin)',
 'Bogner,G., Maurerpaller 0',
 'Haßmann, G., Lackierer I',
 'Frühbeißer, K., Kleiderm. I',
 'tzrundler,J.,Feingoldschl.I',
 'Pirner, I., Eisendreher II',
 'Wagner, G., Taglöhner IN',
 'Kortes, P., Schlosser IN',
 'Röder, G., Kutscher IN',
 'MandlingerT.Zimmges.lII',
 'Sebald, P., Taglöhner IV',
 'Sebald, P., LadergehilfeIV',
 'Fcrstl, S., Taglöhner IV',
 '2 *Bühner, I., Oekonom  in Auerau (zuin neuen Bahnhof)',
 'Gruber, G. Eisendreher',
 'Gareis, F., Friseur',
 'Jordan, I., Drechsler',
 'Wirth,G. Marmorschleiferi',
 'Lippl, I., Taglöhner I',
 'Kantorek, N., Futteralmch.I',
 'Vogel, G., Bleistiftarb. I',
 'Walleter, I., Flaschner I',
 'Hobineier, V., ZementcurN',
 'Wachter, A., Stukkaturer II',
 'Zapf, A., Vernickler II',
 'Zchuhbauer, I., Kutscher II',
 'SemmelmannG.Lokomh.il',
 'Rupprecht, I., Gießer NI',
 'John,I.,Fabrikarbeiter IN',
 'Stiegler, A.,

In [None]:
'''Adding whitespace after * or , if missing as 
preparation for entry component separation later on'''
#Add space after *
address_book = re.sub(r'(\*|\,)([^\s])', r'\1 \2', address_book)
address_book = re.sub(r'(\*|\,)(\w)', r'\1 \2', address_book)
#Add space after .
#address_book = re.sub(r'(\.)([^\s\,\-])', r'\1 \2', address_book) #?
address_book = re.sub(r'(\.)(\w)', r'\1 \2', address_book)
#Add space infront of *
address_book = re.sub(r'([^(\s|\n)])(\*)', r'\1 \2', address_book)
address_book = re.sub(r'(\r\n)\s*(\*)', r'\1\2', address_book)

#Delete opening parenthesis, if no closing parenthesis is in same line
address_book = re.sub(r'(\r\n)(.*)(\()([^(\)|\n)]+)(\r\n)', r'\1\2\4\5', address_book)
#Deleting closing parentheses, if no opening parenthesis in same line
address_book = re.sub(r'(\r\n)([^(\n]+)(\))(.*)(\r\n)', r'\1\2\4\5', address_book)
#Insert missing white spaces if capital letters occur in the middle of a word
address_book = re.sub(r'([a-zäöüß]{3,})([A-ZÄÖÜ])([a-zäöüß]{3,})', r'\1 \2\3', address_book) #([a-zäöüß])([A-ZÄÖÜ])+([a-zäöüß]{3,}|\b|\.|\,)
address_book = re.sub(r'([a-zäöüß]{3,})([A-ZÄÖÜ])([\.\,]+)', r'\1, \2\3', address_book)
#No line starting with space
address_book = re.sub(r'(\r\n)\s*', r'\1', address_book)
#if line starts with House Floor
address_book = re.sub(r'(\r\n)(II|IV)\s*', r'\1', address_book)

address_book_list = address_book.split('\r\n')
address_book_list

['0 Altcrstraße t.',
 'Steinbühl. (Von der Espan- zur Landgrabenstraße.)',
 'Distrikt 50.',
 '1 * Pirner, G., Wirt (zur Siegesgöttin)',
 'Bogner, G., Maurerpaller 0',
 'Haßmann, G., Lackierer I',
 'Frühbeißer, K., Kleiderm. I',
 'tzrundler, J., Feingoldschl. I',
 'Pirner, I., Eisendreher II',
 'Wagner, G., Taglöhner IN',
 'Kortes, P., Schlosser IN',
 'Röder, G., Kutscher IN',
 'Mandlinger, T. Zimmges. lII',
 'Sebald, P., Taglöhner IV',
 'Sebald, P., LadergehilfeIV',
 'Fcrstl, S., Taglöhner IV',
 '2 * Bühner, I., Oekonom  in Auerau (zuin neuen Bahnhof)',
 'Gruber, G. Eisendreher',
 'Gareis, F., Friseur',
 'Jordan, I., Drechsler',
 'Wirth, G. Marmorschleiferi',
 'Lippl, I., Taglöhner I',
 'Kantorek, N., Futteralmch. I',
 'Vogel, G., Bleistiftarb. I',
 'Walleter, I., Flaschner I',
 'Hobineier, V., ZementcurN',
 'Wachter, A., Stukkaturer II',
 'Zapf, A., Vernickler II',
 'Zchuhbauer, I., Kutscher II',
 'Semmelmann, G. Lokomh. il',
 'Rupprecht, I., Gießer NI',
 'John, I., Fabrikarbeiter IN'

In [None]:
'''Adding missing last Names at the beginning of a line.
This needs to be done before the house number and street name are added!'''
address_book_list = address_book.split('\r\n')

new_address_book_list = []
current = ''

for entry in address_book_list:
    entryComponents = entry.split(' ')
    lastName = entryComponents[0]
    first_character_i = lastName[0]
    #print(first_character_i)
#    if ord(first_character_i) != 45:
#        current = str(lastName)
    if ord(first_character_i) in range(65,122):
        current = str(lastName)
    elif ord(first_character_i) in range(0,44):
        current = str(lastName)
    elif ord(first_character_i) == 45:
        entry = str(current) + entry[1:]
    new_address_book_list.append(entry)
address_book = '\r\n'.join(new_address_book_list)
#with open('/Users/ilona-dewikusardi/Desktop/Datensets/F_23_92.jpg/2_recon_new_address_book.txt', 'w') as fw:
#with open('/content/gdrive/MyDrive/MA Python/Outputs/2_recon_new_address_book.txt', 'w') as fw:
with open('./Outputs/2_recon_test.txt', 'w') as fw:
    fw.write(address_book)

#new_address_book_list

In [None]:
'''Correcting common house floor mistakes'''
address_book = re.sub(r'([a-zäöüß]{2,})([HIV0oO1Nl]{2,})\b', r'\1 \2', address_book)
address_book = re.sub(r'([a-zäöüß]{2,})([HIV01NO])\b', r'\1 \2', address_book)
address_book = re.sub(r'\b[0oO]\b(\r\n)', r'0\1', address_book)
address_book = re.sub(r'\b[Ili1]\b(\r\n)', r'I\1', address_book)
address_book = re.sub(r'\b(N|U|[Il1i]{2})\b', r'II', address_book)
address_book = re.sub(r'\b([NU][I1li]{1}|[Il1i]{3}|IH)\b(\r\n)', r'III\2', address_book)
address_book = re.sub(r'\b[i1lI][NVv]\b', r'IV', address_book)
address_book = re.sub(r'\b(H[0oO])\b(\r\n)', r'H0\2', address_book)
address_book = re.sub(r'\b(H[i1lI]{1})\b(\r\n)', r'HI\2', address_book)
address_book = re.sub(r'\b(H[i1lI]{2}|HN|HU)\b(\r\n)', r'HII\2', address_book)
address_book = re.sub(r'\b(H[i1lI]{3}|HIH)\b(\r\n)', r'HIII\2', address_book)
address_book = re.sub(r'\b(H[i1lI]{1}V)\b(\r\n)', r'HIV\2', address_book)

#with open('/Users/ilona-dewikusardi/Desktop/Datensets/F_23_92.jpg/2_recon_new_address_book.txt', 'w') as fw:
with open('./Outputs/2_recon_new_address_book.txt', 'w') as fw:
#with open('/content/gdrive/MyDrive/MA Python/Outputs/2_recon_test_2.txt', 'w') as fw:
    fw.write(address_book)
    
address_book_list = address_book.split('\r\n')
address_book_list

['0 Altcrstraße t.',
 'Steinbühl. (Von der Espan- zur Landgrabenstraße.)',
 'Distrikt 50.',
 '1 * Pirner, G., Wirt (zur Siegesgöttin)',
 'Bogner, G., Maurerpaller 0',
 'Haßmann, G., Lackierer I',
 'Frühbeißer, K., Kleiderm. I',
 'tzrundler, J., Feingoldschl. I',
 'Pirner, I., Eisendreher II',
 'Wagner, G., Taglöhner IV',
 'Kortes, P., Schlosser IV',
 'Röder, G., Kutscher IV',
 'Mandlinger, T. Zimmges. III',
 'Sebald, P., Taglöhner IV',
 'Sebald, P., Ladergehilfe IV',
 'Fcrstl, S., Taglöhner IV',
 '2 * Bühner, I., Oekonom  in Auerau (zuin neuen Bahnhof)',
 'Gruber, G. Eisendreher',
 'Gareis, F., Friseur',
 'Jordan, I., Drechsler',
 'Wirth, G. Marmorschleiferi',
 'Lippl, I., Taglöhner I',
 'Kantorek, II., Futteralmch. I',
 'Vogel, G., Bleistiftarb. I',
 'Walleter, I., Flaschner I',
 'Hobineier, V., Zementcur II',
 'Wachter, A., Stukkaturer II',
 'Zapf, A., Vernickler II',
 'Zchuhbauer, I., Kutscher II',
 'Semmelmann, G. Lokomh. II',
 'Rupprecht, I., Gießer III',
 'John, I., Fabrikarbeite