## Notebook 7: Levenshtein Distance for professions

In this Jupyter Notebook the derived profession names from the OCR are compared to another predefined list with professions from 1910.

### Import libraries

1. **re**  - This is a library to process regular expressions. 

2. **Numpy** - Numpy is a library for the easy use of vectors, matrices or arrays in general. It simplifies various numerical operations. 

3. **Codecs** - This module provides access to the most common Python encoders and decoders for example to be used for text encoding.

4. **Pandas** - Pandas is a library to analyze and to manage data. It is used to create tables.

5. **Levenshtein** -  The levensthein_distance is imported and used as a similarity measure

In [1]:
!pip install Levenshtein
'''Import Statements'''
import pandas as pd
import re
import numpy as np
from Levenshtein import distance as levenshtein_distance
# from google.colab import drive 
# drive.mount('/content/gdrive')
#import io

Collecting Levenshtein
  Downloading Levenshtein-0.16.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[K     |████████████████████████████████| 110 kB 4.2 MB/s 
[?25hCollecting rapidfuzz<1.9,>=1.8.2
  Downloading rapidfuzz-1.8.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (854 kB)
[K     |████████████████████████████████| 854 kB 37.0 MB/s 
[?25hInstalling collected packages: rapidfuzz, Levenshtein
Successfully installed Levenshtein-0.16.0 rapidfuzz-1.8.3
Mounted at /content/gdrive


### Import a Profession list

In [2]:
'''Creation of a professions list as a reference vocabulary for occupations.
The normalized occupation is added into the column "Normalized Occupation"'''

Professions1910 = pd.read_csv('./files/Reference_Vocabularies/nürnberg_1910_teilbestand.xlsx - nürnberg_1910_teilbestand.csv', usecols = ['Beruf', 'Normbezeichnungen', 'OhdAB_01', 'OhdAB_02', 'OhdAB_03', 'OhdAB_04', 'OhdAB_05'])
#Professions1910 = Professions1910.apply(lambda x: x.str.split(';').explode()).drop_duplicates().reset_index().drop(columns=["index"])
Professions1910 = Professions1910.fillna('')
professionGroundTruth = Professions1910

In [3]:
#person_df = pd.read_csv('/content/gdrive/MyDrive/MA Python/2_person_df_straßen.csv', lineterminator='\n', dtype={'Street': str, 'Normalized Street': str, 'House Number': int, 'House Floor': str, 'House Owner': str, 'Last Name': str, 'First Name Abbreviation': str, 'Location Owner Street': str, 'Location Owner Number': str, 'Occupation': str, 'Add Info': str, 'Streetname not found':str, 'LS_Street_1':str, 'LS_Street_2':str, 'LS_Street_3':str})
person_df = pd.read_csv('./Outputs/2_person_df_str_clean.csv', lineterminator='\n', dtype={'Full Name':str, 'Last Name':str, 'First Name Abbreviation':str, 'Occupation':str, 'Occupation 1':str, 'Occupation 2':str,'Add Info':str, 'House Owner':str, 'Full Address':str, 'Street':str, 'Normalized Street':str, 'House Number':str, 'Part of House':str, 'House Floor':str, 'Building':str, 'BuildingPart':str, 'BuildingPartFloor':str, 'Full Owner Address':str, 'Owner Street':str, 'Normalized Owner Street':str, 'Owner Number':str})
#df.head(100)})
person_df = person_df.fillna('')

#company_df = pd.read_csv('/content/gdrive/MyDrive/MA Python/2_company_df.csv', lineterminator='\n', dtype={'Street': str,'House Number': int, 'House Owner': str, 'Company Name': str, 'Location Owner Street': str, 'Location Owner Number': str})
#company_df = company_df.fillna('')

#Creating a subset of the df for working
#df = person_df.iloc[0:1000]
df = person_df

### Comparison and Correction of the Professions

In [4]:
Beruf_ohne_Normb = []
Beruf_ohne_OhdAB_01 = []
Beruf_ohne_OhdAB_02 = []
Beruf_ohne_OhdAB_03 = []
Beruf_ohne_OhdAB_04 = []
Beruf_ohne_OhdAB_05 = []
Levenshtein_1 = []
Levenshtein_2 = []
Levenshtein_3 = []
#df['Normalized Occupation'] = ''
#df['Levenshtein 1'] = ''
#df['Levenshtein 2'] = ''
#df['Levenshtein 3'] = ''

'''Wenn das word (=Occupation im DataFrame) in der Liste professions (in der Spalte Berufe) vorliegt, 
dann übernimm in Spalte "Normalized Occupation" im DataFrame den Inhalt der Spalte Normbezeichnungen aus der professions-Liste.
Wenn Normbezeichnungen leer, dann üvernimm Inhalt von OhdAB_01, etc.'''

def professionCorrection(table, column, normalizedcolumn, notfound, word, professions):
    if word is not None:
        word = word.strip()
        if word in professions['Beruf'].values:
            idx = np.where(professions['Beruf'] == word)[0][0]
            #print(str(idx) + professions['Beruf'][idx])
            if professions['Normbezeichnungen'][idx] != '':
                #print(table)
                table.loc[table[column] == (word), normalizedcolumn] = (professions['Normbezeichnungen'][idx])
            else:
                Beruf_ohne_Normb.append(professions['Beruf'][idx])
                if professions['OhdAB_01'][idx] != '':
                    table.loc[table[column] == (word), normalizedcolumn] = (professions['OhdAB_01'][idx])
                else:
                    Beruf_ohne_OhdAB_01.append(professions['Beruf'][idx])
                    if professions['OhdAB_02'][idx] != '':
                        table.loc[table[column] == (word), normalizedcolumn] = (professions['OhdAB_02'][idx])
                    else:
                        Beruf_ohne_OhdAB_02.append(professions['Beruf'][idx])
                        if professions['OhdAB_03'][idx] != '':
                            table.loc[table[column] == (word), normalizedcolumn] = (professions['OhdAB_03'][idx])
                        else:
                            Beruf_ohne_OhdAB_03.append(professions['Beruf'][idx])
                            if professions['OhdAB_04'][idx] != '':
                                table.loc[table[column] == (word), normalizedcolumn] = (professions['OhdAB_04'][idx])
                            else:
                                Beruf_ohne_OhdAB_04.append(professions['Beruf'][idx])
                                if professions['OhdAB_05'][idx] != '':
                                    table.loc[table[column] == (word), normalizedcolumn] = (professions['OhdAB_05'][idx])
                                else:
                                    Beruf_ohne_OhdAB_05.append(professions['Beruf'][idx])                       
        else:
            table.loc[table[column] == (word), notfound] = (word)     
    #return table
df['Normalized Occupation'] = ''
df['Normalized Occupation 1'] = ''
df['Normalized Occupation 2'] = ''
df['Occupation not found'] = ''
df['Occupation not found 1'] = ''
df['Occupation not found 2'] = ''

for idxAdrBook, word in df['Occupation'].iteritems():
    if word is not None:
        word = str(word)
        professionCorrection(df, 'Occupation', 'Normalized Occupation', 'Occupation not found', word, professionGroundTruth)

for idxAdrBook, word in df['Occupation 1'].iteritems():
    if word != '':
        df.loc[df['Occupation 1'] == (word), 'Occupation not found'] = ''

for idxAdrBook, word in df['Occupation 1'].iteritems():
    if word is not None:
        word = str(word)
        professionCorrection(df, 'Occupation 1', 'Normalized Occupation 1', 'Occupation not found 1', word, professionGroundTruth)

for idxAdrBook, word in df['Occupation 2'].iteritems():
    if word is not None:
        word = str(word)
        professionCorrection(df, 'Occupation 2', 'Normalized Occupation 2', 'Occupation not found 2', word, professionGroundTruth)

df.head(225)

Unnamed: 0,Full Name,Last Name,First Name Abbreviation,Occupation,Occupation 1,Occupation 2,Add Info,House Owner,Full Address,Street,Normalized Street,House Number,Part of House,House Floor,Building,BuildingPart,BuildingPartFloor,Full Owner Address,Owner Street,Normalized Owner Street,Owner Number,Normalized Occupation,Normalized Occupation 1,Normalized Occupation 2,Occupation not found,Occupation not found 1,Occupation not found 2
0,G. Pirner,Pirner,G.,Wirt,,,(zur Siegesgöttin),True,"Altcrstraße 1, Vorderhaus",Altcrstraße,,1,Vorderhaus,,Altcrstraße 1,"Altcrstraße 1, Vorderhaus","Altcrstraße 1, Vorderhaus",,,,,Wirt/in (Gastwirt/in),,,,,
1,G. Bogner,Bogner,G.,Maurerpaller,,,,False,"Altcrstraße 1, Vorderhaus, 0",Altcrstraße,,1,Vorderhaus,0,Altcrstraße 1,"Altcrstraße 1, Vorderhaus","Altcrstraße 1, Vorderhaus, 0",,,,,,,,Maurerpaller,,
2,G. Haßmann,Haßmann,G.,Lackierer,,,,False,"Altcrstraße 1, Vorderhaus, 1",Altcrstraße,,1,Vorderhaus,1,Altcrstraße 1,"Altcrstraße 1, Vorderhaus","Altcrstraße 1, Vorderhaus, 1",,,,,Lackierer/in - allgemein,,,,,
3,K. Frühbeißer,Frühbeißer,K.,Kleiderm,,,,False,"Altcrstraße 1, Vorderhaus",Altcrstraße,,1,Vorderhaus,,Altcrstraße 1,"Altcrstraße 1, Vorderhaus","Altcrstraße 1, Vorderhaus",,,,,Kleidermacher/in,,,,,
4,J. tzrundler,tzrundler,J.,Feingoldschl,,,,False,"Altcrstraße 1, Vorderhaus",Altcrstraße,,1,Vorderhaus,,Altcrstraße 1,"Altcrstraße 1, Vorderhaus","Altcrstraße 1, Vorderhaus",,,,,,,,Feingoldschl,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
220,A. Ellinger,Ellinger,A.,Kalkulant,,,,False,"Altcrstraße 18, Vorderhaus, 0",Altcrstraße,,18,Vorderhaus,0,Altcrstraße 18,"Altcrstraße 18, Vorderhaus","Altcrstraße 18, Vorderhaus, 0",,,,,Kalkulator/in,,,,,
221,I. Worner,Worner,I.,Bahnarbeirer,,,,False,"Altcrstraße 18, Vorderhaus, 0",Altcrstraße,,18,Vorderhaus,0,Altcrstraße 18,"Altcrstraße 18, Vorderhaus","Altcrstraße 18, Vorderhaus, 0",,,,,,,,Bahnarbeirer,,
222,I. Tiefe,Tiefe,I.,Bahnarbeiter,,,,False,"Altcrstraße 18, Vorderhaus, 1",Altcrstraße,,18,Vorderhaus,1,Altcrstraße 18,"Altcrstraße 18, Vorderhaus","Altcrstraße 18, Vorderhaus, 1",,,,,,,,Bahnarbeiter,,
223,I. Merz,Merz,I.,Fabrikarbeiter,,,,False,"Altcrstraße 18, Vorderhaus, 2",Altcrstraße,,18,Vorderhaus,2,Altcrstraße 18,"Altcrstraße 18, Vorderhaus","Altcrstraße 18, Vorderhaus, 2",,,,,,,,Fabrikarbeiter,,


In [5]:
'''Wenn ein word nicht so existiert in der profession Liste in Spalte Beruf, dann bestimme ein Wort mit Levenshtein Distance = 1 und wieder das Spiel mit allen Spalten.
Dann das gleiche mit Distaze 2 und 3. 
ANmerkung: Eig will ich, dass es die Levenshtein 1 und 2 und 3 nur durchmacht, wenn es davor noch nichts gefunden hat. Weil ich fürchte, dass es grade alles für alle immer durchläuft. 
Und es sollte das Normalisierte eig auch immer in die Spalte mit Normalized Occupation eintragen, aber das habe ich auskommentiert, weil es mir dann halt immer überschrieben hat, was schon drin war'''

def LevenshteinDistance(table, column1, column2, word, professions):
    for idx, profession in professions['Beruf'].iteritems():
        profession = str(profession).strip()
        if word == '':
          return table
        elif word !='':
          if levenshtein_distance(word, profession) == 1 and not word in professions['Beruf']:
              #print('Levenshtein 1: word: ' + (word) + ' profession: ' +(profession))
              if professions['Normbezeichnungen'][idx] != '':
                  table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = professions['Normbezeichnungen'][idx]
                  table.loc[table['Occupation'] == (word), 'Levenshtein 1'] = professions['Normbezeichnungen'][idx]
                  table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                  Levenshtein_1.append(professions['Beruf'][idx])
                  return table
              else:
                  if professions['OhdAB_01'][idx] != '':
                      table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_01'][idx])
                      table.loc[table['Occupation'] == (word), 'Levenshtein 1'] = professions['OhdAB_01'][idx]
                      table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                      Levenshtein_1.append(professions['Beruf'][idx])
                      return table
                  else:
                      if professions['OhdAB_02'][idx] != '':
                          table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_02'][idx])
                          table.loc[table['Occupation'] == (word), 'Levenshtein 1'] = professions['OhdAB_02'][idx]
                          table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                          Levenshtein_1.append(professions['Beruf'][idx])
                          return table
                      else:
                          if professions['OhdAB_03'][idx] != '':
                              table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_03'][idx])
                              table.loc[table['Occupation'] == (word), 'Levenshtein 1'] = professions['OhdAB_03'][idx]
                              table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                              Levenshtein_1.append(professions['Beruf'][idx])
                              return table
                          else:
                              if professions['OhdAB_04'][idx] != '':
                                  table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_04'][idx])
                                  table.loc[table['Occupation'] == (word), 'Levenshtein 1'] = professions['OhdAB_04'][idx]
                                  table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                                  Levenshtein_1.append(professions['Beruf'][idx])
                                  return table
                              else:
                                  if professions['OhdAB_05'][idx] != '':
                                      table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_05'][idx])
                                      table.loc[table['Occupation'] == (word), 'Levenshtein 1'] = professions['OhdAB_05'][idx]
                                      table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                                      Levenshtein_1.append(professions['Beruf'][idx])
                                      return table
                                  else:
                                    #table.loc[table['Occupation'] == (word), 'not found with L 1'] = (word)
                                    return table

          elif levenshtein_distance(word, profession) == 2 and not word in professions['Beruf']:
              #print('Levenshtein 2: ' + (word) + ' ' +(profession))
              if professions['Normbezeichnungen'][idx] != '':
                  table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = professions['Normbezeichnungen'][idx]
                  table.loc[table['Occupation'] == (word), 'Levenshtein 2'] = professions['Normbezeichnungen'][idx]
                  table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                  Levenshtein_2.append(professions['Beruf'][idx])
                  return table
              else:
                  if professions['OhdAB_01'][idx] != '':
                      table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_01'][idx])
                      table.loc[table['Occupation'] == (word), 'Levenshtein 2'] = professions['OhdAB_01'][idx]
                      table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                      Levenshtein_2.append(professions['Beruf'][idx])
                      return table
                  else:
                      if professions['OhdAB_02'][idx] != '':
                          table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_02'][idx])
                          table.loc[table['Occupation'] == (word), 'Levenshtein 2'] = professions['OhdAB_02'][idx]
                          table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                          Levenshtein_2.append(professions['Beruf'][idx])
                          return table
                      else:
                          if professions['OhdAB_03'][idx] != '':
                              table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_03'][idx])
                              table.loc[table['Occupation'] == (word), 'Levenshtein 2'] = professions['OhdAB_03'][idx]
                              table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                              Levenshtein_2.append(professions['Beruf'][idx])
                              return table
                          else:
                              if professions['OhdAB_04'][idx] != '':
                                  table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_04'][idx])
                                  table.loc[table['Occupation'] == (word), 'Levenshtein 2'] = professions['OhdAB_04'][idx]
                                  table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                                  Levenshtein_2.append(professions['Beruf'][idx])
                                  return table
                              else:
                                  if professions['OhdAB_05'][idx] != '':
                                      table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_05'][idx])
                                      table.loc[table['Occupation'] == (word), 'Levenshtein 2'] = professions['OhdAB_05'][idx]
                                      table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                                      Levenshtein_2.append(professions['Beruf'][idx])
                                      return table
                                  else:
                                      return table
              return table
                                      
          elif levenshtein_distance(word, profession) == 3 and not word in professions['Beruf']:
              #print('Levenshtein 3: ' + (word) + ' ' +(profession))
              if professions['Normbezeichnungen'][idx] != '':
                  table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['Normbezeichnungen'][idx])
                  table.loc[table['Occupation'] == (word), 'Levenshtein 3'] = professions['Normbezeichnungen'][idx]
                  table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                  Levenshtein_3.append(professions['Beruf'][idx])
                  return table
              else:
                  if professions['OhdAB_01'][idx] != '':
                      table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_01'][idx])
                      table.loc[table['Occupation'] == (word), 'Levenshtein 3'] = professions['OhdAB_01'][idx]
                      table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                      Levenshtein_3.append(professions['Beruf'][idx])
                      return table
                  else:
                      if professions['OhdAB_02'][idx] != '':
                          table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_02'][idx])
                          table.loc[table['Occupation'] == (word), 'Levenshtein 3'] = professions['OhdAB_02'][idx]
                          table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                          Levenshtein_3.append(professions['Beruf'][idx])
                          return table
                      else:
                          if professions['OhdAB_03'][idx] != '':
                              table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_03'][idx])
                              table.loc[table['Occupation'] == (word), 'Levenshtein 3'] = professions['OhdAB_03'][idx]
                              table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                              Levenshtein_3.append(professions['Beruf'][idx])
                              return table
                          else:
                              if professions['OhdAB_04'][idx] != '':
                                  table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_04'][idx])
                                  table.loc[table['Occupation'] == (word), 'Levenshtein 3'] = professions['OhdAB_04'][idx]
                                  table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                                  Levenshtein_3.append(professions['Beruf'][idx])
                                  return table
                              else:
                                  if professions['OhdAB_05'][idx] != '':
                                      table.loc[table['Occupation'] == (word), 'Normalized Occupation'] = (professions['OhdAB_05'][idx])
                                      table.loc[table['Occupation'] == (word), 'Levenshtein 3'] = professions['OhdAB_05'][idx]
                                      table.loc[table['Occupation'] == (word), 'Occupation not found'] = None
                                      Levenshtein_3.append(professions['Beruf'][idx])
                                      return table 
                                  else:
                                    return table
        #else:
          #table.loc[table['Occupation'] == (word), 'Occupation not found'] = 'found'
          #return table
        else:
          return table

df['Levenshtein 1'] = ''
df['Levenshtein 2'] = ''
df['Levenshtein 3'] = ''

for idxAdrBook, word in df['Occupation not found'].iteritems():
    if word is not None:
        word = str(word)
        LevenshteinDistance(df, 'Normalized Occupation', 'Occupation not found', word, professionGroundTruth)

#df = df[['Last Name', 'First Name Abbreviation', 'Occupation', 'Normalized Occupation', 'Occupation not found', 'Levenshtein 1', 'Levenshtein 2', 'Levenshtein 3']]                                    
df.head(225)

Unnamed: 0,Full Name,Last Name,First Name Abbreviation,Occupation,Occupation 1,Occupation 2,Add Info,House Owner,Full Address,Street,Normalized Street,House Number,Part of House,House Floor,Building,BuildingPart,BuildingPartFloor,Full Owner Address,Owner Street,Normalized Owner Street,Owner Number,Normalized Occupation,Normalized Occupation 1,Normalized Occupation 2,Occupation not found,Occupation not found 1,Occupation not found 2,Levenshtein 1,Levenshtein 2,Levenshtein 3
0,G. Pirner,Pirner,G.,Wirt,,,(zur Siegesgöttin),True,"Altcrstraße 1, Vorderhaus",Altcrstraße,,1,Vorderhaus,,Altcrstraße 1,"Altcrstraße 1, Vorderhaus","Altcrstraße 1, Vorderhaus",,,,,Wirt/in (Gastwirt/in),,,,,,,,
1,G. Bogner,Bogner,G.,Maurerpaller,,,,False,"Altcrstraße 1, Vorderhaus, 0",Altcrstraße,,1,Vorderhaus,0,Altcrstraße 1,"Altcrstraße 1, Vorderhaus","Altcrstraße 1, Vorderhaus, 0",,,,,Maurerpolier/in,,,,,,,Maurerpolier/in,
2,G. Haßmann,Haßmann,G.,Lackierer,,,,False,"Altcrstraße 1, Vorderhaus, 1",Altcrstraße,,1,Vorderhaus,1,Altcrstraße 1,"Altcrstraße 1, Vorderhaus","Altcrstraße 1, Vorderhaus, 1",,,,,Lackierer/in - allgemein,,,,,,,,
3,K. Frühbeißer,Frühbeißer,K.,Kleiderm,,,,False,"Altcrstraße 1, Vorderhaus",Altcrstraße,,1,Vorderhaus,,Altcrstraße 1,"Altcrstraße 1, Vorderhaus","Altcrstraße 1, Vorderhaus",,,,,Kleidermacher/in,,,,,,,,
4,J. tzrundler,tzrundler,J.,Feingoldschl,,,,False,"Altcrstraße 1, Vorderhaus",Altcrstraße,,1,Vorderhaus,,Altcrstraße 1,"Altcrstraße 1, Vorderhaus","Altcrstraße 1, Vorderhaus",,,,,,,,Feingoldschl,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
220,A. Ellinger,Ellinger,A.,Kalkulant,,,,False,"Altcrstraße 18, Vorderhaus, 0",Altcrstraße,,18,Vorderhaus,0,Altcrstraße 18,"Altcrstraße 18, Vorderhaus","Altcrstraße 18, Vorderhaus, 0",,,,,Kalkulator/in,,,,,,,,
221,I. Worner,Worner,I.,Bahnarbeirer,,,,False,"Altcrstraße 18, Vorderhaus, 0",Altcrstraße,,18,Vorderhaus,0,Altcrstraße 18,"Altcrstraße 18, Vorderhaus","Altcrstraße 18, Vorderhaus, 0",,,,,,,,Bahnarbeirer,,,,,
222,I. Tiefe,Tiefe,I.,Bahnarbeiter,,,,False,"Altcrstraße 18, Vorderhaus, 1",Altcrstraße,,18,Vorderhaus,1,Altcrstraße 18,"Altcrstraße 18, Vorderhaus","Altcrstraße 18, Vorderhaus, 1",,,,,"Salzarbeiter/in, Hallarbeiter/in",,,,,,,,"Salzarbeiter/in, Hallarbeiter/in"
223,I. Merz,Merz,I.,Fabrikarbeiter,,,,False,"Altcrstraße 18, Vorderhaus, 2",Altcrstraße,,18,Vorderhaus,2,Altcrstraße 18,"Altcrstraße 18, Vorderhaus","Altcrstraße 18, Vorderhaus, 2",,,,,Tabakarbeiter/in,,,,,,,,Tabakarbeiter/in


In [6]:
#df.to_csv('/content/gdrive/MyDrive/MA Python/Outputs/2_person_df_str_occ_full.csv')
df.to_csv('./Outputs/2_person_df_str_occ.csv', index = False)
#df.to_csv('/content/gdrive/MyDrive/MA Python/Outputs/2_person_df_str_occ_clean.csv', columns=['Street', 'Normalized Streetname', 'House Number', 'House Floor', 'House Owner', 'Last Name', 'First Name Abbreviation', 'Location Owner Street', 'Normalized Owner Street', 'Location Owner Number', 'Occupation', 'Normalized Occupation', 'Normalized Occupation 1', 'Normalized Occupation 2', 'Add Info'], index = False)
df.to_csv('./Outputs/2_person_df_str_occ_clean.csv', columns=['Full Name', 'Last Name', 'First Name Abbreviation', 'Occupation', 'Occupation 1', 'Occupation 2', 'Normalized Occupation', 'Normalized Occupation 1', 'Normalized Occupation 2', 'Add Info', 'House Owner', 'Full Address', 'Street', 'Normalized Street', 'House Number', 'Part of House', 'House Floor', 'Building', 'BuildingPart', 'BuildingPartFloor', 'Full Owner Address', 'Owner Street', 'Normalized Owner Street', 'Owner Number'], index = False)
