# Data preparation

In diesem Notebook wird versucht die Daten des [Amtliches Verzeichnis der Strassen](https://www.cadastre.ch/de/services/service/registry/street.html) mit Angaben aus Wikidata abzugleichen.

In einem ersten Versuch werden nur Daten des Kantons BS verwendet um die Datenmenge besser im Überblick zu behalten.

---
## Laden der Daten
Analog dem Vorgehen aus [du-StrassenVZ.ipynb](https://github.com/CaptainInler/cassda-zertifikatsarbeit/blob/main/Dataunderstanding/du-StrassenVZ.ipynb)

In [1]:
import urllib.request
    
url = 'https://data.geo.admin.ch/ch.swisstopo.amtliches-strassenverzeichnis/csv/2056/ch.swisstopo.amtliches-strassenverzeichnis.zip'
filehandle, _ = urllib.request.urlretrieve(url)

In [2]:
from zipfile import ZipFile

with ZipFile(filehandle, 'r') as zip:
    zip.printdir()
    data = zip.read("pure_str.csv")

File Name                                             Modified             Size
pure_str.csv                                   2022-10-09 02:13:00     23539097
timestamp.txt                                  2022-10-09 02:34:30           10


In [3]:
from io import StringIO
import pandas as pd

daten = StringIO(str(data,'UTF-8-SIG'))

df = pd.read_csv(daten, encoding='UTF-8-SIG', sep=';')
df.head()

Unnamed: 0,STR_ESID,STN_LABEL,ZIP_LABEL,COM_FOSNR,COM_NAME,COM_CANTON,STR_TYPE,STR_STATUS,STR_OFFICIAL,STR_VALID,STR_MODIFIED,STR_EASTING,STR_NORTHING
0,10258316,Eggwald,6484 Wassen UR,1220,Wassen,UR,Place,real,True,False,10.09.2022,,
1,10023770,Wiedenweg,4203 Grellingen,2786,Grellingen,BL,Street,real,True,True,09.09.2022,2610733.0,1254311.0
2,10179192,Wuhrbärgli,4253 Liesberg,2788,Liesberg,BL,Street,real,True,True,26.08.2022,2598709.0,1249640.0
3,10250501,Hüethütte Unter Trübsee,6390 Engelberg,1511,Wolfenschiessen,NW,Area,real,True,True,07.08.2021,2671798.0,1184817.0
4,10163108,Heimstenstich,4436 Liedertswil,2890,Liedertswil,BL,Area,real,True,True,03.03.2022,2621856.0,1248672.0


In [4]:
dfBs = df[(df.COM_CANTON == "BS") & (df.STR_TYPE != "Area")]

In [5]:
dfBs.head(10)

Unnamed: 0,STR_ESID,STN_LABEL,ZIP_LABEL,COM_FOSNR,COM_NAME,COM_CANTON,STR_TYPE,STR_STATUS,STR_OFFICIAL,STR_VALID,STR_MODIFIED,STR_EASTING,STR_NORTHING
8671,10251567,Weinlagerstrasse,4056 Basel,2701,Basel,BS,Street,real,True,True,15.08.2022,2610180.0,1269127.0
10035,10256874,Katja Wulff-Anlage,4052 Basel,2701,Basel,BS,Place,real,True,False,16.08.2022,2612763.0,1265337.0
10039,10256872,Wibrandis Rosenblatt-Weg,4052 Basel,2701,Basel,BS,Street,real,True,False,16.08.2022,2612778.0,1265286.0
10040,10256875,Gretel Bolliger-Promenade,4052 Basel,2701,Basel,BS,Place,real,True,False,29.08.2022,2612972.0,1265302.0
18388,10255061,Backstubenweg,4056 Basel,2701,Basel,BS,Street,real,True,True,15.08.2022,2610179.0,1269240.0
18393,10255064,Lichtnelkenweg,4056 Basel,2701,Basel,BS,Street,real,True,True,15.08.2022,2609984.0,1269350.0
18397,10255065,Kasernenhof,4058 Basel,2701,Basel,BS,Place,real,True,True,29.08.2022,2611424.0,1268054.0
18399,10255060,Helli Stehle-Weg,4059 Basel,2701,Basel,BS,Street,real,True,True,29.08.2022,2611017.0,1264793.0
18400,10255063,Nachtkerzenweg,4056 Basel,2701,Basel,BS,Street,real,True,True,15.08.2022,2610054.0,1269211.0
18402,10255066,Beim Wettsteinhäuschen,4058 Basel,2701,Basel,BS,Place,real,True,True,29.08.2022,2611900.0,1267744.0


Versuchen aus den Strassennamen (STN_LABEL) die Namen ohne Suffix (strasse, weg, hof, Anlage)

In [6]:
seperators =["strasse", "-Strasse", "weg", "-Weg", "-Anlage", "anlage", "-Promenade", "rain", "gasse", "gässlein", "gässchen",
             "-Steg", "platz", "-Platz", "-Brücke", "-Passage", "graben", "-Graben", "steg", "-Park", "park", "schanze", "tunnel", "kreisel", "ring", "allee"]

In [7]:
import numpy as np

dfBsStam = dfBs.copy() #Sonst gibt es eine Warnung beim setzen der neuen Spalte
dfBsStam["Stamm"] = dfBsStam["STN_LABEL"]
dfBsStam["suffix"] = np.nan

dfBsStam.head(3)

Unnamed: 0,STR_ESID,STN_LABEL,ZIP_LABEL,COM_FOSNR,COM_NAME,COM_CANTON,STR_TYPE,STR_STATUS,STR_OFFICIAL,STR_VALID,STR_MODIFIED,STR_EASTING,STR_NORTHING,Stamm,suffix
8671,10251567,Weinlagerstrasse,4056 Basel,2701,Basel,BS,Street,real,True,True,15.08.2022,2610180.0,1269127.0,Weinlagerstrasse,
10035,10256874,Katja Wulff-Anlage,4052 Basel,2701,Basel,BS,Place,real,True,False,16.08.2022,2612763.0,1265337.0,Katja Wulff-Anlage,
10039,10256872,Wibrandis Rosenblatt-Weg,4052 Basel,2701,Basel,BS,Street,real,True,False,16.08.2022,2612778.0,1265286.0,Wibrandis Rosenblatt-Weg,


In [8]:
for seperator in seperators:

    dfBsStam1 = dfBsStam.STN_LABEL.str.removesuffix(seperator)
    dfBsStam2 = pd.DataFrame(dfBsStam1)
    dfBsStam2.rename(columns = {"STN_LABEL":"temp"}, inplace = True)

    dfBsStam = pd.concat([dfBsStam, dfBsStam2], axis=1)

    dfBsStam['suffix'] = np.where(dfBsStam["STN_LABEL"] != dfBsStam["temp"], seperator, dfBsStam["suffix"])
    dfBsStam['Stamm'] = np.where(dfBsStam['STN_LABEL'] != dfBsStam['temp'], dfBsStam['temp'], dfBsStam['Stamm'])

    del dfBsStam['temp']
    

dfBsStam.head(3)

Unnamed: 0,STR_ESID,STN_LABEL,ZIP_LABEL,COM_FOSNR,COM_NAME,COM_CANTON,STR_TYPE,STR_STATUS,STR_OFFICIAL,STR_VALID,STR_MODIFIED,STR_EASTING,STR_NORTHING,Stamm,suffix
8671,10251567,Weinlagerstrasse,4056 Basel,2701,Basel,BS,Street,real,True,True,15.08.2022,2610180.0,1269127.0,Weinlager,strasse
10035,10256874,Katja Wulff-Anlage,4052 Basel,2701,Basel,BS,Place,real,True,False,16.08.2022,2612763.0,1265337.0,Katja Wulff,-Anlage
10039,10256872,Wibrandis Rosenblatt-Weg,4052 Basel,2701,Basel,BS,Street,real,True,False,16.08.2022,2612778.0,1265286.0,Wibrandis Rosenblatt,-Weg


Abgleich von `Stamm` in Wikidata 

In [9]:
from SPARQLWrapper import SPARQLWrapper, JSON

In [10]:
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

In [11]:
def queryWd(sparql, subject):
    #print(wdKey)
    query = """
    SELECT ?subject ?subjectLabel ?instanceLabel WHERE {
      ?subject rdfs:label "%s"@de;
               wdt:P31 ?instance.
      SERVICE wikibase:label { bd:serviceParam wikibase:language "de" . }   
    }
    """ % (subject)
    #print(query)
    
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    return sparql

In [12]:
i = 0
for x in dfBsStam.index:
    i+=1
    #print(i)
    subject = dfBsStam['Stamm'][x]
    subjectStr = dfBsStam['STN_LABEL'][x]
    #print(f"Subjekt: {subject}")
    sparql = queryWd(sparql, subject)
    try:
        results = sparql.query()
        #print(results.info())
    except Exception as e:
        #Prüfen auf Statuscode 429 (Too many Requests). Leider kann der Statuscode nicht abgerufen werden..
        print("Sollte ein Statuscode 429 auftreten: Anfrage in ca. 30sec wiederholen")
        print(e)       
        break

    result = results.convert()
    #print(result)
    
    results_df = pd.json_normalize(result['results']['bindings'])
    #print(results_df)

    if not results_df.empty:
        wikiQ = results_df['subject.value'][0]
        wikiQLabel = results_df['subjectLabel.value'][0]
        instance = results_df['instanceLabel.value'][0]
        
        print(f"{subjectStr} | {wikiQLabel}: {wikiQ} -> {instance}")
    else:
        print(f"{subjectStr} | {subject}: Kein Eintrag in Wikidata gefunden")
    
    #print(x)
    if i > 10:
        break
    

Weinlagerstrasse | Weinlager: http://www.wikidata.org/entity/Q102011549 -> Innerortsstraße
Katja Wulff-Anlage | Katja Wulff: http://www.wikidata.org/entity/Q43267718 -> Mensch
Wibrandis Rosenblatt-Weg | Wibrandis Rosenblatt: http://www.wikidata.org/entity/Q120460 -> Mensch
Gretel Bolliger-Promenade | Gretel Bolliger: http://www.wikidata.org/entity/Q40127518 -> Mensch
Backstubenweg | Backstuben: Kein Eintrag in Wikidata gefunden
Lichtnelkenweg | Lichtnelken: Kein Eintrag in Wikidata gefunden
Kasernenhof | Kasernenhof: Kein Eintrag in Wikidata gefunden
Helli Stehle-Weg | Helli Stehle: http://www.wikidata.org/entity/Q22687790 -> Mensch
Nachtkerzenweg | Nachtkerzen: http://www.wikidata.org/entity/Q157658 -> Taxon
Beim Wettsteinhäuschen | Beim Wettsteinhäuschen: Kein Eintrag in Wikidata gefunden
Kabelstrasse | Kabel: http://www.wikidata.org/entity/Q657153 -> Grotesk


---
Versuch die aus den [Kompositionen](https://de.wikipedia.org/wiki/Komposition_%28Grammatik%29) wie "Weinlagerstrasse" nur "Weinlager" zu erhalten mittels spacy

ToDo: https://stackoverflow.com/questions/21515535/split-decompose-german-words-in-python

In [None]:
#!pip install spacy
#!python -m spacy download de_core_news_md

In [None]:
import spacy #Our NLP tools
import de_core_news_md

#Load a German language model to do NLP - the models we use will influence our results a lot
nlp = spacy.load('de_core_news_md')

In [None]:
text = dfBs.STN_LABEL.tolist()
print(text[0])

In [None]:
doc = nlp(text[0])

In [None]:
print(doc)

In [None]:
for token in doc:
    print(f"{token.text:<20}\t{token.lemma_:<20}\t{token.pos_:<6}\t{token.is_stop}")