In [1]:
%cd ..

/Users/lisaschmidt/Documents/GitHub/data-engineering-showcase


# Dataset 1

### Downloading data

In [2]:
from urllib.request import urlretrieve

In [3]:
url = ("https://mobilithek.info/mdp-api/files/aux/573356838940979200/moin-2022-05-02.1-20220502.131229-1.ttl.bz2")
filename = "data/city_connections.ttl.bz2"

In [4]:
urlretrieve(url, filename)

('data/city_connections.ttl.bz2', <http.client.HTTPMessage at 0x10896b7d0>)

In [5]:
import bz2

In [6]:
zipfile = bz2.BZ2File(filename) # open the file
data = zipfile.read() # get the decompressed data
newfilepath = filename[:-4] # assuming the filepath ends with .bz2
open(newfilepath, 'wb').write(data) # write a uncompressed file

167540483

In [7]:
import pyoxigraph as graph
import pandas as pd

In [8]:
l = list(graph.parse(newfilepath, "text/turtle", base_iri="http://example.com/"))

### An Overview of the Data Structure

In [9]:
for t in l[5:15]:
    print( t.subject, t.predicate, t.object)

<http://moin-project.org/data/Bremerhaven> <http://moin-project.org/ontology/connectedTo> <http://moin-project.org/data/Marl> <http://moin-project.org/ontology/hasTrip> _:9a646dd42746eda5bca4347d24270ee8
_:9a646dd42746eda5bca4347d24270ee8 <http://moin-project.org/ontology/transportType> <http://moin-project.org/ontology/train>
_:9a646dd42746eda5bca4347d24270ee8 <http://moin-project.org/ontology/startTime> "10:42:00"^^<http://www.w3.org/2001/XMLSchema#time>
_:9a646dd42746eda5bca4347d24270ee8 <http://moin-project.org/ontology/endTime> "16:02:00"^^<http://www.w3.org/2001/XMLSchema#time>
_:9a646dd42746eda5bca4347d24270ee8 <http://moin-project.org/ontology/duration> "PT320M"^^<http://www.w3.org/2001/XMLSchema#duration>
<http://moin-project.org/data/Bremerhaven> <http://moin-project.org/ontology/connectedTo> <http://moin-project.org/data/Marl> <http://moin-project.org/ontology/hasTrip> _:d41001eeaf7f4bb01f680bd00890a8d9
_:d41001eeaf7f4bb01f680bd00890a8d9 <http://moin-project.org/ontology/tra

This graph has different types of triples that are useful to us:

1. A triple which describes that two cities are connected by a trip:

    ```<http://moin-project.org/data/Bremerhaven> <http://moin-project.org/ontology/connectedTo> <http://moin-project.org/data/Marl> <http://moin-project.org/ontology/hasTrip> _:77af2dc5c57b5384d50fd429a6ae23e4```
    
    Notable is here that the subject `<http://moin-project.org/data/Bremerhaven> <http://moin-project.org/ontology/connectedTo> <http://moin-project.org/data/Marl>` is a triple itself. This "city-connection" is then uniquely identified through `hasTrip` to an identifier `_:77af2dc5c57b5384d50fd429a6ae23e4`.

2. A triple which describes the properties of a trip (ex. transport type, travel time or distance):

    ```_:77af2dc5c57b5384d50fd429a6ae23e4 <http://moin-project.org/ontology/transportType> <http://moin-project.org/ontology/train>```

In [10]:
# a connection between two cities
l[0].subject

<Triple subject=<NamedNode value=http://moin-project.org/data/Bremerhaven> predicate=<NamedNode value=http://moin-project.org/ontology/connectedTo> object=<NamedNode value=http://moin-project.org/data/Marl>>

In [11]:
# an example of a triple which states a connection between two cities and then uniquely names them.
l[0]

<Triple subject=<Triple subject=<NamedNode value=http://moin-project.org/data/Bremerhaven> predicate=<NamedNode value=http://moin-project.org/ontology/connectedTo> object=<NamedNode value=http://moin-project.org/data/Marl>> predicate=<NamedNode value=http://moin-project.org/ontology/hasTrip> object=<BlankNode value=d18152a12030057d83065b2d64fdce43>>

In [12]:
print("There are ", len(l), " triples in the database.")

There are  215136  triples in the database.


In [13]:
connections = [t.subject for t in l]
for i in range(20):
    print(i, connections[i])

0 <http://moin-project.org/data/Bremerhaven> <http://moin-project.org/ontology/connectedTo> <http://moin-project.org/data/Marl>
1 _:d18152a12030057d83065b2d64fdce43
2 _:d18152a12030057d83065b2d64fdce43
3 _:d18152a12030057d83065b2d64fdce43
4 _:d18152a12030057d83065b2d64fdce43
5 <http://moin-project.org/data/Bremerhaven> <http://moin-project.org/ontology/connectedTo> <http://moin-project.org/data/Marl>
6 _:9a646dd42746eda5bca4347d24270ee8
7 _:9a646dd42746eda5bca4347d24270ee8
8 _:9a646dd42746eda5bca4347d24270ee8
9 _:9a646dd42746eda5bca4347d24270ee8
10 <http://moin-project.org/data/Bremerhaven> <http://moin-project.org/ontology/connectedTo> <http://moin-project.org/data/Marl>
11 _:d41001eeaf7f4bb01f680bd00890a8d9
12 _:d41001eeaf7f4bb01f680bd00890a8d9
13 _:d41001eeaf7f4bb01f680bd00890a8d9
14 _:d41001eeaf7f4bb01f680bd00890a8d9
15 <http://moin-project.org/data/Bremerhaven> <http://moin-project.org/ontology/connectedTo> <http://moin-project.org/data/Marl>
16 _:61ec8232c8a018afb59a839bdfcea672


We can see above that there are different options to travel between Bremerhaven and Marl. In this project, we want to compare these different travel option with a special interest in comparing travel by car to travel by train.

### Overview on the Contents of the Data

In [14]:
predicates = [t.predicate for t in l]
pd.DataFrame(predicates).value_counts()

<http://moin-project.org/ontology/duration>               40515
<http://moin-project.org/ontology/hasTrip>                40515
<http://moin-project.org/ontology/transportType>          40515
<http://moin-project.org/ontology/startTime>              30524
<http://moin-project.org/ontology/endTime>                30461
<http://moin-project.org/ontology/connectedTo>            10159
<http://moin-project.org/ontology/route>                   9991
<http://moin-project.org/ontology/drivingDistance>         9991
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>           420
<http://www.w3.org/2000/01/rdf-schema#label>                233
<http://www.wikidata.org/prop/direct/P625>                  226
<http://www.wikidata.org/prop/direct/P17>                   117
<http://www.wikidata.org/prop/direct/P31>                   117
<http://www.wikidata.org/prop/direct/P239>                  117
<http://www.opengis.net/ont/geosparql#asWKT>                103
<http://www.opengis.net/ont/geosparql#de

The most interesting predicates for this project are the `moin-project` predicates:
- duration
- hasTrip
- transportType
- connectedTo
- route

The other `moin-project` predicates (startTime, endTime, drivingDistance, nearestAirport) are currently not relevant.

But there are other predicates, which we also should investigate. For example, the `wikidata` predicates:

- P625 = Coordinate Information
- P15 = country
- P31 = instance of
- P239 ICAO Airport Code
- P94 = Coat of arms image
- P131 = located in the administrative territorial entity
- P1082 = Population
- P856 = official website
- P2046 = area
- P41 = flag image
- P238 = IATA Airport Code

##### Now let's take a look at the remaining predicates:

- <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
- <http://www.w3.org/2000/01/rdf-schema#label>   
- <http://www.w3.org/2002/07/owl#sameAs>
- <http://www.w3.org/2000/01/rdf-schema#comment>


- <http://schema.org/about>


- <http://www.opengis.net/ont/geosparql#asWKT>            
- <http://www.opengis.net/ont/geosparql#defaultGeometry>  

In [15]:
from pyoxigraph import Store

In [16]:
store = Store()
store.load('./data/city_connections.ttl', mime_type="text/turtle")

In [17]:
PREDICATE = "http://www.opengis.net/ont/geosparql#defaultGeometry"

result = store.query("SELECT ?s ?o WHERE { ?s <"+PREDICATE+"> ?o}")

In [18]:
[s for s in result][:10]

[<QuerySolution s=<NamedNode value=http://moin-project.org/data/Moers> o=<BlankNode value=1c73738e92301798d11a6be30c38700>>,
 <QuerySolution s=<NamedNode value=http://moin-project.org/data/Minden> o=<BlankNode value=6419ac8c77de1566f93beae2390bb5f>>,
 <QuerySolution s=<NamedNode value=http://moin-project.org/data/Pforzheim> o=<BlankNode value=bf2180e6774d61f18860db596145930>>,
 <QuerySolution s=<NamedNode value=http://moin-project.org/data/Dessau-Ro%C3%9Flau> o=<BlankNode value=c79446258a36dca960c54090c93f29f>>,
 <QuerySolution s=<NamedNode value=http://moin-project.org/data/Solingen> o=<BlankNode value=dcadc20a22f077f8994bde1f4cf29f9>>,
 <QuerySolution s=<NamedNode value=http://moin-project.org/data/W%C3%BCrzburg> o=<BlankNode value=f8cf675e192fbe43555861e707412e9>>,
 <QuerySolution s=<NamedNode value=http://moin-project.org/data/Frankfurt%20am%20Main> o=<BlankNode value=119f0f3867cb88df0112732c5552ecc7>>,
 <QuerySolution s=<NamedNode value=http://moin-project.org/data/Sterkrade> o=<B

- <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>     
    The type of an object, for example if it is an Airport
- <http://www.w3.org/2000/01/rdf-schema#label>   
    This predicate relates an IRI to a literal, for example the IRI of a city to the city's name.
- <http://www.w3.org/2002/07/owl#sameAs>    
    The different IRIs of the same object are related through this predicate.
- <http://www.w3.org/2000/01/rdf-schema#comment>    
    The comment field contains a short description of different cities.
- <http://schema.org/about>      
    This relation links a city's wikipedia article to the IRI.
- <http://www.opengis.net/ont/geosparql#asWKT>      
    The WKT-Literal contains the coordinates of an object.        
- <http://www.opengis.net/ont/geosparql#defaultGeometry>    
    This predicate connects a city's IRI to an object (see above).

## A Plan for bringing the Data into the desired Format

### Requirements:
- One Table with all **trips as primary key**
- Columns: 
    - duration
    - hasTrip
    - transportType
    - connectedTo
    - route

### Approach
1. Read all trips into a structure and match them with start IRI and destination IRI

2. Find all trip properties by the trip IRI and write them to the trips  

3. Find the labels of start and end destination and replace IRIs with labels 


## Data Transformations

In [19]:
type(l[0].subject.predicate)

pyoxigraph.NamedNode

In [20]:
connected_to = graph.NamedNode("http://moin-project.org/ontology/connectedTo")
has_trip = graph.NamedNode("http://moin-project.org/ontology/hasTrip")

In [21]:
#PREDICATE = "http://moin-project.org/ontology/connectedTo"
PREDICATE = "http://moin-project.org/ontology/hasTrip"

result = store.query("SELECT ?s ?o WHERE { ?s <"+PREDICATE+"> ?o}")

#### (1) Read all trips into a structure and match them with start IRI and destination IRI

In [22]:
connections = {}
for triple in l:
    try:
        subject, predicate, object = triple.subject, triple.predicate, triple.object
        if predicate == has_trip: #subject.predicate == connected_to and 
            trip_from = subject.subject
            trip_to = subject.object
            trip_id = triple.object
            connections[trip_id.value] = {
                "iri_start": str(trip_from.value),
                "iri_end": str(trip_to.value),
            }
    except AttributeError:
        print("Error with triple: ", triple)

In [23]:
df_connections = pd.DataFrame(connections).T
df_connections

Unnamed: 0,iri_start,iri_end
d18152a12030057d83065b2d64fdce43,http://moin-project.org/data/Bremerhaven,http://moin-project.org/data/Marl
9a646dd42746eda5bca4347d24270ee8,http://moin-project.org/data/Bremerhaven,http://moin-project.org/data/Marl
d41001eeaf7f4bb01f680bd00890a8d9,http://moin-project.org/data/Bremerhaven,http://moin-project.org/data/Marl
61ec8232c8a018afb59a839bdfcea672,http://moin-project.org/data/Bremerhaven,http://moin-project.org/data/Marl
52f39a963fa14381183c854d8c280c54,http://moin-project.org/data/Dortmund,http://moin-project.org/data/Karlsruhe
...,...,...
b21de08ef1c7981ec4493c55c857102a,http://moin-project.org/data/Chemnitz,http://moin-project.org/data/Osnabr%C3%BCck
a0fc7c80a605478616051b3931f233cc,http://moin-project.org/data/Halle%20%28Saale%29,http://moin-project.org/data/Bielefeld
8fb712ec320d5e1598a050b1285e4074,http://moin-project.org/data/Halle%20%28Saale%29,http://moin-project.org/data/Bielefeld
e544d69fa55592081dc597ecd64883d1,http://moin-project.org/data/Halle%20%28Saale%29,http://moin-project.org/data/Bielefeld


#### (2) Find all trip properties by the trip IRI and write them to the trips  

In [24]:
duration = graph.NamedNode("http://moin-project.org/ontology/duration")
transport_type = graph.NamedNode("http://moin-project.org/ontology/transportType")
route = graph.NamedNode("http://moin-project.org/ontology/route")
drivingDistance = graph.NamedNode("http://moin-project.org/ontology/drivingDistance")

In [25]:
for triple in l:
    try:
        subject, predicate, object = triple.subject, triple.predicate, triple.object
        pred_name = None
        pred_value = None
        if predicate in [duration, transport_type, route, drivingDistance]:
            trip_id = triple.subject.value
            pred_name = triple.predicate.value
            pred_value = triple.object.value
            connections[trip_id][str(pred_name)] = str(pred_value)
    except AttributeError:
        print("Error with triple: ", triple)

In [26]:
df_connections = pd.DataFrame(connections).T
df_connections

Unnamed: 0,iri_start,iri_end,http://moin-project.org/ontology/transportType,http://moin-project.org/ontology/route,http://moin-project.org/ontology/duration,http://moin-project.org/ontology/drivingDistance
d18152a12030057d83065b2d64fdce43,http://moin-project.org/data/Bremerhaven,http://moin-project.org/data/Marl,http://moin-project.org/ontology/car,LINESTRING(8.586580000000001 53.55175000000000...,PT9134.0S,288831.37934856815e0
9a646dd42746eda5bca4347d24270ee8,http://moin-project.org/data/Bremerhaven,http://moin-project.org/data/Marl,http://moin-project.org/ontology/train,,PT320M,
d41001eeaf7f4bb01f680bd00890a8d9,http://moin-project.org/data/Bremerhaven,http://moin-project.org/data/Marl,http://moin-project.org/ontology/train,,PT385M,
61ec8232c8a018afb59a839bdfcea672,http://moin-project.org/data/Bremerhaven,http://moin-project.org/data/Marl,http://moin-project.org/ontology/train,,PT322M,
52f39a963fa14381183c854d8c280c54,http://moin-project.org/data/Dortmund,http://moin-project.org/data/Karlsruhe,http://moin-project.org/ontology/car,"LINESTRING(7.46417 51.51505, 7.461770000000000...",PT12015.0S,355366.2358930178e0
...,...,...,...,...,...,...
b21de08ef1c7981ec4493c55c857102a,http://moin-project.org/data/Chemnitz,http://moin-project.org/data/Osnabr%C3%BCck,http://moin-project.org/ontology/train,,PT320M,
a0fc7c80a605478616051b3931f233cc,http://moin-project.org/data/Halle%20%28Saale%29,http://moin-project.org/data/Bielefeld,http://moin-project.org/ontology/car,LINESTRING(11.970030000000001 51.4824400000000...,PT10854.0S,344566.4962791281e0
8fb712ec320d5e1598a050b1285e4074,http://moin-project.org/data/Halle%20%28Saale%29,http://moin-project.org/data/Bielefeld,http://moin-project.org/ontology/train,,PT231M,
e544d69fa55592081dc597ecd64883d1,http://moin-project.org/data/Halle%20%28Saale%29,http://moin-project.org/data/Bielefeld,http://moin-project.org/ontology/train,,PT192M,


In [27]:
df_connections["http://moin-project.org/ontology/transportType"].value_counts()

http://moin-project.org/ontology/transportType
http://moin-project.org/ontology/train     29106
http://moin-project.org/ontology/car        9991
http://moin-project.org/ontology/flight     1418
Name: count, dtype: int64

#### (3) Find the labels of start and end destination and replace IRIs with labels 

In [28]:
PREDICATE = "http://www.w3.org/2000/01/rdf-schema#label"

In [29]:
IRI2Label = {}

In [30]:
result = store.query("SELECT ?subject ?object WHERE { ?subject <"+PREDICATE+"> ?object}")

for r in result:
    IRI2Label[str(r["subject"].value)] = r["object"].value
    #print(r["subject"].value)
    #print(r["object"].value)

In [31]:
def replace_with_label(iri):
    if iri in list(IRI2Label.keys()):
        return IRI2Label[iri]

In [32]:
df_connections["iri_start"].apply(replace_with_label)

d18152a12030057d83065b2d64fdce43      Bremerhaven
9a646dd42746eda5bca4347d24270ee8      Bremerhaven
d41001eeaf7f4bb01f680bd00890a8d9      Bremerhaven
61ec8232c8a018afb59a839bdfcea672      Bremerhaven
52f39a963fa14381183c854d8c280c54         Dortmund
                                        ...      
b21de08ef1c7981ec4493c55c857102a         Chemnitz
a0fc7c80a605478616051b3931f233cc    Halle (Saale)
8fb712ec320d5e1598a050b1285e4074    Halle (Saale)
e544d69fa55592081dc597ecd64883d1    Halle (Saale)
b1d5286589ca7826c25db28275a6ecc2    Halle (Saale)
Name: iri_start, Length: 40515, dtype: object

In [33]:
df_connections["iri_end"].apply(replace_with_label)

d18152a12030057d83065b2d64fdce43         Marl
9a646dd42746eda5bca4347d24270ee8         Marl
d41001eeaf7f4bb01f680bd00890a8d9         Marl
61ec8232c8a018afb59a839bdfcea672         Marl
52f39a963fa14381183c854d8c280c54    Karlsruhe
                                      ...    
b21de08ef1c7981ec4493c55c857102a    Osnabrück
a0fc7c80a605478616051b3931f233cc    Bielefeld
8fb712ec320d5e1598a050b1285e4074    Bielefeld
e544d69fa55592081dc597ecd64883d1    Bielefeld
b1d5286589ca7826c25db28275a6ecc2    Bielefeld
Name: iri_end, Length: 40515, dtype: object

In [34]:
def replace_transport_iri_with_label(transport_type_iri):
    if transport_type_iri == "http://moin-project.org/ontology/train":
        return "train"
    if transport_type_iri == "http://moin-project.org/ontology/car":
        return "car"
    if transport_type_iri == "http://moin-project.org/ontology/flight":
        return "flight"

In [35]:
df_connections["http://moin-project.org/ontology/transportType"].apply(replace_transport_iri_with_label)

d18152a12030057d83065b2d64fdce43      car
9a646dd42746eda5bca4347d24270ee8    train
d41001eeaf7f4bb01f680bd00890a8d9    train
61ec8232c8a018afb59a839bdfcea672    train
52f39a963fa14381183c854d8c280c54      car
                                    ...  
b21de08ef1c7981ec4493c55c857102a    train
a0fc7c80a605478616051b3931f233cc      car
8fb712ec320d5e1598a050b1285e4074    train
e544d69fa55592081dc597ecd64883d1    train
b1d5286589ca7826c25db28275a6ecc2    train
Name: http://moin-project.org/ontology/transportType, Length: 40515, dtype: object