# Quering vertically partitioned data as if it is one dataset with identifier matching
Here we explore whether we can query a vertically partitioned dataset (i.e. datasets share the same sample ID space, but sample feature space is split across datasets) as if it was one dataset.
We make quite strong assumptions:
    * Data is tabular and consists of only 1 table
        (i.e. only one type of subject, and predicate-object pairs are properties)
    * We expect the values of the identifying predicates to be in the same format (i.e <http://www.w3.org/2001/XMLSchema#date> for birtdates).
    * Users want to have all the data
    
But: compared to notebooks/1.0-SB-querying-vertically-partitioned-data-simplest-case.ipynb now we have data with differently mapped identifiers.

In [35]:
from collections import defaultdict
from typing import List

import rdflib
import pandas as pd

## Load sample person data A
Data is specified using the w3.org vcard ontology and apart from our identifiers full name and birthday also includes a nickname and email (our features)

In [3]:
data = '''
  @prefix v:  <http://www.w3.org/2006/vcard/ns#> .
  @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

  <http://example.com/me/corky> a v:VCard ;
     v:fn "Corky Crystal" ;
     v:nickname "Corks" ;
     v:email  <mailto:corky@example.com> ;
     v:bday "2013-01-01"^^xsd:date ; .

  <http://example.com/me/vinko> a v:VCard ;
     v:fn "Vinko Vork" ;
     v:nickname "Vinker" ;
     v:email  <mailto:vinko@example.com> ;
     v:bday "2001-02-02"^^xsd:date ; .

'''
graph_a = rdflib.Graph()
result = graph_a.parse(data=data, format='ttl')
print(graph_a.serialize(format="turtle").decode("utf-8"))

@prefix v: <http://www.w3.org/2006/vcard/ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.com/me/corky> a v:VCard ;
    v:bday "2013-01-01"^^xsd:date ;
    v:email <mailto:corky@example.com> ;
    v:fn "Corky Crystal" ;
    v:nickname "Corks" .

<http://example.com/me/vinko> a v:VCard ;
    v:bday "2001-02-02"^^xsd:date ;
    v:email <mailto:vinko@example.com> ;
    v:fn "Vinko Vork" ;
    v:nickname "Vinker" .




## Load sample person data B
Data is specified using the dbpedia ontology and apart from our identifiers full name and birthday also includes a deathdate (our only feature in this dataset)

In [9]:
data = '''
  @prefix dbo: <http://dbpedia.org/ontology#> .
  @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

  <http://example.com/1> a dbo:Person ;
     dbo:name "Corky Crystal" ;
     dbo:birthDate "2013-01-01"^^xsd:date ;
     dbo:deathDate "2020-03-03"^^xsd:date ; .

  <http://example.com/2> a dbo:Person ;
     dbo:name "Vinko Vork" ;
     dbo:birthDate "2001-02-02"^^xsd:date ;
     dbo:deathDate "2020-03-03"^^xsd:date ; .
'''
graph_b = rdflib.Graph()
result = graph_b.parse(data=data, format='ttl')
print(graph_b.serialize(format="turtle").decode("utf-8"))

@prefix dbo: <http://dbpedia.org/ontology#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.com/1> a dbo:Person ;
    dbo:birthDate "2013-01-01"^^xsd:date ;
    dbo:deathDate "2020-03-03"^^xsd:date ;
    dbo:name "Corky Crystal" .

<http://example.com/2> a dbo:Person ;
    dbo:birthDate "2001-02-02"^^xsd:date ;
    dbo:deathDate "2020-03-03"^^xsd:date ;
    dbo:name "Vinko Vork" .




In [37]:
class VerticalQueryClient():
    """
    Client for querying a vertically partitioned dataset.
    Makes some strong assumptions:
    * Data is tabular and consists of only 1 table
        (i.e. only one type of subject, and predicate-object pairs are properties)
    * The subject name itself is a unique identifier across data sources
    """
    def __init__(self, left_graph: rdflib.Graph, right_graph: rdflib.Graph):
        self.left_graph, self.right_graph = left_graph, right_graph
        
    @staticmethod
    def _sparql_to_pandas(result):
        """
        Convert sparql result to pandas. Group all properties (predicate-object pairs)
        for corresponding subject. Set subject as index.
        """
        subject2property = defaultdict(dict)
        for s, p, o in result:
            subject2property[str(s)][str(p)] = o

        data = list()
        for subj, properties in subject2property.items():
            properties['subj'] = subj
            data.append(properties)
        df = pd.DataFrame(data)
        df = df.set_index('subj', drop=True)
        return df

    def query(self, left_on: List[str], right_on: List[str]):
        """
        Query vertically partitioned data. Select all data from different data sources.
        Convert data to pandas DataFrame by grouping all properties (predicate-object pairs) for
        corresponding subjects even though they might come from different data sources. User should
        specify on which predicates data should be merged.
        
        Args:
            left_on: merge left graph on these predicates (order matters!)
            right_on: merge right graph on these predicates (order matters!)
        """
        q = '''
            SELECT ?s ?p ?o
            WHERE {?s ?p ?o .}
        '''
        result = self.left_graph.query(q)
        left_df = self._sparql_to_pandas(result)
        
        result = self.right_graph.query(q)
        right_df = self._sparql_to_pandas(result)
        return pd.merge(left_df, right_df, left_on=left_on, right_on=right_on)

In [38]:
client = VerticalQueryClient(left_graph=graph_a, right_graph=graph_b)

Now we query our vertically positioned dataset by providing the predicates that we want to merge on (i.e. name and birthday, but mapped onto different ontologies)

In [39]:
client.query(left_on=['http://www.w3.org/2006/vcard/ns#fn', 'http://www.w3.org/2006/vcard/ns#bday'],
             right_on=['http://dbpedia.org/ontology#name', 'http://dbpedia.org/ontology#birthDate'])

Unnamed: 0,http://www.w3.org/2006/vcard/ns#nickname,http://www.w3.org/2006/vcard/ns#bday,http://www.w3.org/1999/02/22-rdf-syntax-ns#type_x,http://www.w3.org/2006/vcard/ns#email,http://www.w3.org/2006/vcard/ns#fn,http://www.w3.org/1999/02/22-rdf-syntax-ns#type_y,http://dbpedia.org/ontology#deathDate,http://dbpedia.org/ontology#name,http://dbpedia.org/ontology#birthDate
0,Corks,2013-01-01,http://www.w3.org/2006/vcard/ns#VCard,mailto:corky@example.com,Corky Crystal,http://dbpedia.org/ontology#Person,2020-03-03,Corky Crystal,2013-01-01
1,Vinker,2001-02-02,http://www.w3.org/2006/vcard/ns#VCard,mailto:vinko@example.com,Vinko Vork,http://dbpedia.org/ontology#Person,2020-03-03,Vinko Vork,2001-02-02
