### Plan

1. Download a bunch of World Bank document metadata using API
2. Build ontology using the metadata
3. Build a knowledge graph (KG) by populating ontology with instances of World Bank documents (each row in the metadata)
4. Bring in some additional data from Wikidata
5. Query KG using SPARQL
6. Compare the different ways of interacting with data: using SPARQL queries on the RDF, putting raw metadata into LlamaIndex, and putting the RDF data into LlamaIndex

See accompanying blog post on Medium for full documentation

## 1. Download some World Bank document metadata

In [492]:
import requests
import json
import pandas as pd
import ast

url = 'https://search.worldbank.org/api/v2/wds'
params = {
    'format': 'json',
    'display_title': '"sustainable development"',
    'rows': 100, #Can adjust this to get more/less data
    'page': 1
}

metadata_list = []

for i in range(1):
    response = requests.get(url, params=params)
    data = json.loads(response.content)
    for doc_id in data['documents']:
        metadata = data['documents'][doc_id]
        metadata_list.append(metadata)
        
    params['page'] += 1
    
df = pd.DataFrame(metadata_list)


In [493]:
#Save raw data to a csv so we can compare results later
df.to_csv("raw.csv")

## 2. Use metadata to create ontology

In [495]:
from rdflib import Graph, RDF, RDFS, Namespace, URIRef, Literal
from rdflib.namespace import SKOS, XSD
from SPARQLWrapper import SPARQLWrapper, JSON
from tqdm import tqdm

# Create a new RDF graph
g = Graph()

schema = Namespace('http://schema.org/')
wd = Namespace('http://www.wikidata.org/entity/')

# Define namespaces
prefixes = {
    'schema': schema,
    'wd': wd,
    'skos': SKOS,
    'xsd': XSD
}
for p, ns in prefixes.items():
    g.bind(p, ns)
    

In [496]:
def get_entity_label(entity_code):
    # Set up the SPARQL endpoint
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

    # Construct the SPARQL query
    query = f"""
    SELECT ?label WHERE {{
      wd:{entity_code} rdfs:label ?label.
      FILTER (lang(?label) = 'en')
    }}
    """

    # Set the query and response format
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)

    # Execute the query and retrieve the results
    results = sparql.query().convert()

    # Extract and return the entity label
    if 'results' in results and 'bindings' in results['results'] and results['results']['bindings']:
        label = results['results']['bindings'][0]['label']['value']
        return label

    return None


In [497]:
def create_subclass_country(column):
    #This puts all entities in this column under the class 'country' when some of them are regions like MENA or continents like Africa. 
    newClass = URIRef(schema + "country")
    g.add((newClass, RDFS.label, Literal("country", lang='en')))
    df[column] = df[column].astype(str)
    for value in df[column].unique():
        if value != "nan":
            # Check Wikidata for a matching class
            sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
            query = f"""
                SELECT ?class ?label WHERE {{
                    ?class wdt:P31 wd:Q6256 .
                    ?class rdfs:label "{value}"@en .
                    OPTIONAL {{ ?class skos:prefLabel ?label FILTER(lang(?label) = "en") }}
                    FILTER(REGEX(STR(?class), "^http://www.wikidata.org/entity/Q[0-9]+$"))
                }}
            """
            sparql.setQuery(query)
            sparql.setReturnFormat(JSON)
            results = sparql.query().convert()

            # If there is a match, use the Wikidata class as a subclass
            if results['results']['bindings']:
                #Get URI from Wikidata
                uri = results['results']['bindings'][0]['class']['value']
                #Get the 'Q ID' which is the unique ID at the end of the URI
                qid = uri.split('/')[-1]
                country_label = value
                #Create a subclass for each country under the country class
                subclass = URIRef(schema + country_label.replace(' ', '_'))
                g.add((subclass, RDF.type, RDFS.Class))
                g.add((subclass, RDFS.subClassOf, newClass))
                # Update the "country_URI" column with the URI for the current country
                df.loc[df[column] == value, "country_URI"] = uri
                uri = URIRef(uri)
                # Define the URI for the new Wikidata URI property
                wd_URI_property = URIRef(schema + "wd_URI")
                # Add the property to the RDF graph
                g.add((wd_URI_property, RDF.type, RDF.Property))
                # Add a label to the property
                label = Literal("Wikidata URI", lang="en")
                g.add((wd_URI_property, RDFS.label, label))
                #Add Wikidata URI as a property to each country class
                g.add((subclass, schema.wd_URI, uri))
                #Add label to each Wikidata Q ID code that it is the Q ID for this particular country
                g.add((uri, RDFS.label, Literal(f"{country_label} wikidata code", lang='en')))
                g.add((subclass, RDFS.label, Literal(value, lang='en')))
            else:    
                subclass = URIRef(schema + value.replace(' ', '_').replace('-','_'))
                g.add((subclass, RDF.type, RDFS.Class))
                g.add((subclass, RDFS.subClassOf, newClass))
                g.add((subclass, RDFS.label, Literal(value, lang='en')))

In [498]:
def create_subclass_world_bank_document(column):
    newClass = URIRef(schema + "world_bank_document")
    g.add((newClass, RDFS.label, Literal("A document produced and written by the World Bank.", lang='en')))   
    df[column] = df[column].astype(str)
    for value in df[column].unique():
        if isinstance(value, str):
            # Check Wikidata for a matching class
            sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
            query = f"""
                SELECT ?class ?label WHERE {{
                    ?class rdfs:label "{value}"@en .
                    OPTIONAL {{ ?class skos:prefLabel ?label FILTER(lang(?label) = "en") }}
                    FILTER(REGEX(STR(?class), "^http://www.wikidata.org/entity/Q[0-9]+$"))
                }}
            """
            
            sparql.setQuery(query)
            sparql.setReturnFormat(JSON)
            results = sparql.query().convert()
            if value != "nan":
                # If there is a match, use the Wikidata class as a subclass
                if results['results']['bindings']:
                    wd_class = results['results']['bindings'][0]['class']['value']
                    #subclass = URIRef(wd_class)
                    qid = wd_class.split('/')[-1]
                    #label = get_entity_label(qid)
                    label = value
                    #print(label)
                    subclass = URIRef(schema + label.replace(' ', '_'))
                    #label = Literal(results['results']['bindings'][0]['label']['value']) if 'label' in results['results']['bindings'][0] else Literal(value, lang='en')
                    g.add((subclass, RDF.type, RDFS.Class))
                    #g.add((subclass, RDFS.subClassOf, schema[column]))
                    g.add((subclass, RDFS.subClassOf, newClass))

                    wd_uri = URIRef(wd_class)

                    # Define the URI for the new property
                    wd_URI_property = URIRef(schema + "wd_URI")

                    # Add the property to the RDF graph
                    g.add((wd_URI_property, RDF.type, RDF.Property))

                    # Add a label to the property
                    label = Literal("Wikidata URI", lang="en")
                    g.add((wd_URI_property, RDFS.label, label))

                    g.add((subclass, schema.wd_URI, wd_uri))
                    g.add((wd_uri, RDFS.label, Literal("entity wikidata code", lang='en')))
                    #g.add((subclass, SKOS.prefLabel, Literal(value, lang='en')))
                    g.add((subclass, RDFS.label, Literal(value, lang='en')))
                    df[column] = df[column].replace(value, str(subclass))

                else:
                    subclass = URIRef(schema + value.replace(' ', '_').replace('-','_'))
                    g.add((subclass, RDF.type, RDFS.Class))
                    g.add((subclass, RDFS.subClassOf, newClass))
                    #g.add((subclass, SKOS.prefLabel, Literal(value, lang='en')))
                    g.add((subclass, RDFS.label, Literal(value, lang='en')))
                    df[column] = df[column].replace(value, str(subclass))



In [499]:
def create_subclass_multiple_values(column, names = None):
    newClass = URIRef(schema + str(column))
    df[column] = df[column].astype(str)
    for rowValue in df[column].unique():
        values = rowValue.split(",")
        if names is not None:
            for value in values:
                newID = URIRef(schema + str(column) + "/" + str(value))
                g.add((newID, RDF.type, newClass))
                if value in names:
                    name = names[value]
                    g.add((newID, SKOS.prefLabel, Literal(name, lang='en')))

In [500]:
def create_subclass_trustfund(column, ids = None):
    df['trustfund'] = df['trustfund'].astype(str).str.replace('\n', '').str.replace(" ", "_").str.replace("-", "_")
    newClass = URIRef(schema + "trustfund")
    g.add((newClass, RDFS.label, Literal("trustfund")))
    #RDFLoader doesn't allow a comment unless you also add a label on it
    #g.add((newClass, RDFS.comment, Literal("The World Bank Group (WBG) uses trust funds, a financing arrangement set up with contributions from one or more development partner, to complement core funding from the International Bank for Reconstruction and Development (IBRD), the International Development Association (IDA), and the International Finance Corporation (IFC), in support of the World Bank Group’s goals. Trust funds allow the Bank to mobilize and direct concessional resources to strategic development priorities and to mobilize the resources and capabilities of other development actors through partnership programs.")))

    # Define the id property
    id_property = schema.identifier
    g.add((id_property, RDF.type, RDF.Property))
    g.add((id_property, RDFS.label, Literal("Identifier")))
    #g.add((id_property, RDFS.comment, Literal("The unique identifier of the trustfund.")))

    # Associate the id property with the trustfund class
    g.add((id_property, RDFS.domain, newClass))

    df[column] = df[column].astype(str)
    for rowValue in df[column].unique():
        if rowValue != "nan":
            names = rowValue.split(",")
            if ids is not None:
                for name in names:
                    newID = URIRef(schema + "trustfund" + "/" + str(name))
                    g.add((newID, RDF.type, newClass))
                    g.add((newID, RDFS.label, Literal(str(name))))
                    if name in ids:
                        id = ids[name]
                        # Create a URIRef for the trustfund_id resource
                        trustfund_id = URIRef(schema + "trustfund" + "/id/" + Literal(id, datatype=XSD.string))
                        g.add((newID, id_property, trustfund_id))
                        #label = Literal("TEST")
                        label = Literal(f"world bank trustfund ID for {name}", lang="en")
                        g.add((trustfund_id, RDFS.label, label))


In [501]:
def create_subclass_project(column):
    newClass = URIRef(schema + "project")
    g.add((newClass, RDFS.label, Literal("worldbank project")))

    df[column] = df[column].astype(str)
    for value in df[column].unique():
        if value != "nan":
            newID = URIRef(schema + "project/" + str(value).replace(" ","_").replace("-","_"))
            g.add((newID, RDF.type, newClass))
            g.add((newID, RDFS.label, Literal(str(value).replace(" ","_").replace("-","_"), lang='en')))  


In [502]:
def create_subclass_authors(column):
    newClass = URIRef(schema + str(column))
    g.add((newClass, RDFS.label, Literal("worldbank authors")))
    df[column] = df[column].astype(str)
    for value in df[column].unique():
        if value != "nan":
            # Add author property for each author
            author_dict = ast.literal_eval(value)
            for author_dict_entries in author_dict.values():
                author_name = author_dict_entries['author']
                author_uri = URIRef(schema + "author/" + author_name.replace(" ", "_"))
                g.add((author_uri, RDF.type, newClass))
                #g.add((author_uri, schema.name, Literal(author_name, lang='en')))
                g.add((author_uri, RDFS.label, Literal(author_name, lang='en')))

In [503]:
# Convert the 'trustfund_key' column to string
df['trustfund_key'] = df['trustfund_key'].astype(str)
df['trustfund'] = df['trustfund'].astype(str)

# Create a dictionary that maps trustfund keys to trustfund names
trustfund_dict = {}
for i, row in df.iterrows():
    keys = row['trustfund'].split(',')
    values = row['trustfund_key'].split(',')
    for key, value in zip(keys, values):
        trustfund_dict[key.strip()] = value.strip()

In [504]:
# Convert the 'trustfund_key' column to string
df['projectid'] = df['projectid'].astype(str)
df['projn'] = df['projn'].astype(str)

# Create a dictionary that maps trustfund keys to trustfund names
project_dict = {}
for i, row in df.iterrows():
    keys = row['projectid'].split(',')
    values = row['projn'].split(',')
    for key, value in zip(keys, values):
        project_dict[key.strip()] = value.strip()

In [505]:
create_subclass_country('count')
create_subclass_world_bank_document('docty')
create_subclass_trustfund('trustfund', trustfund_dict)
create_subclass_project('projn')
create_subclass_authors('authors')

In [506]:
#Save as a ttl file to view in protege
#Also save as ntriples which works better for the LlamaIndex RDFReader
g.serialize('SDKG.ttl',format='turtle',prefixes = prefixes, encoding='urf-8')
g.serialize("SDKG_nt",format="nt",prefixes = prefixes, encoding='urf-8')



<Graph identifier=Nebe42be5141842bbbd6e8cbc72b14459 (<class 'rdflib.graph.Graph'>)>

## 3. Build a knowledge graph (KG) by populating ontology with instances of World Bank documents (each row in the metadata)


In [507]:
#Create abstract property
df['abstracts'] = df['abstracts'].astype(str).str.replace('\n', '').replace('\\n','')
abstractIs_uri = URIRef(schema + "abstractIs")
g.add((abstractIs_uri, RDF.type, RDF.Property))
g.add((abstractIs_uri, RDFS.label, Literal("Short summary of the document.")))

#Create abstract class
abstract_class = URIRef(schema + "abstract")
g.add((abstract_class, RDFS.label, Literal("Short summary of a document.")))

#Create author properties
authoredBy_uri = URIRef(schema + "authoredBy")
authored_uri = URIRef(schema + "authored")
g.add((authoredBy_uri, RDF.type, RDF.Property))
g.add((authored_uri, RDF.type, RDF.Property))
g.add((authoredBy_uri, RDFS.label, Literal("This document was authored by this author.")))
g.add((authored_uri, RDFS.label, Literal("This author wrote this document.")))

#Define 'part of' property
isPartOf_uri = URIRef(schema + "isPartOf")
g.add((isPartOf_uri, RDF.type, RDF.Property))
g.add((isPartOf_uri, RDFS.label, Literal("This entity is a part of another entity")))

#Define 'countryOfOrigin' property
countryOfOrigin_uri = URIRef(schema + "countryOfOrigin")
g.add((countryOfOrigin_uri, RDF.type, RDF.Property))
g.add((countryOfOrigin_uri, RDFS.label, Literal("Country that this document is about.")))

# Create instances for each document and add author property
for index, row in tqdm(df.iterrows()):
    if not pd.isnull(row['id']) and not pd.isnull(row['docty']) and not pd.isnull(row['authors']):
        try:
            # Create the report instance
            instance = URIRef(schema + "doc/" + str(row['display_title']).replace(" ","_").replace("-","_"))
            g.add((instance, RDFS.label, Literal(str(row['display_title']), lang='en')))
            
            #Connect instances with types of documents
            doctype = URIRef(row['docty'])
            g.add((instance, RDF.type, doctype))
    
            #Connect instances with country of origin
            if row['count'] != "nan":
                country = URIRef(schema + str(row['count']).replace(" ","_").replace("-","_"))
                g.add((instance, countryOfOrigin_uri, country))

            #Connect instances with projects
            if row['projn'] != "nan":
                project = URIRef(schema + "project/" + str(row['projn']).replace(" ","_").replace("-","_"))
                g.add((instance, isPartOf_uri, project))

            #Connect instances with trustfund_keys
            if row['trustfund'] != "nan":
                tf_values = row['trustfund'].split(",")
                for tf in tf_values:
                    trustfund_uri = URIRef(schema + "trustfund/" + str(tf).replace(" ","_").replace("-","_"))
                    g.add((trustfund_uri, RDFS.label, Literal(f"Trustfund: {tf}")))
                    g.add((instance, isPartOf_uri, trustfund_uri))
                    g.add((trustfund_uri, countryOfOrigin_uri, country))
                
            #Connect instances with authors
            author_dict = ast.literal_eval(row['authors'])
            for author_dict_entries in author_dict.values():
                author_name = author_dict_entries['author']
                author_uri = URIRef(schema + "author/" + author_name.replace(" ", "_"))
                g.add((instance, authoredBy_uri, author_uri))
                g.add((author_uri, authored_uri, instance))

            #Add abstract
            if row['abstracts'] != "nan":
                abstract_uri = URIRef(schema + "abstract/" + str(row['display_title']).replace(" ","_").replace("-","_"))
                g.add((instance, abstractIs_uri, abstract_uri))
                g.add((abstract_uri, RDFS.label, Literal(str(row['abstracts']))))
                g.add((abstract_uri, RDF.type, abstract_class))
                g.add((abstract_uri, isPartOf_uri, abstract_uri))
        except:
            pass

101it [00:00, 2604.37it/s]


In [508]:
g.serialize('SDKG.ttl',format='turtle',prefixes = prefixes, encoding='urf-8')
g.serialize("SDKG_XML",format="nt",prefixes = prefixes, encoding='urf-8')



<Graph identifier=Nebe42be5141842bbbd6e8cbc72b14459 (<class 'rdflib.graph.Graph'>)>

## 4. Bring in some additional data from Wikidata


In [346]:
from SPARQLWrapper import SPARQLWrapper, JSON

def get_property_label(property_code):
    # Set up the SPARQL endpoint
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

    # Construct the SPARQL query
    query = f"""
    SELECT ?label WHERE {{
      wd:{property_code} rdfs:label ?label.
      FILTER (lang(?label) = 'en')
    }}
    """

    # Set the query and response format
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)

    # Execute the query and retrieve the results
    results = sparql.query().convert()

    # Extract and return the property label
    if 'results' in results and 'bindings' in results['results'] and results['results']['bindings']:
        label = results['results']['bindings'][0]['label']['value']
        return label

    return None



In [347]:
import numpy as np


sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

# Create a cache to store property code-label mappings
property_cache = {}
entity_cache = {}

# Prepare a list to collect triples for bulk graph update
triples = []

# Iterate over the URIs and add the properties to the RDF graph
for uri in tqdm(df['country_URI']):        
    if isinstance(uri, str) and uri.startswith('http://www.wikidata.org/entity/Q'):
        class_uri = URIRef(uri)
        country_column = df.loc[df['country_URI'] == uri, 'count'].iloc[0]
        country_column = URIRef(schema + str(country_column).replace(" ", "_"))
        
        # Construct the SPARQL query
        qid = uri.split('/')[-1]
        query = f"""
        SELECT ?prop ?value WHERE {{
          wd:{qid} ?prop ?value .
          OPTIONAL {{ ?prop rdfs:label ?label . FILTER(lang(?label) = 'en') }}
        }}
        """

        # Set the query and response format
        sparql.setQuery(query)
        sparql.setReturnFormat(JSON)

        # Execute the query and retrieve the results
        results = sparql.query().convert()

        # Iterate over the results and add them to the RDF graph
        for result in results["results"]["bindings"]:
            prop = result["prop"]["value"]
            value = Literal(result["value"]["value"])
            triple = (country_column, None, None)  # Placeholder for triple
            
            if prop.startswith('http://www.wikidata.org/prop'):
                property_code = prop.split('/')[-1]
                # Check if the property code is already in the cache
                if property_code in property_cache:
                    property_label = property_cache[property_code]
                else:
                    # If not in cache, query and retrieve the property label
                    property_label = get_property_label(property_code)
                    # Store the property code-label mapping in the cache
                    property_cache[property_code] = property_label
                
                property_label_URI = URIRef(schema + property_label.replace(" ", "_"))
                triple = (country_column, property_label_URI, value)
            
            if value.startswith('http://www.wikidata.org/entity/Q'):
                entity_code = value.split('/')[-1]
                # Check if the entity code is already in the cache
                if entity_code in entity_cache:
                    entity_label = entity_cache[entity_code]
                else:
                    # If not in cache, query and retrieve the entity label
                    entity_label = get_entity_label(entity_code)
                    # Store the entity code-label mapping in the cache
                    entity_cache[entity_code] = entity_label

                entity_label_URI = URIRef(schema + str(entity_label).replace(" ", "_"))
                triple = (country_column, property_label_URI, entity_label_URI)
            
            triples.append(triple)
        
    elif isinstance(uri, float) and np.isnan(uri):
        continue
    else:
        continue

# Add all collected triples to the RDF graph in bulk
for subject, predicate, object_ in triples:
    if predicate is not None:
        g.add((subject, predicate, object_))



100%|█████████████████████████████████████████| 101/101 [09:11<00:00,  5.46s/it]


In [348]:
g.serialize('SDKG.ttl',format='turtle',prefixes = prefixes, encoding='urf-8')
#g.serialize("SDKG_nt",format="nt",prefixes = prefixes, encoding='urf-8')

<Graph identifier=N213871fefdbd4948b878115aba5fa456 (<class 'rdflib.graph.Graph'>)>

## 5. Query KG using SPARQL


In [462]:
import rdflib
g = rdflib.Graph()
g.parse("path to your KG")


<Graph identifier=N0765c2ce23854b96a2112e0080a699b5 (<class 'rdflib.graph.Graph'>)>

In [463]:
# Step 1: Find the URI of Brazil in your ontology
brazil_uri = "<http://schema.org/Brazil>"  # Replace with the actual URI

# Step 2: Find the most relevant documents related to Brazil
documents_query = f"""
PREFIX schema: <http://schema.org/>
SELECT ?document
WHERE {{
  ?document a/rdfs:subClassOf* schema:world_bank_document ;
      schema:countryOfOrigin {brazil_uri} .

}}
"""
qres = g.query(documents_query)
for row in qres:
    print(f"Document ID: {row.document}")


Document ID: http://schema.org/doc/Brazil___LATIN_AMERICA_AND
____________CARIBBEAN___P107146___Acre_Social_and_Economic_Inclusion_and
____________Sustainable_Development_Project___PROACRE___Audited
____________Financial_Statement
Document ID: http://schema.org/doc/Brazil___LATIN_AMERICA_AND
____________CARIBBEAN___P167455___Ceara_Rural_Sustainable_Development
____________and_Competitiveness_Phase_II___Audited_Financial_Statement
Document ID: http://schema.org/doc/Concept_Environmental_and_Social
____________Review_Summary_(ESRS)___Mato_Grosso_Sustainable_Development
____________of_Family_Farming___P175723
Document ID: http://schema.org/doc/Brazil___Acre_Social_and_Economic
____________Inclusion_and_Sustainable_Development_Project___PROACRE
Document ID: http://schema.org/doc/Brazil___Mato_Grosso_Fiscal
____________Adjustment_and_Environmental_Sustainability_Development
____________Policy_Loan
Document ID: http://schema.org/doc/Disclosable_Version_of_the_ISR__
____________Ceara_Rural_Su

In [351]:
from rdflib import Graph, RDF, RDFS, URIRef

# Step 1: Find the URI of the basic_form_of_government you are interested in
government_form_uri = "<http://schema.org/federal_republic>"  # Replace with the actual URI

# Step 2: Query for authors who have written the most documents associated with countries having the basic_form_of_government as "federal_republic"
authors_query = f"""
PREFIX schema: <http://schema.org/>
PREFIX prop: <http://schema.org/property>
SELECT ?author (COUNT(?document) AS ?numDocuments)
WHERE {{
  ?document a/rdfs:subClassOf* schema:world_bank_document ;
            schema:countryOfOrigin [
                    schema:basic_form_of_government {government_form_uri}
            ] ;
            schema:authoredBy ?author .
}}
GROUP BY ?author
ORDER BY DESC(?numDocuments)
"""

# Execute the query
results = g.query(authors_query)

# Now you can process the results and present them as needed (e.g., using pandas DataFrames)
# For simplicity, here, I'm just printing the author names and the number of documents they wrote
for row in results:
    print(f"Author: {row.author}, Number of Documents: {row.numDocuments}")


Author: http://schema.org/author/World_Bank, Number of Documents: 5
Author: http://schema.org/author/Ru,Jiang, Number of Documents: 4
Author: http://schema.org/author/Tesfaye,Roman, Number of Documents: 3
Author: http://schema.org/author/Techane,Meron_Tadesse, Number of Documents: 2
Author: http://schema.org/author/Noronha_Farinelli,Barbara_Cristina, Number of Documents: 2
Author: http://schema.org/author/Paviot,Marie_Caroline, Number of Documents: 2
Author: http://schema.org/author/Aragaw_Biru, Number of Documents: 2
Author: http://schema.org/author/Zaourak,Gabriel_Roberto, Number of Documents: 2
Author: http://schema.org/author/Estela_Alejandra_Marcolongo, Number of Documents: 1
Author: http://schema.org/author/Glauber_Pinheiro_de_Aquino, Number of Documents: 1
Author: http://schema.org/author/Liduina_Cynthya_Lemos, Number of Documents: 1
Author: http://schema.org/author/Di_Crocco,Paula_Agostina, Number of Documents: 1
Author: http://schema.org/author/Lencina,Fernando_Andres, Number 

## 6. Compare the different ways of interacting with data: using SPARQL queries on the RDF, putting raw metadata into LlamaIndex, and putting the RDF data into LlamaIndex


### Putting raw data into LlamaIndex

In [455]:
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext, LLMPredictor
from langchain import OpenAI
import os 
import openai

os.environ["OPENAI_API_KEY"] = <YOUR API KEY>  # replace with yours
openai.api_key = os.environ["OPENAI_API_KEY"]

#Make sure your raw.csv file is in the data folder
documents = SimpleDirectoryReader('data').load_data()

index = GPTVectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()


In [456]:
response = query_engine.query("Show me all of the World Bank documents in the context information about Brazil")
print(response)


Brazil - LATIN AMERICA AND CARIBBEAN - P126452 - Rio Grande do Norte: Regional Development and Governance - Audited Financial Statement
Brazil - LATIN AMERICA AND CARIBBEAN- P158000- Amazon Sustainable Landscapes Project - Procurement Plan


In [457]:
response = query_engine.query("Based on the context information, what documents has Corsi,Anna written?")
print(response)


Corsi,Anna has not written any documents based on the context information.


In [458]:
response = query_engine.query("Tell me more about Anna Corsi")
print(response)


Anna Corsi is not mentioned in the context information.


In [459]:
response = query_engine.query("Tell me more about World Bank's land management infrastructure project in Turkey")
print(response)


The World Bank does not appear to have a land management infrastructure project in Turkey.


### Putting RDF data into LlamaIndex

In [491]:
from llama_index import GPTVectorStoreIndex, download_loader

RDFReader = download_loader("RDFReader")
document = RDFReader().load_data(file="SDKG_nt")

# Define LLM
llm_predictor = LLMPredictor(llm=OpenAI(temperature=0, model_name="text-davinci-002"))

# NOTE: set a chunk size limit to < 1024 tokens 
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, chunk_size_limit=1012)

index = GPTVectorStoreIndex.from_documents(document, service_context=service_context)

query_engine = index.as_query_engine()


In [464]:
response = query_engine.query("Show me all of the World Bank documents in the context information about Brazil")
print(response)


<Brazil - LATIN AMERICA AND CARIBBEAN - P107146 - Acre Social and Economic Inclusion and Sustainable Development Project - PROACRE - Audited Financial Statement>
<Disclosable Version of the ISR - Rio de Janeiro Adjustment and Sustainable Development Policy Loan - P178729 - Sequence No : 01>
<Grosso Fiscal Adjustment and Environmental Sustainability Development Policy Loan>
<Disclosable Version of the ISR - Matanza-Riachuelo Basin Sustainable Development Project - P105680 - Sequence No : 29>
<Disclosable Version of the ISR - Matanza-Riachuelo Basin Sustainable Development Project - P105680 - Sequence No : 30>
<Disclosable Restructuring Paper - Health Sustainable Development Goals Program-for-Results - P123531>


In [436]:
response = query_engine.query("Based on the context information, what documents has Corsi,Anna written?")
print(response)


Concept Project Information Document (PID) - Land administration infrastructure for green and sustainable development - P179217


In [453]:
response = query_engine.query("Tell me more about Anna Corsi")
print(response)


Anna Corsi is an author of the document "Concept Project Information Document (PID) - Land administration infrastructure for green and sustainable development - P179217". This document is about a project to support the development of a mass valuation system in Turkey and generate market values for individual property units. Corsi is also the author of "The Time is Now : How Can Uzbekistan Leverage Urbanization as a Driver of Sustainable Development?", a document about how Uzbekistan can use urbanization to promote sustainable development.


In [432]:
response = query_engine.query("Tell me more about World Bank's land management infrastructure project in Turkey")
print(response)


The objective of the Land Management Infrastructure for Green and Sustainable Development Project is to improve the accuracy and accessibility of land administration information in Turkiye. There are three components to the project, the first component being creating 3D city models and updating cadastre data. This component will support: (i) the creation of 3D city models based on the proven approach tested in the Amasya pilot; and (ii) the completion of the update and verification of cadastral data for 6 million parcels (out of the total remaining 11 million parcels19 not covered by LRCMP), in both urban and rural areas. As part of the cadastre updating activities, capacity building programs for addressing challenges concerning women’s land rights and ownership will be discussed with TKGM to determine how to better address these issues as part of the public consultation step during the surveying process. While activities on the update and verification of cadastral data will be carrie

### Try using both RDF and raw data to see if it helps

In [468]:

#Make sure to update the files in the data folder
documents = SimpleDirectoryReader('data').load_data()

index = GPTVectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()


In [469]:
response = query_engine.query("Show me all of the World Bank documents in the context information about Brazil")
print(response)


Brazil - LATIN AMERICA AND CARIBBEAN - P126452 - Rio Grande do Norte: Regional Development and Governance - Audited Financial Statement
Brazil - LATIN AMERICA AND CARIBBEAN- P158000- Amazon Sustainable Landscapes Project - Procurement Plan


In [470]:
response = query_engine.query("Based on the context information, what documents has Corsi,Anna written?")
print(response)


Corsi,Anna has not written any documents based on the context information.


In [441]:
response = query_engine.query("Tell me more about Corsi,Anna")
print(response)


Corsi,Anna is not mentioned in the context information provided.


In [442]:
response = query_engine.query("Tell me more about World Bank's land management infrastructure project in Turkey")
print(response)


The World Bank does not appear to have a land management infrastructure project in Turkey.
