### Plan

1. Download a bunch of World Bank document metadata using API
2. Build ontology using the metadata
3. Build a knowledge graph (KG) by populating ontology with instances of World Bank documents (each row in the metadata)
4. Bring in some additional data from Wikidata
5. Query KG using SPARQL
6. Compare the different ways of interacting with data: using SPARQL queries on the RDF, putting raw metadata into LlamaIndex, and putting the RDF data into LlamaIndex



![ws](https://miro.medium.com/v2/resize:fit:828/format:webp/1*njagJOgiT-VTJjQ18bugcw.png)

## 1. Download some World Bank document metadata

In [1]:
import requests
import json
import pandas as pd
import ast

url = 'https://search.worldbank.org/api/v2/wds'
params = {
    'format': 'json',
    'display_title': '"sustainable development"',
    'rows': 10, #Can adjust this to get more/less data
    'page': 1
}

metadata_list = []

for i in range(1):
    response = requests.get(url, params=params)
    data = json.loads(response.content)
    for doc_id in data['documents']:
        metadata = data['documents'][doc_id]
        metadata_list.append(metadata)

    params['page'] += 1

df = pd.DataFrame(metadata_list)


In [2]:
#Save raw data to a csv so we can compare results later
df.to_csv("raw.csv")

## 2. Use metadata to create ontology

In [3]:
pip install rdflib SPARQLWrapper

Collecting rdflib
  Downloading rdflib-7.0.0-py3-none-any.whl (531 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m531.9/531.9 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting SPARQLWrapper
  Downloading SPARQLWrapper-2.0.0-py3-none-any.whl (28 kB)
Collecting isodate<0.7.0,>=0.6.0 (from rdflib)
  Downloading isodate-0.6.1-py2.py3-none-any.whl (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.7/41.7 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: isodate, rdflib, SPARQLWrapper
Successfully installed SPARQLWrapper-2.0.0 isodate-0.6.1 rdflib-7.0.0


In [4]:
from rdflib import Graph, RDF, RDFS, Namespace, URIRef, Literal
from rdflib.namespace import SKOS, XSD
from SPARQLWrapper import SPARQLWrapper, JSON
from tqdm import tqdm

# Create a new RDF graph
g = Graph()

schema = Namespace('http://schema.org/')
wd = Namespace('http://www.wikidata.org/entity/')

# Define namespaces
prefixes = {
    'schema': schema,
    'wd': wd,
    'skos': SKOS,
    'xsd': XSD
}
for p, ns in prefixes.items():
    g.bind(p, ns)


In [5]:
def get_entity_label(entity_code):
    # Set up the SPARQL endpoint
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

    # Construct the SPARQL query
    query = f"""
    SELECT ?label WHERE {{
      wd:{entity_code} rdfs:label ?label.
      FILTER (lang(?label) = 'en')
    }}
    """

    # Set the query and response format
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)

    # Execute the query and retrieve the results
    results = sparql.query().convert()

    # Extract and return the entity label
    if 'results' in results and 'bindings' in results['results'] and results['results']['bindings']:
        label = results['results']['bindings'][0]['label']['value']
        return label

    return None


In [6]:
def create_subclass_country(column):
    #This puts all entities in this column under the class 'country' when some of them are regions like MENA or continents like Africa.
    newClass = URIRef(schema + "country")
    g.add((newClass, RDFS.label, Literal("country", lang='en')))
    df[column] = df[column].astype(str)
    for value in df[column].unique():
        if value != "nan":
            # Check Wikidata for a matching class
            sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
            query = f"""
                SELECT ?class ?label WHERE {{
                    ?class wdt:P31 wd:Q6256 .
                    ?class rdfs:label "{value}"@en .
                    OPTIONAL {{ ?class skos:prefLabel ?label FILTER(lang(?label) = "en") }}
                    FILTER(REGEX(STR(?class), "^http://www.wikidata.org/entity/Q[0-9]+$"))
                }}
            """
            sparql.setQuery(query)
            sparql.setReturnFormat(JSON)
            results = sparql.query().convert()

            # If there is a match, use the Wikidata class as a subclass
            if results['results']['bindings']:
                #Get URI from Wikidata
                uri = results['results']['bindings'][0]['class']['value']
                #Get the 'Q ID' which is the unique ID at the end of the URI
                qid = uri.split('/')[-1]
                country_label = value
                #Create a subclass for each country under the country class
                subclass = URIRef(schema + country_label.replace(' ', '_'))
                g.add((subclass, RDF.type, RDFS.Class))
                g.add((subclass, RDFS.subClassOf, newClass))
                # Update the "country_URI" column with the URI for the current country
                df.loc[df[column] == value, "country_URI"] = uri
                uri = URIRef(uri)
                # Define the URI for the new Wikidata URI property
                wd_URI_property = URIRef(schema + "wd_URI")
                # Add the property to the RDF graph
                g.add((wd_URI_property, RDF.type, RDF.Property))
                # Add a label to the property
                label = Literal("Wikidata URI", lang="en")
                g.add((wd_URI_property, RDFS.label, label))
                #Add Wikidata URI as a property to each country class
                g.add((subclass, schema.wd_URI, uri))
                #Add label to each Wikidata Q ID code that it is the Q ID for this particular country
                g.add((uri, RDFS.label, Literal(f"{country_label} wikidata code", lang='en')))
                g.add((subclass, RDFS.label, Literal(value, lang='en')))
            else:
                subclass = URIRef(schema + value.replace(' ', '_').replace('-','_'))
                g.add((subclass, RDF.type, RDFS.Class))
                g.add((subclass, RDFS.subClassOf, newClass))
                g.add((subclass, RDFS.label, Literal(value, lang='en')))

In [7]:
def create_subclass_world_bank_document(column):
    newClass = URIRef(schema + "world_bank_document")
    g.add((newClass, RDFS.label, Literal("A document produced and written by the World Bank.", lang='en')))
    df[column] = df[column].astype(str)
    for value in df[column].unique():
        if isinstance(value, str):
            # Check Wikidata for a matching class
            sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
            query = f"""
                SELECT ?class ?label WHERE {{
                    ?class rdfs:label "{value}"@en .
                    OPTIONAL {{ ?class skos:prefLabel ?label FILTER(lang(?label) = "en") }}
                    FILTER(REGEX(STR(?class), "^http://www.wikidata.org/entity/Q[0-9]+$"))
                }}
            """

            sparql.setQuery(query)
            sparql.setReturnFormat(JSON)
            results = sparql.query().convert()
            if value != "nan":
                # If there is a match, use the Wikidata class as a subclass
                if results['results']['bindings']:
                    wd_class = results['results']['bindings'][0]['class']['value']
                    #subclass = URIRef(wd_class)
                    qid = wd_class.split('/')[-1]
                    #label = get_entity_label(qid)
                    label = value
                    #print(label)
                    subclass = URIRef(schema + label.replace(' ', '_'))
                    #label = Literal(results['results']['bindings'][0]['label']['value']) if 'label' in results['results']['bindings'][0] else Literal(value, lang='en')
                    g.add((subclass, RDF.type, RDFS.Class))
                    #g.add((subclass, RDFS.subClassOf, schema[column]))
                    g.add((subclass, RDFS.subClassOf, newClass))

                    wd_uri = URIRef(wd_class)

                    # Define the URI for the new property
                    wd_URI_property = URIRef(schema + "wd_URI")

                    # Add the property to the RDF graph
                    g.add((wd_URI_property, RDF.type, RDF.Property))

                    # Add a label to the property
                    label = Literal("Wikidata URI", lang="en")
                    g.add((wd_URI_property, RDFS.label, label))

                    g.add((subclass, schema.wd_URI, wd_uri))
                    g.add((wd_uri, RDFS.label, Literal("entity wikidata code", lang='en')))
                    #g.add((subclass, SKOS.prefLabel, Literal(value, lang='en')))
                    g.add((subclass, RDFS.label, Literal(value, lang='en')))
                    df[column] = df[column].replace(value, str(subclass))

                else:
                    subclass = URIRef(schema + value.replace(' ', '_').replace('-','_'))
                    g.add((subclass, RDF.type, RDFS.Class))
                    g.add((subclass, RDFS.subClassOf, newClass))
                    #g.add((subclass, SKOS.prefLabel, Literal(value, lang='en')))
                    g.add((subclass, RDFS.label, Literal(value, lang='en')))
                    df[column] = df[column].replace(value, str(subclass))



In [8]:
def create_subclass_multiple_values(column, names = None):
    newClass = URIRef(schema + str(column))
    df[column] = df[column].astype(str)
    for rowValue in df[column].unique():
        values = rowValue.split(",")
        if names is not None:
            for value in values:
                newID = URIRef(schema + str(column) + "/" + str(value))
                g.add((newID, RDF.type, newClass))
                if value in names:
                    name = names[value]
                    g.add((newID, SKOS.prefLabel, Literal(name, lang='en')))

In [9]:
def create_subclass_trustfund(column, ids = None):
    df['trustfund'] = df['trustfund'].astype(str).str.replace('\n', '').str.replace(" ", "_").str.replace("-", "_")
    newClass = URIRef(schema + "trustfund")
    g.add((newClass, RDFS.label, Literal("trustfund")))
    #RDFLoader doesn't allow a comment unless you also add a label on it
    #g.add((newClass, RDFS.comment, Literal("The World Bank Group (WBG) uses trust funds, a financing arrangement set up with contributions from one or more development partner, to complement core funding from the International Bank for Reconstruction and Development (IBRD), the International Development Association (IDA), and the International Finance Corporation (IFC), in support of the World Bank Group’s goals. Trust funds allow the Bank to mobilize and direct concessional resources to strategic development priorities and to mobilize the resources and capabilities of other development actors through partnership programs.")))

    # Define the id property
    id_property = schema.identifier
    g.add((id_property, RDF.type, RDF.Property))
    g.add((id_property, RDFS.label, Literal("Identifier")))
    #g.add((id_property, RDFS.comment, Literal("The unique identifier of the trustfund.")))

    # Associate the id property with the trustfund class
    g.add((id_property, RDFS.domain, newClass))

    df[column] = df[column].astype(str)
    for rowValue in df[column].unique():
        if rowValue != "nan":
            names = rowValue.split(",")
            if ids is not None:
                for name in names:
                    newID = URIRef(schema + "trustfund" + "/" + str(name))
                    g.add((newID, RDF.type, newClass))
                    g.add((newID, RDFS.label, Literal(str(name))))
                    if name in ids:
                        id = ids[name]
                        # Create a URIRef for the trustfund_id resource
                        trustfund_id = URIRef(schema + "trustfund" + "/id/" + Literal(id, datatype=XSD.string))
                        g.add((newID, id_property, trustfund_id))
                        #label = Literal("TEST")
                        label = Literal(f"world bank trustfund ID for {name}", lang="en")
                        g.add((trustfund_id, RDFS.label, label))


In [10]:
def create_subclass_project(column):
    newClass = URIRef(schema + "project")
    g.add((newClass, RDFS.label, Literal("worldbank project")))

    df[column] = df[column].astype(str)
    for value in df[column].unique():
        if value != "nan":
            newID = URIRef(schema + "project/" + str(value).replace(" ","_").replace("-","_"))
            g.add((newID, RDF.type, newClass))
            g.add((newID, RDFS.label, Literal(str(value).replace(" ","_").replace("-","_"), lang='en')))


In [11]:
def create_subclass_authors(column):
    newClass = URIRef(schema + str(column))
    g.add((newClass, RDFS.label, Literal("worldbank authors")))
    df[column] = df[column].astype(str)
    for value in df[column].unique():
        if value != "nan":
            # Add author property for each author
            author_dict = ast.literal_eval(value)
            for author_dict_entries in author_dict.values():
                author_name = author_dict_entries['author']
                author_uri = URIRef(schema + "author/" + author_name.replace(" ", "_"))
                g.add((author_uri, RDF.type, newClass))
                #g.add((author_uri, schema.name, Literal(author_name, lang='en')))
                g.add((author_uri, RDFS.label, Literal(author_name, lang='en')))

In [12]:
# Convert the 'trustfund_key' column to string
df['trustfund_key'] = df['trustfund_key'].astype(str)
df['trustfund'] = df['trustfund'].astype(str)

# Create a dictionary that maps trustfund keys to trustfund names
trustfund_dict = {}
for i, row in df.iterrows():
    keys = row['trustfund'].split(',')
    values = row['trustfund_key'].split(',')
    for key, value in zip(keys, values):
        trustfund_dict[key.strip()] = value.strip()

In [13]:
# Convert the 'trustfund_key' column to string
df['projectid'] = df['projectid'].astype(str)
df['projn'] = df['projn'].astype(str)

# Create a dictionary that maps trustfund keys to trustfund names
project_dict = {}
for i, row in df.iterrows():
    keys = row['projectid'].split(',')
    values = row['projn'].split(',')
    for key, value in zip(keys, values):
        project_dict[key.strip()] = value.strip()

In [14]:
create_subclass_country('count')
create_subclass_world_bank_document('docty')
create_subclass_trustfund('trustfund', trustfund_dict)
create_subclass_project('projn')
create_subclass_authors('authors')

In [15]:
#Save as a ttl file to view in protege
#Also save as ntriples which works better for the LlamaIndex RDFReader
g.serialize('SDKG.ttl',format='turtle',prefixes = prefixes, encoding='urf-8')
g.serialize("SDKG_nt",format="nt",prefixes = prefixes, encoding='urf-8')



<Graph identifier=Na086217764554d7f91563f44b081e796 (<class 'rdflib.graph.Graph'>)>

## 3. Build a knowledge graph (KG) by populating ontology with instances of World Bank documents (each row in the metadata)


In [16]:
#Create abstract property
df['abstracts'] = df['abstracts'].astype(str).str.replace('\n', '').replace('\\n','')
abstractIs_uri = URIRef(schema + "abstractIs")
g.add((abstractIs_uri, RDF.type, RDF.Property))
g.add((abstractIs_uri, RDFS.label, Literal("Short summary of the document.")))

#Create abstract class
abstract_class = URIRef(schema + "abstract")
g.add((abstract_class, RDFS.label, Literal("Short summary of a document.")))

#Create author properties
authoredBy_uri = URIRef(schema + "authoredBy")
authored_uri = URIRef(schema + "authored")
g.add((authoredBy_uri, RDF.type, RDF.Property))
g.add((authored_uri, RDF.type, RDF.Property))
g.add((authoredBy_uri, RDFS.label, Literal("This document was authored by this author.")))
g.add((authored_uri, RDFS.label, Literal("This author wrote this document.")))

#Define 'part of' property
isPartOf_uri = URIRef(schema + "isPartOf")
g.add((isPartOf_uri, RDF.type, RDF.Property))
g.add((isPartOf_uri, RDFS.label, Literal("This entity is a part of another entity")))

#Define 'countryOfOrigin' property
countryOfOrigin_uri = URIRef(schema + "countryOfOrigin")
g.add((countryOfOrigin_uri, RDF.type, RDF.Property))
g.add((countryOfOrigin_uri, RDFS.label, Literal("Country that this document is about.")))

# Create instances for each document and add author property
for index, row in tqdm(df.iterrows()):
    if not pd.isnull(row['id']) and not pd.isnull(row['docty']) and not pd.isnull(row['authors']):
        try:
            # Create the report instance
            instance = URIRef(schema + "doc/" + str(row['display_title']).replace(" ","_").replace("-","_"))
            g.add((instance, RDFS.label, Literal(str(row['display_title']), lang='en')))

            #Connect instances with types of documents
            doctype = URIRef(row['docty'])
            g.add((instance, RDF.type, doctype))

            #Connect instances with country of origin
            if row['count'] != "nan":
                country = URIRef(schema + str(row['count']).replace(" ","_").replace("-","_"))
                g.add((instance, countryOfOrigin_uri, country))

            #Connect instances with projects
            if row['projn'] != "nan":
                project = URIRef(schema + "project/" + str(row['projn']).replace(" ","_").replace("-","_"))
                g.add((instance, isPartOf_uri, project))

            #Connect instances with trustfund_keys
            if row['trustfund'] != "nan":
                tf_values = row['trustfund'].split(",")
                for tf in tf_values:
                    trustfund_uri = URIRef(schema + "trustfund/" + str(tf).replace(" ","_").replace("-","_"))
                    g.add((trustfund_uri, RDFS.label, Literal(f"Trustfund: {tf}")))
                    g.add((instance, isPartOf_uri, trustfund_uri))
                    g.add((trustfund_uri, countryOfOrigin_uri, country))

            #Connect instances with authors
            author_dict = ast.literal_eval(row['authors'])
            for author_dict_entries in author_dict.values():
                author_name = author_dict_entries['author']
                author_uri = URIRef(schema + "author/" + author_name.replace(" ", "_"))
                g.add((instance, authoredBy_uri, author_uri))
                g.add((author_uri, authored_uri, instance))

            #Add abstract
            if row['abstracts'] != "nan":
                abstract_uri = URIRef(schema + "abstract/" + str(row['display_title']).replace(" ","_").replace("-","_"))
                g.add((instance, abstractIs_uri, abstract_uri))
                g.add((abstract_uri, RDFS.label, Literal(str(row['abstracts']))))
                g.add((abstract_uri, RDF.type, abstract_class))
                g.add((abstract_uri, isPartOf_uri, abstract_uri))
        except:
            pass

11it [00:00, 2392.77it/s]


In [17]:
g.serialize('SDKG.ttl',format='turtle',prefixes = prefixes, encoding='urf-8')
g.serialize("SDKG_XML",format="nt",prefixes = prefixes, encoding='urf-8')

<Graph identifier=Na086217764554d7f91563f44b081e796 (<class 'rdflib.graph.Graph'>)>

## 4. Bring in some additional data from Wikidata


In [18]:
from SPARQLWrapper import SPARQLWrapper, JSON

def get_property_label(property_code):
    # Set up the SPARQL endpoint
    sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

    # Construct the SPARQL query
    query = f"""
    SELECT ?label WHERE {{
      wd:{property_code} rdfs:label ?label.
      FILTER (lang(?label) = 'en')
    }}
    """

    # Set the query and response format
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)

    # Execute the query and retrieve the results
    results = sparql.query().convert()

    # Extract and return the property label
    if 'results' in results and 'bindings' in results['results'] and results['results']['bindings']:
        label = results['results']['bindings'][0]['label']['value']
        return label

    return None



In [19]:
import numpy as np


sparql = SPARQLWrapper("https://query.wikidata.org/sparql")

# Create a cache to store property code-label mappings
property_cache = {}
entity_cache = {}

# Prepare a list to collect triples for bulk graph update
triples = []

# Iterate over the URIs and add the properties to the RDF graph
for uri in tqdm(df['country_URI']):
    if isinstance(uri, str) and uri.startswith('http://www.wikidata.org/entity/Q'):
        class_uri = URIRef(uri)
        country_column = df.loc[df['country_URI'] == uri, 'count'].iloc[0]
        country_column = URIRef(schema + str(country_column).replace(" ", "_"))

        # Construct the SPARQL query
        qid = uri.split('/')[-1]
        query = f"""
        SELECT ?prop ?value WHERE {{
          wd:{qid} ?prop ?value .
          OPTIONAL {{ ?prop rdfs:label ?label . FILTER(lang(?label) = 'en') }}
        }}
        """

        # Set the query and response format
        sparql.setQuery(query)
        sparql.setReturnFormat(JSON)

        # Execute the query and retrieve the results
        results = sparql.query().convert()

        # Iterate over the results and add them to the RDF graph
        for result in results["results"]["bindings"]:
            prop = result["prop"]["value"]
            value = Literal(result["value"]["value"])
            triple = (country_column, None, None)  # Placeholder for triple

            if prop.startswith('http://www.wikidata.org/prop'):
                property_code = prop.split('/')[-1]
                # Check if the property code is already in the cache
                if property_code in property_cache:
                    property_label = property_cache[property_code]
                else:
                    # If not in cache, query and retrieve the property label
                    property_label = get_property_label(property_code)
                    # Store the property code-label mapping in the cache
                    property_cache[property_code] = property_label

                property_label_URI = URIRef(schema + property_label.replace(" ", "_"))
                triple = (country_column, property_label_URI, value)

            if value.startswith('http://www.wikidata.org/entity/Q'):
                entity_code = value.split('/')[-1]
                # Check if the entity code is already in the cache
                if entity_code in entity_cache:
                    entity_label = entity_cache[entity_code]
                else:
                    # If not in cache, query and retrieve the entity label
                    entity_label = get_entity_label(entity_code)
                    # Store the entity code-label mapping in the cache
                    entity_cache[entity_code] = entity_label

                entity_label_URI = URIRef(schema + str(entity_label).replace(" ", "_"))
                triple = (country_column, property_label_URI, entity_label_URI)

            triples.append(triple)

    elif isinstance(uri, float) and np.isnan(uri):
        continue
    else:
        continue

# Add all collected triples to the RDF graph in bulk
for subject, predicate, object_ in triples:
    if predicate is not None:
        g.add((subject, predicate, object_))



100%|██████████| 11/11 [01:50<00:00, 10.01s/it]


In [20]:
g.serialize('SDKG.ttl',format='turtle',prefixes = prefixes, encoding='urf-8')
#g.serialize("SDKG_nt",format="nt",prefixes = prefixes, encoding='urf-8')

<Graph identifier=Na086217764554d7f91563f44b081e796 (<class 'rdflib.graph.Graph'>)>

## 5. Query KG using SPARQL


In [21]:
import rdflib
g = rdflib.Graph()
g.parse("/content/SDKG.ttl")


<Graph identifier=Naf5942dfd7ce4b4d82d27915ce72b342 (<class 'rdflib.graph.Graph'>)>

In [22]:
# Step 1: Find the URI of Brazil in your ontology
brazil_uri = "<http://schema.org/Brazil>"  # Replace with the actual URI

# Step 2: Find the most relevant documents related to Brazil
documents_query = f"""
PREFIX schema: <http://schema.org/>
SELECT ?document
WHERE {{
  ?document a/rdfs:subClassOf* schema:world_bank_document ;
      schema:countryOfOrigin {brazil_uri} .

}}
"""
qres = g.query(documents_query)
for row in qres:
    print(f"Document ID: {row.document}")


Document ID: http://schema.org/doc/Brazil___LATIN_AMERICA_AND
____________CARIBBEAN___P167455___Ceara_Rural_Sustainable_Development
____________and_Competitiveness_Phase_II___Audited_Financial_Statement
Document ID: http://schema.org/doc/Environmental_and_Social
____________Commitment_Plan_(ESCP)___Mato_Grosso_Sustainable_Development
____________of_Family_Farming___P175723
Document ID: http://schema.org/doc/Appraisal_Environmental_and
____________Social_Review_Summary_(ESRS)___Mato_Grosso_Sustainable
____________Development_of_Family_Farming___P175723
Document ID: http://schema.org/doc/Brazil___LATIN_AMERICA_AND
____________CARIBBEAN__P167455__Ceara_Rural_Sustainable_Development_and
____________Competitiveness_Phase_II___Procurement_Plan
Document ID: http://schema.org/doc/Appraisal_Program_Information
____________Document_(PID)___Rio_de_Janeiro_Fiscal_Management_and
____________Sustainable_Development_Policy_Loan___P179182
Document ID: http://schema.org/doc/Stakeholder_Engagement_Plan_

In [23]:
from rdflib import Graph, RDF, RDFS, URIRef

# Step 1: Find the URI of the basic_form_of_government you are interested in
government_form_uri = "<http://schema.org/federal_republic>"  # Replace with the actual URI

# Step 2: Query for authors who have written the most documents associated with countries having the basic_form_of_government as "federal_republic"
authors_query = f"""
PREFIX schema: <http://schema.org/>
PREFIX prop: <http://schema.org/property>
SELECT ?author (COUNT(?document) AS ?numDocuments)
WHERE {{
  ?document a/rdfs:subClassOf* schema:world_bank_document ;
            schema:countryOfOrigin [
                    schema:basic_form_of_government {government_form_uri}
            ] ;
            schema:authoredBy ?author .
}}
GROUP BY ?author
ORDER BY DESC(?numDocuments)
"""

# Execute the query
results = g.query(authors_query)

# Now you can process the results and present them as needed (e.g., using pandas DataFrames)
# For simplicity, here, I'm just printing the author names and the number of documents they wrote
for row in results:
    print(f"Author: {row.author}, Number of Documents: {row.numDocuments}")


Author: http://schema.org/author/Noronha_Farinelli,Barbara_Cristina, Number of Documents: 3
Author: http://schema.org/author/Liduina_Cynthya_Lemos, Number of Documents: 1
Author: http://schema.org/author/Daniel_Larrache, Number of Documents: 1
Author: http://schema.org/author/ELAINY_CRISTINA_PINHEIRO_VIEIRA, Number of Documents: 1
Author: http://schema.org/author/Lafaete_Almeida_de_Oliveira, Number of Documents: 1
Author: http://schema.org/author/Waksberg_Guerrini,Ana, Number of Documents: 1


### Putting raw data into LlamaIndex

In [24]:
!pip install -q  langchain openai chromadb tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m418.3/418.3 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m26.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.4/58.4 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.5/59.5 kB[0m [31m8.6 MB/s[0m eta [

In [25]:
pip install llama_index

Collecting llama_index
  Downloading llama_index-0.8.20-py3-none-any.whl (752 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/752.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m225.3/752.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m752.3/752.3 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
Collecting urllib3<2 (from llama_index)
  Downloading urllib3-1.26.16-py2.py3-none-any.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.1/143.1 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: urllib3, llama_index
  Attempting uninstall: urllib3
    Found existing installation: urllib3 2.0.4
    Uninstalling urllib3-2.0.4:
      Successfully uninstalled urllib3-2.0.4
Successfully installed llama_index-0.8.20 urllib3-1.26.16


In [44]:
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, ServiceContext, LLMPredictor
from langchain import OpenAI
import os
import openai

os.environ["OPENAI_API_KEY"] = "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" # replace with yours
openai.api_key = os.environ["OPENAI_API_KEY"]

#Make sure your raw.csv file is in the data folder
documents = SimpleDirectoryReader('/content/data').load_data()

print(documents)

[Document(id_='cf3f4efc-7c9d-43eb-833e-d9addda84731', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='8330db185671ff6de2ec1e0b1549de31fea890d61050a2c8d146f898d86ace1d', text="0, 34153662.0, 2023-08-31T00:00:00Z, {'0': {'author': 'World Bank'}}, World, 517191.0, {'0': {'docna': 'World Bank Group Partnership Fund for the\\n            Sustainable Development Goals - Annual Report 2023'}}, Annual Report, 563760.0, Partnerships and Practice Groups (ECRPG), Public, English, 120701.0, {'entityid': '34153662'}, 184733, 2023-08-31T00:00:00Z, 2023-08-31T02:30:12Z, 1.0, Publications & Research, 658101.0, {'cdata!': 'At a moment fraught with daunting\\n            global challenges - food and energy insecurity, inflation\\n            and economic uncertainty, conflict and debt, rising poverty\\n            and the ever-present threat of climate change - we must not\\n            only take up the gauntlet to respond to these cha

In [29]:
index = GPTVectorStoreIndex.from_documents(documents)


[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [31]:
query_engine = index.as_query_engine()

In [36]:
response = query_engine.query("Show me all of the World Bank documents in the context information about Brazil")
print(response)

There are several World Bank documents in the context information about Brazil.


In [37]:
response = query_engine.query("Based on the context information, what documents has Corsi,Anna written?")
print(response)

Corsi, Anna has not written any documents based on the given context information.


In [38]:
response = query_engine.query("Tell me more about Anna Corsi")
print(response)

I'm sorry, but I cannot provide information about Anna Corsi based on the given context.
