# A notebook introduction to **toposkg-lib**

### Install library and dependencies

toposkg-lib is provided as a pypi package. We recommend that the user installs our version of rdflib, which includes a workaround to speed-up file parsing.

In [None]:
!pip install --index-url https://test.pypi.org/simple/ \
  --extra-index-url https://pypi.org/simple \
  "toposkg[full]==0.1.3.dev4"
!pip install git+https://github.com/SKefalidis/rdflib-speed@main

Looking in indexes: https://test.pypi.org/simple/, https://pypi.org/simple
Collecting git+https://github.com/SKefalidis/rdflib-speed@main
  Cloning https://github.com/SKefalidis/rdflib-speed (to revision main) to /tmp/pip-req-build-8zkg998_
  Running command git clone --filter=blob:none --quiet https://github.com/SKefalidis/rdflib-speed /tmp/pip-req-build-8zkg998_
  Resolved https://github.com/SKefalidis/rdflib-speed to commit f83c401fe21c30574b2aad04fd2044cc25d70348
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
# the latest version of unsloth+trasnformers is buggy (see https://github.com/unslothai/unsloth/issues/2656)
!pip install unsloth==2025.5.7 unsloth-zoo==2025.5.8



In [None]:
!pip show toposkg
!pip show unsloth

Name: toposkg
Version: 0.1.3.dev4
Summary: A Python interface to the ToposKG knowledge graph generation pipeline.
Home-page: 
Author: 
Author-email: Sergios-Anestis Kefalidis <skefalidis@di.uoa.gr>, Kostas Plas <kplas@di.uoa.gr>
License: MIT
Location: /usr/local/lib/python3.11/dist-packages
Requires: fsspec, pandas, pyjedai, rich
Required-by: 
Name: unsloth
Version: 2025.5.7
Summary: 2-5X faster LLM finetuning
Home-page: http://www.unsloth.ai
Author: Unsloth AI team
Author-email: info@unsloth.ai
License: 
Location: /usr/local/lib/python3.11/dist-packages
Requires: accelerate, bitsandbytes, datasets, diffusers, hf_transfer, huggingface_hub, numpy, packaging, peft, protobuf, psutil, sentencepiece, torch, torchvision, tqdm, transformers, triton, trl, tyro, unsloth_zoo, wheel, xformers
Required-by: 


## Basics

### Setup and exploring available data sources.

In [None]:
from toposkg.toposkg_lib_core import KnowledgeGraphBlueprintBuilder, KnowledgeGraphSourcesManager

# Load files from the repository. It is advised to skip the download and only download files when they are needed.
sources_manager = KnowledgeGraphSourcesManager(sources_repositories='https://toposkg.di.uoa.gr',)

Do you want to proceed with downloading the entire knowledge graph sources (100gb+)? Any previously downloaded sources will not be redownloaded. (y/n)n
Skipping download of sources...
Loading source information from ~/.toposkg/sources_cache


In [None]:
# Available data sources can be shown as a list of file paths. When shown as a list we can also filter the results to include specific keywords.
sources_manager.print_available_data_sources(tree=False, filter="Greece")

Available data sources:
/root/.toposkg/sources_cache/toposkg/GAUL/countries/Greece
/root/.toposkg/sources_cache/toposkg/GAUL/countries/Greece/Greece_0.nt
/root/.toposkg/sources_cache/toposkg/GAUL/countries/Greece/Greece_1.nt
/root/.toposkg/sources_cache/toposkg/GAUL/countries/Greece/Greece_2.nt
/root/.toposkg/sources_cache/toposkg/GAUL/countries/Greece/Greece_all.nt
/root/.toposkg/sources_cache/toposkg/OSM/countries/Greece
/root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_0.nt
/root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_1.nt
/root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_2.nt
/root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_3.nt
/root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_4.nt
/root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_5.nt
/root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_6.nt
/root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_all.nt
/root/.topo

In [None]:
# Or they can be shown in a tree-view. You can expand the output to view the tree (it's quite large).
sources_manager.print_available_data_sources()

Available data sources:
/root/.toposkg/sources_cache/
  toposkg/
    GAUL/
      countries/
        Afghanistan/
          Afghanistan_0.nt
          Afghanistan_1.nt
          Afghanistan_2.nt
          Afghanistan_all.nt
        Aland Islands/
          Aland Islands.nt
        Albania/
          Albania_0.nt
          Albania_1.nt
          Albania_2.nt
          Albania_all.nt
        Algeria/
          Algeria_0.nt
          Algeria_1.nt
          Algeria_2.nt
          Algeria_all.nt
        American Samoa/
          American Samoa.nt
        Andorra/
          Andorra_0.nt
          Andorra_all.nt
        Angola/
          Angola_0.nt
          Angola_1.nt
          Angola_all.nt
        Anguilla/
          Anguilla.nt
        Antarctica/
          Antarctica.nt
        Antigua and Barbuda/
          Antigua and Barbuda.nt
        Argentina/
          Argentina_0.nt
          Argentina_1.nt
          Argentina_2.nt
          Argentina_all.nt
        Armenia/
          Armenia_0.

### Building your custom GeoKG by selecting from available data sources.


In [None]:
# Create a KnowledgeGraphBlueprintBuilder object to build the knowledge graph blueprint
builder = KnowledgeGraphBlueprintBuilder()

builder.set_name("ToposKG.nt")
builder.set_output_dir("/content/")

# We add the data sources that we want to include in our GeoKG.
builder.add_source_path("/root/.toposkg/sources_cache/toposkg/GAUL/countries/Greece/Greece_all.nt")
builder.add_source_path("/root/.toposkg/sources_cache/toposkg/OSM/forests/Greece/greece_forest.nt")

# Use the blueprint to construct the knowledge graph
blueprint = builder.build()
blueprint.construct(validate=False) # The required files will be downloaded at this stage if you chose to not download them previously.
                                    # We disable file validation to speed-up the process, and because our files have already been validated.

Output()

Output()

'Knowledge graph constructed successfully at /content/ToposKG.nt'

In [None]:
#
# There are also more powerful ways to add data sources to the builder.
#

# First we clear the previously added data sources.
builder.clear_source_paths()

# We can add data sources via regex.
builder.add_source_paths_with_regex(sources_manager.get_source_paths(), r"(?i).*Greece_(?!\d).*\.nt") # do not include individual level files
print("REGEX")
builder.print_source_paths()

# Or via substring filters
builder.clear_source_paths()
builder.add_source_paths_with_strings(sources_manager.get_source_paths(), ["Greece", "OSM"]) # only include files that contain the strings Greece and OSM in their filepath
print("SUBSTRINGS")
builder.print_source_paths()

# Or even by adding folders, whose contents are considered as the selected data sources.
builder.clear_source_paths()
builder.add_source_path('/root/.toposkg/sources_cache/toposkg/GAUL/countries/Greece')
builder.print_source_paths()

REGEX
Sources paths:
- /root/.toposkg/sources_cache/toposkg/GAUL/countries/Greece/Greece_all.nt
- /root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_all.nt
- /root/.toposkg/sources_cache/toposkg/OSM/forests/Greece/greece_forest.nt
- /root/.toposkg/sources_cache/toposkg/OSM/pois/Greece/greece_poi.nt
- /root/.toposkg/sources_cache/toposkg/OSM/waterbodies/Greece/greece_water.nt
SUBSTRINGS
Sources paths:
- /root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_0.nt
- /root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_1.nt
- /root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_2.nt
- /root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_3.nt
- /root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_4.nt
- /root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_5.nt
- /root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_6.nt
- /root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_all.nt
- /r

## Advanced Features

### Using the translation pipeline to generate English labels.

In [None]:
builder = KnowledgeGraphBlueprintBuilder()

builder.set_name("ToposKG_translation.nt")
builder.set_output_dir("/content/")

builder.add_source_path("/root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_1.nt")
builder.add_translation_target(("/root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_1.nt", ["<http://toposkg.di.uoa.gr/ontology/hasName>"]))

blueprint = builder.build()
blueprint.construct(debug=True)

Output()

Output()

Translating...
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Using

==((====))==  Unsloth 2025.5.7: Fast Gemma3 patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


Translating predicates in source path:  /root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_1.nt
Predicates list:  ['<http://toposkg.di.uoa.gr/ontology/hasName>']
Loading source file: /root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_1.nt
Parsing file: /root/.toposkg/sources_cache/toposkg/OSM/countries/Greece/Greece_1.nt
Translating "Αποκεντρωμένη Διοίκηση Μακεδονίας - Θράκης" to Region of Macedonia and Thrace Decentralized Administration
Translating "Makedonias Thrakis" to Thrace
Translating "Αποκεντρωμένη Διοίκηση Αιγαίου" to Aegean Region Decentralized Administration
Translating "Kritis" to Crete
Translating "Ipeiroy Dytikis Makedonias" to West Macedonia
Translating "Thessaly and Central Greece"@en to Thessaly and Central Greece
Translating "Macedonia and Thrace"@en to Macedonia and Thrace
Translating "Αποκεντρωμένη Διοίκηση Ηπείρου - Δυτικής Μακεδονίας" to Region of Epirus-Western Macedonia Decentralized Administration
Translating "Region of Crete"@en to

'Knowledge graph constructed successfully at /content/ToposKG_translation.nt'

### Using the materialization pipeline.

In [None]:
# Because the materialization pipeline relies on Java code, and running Java code on Google Colab is not officialy supported we do not include examples for this functionality here.
# You can see examples in the official code repository of toposkg-lib.

### Using the entity linking pipeline to link custom data with the official ToposKG data sources.

To extend the available data sources, but also interlink data where appropriate toposkg-lib provides a simple interface to pyjedai. In this example we want to link some custom data to the U.S. States.

In [None]:
# download data
!wget https://raw.githubusercontent.com/KwtsPls/ToposKG/refs/heads/main/toposkg_lib/examples/us_states_test_data.csv
!wget https://raw.githubusercontent.com/KwtsPls/ToposKG/refs/heads/main/toposkg_lib/examples/us_states_gaul.nt

--2025-06-20 17:00:49--  https://raw.githubusercontent.com/KwtsPls/ToposKG/refs/heads/main/toposkg_lib/examples/us_states_test_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1606 (1.6K) [text/plain]
Saving to: ‘us_states_test_data.csv’


2025-06-20 17:00:49 (37.0 MB/s) - ‘us_states_test_data.csv’ saved [1606/1606]

--2025-06-20 17:00:49--  https://raw.githubusercontent.com/KwtsPls/ToposKG/refs/heads/main/toposkg_lib/examples/us_states_gaul.nt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 39080053 (37M) [text/plain]
Sav

In [None]:
from toposkg.toposkg_lib_entity_linking import *

# Extract relevant information from the nt file which will be useful for linking with the custom data.
toposkg_nt_to_csv("/content/us_states_gaul.nt", ['http://toposkg.di.uoa.gr/ontology/hasName'], "/content/us_states_gaul.csv")

Unnamed: 0,entity,hasName
0,http://toposkg.di.uoa.gr/resource/usa_2353_31486,Wyoming
1,http://toposkg.di.uoa.gr/resource/usa_2321_32037,Maine
2,http://toposkg.di.uoa.gr/resource/usa_2333_30844,New Mexico
3,http://toposkg.di.uoa.gr/resource/usa_2345_33035,Texas
4,http://toposkg.di.uoa.gr/resource/usa_2341_33325,Rhode Island
5,http://toposkg.di.uoa.gr/resource/usa_2317_33332,Iowa
6,http://toposkg.di.uoa.gr/resource/usa_2329_31448,Nebraska
7,http://toposkg.di.uoa.gr/resource/usa_2309_31575,Delaware
8,http://toposkg.di.uoa.gr/resource/usa_2313_22101,Hawaii
9,http://toposkg.di.uoa.gr/resource/usa_2325_33355,Minnesota


In [None]:
# load the two datasets to link them
gaul_df = pd.read_csv("/content/us_states_gaul.csv", sep=',', engine='python', na_filter=False).astype(str)
target_df = pd.read_csv("/content/us_states_test_data.csv", sep=',', engine='python', na_filter=False).astype(str)

# call pyjedai
toposkg_link_dataframes(
    d1=gaul_df,
    id_column_name_1="hasName",
    d2=target_df,
    id_column_name_2="State",
    output_path="/content/link_output_dfs.csv"
)

# converT the pyjedai output to an RDF file, this file and be added to the data sources and included in a custom KG
toposkg_csv_to_nt(
    input_file="/content/link_output_dfs.csv",
    column_id_name="entity",
    output_file="/content/link_output_dfs.nt"
)

Embeddings-NN Block Building [sminilm, faiss, cuda]:   0%|          | 0/101 [00:00<?, ?it/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

               id1             id2
0          Alabama         Alabama
1           Alaska          Alaska
2          Arizona         Arizona
3         Arkansas        Arkansas
4       California      California
5         Colorado        Colorado
6      Connecticut     Connecticut
7         Delaware        Delaware
8          Florida         Florida
9          Georgia         Georgia
10          Hawaii          Hawaii
11           Idaho           Idaho
12        Illinois        Illinois
13         Indiana         Indiana
14            Iowa            Iowa
15          Kansas          Kansas
16        Kentucky        Kentucky
17       Louisiana       Louisiana
18           Maine           Maine
19        Maryland        Maryland
20   Massachusetts   Massachusetts
21        Michigan        Michigan
22       Minnesota       Minnesota
23     Mississippi     Mississippi
24        Missouri        Missouri
25         Montana         Montana
26        Nebraska        Nebraska
27          Nevada  



<Graph identifier=N0851ef17e7624bc3b48b0b863329a76c (<class 'rdflib.graph.Graph'>)>

In [None]:
pd.read_csv('/content/link_output_dfs.csv')

Unnamed: 0,entity,hasName,State,test_data_1,test_data_2
0,http://toposkg.di.uoa.gr/resource/usa_2353_31486,Wyoming,Wyoming,957.0,82.10551
1,http://toposkg.di.uoa.gr/resource/usa_2321_32037,Maine,Maine,529.0,75.330291
2,http://toposkg.di.uoa.gr/resource/usa_2333_30844,New Mexico,New Mexico,157.0,27.509986
3,http://toposkg.di.uoa.gr/resource/usa_2345_33035,Texas,Texas,385.0,94.979713
4,http://toposkg.di.uoa.gr/resource/usa_2341_33325,Rhode Island,Rhode Island,562.0,6.918714
5,http://toposkg.di.uoa.gr/resource/usa_2317_33332,Iowa,Iowa,639.0,9.904071
6,http://toposkg.di.uoa.gr/resource/usa_2329_31448,Nebraska,Nebraska,304.0,64.854264
7,http://toposkg.di.uoa.gr/resource/usa_2309_31575,Delaware,Delaware,605.0,13.283306
8,http://toposkg.di.uoa.gr/resource/usa_2313_22101,Hawaii,Hawaii,726.0,51.758871
9,http://toposkg.di.uoa.gr/resource/usa_2325_33355,Minnesota,Minnesota,149.0,70.958202
