### Using duckdb with Python API

In [1]:
from IPython.display import display, Image

# Replace 'your_image_url' with the actual URL of the image
image_url = 'https://duckdb.org/images/logo-dl/DuckDB_Logo.png'

# Display the image in the notebook
display(Image(url=image_url))

DuckDB is a database management system, and it is useful in the geospatial domain for efficiently storing and retrieving location-based data.

In simpler terms, DuckDB helps organize and manage information related to locations, like maps or geographic data. It's handy for tasks such as tracking objects, analyzing spatial patterns, or finding nearby points on a map. DuckDB makes working with geospatial data in computer programs easier and more efficient.

DuckDB stands out in the geospatial domain because it's faster, lightweight, and often more efficient compared to other geospatial packages.

* Speed: DuckDB is designed to be fast, allowing for quick processing and retrieval of geospatial data. Its optimized architecture enables speedy execution of queries and analyses, making it a favorable choice for applications that require real-time or near-real-time processing of location-based information.

* Lightweight: DuckDB is a lightweight database, meaning it doesn't require significant computational resources. This makes it suitable for use in various environments, including resource-constrained devices or systems where efficient resource utilization is crucial. The lightweight nature of DuckDB can contribute to faster deployment and lower operational costs.

* Efficiency: DuckDB is built with efficiency in mind, providing a balance between performance and resource utilization. Its design allows for quick data retrieval and processing, making it a reliable option for applications dealing with large geospatial datasets. The efficiency of DuckDB can result in improved overall system performance.

* Integration: DuckDB is designed to integrate seamlessly with programming languages commonly used in data science and geospatial analysis, such as Python. This makes it easier for developers and data scientists to incorporate DuckDB into their workflows, benefiting from its speed and efficiency while leveraging familiar programming tools.

### Objective of this task

* DucDB Exploration:
   - Investigate DucDB features for spatial analysis.
   - Examine geographic data handling tools.
*OSM Integration:
   - Assess DucDB's OSM data integration capabilities.
   - Verify compatibility and interoperability.
* Spatial Analysis Techniques:
   - Identify DucDB's spatial analysis methods.
   - Evaluate advanced analysis using OSM data.
* Nigeria-specific Application:
   - Tailor exploration to Nigeria's geography.
   - Examine DucDB's effectiveness with OSM data in Nigeria.
* Data Visualization:
   - Assess DucDB's geographic data visualization.
   - Explore visualization options for Nigerian geography.

In [2]:
# %pip install duckdb leafmap

import the neccessary libraries

In [3]:
#import the libaries
import duckdb
import pandas as  pd
# import leafmap
import os
import pyogrio
import geopandas as gpd

In [4]:
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

To download the data used in this analysis

In [5]:
# url = r'https://download.geofabrik.de/africa/nigeria-latest-free.shp.zip'
# leafmap.download_file(url,unzip=True)

In [6]:
# load the data in the osm nigeria shapefile folder
home_folder = 'data'
nigeria_folder = 'nigeria-latest-free_shp'
data_folder = os.path.join(home_folder, nigeria_folder)
data = os.listdir(data_folder)
for item in data:
    if item.endswith('.shp'):
        print(item)

gis_osm_buildings_a_free_1.shp
gis_osm_landuse_a_free_1.shp
gis_osm_natural_a_free_1.shp
gis_osm_natural_free_1.shp
gis_osm_places_a_free_1.shp
gis_osm_places_free_1.shp
gis_osm_pofw_a_free_1.shp
gis_osm_pofw_free_1.shp
gis_osm_pois_a_free_1.shp
gis_osm_pois_free_1.shp
gis_osm_railways_free_1.shp
gis_osm_roads_free_1.shp
gis_osm_traffic_a_free_1.shp
gis_osm_traffic_free_1.shp
gis_osm_transport_a_free_1.shp
gis_osm_transport_free_1.shp
gis_osm_waterways_free_1.shp
gis_osm_water_a_free_1.shp


Check the nigeria admin shapeifle that was added to data

In [7]:
# load the data in the osm nigeria shapefile folder
home_folder = 'data'
nigeria_folder = 'NGA_adm'
data_folder = os.path.join(home_folder, nigeria_folder)
data = os.listdir(data_folder)
for item in data:
    if item.endswith('.shp'):
        print(item)

NGA_adm0.shp
NGA_adm1.shp
NGA_adm2.shp


Connecting to Duckdb

Create a db for nigeria where all the data will be store before analysis

In [8]:
con = duckdb.connect("nigeria.db")

Install and load spatial extension

In [9]:
con.install_extension('spatial')
con.load_extension('spatial')

In [10]:
con.sql("SHOW TABLES")

┌───────────────┐
│     name      │
│    varchar    │
├───────────────┤
│ nigeria_state │
└───────────────┘

Load the Nigeria Admin  shapefile to the database

In [14]:
%%timeit
# lets try it with one and see then later we can use a loop to load the rest of the data into the database
nga_admin1  = 'NGA_adm1.shp'
nga_state_data = os.path.join(home_folder, nigeria_folder, nga_admin1)
# read the data into a dataframe
# nga_state_gdf = gpd.read_file(nga_state_data)
# print(nga_state_gdf)

# pass the data into the database
# create a new table from the contents of a DataFrame
query = f"DROP TABLE IF EXISTS nigeria_state; \
        CREATE TABLE nigeria_state AS SELECT * FROM  ST_Read('{nga_state_data}')"
con.execute(query)

# if the tbale already exit all will need to do is to insert into the table
# insert into an existing table from the contents of a DataFrame
#con.execute("INSERT INTO existing_table SELECT * FROM loaded_Dataframe")

<duckdb.duckdb.DuckDBPyConnection object at 0x00000294EA083470>
<duckdb.duckdb.DuckDBPyConnection object at 0x00000294EA083470>
<duckdb.duckdb.DuckDBPyConnection object at 0x00000294EA083470>
<duckdb.duckdb.DuckDBPyConnection object at 0x00000294EA083470>
<duckdb.duckdb.DuckDBPyConnection object at 0x00000294EA083470>
<duckdb.duckdb.DuckDBPyConnection object at 0x00000294EA083470>
<duckdb.duckdb.DuckDBPyConnection object at 0x00000294EA083470>
<duckdb.duckdb.DuckDBPyConnection object at 0x00000294EA083470>
167 ms ± 48.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
con.sql("SHOW TABLES")

┌───────────────┐
│     name      │
│    varchar    │
├───────────────┤
│ nigeria_state │
└───────────────┘

Lets query the data using the State name+

In [None]:
%%time
osm_building  = 'gis_osm_buildings_a_free_1.shp'
building_data = os.path.join(home_folder, nigeria_folder, osm_building)
# get the first 10 building data
print(con.sql(f"SELECT * FROM ST_Read('{building_data}') LIMIT 10"))

Count the total number of building mapped in nigeria

In [None]:
%%time
osm_building  = 'gis_osm_buildings_a_free_1.shp'
building_data= os.path.join(home_folder, nigeria_folder, osm_building)
# get the total numbers of building mapped on osm in Nigeria
query = f"SELECT COUNT(*) FROM ST_Read('{building_data}')"
print(con.sql(query))

use leafmap to visualize the nigeria buidling shapefile

Load data into the databse using sqalchemy method
* read the data using pyogrio, the reason why pyogrio is used to load this data instead of the commonly known geopandas is because of the speed, it is 18x faster than geopandas.
* Use a for loop to load all the data in the nigeria shapefile into the db that was connected to i.e "nigeria.db"

In [None]:
%%timeit
# lets try it with one and see then later we can use a loop to load the rest of the data into the database
osm_building  = 'gis_osm_buildings_a_free_1.shp'
budiling_data = os.path.join(home_folder, nigeria_folder, osm_building)
# read the data into a dataframe
buidling_gdf = pyogrio.read_dataframe(budiling_data)

# pass the data into the database
# create a new table from the contents of a DataFrame
query = f"CREATE TABLE osm_nigeria_builing AS SELECT * FROM  {buidling_gdf}').fetchall()"
print(con.execute(query))

# if the tbale already exit all will need to do is to insert into the table
# insert into an existing table from the contents of a DataFrame
#con.execute("INSERT INTO existing_table SELECT * FROM loaded_Dataframe")

In [None]:
%%timeit
con.sql("SHOW TABLES")

In [None]:
%%timeit
# lets try it with one and see then later we can use a loop to load the rest of the data into the database
osm_building  = 'gis_osm_buildings_a_free_1.shp'
building_data = os.path.join(home_folder, nigeria_folder, osm_building)
# pass the data into the database
# create a new table from the contents of a DataFrame
query = f"CREATE TABLE osm_nigeria_builing AS SELECT * FROM ST_Read('{building_data}')"
print(con.execute(query))

# if the tbale already exit all will need to do is to insert into the table
# insert into an existing table from the contents of a DataFrame
#con.execute("INSERT INTO existing_table SELECT * FROM loaded_Dataframe")

Show our table again to know if the data is already ingested into the database

In [None]:
con.sql("SHOW TABLES")

if that works for one, then we will pass the bulk table (data) into the db

In [None]:
# load the data in the osm nigeria shapefile folder
home_folder = 'data'
nigeria_folder = 'nigeria-latest-free_shp'
data_folder = os.path.join(home_folder, nigeria_folder)
data = os.listdir(data_folder)
for item in data:
    if item.endswith('.shp'):
        print(item)
        
# pass the data into the database
# create a new table from the contents of a DataFrame
query = f"CREATE TABLE osm_nigeria_builing AS SELECT * FROM ST_Read('{bulding_gdf}')"
print(con.execute(query))

In [None]:
%sql duckdb:///:memory:
# %sql duckdb:///path/to/file.db

In [None]:
%%sql

SELECT * FROM duckdb_extensions();

## References 

* https://duckdb.org/docs/api/python/data_ingestion