### Using duckdb with Python API

In [2]:
from IPython.display import display, Image

# Replace 'your_image_url' with the actual URL of the image
image_url = 'https://duckdb.org/images/logo-dl/DuckDB_Logo.png'

# Display the image in the notebook
display(Image(url=image_url))

DuckDB is a database management system, and it is useful in the geospatial domain for efficiently storing and retrieving location-based data.

In simpler terms, DuckDB helps organize and manage information related to locations, like maps or geographic data. It's handy for tasks such as tracking objects, analyzing spatial patterns, or finding nearby points on a map. DuckDB makes working with geospatial data in computer programs easier and more efficient.

DuckDB stands out in the geospatial domain because it's faster, lightweight, and often more efficient compared to other geospatial packages.

* Speed: DuckDB is designed to be fast, allowing for quick processing and retrieval of geospatial data. Its optimized architecture enables speedy execution of queries and analyses, making it a favorable choice for applications that require real-time or near-real-time processing of location-based information.

* Lightweight: DuckDB is a lightweight database, meaning it doesn't require significant computational resources. This makes it suitable for use in various environments, including resource-constrained devices or systems where efficient resource utilization is crucial. The lightweight nature of DuckDB can contribute to faster deployment and lower operational costs.

* Efficiency: DuckDB is built with efficiency in mind, providing a balance between performance and resource utilization. Its design allows for quick data retrieval and processing, making it a reliable option for applications dealing with large geospatial datasets. The efficiency of DuckDB can result in improved overall system performance.

* Integration: DuckDB is designed to integrate seamlessly with programming languages commonly used in data science and geospatial analysis, such as Python. This makes it easier for developers and data scientists to incorporate DuckDB into their workflows, benefiting from its speed and efficiency while leveraging familiar programming tools.

### Objective of this task

* DucDB Exploration:
   - Creating database in duckdb
   - Loading table into the database
   - Loading bulk data inot the database
* Spatial Analysis Techniques:
   - Performining Buffer with duckdb
   - Making spatial Selection with duckdb
* Data Visualization:
   - Using leafmap for Data Visualization
   - Exploring geopandas for data visualization trying the gpd.explore() function.

installing the required package,uisng pip pip inside a notebook is not a standard pratcice it is better you installed all the required libary in a virtual env, start the notebook within the environement to run your analysis. Incase you want to install the libary within the notebook you can use **%pip install** and the name of the libary you want to install

In [16]:
# %pip install duckdb leafmap

import the neccessary libraries

In [4]:
#import the libaries
import duckdb
import pandas as  pd
import leafmap
import os
import pyogrio
import geopandas as gpd

Import jupysql Jupyter extension to create SQL cells

This enables you to write sql query in your notebook, you can use **%sql** when writeing just a line of sql or **%%sql** when writing multiple line of sql in your notebook cell

In [17]:
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [6]:
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

To download the data used in this analysis

In [7]:
# url = r'https://download.geofabrik.de/africa/nigeria-latest-free.shp.zip'
# leafmap.download_file(url,unzip=True)

set folder path

In [None]:
nigeria_osm = 'nigeria-latest-free_shp'
nigeria_admin = 'NGA_adm'
data_folder = 'data'

In [23]:
# load the data in the osm nigeria shapefile folder
data_folder = os.path.join(data_folder, nigeria_osm)
data = os.listdir(data_folder)
for item in data:
    if item.endswith('.shp'):
        print(item)

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'data\\NGA_adm\\nigeria-latest-free_shp'

Check the nigeria admin shapeifle that was added to data

In [9]:
# load the data in the osm nigeria shapefile folder
data_folder = os.path.join(data_folder,nigeria_admin)
data = os.listdir(data_folder)
for item in data:
    if item.endswith('.shp'):
        print(item)

NGA_adm0.shp
NGA_adm1.shp
NGA_adm2.shp


Connecting to Duckdb

Create a db for nigeria where all the data will be store before analysis

In [10]:
con = duckdb.connect("nigeria.db")

Install and load spatial extension

In [11]:
con.install_extension('spatial')
con.load_extension('spatial')

When you first run con.sql("SHOW TABLE") it should show you an empty database without any table present becuase we just created the database untill you create or write a table to the database, it will be empty.

In [12]:
con.sql("SHOW TABLES")

┌───────────────┐
│     name      │
│    varchar    │
├───────────────┤
│ nigeria_state │
└───────────────┘

Load the Nigeria Admin shapefile to the database

By default DuckDB operates on an in-memory database. That means that any tables that are created are not persisted to disk. Using the .connect method a connection can be made to a persistent database. Any data written to that connection will be persisted, and can be reloaded by re-connecting to the same file.

In [13]:
%%timeit
# lets try it with one and see then later we can use a loop to load the rest of the data into the database
nga_admin1  = 'NGA_adm1.shp'
nga_state_data = os.path.join(data_folder,nigeria_admin, nga_admin1)
# read the data into a dataframe
# nga_state_gdf = gpd.read_file(nga_state_data)
# print(nga_state_gdf)

# pass the data into the database
# create a new table from the contents of a DataFrame
query = f"DROP TABLE IF EXISTS nigeria_state; \
        CREATE TABLE nigeria_state AS SELECT * FROM  ST_Read('{nga_state_data}')"
con.execute(query)

# if the table already exit all we need to do is to insert data into the table
# insert into an existing table from the contents of a DataFrame
#con.execute("INSERT INTO existing_table SELECT * FROM loaded_Dataframe")

174 ms ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
con.sql("SHOW TABLES")

┌───────────────┐
│     name      │
│    varchar    │
├───────────────┤
│ nigeria_state │
└───────────────┘

Lets query the data using the State name

In [42]:
# query the table
con.table('nigeria_state').show()

┌───────┬─────────┬─────────┬───────┬─────────────┬───┬────────────┬───────────┬───────────┬──────────────────────┐
│ ID_0  │   ISO   │ NAME_0  │ ID_1  │   NAME_1    │ … │ ENGTYPE_1  │ NL_NAME_1 │ VARNAME_1 │         geom         │
│ int64 │ varchar │ varchar │ int64 │   varchar   │   │  varchar   │  varchar  │  varchar  │       geometry       │
├───────┼─────────┼─────────┼───────┼─────────────┼───┼────────────┼───────────┼───────────┼──────────────────────┤
│   163 │ NGA     │ Nigeria │     1 │ Abia        │ … │ State      │ NULL      │ NULL      │ POLYGON ((7.508012…  │
│   163 │ NGA     │ Nigeria │     2 │ Adamawa     │ … │ State      │ NULL      │ NULL      │ POLYGON ((13.72386…  │
│   163 │ NGA     │ Nigeria │     3 │ Akwa Ibom   │ … │ State      │ NULL      │ NULL      │ MULTIPOLYGON (((7.…  │
│   163 │ NGA     │ Nigeria │     4 │ Anambra     │ … │ State      │ NULL      │ NULL      │ POLYGON ((6.915181…  │
│   163 │ NGA     │ Nigeria │     5 │ Bauchi      │ … │ State      │ NUL

In [51]:
con.sql('SELECT * EXCLUDE geom ,ST_AStext(geom) as geomery FROM nigeria_state').df().head()

Unnamed: 0,ID_0,ISO,NAME_0,ID_1,NAME_1,TYPE_1,ENGTYPE_1,NL_NAME_1,VARNAME_1,geomery
0,163,NGA,Nigeria,1,Abia,State,State,,,"POLYGON ((7.508012771606559 6.009689807891846,..."
1,163,NGA,Nigeria,2,Adamawa,State,State,,,POLYGON ((13.723860740661621 10.91172790527349...
2,163,NGA,Nigeria,3,Akwa Ibom,State,State,,,MULTIPOLYGON (((7.610694885253963 4.4729170799...
3,163,NGA,Nigeria,4,Anambra,State,State,,,"POLYGON ((6.915181159973201 6.711037158966064,..."
4,163,NGA,Nigeria,5,Bauchi,State,State,,,POLYGON ((10.734446525573787 12.40430355072015...


In [22]:
%%time
osm_building  = 'gis_osm_buildings_a_free_1.shp'
building_data = os.path.join(data_folder, nigeria_osm, osm_building)
# get the first 10 building data
con.sql(f"SELECT * FROM ST_Read('{building_data}') LIMIT 10")

Wall time: 498 ms


┌──────────┬───────┬──────────┬──────────────────────┬─────────┬───────────────────────────────────────────────────────┐
│  osm_id  │ code  │  fclass  │         name         │  type   │                         geom                          │
│ varchar  │ int32 │ varchar  │       varchar        │ varchar │                       geometry                        │
├──────────┼───────┼──────────┼──────────────────────┼─────────┼───────────────────────────────────────────────────────┤
│ 10561268 │  1500 │ building │ NECOM House          │ NULL    │ POLYGON ((3.3974795 6.4464039, 3.3977594 6.4468257,…  │
│ 28923442 │  1500 │ building │ Senate Building      │ NULL    │ POLYGON ((7.6544579 11.1506115, 7.6547851 11.150889…  │
│ 30047438 │  1500 │ building │ NULL                 │ NULL    │ POLYGON ((3.9726911 7.3604027, 3.9730135 7.3607825,…  │
│ 31895041 │  1500 │ building │ NULL                 │ NULL    │ POLYGON ((7.7204295 11.1023489, 7.720455 11.1025016…  │
│ 223198   │  1500 │ building │ 

Count the total number of building mapped in nigeria

In [21]:
%%time
osm_building  = 'gis_osm_buildings_a_free_1.shp'
building_data= os.path.join(data_folder, nigeria_osm, osm_building)
# get the total numbers of building mapped on osm in Nigeria
query = f"SELECT COUNT(*) FROM ST_Read('{building_data}')"
print(con.sql(query))

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│     10399995 │
└──────────────┘

Wall time: 6min 21s


In [None]:
%%timeit
# lets try it with one and see then later we can use a loop to load the rest of the data into the database
osm_building  = 'gis_osm_buildings_a_free_1.shp'
building_data= os.path.join(data_folder, nigeria_osm, osm_building)
# read the data into a dataframe
# nga_state_gdf = gpd.read_file(nga_state_data)
# print(nga_state_gdf)

# pass the data into the database
# create a new table from the contents of a DataFrame
query = f"DROP TABLE IF EXISTS osm_nigeria_building; \
        CREATE TABLE osm_nigeria_building AS SELECT * FROM  ST_Read('{nga_state_data}')"
con.execute(query)

# if the tbale already exit all will need to do is to insert into the table
# insert into an existing table from the contents of a DataFrame
#con.execute("INSERT INTO existing_table SELECT * FROM loaded_Dataframe")

use leafmap to visualize the nigeria buidling shapefile

Load data into the databse using sqalchemy method
* read the data using pyogrio, the reason why pyogrio is used to load this data instead of the commonly known geopandas is because of the speed, it is 18x faster than geopandas.
* Use a for loop to load all the data in the nigeria shapefile into the db that was connected to i.e "nigeria.db"

In [None]:
%%timeit
# lets try it with one and see then later we can use a loop to load the rest of the data into the database
osm_building  = 'gis_osm_buildings_a_free_1.shp'
budiling_data = os.path.join(home_folder, nigeria_folder, osm_building)
# read the data into a dataframe
buidling_gdf = pyogrio.read_dataframe(budiling_data)

# pass the data into the database
# create a new table from the contents of a DataFrame
query = f"CREATE TABLE osm_nigeria_builing AS SELECT * FROM  {buidling_gdf}').fetchall()"
print(con.execute(query))

# if the tbale already exit all will need to do is to insert into the table
# insert into an existing table from the contents of a DataFrame
#con.execute("INSERT INTO existing_table SELECT * FROM loaded_Dataframe")

In [None]:
%%timeit
con.sql("SHOW TABLES")

In [None]:
%%timeit
# lets try it with one and see then later we can use a loop to load the rest of the data into the database
osm_building  = 'gis_osm_buildings_a_free_1.shp'
building_data = os.path.join(home_folder, nigeria_folder, osm_building)
# pass the data into the database
# create a new table from the contents of a DataFrame
query = f"CREATE TABLE osm_nigeria_builing AS SELECT * FROM ST_Read('{building_data}')"
print(con.execute(query))

# if the tbale already exit all will need to do is to insert into the table
# insert into an existing table from the contents of a DataFrame
#con.execute("INSERT INTO existing_table SELECT * FROM loaded_Dataframe")

Show our table again to know if the data is already ingested into the database

In [None]:
con.sql("SHOW TABLES")

if that works for one, then we will pass the bulk table (data) into the db

In [None]:
# load the data in the osm nigeria shapefile folder
home_folder = 'data'
nigeria_folder = 'nigeria-latest-free_shp'
data_folder = os.path.join(home_folder, nigeria_folder)
data = os.listdir(data_folder)
for item in data:
    if item.endswith('.shp'):
        print(item)
        
# pass the data into the database
# create a new table from the contents of a DataFrame
query = f"CREATE TABLE osm_nigeria_builing AS SELECT * FROM ST_Read('{bulding_gdf}')"
print(con.execute(query))

In [None]:
%sql duckdb:///:memory:
# %sql duckdb:///path/to/file.db

In [None]:
%%sql

SELECT * FROM duckdb_extensions();

## References 

* https://duckdb.org/docs/api/python/data_ingestion