# SQL Basics with DuckDB Sudan Extension

## Introduction

This notebook is a short introduction to SQL using the [DuckDB Sudan Extension](https://github.com/Osman-Geomatics93/duckdb-sudan-). It provides unified SQL access to Sudan's humanitarian, development, and geospatial data from 5 international APIs.

## Datasets

The following datasets are used in this notebook. No downloads needed — data is fetched live from international APIs or embedded in the extension.

- **SUDAN_States()** — 18 states with bilingual names (Arabic/English), ISO codes, centroids, and polygon boundaries
- **SUDAN_WorldBank()** — World Development Indicators (population, GDP, etc.)
- **SUDAN_Providers()** — List of 5 data providers
- **SUDAN_Boundaries()** — Administrative boundaries as GeoJSON (GADM v4.1)

## Supported Countries

| ISO3 | Country |
|:----:|---------|
| SDN | Sudan |
| EGY | Egypt |
| ETH | Ethiopia |
| TCD | Chad |
| SSD | South Sudan |
| ERI | Eritrea |
| LBY | Libya |
| CAF | Central African Republic |

## References

- [DuckDB SQL Introduction](https://duckdb.org/docs/sql/introduction.html)
- [W3Schools SQL Tutorial](https://www.w3schools.com/sql)
- [Sudan Extension Documentation](https://osman-geomatics93.github.io/duckdb-sudan-/)

## Installation

Uncomment the following cell to install the required packages.

In [9]:
# %pip install duckdb duckdb-engine jupysql

## Library Import and Configuration

In [1]:
import duckdb
import pandas as pd

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql

Set configurations on jupysql to directly output data to Pandas and to simplify the output that is printed to the notebook.

In [2]:
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

## Connecting to DuckDB and Loading the Sudan Extension

We connect to DuckDB with `allow_unsigned_extensions` enabled (required for custom extensions), then install and load the Sudan extension from the online repository.

In [12]:
%pip install duckdb==1.4.4 duckdb-engine jupysql

Collecting duckdb-engine
  Downloading duckdb_engine-0.17.0-py3-none-any.whl.metadata (8.4 kB)
Collecting jupysql
  Downloading jupysql-0.11.1-py3-none-any.whl.metadata (5.9 kB)
Collecting jupysql-plugin>=0.4.2 (from jupysql)
  Downloading jupysql_plugin-0.4.5-py3-none-any.whl.metadata (7.8 kB)
Collecting ploomber-core>=0.2.7 (from jupysql)
  Downloading ploomber_core-0.2.27-py3-none-any.whl.metadata (532 bytes)
Collecting posthog>=3.0 (from ploomber-core>=0.2.7->jupysql)
  Downloading posthog-7.8.4-py3-none-any.whl.metadata (6.4 kB)
Collecting backoff>=1.10.0 (from posthog>=3.0->ploomber-core>=0.2.7->jupysql)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Downloading duckdb_engine-0.17.0-py3-none-any.whl (49 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.7/49.7 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jupysql-0.11.1-py3-none-any.whl (95 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.1/95.1 kB[0m [31m6.

In [13]:
%pip install duckdb==1.4.4 --no-cache-dir



In [3]:
import duckdb
print(duckdb.__version__)

1.4.4


In [4]:
# Connect to DuckDB with unsigned extensions enabled
conn = duckdb.connect(config={'allow_unsigned_extensions': 'true'})

# Install and load the Sudan extension from the online repository
conn.execute("INSTALL httpfs; LOAD httpfs;")
conn.execute("SET custom_extension_repository = 'https://osman-geomatics93.github.io/duckdb-sudan-';")
conn.execute("INSTALL sudan; LOAD sudan;")

# Register the connection with jupysql for %%sql magic
%sql conn --alias duckdb

print("Sudan extension loaded successfully!")

Sudan extension loaded successfully!


Verify the extension is working by listing all data providers.

In [5]:
%%sql

SELECT * FROM SUDAN_Providers();

Unnamed: 0,provider_id,name,name_ar,description,base_url
0,worldbank,World Bank,البنك الدولي,World Development Indicators and other World B...,https://api.worldbank.org/v2/
1,who,World Health Organization,منظمة الصحة العالمية,Global Health Observatory (GHO) data,https://ghoapi.azureedge.net/api/
2,fao,Food and Agriculture Organization,منظمة الأغذية والزراعة,FAOSTAT agricultural statistics,https://faostatservices.fao.org/api/v1/
3,unhcr,UNHCR,المفوضية السامية,UN Refugee Agency displacement and population ...,https://api.unhcr.org/population/v1/
4,ilo,International Labour Organization,منظمة العمل الدولية,International Labour Organization statistics,https://sdmx.ilo.org/rest/


If your SQL query is one line only, you may use the `%sql` magic command. For multi-line SQL queries, use the `%%sql` magic command.

## Explore Available Datasets

Let's look at the data the extension provides. Sudan has 18 states with bilingual names and polygon boundaries.

In [6]:
%%sql

SELECT state_name, state_name_ar, iso_code, centroid_lon, centroid_lat
FROM SUDAN_States();

Unnamed: 0,state_name,state_name_ar,iso_code,centroid_lon,centroid_lat
0,Khartoum,الخرطوم,SD-KH,32.53,15.55
1,Al Jazirah,الجزيرة,SD-GZ,33.53,14.88
2,Al Qadarif,القضارف,SD-GD,35.4,14.03
3,Kassala,كسلا,SD-KA,36.4,15.45
4,Red Sea,البحر الأحمر,SD-RS,37.22,19.62
5,River Nile,نهر النيل,SD-NR,33.93,17.5
6,Northern,الشمالية,SD-NO,30.22,19.5
7,White Nile,النيل الأبيض,SD-NW,32.17,13.17
8,Blue Nile,النيل الأزرق,SD-NB,34.05,11.25
9,Sennar,سنار,SD-SI,34.13,13.55


## Create Tables

Create a table named `states` from Sudan's 18 states.

In [7]:
%%sql

CREATE TABLE states AS
SELECT state_name, state_name_ar, iso_code, centroid_lon, centroid_lat
FROM SUDAN_States();

Unnamed: 0,Count
0,18


Create a table named `population` from the World Bank API (Sudan + neighbors, recent years).

> **Note:** This cell fetches live data from the World Bank API. It may take a few seconds.

In [8]:
%%sql

CREATE TABLE population AS
SELECT indicator_id, indicator_name, country, country_name, year, value
FROM SUDAN_WorldBank('SP.POP.TOTL', countries := ['SDN', 'EGY', 'ETH', 'TCD', 'SSD', 'ERI', 'LBY', 'CAF'])
WHERE value IS NOT NULL;

Unnamed: 0,Count
0,520


Create a table named `gdp` from World Bank GDP data.

> **Note:** GDP (current US$) indicator code is `NY.GDP.MKTP.CD`.

In [9]:
%%sql

CREATE TABLE gdp AS
SELECT indicator_id, indicator_name, country, country_name, year, value
FROM SUDAN_WorldBank('NY.GDP.MKTP.CD', countries := ['SDN', 'EGY', 'ETH', 'TCD', 'SSD', 'ERI', 'LBY', 'CAF'])
WHERE value IS NOT NULL;

Unnamed: 0,Count
0,418


Display the table contents.

In [10]:
%%sql

FROM states;

Unnamed: 0,state_name,state_name_ar,iso_code,centroid_lon,centroid_lat
0,Khartoum,الخرطوم,SD-KH,32.53,15.55
1,Al Jazirah,الجزيرة,SD-GZ,33.53,14.88
2,Al Qadarif,القضارف,SD-GD,35.4,14.03
3,Kassala,كسلا,SD-KA,36.4,15.45
4,Red Sea,البحر الأحمر,SD-RS,37.22,19.62
5,River Nile,نهر النيل,SD-NR,33.93,17.5
6,Northern,الشمالية,SD-NO,30.22,19.5
7,White Nile,النيل الأبيض,SD-NW,32.17,13.17
8,Blue Nile,النيل الأزرق,SD-NB,34.05,11.25
9,Sennar,سنار,SD-SI,34.13,13.55


In [11]:
%%sql

FROM population LIMIT 10;

Unnamed: 0,indicator_id,indicator_name,country,country_name,year,value
0,SP.POP.TOTL,"Population, total",SD,Sudan,2024,50448963.0
1,SP.POP.TOTL,"Population, total",SD,Sudan,2023,50042791.0
2,SP.POP.TOTL,"Population, total",SD,Sudan,2022,49383346.0
3,SP.POP.TOTL,"Population, total",SD,Sudan,2021,48066924.0
4,SP.POP.TOTL,"Population, total",SD,Sudan,2020,46789231.0
5,SP.POP.TOTL,"Population, total",SD,Sudan,2019,45548175.0
6,SP.POP.TOTL,"Population, total",SD,Sudan,2018,44230596.0
7,SP.POP.TOTL,"Population, total",SD,Sudan,2017,42714306.0
8,SP.POP.TOTL,"Population, total",SD,Sudan,2016,41259892.0
9,SP.POP.TOTL,"Population, total",SD,Sudan,2015,40024431.0


In [12]:
%%sql

FROM gdp LIMIT 10;

Unnamed: 0,indicator_id,indicator_name,country,country_name,year,value
0,NY.GDP.MKTP.CD,GDP (current US$),SD,Sudan,2024,49672440000.0
1,NY.GDP.MKTP.CD,GDP (current US$),SD,Sudan,2023,39898290000.0
2,NY.GDP.MKTP.CD,GDP (current US$),SD,Sudan,2022,51666880000.0
3,NY.GDP.MKTP.CD,GDP (current US$),SD,Sudan,2021,34229510000.0
4,NY.GDP.MKTP.CD,GDP (current US$),SD,Sudan,2020,1264790000.0
5,NY.GDP.MKTP.CD,GDP (current US$),SD,Sudan,2019,32338080000.0
6,NY.GDP.MKTP.CD,GDP (current US$),SD,Sudan,2018,32333780000.0
7,NY.GDP.MKTP.CD,GDP (current US$),SD,Sudan,2017,41283620000.0
8,NY.GDP.MKTP.CD,GDP (current US$),SD,Sudan,2016,-319297000.0
9,NY.GDP.MKTP.CD,GDP (current US$),SD,Sudan,2015,51726760000.0


## The SQL SELECT Statement

The `SELECT` statement is used to select data from a database. Use either `SELECT *` to select all columns, or `SELECT column1, column2, ...` to select specific columns.

`SELECT * FROM states` is the same as `FROM states`.

In [13]:
%%sql

SELECT * FROM states;

Unnamed: 0,state_name,state_name_ar,iso_code,centroid_lon,centroid_lat
0,Khartoum,الخرطوم,SD-KH,32.53,15.55
1,Al Jazirah,الجزيرة,SD-GZ,33.53,14.88
2,Al Qadarif,القضارف,SD-GD,35.4,14.03
3,Kassala,كسلا,SD-KA,36.4,15.45
4,Red Sea,البحر الأحمر,SD-RS,37.22,19.62
5,River Nile,نهر النيل,SD-NR,33.93,17.5
6,Northern,الشمالية,SD-NO,30.22,19.5
7,White Nile,النيل الأبيض,SD-NW,32.17,13.17
8,Blue Nile,النيل الأزرق,SD-NB,34.05,11.25
9,Sennar,سنار,SD-SI,34.13,13.55


To limit the number of rows returned, use the `LIMIT` keyword. For example, `SELECT * FROM population LIMIT 10` will return only the first 10 rows.

In [14]:
%%sql

SELECT * FROM population LIMIT 10;

Unnamed: 0,indicator_id,indicator_name,country,country_name,year,value
0,SP.POP.TOTL,"Population, total",SD,Sudan,2024,50448963.0
1,SP.POP.TOTL,"Population, total",SD,Sudan,2023,50042791.0
2,SP.POP.TOTL,"Population, total",SD,Sudan,2022,49383346.0
3,SP.POP.TOTL,"Population, total",SD,Sudan,2021,48066924.0
4,SP.POP.TOTL,"Population, total",SD,Sudan,2020,46789231.0
5,SP.POP.TOTL,"Population, total",SD,Sudan,2019,45548175.0
6,SP.POP.TOTL,"Population, total",SD,Sudan,2018,44230596.0
7,SP.POP.TOTL,"Population, total",SD,Sudan,2017,42714306.0
8,SP.POP.TOTL,"Population, total",SD,Sudan,2016,41259892.0
9,SP.POP.TOTL,"Population, total",SD,Sudan,2015,40024431.0


Select a subset of columns from the `population` table and display the first 10 rows.

In [15]:
%%sql

SELECT country_name, year, value FROM population LIMIT 10;

Unnamed: 0,country_name,year,value
0,Sudan,2024,50448963.0
1,Sudan,2023,50042791.0
2,Sudan,2022,49383346.0
3,Sudan,2021,48066924.0
4,Sudan,2020,46789231.0
5,Sudan,2019,45548175.0
6,Sudan,2018,44230596.0
7,Sudan,2017,42714306.0
8,Sudan,2016,41259892.0
9,Sudan,2015,40024431.0


To select distinct values, use the `DISTINCT` keyword. For example, `SELECT DISTINCT country_name FROM population` returns only the unique country names.

In [16]:
%%sql

SELECT DISTINCT country_name FROM population;

Unnamed: 0,country_name
0,Ethiopia
1,Chad
2,Central African Republic
3,Sudan
4,"Egypt, Arab Rep."
5,South Sudan
6,Eritrea
7,Libya


To count the number of rows returned, use the `COUNT(*)` function.

In [17]:
%%sql

SELECT COUNT(*) FROM states;

Unnamed: 0,count_star()
0,18


In [18]:
%%sql

SELECT COUNT(*) FROM population;

Unnamed: 0,count_star()
0,520


To count the number of distinct values, use the `COUNT(DISTINCT column)` function.

In [19]:
%%sql

SELECT COUNT(DISTINCT country_name) FROM population;

Unnamed: 0,count(DISTINCT country_name)
0,8


To calculate the maximum value, use the `MAX(column)` function. For example, the maximum population recorded across all countries and years.

In [20]:
%%sql

SELECT MAX(value) AS max_population FROM population;

Unnamed: 0,max_population
0,132059767.0


To calculate the minimum value, use the `MIN(column)` function.

In [21]:
%%sql

SELECT MIN(value) AS min_population FROM population;

Unnamed: 0,min_population
0,972547.0


To calculate the total value, use the `SUM(column)` function. For example, the total population of all 8 countries in 2023.

In [22]:
%%sql

SELECT SUM(value) AS total_population
FROM population
WHERE year = 2023;

Unnamed: 0,total_population
0,340001163.0


To calculate the average value, use the `AVG(column)` function.

In [23]:
%%sql

SELECT ROUND(AVG(value), 0) AS avg_population
FROM population
WHERE year = 2023;

Unnamed: 0,avg_population
0,42500145.0


To order the results, use the `ORDER BY column` clause. For example, order states alphabetically.

In [24]:
%%sql

SELECT * FROM states ORDER BY state_name;

Unnamed: 0,state_name,state_name_ar,iso_code,centroid_lon,centroid_lat
0,Al Jazirah,الجزيرة,SD-GZ,33.53,14.88
1,Al Qadarif,القضارف,SD-GD,35.4,14.03
2,Blue Nile,النيل الأزرق,SD-NB,34.05,11.25
3,Central Darfur,وسط دارفور,SD-DC,24.23,13.5
4,East Darfur,شرق دارفور,SD-DE,26.13,12.75
5,Kassala,كسلا,SD-KA,36.4,15.45
6,Khartoum,الخرطوم,SD-KH,32.53,15.55
7,North Darfur,شمال دارفور,SD-DN,25.08,15.77
8,North Kordofan,شمال كردفان,SD-KN,29.42,13.83
9,Northern,الشمالية,SD-NO,30.22,19.5


To order the results in descending order, use the `ORDER BY column DESC` clause. For example, rank countries by population in 2023.

In [25]:
%%sql

SELECT country_name, year, value AS population
FROM population
WHERE year = 2023
ORDER BY value DESC;

Unnamed: 0,country_name,year,population
0,Ethiopia,2023,128691692.0
1,"Egypt, Arab Rep.",2023,114535772.0
2,Sudan,2023,50042791.0
3,Chad,2023,19319064.0
4,South Sudan,2023,11483374.0
5,Libya,2023,7305659.0
6,Central African Republic,2023,5152421.0
7,Eritrea,2023,3470390.0


## The WHERE Clause

The `WHERE` clause is used to filter records. It extracts only those records that fulfill a specified condition.

In [26]:
%%sql

SELECT * FROM population WHERE country_name = 'Sudan' AND year >= 2020;

Unnamed: 0,indicator_id,indicator_name,country,country_name,year,value
0,SP.POP.TOTL,"Population, total",SD,Sudan,2024,50448963.0
1,SP.POP.TOTL,"Population, total",SD,Sudan,2023,50042791.0
2,SP.POP.TOTL,"Population, total",SD,Sudan,2022,49383346.0
3,SP.POP.TOTL,"Population, total",SD,Sudan,2021,48066924.0
4,SP.POP.TOTL,"Population, total",SD,Sudan,2020,46789231.0


You can use boolean operators such as `AND`, `OR`, `NOT` to filter records.

In [27]:
%%sql

SELECT country_name, year, value
FROM population
WHERE (country_name = 'Sudan' OR country_name = 'Egypt')
AND year >= 2020
ORDER BY country_name, year;

Unnamed: 0,country_name,year,value
0,Sudan,2020,46789231.0
1,Sudan,2021,48066924.0
2,Sudan,2022,49383346.0
3,Sudan,2023,50042791.0
4,Sudan,2024,50448963.0


To select states with names starting with the letter `N`, use `LIKE 'N%'`.

In [28]:
%%sql

SELECT * FROM states WHERE state_name LIKE 'N%';

Unnamed: 0,state_name,state_name_ar,iso_code,centroid_lon,centroid_lat
0,Northern,الشمالية,SD-NO,30.22,19.5
1,North Darfur,شمال دارفور,SD-DN,25.08,15.77
2,North Kordofan,شمال كردفان,SD-KN,29.42,13.83


To select all Darfur states, use `LIKE '%Darfur%'`.

In [29]:
%%sql

SELECT * FROM states WHERE state_name LIKE '%Darfur%';

Unnamed: 0,state_name,state_name_ar,iso_code,centroid_lon,centroid_lat
0,South Darfur,جنوب دارفور,SD-DS,24.92,11.75
1,North Darfur,شمال دارفور,SD-DN,25.08,15.77
2,West Darfur,غرب دارفور,SD-DW,22.85,12.83
3,Central Darfur,وسط دارفور,SD-DC,24.23,13.5
4,East Darfur,شرق دارفور,SD-DE,26.13,12.75


To select all Kordofan states, use `LIKE '%Kordofan%'`.

In [30]:
%%sql

SELECT * FROM states WHERE state_name LIKE '%Kordofan%';

Unnamed: 0,state_name,state_name_ar,iso_code,centroid_lon,centroid_lat
0,North Kordofan,شمال كردفان,SD-KN,29.42,13.83
1,South Kordofan,جنوب كردفان,SD-KS,29.67,11.2
2,West Kordofan,غرب كردفان,SD-KW,28.05,12.25


To select from a list of values, use the `IN` operator.

In [31]:
%%sql

SELECT country_name, year, value
FROM population
WHERE country_name IN ('Sudan', 'Egypt', 'Ethiopia')
AND year = 2023
ORDER BY value DESC;

Unnamed: 0,country_name,year,value
0,Ethiopia,2023,128691692.0
1,Sudan,2023,50042791.0


To select rows within a range, use the `BETWEEN` operator. For example, population data between 2015 and 2023.

In [32]:
%%sql

SELECT country_name, year, value
FROM population
WHERE country_name = 'Sudan'
AND year BETWEEN 2015 AND 2023
ORDER BY year;

Unnamed: 0,country_name,year,value
0,Sudan,2015,40024431.0
1,Sudan,2016,41259892.0
2,Sudan,2017,42714306.0
3,Sudan,2018,44230596.0
4,Sudan,2019,45548175.0
5,Sudan,2020,46789231.0
6,Sudan,2021,48066924.0
7,Sudan,2022,49383346.0
8,Sudan,2023,50042791.0


## SQL Joins

Reference: https://www.w3schools.com/sql/sql_join.asp

Here are the different types of JOINs in SQL:

- `(INNER) JOIN`: Returns records that have matching values in both tables
- `LEFT (OUTER) JOIN`: Returns all records from the left table, and the matched records from the right table
- `RIGHT (OUTER) JOIN`: Returns all records from the right table, and the matched records from the left table
- `FULL (OUTER) JOIN`: Returns all records when there is a match in either left or right table

![](https://i.imgur.com/mITYzuS.png)

We have two sample tables: `population` and `gdp`.

Both contain data for 8 countries across many years. We'll join them to calculate **GDP per capita**.

In [33]:
%%sql

SELECT COUNT(*) AS population_rows FROM population;

Unnamed: 0,population_rows
0,520


In [34]:
%%sql

SELECT COUNT(*) AS gdp_rows FROM gdp;

Unnamed: 0,gdp_rows
0,418


In [35]:
%%sql

SELECT * FROM population WHERE year = 2023 ORDER BY country_name;

Unnamed: 0,indicator_id,indicator_name,country,country_name,year,value
0,SP.POP.TOTL,"Population, total",CF,Central African Republic,2023,5152421.0
1,SP.POP.TOTL,"Population, total",TD,Chad,2023,19319064.0
2,SP.POP.TOTL,"Population, total",EG,"Egypt, Arab Rep.",2023,114535772.0
3,SP.POP.TOTL,"Population, total",ER,Eritrea,2023,3470390.0
4,SP.POP.TOTL,"Population, total",ET,Ethiopia,2023,128691692.0
5,SP.POP.TOTL,"Population, total",LY,Libya,2023,7305659.0
6,SP.POP.TOTL,"Population, total",SS,South Sudan,2023,11483374.0
7,SP.POP.TOTL,"Population, total",SD,Sudan,2023,50042791.0


In [36]:
%%sql

SELECT * FROM gdp WHERE year = 2023 ORDER BY country_name;

Unnamed: 0,indicator_id,indicator_name,country,country_name,year,value
0,NY.GDP.MKTP.CD,GDP (current US$),CF,Central African Republic,2023,2555492000.0
1,NY.GDP.MKTP.CD,GDP (current US$),TD,Chad,2023,18340230000.0
2,NY.GDP.MKTP.CD,GDP (current US$),EG,"Egypt, Arab Rep.",2023,395926100000.0
3,NY.GDP.MKTP.CD,GDP (current US$),ET,Ethiopia,2023,135874100000.0
4,NY.GDP.MKTP.CD,GDP (current US$),LY,Libya,2023,44027660000.0
5,NY.GDP.MKTP.CD,GDP (current US$),SD,Sudan,2023,39898290000.0


### SQL Inner Join

The `INNER JOIN` keyword selects records that have matching values in both tables. Here we join `population` and `gdp` on `country` and `year` to calculate GDP per capita.

In [37]:
%%sql

SELECT
    p.country_name,
    p.year,
    p.value AS population,
    g.value AS gdp_usd,
    ROUND(g.value / p.value, 2) AS gdp_per_capita
FROM population p
INNER JOIN gdp g ON p.country = g.country AND p.year = g.year
WHERE p.year = 2023
ORDER BY gdp_per_capita DESC;

Unnamed: 0,country_name,year,population,gdp_usd,gdp_per_capita
0,Libya,2023,7305659.0,44027660000.0,6026.52
1,"Egypt, Arab Rep.",2023,114535772.0,395926100000.0,3456.79
2,Ethiopia,2023,128691692.0,135874100000.0,1055.81
3,Chad,2023,19319064.0,18340230000.0,949.33
4,Sudan,2023,50042791.0,39898290000.0,797.28
5,Central African Republic,2023,5152421.0,2555492000.0,495.98


### SQL Left Join

The `LEFT JOIN` keyword returns all records from the left table (`population`), and the matched records from the right table (`gdp`). If there is no match, the right side will contain `NULL`.

In [38]:
%%sql

SELECT
    p.country_name,
    p.year,
    p.value AS population,
    g.value AS gdp_usd
FROM population p
LEFT JOIN gdp g ON p.country = g.country AND p.year = g.year
WHERE p.year >= 2022
ORDER BY p.country_name, p.year;

Unnamed: 0,country_name,year,population,gdp_usd
0,Central African Republic,2022,5098039.0,2382619000.0
1,Central African Republic,2023,5152421.0,2555492000.0
2,Central African Republic,2024,5330690.0,2751494000.0
3,Chad,2022,18455316.0,17828510000.0
4,Chad,2023,19319064.0,18340230000.0
5,Chad,2024,20299123.0,19518820000.0
6,"Egypt, Arab Rep.",2022,112618250.0,476747700000.0
7,"Egypt, Arab Rep.",2023,114535772.0,395926100000.0
8,"Egypt, Arab Rep.",2024,116538258.0,389059900000.0
9,Eritrea,2022,3409447.0,


### SQL Right Join

The `RIGHT JOIN` keyword returns all records from the right table (`gdp`), and the matched records from the left table (`population`).

In [39]:
%%sql

SELECT
    g.country_name,
    g.year,
    p.value AS population,
    g.value AS gdp_usd
FROM population p
RIGHT JOIN gdp g ON p.country = g.country AND p.year = g.year
WHERE g.year >= 2022
ORDER BY g.country_name, g.year;

Unnamed: 0,country_name,year,population,gdp_usd
0,Central African Republic,2022,5098039.0,2382619000.0
1,Central African Republic,2023,5152421.0,2555492000.0
2,Central African Republic,2024,5330690.0,2751494000.0
3,Chad,2022,18455316.0,17828510000.0
4,Chad,2023,19319064.0,18340230000.0
5,Chad,2024,20299123.0,19518820000.0
6,"Egypt, Arab Rep.",2022,112618250.0,476747700000.0
7,"Egypt, Arab Rep.",2023,114535772.0,395926100000.0
8,"Egypt, Arab Rep.",2024,116538258.0,389059900000.0
9,Ethiopia,2022,125384287.0,-1413747000.0


### SQL Full Join

The `FULL JOIN` keyword returns all records when there is a match in either left or right table.

In [40]:
%%sql

SELECT
    COALESCE(p.country_name, g.country_name) AS country_name,
    COALESCE(p.year, g.year) AS year,
    p.value AS population,
    g.value AS gdp_usd
FROM population p
FULL JOIN gdp g ON p.country = g.country AND p.year = g.year
WHERE COALESCE(p.year, g.year) = 2023
ORDER BY country_name;

Unnamed: 0,country_name,year,population,gdp_usd
0,Central African Republic,2023,5152421.0,2555492000.0
1,Chad,2023,19319064.0,18340230000.0
2,"Egypt, Arab Rep.",2023,114535772.0,395926100000.0
3,Eritrea,2023,3470390.0,
4,Ethiopia,2023,128691692.0,135874100000.0
5,Libya,2023,7305659.0,44027660000.0
6,South Sudan,2023,11483374.0,
7,Sudan,2023,50042791.0,39898290000.0


### SQL Union

The `UNION` operator is used to combine the result-set of two or more `SELECT` statements.

In [41]:
%%sql

SELECT country_name, 'population' AS indicator FROM population WHERE year = 2023
UNION
SELECT country_name, 'gdp' AS indicator FROM gdp WHERE year = 2023
ORDER BY country_name, indicator;

Unnamed: 0,country_name,indicator
0,Central African Republic,gdp
1,Central African Republic,population
2,Chad,gdp
3,Chad,population
4,"Egypt, Arab Rep.",gdp
5,"Egypt, Arab Rep.",population
6,Eritrea,population
7,Ethiopia,gdp
8,Ethiopia,population
9,Libya,gdp


## Aggregation

### Group By

The `GROUP BY` statement groups rows that have the same values into summary rows, like "find the latest population of each country".

The `GROUP BY` statement is often used with aggregate functions (`COUNT`, `MAX`, `MIN`, `SUM`, `AVG`) to group the result-set by one or more columns.

In [42]:
%%sql

SELECT country_name, COUNT(*) AS num_years
FROM population
GROUP BY country_name
ORDER BY num_years DESC;

Unnamed: 0,country_name,num_years
0,Sudan,65
1,"Egypt, Arab Rep.",65
2,South Sudan,65
3,Eritrea,65
4,Libya,65
5,Ethiopia,65
6,Chad,65
7,Central African Republic,65


In [43]:
%%sql

SELECT country_name, MAX(value) AS max_population, MIN(value) AS min_population
FROM population
GROUP BY country_name
ORDER BY max_population DESC;

Unnamed: 0,country_name,max_population,min_population
0,Ethiopia,132059767.0,21376693.0
1,"Egypt, Arab Rep.",116538258.0,26896479.0
2,Sudan,50448963.0,8364489.0
3,Chad,20299123.0,3049838.0
4,South Sudan,11943408.0,2931559.0
5,Libya,7381023.0,1492890.0
6,Central African Republic,5330690.0,1702346.0
7,Eritrea,3535603.0,972547.0


Calculate the average population per decade for Sudan.

In [44]:
%%sql

SELECT
    (year / 10) * 10 AS decade,
    ROUND(AVG(value), 0) AS avg_population
FROM population
WHERE country_name = 'Sudan'
GROUP BY decade
ORDER BY decade;

Unnamed: 0,decade,avg_population
0,1960.0,8364489.0
1,1961.0,8634941.0
2,1962.0,8919028.0
3,1963.0,9218077.0
4,1964.0,9531109.0
...,...,...
60,2020.0,46789231.0
61,2021.0,48066924.0
62,2022.0,49383346.0
63,2023.0,50042791.0


### Having

The `HAVING` clause was added to SQL because the `WHERE` keyword could not be used with aggregate functions.

For example, to select countries with a maximum population greater than 50 million:

In [45]:
%%sql

SELECT country_name, MAX(value) AS max_population
FROM population
GROUP BY country_name
HAVING MAX(value) > 50000000
ORDER BY max_population DESC;

Unnamed: 0,country_name,max_population
0,Ethiopia,132059767.0
1,"Egypt, Arab Rep.",116538258.0
2,Sudan,50448963.0


GDP per capita by country, only showing countries with GDP per capita > $1000.

In [46]:
%%sql

SELECT
    p.country_name,
    ROUND(AVG(g.value / p.value), 2) AS avg_gdp_per_capita
FROM population p
INNER JOIN gdp g ON p.country = g.country AND p.year = g.year
WHERE p.year >= 2015
GROUP BY p.country_name
HAVING AVG(g.value / p.value) > 1000
ORDER BY avg_gdp_per_capita DESC;

Unnamed: 0,country_name,avg_gdp_per_capita
0,Libya,7627.84
1,"Egypt, Arab Rep.",3278.74
2,South Sudan,1080.15


## Conditional Statements

The `CASE` statement goes through conditions and returns a value when the first condition is met (like an `IF-THEN-ELSE` statement).

For example, to classify countries by population size:

In [47]:
%%sql

SELECT country_name, value AS population,
CASE
    WHEN value > 100000000 THEN 'Very Large (100M+)'
    WHEN value > 30000000 THEN 'Large (30M+)'
    WHEN value > 10000000 THEN 'Medium (10M+)'
    ELSE 'Small (<10M)'
END AS size_category
FROM population
WHERE year = 2023
ORDER BY value DESC;

Unnamed: 0,country_name,population,size_category
0,Ethiopia,128691692.0,Very Large (100M+)
1,"Egypt, Arab Rep.",114535772.0,Very Large (100M+)
2,Sudan,50042791.0,Large (30M+)
3,Chad,19319064.0,Medium (10M+)
4,South Sudan,11483374.0,Medium (10M+)
5,Libya,7305659.0,Small (<10M)
6,Central African Republic,5152421.0,Small (<10M)
7,Eritrea,3470390.0,Small (<10M)


Classify Sudan's states into regions using `CASE`.

In [48]:
%%sql

SELECT state_name, state_name_ar, iso_code,
CASE
    WHEN state_name LIKE '%Darfur%' THEN 'Darfur'
    WHEN state_name LIKE '%Kordofan%' THEN 'Kordofan'
    WHEN state_name IN ('Khartoum', 'Al Jazirah', 'White Nile', 'Blue Nile', 'Sennar') THEN 'Central'
    WHEN state_name IN ('Kassala', 'Al Qadarif', 'Red Sea') THEN 'Eastern'
    WHEN state_name IN ('River Nile', 'Northern') THEN 'Northern'
    ELSE 'Other'
END AS region
FROM states
ORDER BY region, state_name;

Unnamed: 0,state_name,state_name_ar,iso_code,region
0,Al Jazirah,الجزيرة,SD-GZ,Central
1,Blue Nile,النيل الأزرق,SD-NB,Central
2,Khartoum,الخرطوم,SD-KH,Central
3,Sennar,سنار,SD-SI,Central
4,White Nile,النيل الأبيض,SD-NW,Central
5,Central Darfur,وسط دارفور,SD-DC,Darfur
6,East Darfur,شرق دارفور,SD-DE,Darfur
7,North Darfur,شمال دارفور,SD-DN,Darfur
8,South Darfur,جنوب دارفور,SD-DS,Darfur
9,West Darfur,غرب دارفور,SD-DW,Darfur


## Saving Results

You can save the results of a query to a new table using the `CREATE TABLE AS` statement.

In [49]:
%%sql

DROP TABLE IF EXISTS sudan_summary;
CREATE TABLE sudan_summary AS
SELECT country_name, year, value AS population
FROM population
WHERE country_name = 'Sudan' AND year >= 2000;

Unnamed: 0,Count
0,25


In [51]:
%%sql

FROM sudan_summary;

Unnamed: 0,country_name,year,population
0,Sudan,2024,50448963.0
1,Sudan,2023,50042791.0
2,Sudan,2022,49383346.0
3,Sudan,2021,48066924.0
4,Sudan,2020,46789231.0
5,Sudan,2019,45548175.0
6,Sudan,2018,44230596.0
7,Sudan,2017,42714306.0
8,Sudan,2016,41259892.0
9,Sudan,2015,40024431.0


Use the `INSERT INTO` statement to insert rows into a table.

In [52]:
%%sql

DROP TABLE IF EXISTS darfur_states;
CREATE TABLE darfur_states AS
SELECT * FROM states WHERE state_name LIKE '%Darfur%';

Unnamed: 0,Count
0,5


In [53]:
%%sql

INSERT INTO darfur_states
SELECT * FROM states WHERE state_name LIKE '%Kordofan%';

Unnamed: 0,Count
0,3


In [54]:
%%sql

FROM darfur_states;

Unnamed: 0,state_name,state_name_ar,iso_code,centroid_lon,centroid_lat
0,South Darfur,جنوب دارفور,SD-DS,24.92,11.75
1,North Darfur,شمال دارفور,SD-DN,25.08,15.77
2,West Darfur,غرب دارفور,SD-DW,22.85,12.83
3,Central Darfur,وسط دارفور,SD-DC,24.23,13.5
4,East Darfur,شرق دارفور,SD-DE,26.13,12.75
5,North Kordofan,شمال كردفان,SD-KN,29.42,13.83
6,South Kordofan,جنوب كردفان,SD-KS,29.67,11.2
7,West Kordofan,غرب كردفان,SD-KW,28.05,12.25


## Export to Files

Export data to CSV.

In [55]:
%%sql

COPY (SELECT * FROM sudan_summary) TO 'sudan_population.csv' (HEADER, DELIMITER ',');

Unnamed: 0,Count
0,25


In [57]:
%%sql
RESET custom_extension_repository;

Unnamed: 0,Success


## Geospatial Queries (Bonus)

The Sudan extension embeds real MultiPolygon boundaries (GADM v4.1) for all 18 states. Combined with DuckDB's `spatial` extension, you can run spatial SQL queries.

Install and load the spatial extension first.

In [58]:
%%sql

INSTALL spatial;
LOAD spatial;

Unnamed: 0,Success


Convert GeoJSON strings to geometry objects.

In [59]:
%%sql

SELECT state_name, ST_GeomFromGeoJSON(geojson) AS geom
FROM SUDAN_Boundaries('state');

Unnamed: 0,state_name,geom
0,Khartoum,"[5, 4, 0, 0, 0, 0, 0, 0, 231, 251, 252, 65, 8,..."
1,Al Jazirah,"[5, 4, 0, 0, 0, 0, 0, 0, 161, 197, 1, 66, 247,..."
2,Al Qadarif,"[5, 4, 0, 0, 0, 0, 0, 0, 81, 56, 6, 66, 18, 13..."
3,Kassala,"[5, 4, 0, 0, 0, 0, 0, 0, 106, 188, 8, 66, 225,..."
4,Red Sea,"[5, 4, 0, 0, 0, 0, 0, 0, 104, 17, 5, 66, 137, ..."
5,River Nile,"[5, 4, 0, 0, 0, 0, 0, 0, 190, 159, 254, 65, 86..."
6,Northern,"[5, 4, 0, 0, 0, 0, 0, 0, 0, 0, 200, 65, 98, 16..."
7,White Nile,"[5, 4, 0, 0, 0, 0, 0, 0, 126, 106, 252, 65, 22..."
8,Blue Nile,"[5, 4, 0, 0, 0, 0, 0, 0, 79, 141, 4, 66, 143, ..."
9,Sennar,"[5, 4, 0, 0, 0, 0, 0, 0, 161, 197, 3, 66, 30, ..."


Compute centroids from polygon geometries.

In [60]:
%%sql

SELECT
    state_name,
    ROUND(ST_X(ST_Centroid(ST_GeomFromGeoJSON(geojson))), 3) AS centroid_lon,
    ROUND(ST_Y(ST_Centroid(ST_GeomFromGeoJSON(geojson))), 3) AS centroid_lat
FROM SUDAN_Boundaries('state');

Unnamed: 0,state_name,centroid_lon,centroid_lat
0,Khartoum,32.802,15.844
1,Al Jazirah,33.345,14.582
2,Al Qadarif,35.221,14.171
3,Kassala,35.751,15.989
4,Red Sea,35.699,19.833
5,River Nile,33.46,18.355
6,Northern,29.353,19.565
7,White Nile,32.351,13.406
8,Blue Nile,34.125,11.269
9,Sennar,34.053,12.896


Find which state contains a specific point (e.g., Khartoum city at 32.53, 15.59).

In [61]:
%%sql

SELECT state_name, state_name_ar
FROM SUDAN_Boundaries('state')
WHERE ST_Contains(ST_GeomFromGeoJSON(geojson), ST_Point(32.53, 15.59));

Unnamed: 0,state_name,state_name_ar
0,Khartoum,الخرطوم


Compute approximate area of each state in km².

In [62]:
%%sql

SELECT
    state_name,
    ROUND(ST_Area(ST_GeomFromGeoJSON(geojson)) * 111.32 * 111.32, 0) AS area_km2
FROM SUDAN_Boundaries('state')
ORDER BY area_km2 DESC;

Unnamed: 0,state_name,area_km2
0,Northern,390924.0
1,North Darfur,332427.0
2,Red Sea,231470.0
3,North Kordofan,195881.0
4,River Nile,133640.0
5,West Kordofan,116117.0
6,South Kordofan,81481.0
7,South Darfur,80078.0
8,Al Qadarif,66469.0
9,East Darfur,56520.0


Export boundaries to GeoPackage file (can be opened in QGIS, ArcGIS, etc.).

In [63]:
%%sql

COPY (
    SELECT state_name, state_name_ar, iso_code,
           ST_GeomFromGeoJSON(geojson) AS geom
    FROM SUDAN_Boundaries('state')
) TO 'sudan_states.gpkg' WITH (FORMAT GDAL, DRIVER 'GPKG');

Unnamed: 0,Count
0,18


## SQL Comments

Comments are used to explain sections of SQL statements, or to prevent execution of SQL statements.

### Single Line Comments

Single line comments start with `--`. Any text between `--` and the end of the line will be ignored.

In [64]:
%%sql

SELECT * FROM states LIMIT 5 -- Show first 5 states;

Unnamed: 0,state_name,state_name_ar,iso_code,centroid_lon,centroid_lat
0,Khartoum,الخرطوم,SD-KH,32.53,15.55
1,Al Jazirah,الجزيرة,SD-GZ,33.53,14.88
2,Al Qadarif,القضارف,SD-GD,35.4,14.03
3,Kassala,كسلا,SD-KA,36.4,15.45
4,Red Sea,البحر الأحمر,SD-RS,37.22,19.62


### Multi-line Comments

Multi-line comments start with `/*` and end with `*/`. Any text between `/*` and `*/` will be ignored.

In [65]:
%%sql

SELECT country_name, year, value
FROM population
/*
 * Filter for Sudan only
 * Recent years (2020+)
 */
WHERE country_name = 'Sudan'
AND year >= 2020
ORDER BY year DESC;

Unnamed: 0,country_name,year,value
0,Sudan,2024,50448963.0
1,Sudan,2023,50042791.0
2,Sudan,2022,49383346.0
3,Sudan,2021,48066924.0
4,Sudan,2020,46789231.0
