# GEP-OnSSET GIS-Extraction Notebook for GEP-OnSSET

This is the GEP-OnSSET GIS extraction notebook. It replaces the [QGIS plugin](https://github.com/global-electrification-platform/Cluster-based_extraction_OnSSET/tree/master/Plugin) used in the online course.

The main purpose of this notebook is to facilitate the change of single datasets without running through the entire plugin. Using this notebook the user will be able to change however many datasets needed.

In order to run an OnSSET analysis the following datasets are needed:
* Admin boundaries
* Elevation
* Global horizontal irradiation
* Land Cover
* Travel time
* Wind velocity
* Clusters **(these clusters should include: The name of the study area, the amount of nighttime lights, population, population living in areas with nighttime light and an ID column)**. The clusters can be downloaded from [Energydata.info](https://energydata.info/) or generated directly using this [code](https://github.com/global-electrification-platform/Clustering_notebook).

In addition to this there are also some optional datasets that can be used in the analysis:
* Custom Demand - Can be generated using this [code](https://github.com/global-electrification-platform/CREDIT_layer). Description of the methodology is available [here](https://www.mdpi.com/1996-1073/12/7/1395) has been used.
* Substations
* Transformers
* Existing mini-grids
* Adm0 & Adm1 boundary layers
* Mini/Small hydro
* Existing and planned HV-lines
* Existing and planned MV-lines 
* Road network

Below instructions for each cell follows. The cells marked with **(Mandatory)** in the title have to be run.

### Useful hints and common error messages
* Make sure that all input layers are using EPSG:4326 as the coordinate system
* Make sure that the "crs" in cell 2 is in a coordinate system using meters as the unit
* It is often useful to clip all the input layers to the country boundaries in order to reduce processing times
* Make sure that each dataset actually has some data within the country boundaries
* Some of the datasets require the user to choose values from a dropdown list below
* For hydro points and mini-grids, the vector layers need some specific column names to work
* In case a dataset still does not work, try opening it in QGIS and run the *Fix geometries* tool and save the new layer.
* If things do not work, it may be useful to go to the very top of this Jupyter Notebook and start again from cell 1


## Cell 1 - Importing necessary packages (Mandatory)

Packages to be used are imported from the funcs.ipynb.

In [1]:
%run funcs.ipynb
import time

## Cell 2 - Setting the target coordinate system (Mandatory)

When calculating distances it is important to choose a coordinate system that represents distances correctly in your area of interst. The coordinate system that is given below is the World Mercator, these coordinate system generally work well, but the distortions get larger as you move away from the equator.

In order to select your own coordinate system go to [epsg.io](http://epsg.io/) and type in your area of interest, this will give you a list of coordinate systems to choose from. Once you have selected your coordinate system replace the numbers below with the numbers from your coordinate system **(keep the "EPSG" part)**.

**NOTE** When selecting your coordinate system make sure that you select a system with the unit of meters, this is indicated for all systems on [epsg.io](http://epsg.io/)

In [2]:
crs = 'EPSG:3395'

## Cell 3 - Select the workspace and the administrative boundaries (Mandatory)

Define the workspace. The output layers will populate this folder. It is highly recommended to select an empty folder as your workspace.

For the administrative boundaries you will have to select an **Polygon** layer represeting your area of interest.
        

In [3]:
messagebox.showinfo('OnSSET extraction', 'Output folder')
workspace = filedialog.askdirectory()

In [4]:
messagebox.showinfo('OnSSET', 'Select the admin boundaries')
admin_path = filedialog.askopenfilename(filetypes = (("vector",["*.shp", "*.gpkg", "*.geojson"]),("all files","*.*")))
admin = gpd.read_file(admin_path)

## Cell 4 - Select the population clusters (Mandatory)

Select the clusters to be used in the analysis

Please also indicate which column is representing the population data as this will be used later. 

In [5]:
x, clusters, clusters_path = select_pop_clusters()
clusters = clusters[~clusters.geometry.is_empty & clusters.geometry.notnull()].copy()
# Fix invalid geometries
clusters['geometry'] = clusters['geometry'].buffer(0)

Population column: Population


## Cell 5 - Select the Global Horizontal Irradiation (GHI) map - Raster map (Mandatory)

**If your settlement data already includes GHI data, skip to cell 6. Note however that this dataset is mandatory to run the OnSSET analysis**

Select the ghi map that you wish to use in your analysis. This cell will extract the ghi values in your raster map to your clusters.

In [None]:
out, ghi_path = zonal_stats_exact('GHI', clusters, 'mean')
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 6 - Select the Travel Time map - Raster map (Mandatory)
 
**If your settlement data already includes travel time data, skip to cell 7. Note however that this dataset is mandatory to run the OnSSET analysis**

Select the travel time map that you wish to use in your analysis. This cell will extract the travel time values in your raster map to your clusters.

In [None]:
out, traveltime_path = zonal_stats_exact('TravelTime', clusters, 'mean')
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 7 - Select the Wind Velocity map - Raster map (Mandatory)

**If your settlement data already includes wind velocity data, skip to cell 8. Note however that this dataset is mandatory to run the OnSSET analysis**

Select the wind velocity map that you wish to use in your analysis. This cell will extract the wind velocity values in your raster map to your clusters.

In [None]:
out, wind_path = zonal_stats_exact('WindVel', clusters, 'mean')
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 8 - Select the Night Lights map - Raster map (Mandatory)

**If your settlement data already includes night lights data, skip to cell 9. Note however that this dataset is mandatory to run the OnSSET analysis**

Select the wind velocity map that you wish to use in your analysis. This cell will extract the wind velocity values in your raster map to your clusters.

In [None]:
out, ntl_path = zonal_stats_exact('NightLight', clusters, 'max')
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 9 - Select the Custom Demand OR Demand index layer(s) - Raster map (Optional)

This cell may be used to extract the values from a) the Custom Residential Electricity Demand Indicative Target (CREDIT) layer **OR** b) the AHP classification process preceding that. Both options depend on raster layers that can be generated based on available data. 

You may refer to this [code](https://github.com/global-electrification-platform/CREDIT_layer) for additional information of how to generate these layers. 

**Note** that we also use two input rasters for urban/rural settlements indicatively. 

In [None]:
out, customdemand_path = zonal_stats_exact('CustomDemand', clusters, 'mean')
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 10 - Preparing to run the vector data (Mandatory)

**If you are planning on extracting any vector data (substations, transformers, hydro, MV-lines, HV-lines or roads) run this cell**. 

This cell reprojects the settlements to the coordinate system you specified above.

In [6]:
clusters = preparing_for_vectors(workspace, clusters, crs)

Processing finished: 2025-07-30 12:24:30


## Cell 11 - Substations - Vector point layer (Optional)

**If you do not have substations or wish to keep the ones already in your settlement file, skip to cell 12.**

Determines the distances between each settlement point to the closest substation. 

In [None]:
out, substation_path = processing_points("Substation", admin, crs, workspace, clusters, mg_filter=False)
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 12 - Existing high voltage lines - Vector line layer (Optional)

**If you do not have existing high voltage lines or wish to keep the ones already in your settlement file, skip to cell 13.**

Determines the distances between each settlement point to the closest existing high voltage line. 

In [None]:
out, existing_hv_path = processing_lines("Existing_HV", admin, crs, workspace, clusters)
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 13 - Planned high voltage lines - Vector line layer (Optional)

**If you do not have planned high voltage lines or wish to keep the ones already in your settlement file, skip to cell 14.**

Determines the distances between each settlement point to the closest planned high voltage line. 

In [None]:
out, planned_hv_path = processing_lines("Planned_HV", admin, crs, workspace, clusters)
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 14 - Existing medium voltage lines - Vector line layer (Optional)

**If you do not have existing medium voltage lines or wish to keep the ones already in your settlement file, skip to cell 15.**

Determines the distances between each settlement point to the closest existing medium voltage line. 

In [None]:
out, existing_mv_path = processing_lines("Existing_MV", admin, crs, workspace, clusters)
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 15 - Planned medium voltage lines - Vector line layer (Optional)

**If you do not have planned medium voltage lines or wish to keep the ones already in your settlement file, skip to cell 16.**

Determines the distances between each settlement point to the closest planned medium voltage line. 

In [None]:
out, planned_mv_path = processing_lines("Planned_MV", admin, crs, workspace, clusters)
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 16 - Roads - Vector line layer (Optional)

**If you do not have roads or wish to keep the ones already in your settlement file, skip to cell 17.**

Determines the distances between each settlement point to the closest road. 

In [None]:
out, roads_path = processing_lines("Road", admin, crs, workspace, clusters)
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 17 - Service/distribution transformers - Vector point layer (Optional)

**If you do not have transformers or wish to keep the ones in the already in the settlement file, skip to cell 18** 

Determines the distances between each settlement point to the closest transformer. 

In [None]:
out, service_transformer_path = processing_points("Service Transformer", admin, crs, workspace, clusters, mg_filter=False)
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 18 - Hydro points - Vector point layer (Optional)

**If you do not have new hydro power points skip to next step

Select the hydro point layer you wish to use. It is important to have a column representing the power output for each hydro point in your dataset. After selecting the column you will also have to select the unit (W, kW or MW) of that column. 

In [7]:
out, hydro_path = hydro(admin, crs, workspace, clusters)
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

PQ50_MW
MW
Processing finished: 2025-07-30 12:26:17


# Extra datasets that can be used to improve the analysis

## Cell 19 - Existing mini-grids - Vector point layer (Optional extra)

This function extracts the nearest mini-grid to each clusters and assigns key characteristics (e.g. name, MV network status, type).

In [None]:
out, mg_path = processing_points("MiniGrid", admin, crs, workspace, clusters, mg_filter=True)
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 20 - Extracting admin 1 name to clusters - Vector polygon layer (Optional extra)

This function extracts the admin level 1 (e.g. Province, State) name to each cluster based on spatial overlay. 

**Please do provide the right column name (e.g. "adm1_name") in the pop-up window**

In [None]:
out, admin_1_path = admin_1("Admin_1", admin, crs, workspace, clusters)
if type(out) == gpd.geodataframe.GeoDataFrame:
    clusters = out

## Cell 21 - Conditioning & Export (Mandatory)

This is the final cell in the extraction. This cell has to be run.

In [None]:
clusters = conditioning(clusters, workspace, x)
print('Workspace: ', workspace)