# Preprocessing land-use/land-cover (LULC) data to enrich and refine them by vector data

## Environment and dependencies

This preprocessing workflow requires to install specific packages to run most of processing commands. Anaconda environment has been used to ensure the consistency and seamless installation of libraries. Geopandas and pandas are recommended to be installed in this way (to provide compatible versions) through Anaconda Prompt: 
```
conda install -c conda-forge geopandas pandas
```
Other libraries may be installed through simple commands in your Anaconda Prompt:
```
conda install fiona
conda install gdal
```
This package is currently not included into the preprocessing workflow, but might be useful in future:
```
conda install qgis --channel conda-forge
```

Let's import all dependencies required:

In [185]:
import sys
import os
os.environ['USE_PATH_FOR_GDAL_PYTHON'] = 'YES' #to import gdal

import numpy as np
import numpy.ma as ma
import warnings
import fiona
import geopandas as gpd

# import processing if needed (currently not required)
# from qgis.core import QgsVectorLayer
# from qgis.core import QgsProject
# from qgis.core import QgsProcessingUtils
# from qgis.core import QgsGeometryChecker

As GDAL installation might face issues it is important to include a separate troubleshooting statement for its installation:

In [186]:
#INSTALLING GDAL
try:
    from osgeo import ogr, osr, gdal
except ImportError:
    import sys
    sys.exit('ERROR: cannot find GDAL/OGR modules')

It is recommended to use GDAL error handler function and exception module:

In [187]:
# specify GDAL error handler function
def gdal_error_handler(err_class, err_num, err_msg):
    errtype = {
        gdal.CE_None: 'None',
        gdal.CE_Debug: 'Debug',
        gdal.CE_Warning: 'Warning',
        gdal.CE_Failure: 'Failure',
        gdal.CE_Fatal: 'Fatal'
    }
    err_msg = err_msg.replace('\n', ' ')
    err_class = errtype.get(err_class, 'None')
    print('Error Number: %s' % (err_num))
    print('Error Type: %s' % (err_class))
    print('Error Message: %s' % (err_msg))

# enable GDAL/OGR exceptions
gdal.UseExceptions()

It is important to check the performance of code:

In [188]:
# to measure time to run code
import time

# starting to measure running time
start_time = time.time()

## Input data and paths

Firstly, it is vital to define names of input data and paths to them. Currently, the automatical extraction of current folder works (os.getcwd) to avoid hard-coded path.
This block is also searching for user-defined vector data to refine raster data. If there is no data uploaded by user, it will be refined by Open Street Map (OSM) data.
The following types of input data are considered:
1. Raster land-use/land-cover (LULC) data, tif format (Cloud Optimised GeoTiff (COG) is preferable)
2. Raster impedance data (derivative from LULC data) correspoding to each unique value of LULC data and reflecting relative unsuitability for species to pass through landscape
3. Vector data to enrich and refine LULC data (currently, roads, railways, water bodies and waterways are considered, geopackage format is supported)
4. Ancillary tabular data mapping LULC types to their specifications: (1) whether concrete LULC type should be refined by vector data or not and (2) whether negative "edge effect" of concrete LULC type should be considered (for instance, roads affect suitability of habitats alongside roads for species)

In [189]:
# specify parent and child directories of code/data
parent_dir = os.getcwd()
print (f"Parent directory: {parent_dir}")

lulc_dir = r'data\input\lulc'
impedance_dir = r'data\input\impedance'
vector_dir = r'data\input\vector'

# specify output directory
output_dir = r'data\output'

# create the output directory if it does not exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Created directory: {output_dir}")

# SPECIFY INPUT DATA
# specifying the file names
lulc = 'lulc_2022.tif'

# check if 'user_vector.gpkg' exists in the folder (suploaded directly by user)
user_vector = os.path.join(parent_dir, vector_dir, 'user_vector.gpkg')
if os.path.exists(user_vector):
    vector_refine = 'user_vector.gpkg'
else:
    vector_refine = 'osm_merged.gpkg'

# print the name of chosen vector file
print(f"Using vector file to refine raster data: {vector_refine}")

# specifying the path to these files through the path variables
lulc = os.path.join(parent_dir,lulc_dir,lulc)
vector_refine = os.path.join(parent_dir,vector_dir,vector_refine)

# specify ancillary csv data 
lulc_definition = os.path.join(parent_dir,lulc_dir,'lulc_definition.csv')
impedance = os.path.join(parent_dir,impedance_dir,'impedance.csv')

Parent directory: c:\Users\kriukovv\Documents\test_python
Using vector file to refine raster data: osm_merged.gpkg


It is important to convert csv data to pandas dataframes:

In [190]:
# transforming csvs to dataframes
lulc_definition = gpd.read_file(lulc_definition)
impedance = gpd.read_file(impedance)

To work with the single geopackage file which combines OSM data it is required to extract separate layers:

In [191]:
'''# open geopackage file
with fiona.open(vector_refine) as geopackage:
    # extract unique values from the "layer_type" attribute
    unique_layer_types = set(feature['properties']['layer_type'] for feature in geopackage)
# use Fiona to find all layers in geopackage
available_layers = fiona.listlayers(vector_refine)

print("Available layers in input GeoPackage:")
for layer_name in available_layers:
    print(layer_name)

# TODO - to rewrite the following block to automatically create separate geopackages based on its "layer_type"
# specify the layer name we want to work with
vector_roads_name = 'gdf_roads_filtered'
vector_railways_name = 'gdf_railways_filtered'
vector_water_bodies_name = 'gdf_water_filtered'
vector_water_lines_name = 'gdf_water_lines_filtered'

# use Fiona again to open the GeoPackage file and access specific layers
with fiona.open(vector_refine) as geopackage:
    # access layers of vector file
    vector_roads = geopackage[vector_roads_name]
    vector_railways = geopackage[vector_railways_name]
    vector_water_bodies = geopackage[vector_water_bodies_name]
    vector_water_lines = geopackage[vector_water_lines_name]
'''

# to define functions to separate geopackages
# TODO - to add separation block based on layers in geopackage
def extract_geopackages(gpkg_path, output_folder, attribute_name):
    # Read the GeoPackage file
    gdf = gpd.read_file(gpkg_path)

    # Get unique values in the specified attribute
    unique_values = gdf[attribute_name].unique()

    # Create GeoPackages for each unique value
    for value in unique_values:
        subset_gdf = gdf[gdf[attribute_name] == value]
        output_gpkg = f"{output_folder}\{value}.gpkg"
        subset_gdf.to_file(output_gpkg, driver="GPKG")
        print(f"Extracted GeoPackage: {output_gpkg}")

# to define variables
geopackage_path = vector_refine
output_folder = output_dir
attribute_name = "layer_type"

# to call function
extract_geopackages(geopackage_path, output_folder, attribute_name)

# to define variables assigned to separate geopackages
vector_roads = os.path.join(output_dir, "gdf_roads_filtered.gpkg")
vector_railways = os.path.join(output_dir, "gdf_railways_filtered.gpkg")
vector_water_bodies = os.path.join(output_dir, "gdf_water_filtered.gpkg")
vector_water_lines = os.path.join(output_dir, "gdf_water_lines_filtered.gpkg")

  output_gpkg = f"{output_folder}\{value}.gpkg"


Extracted GeoPackage: data\output\gdf_roads_filtered.gpkg
Extracted GeoPackage: data\output\gdf_railways_filtered.gpkg
Extracted GeoPackage: data\output\gdf_water_lines_filtered.gpkg
Extracted GeoPackage: data\output\gdf_water_filtered.gpkg


## Vector data

In our case, vector data to enrich raster LULC data has been derived from the open-access OpenStreetMap (OSM) portal through Nominatim API.
This sub-nested workflow is described here:
...


## Initial checks
### Validity of vector geometry

It is required to check the validity of vector geometry used to refine raster LULC data

In [192]:
# open geopackage file
data_source = ogr.Open(vector_refine)

# get the number of layers in geopackage
num_layers = data_source.GetLayerCount()

# iterate through each layer
for i in range(num_layers):
    layer = data_source.GetLayerByIndex(i)

# check the validity of all geometries in the layer
    all_geometries_valid = all(feature.GetGeometryRef().IsValid() for feature in layer)

    if all_geometries_valid:
        print("Good news! All vector geometries are valid and can be used to refine your data.")
    else:
        print("At least one vector geometry is invalid. The further executions might be complicated by invalid geometries.")

# close the geopackage file
data_source = None

Good news! All vector geometries are valid and can be used to refine your data.


### Checks on coordinate reference systems
Input raster data should have the cartesian CRS to perform all computations correctly.

In [193]:
# if the CRS of input raster data is not cartesian one, it will raise the warning

# open input LULC file
dataset = gdal.Open(lulc)

if dataset:
    try:
        # get the projection information
        projection = dataset.GetProjection()
        srs = osr.SpatialReference()
        srs.ImportFromWkt(projection)

        # get the CRS information
        crs = srs.ExportToProj4()

        # check if CRS is cartesian
        is_cartesian = srs.IsProjected()

    finally:
        dataset = None  # close the dataset to free resources
else:
    warning_message_2 = f"Failed to open the raster dataset. Please check the path and format of the input raster."
    warnings.warn(warning_message_2, Warning)

# display a warning if the CRS is not cartesian
if not is_cartesian:
    warning_message_3 = "The CRS is not the cartesian one. To exploit this workflow correctly, you should reproject it."
    warnings.warn(warning_message_3, Warning)
else:
    print("Good news! The CRS of your input raster dataset is the cartesian one.")

Good news! The CRS of your input raster dataset is the cartesian one.


### Check on the consistency of spatial resolution

We should be confident that X spatial resolution of input raster matches to Y spatial resolution.

###

In [194]:
# retrieve raster resolution - cellsize
src = gdal.Open(lulc)
xres = src.RasterXSize
yres = src.RasterYSize
cellsize = xres

# define function raise warning if there is some mismatch between x and y resolution
def check_res (raster):
    raster_geotransform = raster.GetGeoTransform()
    xres = raster_geotransform[1]
    yres = raster_geotransform[5]
    # compare absolute values, because the y value is represented in negative coordinates
    if abs(xres) != abs(yres):
        print ("x:",xres,"y:",yres)
        warning_message = f"Spatial resolution (x and y values) of input raster is inconsistent"
        warnings.warn(warning_message, Warning)
    else:
        print ("Good news! The spatial resolution of your raster data is consistent between X and Y.")

# run function    
check_res (src)

Good news! The spatial resolution of your raster data is consistent between X and Y.


## 1. Processing roads

#### 1.1. Buffering roads
Let's buffer roads by their width recorded as its attribute. This step is more important for primary wide roads that can cover a significant amount of pixels in width.

In [197]:
vector_ds = ogr.Open(vector_roads)

if vector_ds is None:
    print(f"Failed to open the vector file: {vector_roads}")
    exit()

# Print basic information about the layers
print(f"Number of Layers in the Dataset: {vector_ds.GetLayerCount()}")

for i in range(vector_ds.GetLayerCount()):
    layer = vector_ds.GetLayerByIndex(i)
    print(f"\nLayer {i + 1} Information:")
    print(f"Name: {layer.GetName()}")
    print(f"Geometry Type: {ogr.GeometryTypeToName(layer.GetGeomType())}")
    print(f"Number of Features: {layer.GetFeatureCount()}")
    print(f"Extent: {layer.GetExtent()}")

# Close the dataset
vector_ds = None


# define function to create buffers
def createBuffer(inputfn, outputBufferfn, file_format, layerName, distance_field):
    
    inputds = ogr.Open(inputfn)
    inputlyr = inputds.GetLayer(layerName)

    # define the name of the attribute field that holds the buffer distance
    distance_field = 'width_num' 
    # define the buffer distance field name in the Buffer algorithm
    buffer_distance = f'attribute({distance_field})'

    # shpdriver = ogr.GetDriverByName('ESRI Shapefile')
    file_driver = ogr.GetDriverByName(file_format)
    if os.path.exists(outputBufferfn):
        file_driver.DeleteDataSource(outputBufferfn)
    outputBufferds = file_driver.CreateDataSource(outputBufferfn)
    bufferlyr = outputBufferds.CreateLayer(outputBufferfn, geom_type=ogr.wkbPolygon)
    featureDefn = bufferlyr.GetLayerDefn()

    for feature in inputlyr:
        ingeom = feature.GetGeometryRef()
        geomBuffer = ingeom.Buffer(buffer_distance)

        outFeature = ogr.Feature(featureDefn)
        outFeature.SetGeometry(geomBuffer)
        bufferlyr.CreateFeature(outFeature)
    
#write output to file rather than to memory - deal with this later if memory required.
vector_buffered = 'vector_buffered.gpkg'
vector_buffered = os.path.join(parent_dir,output_dir,vector_buffered)

createBuffer(vector_refine, vector_buffered, 'GPKG', vector_roads, 'width_num')

print("Buffered vector saved to: ", vector_buffered)






Number of Layers in the Dataset: 1

Layer 1 Information:
Name: gdf_roads_filtered
Geometry Type: Line String
Number of Features: 103786
Extent: (0.0410692, 3.2779618, 40.501556, 42.9278081)


TypeError: 'NoneType' object is not iterable

#### 1.2. Rasterizing roads

In [None]:
def Rasterize_roads(vector_buffered, vrt_roads, cellsize, field_name=True, NoData_value=-9999):
    # input
    inp_driver = ogr.GetDriverByName('GPKG')
    inp_source = inp_driver.Open(vector_buffered, 0)
    inp_lyr = inp_source.GetLayer(0)
    inp_srs = inp_lyr.GetSpatialRef()

    # extent
    x_min, x_max, y_min, y_max = inp_lyr.GetExtent()
    x_ncells = int((x_max - x_min) / cellsize)
    y_ncells = int((y_max - y_min) / cellsize)

    # output
    out_driver = gdal.GetDriverByName('GPKG')
    if os.path.exists(vrt_roads):
        out_driver.Delete(vrt_roads)
    out_source = out_driver.Create(vrt_roads, x_ncells, y_ncells,1, gdal.GDT_Int32)
    out_source.SetGeoTransform((x_min, cellsize, 0, y_max, 0, -cellsize))
    out_source.SetProjection(inp_srs.ExportToWkt())
    out_lyr = out_source.GetRasterBand(1)
    out_lyr.SetNoDataValue(NoData_value)

    if field_name:
    # this will rasterize your shape file according to the specified attribute field
         rasDs = gdal.Rasterize(
               vrt_roads, vector_buffered,
               xres=cellsize, yres=cellsize,
               outputBounds=[x_min, y_min,x_max, y_max],
               noData=NoData_value,
               outputType=gdal.GDT_Int32,
               attribute='CT', # Or whatever your attribute field name is
               allTouched=True)
    else:
    # this will just give 255 where there are vector data since no attribute is defined
        rasDs = gdal.Rasterize(
               vrt_roads, vector_buffered,
               xRes=cellsize, yRes=cellsize,
               outputBounds=[x_min, y_min,x_max, y_max],
               noData=NoData_value,
               outputType=gdal.GDT_Int32,
               allTouched=True)
    
    rasDs = inp_source = None    
    
     # save and/or close the data sources
    inp_source = None
    out_source = None 

    # return
    return vrt_roads
    
vrt_roads =  os.path.join(parent_dir,output_dir,'vrt_roads')
# input parameter 'vector_buffered' has been already defined
Rasterize_roads(vector_buffered, vrt_roads, cellsize, field_name=True, NoData_value=-9999)

print("Rasterized roads saved to: ", vrt_roads)