# Preprocessing land-use/land-cover (LULC) data to enrich and refine them by vector data

## Environment and dependencies

This preprocessing workflow requires to install specific packages to run most of processing commands. Anaconda environment has been used to ensure the consistency and seamless installation of libraries. Geopandas and pandas are recommended to be installed in this way (to provide compatible versions) through Anaconda Prompt: 
```
conda install -c conda-forge geopandas pandas
```
Other libraries may be installed through simple commands in your Anaconda Prompt:
```
conda install fiona
conda install gdal
```
This package is currently not included into the preprocessing workflow, but might be useful in future:
```
conda install qgis --channel conda-forge
```

Let's import all dependencies required:

In [1]:
import os
# os.environ['USE_PATH_FOR_GDAL_PYTHON'] = 'YES' #to import gdal

import numpy as np
import numpy.ma as ma
import warnings
import geopandas as gpd
import subprocess
import pygeoprocessing as pg

# TODO - delete unused libraries

# import processing if needed (currently not required)
# from qgis.core import QgsVectorLayer
# from qgis.core import QgsProject
# from qgis.core import QgsProcessingUtils
# from qgis.core import QgsGeometryChecker

As GDAL installation might face issues it is important to include a separate troubleshooting statement for its installation:

In [2]:
#INSTALLING GDAL
try:
    from osgeo import ogr, osr, gdal
except ImportError:
    import sys
    sys.exit('ERROR: cannot find GDAL/OGR modules')

It is recommended to use GDAL error handler function and exception module:

In [3]:
# specify GDAL error handler function
def gdal_error_handler(err_class, err_num, err_msg):
    errtype = {
        gdal.CE_None: 'None',
        gdal.CE_Debug: 'Debug',
        gdal.CE_Warning: 'Warning',
        gdal.CE_Failure: 'Failure',
        gdal.CE_Fatal: 'Fatal'
    }
    err_msg = err_msg.replace('\n', ' ')
    err_class = errtype.get(err_class, 'None')
    print('Error Number: %s' % (err_num))
    print('Error Type: %s' % (err_class))
    print('Error Message: %s' % (err_msg))

# enable GDAL/OGR exceptions
gdal.UseExceptions()

It is important to check the performance of code:

In [4]:
# to measure time to run code
import time

# starting to measure running time
start_time = time.time()

## Input data and paths

Firstly, it is vital to define names of input data and paths to them. Currently, the automatical extraction of current folder works (os.getcwd) to avoid hard-coded path.
This block is also searching for user-defined vector data to refine raster data. If there is no data uploaded by user, it will be refined by Open Street Map (OSM) data.
The following types of input data are considered:
1. Raster land-use/land-cover (LULC) data, tif format (Cloud Optimised GeoTiff (COG) is preferable. COG with LZW compression is used to optimise storaging data).
2. Raster impedance data (derivative from LULC data) correspoding to each unique value of LULC data and reflecting relative unsuitability for species to pass through landscape.
3. Vector data to enrich and refine LULC data (currently, roads, railways, water bodies and waterways are considered, geopackage format is supported), deriving from OSM data.
4. Ancillary tabular data mapping LULC types to their specifications: (1) whether concrete LULC type should be refined by vector data or not and (2) whether negative "edge effect" of concrete LULC type should be considered (for instance, roads affect suitability of habitats alongside roads).

In [5]:
# specify parent and child directories of code/data
parent_dir = os.getcwd()
print (f"Parent directory: {parent_dir}")

lulc_dir = r'data\input\lulc'
impedance_dir = r'data\input\impedance'
vector_dir = r'data\input\vector'

# specify output directory
output_dir = r'data\output'

# create the output directory if it does not exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Created directory: {output_dir}")

# SPECIFY INPUT DATA
# specifying the file names
lulc = 'lulc_2022.tif'

# check if 'user_vector.gpkg' exists in the folder (suploaded directly by user)
user_vector = os.path.join(parent_dir, vector_dir, 'user_vector.gpkg')
if os.path.exists(user_vector):
    vector_refine = 'user_vector.gpkg'
else:
    vector_refine = 'osm_merged.gpkg'

# print the name of chosen vector file
print(f"Using vector file to refine raster data: {vector_refine}")

# specifying the path to these files through the path variables
lulc = os.path.join(parent_dir,lulc_dir,lulc)
vector_refine = os.path.join(parent_dir,vector_dir,vector_refine)

# specify ancillary csv data
impedance = os.path.join(parent_dir,impedance_dir,'reclassification.csv') # impedance of short-listed LULC data (7 types)
impedance_ext = os.path.join(parent_dir,impedance_dir,'reclassification_ext.csv') # impedance of extended LULC data (25 types)

Parent directory: c:\Users\kriukovv\Documents\pilot_2\preprocessing
Using vector file to refine raster data: osm_merged.gpkg


It is important to convert csv data to pandas dataframes:

In [6]:
# transforming csvs to dataframes
impedance = gpd.read_file(impedance)
impedance_ext = gpd.read_file(impedance_ext)

# to find impedance values matching with built-up areas of human impact on habitats (to be used in ... TODO) and inland water
lulc_urban = impedance_ext.loc[impedance_ext['type'].str.contains('urban|built|build|resident|industr|commerc', case = False),'lulc'].iloc[0]

lulc_road = impedance_ext.loc[impedance_ext['type'].str.contains('road', case = False),'lulc'] # to choose the first matching value if roads are found
if not lulc_road.empty:
    lulc_road= lulc_road.iloc[0]  # choose the first matching value if railways are found
else:
    lulc_railway = lulc_road   # railways are unlikely specified as a separate LULC code so it is the same

lulc_railway = impedance_ext.loc[impedance_ext['type'].str.contains('rail|train', case = False),'lulc']
if not lulc_railway.empty:
    lulc_railway = lulc_railway.iloc[0]  # choose the first matching value if railways are found
else:
    lulc_railway = lulc_road   # railways are unlikely specified as a separate LULC code so it is the same

# use names of water LULC codes from extended LULC classificiation
lulc_water = impedance_ext.loc[impedance_ext['type'].str.contains('continental water|inland water', case=False), 'lulc']
if not lulc_water.empty:
    lulc_water = lulc_water.iloc[0]
else:
    # if no matches found, resort to other names (short LULC classification)
    lulc_water = impedance_ext.loc[impedance_ext['type'].str.contains('water|aqua|river', case=False), 'lulc'].iloc[0]

print("LULC code of roads:", lulc_road,"\n","LULC code of railways:", lulc_railway,"\n","LULC code of urbanized areas:", lulc_urban,"\n","LULC code of inland waters:", lulc_water)

# TODO - create a dictionary with impedance values?

LULC code of roads: 4 
 LULC code of railways: 4 
 LULC code of urbanized areas: 5 
 LULC code of inland waters: 1


To work with the single geopackage file which combines OSM data it is required to extract separate layers:

In [7]:
'''# open geopackage file
with fiona.open(vector_refine) as geopackage:
    # extract unique values from the "layer_type" attribute
    unique_layer_types = set(feature['properties']['layer_type'] for feature in geopackage)
# use Fiona to find all layers in geopackage
available_layers = fiona.listlayers(vector_refine)

print("Available layers in input GeoPackage:")
for layer_name in available_layers:
    print(layer_name)

# TODO - to rewrite the following block to automatically create separate geopackages based on its "layer_type"
# specify the layer name we want to work with
vector_roads_name = 'gdf_roads_filtered'
vector_railways_name = 'gdf_railways_filtered'
vector_water_bodies_name = 'gdf_water_filtered'
vector_water_lines_name = 'gdf_water_lines_filtered'

# use Fiona again to open the GeoPackage file and access specific layers
with fiona.open(vector_refine) as geopackage:
    # access layers of vector file
    vector_roads = geopackage[vector_roads_name]
    vector_railways = geopackage[vector_railways_name]
    vector_water_bodies = geopackage[vector_water_bodies_name]
    vector_water_lines = geopackage[vector_water_lines_name]
'''

# to define functions to separate geopackages
# TODO - to add separation block based on layers in geopackage
def extract_geopackages(gpkg_path, output_folder, attribute_name):
    # Read the GeoPackage file
    gdf = gpd.read_file(gpkg_path)

    # Get unique values in the specified attribute
    unique_values = gdf[attribute_name].unique()

    # Create GeoPackages for each unique value
    for value in unique_values:
        subset_gdf = gdf[gdf[attribute_name] == value]
        output_gpkg = f"{output_folder}\{value}.gpkg"
        subset_gdf.to_file(output_gpkg, driver="GPKG")
        print(f"Extracted GeoPackage: {output_gpkg}")

# to define variables
geopackage_path = vector_refine
output_folder = output_dir
attribute_name = "layer_type"

'''
# to call function
extract_geopackages(geopackage_path, output_folder, attribute_name)
'''

# to define variables assigned to separate geopackages
vector_roads = os.path.join(vector_dir, "roads.gpkg")
vector_railways = os.path.join(vector_dir, "railways.gpkg")
vector_water_bodies = os.path.join(vector_dir, "water_bodies.gpkg")
vector_water_lines = os.path.join(vector_dir, "water_lines.gpkg")

  output_gpkg = f"{output_folder}\{value}.gpkg"


## Vector data

In our case, vector data to enrich raster LULC data has been derived from the open-access OpenStreetMap (OSM) portal through the nested [Nominatim API](./nominatim_api.py). Currently, OSM data are exported as separate geopackage files (roads, railroads, waterbodies and water lines) and merged geopackage file.

## Initial checks
### Validity of vector geometry

It is required to check the validity of vector geometry used to refine raster LULC data

In [8]:
# open geopackage file
data_source = ogr.Open(vector_refine)

# get the number of layers in geopackage
num_layers = data_source.GetLayerCount()

# iterate through each layer
for i in range(num_layers):
    layer = data_source.GetLayerByIndex(i)

# check the validity of all geometries in the layer
    all_geometries_valid = all(feature.GetGeometryRef().IsValid() for feature in layer)

    if all_geometries_valid:
        print("Good news! All vector geometries are valid and can be used to refine your data.")
    else:
        print("At least one vector geometry is invalid. The further executions might be complicated by invalid geometries.")

# close the geopackage file
data_source = None

Good news! All vector geometries are valid and can be used to refine your data.


### Checks on coordinate reference systems
Input raster data should have the cartesian CRS to perform all computations correctly.

In [9]:
# if the CRS of input raster data is not cartesian one, it will raise the warning

# open input LULC file
dataset = gdal.Open(lulc)

if dataset:
    try:
        # get the projection information
        projection = dataset.GetProjection()
        srs = osr.SpatialReference()
        srs.ImportFromWkt(projection)

        # get the CRS information
        crs = srs.ExportToProj4()

        # check if CRS is cartesian
        is_cartesian = srs.IsProjected()

    finally:
        dataset = None  # close the dataset to free resources
else:
    warning_message_2 = f"Failed to open the raster dataset. Please check the path and format of the input raster."
    warnings.warn(warning_message_2, Warning)

# display a warning if the CRS is not cartesian
if not is_cartesian:
    warning_message_3 = "The CRS is not the cartesian one. To exploit this workflow correctly, you should reproject it."
    warnings.warn(warning_message_3, Warning)
else:
    print("Good news! The CRS of your input raster dataset is the cartesian one.")

Good news! The CRS of your input raster dataset is the cartesian one.


### Check on the consistency of spatial resolution

We should be confident that X spatial resolution of input raster matches to Y spatial resolution.

###

In [10]:
# retrieve raster resolution - cellsize
inp_source = gdal.Open(lulc)
geotransform = inp_source.GetGeoTransform()

# define function raise warning if there is some mismatch between x and y resolution
def check_res (raster):
    raster_geotransform = raster.GetGeoTransform()
    xres = raster_geotransform[1]
    yres = raster_geotransform[5]
    # compare absolute values, because the y value is represented in negative coordinates
    if abs(xres) != abs(yres):
        print ("x:",xres,"y:",yres)
        warning_message = f"Spatial resolution (x and y values) of input raster is inconsistent"
        warnings.warn(warning_message, Warning)
    else:
        print ("Good news! The spatial resolution of your raster data is consistent between X and Y.")
    return xres, yres
    
# run function and capture the resolution values
xres, yres = check_res(inp_source)
cell_size = abs(xres)

# fetch max/min coordinates to use them later
x_min = geotransform[0]
y_max = geotransform[3]
x_max = x_min + geotransform[1] * inp_source.RasterXSize
y_min = y_max + geotransform[5] * inp_source.RasterYSize

print (f"Spatial resolution (pixel size) is {cell_size} meters")
print (f"x min coordinate is {x_min}")
print (f"y max coordinate is {y_max}")
print (f"x max coordinate is {x_max}")
print (f"y min coordinate is {y_min}")

'''REDUNDANT BLOCK
# get the resolution and extent of LULC raster
lulc_info = gdal.Info(lulc, options=['-json'])
lulc_pixel_size = lulc_info['geoTransform'][1]  # pixel size
lulc_width = lulc_info['size'][0]  # width
lulc_height = lulc_info['size'][1]  # height
lulc_extent = lulc_info['cornerCoordinates']
'''

Good news! The spatial resolution of your raster data is consistent between X and Y.
Spatial resolution (pixel size) is 30.0 meters
x min coordinate is 230205.0
y max coordinate is 4777335.0
x max coordinate is 556485.0
y min coordinate is 4459725.0


"REDUNDANT BLOCK\n# get the resolution and extent of LULC raster\nlulc_info = gdal.Info(lulc, options=['-json'])\nlulc_pixel_size = lulc_info['geoTransform'][1]  # pixel size\nlulc_width = lulc_info['size'][0]  # width\nlulc_height = lulc_info['size'][1]  # height\nlulc_extent = lulc_info['cornerCoordinates']\n"

## 1. Processing roads

#### 1.1. Buffering roads
Let's buffer roads by their width recorded as its attribute. This step is more important for primary wide roads that can cover a significant amount of pixels in width.

In [11]:
'''
# 1 option - OGR buffering - doesn't allow copying attributes by now
vector_ds = ogr.Open(vector_roads)

if vector_ds is None:
    print(f"Failed to open the vector file: {vector_roads}")
    exit()

# Print basic information about the layers
print(f"Number of Layers in the Dataset: {vector_ds.GetLayerCount()}")

for i in range(vector_ds.GetLayerCount()):
    layer = vector_ds.GetLayerByIndex(i)
    print(f"\nLayer {i + 1} Information:")
    print(f"Name: {layer.GetName()}")
    print(f"Geometry Type: {ogr.GeometryTypeToName(layer.GetGeomType())}")
    print(f"Number of Features: {layer.GetFeatureCount()}")
    print(f"Extent: {layer.GetExtent()}")

# Close the dataset
vector_ds = None

# define function to create buffers
def createBuffer(inputfn, outputBufferfn, file_format, layerName, distance_field):
    inputds = ogr.Open(inputfn)
    inputlyr = inputds.GetLayer(layerName)
    # check if dataset was opened successfully
    if inputds is None:
        print(f"Failed to open the vector file: {vector_roads}")
        exit()

    # shpdriver = ogr.GetDriverByName('ESRI Shapefile')
    file_driver = ogr.GetDriverByName(file_format)
    if os.path.exists(outputBufferfn):
        file_driver.DeleteDataSource(outputBufferfn)
        
    outputBufferds = file_driver.CreateDataSource(outputBufferfn)

    # define CRS
    output_crs = ogr.osr.SpatialReference()
    output_crs.ImportFromEPSG(25831)

    bufferlyr = outputBufferds.CreateLayer(outputBufferfn, geom_type=ogr.wkbPolygon, srs=output_crs, options=["OVERWRITE=YES"])
    featureDefn = bufferlyr.GetLayerDefn()
    
    #for i in range(featureDefn.GetFieldCount()):
        #fieldDefn = featureDefn.GetFieldDefn(i)
        #bufferlyr.CreateField(fieldDefn)
        
    for feature in inputlyr:
        ingeom = feature.GetGeometryRef()
        # distance_field_value = inputlyr.GetLayerDefn(distance_field)
        buffer_distance = feature.GetField(distance_field)
        geomBuffer = ingeom.Buffer(buffer_distance)  # buffer_distance to get the attribute value

        outFeature = ogr.Feature(featureDefn)
        outFeature.SetGeometry(geomBuffer)
        bufferlyr.CreateFeature(outFeature)
        outFeature = None

        #
        # Copy attributes from input feature to output feature
        #for i in range(featureDefn.GetFieldCount()):
            #field_name = featureDefn.GetFieldDefn(i).GetName()
            #field_value = feature.GetField(i)
            #outFeature.SetField(field_name, field_value)
        
# write output to file rather than to memory - deal with this later if memory required.
vector_roads_buffered = 'vector_roads_buffered.gpkg'
vector_roads_buffered = os.path.join(parent_dir,output_dir,vector_roads_buffered)
# check if the file exists, if it does, delete it
if os.path.exists(vector_roads_buffered):
    os.remove(vector_roads_buffered)

createBuffer(vector_roads, vector_roads_buffered, 'GPKG', "roads", 'width')

print("Buffered vector saved to: ", vector_roads_buffered)
'''

'\n# 1 option - OGR buffering - doesn\'t allow copying attributes by now\nvector_ds = ogr.Open(vector_roads)\n\nif vector_ds is None:\n    print(f"Failed to open the vector file: {vector_roads}")\n    exit()\n\n# Print basic information about the layers\nprint(f"Number of Layers in the Dataset: {vector_ds.GetLayerCount()}")\n\nfor i in range(vector_ds.GetLayerCount()):\n    layer = vector_ds.GetLayerByIndex(i)\n    print(f"\nLayer {i + 1} Information:")\n    print(f"Name: {layer.GetName()}")\n    print(f"Geometry Type: {ogr.GeometryTypeToName(layer.GetGeomType())}")\n    print(f"Number of Features: {layer.GetFeatureCount()}")\n    print(f"Extent: {layer.GetExtent()}")\n\n# Close the dataset\nvector_ds = None\n\n# define function to create buffers\ndef createBuffer(inputfn, outputBufferfn, file_format, layerName, distance_field):\n    inputds = ogr.Open(inputfn)\n    inputlyr = inputds.GetLayer(layerName)\n    # check if dataset was opened successfully\n    if inputds is None:\n        

In [12]:
# 2nd option - ogr2ogr command line script - currently stable solution

# write output to file rather than to memory - deal with this later if memory required.
vector_roads = os.path.join(parent_dir,vector_dir, "roads.gpkg")
vector_roads_buffered = 'vector_roads_buffered.gpkg'
vector_roads_buffered = os.path.join(parent_dir,output_dir,vector_roads_buffered)

# other vector data from OSM
vector_waterbodies = os.path.join(parent_dir,vector_dir, "water_bodies.gpkg")
vector_waterways = os.path.join(parent_dir,vector_dir, "water_lines.gpkg")

# check if the file exists, if it does, delete it
if os.path.exists(vector_roads_buffered):
    os.remove(vector_roads_buffered)

# define the ogr2ogr command as a list of arguments
ogr2ogr_buffer_roads = [
    'ogr2ogr',
    '-f', 'GPKG',
    '-dialect', 'SQLite',
    '-sql', "SELECT ST_Buffer(geom, CASE WHEN width IS NULL THEN CASE WHEN highway IN ('motorway', 'motorway_link', 'trunk', 'trunk_link') THEN 30/2 WHEN highway IN ('primary', 'primary_link', 'secondary', 'secondary_link') THEN 20/2 ELSE 10/2 END ELSE width/2 END) AS geometry, * FROM roads",
    vector_roads_buffered,
    vector_roads,
    '-nlt', 'POLYGON',
    '-nln', 'roads'
]

# TODO - specify condition to replace separately null values of width

# execute ogr2ogr command
try:
    subprocess.run(ogr2ogr_buffer_roads, check=True)
    print("Roads buffering has been successfully completed.")
except subprocess.CalledProcessError as e:
    print(f"Error buffering roads: {e}")

Roads buffering has been successfully completed.


In [13]:
# buffering railways
# TODO - merge as one function with roads?

# write output to file rather than to memory - deal with this later if memory required.
vector_railways = os.path.join(parent_dir,vector_dir, "railways.gpkg")
vector_railways_buffered = 'vector_railways_buffered.gpkg'
vector_railways_buffered = os.path.join(parent_dir,output_dir,vector_railways_buffered)
# check if the file exists, if it does, delete it
if os.path.exists(vector_railways_buffered):
    os.remove(vector_railways_buffered)

# define the ogr2ogr command as a list of arguments
## width of railways is not directly specified in OSM, only as 'gauge' key (track width), but it might be classified as text_value (gauge=broad or gauge = 1000;2000)
## TODO - to implement width processing rule
ogr2ogr_buffer_railways = [
    'ogr2ogr',
    '-f', 'GPKG',
    '-dialect', 'SQLite',
    '-sql', "SELECT ST_Buffer(geom, 10/2) AS geometry, * FROM railways", # divide by 2 as buffer value is a value to be covered to one side from spatial feature
    vector_railways_buffered,
    vector_railways,
    '-nlt', 'POLYGON',
    '-nln', 'roads'
]

# execute ogr2ogr command
try:
    subprocess.run(ogr2ogr_buffer_railways, check=True)
    print("Railways buffering has been successfully completed.")
except subprocess.CalledProcessError as e:
    print(f"Error buffering roads: {e}")

    

Railways buffering has been successfully completed.


#### 1.2. Merging buffered roads 
It's better to operate with merged buffers to use them further.

In [14]:
# pygeo?
'''REVEALED TO BE REDUNDANT BLOCK (RASTERIZING IS FASTER WITHOUT INITIAL MERGING)
# define output file to save merged geometries
roads_buf_merged = 'roads_buf_merged.gpkg'
roads_buf_merged = os.path.join(parent_dir,output_dir,roads_buf_merged)

# define the ogr2ogr command as a list of arguments
ogr2ogr_merge = [
    'ogr2ogr',
    '-f', 'GPKG',
    '-dialect', 'SQLite',
    '-sql', "SELECT ST_Union(geometry) AS geometry, * FROM roads",
    roads_buf_merged,
    vector_roads_buffered,
    '-nln', 'roads'
]

# execute ogr2ogr command
try:
    subprocess.run(ogr2ogr_merge, check=True)
    print("Merging of buffers has been successfully completed.")
except subprocess.CalledProcessError as e:
    print(f"Error merging buffers: {e}")
'''


'REVEALED TO BE REDUNDANT BLOCK (RASTERIZING IS FASTER WITHOUT INITIAL MERGING)\n# define output file to save merged geometries\nroads_buf_merged = \'roads_buf_merged.gpkg\'\nroads_buf_merged = os.path.join(parent_dir,output_dir,roads_buf_merged)\n\n# define the ogr2ogr command as a list of arguments\nogr2ogr_merge = [\n    \'ogr2ogr\',\n    \'-f\', \'GPKG\',\n    \'-dialect\', \'SQLite\',\n    \'-sql\', "SELECT ST_Union(geometry) AS geometry, * FROM roads",\n    roads_buf_merged,\n    vector_roads_buffered,\n    \'-nln\', \'roads\'\n]\n\n# execute ogr2ogr command\ntry:\n    subprocess.run(ogr2ogr_merge, check=True)\n    print("Merging of buffers has been successfully completed.")\nexcept subprocess.CalledProcessError as e:\n    print(f"Error merging buffers: {e}")\n'

#### 1.3. Rasterizing roads

In [15]:
'''
# TODO - delete it later when rasterizing is troubleshooted
roads_buf_merged = 'roads_buf_merged.gpkg'
roads_buf_merged = os.path.join(parent_dir,output_dir,roads_buf_merged)

def Rasterize_roads(roads_buf_merged, vrt_roads, cellsize, field_name=True, NoData_value=-9999):
    # input
    inp_driver = ogr.GetDriverByName('GPKG')
    inp_source = inp_driver.Open(roads_buf_merged, 0)
    inp_lyr = inp_source.GetLayer(0)
    inp_srs = inp_lyr.GetSpatialRef()

    # extent
    x_min, x_max, y_min, y_max = inp_lyr.GetExtent()
    x_ncells = int((x_max - x_min) / cellsize)
    y_ncells = int((y_max - y_min) / cellsize)
    # flipping values
    # TODO - redefine cellsize
    # ulx, xres, xskew, uly, yskew, yres  = src.GetGeoTransform()
    print (cellsize)
    print (x_ncells,y_ncells)

    # output 
    out_driver = gdal.GetDriverByName('GTiff')
    if os.path.exists(vrt_roads):
        out_driver.Delete(vrt_roads)
    out_source = out_driver.Create(vrt_roads, x_ncells, y_ncells,1, gdal.GDT_Int16)
    out_source.SetGeoTransform((x_min, cellsize, 0, y_max, 0, -cellsize))
    out_source.SetProjection(inp_srs.ExportToWkt())
    out_lyr = out_source.GetRasterBand(1)
    out_lyr.SetNoDataValue(NoData_value)

    # output extent
    x_min_out, x_max_out = x_min, x_min + (x_ncells * cellsize)
    y_min_out, y_max_out = y_min, y_min + (y_ncells * cellsize)

    if field_name:
    # this will rasterize your shape file according to the specified attribute field
         rasDs = gdal.Rasterize(
               vrt_roads, roads_buf_merged,
               xRes=cellsize, yRes=cellsize,
               outputBounds=[x_min, y_min,x_max, y_max],
               noData=NoData_value,
               outputType=gdal.GDT_Int16,
               attribute='fid', # or whatever your attribute field name is
               allTouched=True)
    else:
    # this will just give burn-in value where there are vector data since no attribute is defined
        rasDs = gdal.Rasterize(
               vrt_roads, roads_buf_merged,
               xRes=cellsize, yRes=cellsize,
               outputBounds=[x_min, y_min,x_max, y_max],
               noData=NoData_value,
               burnValues=2, #to enrich roads, TODO - specify more generic flag from extended LULC map (25 types)
               outputType=gdal.GDT_Int16,
               allTouched=True) # to include pixels that are covered by roads even partly (by default, it must cover at least 50% of pixel area to be rasterized)
        
    rasDs = inp_source = None    
    
    # save and/or close the data sources
    inp_source = None
    out_source = None 

    # return
    return vrt_roads
    
vrt_roads =  os.path.join(parent_dir,output_dir,'vrt_roads.tif')
# input parameter 'vector_roads_buffered' has already been defined
roads_buf_merged = 'roads_buf_merged.gpkg'
roads_buf_merged = os.path.join(parent_dir,output_dir,roads_buf_merged)
# getting cellsize from lulc resolution
cellsize = xres
Rasterize_roads(roads_buf_merged, vrt_roads, cellsize, field_name=False, NoData_value=-9999)

print("Rasterized roads saved to: ", vrt_roads)
vrt_roads = None
'''

'\n# TODO - delete it later when rasterizing is troubleshooted\nroads_buf_merged = \'roads_buf_merged.gpkg\'\nroads_buf_merged = os.path.join(parent_dir,output_dir,roads_buf_merged)\n\ndef Rasterize_roads(roads_buf_merged, vrt_roads, cellsize, field_name=True, NoData_value=-9999):\n    # input\n    inp_driver = ogr.GetDriverByName(\'GPKG\')\n    inp_source = inp_driver.Open(roads_buf_merged, 0)\n    inp_lyr = inp_source.GetLayer(0)\n    inp_srs = inp_lyr.GetSpatialRef()\n\n    # extent\n    x_min, x_max, y_min, y_max = inp_lyr.GetExtent()\n    x_ncells = int((x_max - x_min) / cellsize)\n    y_ncells = int((y_max - y_min) / cellsize)\n    # flipping values\n    # TODO - redefine cellsize\n    # ulx, xres, xskew, uly, yskew, yres  = src.GetGeoTransform()\n    print (cellsize)\n    print (x_ncells,y_ncells)\n\n    # output \n    out_driver = gdal.GetDriverByName(\'GTiff\')\n    if os.path.exists(vrt_roads):\n        out_driver.Delete(vrt_roads)\n    out_source = out_driver.Create(vrt_road

In [16]:
# BASH version of rasterization

def rasterize_vector(vector_path, output_path, nodata_value, burn_value):
    '''
    # input
    inp_driver = ogr.GetDriverByName('GPKG')
    inp_source = inp_driver.Open(vector_path, 0)
    inp_lyr = inp_source.GetLayer(0)
    inp_srs = inp_lyr.GetSpatialRef()

    # getting cellsize from lulc resolution
    cell_size = xres # is not a parameter of function because it must be the same as cell_size of LULC raster

    # input extent # TODO - must be specified from LULC not geopackages!
    x_min, x_max, y_min, y_max = inp_lyr.GetExtent()
    '''
    # define gdal_rasterize command
    gdal_rasterize_cmd = [
        'gdal_rasterize',
        #'-l', 'roads',  # TODO - to put layer name 
        '-tr', str(cell_size), str(cell_size),  # output raster pixel size
        '-te', str(x_min), str(y_min), str(x_max), str(y_max),  # output extent 
        '-a_nodata', str(nodata_value),  # no_data value
        '-ot', 'Int16',   # output raster data type,
        '-burn', str(burn_value),  # burn-in value
        '-at',  # all touched pixels are burned in
        vector_path,  # input vector file
        output_path  # output raster file
    ]

    # execute gdal_rasterize command through subprocess
    subprocess.run(gdal_rasterize_cmd, check=True)

    '''
    # resample output raster to match the resolution and size of LULC raster
    output_path_resampled = output_path.replace('.tif', '_resampled.tif')
    gdalwarp_cmd = [
        'gdalwarp',
        # '-tr', str(lulc_pixel_size), str(lulc_pixel_size),  # target resolution same as LULC raster
        '-te', str(x_min), str(y_min), str(x_max), str (y_max)  # target coordinates same as LULC raster
        '-r', 'near',  # resampling method (better use 'near' for categorical data)
        '-dstnodata', str(nodata_value),  # set nodata value
        output_path,  # input raster to be resampled
        output_path_resampled  # output path for resampled raster
    ]

    # execute gdalwarp command through subprocess
    subprocess.run(gdalwarp_cmd, check=True)
    '''
    
    # compress output 
    output_compressed = output_path.replace('.tif', '_compr.tif')
    gdal_translate_cmd = [
        'gdal_translate',
        output_path,
        output_compressed,
        '-co', 'COMPRESS=LZW',
        '-ot', 'Byte'
    ]
    # execute gdal_translate command through subprocess
    subprocess.run(gdal_translate_cmd, check=True)

    # rename compressed output to original
    os.remove(output_path)
    os.rename(output_compressed, output_path)

    print("Rasterized output saved to:", output_path)

# to resample rasters obtained by LULC

# specify rasterized outputs
vrt_roads = os.path.join(parent_dir,output_dir,'vrt_roads.tif')
vrt_railways = os.path.join(parent_dir,output_dir,'vrt_railways.tif')
vrt_waterbodies = os.path.join(parent_dir,output_dir,'vrt_waterbodies.tif')
vrt_waterways = os.path.join(parent_dir,output_dir,'vrt_waterways.tif')

# vrt_roads_compr = os.path.join(parent_dir,output_dir,'vrt_roads_compr.tif')

# rasterize roads and railways
rasterize_vector(vector_roads_buffered, vrt_roads, nodata_value=0, burn_value=lulc_road)
rasterize_vector(vector_railways_buffered, vrt_railways, nodata_value=0, burn_value=lulc_railway)
rasterize_vector(vector_waterbodies, vrt_waterbodies, nodata_value=0, burn_value=lulc_water)
rasterize_vector(vector_waterways, vrt_waterways, nodata_value=0, burn_value=lulc_water)

Rasterized output saved to: c:\Users\kriukovv\Documents\pilot_2\preprocessing\data\output\vrt_roads.tif
Rasterized output saved to: c:\Users\kriukovv\Documents\pilot_2\preprocessing\data\output\vrt_railways.tif
Rasterized output saved to: c:\Users\kriukovv\Documents\pilot_2\preprocessing\data\output\vrt_waterbodies.tif
Rasterized output saved to: c:\Users\kriukovv\Documents\pilot_2\preprocessing\data\output\vrt_waterways.tif


##### 3. Merge raster files

All GeoTIFF files are combined into one, updated LULC through the raster calculator.

In [17]:
# TODO - perform merging of rasterized files before, it might be easier for raster calculation!

# BASH version of raster calculator

lulc_upd = 'lulc_2022_upd.tif'
lulc_upd = os.path.join(parent_dir,output_dir,lulc_upd)
print (lulc_upd)

# TODO - rio-calc, rasterstats, pygeoprocessing, pktools

# list of input raster paths and bands
listraster_uri = [
    (lulc, 1),
    (vrt_waterbodies, 1),
    (vrt_waterways, 1),
    (vrt_railways, 1),
    (vrt_roads, 1)
]

# function to get raster dimensions
def get_raster_dimensions(raster_path):
    dataset = gdal.Open(raster_path)
    if dataset:
        width = dataset.RasterXSize
        height = dataset.RasterYSize
        return width, height
    else:
        raise ValueError(f"Unable to open raster file: {raster_path}")

# print dimensions for each raster to check them against LULC dimension
for raster_path, band in listraster_uri:
    width, height = get_raster_dimensions(raster_path)
    print(f"Dimensions of {os.path.basename(raster_path)}: {width} x {height}")

# Function to get raster nodata value
def get_raster_nodata_value(raster_path):
    dataset = gdal.Open(raster_path)
    if dataset:
        band = dataset.GetRasterBand(1)
        nodata_value = band.GetNoDataValue()
        return nodata_value
    else:
        raise ValueError(f"Unable to open raster file: {raster_path}")

# Fetch nodata value from the LULC raster
lulc_nodata = get_raster_nodata_value(lulc)
print(f"Nodata value of LULC raster: {lulc_nodata}")

# define output raster
rasterout_uri = lulc_upd
# define math expression to update LULC
def raster_upd(lulc, waterbodies, waterways, railways, roads):
    # use the original LULC as the base
    result = np.copy(lulc)
    # nodata mask to exclude OSM values beyond LULC
    nodata_mask = np.isclose(lulc, lulc_nodata)
    # overwrite LULC with values from OSM data where there are no nodata value
    result[~nodata_mask & ~np.isclose(waterbodies, lulc_nodata)] = waterbodies[~nodata_mask & ~np.isclose(waterbodies, lulc_nodata)]
    result[~nodata_mask & ~np.isclose(waterways, lulc_nodata)] = waterways[~nodata_mask & ~np.isclose(waterways, lulc_nodata)]
    result[~nodata_mask & ~np.isclose(railways, lulc_nodata)] = railways[~nodata_mask & ~np.isclose(railways, lulc_nodata)]
    result[~nodata_mask & ~np.isclose(roads, lulc_nodata)] = roads[~nodata_mask & ~np.isclose(roads, lulc_nodata)]
    return result

# run raster calculator
pg.raster_calculator(
            base_raster_path_band_const_list=listraster_uri,
            local_op=raster_upd, 
            target_raster_path=rasterout_uri,
            datatype_target=gdal.GDT_Byte,
            nodata_target=0,
            calc_raster_stats=True)

''' GDAL_CALC
# subprocess of gdal calculator becomes too bulky to compute - might cause issues: https://stackoverflow.com/questions/73921278/python-not-giving-same-results-as-gdal-command-line
merge_raster = [
    'gdal_calc.py',
    '-A', lulc,
    '-B', vrt_roads,
    '-C', vrt_railways,
    '-D', vrt_waterbodies,
    '-E', vrt_waterways,
    '--outfile=lulc_upd',
    '--calc="A*B*C*D*E"', # TODO - or 'B*(B!=0) + A*(B==0)',...
    '--NoDataValue', '0',
    '--debug',
]


# execute sum command through subprocess
subprocess.run(merge_raster, check=True, shell=True) # included shell=true, otherwise 
'''

# TODO - to remove intermediate rasterized files (roads, railways, waterbodies, waterways) once everything is finalised
# TODO - to record time to run code

c:\Users\kriukovv\Documents\pilot_2\preprocessing\data\output\lulc_2022_upd.tif
Dimensions of lulc_2022.tif: 10876 x 10587
Dimensions of vrt_waterbodies.tif: 10876 x 10587
Dimensions of vrt_waterways.tif: 10876 x 10587
Dimensions of vrt_railways.tif: 10876 x 10587
Dimensions of vrt_roads.tif: 10876 x 10587
Nodata value of LULC raster: 0.0


' GDAL_CALC\n# subprocess of gdal calculator becomes too bulky to compute - might cause issues: https://stackoverflow.com/questions/73921278/python-not-giving-same-results-as-gdal-command-line\nmerge_raster = [\n    \'gdal_calc.py\',\n    \'-A\', lulc,\n    \'-B\', vrt_roads,\n    \'-C\', vrt_railways,\n    \'-D\', vrt_waterbodies,\n    \'-E\', vrt_waterways,\n    \'--outfile=lulc_upd\',\n    \'--calc="A*B*C*D*E"\', # TODO - or \'B*(B!=0) + A*(B==0)\',...\n    \'--NoDataValue\', \'0\',\n    \'--debug\',\n]\n\n\n# execute sum command through subprocess\nsubprocess.run(merge_raster, check=True, shell=True) # included shell=true, otherwise \n'

##### 4. Recalculation of impedance
Let's extract all LULC values that are causing edge effect and increasing landscape impedance. CSV column with boolean values will be read by this part of code.

In [20]:
# сreate an empty dictionary to store LULC codes which cause negative impact on habitats and edge effect
edge_effect_list = []
# convert datatype of 'edge_effect' column into integer one if needed
impedance['edge_effect'] = impedance['edge_effect'].astype(int)

# iterate through each row in dataframe
for index, row in impedance.iterrows():
    # check if the value in 'edge_effect' column is 1 - user specified that these LULC are affecting habitats
    if row['edge_effect'] == 1:
        # record the value from 'lulc_code' column
        edge_effect_list.append(row['lulc'])
        print(f"LULC = {row['lulc']} is causing edge effect.")

print (f"LULC types causing edge effect on habitats are: {edge_effect_list}")

LULC = 2 is causing edge effect.
LULC = 102 is causing edge effect.
LULC types causing edge effect on habitats are: ['2', '102']


In [19]:
'''
# Convert 'edge_effect' column to integer if needed
impedance['edge_effect'] = impedance['edge_effect'].astype(int)
# Create a list of LULC codes causing edge effect
edge_effect_list = impedance.loc[impedance['edge_effect'] == 1, 'lulc'].tolist()
print(f"LULC types causing edge effect on habitats are: {edge_effect_list}")
edge_effect_array = np.array(edge_effect_list, dtype=int)
print(edge_effect_array)

# open LULC
data_source = gdal.Open(lulc)
band = data_source.GetRasterBand(1)
lulc_data = band.ReadAsArray()
nodata_value = band.GetNoDataValue()

print("NoData value:", nodata_value)

band_data_type = band.DataType
print("Data type of the band:", gdal.GetDataTypeName(band_data_type))

# create a mask based on the 'edge_effect' values from the dataframe
mask = np.isin(lulc_data, edge_effect_array)
if np.any(mask):
    print("True values are present in the mask.")
else:
    print("No True values are present in the mask.")

# apply mask to LULC
masked_data = np.where(mask, lulc_data, nodata_value)
print (masked_data)
if np.any(masked_data != 0):
    print("Valid data is present in masked_data.")
else:
    print("masked_data contains only zeros or nodata values.")

# get the geo-transform and projection from the input raster
geotransform = data_source.GetGeoTransform()
projection = data_source.GetProjection()

# create output raster file
output_raster_path = os.path.join(parent_dir,output_dir,'edge_effect.tif')
driver = gdal.GetDriverByName('GTiff')
out_dataset = driver.Create(output_raster_path, data_source.RasterXSize, data_source.RasterYSize, 1, band.DataType)
out_dataset.SetGeoTransform(geotransform)
out_dataset.SetProjection(projection)

# write the masked data to the new raster file
out_band = out_dataset.GetRasterBand(1)
out_band.WriteArray(masked_data)
nodata_value_int = int(nodata_value)
out_band.SetNoDataValue(nodata_value_int)
print (nodata_value_int)

# flush data to disk
# out_band.FlushCache()
# close datasets
# data_source = None
# data_source = None

print("Masked LULC types affecting habitats with edge effect are saved to:", output_raster_path)

'''

'\n# Convert \'edge_effect\' column to integer if needed\nimpedance[\'edge_effect\'] = impedance[\'edge_effect\'].astype(int)\n# Create a list of LULC codes causing edge effect\nedge_effect_list = impedance.loc[impedance[\'edge_effect\'] == 1, \'lulc\'].tolist()\nprint(f"LULC types causing edge effect on habitats are: {edge_effect_list}")\nedge_effect_array = np.array(edge_effect_list, dtype=int)\nprint(edge_effect_array)\n\n# open LULC\ndata_source = gdal.Open(lulc)\nband = data_source.GetRasterBand(1)\nlulc_data = band.ReadAsArray()\nnodata_value = band.GetNoDataValue()\n\nprint("NoData value:", nodata_value)\n\nband_data_type = band.DataType\nprint("Data type of the band:", gdal.GetDataTypeName(band_data_type))\n\n# create a mask based on the \'edge_effect\' values from the dataframe\nmask = np.isin(lulc_data, edge_effect_array)\nif np.any(mask):\n    print("True values are present in the mask.")\nelse:\n    print("No True values are present in the mask.")\n\n# apply mask to LULC\nm