# Enrichment of land-use/land-cover (LULC) data

This tool is aimed at the enrichment and rectification of commonly produced land-use/land-cover (LULC) raster data with auxiliary data from other sources. While these products picking up local spatial features in a  robust way, storaging data in raster format, providing quick and reliable access, they might lack some important human-made and natural spatialfeatures, for example, narrow roads or waterways overshadowed by the vegetation. At the same time, these features can be extremely important for different purposes, including ecological moitoring and conservation science, because roads, railways or even waterways can act as ecological barriers and prevent species to pass through them and migrate to other habitats. The workflow described below has been established to effectively detail LULC data.

Currently, this workflow has been successfully applied to enrich MUCSC maps of Catalonia, Spain and [LCM (Land Cover Maps) by UKCEH (UK Centre for Ecology and Hydrology)](https://www.ceh.ac.uk/data/ukceh-land-cover-maps) with spatial resolution of 30 m and 25 m, respectively.

**This Jupyter Notebook depends on the [2nd step](./2_osm_historical.ipynb), but [first step](./1_protected_areas/1_preprocessing_pas.ipynb) is not mandatory.**

## Environment and dependencies

This workflow requires specific packages to be installed to run most of processing commands. Anaconda environment has been used to ensure the consistency and seamless installation of libraries. Geopandas and pandas are recommended to be installed in this common way (to provide compatible versions) through Anaconda Prompt: 
```
conda install -c conda-forge geopandas pandas
```
Other libraries may be installed through simple commands in your Anaconda Prompt, for example:
```
conda install gdal
```
This package is currently not included into the preprocessing workflow, but might be useful in future:
```
conda install qgis --channel conda-forge
```
Outside of the Anaconda environment, pip is also commonly used to install libraries:
```
pip install gdal
```

Let's import all dependencies required:

##### Importing all dependencies

In [1]:
import numpy as np
import numpy.ma as ma
import geopandas as gpd
import fiona
import pygeoprocessing as pg

# auxiliary libraries
import subprocess
import warnings
import yaml
import os

# for appending scripts and functions
import sys

# own modules
import timing

# TODO - delete unused libraries

"""
import time
# os.environ['USE_PATH_FOR_GDAL_PYTHON'] = 'YES' #to import gdal
"""

# REDUNDANT - import QGIS  processing modules if needed (currently not required)
"""
from qgis.core import QgsVectorLayer
from qgis.core import QgsProject
from qgis.core import QgsProcessingUtils
from qgis.core import QgsGeometryChecker
"""

'\nfrom qgis.core import QgsVectorLayer\nfrom qgis.core import QgsProject\nfrom qgis.core import QgsProcessingUtils\nfrom qgis.core import QgsGeometryChecker\n'

As GDAL installation might face issues it is important to include a separate troubleshooting statement for its installation:

In [2]:
# installing GDAL
try:
    from osgeo import ogr, osr, gdal
except ImportError:
    import sys
    sys.exit('ERROR: cannot find GDAL/OGR modules')

It is recommended to use GDAL error handler function and exception module:

In [3]:
# specify GDAL error handler function
def gdal_error_handler(err_class, err_num, err_msg):
    errtype = {
        gdal.CE_None: 'None',
        gdal.CE_Debug: 'Debug',
        gdal.CE_Warning: 'Warning',
        gdal.CE_Failure: 'Failure',
        gdal.CE_Fatal: 'Fatal'
    }
    err_msg = err_msg.replace('\n', ' ')
    err_class = errtype.get(err_class, 'None')
    print('Error Number: %s' % (err_num))
    print('Error Type: %s' % (err_class))
    print('Error Message: %s' % (err_msg))

# enable GDAL/OGR exceptions
gdal.UseExceptions()

It is important to check the performance of code:

In [4]:
# call own module and start calculating time
timing.start()

# REDUNDANT SCRIPT
"""
# to measure time to run code
import time

# starting to measure running time
start_time = time.time()
"""

'\n# to measure time to run code\nimport time\n\n# starting to measure running time\nstart_time = time.time()\n'

### Configuration

##### Input data

Firstly, it is vital to define input data, file names and paths to them. This block also defines Open Street Map (OSM) data or user-specified vector data to refine raster data.
The following types of input data are exploited:
1. Raster land-use/land-cover (LULC) data, geotiff format. Cloud Optimised GeoTiff (COG) is preferable (COG with LZW compression is used to optimise storaging data). ***MANDATORY***
2. Vector data (GPKG) to enrich and refine LULC data (currently, roads, railways, water bodies and waterways are processed, urban and suburban areas are planned to implement), deriving either from OSM or user-specified data. ***MANDATORY***
3. Ancillary tabular data mapping LULC types to their specifications: (1) whether concrete LULC type should be refined by vector data or not (***MANDATORY***) and (2) whether negative "edge effect" of concrete LULC type should be considered, for instance, roads affect suitability of habitats alongside roads (***OPTIONAL***).
4. Raster impedance (friction, or resistance) data (derivative from LULC data) corresponding to each unique value of LULC data and reflecting relative unsuitability for species to pass through different LULC types. This dataset is required to compute habitat connectivity, but it is not needed for other purposes. ***OPTIONAL***

##### Paths and filenames
Main variables are defined in the configuration file config.yaml. Let's load the configuration:

In [5]:
with open('config.yaml', 'r') as f:
    config = yaml.safe_load(f)

Define 'yearstamp' of input dataset to be enriched with vector data. It is convenient to use the configuration file:

In [6]:
year = config.get('year')
if year is None or 'year' not in config: # both conditions should be considered
    warnings.warn("Year variable is not found in the configuration file.")
"""
print (year)
"""

'\nprint (year)\n'

Define the file name of input dataset to work with from the yaml file:

In [7]:
lulc_template = config.get('lulc')

# substitute year from the configuration file
lulc = lulc_template.format(year=year)

print(f"Input raster to be used for processing is {lulc}.")

Input raster to be used for processing is lulc_ukceh_25m_2023.tif.


Specify paths to the current directory, input and output datasets:

In [8]:
# specify parent directory
parent_dir = os.getcwd()  # Currently, the automatical extraction of current folder works to avoid hard-coded path.
print (f"Parent directory: {parent_dir}")

# add Python path to search for scripts, modules
sys.path.append(parent_dir)

# specify paths
lulc_dir = config.get('lulc_dir')
impedance_dir = config.get('impedance_dir')
vector_dir = config.get('vector_dir')
output_dir = config.get('output_dir')

"""
# REDUNDANT - replaced with yaml
lulc_dir = r'data\input\lulc'
impedance_dir = r'data\input\impedance'
vector_dir = r'data\input\vector'
# specify output directory
output_dir = r'data\output'
"""

# create the output directory if it does not exist
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
    print(f"Created directory: {output_dir}")

"""
# REDUNDANT - replaced with yaml file
year = "2023" 
lulc = f'lulc_ukceh_25m_{year}.tif' # changed to test UKCEH LULC Maps 
"""

"""
# REDUNDANT PART ON CHOICE FROM OSM OR USER SPECIFIED DATASET (based on the presence of files)
# check if 'user_vector.gpkg' exists in the folder (suploaded directly by user)
user_vector = os.path.join(parent_dir, vector_dir, 'user_vector.gpkg')
if os.path.exists(user_vector):
    vector_refine = 'user_vector_{year}.gpkg'
else:
    vector_refine = 'osm_merged_{year}.gpkg'
"""

Parent directory: C:\Users\kriukovv\Documents\pilot_2\preprocessing


  """


"\n# REDUNDANT PART ON CHOICE FROM OSM OR USER SPECIFIED DATASET (based on the presence of files)\n# check if 'user_vector.gpkg' exists in the folder (suploaded directly by user)\nuser_vector = os.path.join(parent_dir, vector_dir, 'user_vector.gpkg')\nif os.path.exists(user_vector):\n    vector_refine = 'user_vector_{year}.gpkg'\nelse:\n    vector_refine = 'osm_merged_{year}.gpkg'\n"

User can choose either from Open Street Map data or exploit datasets from other sources accessed them on their own. It is highly recommended to use Open Street Map as it provides highly complete spatial data of various types across the world ([1](https://www.sciencedirect.com/science/article/abs/pii/S0378437114009145), [2](https://www.sciencedirect.com/science/article/pii/S0303243421002786), [3](https://www.sciencedirect.com/science/article/pii/S0143622817301819)).

In [9]:
# specify input vector data
osm_data_template = config.get('osm_data')
if osm_data_template is not None:
    osm_data = osm_data_template.format(year=year)
    user_vector = None
    vector_refine = osm_data # define a new variable which will be equal either osm_data or user_vector (depending on the configuration file)
    print ("Input raster dataset will be enriched with OSM data.")
else:
    osm_data = None
    warnings.warn("OSM data not found in the configuration file.") 
    
    user_vector_template = config.get('user_vector')
    if user_vector_template is not None:
        user_vector = user_vector_template.format(year=year)
        vector_refine = user_vector
        print ("Input raster dataset will be enriched with user-specified data.")
    else:
        # if neither OSM dataset, nor user dataset specified in the config file
        user_vector = None
        vector_refine = None
        warnings.warn("Neither OSM data nor user specified data found in the configuration file.")

if vector_refine is None:
    raise ValueError("No valid input vector data found. Both OSM data and user-specified data are missing.")

# print the name of chosen vector file
print(f"Using vector file to refine raster data: {vector_refine}")

Input raster dataset will be enriched with OSM data.
Using vector file to refine raster data: osm_merged_2023.gpkg


Let's define then full paths to the input files with the filenames:

In [10]:
# specifying the path to input files through the path variables
lulc = os.path.join(parent_dir,lulc_dir,lulc)
vector_refine = os.path.join(parent_dir,vector_dir,vector_refine)

# normalise paths (to avoid mixing of backslashes and forward slashes)ss
lulc = os.path.normpath(lulc)
vector_refine = os.path.normpath(vector_refine)

print(f"Path to the input raster dataset: {lulc}")
print(f"Path to the input vector dataset: {vector_refine}")

Path to the input raster dataset: C:\Users\kriukovv\Documents\pilot_2\preprocessing\data\input\lulc\lulc_ukceh_25m_2023.tif
Path to the input vector dataset: C:\Users\kriukovv\Documents\pilot_2\preprocessing\data\input\vector\osm_merged_2023.gpkg


### Tabular data

Let's define auxiliary data (CSV):

In [11]:
# define variable
impedance = config.get('impedance')
if impedance is not None:
    print(f"Using auxiliary tabular data from {impedance}.")
else:
    warnings.warn("No valid auxiliary tabular data found.")

# define path
impedance_file = os.path.join(parent_dir,impedance_dir,impedance)

Using auxiliary tabular data from reclassification_ukceh.csv.


Currently, user can either:
- specify how main types of OSM features correspond with LULC codes from input raster dataset (for example, what LULC code roads should be assigned with) or
- use text-matching tool called from the external Python script.
The first option is recommended as variety of LULC types descriptions and languages used is vast.

In [12]:
# to import separate script as a module
import text_matching

# ancillary part - reload the module to reflect recent changes (otherwise might cause issues in Jupyter Notebook)
import importlib
importlib.reload(text_matching)

# read CSV file through geopandas as a dataframe
impedance = gpd.read_file(impedance_file)

# find out from config file if user wants define LULC codes on their own, or use text-matching tool
user_matching = config.get('user_matching')

# if user defines mapping on their own
if user_matching.lower() == 'true': # case-insensitive condition
    # access variables and subvariables from the confiration file
    lulc_code = config.get('lulc_codes', {})
    lulc_road = lulc_code.get('lulc_road')
    lulc_railway = lulc_code.get('lulc_railway')
    lulc_water = lulc_code.get('lulc_water')
    lulc_urban = lulc_code.get('lulc_urban')
    lulc_suburban = lulc_code.get('lulc_suburban')

    # print codes of areas from OSM corresponding with LULC codes from input raster dataset
    print("User-specified mapping of LULC codes and OSM features is used.")
    print("LULC code of roads:", lulc_road)
    print("LULC code of railways:", lulc_railway)
    print("LULC code of inland waters:", lulc_water)
    print("LULC code of urban areas:", lulc_urban)
    print("LULC code of suburban areas:", lulc_suburban)

# if user defines mapping from text-matching tool
elif user_matching.lower() == 'false': # case-insensitive condition
    # call the function and capture the result
    lulc_codes = text_matching.codes(config, impedance_file)

    # define variables from the lulc_codes object
    lulc_road = lulc_codes.lulc_road
    lulc_railway = lulc_codes.lulc_railway
    lulc_water = lulc_codes.lulc_water
    lulc_urban = lulc_codes.lulc_urban
    lulc_suburban = lulc_codes.lulc_suburban
    
    # print codes of areas from OSM corresponding with LULC codes from input raster dataset
    print("Text matching tool used to map LULC codes and corresponding OSM features.")
    print("LULC code of roads:", lulc_road)
    print("LULC code of railways:", lulc_railway)
    print("LULC code of inland waters:", lulc_water)
    print("LULC code of urban areas:", lulc_urban)
    print("LULC code of suburban areas:", lulc_suburban)
else:
    raise ValueError ("User did not specify mapping between OSM features and LULC types.")


"""
# This block has been moved to the separate text_matching.py
# to find impedance values matching with built-up areas of human impact on habitats and inland water
lulc_urban = impedance.loc[impedance['type'].str.contains('urban|built|build|resident|industr|commerc', case = False),'lulc'].iloc[0] # use .iloc[0] to receive a first row with the text match
# if lulc_urban.empty:
    # print ("No urban areas revealed in LULC data")
lulc_suburban = impedance.loc[impedance['type'].str.contains('suburban|urbanized|urbanised', case = False),'lulc'].iloc[0]
# if lulc_suburban.empty:
    # print ("No suburban areas revealed in LULC data") # error 'str' object has no attribute 'empty'

lulc_road = impedance.loc[impedance['type'].str.contains(r'\broad|highway', case = False),'lulc'] # to choose the first matching value if roads are found
if not lulc_road.empty:
    lulc_road = lulc_road.iloc[0]  # choose the first matching value if roads are found
else:
    lulc_road = lulc_urban   # railways are unlikely specified as a separate LULC code so it is the same

lulc_railway = impedance.loc[impedance['type'].str.contains('rail|train', case = False),'lulc']
if not lulc_railway.empty:
    lulc_railway = lulc_railway.iloc[0]  # choose the first matching value if railways are found
else:
    lulc_railway = lulc_suburban   # railways are unlikely specified as a separate LULC code so it is the same

# use names of water LULC codes from extended LULC classificiation
lulc_water = impedance.loc[impedance['type'].str.contains('continental water|inland water|freshwater', case=False), 'lulc']
if not lulc_water.empty:
    lulc_water = lulc_water.iloc[0]
else:
    # if no matches found, resort to other names (short LULC classification)
    lulc_water = impedance.loc[impedance['type'].str.contains('water|aqua|river', case=False), 'lulc'].iloc[0]

print("LULC code of roads:", lulc_road,"\n","LULC code of railways:", lulc_railway,"\n","LULC code of urban areas:", lulc_urban,"\n","LULC code of suburban areas:", lulc_suburban,"\n","LULC code of inland waters:", lulc_water)

# TODO - create a dictionary with impedance values?
"""



User-specified mapping of LULC codes and OSM features is used.
LULC code of roads: 20
LULC code of railways: 21
LULC code of inland waters: 14
LULC code of urban areas: None
LULC code of suburban areas: None


'\n# This block has been moved to the separate text_matching.py\n# to find impedance values matching with built-up areas of human impact on habitats and inland water\nlulc_urban = impedance.loc[impedance[\'type\'].str.contains(\'urban|built|build|resident|industr|commerc\', case = False),\'lulc\'].iloc[0] # use .iloc[0] to receive a first row with the text match\n# if lulc_urban.empty:\n    # print ("No urban areas revealed in LULC data")\nlulc_suburban = impedance.loc[impedance[\'type\'].str.contains(\'suburban|urbanized|urbanised\', case = False),\'lulc\'].iloc[0]\n# if lulc_suburban.empty:\n    # print ("No suburban areas revealed in LULC data") # error \'str\' object has no attribute \'empty\'\n\nlulc_road = impedance.loc[impedance[\'type\'].str.contains(r\'\x08road|highway\', case = False),\'lulc\'] # to choose the first matching value if roads are found\nif not lulc_road.empty:\n    lulc_road = lulc_road.iloc[0]  # choose the first matching value if roads are found\nelse:\n    lu

### Vector data

Historical vector data to enrich raster LULC data has been derived from the open-access OpenStreetMap (OSM) portal through the nested [Overpass Turbo API](./1_osm_hsitorical.py). Currently, OSM data are exported as merged geopackage file (roads, railroads, waterbodies and water lines).

##### Access layers
To work with the merged geopackage file which combines OSM data it is required to access separate layers:

In [13]:
def extract_layer_names(gpkg_path):
    """
    Extracts layer names from a GeoPackage file.

    Arguments:
    - gpkg_path (str): Path to the GeoPackage file.

    Returns:
    - layer_names (list): A list of layer names in the GeoPackage.
    """
    with fiona.Env():
        layer_names = fiona.listlayers(gpkg_path)
    return layer_names

# apply function
layers = extract_layer_names(vector_refine)
formatted_layers = ', '.join(layers)  # join layer names with a comma and space for readability
print(f"Layers in the vector file are: {formatted_layers}. Do they match your expectations?")

Layers in the vector file are: railways, roads, waterbodies, waterways. Do they match your expectations?


In [14]:
'''# open geopackage file
with fiona.open(vector_refine) as geopackage:
    # extract unique values from the "layer_type" attribute
    unique_layer_types = set(feature['properties']['layer_type'] for feature in geopackage)
# use Fiona to find all layers in geopackage
available_layers = fiona.listlayers(vector_refine)

print("Available layers in input GeoPackage:")
for layer_name in available_layers:
    print(layer_name)

# TODO - to rewrite the following block to automatically create separate geopackages based on its "layer_type"
# specify the layer name we want to work with
vector_roads_name = 'gdf_roads_filtered'
vector_railways_name = 'gdf_railways_filtered'
vector_waterbodies_name = 'gdf_water_filtered'
vector_water_lines_name = 'gdf_water_lines_filtered'

# use Fiona again to open the GeoPackage file and access specific layers
with fiona.open(vector_refine) as geopackage:
    # access layers of vector file
    vector_roads = geopackage[vector_roads_name]
    vector_railways = geopackage[vector_railways_name]
    vector_waterbodies = geopackage[vector_waterbodies_name]
    vector_waterways = geopackage[vector_waterways_name]
'''

# to define function to separate geopackages
"""
# TODO - to add separation block based on layers in geopackage
def extract_geopackages(gpkg_path, output_folder, attribute_name):
    # Read the GeoPackage file
    gdf = gpd.read_file(gpkg_path)

    # Get unique values in the specified attribute
    unique_values = gdf[attribute_name].unique()

    # Create GeoPackages for each unique value
    for value in unique_values:
        subset_gdf = gdf[gdf[attribute_name] == value]
        output_gpkg = f"{output_folder}\{value}.gpkg"
        subset_gdf.to_file(output_gpkg, driver="GPKG")
        print(f"Extracted GeoPackage: {output_gpkg}")

# to define variables
geopackage_path = vector_refine
output_folder = output_dir
attribute_name = "layer_type"

# to call function
extract_geopackages(geopackage_path, output_folder, attribute_name)

# to define variables assigned to separate geopackages
vector_roads = os.path.join(vector_dir, "roads_{year}.gpkg")
vector_railways = os.path.join(vector_dir, "railways_{year}.gpkg")
vector_waterbodies = os.path.join(vector_dir, "waterbodies_{year}.gpkg")
vector_waterways = os.path.join(vector_dir, "waterways_{year}.gpkg")
"""

"""
vector_roads = gpd.read_file(vector_refine, layer = 'roads')
vector_railways = gpd.read_file(vector_refine, layer = 'railways')
vector_waterbodies = gpd.read_file(vector_refine, layer = 'waterbodies')
vector_waterways = gpd.read_file(vector_refine, layer = 'waterways')
"""

  """


"\nvector_roads = gpd.read_file(vector_refine, layer = 'roads')\nvector_railways = gpd.read_file(vector_refine, layer = 'railways')\nvector_waterbodies = gpd.read_file(vector_refine, layer = 'waterbodies')\nvector_waterways = gpd.read_file(vector_refine, layer = 'waterways')\n"

##### Validity of vector geometry

It is important to check the validity of vector geometry used to refine input raster dataset. If any invalid geometries detected, user warned and provided with the share of features with invalid geometries from the total number of features. It depends on user whether they would like to proceed processing with invalid geometries or not. As usually geometries derived from Open Street Map are geometrically and topologically correct [(4)](https://www.sciencedirect.com/science/article/abs/pii/S0143622822001138), no errors are raised at this step.

Geometries are being fixed while harmonising the outputs of [Overpass Turbo API queries](./1_osm_hsitorical.py), but geometries are checked in this workflow for the second time.

In [15]:
# import functions from own .py module
from vector_proc import VectorTransform

# define full path with vector input directory
vector_refine_path = os.path.join(vector_refine,'..')

# call function from class
VectorTransform(vector_refine_path).geom_valid()

# Previous version not casted to function
"""
# open geopackage file
data_source = ogr.Open(vector_refine)

# get the number of layers in geopackage
num_layers = data_source.GetLayerCount()

# iterate through each layer
for i in range(num_layers):
    layer = data_source.GetLayerByIndex(i)

# check the validity of all geometries in the layer
    all_geometries_valid = all(feature.GetGeometryRef().IsValid() for feature in layer)

    if all_geometries_valid:
        print("Good news! All vector geometries are valid and can be used to refine your data.")
    else:
        print("At least one vector geometry is invalid. The further executions might be complicated by invalid geometries.")

# close the geopackage file
data_source = None
"""


Good news! All vector geometries in GeoPackage 'osm_merged_2018.gpkg' (layer 'railways') are valid.
----------------------------------------
Good news! All vector geometries in GeoPackage 'osm_merged_2018.gpkg' (layer 'roads') are valid.
----------------------------------------
Good news! All vector geometries in GeoPackage 'osm_merged_2018.gpkg' (layer 'waterbodies') are valid.
----------------------------------------
Good news! All vector geometries in GeoPackage 'osm_merged_2018.gpkg' (layer 'waterways') are valid.
----------------------------------------
Good news! All vector geometries in GeoPackage 'osm_merged_2023.gpkg' (layer 'railways') are valid.
----------------------------------------
Good news! All vector geometries in GeoPackage 'osm_merged_2023.gpkg' (layer 'roads') are valid.
----------------------------------------
Good news! All vector geometries in GeoPackage 'osm_merged_2023.gpkg' (layer 'waterbodies') are valid.
----------------------------------------
Good news! A

'\n# open geopackage file\ndata_source = ogr.Open(vector_refine)\n\n# get the number of layers in geopackage\nnum_layers = data_source.GetLayerCount()\n\n# iterate through each layer\nfor i in range(num_layers):\n    layer = data_source.GetLayerByIndex(i)\n\n# check the validity of all geometries in the layer\n    all_geometries_valid = all(feature.GetGeometryRef().IsValid() for feature in layer)\n\n    if all_geometries_valid:\n        print("Good news! All vector geometries are valid and can be used to refine your data.")\n    else:\n        print("At least one vector geometry is invalid. The further executions might be complicated by invalid geometries.")\n\n# close the geopackage file\ndata_source = None\n'

### Raster data

##### Checks on coordinate reference systems
Input raster data should have the cartesian (projected) CRS to perform all computations correctly.

In [16]:
# if the CRS of input raster data is not cartesian one, it will raise the warning

# open input LULC file
dataset = gdal.Open(lulc)

if dataset:
    try:
        # get the projection information
        projection = dataset.GetProjection()
        srs = osr.SpatialReference()
        srs.ImportFromWkt(projection)

        # get the CRS information
        crs = srs.ExportToProj4()

        # check if CRS is cartesian
        is_cartesian = srs.IsProjected()

    finally:
        dataset = None  # close the dataset to free resources
else:
    warning_message_2 = f"Failed to open the raster dataset. Please check the path and format of the input raster."
    warnings.warn(warning_message_2, Warning)

# display a warning if the CRS is not cartesian
if not is_cartesian:
    warning_message_3 = "The CRS is not the cartesian one. To exploit this workflow correctly, you should reproject it."
    warnings.warn(warning_message_3, Warning)
else:
    print("Good news! The CRS of your input raster dataset is the cartesian one.")

Good news! The CRS of your input raster dataset is the cartesian one.


##### Check on the consistency of spatial resolution

We should be confident that X spatial resolution of input raster matches to Y spatial resolution.

In [17]:

# Import the RasterTransform class from the reprojection module
from reprojection import RasterTransform  # this imports RasterTransform class

xres, yres = RasterTransform(lulc).check_res()

# REDUNDANT - casted to function
"""
# retrieve raster resolution - cellsize
inp_source = gdal.Open(lulc)
geo_transform = inp_source.GetGeoTransform()

# define function raise warning if there is some mismatch between x and y resolution
def check_res (raster):
    raster_geotransform = raster.GetGeoTransform()
    xres = raster_geotransform[1]
    yres = raster_geotransform[5]
    # compare absolute values, because the y value is represented in negative coordinates
    if abs(xres) != abs(yres):
        print ("x:",xres,"y:",yres)
        warning_message = f"Spatial resolution (x and y values) of input raster is inconsistent"
        warnings.warn(warning_message, Warning)
    else:
        print ("Good news! The spatial resolution of your raster data is consistent between X and Y.")
    return xres, yres
"""
# TODO - to cast to the function

# get the raster information
x_min, x_max, y_min, y_max, cell_size = RasterTransform(lulc).get_raster_info()

# print the results
print(f"x_min: {x_min}")
print(f"x_max: {x_max}")
print(f"y_min: {y_min}")
print(f"y_max: {y_max}")
print(f"Spatial resolution of input raster dataset (cell size): {cell_size}")

# check if the input raster dataset has a projected (cartesian) CRS
is_cartesian, crs_info = RasterTransform(lulc).check_cart_crs()


#REDUNDANT - casted to function
"""
# run function and capture the resolution values
xres, yres = check_res(inp_source)
cell_size = abs(xres)

# fetch max/min coordinates to use them later
x_min = geo_transform[0]
y_max = geo_transform[3]
x_max = x_min + geo_transform[1] * inp_source.RasterXSize
y_min = y_max + geo_transform[5] * inp_source.RasterYSize

print (f"Spatial resolution (pixel size) is {cell_size} meters")
print (f"x min coordinate is {x_min}")
print (f"y max coordinate is {y_max}")
print (f"x max coordinate is {x_max}")
print (f"y min coordinate is {y_min}")
"""

Good news! The spatial resolution of your raster data is consistent between X and Y.
Input raster dataset C:\Users\kriukovv\Documents\pilot_2\preprocessing\data\input\lulc\lulc_ukceh_25m_2023.tif was opened successfully.
Coordinate reference system of the input raster dataset is EPSG:27700
x_min: 347225.0
x_max: 452300.0
y_min: 343800.0
y_max: 540325.0
Spatial resolution of input raster dataset (cell size): 25.0
Good news! The CRS of your input raster dataset is the Cartesian (projected) one.


'\n# run function and capture the resolution values\nxres, yres = check_res(inp_source)\ncell_size = abs(xres)\n\n# fetch max/min coordinates to use them later\nx_min = geo_transform[0]\ny_max = geo_transform[3]\nx_max = x_min + geo_transform[1] * inp_source.RasterXSize\ny_min = y_max + geo_transform[5] * inp_source.RasterYSize\n\nprint (f"Spatial resolution (pixel size) is {cell_size} meters")\nprint (f"x min coordinate is {x_min}")\nprint (f"y max coordinate is {y_max}")\nprint (f"x max coordinate is {x_max}")\nprint (f"y min coordinate is {y_min}")\n'

### Buffering features

##### Buffering roads
Let's buffer roads by their width recorded as its attribute. This step is more important for primary wide roads that can cover a significant amount of pixels in width.

In [18]:
'''
# 1 option - OGR buffering - doesn't allow copying attributes by now
vector_ds = ogr.Open(vector_roads)

if vector_ds is None:
    print(f"Failed to open the vector file: {vector_roads}")
    exit()

# Print basic information about the layers
print(f"Number of Layers in the Dataset: {vector_ds.GetLayerCount()}")

for i in range(vector_ds.GetLayerCount()):
    layer = vector_ds.GetLayerByIndex(i)
    print(f"\nLayer {i + 1} Information:")
    print(f"Name: {layer.GetName()}")
    print(f"Geometry Type: {ogr.GeometryTypeToName(layer.GetGeomType())}")
    print(f"Number of Features: {layer.GetFeatureCount()}")
    print(f"Extent: {layer.GetExtent()}")

# Close the dataset
vector_ds = None

# define function to create buffers
def createBuffer(inputfn, outputBufferfn, file_format, layerName, distance_field):
    inputds = ogr.Open(inputfn)
    inputlyr = inputds.GetLayer(layerName)
    # check if dataset was opened successfully
    if inputds is None:
        print(f"Failed to open the vector file: {vector_roads}")
        exit()

    # shpdriver = ogr.GetDriverByName('ESRI Shapefile')
    file_driver = ogr.GetDriverByName(file_format)
    if os.path.exists(outputBufferfn):
        file_driver.DeleteDataSource(outputBufferfn)
        
    outputBufferds = file_driver.CreateDataSource(outputBufferfn)

    # define CRS
    output_crs = ogr.osr.SpatialReference()
    output_crs.ImportFromEPSG(25831)

    bufferlyr = outputBufferds.CreateLayer(outputBufferfn, geom_type=ogr.wkbPolygon, srs=output_crs, options=["OVERWRITE=YES"])
    featureDefn = bufferlyr.GetLayerDefn()
    
    #for i in range(featureDefn.GetFieldCount()):
        #fieldDefn = featureDefn.GetFieldDefn(i)
        #bufferlyr.CreateField(fieldDefn)
        
    for feature in inputlyr:
        ingeom = feature.GetGeometryRef()
        # distance_field_value = inputlyr.GetLayerDefn(distance_field)
        buffer_distance = feature.GetField(distance_field)
        geomBuffer = ingeom.Buffer(buffer_distance)  # buffer_distance to get the attribute value

        outFeature = ogr.Feature(featureDefn)
        outFeature.SetGeometry(geomBuffer)
        bufferlyr.CreateFeature(outFeature)
        outFeature = None

        #
        # Copy attributes from input feature to output feature
        #for i in range(featureDefn.GetFieldCount()):
            #field_name = featureDefn.GetFieldDefn(i).GetName()
            #field_value = feature.GetField(i)
            #outFeature.SetField(field_name, field_value)
        
# write output to file rather than to memory - deal with this later if memory required.
vector_roads_buffered = 'vector_roads_buffered.gpkg'
vector_roads_buffered = os.path.join(parent_dir,output_dir,vector_roads_buffered)
# check if the file exists, if it does, delete it
if os.path.exists(vector_roads_buffered):
    os.remove(vector_roads_buffered)

createBuffer(vector_roads, vector_roads_buffered, 'GPKG', "roads", 'width')

print("Buffered vector saved to: ", vector_roads_buffered)
'''

'\n# 1 option - OGR buffering - doesn\'t allow copying attributes by now\nvector_ds = ogr.Open(vector_roads)\n\nif vector_ds is None:\n    print(f"Failed to open the vector file: {vector_roads}")\n    exit()\n\n# Print basic information about the layers\nprint(f"Number of Layers in the Dataset: {vector_ds.GetLayerCount()}")\n\nfor i in range(vector_ds.GetLayerCount()):\n    layer = vector_ds.GetLayerByIndex(i)\n    print(f"\nLayer {i + 1} Information:")\n    print(f"Name: {layer.GetName()}")\n    print(f"Geometry Type: {ogr.GeometryTypeToName(layer.GetGeomType())}")\n    print(f"Number of Features: {layer.GetFeatureCount()}")\n    print(f"Extent: {layer.GetExtent()}")\n\n# Close the dataset\nvector_ds = None\n\n# define function to create buffers\ndef createBuffer(inputfn, outputBufferfn, file_format, layerName, distance_field):\n    inputds = ogr.Open(inputfn)\n    inputlyr = inputds.GetLayer(layerName)\n    # check if dataset was opened successfully\n    if inputds is None:\n        

In [19]:
# This block is using ogr2ogr command line script (currently the most stable solution)

# write output to file rather than to memory
# TODO - try writing to memory if required to save resources 
# vector_roads = os.path.join(parent_dir, vector_dir, f"roads_{year}.gpkg")
vector_roads_buffered = os.path.join(parent_dir, output_dir, f"roads_{year}_buffered.gpkg")
vector_roads_buffered = os.path.normpath(vector_roads_buffered) # normalise path

# check if the file exists from prevous calcualtions and delete it if it does
if os.path.exists(vector_roads_buffered):
    os.remove(vector_roads_buffered)

"""
# other vector data from OSM
vector_waterbodies = os.path.join(parent_dir,vector_dir, f"waterbodies_{year}.gpkg")
vector_waterways = os.path.join(parent_dir,vector_dir, f"waterways_{year}.gpkg")
"""

# define the ogr2ogr command as a list of arguments
## 'Width' column is preliminarily casted into real values as original OSM data (derived from geojson) are recognised as text values in this column.
## TODO - replace NULL width with self-defined width in sql query

ogr2ogr_buffer_roads = [
    'ogr2ogr',
    '-f', 'GPKG',
    vector_roads_buffered, # output file path
    vector_refine, # input file path (should be before the SQL statement)
    '-dialect', 'SQLite',
    '-sql', f"""
        SELECT 
            ST_Buffer(
                geom, 
                CASE 
                    WHEN width IS NULL OR CAST(width AS REAL) IS NULL THEN 
                        CASE 
                            WHEN highway IN ('motorway', 'motorway_link', 'trunk', 'trunk_link') THEN 30/2 
                            WHEN highway IN ('primary', 'primary_link', 'secondary', 'secondary_link') THEN 20/2 
                            ELSE 10/2 
                        END 
                    ELSE CAST(width AS REAL)/2 
                END
            ) AS geometry, 
            * 
        FROM roads /* to specify layer of input file */
    """,
    '-nlt', 'POLYGON' # ensure the output is a polygon
    #'-nln', f'roads_{year}_buffered' # define layer in the output file
]

# TODO - specify condition to replace separately null values of width
"""
# redundant solutions
... FROM roads_{year} 
"""

# execute ogr2ogr command
try:
    result = subprocess.run(ogr2ogr_buffer_roads, check=True, capture_output=True, text=True)
    print(f"Successfully buffered 'roads' layer and saved to {vector_roads_buffered}.")
    if result.stderr:
        print(f"Warnings or errors:\n{result.stderr}")
except subprocess.CalledProcessError as e:
    print(f"Error buffering roads: {e.stderr}")
except Exception as e:
    print(f"Unexpected error: {str(e)}")

Successfully buffered 'roads' layer and saved to C:\Users\kriukovv\Documents\pilot_2\preprocessing\data\output\roads_2023_buffered.gpkg.


##### Buffering railways
Let's buffer roads by their width recorded as its attribute. This step is more important for primary wide roads that can cover a significant amount of pixels in width.

In [20]:
# write output to file rather than to memory - deal with this later if memory required (TODO)
vector_railways = os.path.join(parent_dir,vector_dir, f"railways_{year}.gpkg")
vector_railways_buffered = os.path.join(parent_dir,output_dir, f"railways_{year}_buffered.gpkg")
vector_railways_buffered = os.path.normpath(vector_railways_buffered) # normalise path

# check if the file exists, if it does, delete it
if os.path.exists(vector_railways_buffered):
    os.remove(vector_railways_buffered)

# define the ogr2ogr command as a list of arguments
## width of railways is not directly specified in OSM, only as 'gauge' key (track width), but it might be classified as text_value (gauge=broad or gauge = 1000;2000)
## TODO - to implement width processing rule
ogr2ogr_buffer_railways = [
    'ogr2ogr',
    '-f', 'GPKG',
    vector_railways_buffered, # output file
    vector_refine, # input merged gpkg file
    '-dialect', 'SQLite',
    '-sql', f"SELECT ST_Buffer(geom, 10/2) AS geometry, * FROM railways", # divide by 2 as buffer value is a value to be covered to one side from spatial feature
    # select from the dedicated layer
    '-nlt', 'POLYGON',
    # '-nln', f'railways_{year}_buffered' # define layer in the output file
]

# execute ogr2ogr command
try:
    subprocess.run(ogr2ogr_buffer_railways, check=True, capture_output=True, text=True)
    print(f"Successfully buffered 'railways' layer and saved to {vector_railways_buffered}.")
except subprocess.CalledProcessError as e:
    print(f"Error buffering railways: {e}")
except Exception as e:
    print(f"Unexpected error: {str(e)}")

# listing all intermediate buffer geometries to delete them once all steps are completed
buffered_geoms = [vector_roads_buffered, vector_railways_buffered]

# TODO - merge as one function with roads?

Successfully buffered 'railways' layer and saved to C:\Users\kriukovv\Documents\pilot_2\preprocessing\data\output\railways_2023_buffered.gpkg.


*####Merging buffered roads*
*It is currently tentative block as it has been found that merging buffers is not vital to process them further.*

In [21]:
# pygeo?
'''
# REDUNDANT BLOCK (RASTERIZING IS FASTER WITHOUT INITIAL MERGING)
# define output file to save merged geometries
roads_buf_merged = 'roads_buf_merged.gpkg'
roads_buf_merged = os.path.join(parent_dir,output_dir,roads_buf_merged)

# define the ogr2ogr command as a list of arguments
ogr2ogr_merge = [
    'ogr2ogr',
    '-f', 'GPKG',
    '-dialect', 'SQLite',
    '-sql', "SELECT ST_Union(geometry) AS geometry, * FROM roads",
    roads_buf_merged,
    vector_roads_buffered,
    '-nln', 'roads'
]

# execute ogr2ogr command
try:
    subprocess.run(ogr2ogr_merge, check=True)
    print("Merging of buffers has been successfully completed.")
except subprocess.CalledProcessError as e:
    print(f"Error merging buffers: {e}")
'''

'\n# REDUNDANT BLOCK (RASTERIZING IS FASTER WITHOUT INITIAL MERGING)\n# define output file to save merged geometries\nroads_buf_merged = \'roads_buf_merged.gpkg\'\nroads_buf_merged = os.path.join(parent_dir,output_dir,roads_buf_merged)\n\n# define the ogr2ogr command as a list of arguments\nogr2ogr_merge = [\n    \'ogr2ogr\',\n    \'-f\', \'GPKG\',\n    \'-dialect\', \'SQLite\',\n    \'-sql\', "SELECT ST_Union(geometry) AS geometry, * FROM roads",\n    roads_buf_merged,\n    vector_roads_buffered,\n    \'-nln\', \'roads\'\n]\n\n# execute ogr2ogr command\ntry:\n    subprocess.run(ogr2ogr_merge, check=True)\n    print("Merging of buffers has been successfully completed.")\nexcept subprocess.CalledProcessError as e:\n    print(f"Error merging buffers: {e}")\n'

### Rasterizing processed vector features

In [22]:
# BASH version of rasterization

# dfeine function
def rasterize_vector(vector_path, output_path, nodata_value, burn_value, layer_name=None):
    '''
    # input
    inp_driver = ogr.GetDriverByName('GPKG')
    inp_source = inp_driver.Open(vector_path, 0)
    inp_lyr = inp_source.GetLayer(0)
    inp_srs = inp_lyr.GetSpatialRef()

    # getting cellsize from lulc resolution
    cell_size = xres # is not a parameter of function because it must be the same as cell_size of LULC raster

    # input extent # TODO - must be specified from LULC not geopackages!
    x_min, x_max, y_min, y_max = inp_lyr.GetExtent()
    '''

    # open the vector data source
    data_source = ogr.Open(vector_path)
    if data_source is None:
        raise RuntimeError(f"Failed to open the vector file: {vector_path}")

    # check the number of layers and write it to the variable
    layer_count = data_source.GetLayerCount()

    # define gdal_rasterize command
    gdal_rasterize_cmd = [
        'gdal_rasterize',
        '-tr', str(cell_size), str(cell_size),  # output raster pixel size
        '-te', str(x_min), str(y_min), str(x_max), str(y_max),  # output extent 
        '-a_nodata', str(nodata_value),  # no_data value
        '-ot', 'Int16',   # output raster data type,
        '-burn', str(burn_value),  # burn-in value
        '-at',  # all touched pixels are burned in
        vector_path,  # input vector file
        output_path  # output raster file
    ]

    # add the layer name if there are multiple layers 
    if layer_count > 1: # specify layer name if using merged geopackage as an input file
        gdal_rasterize_cmd.insert(1, '-l')
        gdal_rasterize_cmd.insert(2, str(layer_name))

    # execute gdal_rasterize command through subprocess
    subprocess.run(gdal_rasterize_cmd, check=True, capture_output=True, text=True)

    '''
    # resample output raster to match the resolution and size of LULC raster
    output_path_resampled = output_path.replace('.tif', '_resampled.tif')
    gdalwarp_cmd = [
        'gdalwarp',
        # '-tr', str(lulc_pixel_size), str(lulc_pixel_size),  # target resolution same as LULC raster
        '-te', str(x_min), str(y_min), str(x_max), str (y_max)  # target coordinates same as LULC raster
        '-r', 'near',  # resampling method (better use 'near' for categorical data)
        '-dstnodata', str(nodata_value),  # set nodata value
        output_path,  # input raster to be resampled
        output_path_resampled  # output path for resampled raster
    ]

    # execute gdalwarp command through subprocess
    subprocess.run(gdalwarp_cmd, check=True)
    '''
    
    # compress output 
    output_compressed = output_path.replace('.tif', '_compr.tif')
    gdal_translate_cmd = [
        'gdal_translate',
        output_path,
        output_compressed,
        '-co', 'COMPRESS=LZW',
        '-ot', 'Byte'
    ]
    # execute gdal_translate command through subprocess
    subprocess.run(gdal_translate_cmd, check=True)

    # rename compressed output to original
    os.remove(output_path)
    os.rename(output_compressed, output_path)

    print("Rasterized output saved to:", output_path)

# to resample rasters obtained by LULC

# specify rasterized temporary outputs
vrt_roads = os.path.join(parent_dir,output_dir,f'vrt_roads_{year}.tif')
vrt_railways = os.path.join(parent_dir,output_dir,f'vrt_railways_{year}.tif')
vrt_waterbodies = os.path.join(parent_dir,output_dir,f'vrt_waterbodies_{year}.tif')
vrt_waterways = os.path.join(parent_dir,output_dir,f'vrt_waterways_{year}.tif')

# appending temporary outputs to a list
rasters_temp = [vrt_roads, vrt_railways, vrt_waterbodies, vrt_waterways]

# vrt_roads_compr = os.path.join(parent_dir,output_dir,'vrt_roads_compr.tif')

# rasterize roads and railways from buffered geometries
rasterize_vector(vector_roads_buffered, vrt_roads, nodata_value=0, burn_value=lulc_road)
rasterize_vector(vector_railways_buffered, vrt_railways, nodata_value=0, burn_value=lulc_railway)

# rasterize waterbodies and waterways from the initial input vector data
rasterize_vector(vector_refine, vrt_waterbodies, layer_name='waterbodies', nodata_value=0, burn_value=lulc_water) # read from the corresponding layer
rasterize_vector(vector_refine, vrt_waterways, layer_name='waterways', nodata_value=0, burn_value=lulc_water) # read from the corresponding layer

# TODO - to define variables on waterbodies and waterways separately

Rasterized output saved to: C:\Users\kriukovv\Documents\pilot_2\preprocessing\data/output\vrt_roads_2023.tif
Rasterized output saved to: C:\Users\kriukovv\Documents\pilot_2\preprocessing\data/output\vrt_railways_2023.tif
Rasterized output saved to: C:\Users\kriukovv\Documents\pilot_2\preprocessing\data/output\vrt_waterbodies_2023.tif
Rasterized output saved to: C:\Users\kriukovv\Documents\pilot_2\preprocessing\data/output\vrt_waterways_2023.tif


In [23]:
'''
# REDUNDANT BLOCK - old version with the Python wrapper of GDAL
roads_buf_merged = 'roads_buf_merged.gpkg'
roads_buf_merged = os.path.join(parent_dir,output_dir,roads_buf_merged)

def Rasterize_roads(roads_buf_merged, vrt_roads, cellsize, field_name=True, NoData_value=-9999):
    # input
    inp_driver = ogr.GetDriverByName('GPKG')
    inp_source = inp_driver.Open(roads_buf_merged, 0)
    inp_lyr = inp_source.GetLayer(0)
    inp_srs = inp_lyr.GetSpatialRef()

    # extent
    x_min, x_max, y_min, y_max = inp_lyr.GetExtent()
    x_ncells = int((x_max - x_min) / cellsize)
    y_ncells = int((y_max - y_min) / cellsize)
    # flipping values
    # TODO - redefine cellsize
    # ulx, xres, xskew, uly, yskew, yres  = src.GetGeoTransform()
    print (cellsize)
    print (x_ncells,y_ncells)

    # output 
    out_driver = gdal.GetDriverByName('GTiff')
    if os.path.exists(vrt_roads):
        out_driver.Delete(vrt_roads)
    out_source = out_driver.Create(vrt_roads, x_ncells, y_ncells,1, gdal.GDT_Int16)
    out_source.SetGeoTransform((x_min, cellsize, 0, y_max, 0, -cellsize))
    out_source.SetProjection(inp_srs.ExportToWkt())
    out_lyr = out_source.GetRasterBand(1)
    out_lyr.SetNoDataValue(NoData_value)

    # output extent
    x_min_out, x_max_out = x_min, x_min + (x_ncells * cellsize)
    y_min_out, y_max_out = y_min, y_min + (y_ncells * cellsize)

    if field_name:
    # this will rasterize your shape file according to the specified attribute field
         rasDs = gdal.Rasterize(
               vrt_roads, roads_buf_merged,
               xRes=cellsize, yRes=cellsize,
               outputBounds=[x_min, y_min,x_max, y_max],
               noData=NoData_value,
               outputType=gdal.GDT_Int16,
               attribute='fid', # or whatever your attribute field name is
               allTouched=True)
    else:
    # this will just give burn-in value where there are vector data since no attribute is defined
        rasDs = gdal.Rasterize(
               vrt_roads, roads_buf_merged,
               xRes=cellsize, yRes=cellsize,
               outputBounds=[x_min, y_min,x_max, y_max],
               noData=NoData_value,
               burnValues=2, #to enrich roads, TODO - specify more generic flag from extended LULC map (25 types)
               outputType=gdal.GDT_Int16,
               allTouched=True) # to include pixels that are covered by roads even partly (by default, it must cover at least 50% of pixel area to be rasterized)
        
    rasDs = inp_source = None    
    
    # save and/or close the data sources
    inp_source = None
    out_source = None 

    # return
    return vrt_roads
    
vrt_roads =  os.path.join(parent_dir,output_dir,'vrt_roads.tif')
# input parameter 'vector_roads_buffered' has already been defined
roads_buf_merged = 'roads_buf_merged.gpkg'
roads_buf_merged = os.path.join(parent_dir,output_dir,roads_buf_merged)
# getting cellsize from lulc resolution
cellsize = xres
Rasterize_roads(roads_buf_merged, vrt_roads, cellsize, field_name=False, NoData_value=-9999)

print("Rasterized roads saved to: ", vrt_roads)
vrt_roads = None
'''

'\n# REDUNDANT BLOCK - old version with the Python wrapper of GDAL\nroads_buf_merged = \'roads_buf_merged.gpkg\'\nroads_buf_merged = os.path.join(parent_dir,output_dir,roads_buf_merged)\n\ndef Rasterize_roads(roads_buf_merged, vrt_roads, cellsize, field_name=True, NoData_value=-9999):\n    # input\n    inp_driver = ogr.GetDriverByName(\'GPKG\')\n    inp_source = inp_driver.Open(roads_buf_merged, 0)\n    inp_lyr = inp_source.GetLayer(0)\n    inp_srs = inp_lyr.GetSpatialRef()\n\n    # extent\n    x_min, x_max, y_min, y_max = inp_lyr.GetExtent()\n    x_ncells = int((x_max - x_min) / cellsize)\n    y_ncells = int((y_max - y_min) / cellsize)\n    # flipping values\n    # TODO - redefine cellsize\n    # ulx, xres, xskew, uly, yskew, yres  = src.GetGeoTransform()\n    print (cellsize)\n    print (x_ncells,y_ncells)\n\n    # output \n    out_driver = gdal.GetDriverByName(\'GTiff\')\n    if os.path.exists(vrt_roads):\n        out_driver.Delete(vrt_roads)\n    out_source = out_driver.Create(vrt_

##### Merge raster files

All GeoTIFF files are combined into one, updated LULC through the raster calculator. Rewriting of input land-use/land cover raster dataset is performed through numpy operations.

In [24]:
lulc_upd = f'lulc_{year}_upd.tif'
lulc_upd = os.path.join(parent_dir,output_dir,lulc_upd)
lulc_upd = os.path.normpath(lulc_upd) # normalise path

# debug: to print the output filename
print (f"Enriched land-use/land-cover dataset(s) will be fetched to {lulc_upd}")

# list of input raster paths and bands (each type of bands in the separate band)
listraster_uri = [
    (lulc, 1),
    (vrt_waterbodies, 1),
    (vrt_waterways, 1),
    (vrt_railways, 1),
    (vrt_roads, 1)
]

# debug: function to get raster dimensions
def get_raster_dimensions(raster_path):
    dataset = gdal.Open(raster_path)
    if dataset:
        width = dataset.RasterXSize
        height = dataset.RasterYSize
        return width, height
    else:
        raise ValueError(f"Unable to open raster file: {raster_path}")

# debug: print dimensions for each raster to check them against LULC dimension
for raster_path, band in listraster_uri:
    width, height = get_raster_dimensions(raster_path)
    print(f"Dimensions of {os.path.basename(raster_path)}: {width} x {height}")

# REDUNDANT block - nodata value is defined in the next function
"""
# function to get raster nodata value
def get_raster_nodata_value(raster_path):
    dataset = gdal.Open(raster_path)
    if dataset:
        band = dataset.GetRasterBand(1)
        nodata_value = band.GetNoDataValue()
        return nodata_value
    else:
        raise ValueError(f"Unable to open raster file: {raster_path}")

# fetch nodata value from the LULC raster
lulc_nodata = get_raster_nodata_value(lulc)

# debug: print nodata values
print(f"Nodata value of LULC raster: {lulc_nodata}")
"""

Enriched land-use/land-cover dataset(s) will be fetched to C:\Users\kriukovv\Documents\pilot_2\preprocessing\data\output\lulc_2023_upd.tif
Dimensions of lulc_ukceh_25m_2023.tif: 4203 x 7861
Dimensions of vrt_waterbodies_2023.tif: 4203 x 7861
Dimensions of vrt_waterways_2023.tif: 4203 x 7861
Dimensions of vrt_railways_2023.tif: 4203 x 7861
Dimensions of vrt_roads_2023.tif: 4203 x 7861


'\n# function to get raster nodata value\ndef get_raster_nodata_value(raster_path):\n    dataset = gdal.Open(raster_path)\n    if dataset:\n        band = dataset.GetRasterBand(1)\n        nodata_value = band.GetNoDataValue()\n        return nodata_value\n    else:\n        raise ValueError(f"Unable to open raster file: {raster_path}")\n\n# fetch nodata value from the LULC raster\nlulc_nodata = get_raster_nodata_value(lulc)\n\n# debug: print nodata values\nprint(f"Nodata value of LULC raster: {lulc_nodata}")\n'

In [None]:
# function to overwrite values from input raster by multiple rasters
def overwrite_raster(base_raster, *rasters):
    # open the input raster and read it
    base_ds = gdal.Open(base_raster)
    base_band = base_ds.GetRasterBand(1)
    base_data = base_band.ReadAsArray().astype(np.float32)
    
    # get nodata value for the input raster
    nodata_value = base_band.GetNoDataValue()
    if nodata_value is None:  # if nodata value is not defined, set 0 as a default
        nodata_value = 0
    base_data[base_data == nodata_value] = np.nan  # replace nodata value with nan for processing
    print(f"Nodata value of the input raster dataset: {nodata_value}")
    
    # iterate over other rasters
    for raster in rasters:
        ds = gdal.Open(raster)
        band = ds.GetRasterBand(1)
        data = band.ReadAsArray().astype(np.float32)
        current_nodata = band.GetNoDataValue()
        if current_nodata is None:  # handle missing nodata value
            current_nodata = 0
        data[data == current_nodata] = np.nan  # replace nodata with nan for processing
        
        # overwrite values in base_data where current raster has valid data
        mask = ~np.isnan(data)
        base_data[mask] = data[mask]
    
    # after processing, replace NaNs with the nodata value before saving
    base_data[np.isnan(base_data)] = nodata_value
    
    return base_data, base_ds, nodata_value

# define file paths
raster_a = lulc
raster_b = vrt_waterways
raster_c = vrt_waterbodies
raster_d = vrt_roads
raster_e = vrt_railways
output_raster = lulc_upd

# overwrite rasters over input dataset in the following order: waterbodies, waterways, roads, railways
output_data, output_ds, nodata_value = overwrite_raster(raster_a, raster_b, raster_c, raster_d, raster_e)

# get the driver to write a new GeoTIFF
driver = gdal.GetDriverByName('GTiff')
out_ds = driver.Create(output_raster, output_ds.RasterXSize, output_ds.RasterYSize, 1, gdal.GDT_Byte)

# set geo-transform and projection from the input raster
out_ds.SetGeoTransform(output_ds.GetGeoTransform())
out_ds.SetProjection(output_ds.GetProjection())

# write the data to the output raster
out_band = out_ds.GetRasterBand(1)
out_band.WriteArray(output_data)

# set nodata value 
out_band.SetNoDataValue(nodata_value)

# flush the data and close files
out_band.FlushCache()
out_ds = None  # close the file
output_ds = None  # close the input file

print(f"Output raster saved to {output_raster}")

# set nodata value 
out_band.SetNoDataValue(nodata_value)

# flush the data and close files
out_band.FlushCache()
out_ds = None  # close the file
output_ds = None  # close the input file

print(f"Output raster saved to {output_raster}")

# REDUNDANT - previous version of raster calculator through pygeoprocessing
''' 
# define output raster
rasterout_uri = lulc_upd
# define math expression to update LULC
def raster_upd(lulc, waterbodies, waterways, railways, roads):
    # use the original LULC as the base
    result = np.copy(lulc)
    # nodata mask to exclude OSM values beyond LULC
    nodata_mask = np.isclose(lulc, lulc_nodata)
    # overwrite LULC with values from OSM data where there are no nodata value (comparison with tolerance for floating-point values)
    result[~nodata_mask & ~np.isclose(waterbodies, lulc_nodata)] = waterbodies[~nodata_mask & ~np.isclose(waterbodies, lulc_nodata)]
    result[~nodata_mask & ~np.isclose(waterways, lulc_nodata)] = waterways[~nodata_mask & ~np.isclose(waterways, lulc_nodata)]
    result[~nodata_mask & ~np.isclose(railways, lulc_nodata)] = railways[~nodata_mask & ~np.isclose(railways, lulc_nodata)]
    result[~nodata_mask & ~np.isclose(roads, lulc_nodata)] = roads[~nodata_mask & ~np.isclose(roads, lulc_nodata)]
    return result

# run raster calculator (pygeoprocessing)
pg.raster_calculator(
            base_raster_path_band_const_list=listraster_uri,
            local_op=raster_upd, 
            target_raster_path=rasterout_uri,
            datatype_target=gdal.GDT_Byte,
            nodata_target=0,
            calc_raster_stats=True)
'''

# REDUNDANT - alternative block of gdal calculator - causing issues (endless computation)
'''   
# GDAL_CALC
# subprocess of gdal calculator becomes too bulky to compute - might cause issues: https://stackoverflow.com/questions/73921278/python-not-giving-same-results-as-gdal-command-line
# takes much more time than the raw gdal_calc
merge_raster = [
    'gdal_calc.py',
    '-A', lulc,
    '-B', vrt_waterbodies, 
    '-C', vrt_waterways,
    '-D', vrt_railways,
    '-E', vrt_roads,
    '--outfile=lulc_upd',
    '--calc="A+B+C+D+E"', # TODO - or 'B*(B!=0) + A*(B==0)',...
    '--NoDataValue', '0',
    '--debug',
]

# execute sum command through subprocess
subprocess.run(merge_raster, check=True, shell=True) # included shell=true
'''

Nodata value of the input raster dataset: 0.0
Output raster saved to C:\Users\kriukovv\Documents\pilot_2\preprocessing\data\output\lulc_2023_upd.tif


##### Recalculation of impedance (***OPTIONAL***)

This block should be run only if user would like to estimate habitat connectivity later on. Therefore, their raster datasets on landscape impedance (or resistance, or friction) should be updated considering new enriched LULC datasets.

'Edge effect' of human infrastructure is a common phenomena, caused by the construction of human infrastructure and built-up areas (including residential and commercial ones). mostly resulting in reducing biodiversity, spreading of invasive species and decline in specialist species



Let's extract all LULC values that are causing edge effect and increasing landscape impedance. CSV column with boolean values will be read by this part of code.

In [None]:
# сreate an empty list to store LULC codes which cause negative impact on habitats and edge effect
edge_effect_list = []
# convert datatype of 'edge_effect' column into integer one if needed
impedance['edge_effect'] = impedance['edge_effect'].astype(int)

# iterate through each row in dataframe
for index, row in impedance.iterrows():
    # check if the value in 'edge_effect' column is 1 - user specified that these LULC are affecting habitats
    if row['edge_effect'] == 1:
        # record the value from 'lulc_code' column
        edge_effect_list.append(row['lulc'])
        print(f"LULC code = {row['lulc']} is causing edge effect.")

print (f"LULC type codes causing edge effect on habitats are: {edge_effect_list}")

In the scope of the case study of terrestrial habitats in Catalonia, Spain edge effect is caused by urban areas, urbanised areas with a lower density of buildings, roads and railways (stressors). Waterways and water bodies, also fetched from Open Street Map data, do not pose the same threat of edge effect and excluded from this analysis.

The character of edge effect is characterised by various decay rates, according to the distances from stressors. Various functions can characterise a 'decay' of edge effect while moving away from the stressors, but in general linear and exponential ones are used, for example in [InVEST model](https://naturalcapitalproject.stanford.edu/invest/habitat-risk-assessment). Currently, a simple version of exponential decay rate is implemented. It is planned to provide an opportunity for user to specify decay rate on their own through the configuration file, but it is strongly advised to conduct a separate research on values which represent particular stressors and habitats, or use empirical data or expert knowledge from the similar studies. There is no unique solution for every particular case which depends on species, study scope, land-use/land-cover and stressors!

In [None]:
'''
# Convert 'edge_effect' column to integer if needed
impedance['edge_effect'] = impedance['edge_effect'].astype(int)
# Create a list of LULC codes causing edge effect
edge_effect_list = impedance.loc[impedance['edge_effect'] == 1, 'lulc'].tolist()
print(f"LULC types causing edge effect on habitats are: {edge_effect_list}")
edge_effect_array = np.array(edge_effect_list, dtype=int)
print(edge_effect_array)

# open LULC
data_source = gdal.Open(lulc)
band = data_source.GetRasterBand(1)
lulc_data = band.ReadAsArray()
nodata_value = band.GetNoDataValue()

print("NoData value:", nodata_value)

band_data_type = band.DataType
print("Data type of the band:", gdal.GetDataTypeName(band_data_type))

# create a mask based on the 'edge_effect' values from the dataframe
mask = np.isin(lulc_data, edge_effect_array)
if np.any(mask):
    print("True values are present in the mask.")
else:
    print("No True values are present in the mask.")

# apply mask to LULC
masked_data = np.where(mask, lulc_data, nodata_value)
print (masked_data)
if np.any(masked_data != 0):
    print("Valid data is present in masked_data.")
else:
    print("masked_data contains only zeros or nodata values.")

# get the geo-transform and projection from the input raster
geotransform = data_source.GetGeoTransform()
projection = data_source.GetProjection()

# create output raster file
output_raster_path = os.path.join(parent_dir,output_dir,'edge_effect.tif')
driver = gdal.GetDriverByName('GTiff')
out_dataset = driver.Create(output_raster_path, data_source.RasterXSize, data_source.RasterYSize, 1, band.DataType)
out_dataset.SetGeoTransform(geotransform)
out_dataset.SetProjection(projection)

# write the masked data to the new raster file
out_band = out_dataset.GetRasterBand(1)
out_band.WriteArray(masked_data)
nodata_value_int = int(nodata_value)
out_band.SetNoDataValue(nodata_value_int)
print (nodata_value_int)

# flush data to disk
# out_band.FlushCache()
# close datasets
# data_source = None
# data_source = None

print("Masked LULC types affecting habitats with edge effect are saved to:", output_raster_path)

'''

Let's remove intermediate files to free resources:

In [None]:
# remove buffered geometries
for gpkg in buffered_geoms:
    try:
        os.remove(gpkg)
        print (f"Intermediate temporary gpkg file {gpkg} with buffered geometries is deleted.")
    except OSError as e:
        print (f"Intermediate temporary gpkg file {gpkg} with buffered geometries cannot be deleted.:{e}.")
        
"""
# remove temporary raster data with buffers
for raster in rasters_temp:
    try:
        os.remove(raster)
        print (f"Intermediate temporary raster file {raster} with buffered geometries is deleted.")
    except OSError as e:
        print (f"Intermediate temporary raster file {raster} with buffered geometries cannot be deleted.:{e}.")
"""

# TODO - to implement VRT

To print time needed to calculate this code:

In [None]:
# call own module and start calculating time
timing.stop()

###### ***Processing issues***

- Python wrappers of GDAL have been replaced with subprocesses of native GDAL command line as they have been found out to take less time to run.
- Pygeoprocessing module to run raster calculations is not a reliable solution to execute within the docker...(***TODO - to update the description of issue***). GDAL raster calculator requires the manual setting of path to the executable file (gdal_calc.py) which implies the possible issues with running raster calculator from the Docker. Moreover, faced unknown issue with overwriting the output raster file with '0' values. It is decided to switch to the numpy array calculations instead of the raster calculator.
- Some modules contained in separate Python files must be reloaded to reflect the recent changes (otherwise might cause issues in Jupyter Notebook). It has been experienced with 'timing' module within Jupyter Notebook.