## Access and harmonisation of historical vector data on land-use/land-cover (LULC) - Open Street Map (OSM) data 

This workflow is aimed at semi-automatic extraction and harmonisation of open-access Open Street Map (OSM) data at various timestamps relevant for researches dealing with land-use/land cover data. Overpass Turbo API applied to fetch specific types of OSM features, including human-built infrastructure (roads and railways) and mostly natural features (inland waters - waterways and water bodies). No user authentication is required to access Overpass Turbo API, but it is advised to extract data in a reasonable manner at spatial scales such as Catalonia, Spain or Nothern England (69 925 km<sup>2</sup> and 20 650 km<sup>2</sup>, respectively) to avoid issues related to API throttling.

#### Limitations
Limitations relevant to the further processing of habitat connectivity are multiple, but mostly caused by logical inconsistency and incompleteness of features (omissions) rather than geometrical and topological inconsistency, especially at older timestamps. The common problem of defining keys and values for filtering data is caused by the lack of relevant information for older timestamps (comprehensive OSM quides and manuals are mostly focused on current rules of assigning tags).
- Due to the active development of Open Street Map and increase in popularity since the launch, earlier timestamps from 2010s might lack a significant number of features compared to the current timestamp. For example, the total length of rivers and canals in Catalonia has increased by 27.3% from the end of the 2013 to 2022, while there were no significant nature- or human-driven changes in the water network. Such a change can be partly explained by the increase in the geometric data accuracy, smoothing and curving linear features, mapping linear features within water bodies, but it is also related to the increase in OSM popularity and adding other water features by new users.
- Some roads are missing key "surface". Therefore, paving of roads is not considered even though some narrow, unpaved and non-frequent roads might be crossed regularly by species.
- Some keys are not consistent throughout years (for example, level of roads is missing in data from 2012 and 2013 years).
- Presence of invalid numerical values (for example, width of roads = "3000" or "6 m" which requires additional preprocessing). Unique values by keys can be explored in detail here through [taginfo tool](https://taginfo.openstreetmap.org.uk/keys/width#values).
- Logical inconsistency in defining tags for types of roads throughout the years (can be defined as *'primary'*, and redefined as a *'secondary'* one later).
- Logical inconsistency in defining keys for water reservoirs (can be defined as *'land_use'* types instead of *'water'* types) at older timestamps (2012-2013).
- Changes in geometry types of water features - some rivers have been mapped as ways at older timestamps, but have been complemented with multipolygon features later. Was experienced in Catalonia (2012, 2013, 2017 timestamps).
- According to checks on the validity of vector data during the testing, waterbodies derived from Open Street Map might have invalid geometry, which might insignificantly complicate the further processing (0.01-0.04% from the total number of features, depending on the bounding box and timestamp).
- During the peak load on Overpass servers, bulky queries might be refused with error: *"Remote connection closed unexpectedly"* (faced with queries on roads). Quite high limit rates on memory consumption and query time are defined in this workflow by default but it is advised to fetch data from OSM at local or regional scale to prevent throttling issues.
- Queries on historical data with specified timestamp tend to be more time-consuming than the same ones without timestamps.

#### Initial setup
Let's import all libraries needed:

In [6]:
import json
import requests
import sys
import geopandas as gpd
import pandas as pd
import os
import tempfile
from shapely.geometry import Point, LineString, MultiLineString, Polygon, MultiPolygon
from osgeo import gdal, osr
import pyproj

# auxiliary libraries
import time
import warnings
import yaml
import subprocess

import timing # own module
timing.start()

Let's specify the working directory:

In [5]:
parent_dir = os.getcwd()
# child_dir = 'data'

Variables are defined in the configuration file config.yaml. Let's load the configuration:

In [6]:
# load configuration from YAML file
with open('config.yaml', 'r') as f:
    config = yaml.safe_load(f)
    
"""
print(config)
"""

'\nprint(config)\n'

#### Building Overpass Turbo API queries and access through Overpass endpoint
Now, we can define the preprocessor class to fetch Open Street Map data:

Let's define Overpass query for roads. Queries below extract ways and correspoding nodes automatically as it is crucial to record geometries of spatial features. It is also important to define a maximum size of memory consumption, otherwise issue of memory runout might arise for big queries.

Four main categories of OSM features are fetched:

1. Queries for roads (divided by two and merged then as a list, because one large query leads to closing connection by Overpas).
2. Ouery for railways
3. Query for water lines (including natural and artificial ones)
4. Query for water bodies (also includes mistagged features with deprecated definitions to include them for older timestamps when rules for assigning tags were different (a query with actual tags is saved in comments)).

#### Output formats
There are two main options of output data formats - json and csv. Json is more bulky to transform to other spatial data formats and less flexible than csv, but another common library helps to quickly transform OSM jsons to geojsons easily - [osmtogeojson](https://wiki.openstreetmap.org/wiki/Overpass_turbo/GeoJSON). There are also other export options in the [official documentation](https://dev.overpass-api.de/output_formats.html): OSM XML, HTMPL popups and custom formats. However, these solutions are not stable.

UPD: CSV is fetched more quickly and it is suitable with larger sizes of data fetched (including multiple 'tertiary' roads), but coordinates from assigned nodes are glitched (order of nodes is mixed up). It might be possible to fix these issues related to the order of coordinates, but this workflow is not aimed at the optimisation of convertation since the stable solution of transforming json exists. Therefore, JSON has been chosen over CSV.

To use OSM JSON response further in preprocessing, [osmtogeojson](https://github.com/tyrasd/osmtogeojsonlibrary) is used (NPM is pre-installed within the docker container to get an access to osmtogeojson). At this step, non-suitable geometries (points and polygons for roads, railways and waterways) are filtered out. 

In [49]:
import yaml

class Osm_PreProcessor():
    """
    A class to enrich raster data with OSM data
    """
    #TODO 20/09/2024  only do 1 year for now. Later, we can extend it to multiple years
    def __init__(self, config_path:str, output_dir:str) -> None:
        self.config = self.load_yaml(config_path)
        self.output_dir = output_dir

        # make output directory if it does not exist
        if not os.path.exists(output_dir):
            os.makedirs(output_dir)
        
        ## read year (Specify the input raster data)
        self.years = self.config.get('year', None)
        if self.years is None:
            warnings.warn("Year variable is null or not found in the configuration file... \n Defaulting to 2022")
            self.years = [2022]
            self.date = f"{self.years}-12-31T23:59:59Z" # to fetch it as a last second of the year
        elif isinstance(self.years, int):
            # cast to list
            self.years = [self.years]
            self.date = f"{self.years}-12-31T23:59:59Z"
        else:
            # cast to list
            self.years = [int(year) for year in self.years]
            self.date = [f"{year}-12-31T23:59:59Z" for year in self.years]

        print(f"OSM data is to be retrieved for {self.years} years.")
        print ("-" * 30)

        # find directories from config file
        input_dir = self.config.get('input_dir')
        lulc_dir = self.config.get('lulc_dir')

        ## define the input raster dataset that should be enriched with OSM data
        lulc_template = self.config.get('lulc')

        # substitute the year into the lulc string from config file
        # lulcs = [lulc_template.format(year=year) for year in self.years]
        lulcs = {os.path.normpath(os.path.join(lulc_dir,lulc_template.format(year=year))):year for year in self.years}
        
        [print(f"Input rasters to be used for processing is {lulc}, {year}.") for lulc,year in lulcs.items()]
        print ("-" * 30)

        #NOTE: for now just work with one raster file
        #TODO: loop over all lulc files.
        lulc = list(lulcs.keys())[0]
        year = list(lulcs.values())[0]
        self.year = list(lulcs.values())[0]
        x_min_cart, x_max_cart, y_min_cart, y_max_cart, epsg_code = self.get_raster_properties(lulc)
        self.bbox = self.reproject_and_get_bbox(x_min_cart, x_max_cart, y_min_cart, y_max_cart, epsg_code)
        
        # to check the bounding box of input raster
        print(self.bbox)

    def reproject_and_get_bbox(self, x_min_cart:float, x_max_cart:float, y_min_cart:float, y_max_cart:float, epsg_code:int) -> tuple:

        ## Reproject the bounding box of input dataset as Overpass accepts only coordinates in geographical coordinates (WGS 84):
        # defining function to transform
        transform_cart_to_geog = pyproj.Transformer.from_crs(
            pyproj.CRS(f'EPSG:{epsg_code}'),  # applying EPSG code of input raster dataset
            pyproj.CRS('EPSG:4326')   # WGS84 geographic which should be used in OSM APIs
        )

        # running function
        x_min, y_min = transform_cart_to_geog.transform(x_min_cart, y_min_cart)
        x_max, y_max = transform_cart_to_geog.transform(x_max_cart, y_max_cart)

        # print the Cartesian coordinates before transformation
        print("Before Transformation:")
        print("x_min_cart:", x_min_cart)
        print("x_max_cart:", x_max_cart)
        print("y_min_cart:", y_min_cart)
        print("y_max_cart:", y_max_cart)

        # print the transformed geographical coordinates
        print("After Transformation:")
        print("x_min:", x_min)
        print("x_max:", x_max)
        print("y_min:", y_min)
        print("y_max:", y_max)
        bbox=f"{x_min},{y_min},{x_max},{y_max}"
        
        return bbox

    def get_raster_properties(self,lulc:any) -> tuple:
        """
        Get the properties of the raster file
        """
        ## Load the raster file, get its extent, cell size and projection:
        raster = gdal.Open(lulc)
        if raster is not None:
            inp_lyr = raster.GetRasterBand(1)  # get the first band
            x_min_cart, x_max_cart, y_min_cart, y_max_cart = raster.GetGeoTransform()[0], raster.GetGeoTransform()[0] + raster.RasterXSize * raster.GetGeoTransform()[1], raster.GetGeoTransform()[3] + raster.RasterYSize * raster.GetGeoTransform()[5], raster.GetGeoTransform()[3]
            '''
            cellsize = raster.GetGeoTransform()[1]  # Assuming the cell size is constant in both x and y directions
            x_ncells = int((x_max - x_min) / cellsize)
            y_ncells = int((y_max - y_min) / cellsize)
            '''
            print ("Input raster has been successfully found.")

            # extract projection system of input raster file
            info = gdal.Info(raster, format='json')
            if 'coordinateSystem' in info and 'wkt' in info['coordinateSystem']:
                srs = osr.SpatialReference(wkt=info['coordinateSystem']['wkt'])
                if srs.IsProjected():
                    epsg_code = srs.GetAttrValue("AUTHORITY", 1)
                    print(f"Projected coordinate system of the input raster is EPSG:{epsg_code}")
                else:
                    print("Input raster does not have a projected coordinate system.")
            else:
                print("No projection information found in the input raster.")
            # close the raster to keep memory empty
            raster = None
        else:
            print ("Input raster is missing.")

        return x_min_cart, x_max_cart, y_min_cart, y_max_cart, epsg_code



    def load_yaml(self, path:str) -> dict:
        """
        Load a yaml file from the given path to a dictionary

        Args:
            path (str): path to the yaml file

        Returns:
            dict: dictionary containing the yaml file content
        """
        with open(path , 'r') as file:
            return yaml.safe_load(file)
    

    def fetch_osm_data(self,queries:dict, year:int , overpass_url:str = "https://overpass-api.de/api/interpreter", ) -> list:
        intermediate_jsons = []

        # iterate over the queries and execute them
        for query_name, query in queries.items():
            response = requests.get(overpass_url, params={'data': query})
            print(response)
                
            # if response is successful
            if response.status_code == 200:
                print(f"Query to fetch OSM data for {query_name} in the {year} year has been successful.")
                data = response.json()
                
                # Extract elements from data
                elements = data.get('elements', [])
                
                # Print the number of elements
                print(f"Number of elements in {query_name} in the {year} year: {len(elements)}")
                
                # Print the first 3 elements to verify response
                for i, element in enumerate(elements[:3]):
                    print(f"Element {i+1}:")
                    print(json.dumps(element, indent=2))
                
                # Save the JSON data to a file
                output_file = os.path.join(self.output_dir, f"{query_name}_{year}.json")
                with open(output_file, 'w', encoding='utf-8') as f:
                    json.dump(data, f, ensure_ascii=False, indent=4)
                print(f"Data has been saved to {output_file}")
                print ("-" * 30)

                # Add the output file name to the list
                intermediate_jsons.append(output_file)
                
            else:
                print(f"Error: {response.status_code} for {query_name} in the {year} year")
                print(response.text)
                print ("-" * 30)

        return intermediate_jsons

    def overpass_query_builder(self, year:int, bbox:str) -> dict[str, str]:
        """
        A function to build the query for Overpass API
        """
        #TODO: The data limit is 1GB. We should try split the query into smaller parts (bounding boxes) and run them separately.
        #NOTE: the issue with the above is that you might get IP blocked by the server. So, we need to be careful with this.
        query_roads = f"""
        [out:json]
        [maxsize:1073741824]
        [timeout:9000]
        [date:"{year}-12-31T23:59:59Z"]
        [bbox:{bbox}];
        way["highway"~"(motorway|trunk|primary|secondary|tertiary)"];
        /* also includes 'motorway_link',  'trunk_link' etc. because they also restrict habitat connectivity */
        (._;>;);
        out body;
        """
        # '{' characters must be doubled in Python f-string (except for {bbox} because it is a variable)
        # to include statement on paved surfaces use: ["surface"~"(paved|asphalt|concrete|paving_stones|sett|unhewn_cobblestone|cobblestone|bricks|metal|wood)"];
        # it is important to include only paved roads it is important to list all values above, not only 'paved'*/
        # BUT! : 'paved' tag seems to be missing in a lot of features at timestamps from 2010s
        # 'residential' roads are not fetched as these areas are already identified in land-use/land-cover data as urban or residential ones
        # "~" extracts all tags containing this text, for example 'motorway_link'
        
        query_railways = f"""
        [out:json]
        [maxsize:1073741824]
        [timeout:9000]
        [date:"{year}-12-31T23:59:59Z"]
        [bbox:{bbox}];
        way["railway"~"(rail|light_rail|narrow_gauge|tram|preserved)"];
        (._;>;);
        out;
        """
        
        # way["railway"];  # to include features if 'railway' key is found (any value)
        # to include features with values filtered by key. 
        # This statement also includes 'monorail' which are not obstacles for species migration, but these features are extremely rare. Therefore, it was decided not to overcomplicate the query.
        # 31/07/2024 - added filtering on 'preserved' railway during the verification by UKCEH LULC dataset (some railways are marked as 'preserved at older timestamps and 'rail' in newer ones).
    
        query_waterways = f"""
        [out:json]
        [maxsize:1073741824]
        [timeout:9000]
        [date:"{year}-12-31T23:59:59Z"]
        [bbox:{bbox}];
        (
        way["waterway"~"^(river|canal|flowline|tidal_channel)$"];
        way["water"~"^(river|canal)$"];
        );
        /* ^ and $ symbols to exclude 'riverbank' and 'derelict_canal'*/
        /*UPD - second line is added in case if some older features are missing 'way' tag*/
        (._;>;);
        out;
        """

        # Query to bring water features with deprecated tags
        query_waterbodies = f"""
        [out:json]
        [maxsize:1073741824]
        [timeout:9000]
        [date:"{year}-12-31T23:59:59Z"]
        [bbox:{bbox}];
        (
        nwr["natural"="water"];
        nwr["water"~"^(cenote|lagoon|lake|oxbow|rapids|river|stream|stream_pool|canal|harbour|pond|reservoir|wastewater|tidal|natural)$"];
        nwr["landuse"="reservoir"];
        nwr["waterway"="riverbank"];
        /*UPD - second filter was added to catch other water features at all timestamps*/
        /*UPD - third and fourth filters were added to catch other water features at older timestamps*/
        /*it is more reliable to query nodes, ways and relations altogether ('nwr') to fetch the complete polygon spatial features*/
        );
        (._;>;);
        out;
        """
        
        # to include small waterways use way["waterway"~"(^river$|^canal$|flowline|tidal_channel|stream|ditch|drain)"]


        # merge queries into dictonary
        # to include all queries
        return {"roads":query_roads, "railways":query_railways, "waterways":query_waterways, "waterbodies":query_waterbodies}
    

    def convert_to_geojson(self, queries:dict[str,str]):
        for year in self.years:
            for query_name, query in queries.items():
                input_file = os.path.join(self.output_dir, f"{query_name}_{year}.json")
                output_file = os.path.join(self.output_dir, f"{query_name}_{year}.geojson")
                result = subprocess.run(['osmtogeojson', input_file], capture_output=True, text=True)
                with open(output_file, 'w', encoding='utf-8') as f:
                    f.write(result.stdout)

    def fix_invalid_geometries(self, queries:dict[str,str], year:int ,overwrite_original:bool):
        """
        A function to fix invalid geometries in the GeoJSON files

        Args:
            queries (dict): a dictionary of queries
            year (int): the year of the data
            overwrite_original (bool): whether to overwrite the original GeoJSON files

        Returns:
            list: a list of fixed GeoJSON files
        """
        geojson_files=[]

        # iterate over the queries and define outputs
        for query_name, query in queries.items():
            geojson_file = os.path.join(self.output_dir, f"{query_name}_{year}.geojson")

            # check if the non-zero GeoJSON files exist
            if os.path.exists(geojson_file) and os.path.getsize(geojson_file) > 0:
                print(f"Conversion to GeoJSON for {query_name} in the {year} year was successful.")
                
                # read the GeoJSONs
                with open(geojson_file, 'r', encoding='utf-8') as f:
                    geojson_data = json.load(f)
                    features = geojson_data.get('features', [])
                    print(f"Total features: {len(features)}")
                    
                # determine the geometries to filter based on query_name
                # for roads, railways and waterways extract only lines and multilines
                if query_name in ("roads", "railways", "waterways"):
                    geometry_types = ['LineString', 'MultiLineString']
                    # filter based on geometry types and level - it should be 0 (or null)
                    filtered_features = [
                        feature for feature in geojson_data.get('features', [])
                        if feature['geometry']['type'] in geometry_types
                        and (feature['properties'].get('level') in (None, 0)) # filtering by ground level of infrastructure
                    ]
                # for waterbodies extract only polygons and multipolygons
                elif query_name == "waterbodies":
                    geometry_types = ['Polygon', 'MultiPolygon']
                    # filter based on geometry types only
                    filtered_features = [
                        feature for feature in geojson_data.get('features', [])
                        if feature['geometry']['type'] in geometry_types
                    ]
                # for everything else extract everything that can be found
                else:
                    filtered_features = [
                        feature for feature in geojson_data.get('features', [])
                    ]

                # cast all property keys to lowercase
                filtered_features = [
                    {
                        k: {property_key.lower(): property_value for property_key, property_value in v.items()} if k == "properties" else v
                        for k, v in feature.items()
                    }
                    for feature in filtered_features
                ]
                # create a new GeoJSON structure with filtered features
                filtered_geojson_data = {
                    "type": "FeatureCollection",
                    "features": filtered_features
                }

                print(f"Total features after filtering {query_name} in the {year} year: {len(filtered_features)}")
                print ("-" *30)
                
                # create new file 
                if overwrite_original == False:
                    geojson_file = os.path.join(self.output_dir, f"{query_name}_{year}_filtered.geojson")
                
                # overwrite the original GeoJSON file with the filtered one
                with open(geojson_file, 'w', encoding='utf-8') as f:
                    json.dump(filtered_geojson_data, f, ensure_ascii=False, indent=4)

                # write filenames to the list with intermediate geojsons
                geojson_files.append(geojson_file)
            
            else:
                print(f"Conversion to GeoJSON for {query_name} in the {year} year failed.")
                print ("-" *30)

osm = Osm_PreProcessor('config.yaml',"./data/input/osm/")
queries = osm.overpass_query_builder(osm.year, osm.bbox) #TODO check what lulc[0] is in year (currently, osm.year)
# # TODO loop over all lulc files
osm.fetch_osm_data(queries=queries, year=osm.year)
osm.convert_to_geojson(queries=queries)
osm.fix_invalid_geometries(queries,osm.year,False)

OSM data is to be retrieved for [2017] years.
------------------------------
Input rasters to be used for processing is data/input/lulc/lulc_esa_2017.tif, 2017.
------------------------------
Input raster has been successfully found.
Projected coordinate system of the input raster is EPSG:32630
Before Transformation:
x_min_cart: 538670.0
x_max_cart: 610530.0
y_min_cart: 5883540.0
y_max_cart: 5959790.0
After Transformation:
x_min: 53.099904191450165
x_max: 53.77496615869372
y_min: -2.4224482920540216
y_max: -1.322754831494132
53.099904191450165,-2.4224482920540216,53.77496615869372,-1.322754831494132
<Response [200]>
Query to fetch OSM data for roads in the 2017 year has been successful.
Number of elements in roads in the 2017 year: 222701
Element 1:
{
  "type": "node",
  "id": 154915,
  "lat": 53.6526029,
  "lon": -1.5279476
}
Element 2:
{
  "type": "node",
  "id": 154916,
  "lat": 53.650261,
  "lon": -1.5286385
}
Element 3:
{
  "type": "node",
  "id": 154917,
  "lat": 53.6487681,
  "l

#### Postprocessing outputs
Then, geojson datasets should be converted into geopackages since they are more optimised for further processing. *The quickest way to perform the conversion is ogr2ogr library executed as a shell script (currently executed through subprocess).*

We also reprojected OSM data to align with input raster dataset's coordinate system.

*Computation time of this block doesn't differ from the same block executed through the shell script.*

*Additional step is the translation of 'width' column into decimal one since geojson recognizes this column as a text. Currently, it is performed in further processing as a SQL statement for buffering roads by width from this column.*

In [50]:
# import os

# def convert_geojson_to_gpkg(input_geojson, output_gpkg, target_epsg=4326):
#     # run function as a shell script through subprocess library
#     result = subprocess.run(['ogr2ogr', '-f', 'GPKG', '-t_srs', f'EPSG:{target_epsg}', output_gpkg, input_geojson], 
#                             check=True, 
#                             capture_output=True, 
#                             text=True)
    
#     print(f"Converted and modified to GeoPackage: {output_gpkg}")
#     #check error code
#     if len(result.stderr) > 0:
#         print(f"Warnings or errors:\n{result.stderr}")
    
                    
# # Example usage
# input_geojson = os.path.join('data/input/osm', 'waterbodies_2018_filtered.geojson')
# output_gpkg = os.path.join('data/input/osm/gpkg_temp', 'test.gpkg')
# convert_geojson_to_gpkg(input_geojson, output_gpkg)

In [51]:
import shutil

from osgeo import ogr

class OsmGeojson_to_gpkg():
    def __init__(self, input_dir:str,output_dir:str,target_epsg:str) -> None:
        self.input_dir = input_dir
        self.output_dir = output_dir
        self.target_epsg = target_epsg
        # replace .geojson with .gpkg for each file
        self.gpkg_files = [file.replace('.geojson', '.gpkg') for file in self.convert_geojson_to_gpkg()]
        print(self.gpkg_files)

    def convert_geojson_to_gpkg(self, file_ending:str='filtered.geojson') -> list:
        # create output directory if it does not exist
        os.makedirs(self.output_dir, exist_ok=True)
        # loop through all geojson files in directory
        geojson_files = []
        for filename in os.listdir(self.input_dir):
            if filename.endswith(file_ending):
                geojson_file = os.path.join(self.input_dir, filename)
                geopackage_file = os.path.join(self.output_dir, filename.replace('.geojson', '.gpkg'))
            
                try:
                    # run function as a shell script through subprocess library
                    result = subprocess.run(['ogr2ogr', '-f', 'GPKG', '-t_srs', f'EPSG:{self.target_epsg}', geopackage_file, geojson_file], 
                                            check=True, 
                                            capture_output=True, 
                                            text=True)
                    
                    print(f"Converted and modified to GeoPackage: {filename}")

                    #check error code
                    if len(result.stderr) > 0:
                        print(f"Warnings or errors:\n{result.stderr}")

                    # append filenames with a list
                    geojson_files.append(filename)

                except subprocess.CalledProcessError as e:
                    print(f"Error processing {filename}: {e}")
                except Exception as e:
                    print(f"Unexpected error with {filename}: {e}")

        # return the list of GeoJSON files
        return geojson_files
    
    def merge_gpkg_files(self, output_file:str, year:int):
        print(self.gpkg_files)
        # initialize the GeoPackage using the first GeoPackage file
        first_gpkg_file = self.gpkg_files[0]
        layer_name = first_gpkg_file.split(f"_{osm.year}")[0] # replaced with a year variable from OSM_PreProcessor
        first_gpkg_file = os.path.join(self.output_dir, first_gpkg_file)

        subprocess.run(['ogr2ogr', '-f', 'GPKG', output_file, first_gpkg_file, # output and input files
                '-s_srs', f'EPSG:{self.target_epsg}',  # set source CRS
                '-t_srs', f'EPSG:{self.target_epsg}', # set target CRS
                '-nln', layer_name # specify name of the layer
                ], check=True, capture_output=True, text=True) # to show log
        print(f"Initialized merged GeoPackage with CRS EPSG:{self.target_epsg} from {layer_name}.")

        for gpkg_file in self.gpkg_files[1:]:  # skip the first file because it's already added
            layer_name = gpkg_file.split(f"_{osm.year}")[0]
            gpkg_file = os.path.join(self.output_dir, gpkg_file)
            # run appending separate geopackages to empty merged geopackage (update if layers were previously written)
            try:
                result = subprocess.run(['ogr2ogr', '-f', 'GPKG', output_file, '-s_srs', f'EPSG:{self.target_epsg}', # for input file
                                                '-t_srs', f'EPSG:{self.target_epsg}', # for output file
                                                '-nln', layer_name, '-update', '-append', gpkg_file],
                                                check=True, 
                                                capture_output=True, 
                                                text=True)
                
                print(f"Added layer {layer_name} from {gpkg_file} to {output_file}")
                if len(result.stderr) > 0:
                    print(f"Warnings or errors:\n{result.stderr}")

            except subprocess.CalledProcessError as e:
                print(f"Error adding {layer_name}: {e.stderr}")
            except Exception as e:
                print(f"Unexpected error with {layer_name}: {e}")

    def fix_geometries_in_gpkg(self, input_gpkg:str, fixed_gpkg:str=None, overwrite_original:bool=False):
        # if fixed_gpkg is not specified, overwrite the input_gpkg
        if fixed_gpkg is None:
            fixed_gpkg = input_gpkg
        else:
            shutil.copyfile(input_gpkg, fixed_gpkg) # to copy file to a new one

        # open the output GeoPackage for editing
        data_source = ogr.Open(fixed_gpkg, update=1)

        for i in range(data_source.GetLayerCount()):
            layer = data_source.GetLayerByIndex(i)
            layer_name = layer.GetName()
            feature_to_fix_count = 0
            fixed_feature_count = 0
            invalid_feature_count = 0

            # iterate over all features in the layer
            for feature in layer:
                geometry = feature.GetGeometryRef()
                if not geometry.IsValid():
                    feature_to_fix_count += 1 # increment the number of features to be fixed
                    # attempt to fix the geometry
                    fixed_geometry = geometry.MakeValid()

                    if fixed_geometry.IsValid():
                        # replace the geometry with the fixed one
                        feature.SetGeometry(fixed_geometry)
                        layer.SetFeature(feature)  # save the updated feature back to the layer
                        print(f"Fixed invalid geometry in layer '{layer_name}', feature ID: {feature.GetFID()}")
                        fixed_feature_count += 1 # increment the number of fixed features
                    else:
                        print(f"Could not fix geometry in layer '{layer_name}', feature ID: {feature.GetFID()}")
                        invalid_feature_count += 1 # increment the number of features that cannot be fixed

        # estin
        
        if feature_to_fix_count == 0:
            print (f"All geometries of features in the layer '{layer_name}' of the output vector are valid.")
            print("-" * 40)
        else:
            print(f"Layer '{layer_name}: {fixed_feature_count} geometries fixed.") 
            print(f"Layer '{layer_name}': {invalid_feature_count} geometries could not be fixed.")
            print("-" * 40)

        # close the data source
        del data_source

        # remove the original GeoPackage if it should be overwritten
        if overwrite_original:
            shutil.copyfile(fixed_gpkg, input_gpkg)
            print(f"Fixed geometries and saved to {input_gpkg}.")
            os.remove(fixed_gpkg)
        else:
            print(f"Fixed geometries and saved to {fixed_gpkg}.")

    
    def delete_temp_files(self):
        # delete all GeoJSON files
        for file in os.listdir(self.input_dir):
            if file.endswith('.geojson'):
                os.remove(os.path.join(self.input_dir, file))
        print("Deleted all intermediate GeoJSON files.")
        print("-" * 40)

        for file in os.listdir(self.output_dir):
            if file.startswith('osm_merged'):
                continue
            else:
                os.remove(os.path.join(self.output_dir, file))

# Merge geopackage and fix geometries

In [52]:
# run the conversion and modification 
# TODO loop over years
input_dir = os.path.join(os.getcwd(), 'data/input/osm')
output_dir = os.path.join(input_dir, 'gpkg_temp')
ogtg = OsmGeojson_to_gpkg(input_dir,output_dir,target_epsg=4326)
output_file = os.path.join(output_dir, f'osm_merged_{osm.year}.gpkg') # osm.year added from OSM_PreProcessor class 
fixed_gpkg = os.path.join(output_dir, f'osm_merged_{osm.year}_fixed.gpkg')
ogtg.merge_gpkg_files(output_file, {osm.year})
ogtg.fix_geometries_in_gpkg(output_file, fixed_gpkg, overwrite_original=False)
ogtg.delete_temp_files()
#NOTE remember to move file to vector_dir for next notebook
shutil.move(fixed_gpkg, os.path.join(os.getcwd(), 'data/input/vector/osm_merged_2018.gpkg'))

2017
Converted and modified to GeoPackage: railways_2017_filtered.geojson
Converted and modified to GeoPackage: roads_2017_filtered.geojson
Converted and modified to GeoPackage: waterbodies_2017_filtered.geojson
Converted and modified to GeoPackage: waterways_2017_filtered.geojson
['railways_2017_filtered.gpkg', 'roads_2017_filtered.gpkg', 'waterbodies_2017_filtered.gpkg', 'waterways_2017_filtered.gpkg']
['railways_2017_filtered.gpkg', 'roads_2017_filtered.gpkg', 'waterbodies_2017_filtered.gpkg', 'waterways_2017_filtered.gpkg']
Initialized merged GeoPackage with CRS EPSG:4326 from railways.
Added layer roads from /data/data/input/osm/gpkg_temp/roads_2017_filtered.gpkg to /data/data/input/osm/gpkg_temp/osm_merged_2017.gpkg
Added layer waterbodies from /data/data/input/osm/gpkg_temp/waterbodies_2017_filtered.gpkg to /data/data/input/osm/gpkg_temp/osm_merged_2017.gpkg
Added layer waterways from /data/data/input/osm/gpkg_temp/waterways_2017_filtered.gpkg to /data/data/input/osm/gpkg_temp/o

'/data/data/input/vector/osm_merged_2018.gpkg'

In [53]:
# TODO write all data to data folder and give options to remove temp files
# organise folders into json_, geosjson_, gpkg_ and temp folders
# remove filtered_geojson files after conversion to gpkg
# by default keep json files only for each land feature (roads, railways, waterways, waterbodies)

Let's find out how much time does it take to extract and optimise this dataset:

In [7]:
timing.stop()

Elapsed time: 5.61 seconds


#### ***Processing issues***

- a few other ways to access historical OSM data were explored as well:
Nominatim API
Ohsome API
Geofabrik archives (do not provide automatic or semi-automatic access)
- ogr2ogr might interpret X and Y axis in a different order (flipping coordinates). This issue was detected for EPSG:27700 while merging separate geopackages into one even though no reprojection was specified  in flags. The solution is to define axis order explicitly through '-s_srs' (for the input dataset) and '-t_srs' (for the output dataset) flags as the same EPSG code.
- additional step is the translation of 'width' column into decimal one since geojson recognizes this column as a text. Currently, it is performed in further processing (enrichment of raster land-use/land-cover data) as a SQL statement for buffering roads by width from this column.