## Access and harmonisation of historical Open Street Map (OSM) data on land-use/land-cover (LULC)

This workflow is aimed at semi-automatic extraction and harmonisation of open-access Open Street Map (OSM) data at various timestamps relevant for researches dealing with land-use/land cover data. Overpass Turbo API applied to fetch specific types of OSM features, including human-built infrastructure (roads and railways) and mostly natural features (inland waters - waterways and water bodies). No user authentication is required to access Overpass Turbo API, but it is advised to extract data reasonably at spatial scales such as Catalonia, Spain or Nothern England (69 925 km<sup>2</sup> and 20 650 km<sup>2</sup>, respectively).

##### Workflow limitations
Limitations relevant to the further processing of habitat connectivity (for more details see issues.docx) are multiple, but mostly caused by logical inconsistency and incompleteness of features (omissions) rather than geometrical and topological inconsistency, especially at older timestamps. The common problem of defining keys and values for filtering data is caused by the lack of relevant information for older timestamps (comprehensive OSM quides and manuals are mostly focused on current rules of assigning tags).
- Due to the active development of Open Street Map and increase in popularity since the launch, earlier timestamps from 2010s might lack a significant number of features compared to the current timestamp. For example, the total length of rivers and canals in Catalonia has increased by 27.3% from the end of the 2013 to 2022, while there were no significant nature- or human-driven changes in the water network. Such a change can be partly explained by the increase in the geometric data accuracy, smoothing and curving linear features, mapping linear features within water bodies, but it is also related to the increase in OSM popularity and adding other water features by new users.
- Some roads are missing key "surface". Therefore, paving of roads is not considered even though some narrow, unpaved and non-frequent roads might be crossed regularly by species.
- Some keys are not consistent throughout years (for example, level of roads is missing in data from 2012 and 2013 years).
- Presence of invalid numerical values (for example, width of roads = "3000" or "6 m" which requires additional preprocessing). Unique values by keys can be explored in detail here: https://taginfo.openstreetmap.org.uk/keys/width#values 
- Logical inconsistency in defining tags for types of roads throughout the years (can be defined as primary, and redefined as a secondary one later).
- Logical inconsistency in defining keys for water reservoirs (can be defined as 'land_use' types instead of 'water' types) at older timestamps (2012-2013).
- Changes in geometry types of water features - some rivers have been mapped as ways at older timestamps, but have been complemented with multipolygon features later. Was experienced in Catalonia (2012,2013,2017 timestamps).
- According to checks on the validity of vector data during the testing, waterbodies derived from Open Street Map can have invalid geometry, which might insignificantly complicate the further processing (0.01-0.04% from the total number of features, depending on the bounding box and timestamp).
- During the peak load on Overpass servers, bulky queries might be refused with error: "Remote connection closed unexpectedly" (faced with queries on roads). Quite high limit rates on memory consumption and query time are defined in this workflow by default but it is advised to fetch data from OSM at local or regional scale to prevent throttling issues.
- Queries on historical data with specified timestamp tend to be more time-consuming than the same ones without timestamps.

##### Initial setup
Let's import all libraries needed:

In [1]:
import overpy
import json
import requests
import sys
import geopandas as gpd
import pandas as pd
import os
import tempfile
from shapely.geometry import Point, LineString, MultiLineString, Polygon, MultiPolygon
from osgeo import gdal, osr
import pyproj

# auxiliary libraries
import time
import warnings
import yaml

Let's start measure time to run this code:

In [2]:
# TODO - import from own module
start_time = time.time()

Overpy is installed through the command prompt "pip install requests" after activating conda environment. Let's specify directories:

In [3]:
parent_dir = os.getcwd()
# child_dir = 'data'

Variables are defined in the configuration file config.yaml. Let's load the configuration:

In [4]:
# load configuration from YAML file
with open('config.yaml', 'r') as f:
    config = yaml.safe_load(f)
    
"""
print(config)
"""

'\nprint(config)\n'

Specify input raster data:

In [5]:
# let's define years we want to retrieve OSM data from

# fetch year from the configuration file
year = config.get('year', 2022)  # default to 2022 if not found in configuration file
if year is None or 'year' not in config : # both conditions should be considered
    warnings.warn("Year variable is not found in the configuration file.")

# create the timestamp from year
date = f"{year}-12-31T23:59:59Z" # to fetch it as a last second of the year

print(f"OSM data is to be retrieved for {year} year.")
print ("-" * 30)

"""
years = [2018, 2020, 2023]
for year in years:
    date = f"{year}-12-31T23:59:59Z" # to fetch it as a last second of the year

# TODO - to loop it over years
"""

OSM data is to be retrieved for 2018 year.
------------------------------


'\nyears = [2018, 2020, 2023]\nfor year in years:\n    date = f"{year}-12-31T23:59:59Z" # to fetch it as a last second of the year\n\n# TODO - to loop it over years\n'

Let's define input raster dataset that should be enriched with OSM data:

In [6]:
lulc_template = config.get('lulc')

# substitute the year into the lulc string from config file
lulc = lulc_template.format(year=year)

# find directories from config file
input_dir = config.get('input_dir')
lulc_dir = config.get('lulc_dir')

lulc = os.path.join(lulc_dir, lulc) # define path
lulc = os.path.normpath(lulc) # normalise path

print(f"Input raster to be used for processing is {lulc}.")

Input raster to be used for processing is data\input\lulc\lulc_ukceh_25m_2018.tif.


Load the raster file, get its extent, cell size and projection:

In [7]:
raster = gdal.Open(lulc)
if raster is not None:
    inp_lyr = raster.GetRasterBand(1)  # get the first band
    x_min_cart, x_max_cart, y_min_cart, y_max_cart = raster.GetGeoTransform()[0], raster.GetGeoTransform()[0] + raster.RasterXSize * raster.GetGeoTransform()[1], raster.GetGeoTransform()[3] + raster.RasterYSize * raster.GetGeoTransform()[5], raster.GetGeoTransform()[3]
    '''
    cellsize = raster.GetGeoTransform()[1]  # Assuming the cell size is constant in both x and y directions
    x_ncells = int((x_max - x_min) / cellsize)
    y_ncells = int((y_max - y_min) / cellsize)
    '''
    print ("Input raster has been successfully found.")

    # extract projection system of input raster file
    info = gdal.Info(raster, format='json')
    if 'coordinateSystem' in info and 'wkt' in info['coordinateSystem']:
        srs = osr.SpatialReference(wkt=info['coordinateSystem']['wkt'])
        if srs.IsProjected():
            epsg_code = srs.GetAttrValue("AUTHORITY", 1)
            print(f"Projected coordinate system of the input raster is EPSG:{epsg_code}")
        else:
            print("Input raster does not have a projected coordinate system.")
    else:
        print("No projection information found in the input raster.")
    # close the raster to keep memory empty
    raster = None
else:
    print ("Input raster is missing.")

Input raster has been successfully found.




Projected coordinate system of the input raster is EPSG:27700


We need to reproject the bounding box of input dataset as Overpass accepts only coordinates in geographical coordinates (WGS 84):

In [8]:
# defining function to transform
transform_cart_to_geog = pyproj.Transformer.from_crs(
    pyproj.CRS(f'EPSG:{epsg_code}'),  # applying EPSG code of input raster dataset
    pyproj.CRS('EPSG:4326')   # WGS84 geographic which should be used in OSM APIs
)

# running function
x_min, y_min = transform_cart_to_geog.transform(x_min_cart, y_min_cart)
x_max, y_max = transform_cart_to_geog.transform(x_max_cart, y_max_cart)

# print the Cartesian coordinates before transformation
print("Before Transformation:")
print("x_min_cart:", x_min_cart)
print("x_max_cart:", x_max_cart)
print("y_min_cart:", y_min_cart)
print("y_max_cart:", y_max_cart)

# print the transformed geographical coordinates
print("After Transformation:")
print("x_min:", x_min)
print("x_max:", x_max)
print("y_min:", y_min)
print("y_max:", y_max)
bbox=f"{x_min},{y_min},{x_max},{y_max}"

# to check the bounding box of input raster
print (bbox)

Before Transformation:
x_min_cart: 347225.0
x_max_cart: 452300.0
y_min_cart: 343800.0
y_max_cart: 540325.0
After Transformation:
x_min: 52.98893670759685
x_max: 54.755166952963556
y_min: -2.7876213063263044
y_max: -1.1889035388429525
52.98893670759685,-2.7876213063263044,54.755166952963556,-1.1889035388429525


##### Building Overpass Turbo API queries

Let's define Overpass query for roads. Query below extract ways and correspoding nodes automatically as it is crucial to record geometries of spatial features. It is also important to define maximum size of memory consumption, otherwise issue of memory runout might arise for big queries.

Queries for roads are divided by two and merged then as a list, because one large query leads to closing connection by Overpass.

In [9]:
query_roads = f"""
[out:json]
[maxsize:1073741824]
[timeout:9000]
[date:"{year}-12-31T23:59:59Z"]
[bbox:{bbox}];
way["highway"~"(motorway|trunk|primary|secondary|tertiary)"];
/* also includes 'motorway_link',  'trunk_link' etc. because they also restrict habitat connectivity
*/
/* old version to export only columns needed, but it overwhelms the code:
foreach -> .setWay {{
  .setWay; > -> .setNodes;
  make way geometry = setNodes.set(" " + lat() + ", " + lon()),
  highway = u(t["highway"]),
  name = u(t["name"]),
  width = u(t["width"]),
  level = u(t["level"]),
  bridge = u(t["bridge"]);
  out;
}};
*/
(._;>;);
out;
"""

#  try way(bn) or way(r) to filter out points and polygons

# '{' characters must be doubled in Python f-string (except for {bbox} because it is a variable)
# to include statement on paved surfaces use: ["surface"~"(paved|asphalt|concrete|paving_stones|sett|unhewn_cobblestone|cobblestone|bricks|metal|wood)"];
# it is important to include only paved roads it is important to list all values above, not only 'paved'*/
# BUT! : 'paved' tag seems to be missing in a lot of features at timestamps from 2010s
# 'residential' roads are not fetched as these areas are already identified in land-use/land-cover data as urban or residential ones
# "~" extracts all tags containing this text, for example 'motorway_link'


# REDUNDANT BLOCK - to import a date from query itself
# define the regex pattern to match the date
# import re

# date_pattern = r'\[date:"(\d{4})-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z"\]'
# Search for the pattern in the query string
# match = re.search(date_pattern, query_roads)

# Extract the year from the matched date string
# year = match.group(1)
# print(f"OSM data is to be retrieved for the {year} year.")

2nd query to extract railways:

In [10]:
query_railways = f"""
[out:json]
[maxsize:1073741824]
[timeout:9000]
[date:"{year}-12-31T23:59:59Z"]
[bbox:{bbox}];
way["railway"~"(rail|light_rail|narrow_gauge|tram|preserved)"];
(._;>;);
out;
"""
# way["railway"];  # to include features if 'railway' key is found (any value)
#  # to include features with values filtered by key. 
# This statement also includes 'monorail' which are not obstacles for species migration, but these features are extremely rare. Therefore, it was decided not to overcomplicate the query.
# 31/07/2024 - added filtering on 'preserved' railway during the verification by UKCEH LULC dataset (some railways are marked as 'preserved at older timestamps and 'rail' in newer ones). 

3d query to extract waterways:

In [11]:
query_waterways = f"""
[out:json]
[maxsize:1073741824]
[timeout:9000]
[date:"{year}-12-31T23:59:59Z"]
[bbox:{bbox}];
(
way["waterway"~"^(river|canal|flowline|tidal_channel)$"];
way["water"~"^(river|canal)$"];
);
/* ^ and $ symbols to exclude 'riverbank' and 'derelict_canal'*/
/*UPD - second line is added in case if some older features are missing 'way' tag*/
(._;>;);
out;
"""

# to include small waterways use way["waterway"~"(^river$|^canal$|flowline|tidal_channel|stream|ditch|drain)"]

4th query to extract water bodies includes also mistagged features with deprecated definitions to include them for older timestamps when rules for assigning tags were different:

In [12]:
# Query to bring water features with deprecated tags
query_waterbodies = f"""
[out:json]
[maxsize:1073741824]
[timeout:9000]
[date:"{year}-12-31T23:59:59Z"]
[bbox:{bbox}];
(
nwr["natural"="water"];
nwr["water"~"^(cenote|lagoon|lake|oxbow|rapids|river|stream|stream_pool|canal|harbour|pond|reservoir|wastewater|tidal|natural)$"];
nwr["landuse"="reservoir"];
nwr["waterway"="riverbank"];
/*UPD - second filter was added to catch other water features at all timestamps*/
/*UPD - third and fourth filters were added to catch other water features at older timestamps*/
/*it is more reliable to query nodes, ways and relations altogether ('nwr') to fetch the complete polygon spatial features*/
);
(._;>;);
out;
"""

To compare the completeness of queries it is useful to keep a query with actual tags only:

In [13]:
# Query to bring water features without deprecated tags
'''

query_waterbodies = f"""
[out:json]
[maxsize:1073741824]
[timeout:9000]
[date:"{year}-12-31T23:59:59Z"]
[bbox:{bbox}];
(
nwr["natural"="water"];
nwr["water"~"^(cenote|lagoon|lake|oxbow|rapids|river|stream|stream_pool|canal|harbour|pond|reservoir|wastewater|tidal)$"];
);
(._;>;);
out;
"""

'''

'\n\nquery_waterbodies = f"""\n[out:json]\n[maxsize:1073741824]\n[timeout:9000]\n[date:"{year}-12-31T23:59:59Z"]\n[bbox:{bbox}];\n(\nnwr["natural"="water"];\nnwr["water"~"^(cenote|lagoon|lake|oxbow|rapids|river|stream|stream_pool|canal|harbour|pond|reservoir|wastewater|tidal)$"];\n);\n(._;>;);\nout;\n"""\n\n'

In [14]:
# merge queries into dictonary
# to include all queries
queries = {"roads":query_roads, "railways":query_railways, "waterways":query_waterways, "waterbodies":query_waterbodies}

In [15]:
# REDUNDANT version of query on roads with fetching only needed columns
'''
query_roads = f"""
[out:csv(print; false)]
[date:"2022-12-31T23:59:59Z"]
[timeout:1800]
[maxsize:1073741824]
[bbox:{bbox}];
way["highway"~"(motorway|trunk|primary|secondary|tertiary)"];
foreach -> .setWay {{
  .setWay; > -> .setNodes;
  make x print = u(t["highway"] + "|" + t["name"] + "|" + t["width"]) + "|" + setNodes.set(lat() + ", " + lon()); out;
}};
"""

'''

'\nquery_roads = f"""\n[out:csv(print; false)]\n[date:"2022-12-31T23:59:59Z"]\n[timeout:1800]\n[maxsize:1073741824]\n[bbox:{bbox}];\nway["highway"~"(motorway|trunk|primary|secondary|tertiary)"];\nforeach -> .setWay {{\n  .setWay; > -> .setNodes;\n  make x print = u(t["highway"] + "|" + t["name"] + "|" + t["width"]) + "|" + setNodes.set(lat() + ", " + lon()); out;\n}};\n"""\n\n'

##### Accessing Open Street Map through Overpass endpoint
Let's define access to Overpass Turbo endpoint and iterate over the queries.

There are two main options of output data formats - json and csv. Json is more bulky to transform to other spatial data formats and less flexible than csv, but another common library helps to quickly transform OSM jsons to geojsons easily - [osmtogeojson](https://wiki.openstreetmap.org/wiki/Overpass_turbo/GeoJSON). There are also other export options in the [official documentation](https://dev.overpass-api.de/output_formats.html): OSM XML, HTMPL popups and custom formats. However, these solutions are not stable.

UPD: CSV is fetched more quickly and it is suitable with larger sizes of data fetched (including multiple 'tertiary' roads), but coordinates from assigned nodes are glitched (order of nodes is mixed up). It might be possible to fix these issues related to the order of coordinates, but this workflow is not aimed at the optimisation of convertation since the stable solution of transforming json exists. Therefore, JSON has been chosen over CSV.

In [17]:
# FOR JSON RESPONSES
# define endpoint
overpass_url = "https://overpass-api.de/api/interpreter"

intermediate_jsons = []

# iterate over the queries and execute them
for query_name, query in queries.items():
    response = requests.get(overpass_url, params={'data': query})
    print(response)
        
    # if response is successful
    if response.status_code == 200:
        print(f"Query to fetch OSM data for {query_name} in the {year} year has been successful.")
        data = response.json()
        
        # Extract elements from data
        elements = data.get('elements', [])
        
        # Print the number of elements
        print(f"Number of elements in {query_name} in the {year} year: {len(elements)}")
        
        # Print the first 3 elements to verify response
        for i, element in enumerate(elements[:3]):
            print(f"Element {i+1}:")
            print(json.dumps(element, indent=2))
        
        # Save the JSON data to a file
        output_file = f"{query_name}_{year}.json"
        with open(output_file, 'w', encoding='utf-8') as f:
            json.dump(data, f, ensure_ascii=False, indent=4)
        print(f"Data has been saved to {output_file}")
        print ("-" * 30)

        # Add the output file name to the list
        intermediate_jsons.append(output_file)
        
    else:
        print(f"Error: {response.status_code} for {query_name} in the {year} year")
        print(response.text)
        print ("-" * 30)

"""
print (intermediate_jsons)
"""

<Response [200]>
Query to fetch OSM data for roads in the 2018 year has been successful.
Number of elements in roads in the 2018 year: 593328
Element 1:
{
  "type": "node",
  "id": 154915,
  "lat": 53.6526029,
  "lon": -1.5279476
}
Element 2:
{
  "type": "node",
  "id": 154916,
  "lat": 53.650261,
  "lon": -1.5286385
}
Element 3:
{
  "type": "node",
  "id": 154917,
  "lat": 53.6487681,
  "lon": -1.5293094
}
Data has been saved to roads_2018.json
------------------------------
<Response [200]>
Query to fetch OSM data for railways in the 2018 year has been successful.
Number of elements in railways in the 2018 year: 99764
Element 1:
{
  "type": "node",
  "id": 332513,
  "lat": 53.2589429,
  "lon": -2.7883402
}
Element 2:
{
  "type": "node",
  "id": 332514,
  "lat": 53.2648554,
  "lon": -2.7820511
}
Element 3:
{
  "type": "node",
  "id": 332515,
  "lat": 53.2698036,
  "lon": -2.7768262
}
Data has been saved to railways_2018.json
------------------------------
<Response [200]>
Query to fet

'\nprint (intermediate_jsons)\n'

REDUNDANT:
Export of output to csv to check the content:

In [18]:
'''
import pandas as pd
import geopandas as gpd
import requests
from io import StringIO

# Check if the request was successful
if response.status_code == 200:
    # Load the CSV data from the response
    csv_data = StringIO(response.text)
    
    # Read the CSV data into a DataFrame
    df = pd.read_csv(csv_data, sep=',', header=None, names=['highway'])
    
    # Define a function to split and pad the results
    def split_and_pad(row):
        parts = row.split('|')
        if len(parts) < 4:
            parts.extend([None] * (4 - len(parts)))  # Pad with None if less than 4 parts
        elif len(parts) > 4:
            parts = parts[:4]  # Truncate if more than 4 parts
        return parts
    
    # Apply the function to split and pad the 'highway' column
    split_cols = df['highway'].apply(split_and_pad)
    
    # Convert the result to a DataFrame
    split_df = pd.DataFrame(split_cols.tolist(), columns=['highway_type', 'name', 'width', 'geometry'])
    
    # Concatenate the split columns back to the original DataFrame
    df = pd.concat([df, split_df], axis=1)
    
    # Drop the original 'highway' column
    df.drop(columns=['highway'], inplace=True)
    
    # Drop rows where all other columns are NaN
    df.dropna(how='all', subset=['highway_type', 'name', 'width', 'geometry'], inplace=True)
    
    # Reset index
    df.reset_index(drop=True, inplace=True)

    # Save the transformed DataFrame to a CSV file
    # df.to_csv('transformed_highway_data.csv', index=False)
    
    # Display the transformed DataFrame
    print(df.head(1000))
else:
    print(f"Failed to fetch data. Status code: {response.status_code}")
'''

'\nimport pandas as pd\nimport geopandas as gpd\nimport requests\nfrom io import StringIO\n\n# Check if the request was successful\nif response.status_code == 200:\n    # Load the CSV data from the response\n    csv_data = StringIO(response.text)\n    \n    # Read the CSV data into a DataFrame\n    df = pd.read_csv(csv_data, sep=\',\', header=None, names=[\'highway\'])\n    \n    # Define a function to split and pad the results\n    def split_and_pad(row):\n        parts = row.split(\'|\')\n        if len(parts) < 4:\n            parts.extend([None] * (4 - len(parts)))  # Pad with None if less than 4 parts\n        elif len(parts) > 4:\n            parts = parts[:4]  # Truncate if more than 4 parts\n        return parts\n    \n    # Apply the function to split and pad the \'highway\' column\n    split_cols = df[\'highway\'].apply(split_and_pad)\n    \n    # Convert the result to a DataFrame\n    split_df = pd.DataFrame(split_cols.tolist(), columns=[\'highway_type\', \'name\', \'width\'

##### Postprocessing outputs
Let's transform OSM data to geopackage as well:


In [19]:
# redundant block on fixing geometries in dataframe
'''
# Function to convert coordinates string to LineString
def create_linestring(coords):
    if isinstance(coords, str):  # Check if coords is a string
        points = []
        for point in coords.split(';'):
            if point.strip() != '':
                try:
                    y, x = map(float, point.split(','))  # Swap x and y here if they are mixed up
                    points.append((x, y))  # Append (x, y) to points
                except ValueError:
                    pass
        if len(points) >= 2:
            return LineString(points)
        else:
            return None
    else:
        return coords  # Return the geometry unchanged if it's already a LineString

# Apply function to the 'geometry' column
df['geometry'] = df['geometry'].apply(create_linestring)

# Drop rows with invalid geometry (empty LineString)
df = df.dropna(subset=['geometry'])

# Convert DataFrame to GeoDataFrame
gdf = gpd.GeoDataFrame(df, geometry='geometry')

# Set CRS to WGS84 (EPSG:4326)
gdf.crs = "EPSG:4326"

# Save GeoDataFrame to GeoPackage
geopackage = "osm_historical.gpkg"
gdf.to_file(geopackage, layer='roads', driver='GPKG')
print(f"Data has been saved to {geopackage}.")
'''

'\n# Function to convert coordinates string to LineString\ndef create_linestring(coords):\n    if isinstance(coords, str):  # Check if coords is a string\n        points = []\n        for point in coords.split(\';\'):\n            if point.strip() != \'\':\n                try:\n                    y, x = map(float, point.split(\',\'))  # Swap x and y here if they are mixed up\n                    points.append((x, y))  # Append (x, y) to points\n                except ValueError:\n                    pass\n        if len(points) >= 2:\n            return LineString(points)\n        else:\n            return None\n    else:\n        return coords  # Return the geometry unchanged if it\'s already a LineString\n\n# Apply function to the \'geometry\' column\ndf[\'geometry\'] = df[\'geometry\'].apply(create_linestring)\n\n# Drop rows with invalid geometry (empty LineString)\ndf = df.dropna(subset=[\'geometry\'])\n\n# Convert DataFrame to GeoDataFrame\ngdf = gpd.GeoDataFrame(df, geometry=\'

To use OSM JSON response further in preprocessing, [osmtogeojson](https://github.com/tyrasd/osmtogeojsonlibrary) is used. NPM must be installed to get an access to osmtogeojson. To install it on Windows:

In [20]:
!pip install npm



Then, we need to install osmtogeojson:

In [21]:
!npm install -g osmtogeojson


changed 35 packages in 5s

4 packages are looking for funding
  run `npm fund` for details


npm notice 
npm notice New major version of npm available! 9.8.1 -> 10.8.3
npm notice Changelog: <https://github.com/npm/cli/releases/tag/v10.8.3>
npm notice Run `npm install -g npm@10.8.3` to update!
npm notice 


To transform OSM json to geojson, run as a shell script:

In [22]:
for query_name, query in queries.items():
    result = !osmtogeojson {query_name}_{year}.json > {query_name}_{year}.geojson

It is important to filter out non-suitable geometries (points and polygons for roads, railways and waterways):

In [23]:
# create empty list with file names of geojsons
intermediate_geojsons=[]

# iterate over the queries and define outputs
for query_name, query in queries.items():
    geojson_file = f"{query_name}_{year}.geojson"
    
    # check if the non-zero GeoJSON files exist
    if os.path.exists(geojson_file) and os.path.getsize(geojson_file) > 0:
        print(f"Conversion to GeoJSON for {query_name} in the {year} year was successful.")
        
        # read the GeoJSONs and print the first feature
        with open(geojson_file, 'r', encoding='utf-8') as f:
            geojson_data = json.load(f)
            features = geojson_data.get('features', [])
            print(f"Total features: {len(features)}")
            for i, feature in enumerate(features[:1]):
                print(f"Feature {i+1}:")
                print(json.dumps(feature, indent=2))

        # determine the geometries to filter based on query_name
        # for roads, railways and waterways extract only lines and multilines
        if query_name in ("roads", "railways", "waterways"):
            geometry_types = ['LineString', 'MultiLineString']
            # filter based on geometry types and level - it should be 0 (or null)
            filtered_features = [
                feature for feature in geojson_data.get('features', [])
                if feature['geometry']['type'] in geometry_types
                and (feature['properties'].get('level') in (None, 0)) # filtering by ground level of infrastructure
            ]
        # for waterbodies extract only polygons and multipolygons
        elif query_name == "waterbodies":
            geometry_types = ['Polygon', 'MultiPolygon']
            # filter based on geometry types only
            filtered_features = [
                feature for feature in geojson_data.get('features', [])
                if feature['geometry']['type'] in geometry_types
            ]
        # for everything else extract everything that can be found
        else:
            filtered_features = [
                feature for feature in geojson_data.get('features', [])
            ]

        # create a new GeoJSON structure with filtered features
        filtered_geojson_data = {
            "type": "FeatureCollection",
            "features": filtered_features
        }

        print(f"Total features after filtering {query_name} in the {year} year: {len(filtered_features)}")
        print ("-" *30)
        
        # overwrite the original GeoJSON file with the filtered one
        with open(geojson_file, 'w', encoding='utf-8') as f:
            json.dump(filtered_geojson_data, f, ensure_ascii=False, indent=4)

        # write filenames to the list with intermediate geojsons
        intermediate_geojsons.append(geojson_file)
    
    else:
        print(f"Conversion to GeoJSON for {query_name} in the {year} year failed.")
        print ("-" *30)
"""
print (intermediate_geojsons)
"""

Conversion to GeoJSON for roads in the 2018 year was successful.
Total features: 81964
Feature 1:
{
  "type": "Feature",
  "id": "way/543575676",
  "properties": {
    "area": "yes",
    "name": "Long Lane",
    "highway": "tertiary",
    "id": "way/543575676"
  },
  "geometry": {
    "type": "Polygon",
    "coordinates": [
      [
        [
          -2.3950353,
          53.5723483
        ],
        [
          -2.3949383,
          53.5723441
        ],
        [
          -2.3948666,
          53.5723531
        ],
        [
          -2.3948087,
          53.5723735
        ],
        [
          -2.394755,
          53.5724014
        ],
        [
          -2.3947081,
          53.5724308
        ],
        [
          -2.3946794,
          53.5724836
        ],
        [
          -2.3946709,
          53.5725217
        ],
        [
          -2.3946764,
          53.5725618
        ],
        [
          -2.3946931,
          53.5726086
        ],
        [
          -2.3947

'\nprint (intermediate_geojsons)\n'

Then, geojson datasets should be converted into geopackages since they are more optimised for further processing. The quickest way to perform the conversion is ogr2ogr library executed as a shell script (currently executed through subprocess).

We also reprojected OSM data to align with input raster dataset's coordinate system.

*Computation time of this block doesn't differ from the same block executed through the shell script.*

*Additional step is the translation of 'width' column into decimal one since geojson recognizes this column as a text. Currently, it is performed in further processing as a SQL statement for buffering roads by width from this column.*

In [24]:
import subprocess

# define parameters
input_dir = os.getcwd()  # current directory
target_epsg = epsg_code # coordinate system of input raster dataset
gpkg_temp = os.path.join(input_dir, 'gpkg_temp')

print(f"Input directory: {input_dir}")
print(f"Output directory: {gpkg_temp}")

geojson_files = []
gpkg_files = []

# define 
def geojson2gpkg(input_dir, output_dir):
    
    # create output directory if it does not exist
    os.makedirs(output_dir, exist_ok=True)
    # loop through all geojson files in directory
    for filename in os.listdir(input_dir):
        if filename.endswith('.geojson'):
            geojson_file = os.path.join(input_dir, filename)
            geopackage_file = os.path.join(input_dir, filename.replace('.geojson', '.gpkg'))
        
            try:
                # run function as a shell script through subprocess library
                result = subprocess.run(['ogr2ogr', '-f', 'GPKG', '-t_srs', f'EPSG:{target_epsg}', geopackage_file, geojson_file], 
                                        check=True, 
                                        capture_output=True, 
                                        text=True)
                
                print(f"Converted and modified to GeoPackage: {filename}")
                print(f"Warnings or errors:\n{result.stderr}")

                # append filenames with a list
                geojson_files.append(filename)

            except subprocess.CalledProcessError as e:
                print(f"Error processing {filename}: {e}")
            except Exception as e:
                print(f"Unexpected error with {filename}: {e}")


# run the conversion and modification
geojson2gpkg(input_dir, gpkg_temp)

# print(geojson_files)

# replace .geojson with .gpkg for each file
gpkg_files = [file.replace('.geojson', '.gpkg') for file in geojson_files]

"""
# Explicit bash code (currently not working on Windows because of compability issues)
# Define the input directory
input_directory=$(pwd)  # current directory

# Loop through all geojson files in directory
for filename in "$input_directory"/*.geojson; do
    if [[ -f "$filename" ]]; then
        geojson_file="$filename"
        geopackage_file="${filename%.geojson}.gpkg"

        # Run function as a shell script
        if ogr2ogr -f "GPKG" "$geopackage_file" "$geojson_file"; then
            echo "Converted and modified: $filename"
        else
            echo "Error processing: $filename"
        fi
    else
        echo "No GeoJSON files found in the directory."
        break
    fi
done
"""

Input directory: C:\Users\kriukovv\Documents\pilot_2\preprocessing
Output directory: C:\Users\kriukovv\Documents\pilot_2\preprocessing\gpkg_temp
Converted and modified to GeoPackage: railways_2018.geojson

Converted and modified to GeoPackage: roads_2018.geojson

Converted and modified to GeoPackage: waterbodies_2018.geojson

Converted and modified to GeoPackage: waterways_2018.geojson



'\n# Explicit bash code (currently not working on Windows because of compability issues)\n# Define the input directory\ninput_directory=$(pwd)  # current directory\n\n# Loop through all geojson files in directory\nfor filename in "$input_directory"/*.geojson; do\n    if [[ -f "$filename" ]]; then\n        geojson_file="$filename"\n        geopackage_file="${filename%.geojson}.gpkg"\n\n        # Run function as a shell script\n        if ogr2ogr -f "GPKG" "$geopackage_file" "$geojson_file"; then\n            echo "Converted and modified: $filename"\n        else\n            echo "Error processing: $filename"\n        fi\n    else\n        echo "No GeoJSON files found in the directory."\n        break\n    fi\ndone\n'

Let's write separate geopackages as layers of one merged geopackage:

In [25]:
output_file = os.path.join(input_dir, f'osm_merged_{year}.gpkg')

"""
# to initialise output merged gpkg. '/dev/null' is used for avoiding actual operations with files
subprocess.run(['ogr2ogr', '-f', 'GPKG', output_file, '-a_srs', f'EPSG:{target_epsg}', '-overwrite', '/dev/null'], check=True)
"""

# initialize the GeoPackage using the first GeoPackage file
first_gpkg_file = gpkg_files[0]
layer_name = first_gpkg_file.replace(f'_{year}.gpkg','') # extract layer name from geopackage filenames

subprocess.run(['ogr2ogr', '-f', 'GPKG', output_file, first_gpkg_file, # output and input files
                '-s_srs', f'EPSG:{target_epsg}',  # set source CRS
                '-t_srs', f'EPSG:{target_epsg}', # set target CRS
                '-nln', layer_name # specify name of the layer
                ], check=True, capture_output=True, text=True) # to show log
print(f"Initialized merged GeoPackage with CRS EPSG:{target_epsg} from {layer_name}.")


# ancillary function to check coordinates of non-merged and merged geometries (if flipping issue occurred)
"""
def print_coordinates_from_gpkg(file_path, num_samples=5):

    # Print coordinates from the first few features in a GeoPackage.
    
    #:param file_path: Path to the GeoPackage file
    #:param num_samples: Number of features to sample
    
    gdf = gpd.read_file(file_path)
    print(f"File: {file_path}")
    for i, row in gdf.head(num_samples).iterrows():
        geom = row['geometry']
        coords = geom.xy
        x_coords = coords[0]
        y_coords = coords[1]
        print(f"Feature {i}:")
        print(f"  X coordinates: {x_coords}")
        print(f"  Y coordinates: {y_coords}")

# print coordinates from non-merged Geopackages
print_coordinates_from_gpkg(gpkg_files[0])

# print coordinates from merged GeoPackage
print_coordinates_from_gpkg(output_file)

"""

# append the rest of the GeoPackage files
for gpkg_file in gpkg_files[1:]:  # skip the first file because it's already added
    layer_name = gpkg_file.replace(f'_{year}.gpkg','')  # extract future layer name from geopackage files

    # run appending separate geopackages to empty merged geopackage (update if layers were previously written)
    try:
        result = subprocess.run(['ogr2ogr', '-f', 'GPKG', output_file, '-s_srs', f'EPSG:{target_epsg}', # for input file
                                        '-t_srs', f'EPSG:{target_epsg}', # for output file
                                        '-nln', layer_name, '-update', '-append', gpkg_file],
                                        check=True, 
                                        capture_output=True, 
                                        text=True)
        
        print(f"Added layer {layer_name} from {gpkg_file} to {output_file}")
        print(f"Warnings or errors:\n{result.stderr}")

    except subprocess.CalledProcessError as e:
        print(f"Error adding {layer_name}: {e.stderr}")
    except Exception as e:
        print(f"Unexpected error with {layer_name}: {e}")

Initialized merged GeoPackage with CRS EPSG:27700 from railways.
Added layer roads from roads_2018.gpkg to C:\Users\kriukovv\Documents\pilot_2\preprocessing\osm_merged_2018.gpkg

Added layer waterbodies from waterbodies_2018.gpkg to C:\Users\kriukovv\Documents\pilot_2\preprocessing\osm_merged_2018.gpkg

Added layer waterways from waterways_2018.gpkg to C:\Users\kriukovv\Documents\pilot_2\preprocessing\osm_merged_2018.gpkg



Let's try to fix geometries if errors in geometries found:

In [26]:
import shutil
from osgeo import ogr

def fix_geometries_in_gpkg(output_file, fixed_gpkg=None):
    # copy input GeoPackage to output if output path is specified
    if fixed_gpkg is None:
        fixed_gpkg = input_gpkg
    else:
        shutil.copyfile(output_file, fixed_gpkg) # to copy file to a new one

    # open the output GeoPackage for editing
    data_source = ogr.Open(fixed_gpkg, update=1)

    for i in range(data_source.GetLayerCount()):
        layer = data_source.GetLayerByIndex(i)
        layer_name = layer.GetName()
        feature_to_fix_count = 0
        fixed_feature_count = 0
        invalid_feature_count = 0

        # iterate over all features in the layer
        for feature in layer:
            geometry = feature.GetGeometryRef()
            if not geometry.IsValid():
                feature_to_fix_count += 1 # increment the number of features to be fixed
                # attempt to fix the geometry
                fixed_geometry = geometry.MakeValid()

                if fixed_geometry.IsValid():
                    # replace the geometry with the fixed one
                    feature.SetGeometry(fixed_geometry)
                    layer.SetFeature(feature)  # save the updated feature back to the layer
                    print(f"Fixed invalid geometry in layer '{layer_name}', feature ID: {feature.GetFID()}")
                    fixed_feature_count += 1 # increment the number of fixed features
                else:
                    print(f"Could not fix geometry in layer '{layer_name}', feature ID: {feature.GetFID()}")
                    invalid_feature_count += 1 # increment the number of features that cannot be fixed

        # estin
        
        if feature_to_fix_count == 0:
            print (f"All geometries of features in the layer '{layer_name}' of the output vector are valid.")
            print("-" * 40)
        else:
            print(f"Layer '{layer_name}: {fixed_feature_count} geometries fixed.") 
            print(f"Layer '{layer_name}': {invalid_feature_count} geometries could not be fixed.")
            print("-" * 40)

    # close the data source
    data_source = None

    # copy the fixed geopackage back to the original output_file
    shutil.copyfile(fixed_gpkg, output_file)
    
    # delete the temporary fixed_gpkg if it was different from the output_file
    if fixed_gpkg != output_file:
        os.remove(fixed_gpkg)

    print(f"Geometries fixed and saved to {fixed_gpkg}")

# usage:
fixed_gpkg = os.path.join(input_dir, f'osm_merged_{year}_fixed.gpkg')
fix_geometries_in_gpkg(output_file, fixed_gpkg)

# TODO - to replace non-fixed gpkg with a fixed one

# REDUNDANT - fixing geometries with shapely
"""
from shapely.geometry import shape
from shapely.validation import make_valid

# read output geopackage as a geodataframe
gdf = gpd.read_file(output_file)

# Function to fix invalid geometries
def fix_invalid_geometries(geometry):
    if geometry.is_valid:
        return geometry
    else:
        fixed_geom = make_valid(geometry)
        return fixed_geom if not fixed_geom.is_empty else None

# Apply the fix to all geometries
gdf['geometry'] = gdf['geometry'].apply(fix_invalid_geometries)

fixed_output_file = os.path.join(input_dir, f'osm_merged_{year}_fixed.gpkg')
gdf.to_file(fixed_output_file, driver="GPKG")
"""

All geometries of features in the layer 'railways' of the output vector are valid.
----------------------------------------
All geometries of features in the layer 'roads' of the output vector are valid.
----------------------------------------
Fixed invalid geometry in layer 'waterbodies', feature ID: 353
Fixed invalid geometry in layer 'waterbodies', feature ID: 432
Fixed invalid geometry in layer 'waterbodies', feature ID: 634
Layer 'waterbodies: 3 geometries fixed.
Layer 'waterbodies': 0 geometries could not be fixed.
----------------------------------------
All geometries of features in the layer 'waterways' of the output vector are valid.
----------------------------------------
Geometries fixed and saved to C:\Users\kriukovv\Documents\pilot_2\preprocessing\osm_merged_2018_fixed.gpkg


'\nfrom shapely.geometry import shape\nfrom shapely.validation import make_valid\n\n# read output geopackage as a geodataframe\ngdf = gpd.read_file(output_file)\n\n# Function to fix invalid geometries\ndef fix_invalid_geometries(geometry):\n    if geometry.is_valid:\n        return geometry\n    else:\n        fixed_geom = make_valid(geometry)\n        return fixed_geom if not fixed_geom.is_empty else None\n\n# Apply the fix to all geometries\ngdf[\'geometry\'] = gdf[\'geometry\'].apply(fix_invalid_geometries)\n\nfixed_output_file = os.path.join(input_dir, f\'osm_merged_{year}_fixed.gpkg\')\ngdf.to_file(fixed_output_file, driver="GPKG")\n'

Let's delete intermediate jsons and geojsons to save space:

In [27]:
for json in intermediate_jsons:
    try:
        os.remove (json)
        print ("Intermediate json file is deleted.")
    except OSError as e:
        print (f"Intermediate json file {json} cannot be deleted:{e}.")

for geojson in intermediate_geojsons:
    try:
        os.remove (geojson)
        print ("Intermediate geojson file is deleted.")
    except OSError as e:
        print (f"Intermediate geojson file {geojson} cannot be deleted:{e}.")


Intermediate json file is deleted.
Intermediate json file is deleted.
Intermediate json file is deleted.
Intermediate json file is deleted.
Intermediate geojson file is deleted.
Intermediate geojson file is deleted.
Intermediate geojson file is deleted.
Intermediate geojson file is deleted.


Let's find out how much time does it take to extract and optimise this dataset:

In [28]:
# stop recording time
end_time = time.time()
# calculate elapsed time
elapsed_time = end_time - start_time
# print elapsed time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

Elapsed time: 905.03 seconds


###### ***Processing issues***

- a few other ways to access historical OSM data were explored as well:
Nominatim API
Ohsome API
Geofabrik archives (do not provide automatic or semi-automatic access)
- ogr2ogr might interpret X and Y axis in a different order (flipping coordinates). This issue was detected for EPSG:27700 while merging separate geopackages into one even though no reprojection was specified  in flags. The solution is to define axis order explicitly through '-s_srs' (for the input dataset) and '-t_srs' (for the output dataset) flags as the same EPSG code.
- additional step is the translation of 'width' column into decimal one since geojson recognizes this column as a text. Currently, it is performed in further processing (enrichment of raster land-use/land-cover data) as a SQL statement for buffering roads by width from this column.