## Overlaying NFI and AWI datasets

With both NFI and AWI data being available in geospatial format it's possible to subtract one from the other, giving us a relatively accurate impression of Trees that are native, not native and felled within the woodlands, making for a great map for the public to see.

### Approach

Below we'll make use of the AWI dataset we have created and processed in [AWI workflow](uk_gb_awi.ipynb) and the existing NFI datasets to create a combined dataset, showing:

- Native Trees. These are areas from the AWI dataset, minus any overlap with NFI's "Felled" areas.
- Non-Native Trees. Areas from the NFI dataset designated as "Trees" minus the AWI overlap.
- Felled Trees. Areas from the NFI dataset designated as "Barren & Felled" minus the AWI overlap.
- Felled Native Trees. Areas from the NFI dataset not designated as "Trees" that overlap with the AWI dataset.
- Other NFI. Areas from the NFI dataset designated as "Other" minus the AWI overlap.

### Overlaying Errors

Certain geometries in the NFI dataset could trigger a nasty `found non-noded intersection` error when running gpd.overlay with `difference` parameter. It's unclear what causes it, but the recommended solutions like `buffer(0)` do not work. In total there are 97 entries like this in the NFI dataset, all of them located in Wales. I started my initial troubleshooting with 2022 and found 4 places, 3 in `Trees` and 1 in `Barren & Felled`. 

The collection called `nfi_dataset_problematic` is the end result of work across multiple datasets that's intended to be reused once the process is over so it doesn't have to be repeated, as it easily takes around 20 minutes per year for a total of 4 hours. The indexes stored in the collection should be consistent across restarts but might not be if you have changed the functions creating the initial `nfi_dataset`, but the source_index generated and read remedies that.

### Notes

All runtimes are indicated for Apple M1 Max 64GB. 

In [None]:
import pandas as pd
import geopandas as gpd
gpd.options.io_engine = 'pyogrio'

import util.geo_ops as gops

In [None]:
## Importing the AWI dataset from parquet
# Runtime: 2s, RAM: 1.5GB
awi_dataset = gpd.read_parquet('../data/processed/gb_awi_dataset.parquet')

gops.geodf_summary(awi_dataset, 'type_combined', 'area_ha')

In [None]:
def overlay_datasets(left_df, right_df,  problematic_indices=None, overlay_operation='difference'):
    # Copy to avoid modifying the original DataFrame
    ldf = left_df.copy()
    rdf = right_df.copy()

    # Simplify the geometries of the problematic indices if any
    if problematic_indices is not None:
        for idx, row in problematic_indices.iterrows():
            if row['Indices'] in ldf.index:
                ldf.loc[row['Indices'], 'geometry'] = ldf.loc[row['Indices'], 'geometry'].simplify(row['Tolerances'])

    # Use overlay to find the geometries in left_df_crs that do not intersect with right_df_crs
    overlay = gpd.overlay(ldf, rdf, how=overlay_operation, keep_geom_type=False)

    # Calculate the area of each geometry in the difference and store it in a new column
    overlay['area_ha'] = overlay.geometry.area / 10000

    return overlay

In [None]:
## Creating the final, main dataset with overlay, combined, aggregate and source data preserved across all years
# Runtime 1h 10m, ~6m per year for 11 years, RAM: 12GB Max

summaries = {}

for year in range(2012, 2023):
    # Getting the NFI dataset
    print(f'Started processing year {year}.')
    nfi_dataset = gpd.read_parquet(f'../data/processed/gb_nfi_dataset_{year}.parquet')

    # And the problematic indices calculated in the _errors notebook
    nfi_dataset_problematic = pd.read_csv(f'../data/overlay_issues/problematic_{year}.csv')

    # Ensuring matching CRS
    awi_dataset = awi_dataset.to_crs(epsg=27700)
    nfi_dataset = nfi_dataset.to_crs(epsg=27700)
    print(f'Finished aligning CRS for {year}.')

    # Native trees as difference between AWI Trees and NFI Felled & Other
    awi_dataset_nfi_intact = overlay_datasets(awi_dataset[awi_dataset.type_combined != 'Other (land, pasture, unknown, etc.)'], nfi_dataset[nfi_dataset.type_combined != 'Trees'], problematic_indices=nfi_dataset_problematic, overlay_operation='difference')

    # Non-native trees as difference between NFI "Trees" and AWI as a whole
    nfi_dataset_non_awi = overlay_datasets(nfi_dataset[nfi_dataset.type_combined == 'Trees'], awi_dataset, problematic_indices=nfi_dataset_problematic, overlay_operation='difference')

    # Native Felled Trees as intersection between AWI Trees and NFI Felled & Other
    awi_dataset_nfi_felled = overlay_datasets(awi_dataset[awi_dataset.type_combined != 'Other (land, pasture, unknown, etc.)'], nfi_dataset[nfi_dataset.type_combined != 'Trees'], problematic_indices=nfi_dataset_problematic, overlay_operation='intersection')

    # General Felled Trees as NFI Felled less all AWI
    nfi_dataset_felled_non_awi = overlay_datasets(nfi_dataset[nfi_dataset.type_combined == 'Barren & Felled'], awi_dataset, problematic_indices=nfi_dataset_problematic, overlay_operation='difference')

    # Other NFI as NFI without AWI at all
    nfi_dataset_other_awi = overlay_datasets(nfi_dataset[nfi_dataset.type_combined == 'Other (land, urban, etc.)'], awi_dataset, problematic_indices=nfi_dataset_problematic, overlay_operation='difference')

    # Other AWI is simply a subset
    awi_other = awi_dataset[awi_dataset.type_combined == 'Other (land, pasture, unknown, etc.)']

    # Validation sets
    nfi_x_awi = overlay_datasets(nfi_dataset, awi_dataset, problematic_indices=nfi_dataset_problematic, overlay_operation='intersection')
    
    print(f"Native Trees: {format(round(awi_dataset_nfi_intact.area_ha.sum(), 2), ',.2f')} ha")
    print(f"Non-Native Trees: {format(round(nfi_dataset_non_awi.area_ha.sum(), 2), ',.2f')} ha")
    print(f"Native Felled Trees: {format(round(awi_dataset_nfi_felled.area_ha.sum(), 2), ',.2f')} ha")
    print(f"Other Felled Trees: {format(round(nfi_dataset_felled_non_awi.area_ha.sum(), 2), ',.2f')} ha")
    print(f"Other NFI areas: {format(round(nfi_dataset_other_awi.area_ha.sum(), 2), ',.2f')} ha")
    print(f"Other AWI areas: {format(round(awi_dataset_nfi_intact.area_ha.sum(), 2), ',.2f')} ha")
    print(f"Total non-unique in both: {format(round(awi_dataset.area_ha.sum() + nfi_dataset.area_ha.sum(), 2), ',.2f')} ha")
    print(f"Total intersection between datasets: {format(round(nfi_x_awi.area_ha.sum(), 2), ',.2f')} ha")
    print(f"Total unique across both datasets: {format(round(awi_dataset.area_ha.sum() + nfi_dataset.area_ha.sum() - nfi_x_awi.area_ha.sum(), 2), ',.2f')} ha")

    nfi_awi_combined = {}

    native_trees_ds = awi_dataset_nfi_intact
    nfi_awi_combined['Native Trees'] = gpd.GeoDataFrame({
        'source_index': native_trees_ds['source_index'],
        'type_overlay': 'Native Trees',
        'type_combined': native_trees_ds['type_combined'],
        'type_aggregate': native_trees_ds['type_aggregate'],
        'type_source': native_trees_ds['type_source'],
        'area_ha': native_trees_ds['area_ha'],
    }, geometry=native_trees_ds.geometry, crs=native_trees_ds.crs)

    non_native_trees_ds = nfi_dataset_non_awi
    nfi_awi_combined['Non-Native Trees'] = gpd.GeoDataFrame({
        'source_index': non_native_trees_ds['source_index'],
        'type_overlay': 'Non-Native Trees',
        'type_combined': non_native_trees_ds['type_combined'],
        'type_aggregate': non_native_trees_ds['type_aggregate'],
        'type_source': non_native_trees_ds['type_source'],
        'area_ha': non_native_trees_ds['area_ha'],
    }, geometry=non_native_trees_ds.geometry, crs=non_native_trees_ds.crs)

    # Taking the right areas to show to which areas the Native Woodland has been lost
    felled_native_trees_ds = awi_dataset_nfi_felled
    nfi_awi_combined['Felled Native Trees'] = gpd.GeoDataFrame({
        'source_index': felled_native_trees_ds['source_index_2'],
        'type_overlay': 'Felled Native Trees',
        'type_combined': felled_native_trees_ds['type_combined_2'],
        'type_aggregate': felled_native_trees_ds['type_aggregate_2'],
        'type_source': felled_native_trees_ds['type_source_2'],
        'area_ha': felled_native_trees_ds['area_ha'],
    }, geometry=felled_native_trees_ds.geometry, crs=felled_native_trees_ds.crs)

    felled_trees_ds = nfi_dataset_felled_non_awi
    nfi_awi_combined['Other Felled Trees'] = gpd.GeoDataFrame({
        'source_index': felled_trees_ds['source_index'],
        'type_overlay': 'Other Felled Trees',
        'type_combined': felled_trees_ds['type_combined'],
        'type_aggregate': felled_trees_ds['type_aggregate'],
        'type_source': felled_trees_ds['type_source'],
        'area_ha': felled_trees_ds['area_ha'],
    }, geometry=felled_trees_ds.geometry, crs=felled_trees_ds.crs)

    other_ds = pd.concat([nfi_dataset_other_awi, awi_other])
    nfi_awi_combined['Other (land, pasture, urban, etc.)'] = gpd.GeoDataFrame({
        'source_index': other_ds['source_index'],
        'type_overlay': 'Other (land, pasture, urban, etc.)',
        'type_combined': other_ds['type_combined'],
        'type_aggregate': other_ds['type_aggregate'],
        'type_source': other_ds['type_source'],
        'area_ha': other_ds['area_ha'],
    }, geometry=other_ds.geometry, crs=other_ds.crs)

    gb_nfi_awi_overlay = pd.concat(nfi_awi_combined.values(), ignore_index=True)
    print(f"Sum of overlay features: {format(round(gb_nfi_awi_overlay.area_ha.sum(), 2), ',.2f')} ha")
    gb_nfi_awi_overlay.to_parquet(f'../data/processed/gb_nfi_awi_overlay_{year}.parquet')

    print(f'Finished processing for {year}')

In [None]:
## Writing out the summary CSVs. This can be performed gradually as the data is processed as well for much lower RAM cost
# Runtime: 1m30s, RAM: 40GB
gb_nfi_awi_overlay = {}

for year in range(2012, 2023):
    gb_nfi_awi_overlay[year] = gpd.read_parquet(f'../data/processed/gb_nfi_awi_overlay_{year}.parquet')

types = ['source', 'aggregate', 'combined', 'overlay']
for type in types:
    gops.geodfs_to_csv(gb_nfi_awi_overlay, f'./sheets/gb_nfi_awi_{type}_summary.csv', f'type_{type}', 'area_ha')

In [None]:
# Deleting the datasets freeing up memory
del nfi_dataset
del awi_dataset
del nfi_dataset_problematic
del gb_nfi_awi_overlay
del native_trees_ds
del non_native_trees_ds
del felled_trees_ds
del felled_native_trees_ds
del other_ds
del awi_other
del nfi_dataset_non_awi
del nfi_dataset_other_awi
del awi_dataset_nfi_intact
del awi_dataset_nfi_felled
del nfi_dataset_felled_non_awi
del nfi_awi_combined
del nfi_x_awi