## Calculating erroneous areas

As par the [uk_gb_nfi_geospatial](uk_gb_nfi_geospatial.ipynb) notebook, we are running a risk of getting inexplicable errors during overlay. To combat that, this compact notebook iterates over the NFI dataset in a memory-efficient way, using the NFI processing from the Geospatial and the AWI dataset.

<details>
    <summary>Details of the troubleshooting for future reference</summary>

For the year of 2022 specifically, when overlaying the `Trees` type subset, the issue is caused by 2-3 geometries, bizarrely inconsistently, they are:
- Wales, Pickle Wood, `Trees`, Index 461425, 51.7957357810459, -4.827321677247609, Area 41.147849, geometry `POLYGON ((204593.604 214909.751, 204593.800...`
- Wales, Allt Rhosygilwen, `Trees`, Index 469005, 52.036802759545104, -4.617561294671731, Area 46.6604904321, geometry `POLYGON ((221343.140 241874.840, 221343.515...`
- Wales, Coed y Brenin Forest, `Trees`, Index 537605, 52.815008884136546, -3.8724175720291303, Area 443.201233981, geometry `POLYGON ((272336.730 328993.532, 272343.858...`
- Wales, Dulas Valley, `Barren & Felled`, Index 561434, 52.64613555533024, -3.8447860459273797, Area 52.889922, geometry `POLYGON ((275280.160 307096.420, 275280.860...`

They are usually represented by the following error messages (`Trees` ones below):
```python
TopologyException: found non-noded intersection between LINESTRING (204841 214686, 204858 214683) and LINESTRING (204858 214683, 204841 214686) at 204846.78035594229 214685.10662104361
TopologyException: found non-noded intersection between LINESTRING (273354 325793, 273352 325796) and LINESTRING (273352 325796, 273355 325793) at 273354.15382456378 325793.3069786204
TopologyException: found non-noded intersection between LINESTRING (220596 240170, 220598 240169) and LINESTRING (220598 240169, 220596 240170) at 220595.98972291514 240169.62026441595 557750:557800
```

Simplifying them, usually to a tolerance of `1e-1` seems to be sufficient to make the overlay work. This segment of the workflow is designated to that procedure, specifically the `find_problematic_indices` function that recursively finds all problematic indices for further simplification. It's advised to start with `1e-5` and work down using the rule of .3, i.e. `1e-5`, `3e-4`, `1e-4` etc. to find the most precise simplified geometry that doesn't cause the error.

Importantly, the errors would be different depending on which overlays one wants to perform, the `Trees` or `Barren & Felled` subset. Values for both are included for consistency, as the recursive function runs against the entire dataset as opposed to specific subsets to ensure it's full compliance.

Lastly, important to note that `overlay=intersection` does not produce similar errors, but for these functions the simplified geometries will be used nonetheless for consistency.

There are some invalid geometries in both AWI and NFI datasets, but if they are removed the resulting difference doesn't make sense. "Fixing" them with buffer(0) yielded no result either.

![The culprit image](../assets/pickle_wood_dark.png)
</details>

### Notes

All runtimes are indicated for Apple M1 Max 64GB. 

In [None]:
import pandas as pd
import geopandas as gpd
gpd.options.io_engine = 'pyogrio'
import concurrent.futures

import util.geo_ops as gops

from shapely import wkb

In [None]:
# Main recursive function to find problematic indices
def find_problematic_indices(ldf, rdf, start, end, overlay_operation='difference'):
    problematic_indices = []

    if start > end:
        return problematic_indices

    mid = (start + end) // 2

    try:
        print(f'Checking midpoint {ldf.iloc[[mid]].index.item()}')
        gpd.overlay(ldf.iloc[[mid]], rdf, how=overlay_operation, keep_geom_type=False)
    except:
        problematic_indices.append({'Indices': ldf.iloc[[mid]].index.item(), 'Tolerance': None})
        print(f'Error found at midpoint {ldf.iloc[[mid]].index.item()}')

    # Check the first half
    try:
        print(f'Checking the first half at {ldf.iloc[start:mid].index}')
        gpd.overlay(ldf.iloc[start:mid], rdf, how=overlay_operation, keep_geom_type=False)
        print(f'No errors in the first half at {ldf.iloc[start:mid].index}, moving to the second half.')
    except:
        print(f"Error in the first half at {ldf.iloc[start:mid].index}")
        with concurrent.futures.ThreadPoolExecutor() as executor:
            future = executor.submit(find_problematic_indices, ldf, rdf, start, mid, overlay_operation)
            problematic_indices += future.result()

    try:
        print(f'Checking the second half at {ldf.iloc[mid:end + 1].index}')
        gpd.overlay(ldf.iloc[mid:end + 1], rdf, how=overlay_operation, keep_geom_type=False)
        print(f"No errors in the second half at {ldf.iloc[mid:end + 1].index}, existing this layer of search.")
    except:
        print(f"Error in the second half, narrowing the search range.")
        with concurrent.futures.ThreadPoolExecutor() as executor:
            future = executor.submit(find_problematic_indices, ldf, rdf, mid + 1, end, overlay_operation)
            problematic_indices += future.result()

    return problematic_indices

def find_problematic_indices_main(left_df, right_df, type_column=None, type_value=None, problematic_indices=None, overlay_operation='difference'):
    # Copy to avoid modifying the original DataFrame
    ldf = left_df.copy()
    rdf = right_df.copy()

    # Simplify the geometries of the problematic indices if any
    if problematic_indices is not None:
        for idx, row in problematic_indices.iterrows():
            if row['Indices'] in ldf.index:
                ldf.loc[row['Indices'], 'geometry'] = ldf.loc[row['Indices'], 'geometry'].simplify(row['Tolerances'])
        print(f'Simplified geometry for problematic indices at {problematic_indices["Indices"].values} to tolerance {problematic_indices["Tolerances"].values}')

    # Filtering by type requested
    if type_column is not None and type_value is not None:
        ldf = ldf[ldf[type_column] == type_value]

    # Initialize the search range
    start, end = 0, len(ldf) - 1

    print(f'Starting search from {start} to {end}')
    problematic_indices = find_problematic_indices(ldf, rdf, start, end, overlay_operation)

    return pd.DataFrame(problematic_indices)

In [None]:
# Auxiliary function for crude debugging if a range is known within 100-ish indices
def debug_overlay_datasets(left_df, right_df, type_column, type_value, index_range=None, drop_indices=None, overlay_operation='difference'):
    # Drop specified indices if any, before filtering
    ldf = left_df
    if drop_indices is not None:
        ldf = ldf.drop(drop_indices)

    # Filter by index range if specified
    if index_range is not None:
        ldf = ldf.loc[index_range[0]:index_range[1]]

    ldf = ldf[ldf[type_column] == type_value]
    rdf = right_df

    # Initialize overlay as an empty GeoDataFrame
    overlay = gpd.GeoDataFrame()

    # Iterate over each index in the range
    for idx in ldf.index:
        try:
            # Use overlay to find the geometries in left_df_crs that do not intersect with right_df_crs
            overlay_idx = gpd.overlay(ldf.loc[[idx]], rdf, how=overlay_operation, keep_geom_type=False)

            print(f"Success for index {idx}")
        except Exception as e:
            print(f"Error for index {idx}, area {ldf.loc[[idx]].area_ha}: {e}")

    return overlay

In [None]:
## Importing the AWI dataset from parquet
# Runtime: 2s, RAM: 1.5GB
df = pd.read_parquet('../data/processed/gb_awi_dataset.parquet')
df['geometry'] = df['geometry'].apply(wkb.loads)
awi_dataset = gpd.GeoDataFrame(df, geometry='geometry')
awi_dataset = awi_dataset.set_crs(epsg=27700)
gops.geodf_summary(awi_dataset, 'type_combined', 'area_ha')

In [None]:
## Importing the NFI dataset from parquet
# Runtime: 1m, RAM: 30GB
nfi_dataset = {}

for year in range(2012, 2023):
    df = pd.read_parquet(f'../data/processed/gb_nfi_dataset_{year}.parquet')
    df['geometry'] = df['geometry'].apply(wkb.loads)
    gdf = gpd.GeoDataFrame(df, geometry='geometry')
    gdf = gdf.set_crs(epsg=27700)
    nfi_dataset[year] = gdf

gops.geodfs_print_summary(nfi_dataset, 'type_combined', 'area_ha')

In [None]:
# Runtime: 20m per year on average, 4h total, 40GB RAM peak
problematic_indices = {}

for year in range(2012, 2023):
    print(f'Finished loading {year}, starting processing')
    problematic_indices[year] = find_problematic_indices_main(nfi_dataset[year], awi_dataset, overlay_operation='difference')
    print(f'Finished processing year {year}')

problematic_indices

In [None]:
problematic_records_dict = {}

for year, df in problematic_indices.items():
    indices = df['Indices']
    problematic_records = nfi_dataset[year].loc[indices]
    problematic_records['year'] = year
    problematic_records_dict[year] = problematic_records

problematic_gdf = pd.concat(problematic_records_dict.values(), ignore_index=True)
problematic_gdf = problematic_gdf.sort_values(['year', 'source_index'])

cols = ['year'] + [col for col in problematic_gdf.columns if col != 'year']
problematic_gdf = problematic_gdf[cols].reset_index(drop=True)

problematic_gdf.drop('geometry', axis=1).to_csv('../data/overlay_issues/problematic.csv')
problematic_gdf.to_parquet('../data/overlay_issues/problematic.parquet')

In [None]:
## Adding indications for problematic geometries to the dataset for the overlay function to handle
# They are found empirically in uk_gb_nfi_awi_overlap.ipynb (time to run: 4h)
# Notes for 2022: Pickle Wood (461425) index has not been returned by the advanced algo but still might be relevant?
# There are only so many unique ones however, looking at the map it's about 20-30 places in Wales
# 1e+2 works for 2012, 2013, 2018
# 1e+1 works for 2014, 2015, 2016, 2017, 2019
# 1e-1 works for 2020, 2021, 2022

dataset = {
    2022: pd.DataFrame({'Indices': [469005, 537605, 561434, 604367], 'Tolerances': [1e-1]*4}),
    2021: pd.DataFrame({'Indices': [102122, 377302, 472827, 551872], 'Tolerances': [1e-1]*4}),
    2020: pd.DataFrame({'Indices': [101247, 376396, 376450, 471099, 548850], 'Tolerances': [1e-1]*5}),
    2019: pd.DataFrame({'Indices': [95060, 106935, 108275, 110139, 121142, 122418, 376394, 519912], 'Tolerances': [1e+1]*8}),
    2018: pd.DataFrame({'Indices': [90236, 90684, 93475, 101521, 101885, 104099, 109659, 122946, 127023, 357215, 357303, 357656, 364976, 365065], 'Tolerances': [1e+2]*14}),
    2017: pd.DataFrame({'Indices': [91447, 103102, 104423, 106270, 117173, 118443, 362874, 363289, 369684, 505330], 'Tolerances': [1e+1]*10}),
    2016: pd.DataFrame({'Indices': [91187, 102832, 104153, 105996, 116870, 118127, 360380, 360793, 367187, 500980], 'Tolerances': [1e+1]*10}),
    2015: pd.DataFrame({'Indices': [13102, 13174, 17051, 17876, 67774, 68192, 68851, 84599, 107901, 108724], 'Tolerances': [1e+1]*10}),
    2014: pd.DataFrame({'Indices': [13304, 13890, 16379, 17061, 70714, 71152, 76239, 92419, 117106, 125221], 'Tolerances': [1e+1]*10}),
    2013: pd.DataFrame({'Indices': [12940, 13012, 16770, 17535, 72578, 72995, 73651, 86444, 118417, 119236, 122720], 'Tolerances': [1e+2]*11}),
    2012: pd.DataFrame({'Indices': [525403, 525474, 529062, 529800, 550028, 555959, 556376, 557030, 569161, 569550, 570669], 'Tolerances': [1e+2]*11})
}

# Construct the desired structure
nfi_dataset_problematic = {}

for year, df in dataset.items():
    nfi_dataset_problematic[year] = pd.DataFrame({
        'Indices': df['Indices'].tolist(),
        'Tolerances': df['Tolerances'].tolist()
    })
    nfi_dataset_problematic[year].to_csv(f'../data/overlay_issues/problematic_{year}.csv')

### Raw output after 4h30m

All of the geometries are in Wales, for every year.


![A screenshot of the map with all problematic geometries](../assets/nfi_problematic_geometries.png)

<details>
    <summary>Expand for raw output first run</summary>

{2022:    Indices Tolerance
 0   469005      None
 1   537605      None
 2   561434      None,
 2020:    Indices Tolerance
 0   101247      None
 1   376396      None
 2   376450      None
 3   471099      None,
 2021:    Indices Tolerance
 0   102122      None
 1   377302      None
 2   472827      None,
 2012:    Indices Tolerance
 0   525474      None
 1   529062      None
 2   529800      None
 3   550028      None
 4   555959      None
 5   556376      None
 6   557030      None
 7   569161      None
 8   569550      None
 9   570669      None,
 2013:    Indices Tolerance
 0    13012      None
 1    16770      None
 2    17535      None
 3    72578      None
 4    72995      None
 5    73651      None
 6    86444      None
 7   118417      None
 8   119236      None
 9   122720      None,
 2014:    Indices Tolerance
 0    13890      None
 1    16379      None
 2    17061      None
 3    70714      None
 4    71152      None
 5    76239      None
 6    92419      None
 7   117106      None
 8   125221      None,
 2015:    Indices Tolerance
 0    13174      None
 1    17051      None
 2    17876      None
 3    67774      None
 4    68192      None
 5    68851      None
 6    84599      None
 7   107901      None
 8   108724      None,
 2016:    Indices Tolerance
 0    91187      None
 1   102832      None
 2   104153      None
 3   105996      None
 4   116870      None
 5   118127      None
 6   360793      None
 7   367187      None
 8   500980      None,
 2017:    Indices Tolerance
 0    91447      None
 1   103102      None
 2   104423      None
 3   106270      None
 4   117173      None
 5   118443      None
 6   363289      None
 7   369684      None
 8   505330      None,
 2018:     Indices Tolerance
 0     90236      None
 1     90684      None
 2     93475      None
 3    101521      None
 4    101885      None
 5    104099      None
 6    109659      None
 7    122946      None
 8    127023      None
 9    357215      None
 10   357303      None
 11   357656      None
 12   364976      None
 13   365065      None,
 2019:    Indices Tolerance
 0    95060      None
 1   106935      None
 2   108275      None
 3   110139      None
 4   121142      None
 5   122418      None
 6   376394      None
 7   519912      None}
</details>

<details>
    <summary>Expand for raw output second run</summary>

{2012:     Indices Tolerance
 0    525403      None
 1    525474      None
 2    529062      None
 3    529800      None
 4    550028      None
 5    555959      None
 6    556376      None
 7    557030      None
 8    569161      None
 9    569550      None
 10   570669      None,
 2013:     Indices Tolerance
 0     12940      None
 1     13012      None
 2     16770      None
 3     17535      None
 4     72578      None
 5     72995      None
 6     73651      None
 7     86444      None
 8    118417      None
 9    119236      None
 10   122720      None,
 2014:    Indices Tolerance
 0    13304      None
 1    13890      None
 2    16379      None
 3    17061      None
 4    70714      None
 5    71152      None
 6    76239      None
 7    92419      None
 8   117106      None
 9   125221      None,
 2015:    Indices Tolerance
 0    13102      None
 1    13174      None
 2    17051      None
 3    17876      None
 4    67774      None
 5    68192      None
 6    68851      None
 7    84599      None
 8   107901      None
 9   108724      None,
 2016:    Indices Tolerance
 0    91187      None
 1   102832      None
 2   104153      None
 3   105996      None
 4   116870      None
 5   118127      None
 6   360380      None
 7   360793      None
 8   367187      None
 9   500980      None,
 2017:    Indices Tolerance
 0    91447      None
 1   103102      None
 2   104423      None
 3   106270      None
 4   117173      None
 5   118443      None
 6   362874      None
 7   363289      None
 8   369684      None
 9   505330      None,
 2018:     Indices Tolerance
 0     90236      None
 1     90684      None
 2     93475      None
 3    101521      None
 4    101885      None
 5    104099      None
 6    109659      None
 7    122946      None
 8    127023      None
 9    357215      None
 10   357303      None
 11   357656      None
 12   364976      None
 13   365065      None,
 2019:    Indices Tolerance
 0    95060      None
 1   106935      None
 2   108275      None
 3   110139      None
 4   121142      None
 5   122418      None
 6   376394      None
 7   519912      None,
 2020:    Indices Tolerance
 0   101247      None
 1   376396      None
 2   376450      None
 3   548850      None,
 2021:    Indices Tolerance
 0   102122      None
 1   377302      None
 2   551872      None,
 2022:    Indices Tolerance
 0   469005      None
 1   537605      None
 2   604367      None}
</details>
