## NFI Spatial Data Overview

This workflow deals with the most valuable dataset NFI produces - an annualised report on the state of Great Britain's woodlands. It has it's limitations, such as not including TOWs (Trees Outside Woods), Hedgerows or other woody linear features, but it is the most comprehensive dataset available for GB.

### Methodology

NFI makes available their Methodology for this dataset, including it's evolution and can be found in the [Methodology and Outputs](https://www.forestresearch.gov.uk/tools-and-resources/statistics/about-our-statistics/methodology-and-outputs/) section. Most recently, the Area Estimates is available in the [Area Estimates](https://cdn.forestresearch.gov.uk/2022/02/mnwoodarea.pdf). The most relevant methodology document for this dataset is the [National Forest Inventory - Description of attributes](https://cdn.forestresearch.gov.uk/2022/02/nfi-description-of-attributes.pdf).

The NFI dataset is slightly inconsistent both in features available and in formatting, which warrants preprocessing and re-saving in a uniform format. Various issues identified and remedied in the NFI datasets:
- Coppice feature is misspelled on a few records in 2012
- `Grass` and `Grassland` evolved over the years
- Years 2012 through 2015 lack `Area_ha` feature
- Years 2014, 2017 and 2018 lack `OBJECTID` feature
- In 2021 there seems to be no `Power lines` or `Powerlines` at all
- There's no data for `Windblow` / `Windthrow` for 2012 and 2013, as well as `Failed`
- `Failed` itself isn't described in the methodology
- And other minor issues

Whilst it's possible to take a step further and combine the entire available dataset of 11 years into one long format dataframe, we'll avoid that due to the sheer size of the dataset which would make I/O harder. We'll rely on a dictionary of long datasets for each year.

### Data Range

NFI makes data available starting with 2010, but data for 2010 and 2011 is not too useful for comparison due to lack or significant difference in IFT_IOA feature definition. Therefore, the data range for this dataset is 2012-2022.

### Sources
- 2012: [National Forest Inventory GB 2012](https://data-forestry.opendata.arcgis.com/documents/951a84d62b86428192202590c016beee/about), 2012, [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
- 2013: [National Forest Inventory GB 2013](https://data-forestry.opendata.arcgis.com/documents/e153b004d5a841898244f547d1e810f3/about), 2013, [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
- 2014: [National Forest Inventory GB 2014](https://data-forestry.opendata.arcgis.com/documents/f18ce64eec874b5287a62b20adcffaad/about), 2014, [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
- 2015: [National Forest Inventory GB 2015](https://data-forestry.opendata.arcgis.com/documents/5201e715c6c7430d8c1f9afa9c14031d/about), 2015, [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
- 2016: [National Forest Inventory GB 2016](https://data-forestry.opendata.arcgis.com/documents/e985fa26cb624ca7967a434c4afa2384/about), 2016, [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
- 2017: [National Forest Inventory Woodland GB 2017](https://data-forestry.opendata.arcgis.com/documents/b529a848cb71421a9f26f1a035555cef/about), 2017, [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
- 2018: [National Forest Inventory GB 2018](https://data-forestry.opendata.arcgis.com/documents/741b93c083734de68c8d3df56be38cef/about), 2018, [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
- 2019: [National Forest Inventory GB 2019](https://data-forestry.opendata.arcgis.com/documents/5d694fb04c4f43558f90095a103f4513/about), 2019, [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
- 2020: [National Forest Inventory Woodland GB 2020](https://data-forestry.opendata.arcgis.com/datasets/eb05bd0be3b449459b9ad0692a8fc203_0/about), 2020, [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
- 2021: [National Forest Inventory Woodland GB 2021](https://data-forestry.opendata.arcgis.com/datasets/5b91b7041f8b46e099f64aa6d2013e9d_0/about), 2021, [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).
- 2022: [National Forest Inventory GB 2022](https://data-forestry.opendata.arcgis.com/datasets/9149ff1ab1a24f87ba1a60837bae872e_0/about), 2022, [Open Government Licence](https://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/).

### Notes

All runtimes are indicated for Apple M1 Max 64GB. 

In [None]:
import geopandas as gpd
gpd.options.io_engine = 'pyogrio'

import util.geo_ops as gops

In [None]:
## Importing the NFI dataset, Runtime 1min. It's about 30GB of data to store in memory.
# This code assumes that you've been able to put the data into the `../data/source` directory after unarchiving it and selected the Shapefile format as it's already in EPSG:27700 CRS format, useful for restoring the area data from the NFI dataset.
nfi_dataset = {}

nfi_dataset[2012] = gpd.read_file('../data/source/NATIONAL_FOREST_INVENTORY_GB_2012/NATIONAL_FOREST_INVENTORY_GB_2012.shp')
nfi_dataset[2013] = gpd.read_file('../data/source/NATIONAL_FOREST_INVENTORY_GB_2013/NATIONAL_FOREST_INVENTORY_GB_2013.shp')
nfi_dataset[2014] = gpd.read_file('../data/source/GB_NATIONAL_FOREST_INVENTORY_2014/NATIONAL_FOREST_INVENTORY_GB.shp')
nfi_dataset[2015] = gpd.read_file('../data/source/NATIONAL_FOREST_INVENTORY_GB_WOODLAND_2015/NATIONAL_FOREST_INVENTORY_GB_2015.shp')
nfi_dataset[2016] = gpd.read_file('../data/source/NATIONAL_FOREST_INVENTORY_GB_2016/NATIONAL_FOREST_INVENTORY_GB_2016.shp')
nfi_dataset[2017] = gpd.read_file('../data/source/NATIONAL_FOREST_INVENTORY_WOODLAND_GB_2017/NATIONAL_FOREST_INVENTORY_WOODLAND_GB_2017.shp')
nfi_dataset[2018] = gpd.read_file('../data/source/NATIONAL_FOREST_INVENTORY_GB_2018/NFI_GB_IFT_Data_20191001.shp')
nfi_dataset[2019] = gpd.read_file('../data/source/NATIONAL_FOREST_INVENTORY_GB_2019/National Forest Inventory Woodland Map 2019 (GB).shp')
nfi_dataset[2020] = gpd.read_file('../data/source/National_Forest_Inventory_Woodland_GB_2020/National_Forest_Inventory_Woodland_GB_2020.shp')
nfi_dataset[2021] = gpd.read_file('../data/source/National_Forest_Inventory_Woodland_GB_2021/National_Forest_Inventory_Woodland_GB_2021.shp')
nfi_dataset[2022] = gpd.read_file('../data/source/National_Forest_Inventory_GB_2022/National_Forest_Inventory_GB_2022.shp')

In [None]:
## Aligning the dataset and fixing up small inconsistencies
# Assigning Area_ha in datasets where it's missing
for year in range(2012, 2016):
    nfi_dataset[year]['Area_ha'] = nfi_dataset[year].geometry.area/10000

# Changing an odd column name in 2016 for consistency
nfi_dataset[2016] = nfi_dataset[2016].rename(columns={'Hectares': 'Area_ha'})

# Fixing inconsistencies in IFT_IOA labels
for year in nfi_dataset:
    nfi_dataset[year]['IFT_IOA'] = nfi_dataset[year]['IFT_IOA'].replace('Copicce', 'Coppice')
    nfi_dataset[year]['IFT_IOA'] = nfi_dataset[year]['IFT_IOA'].replace('Grass', 'Grassland')
    nfi_dataset[year]['IFT_IOA'] = nfi_dataset[year]['IFT_IOA'].replace('Power line', 'Powerline')
    nfi_dataset[year]['IFT_IOA'] = nfi_dataset[year]['IFT_IOA'].replace('Windthrow', 'Windblow')
    nfi_dataset[year]['IFT_IOA'] = nfi_dataset[year]['IFT_IOA'].replace('Road or Railways', 'Road')

# Fixing IDs
nfi_dataset[2014]['OBJECTID'] = nfi_dataset[2014].index
nfi_dataset[2017]['OBJECTID'] = nfi_dataset[2017].index
nfi_dataset[2018]['OBJECTID'] = nfi_dataset[2018].index
nfi_dataset[2020]['OBJECTID'] = nfi_dataset[2020]['OBJECTID_1']
nfi_dataset[2021]['OBJECTID'] = nfi_dataset[2021]['OBJECTID_1']

In [None]:
## Taking the source NFI dataset, recreating it with only the necessary columns
nfi_dataset_refined = {}

for year in nfi_dataset:
    nfi_dataset_refined[year] = gpd.GeoDataFrame(geometry = nfi_dataset[year].geometry, crs=27700)
    nfi_dataset_refined[year]['type_source'] = nfi_dataset[year]['IFT_IOA']
    nfi_dataset_refined[year]['area_ha'] = nfi_dataset[year]['Area_ha']

del nfi_dataset
nfi_dataset = nfi_dataset_refined
del nfi_dataset_refined

gops.geodfs_to_csv(nfi_dataset, '../data/sheets/gb_nfi_source_summary.csv', 'type_source', 'area_ha')

In [None]:
# Whilst not exactly scientific, is a good representation, since it's land that doesn't look like anything and isn't guaranteed to have anything planted currently or in the future
stumps = ['Felled', 'Coppice', 'Coppice with standards', 'Ground prep', 'Windblow', 'Bare area']

# Combining mixed conifer and broadleaved into one category
mixed = ['Mixed mainly broadleaved', 'Mixed mainly conifer']

# Giving the benefit of the doubt here really, "assumed woodland" is only growing year over year even though it's supposed to be declining as per NFI's methodology
other_vegetation = ['Low density', 'Shrub', 'Other vegetation', 'Cloud \ shadow', 'Uncertain', 'Assumed woodland']

# Have absolutely no business being designated as "woodland" or reported as such in the total
other = ['Powerline', 'Windfarm', 'River',  'Quarry', 'Road', 'Road or Railways', 'Urban', 'Open water', 'Agriculture land', 'Failed', 'Grassland'] 

# Creating a new dataframe for the woodlands
for year in nfi_dataset:
    nfi_dataset[year]['type_aggregate'] = nfi_dataset[year]['type_source'].replace(stumps, 'Barren & Felled').replace(mixed, 'Mixed (conifer & broadleaved)').replace(other_vegetation, 'Other (assumed & uncertain)').replace(other, 'Other (land, urban, etc.)')

gops.geodfs_to_csv(nfi_dataset, '../data/sheets/gb_nfi_aggregate_summary.csv', 'type_aggregate', 'area_ha')

In [None]:
# Still including "other" here even though I probably shouldn't
trees = ['Broadleaved', 'Conifer', 'Young trees', 'Mixed (conifer & broadleaved)', 'Other (assumed & uncertain)']

for year in nfi_dataset:
    nfi_dataset[year]['source_index'] = nfi_dataset[year].index
    nfi_dataset[year]['type_combined'] = nfi_dataset[year]['type_aggregate'].replace(trees, 'Trees')
    nfi_dataset[year] = nfi_dataset[year][['source_index', 'type_combined', 'type_aggregate', 'type_source', 'area_ha', 'geometry']]

gops.geodfs_to_csv(nfi_dataset, '../data/sheets/gb_nfi_combined_summary.csv', 'type_combined', 'area_ha')

In [None]:
## Finally, saving our work in Parquet files in the 27700 CRS without any simplification, since Parquet is very efficient at I/O to reload the results for Tiling and'or any other processing in different workflows.
# Runtime: 1m30s, Size: 11GB

for year in nfi_dataset:
    nfi_dataset[year].to_parquet(f'../data/processed/gb_nfi_dataset_{year}.parquet')

# Deleting the datasets freeing up memory
del nfi_dataset