# Geospatial Data Processing Methodology

Methodology employed for processing geospatial data using the GeoPandas library in Python. The primary objective was to refine and clip polygons from the Land Use Plan (LUP) dataset based on boundaries defined in the Limit Subset dataset, ensuring non-overlapping property polygons.

2. Data Preparation
2.1. Property Filtering and Sorting:

The dataset was filtered to retain properties registered from the year 2000 onwards.
Properties were sorted chronologically, first by their registration year (anho_capa) and subsequently by their registration date (fecha_res).
2.2. Baseline Establishment:

An initial set of properties from the year 2000 was used to establish a baseline in the final_properties GeoDataFrame.
3. Data Processing
3.1. Polygon Subtraction:

For each subsequent year (2001-2022), the following steps were undertaken:
The geometries of older properties were subtracted from the current property to ensure no overlaps.
The modified current property was appended to the final_properties GeoDataFrame.
3.2. Yearly Subsets Creation:

The dataset was segmented into yearly subsets. Each subset was saved as a separate GeoPackage for granularity.
3.3. Data Validation:

Duplicate put_id values in the limit_subset dataset were identified and addressed.
Rows with empty geometries were filtered out.
A subset of the LUP dataset, termed lup_subset, was created based on unique put_id values from limit_subset.
3.4. Geometry Validation and Repair:

Invalid geometries in both the lup_subset and limit_subset datasets were identified.
A buffer operation was employed to repair any detected invalid geometries.
4. Clipping Process
4.1. Polygon Clipping:

For each geometry in the limit_subset dataset:
Corresponding polygons from the lup_subset were identified.
These polygons were then clipped based on the current limit_subset geometry's boundaries.
The resulting clipped polygons were appended to the final_properties GeoDataFrame.
5. Final Output
5.1. Saving Processed Data:

The final_properties GeoDataFrame, which houses the clipped polygons, was saved as a GeoPackage. This dataset is primed for subsequent analysis or visualizatio

In [1]:

import os
import sys
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
from geopandas.tools import clip
from joblib import Parallel, delayed
import numpy as np
from shapely.geometry import Polygon, MultiPolygon, GeometryCollection


In [2]:

# Get the current working directory
current_dir = os.path.abspath('')

# Search for the 'constants.py' file starting from the current directory and moving up the hierarchy
project_root = current_dir
while not os.path.isfile(os.path.join(project_root, 'constants.py')):
    project_root = os.path.dirname(project_root)

# Add the project root to the Python path
sys.path.append(project_root)

In [3]:
from constants import  DATA_PATH, LUP, LIMIT, LUP_SUBSET,LIMIT_SUBSET, LUP_PRELABEL

In [None]:
# Load the shapefile using geopandas
limit = gpd.read_file(LIMIT)



In [None]:
limit.crs

In [None]:
lup = gpd.read_file(LUP, layer = 'lup')

In [None]:
lup['anho_capa'] = lup['anho_capa'].astype(int)
filtered_lup = lup[(lup['anho_capa'] >= 2000) & (lup['anho_capa'] <= 2012)]


In [None]:
filtered_lup =filtered_lup.to_crs(limit.crs)

In [None]:
filtered_lup.crs

In [None]:
len(filtered_lup['put_id'].unique())

In [None]:
# Check for invalid geometries in lup and limit_subset
invalid_lup = filtered_lup[~filtered_lup.geometry.is_valid]
len(invalid_lup['put_id'].unique())


In [None]:
# If there are invalid geometries, you might want to repair them
# One common method is to use the buffer operation with a distance of 0
if not invalid_lup.empty:
    filtered_lup.geometry = filtered_lup.buffer(0)


In [None]:
# Check for invalid geometries in lup and limit_subset
invalid_lup = filtered_lup[~filtered_lup.geometry.is_valid]
len(invalid_lup['put_id'].unique())

In [None]:
#create a custom_limit_subset from lup_subset by uniting polygons based on their put_id
# Step 1: Group by 'put_id'
grouped = filtered_lup.groupby('put_id')

# Step 2: Union polygons within each group
unioned_polygons = grouped['geometry'].apply(lambda x: x.unary_union)

# Step 3: Create a new GeoDataFrame
custom_limit_subset = gpd.GeoDataFrame(unioned_polygons, columns=['geometry'])
custom_limit_subset.reset_index(inplace=True)

# Ensure the CRS is consistent
custom_limit_subset.crs = filtered_lup.crs


In [None]:
filtered_lup.columns

In [None]:
# Create a DataFrame with unique 'put_id' and 'anho_capa'
unique_anho_capa = filtered_lup[['put_id', 'anho_capa']].drop_duplicates()

# Merge 'anho_capa' into 'custom_limit_subset'
custom_limit_subset = custom_limit_subset.merge(unique_anho_capa, on='put_id', how='left')

# Create a DataFrame with unique 'put_id' and 'fecha_res'
unique_fecha_res = limit[['put_id', 'fecha_res']].drop_duplicates()

# Merge 'fecha_res' into 'custom_limit_subset'
custom_limit_subset = custom_limit_subset.merge(unique_fecha_res, on='put_id', how='left')



In [None]:
custom_limit_subset

In [None]:

#limit = limit[['id', 'put_id', 'anho_capa','fecha_res', 'geometry' ]]

filtered_limit = custom_limit_subset[(custom_limit_subset['anho_capa'] >= 2000) & (custom_limit_subset['anho_capa'] <= 2012)]
#filtered_limit['area'] = filtered_limit['geometry'].area

In [None]:
# Sort properties by registration year
properties = filtered_limit.sort_values(by='anho_capa')

# Convert fecha_res to datetime format
properties['fecha_res'] = pd.to_datetime(properties['fecha_res'], errors='coerce')

# For rows with NaT (Not a Timestamp) in fecha_res, assign a default date based on their year
properties.loc[properties['fecha_res'].isna(), 'fecha_res'] = pd.to_datetime(properties['anho_capa'].astype(str) + '-01-01')

# Sort properties by fecha_res
properties = properties.sort_values(by='fecha_res')


In [None]:
# Create an empty GeoDataFrame to store the final processed properties
final_properties = gpd.GeoDataFrame(columns=properties.columns)

# Add properties from the year 2000 to final_properties as the baseline
final_properties = pd.concat([final_properties, properties[properties['anho_capa'] == 2000]])

for year in range(2001, 2023):  # Loop from 2001 to 2022
    # Get properties of the current year
    current_year_properties = properties[properties['anho_capa'] == year]
    
    # Iterate over each property of the current year
    for idx, current_property in current_year_properties.iterrows():
        # Subtract geometries of older properties from the current property
        for _, older_property in final_properties.iterrows():
            current_property['geometry'] = current_property['geometry'].difference(older_property['geometry'])
        
        # Append the "cut" current property to the final_properties GeoDataFrame
        final_properties = pd.concat([final_properties, current_property.to_frame().T])
final_properties.crs = properties.crs

In [None]:
# For Visual Check in Qgis

'''output_path = os.path.join(DATA_PATH,'processing')

# Convert the 'fecha_res' column to a string format
final_properties['fecha_res'] = final_properties['fecha_res'].astype(str)

# Create the directory if it doesn't exist
if not os.path.exists(output_path):
    os.makedirs(output_path)
    # Save the GeoDataFrame as a GeoPackage
# Define the filename for the GeoPackage

filename = os.path.join(output_path, "custom-limit-clip.gpkg")
final_properties.to_file(filename, driver="GPKG")'''

In [None]:
# Check for invalid geometries
invalid_geoms = final_properties[~final_properties.geometry.is_valid]
len(invalid_geoms['put_id'].unique())



In [None]:
# If there are invalid geometries, fix them
if len(invalid_geoms) > 0:
    final_properties.geometry = final_properties.geometry.buffer(0)

In [None]:
# Check for invalid geometries
invalid_geoms = final_properties[~final_properties.geometry.is_valid]
len(invalid_geoms['put_id'].unique())

Examining output slivers of properties causes issues when trying to clip LUP so buffers applied and LUP not captured acquired in QGIS.

In [None]:
final_properties.geometry = final_properties.buffer(1, join_style= 2)
final_properties.geometry = final_properties.buffer(-1, join_style= 2)


In [None]:
# Check for invalid geometries
invalid_geoms = final_properties[~final_properties.geometry.is_valid]
len(invalid_geoms['put_id'].unique())

In [None]:
final_properties.geometry.is_empty.sum()

In [None]:
# Filter out rows with empty geometries
final_properties = final_properties[~final_properties.geometry.is_empty]

In [None]:
# Check for invalid geometries
invalid_geoms = final_properties[~final_properties.geometry.is_valid]
len(invalid_geoms['put_id'].unique())

In [None]:
final_properties.geometry.is_empty.sum()

In [None]:
final_properties.geometry = final_properties.buffer(-2, join_style= 2)
final_properties.geometry = final_properties.buffer(2, join_style= 2)

In [None]:
# For Visual Check in Qgis

output_path = os.path.join(DATA_PATH,'processing')

# Convert the 'fecha_res' column to a string format
final_properties['fecha_res'] = final_properties['fecha_res'].astype(str)

# Create the directory if it doesn't exist
if not os.path.exists(output_path):
    os.makedirs(output_path)
    # Save the GeoDataFrame as a GeoPackage
# Define the filename for the GeoPackage

filename = os.path.join(output_path, "custom-limit-clip_buffered.gpkg")
final_properties.to_file(filename, driver="GPKG")

In [None]:
# Load the shapefile using geopandas
'''limit_subset = gpd.read_file(LIMIT_SUBSET)
lup_subset  = gpd.read_file(LUP_SUBSET)'''


In [None]:
limit_subset = final_properties
# Lup from 2000-2012
lup_subset  = filtered_lup

In [None]:
len(limit_subset['put_id'].unique())

In [None]:
len(lup_subset['put_id'].unique())

In [None]:
'''keep=False: This argument specifies how to mark duplicates:
If keep='first' (default), it would mark all duplicates as True except for the first occurrence.
If keep='last', it would mark all duplicates as True except for the last occurrence.
If keep=False, it marks all duplicates as True.
'''
print(limit_subset[limit_subset.duplicated(subset='put_id', keep=False)])


In [None]:
limit_subset.geometry.is_empty.sum()

In [None]:
# Filter out rows with empty geometries
limit_subset = limit_subset[~limit_subset.geometry.is_empty]

# Reset the index if needed
limit_subset.reset_index(drop=True, inplace=True)


In [None]:
len(limit_subset['put_id'].unique())

In [None]:
lup_subset.geometry.is_empty.sum()

In [None]:
# Check for invalid geometries in lup and limit_subset
invalid_lup = lup_subset[~lup_subset.geometry.is_valid]
invalid_limit = limit_subset[~limit_subset.geometry.is_valid]

In [None]:
len(invalid_limit['put_id'].unique())

In [None]:

len(invalid_lup['put_id'].unique())

In [None]:
# Initially, lup_subset contains all land use plans
# Keep a copy of the original lup_subset before filtering
original_lup_subset = lup_subset.copy()

In [None]:
# Extract the unique 'put_id' values from limit_subset
put_ids_to_subset = limit_subset['put_id'].unique()


# Filter lup_subset to only include land use plans with a corresponding property border
lup_subset = lup_subset[lup_subset['put_id'].isin(put_ids_to_subset)]


# Find the land use plans that were excluded in the filtering process
excluded_lup = original_lup_subset[~original_lup_subset['put_id'].isin(put_ids_to_subset)]



In [None]:
# The LUP that don't have a matching property limit polygon need to be added in.
len(excluded_lup['put_id'].unique())

In [None]:
len(original_lup_subset['put_id'].unique())

In [None]:
len(lup_subset['put_id'].unique())

In [None]:
# Create an empty GeoDataFrame to store the clipped lup polygons
final_lup= gpd.GeoDataFrame(columns=lup_subset.columns, crs=lup_subset.crs)

# Outer loop: Iterate through each geometry in limit_subset
for _, row in limit_subset.iterrows():
    current_limit_geom = row.geometry
    current_put_id = row['put_id']
        # Check if the geometry is a GeometryCollection
        
    # Check if the geometry is a GeometryCollection
    if isinstance(current_limit_geom, GeometryCollection):
        # Create a MultiPolygon from the GeometryCollection
        # by filtering out non-polygon geometries and flattening the collection
        polygons = [geom for geom in current_limit_geom.geoms if isinstance(geom, (Polygon, MultiPolygon))]
        current_limit_geom = MultiPolygon(polygons)
        
    # Gather all the lup_subset polygons with the same put_id
    current_lup_polygons = lup_subset[lup_subset['put_id'] == current_put_id]
    
    # Clip the gathered lup polygons using the current limit_subset geometry
    clipped_lup = gpd.clip(current_lup_polygons, current_limit_geom)
    
    # Ensure the clipped_lup has the same CRS as final_properties before concatenating
    clipped_lup = clipped_lup.to_crs(final_lup.crs)
    
    # Append the clipped lup polygons to the final_properties GeoDataFrame
    final_lup = pd.concat([final_lup, clipped_lup])

# Reset the index of the result GeoDataFrame
final_lup.reset_index(drop=True, inplace=True)


In [None]:
# Merge fecha_res from limit to lup based on put_id
lup_filtered_fres = final_lup.merge(limit_subset[['put_id', 'fecha_res']], on='put_id', how='left')

In [None]:
# For visual inspection in qgis
'''output_path = os.path.join(DATA_PATH,'processing' )

#lup_filtered_fres['fecha_res'] = lup_filtered_fres['fecha_res'].astype(str)

# Create the directory if it doesn't exist
if not os.path.exists(output_path):
    os.makedirs(output_path)
    # Save the GeoDataFrame as a GeoPackage
# Define the filename for the GeoPackage

filename = os.path.join(output_path, "custom_clipped_lup.gpkg")
lup_filtered_fres.to_file(filename, driver="GPKG")'''

In [None]:
#len(excluded_lup[~excluded_lup.geometry.is_valid])

In [None]:
'''output_path = os.path.join(DATA_PATH,'processing' )


# Create the directory if it doesn't exist
if not os.path.exists(output_path):
    os.makedirs(output_path)
    # Save the GeoDataFrame as a GeoPackage
# Define the filename for the GeoPackage

filename = os.path.join(output_path, "excluded_lup.gpkg")
excluded_lup.to_file(filename, driver="GPKG")'''

I used the excluded and final properties to see what was missed during the process in qgis. From this i decided that using the lup_subset and taking the difference was the best route as that gave me all the lup plans that were missed. applied a -45 and 45 buffer. Manually stil had to clean lups that overlapped, for most part they were just duplicates in some cases they were from different years. In the case of different years I selected the oldest lup.

In [None]:
#prelabel = gpd.read_file(LUP_PRELABEL)
prelabel = lup_filtered_fres

In [None]:
# Step 1: Identify the unique values of 'categoria_ant' for each 'grupo'
unique_values_mapping = prelabel.groupby('grupo')['categoria_ant'].unique().to_dict()

In [None]:
unique_values_mapping

In [None]:
# Given unique_values_mapping as a dictionary from your previous code:
unique_values_mapping = {
    'AREA_AUTORIZADA': np.array(['A-HABILITAR', 'SIN COBERTURA']),
    'BOSQUES': np.array(['FRANJAS', 'RESERVA-FORESTAL', 'PROTECCION-CAUCES', 'PROTECCION',
                         'BOSQUETES', 'REGENERACION', 'FORESTACION', 'A-REFORESTAR',
                         'REMANENTE', 'REFORESTACION', 'A-REGENERAR', 'MANEJO-FORESTAL']),
    'EN_CONFLICTO': np.array(['EN-CONFLICTO']),
    'OTRAS_COBERTURAS': np.array(['NO_FORESTAL', 'PASTO', 'AREA AFECTADA POR M*']),
    'OTRAS_TIERRAS_FORESTALES': np.array(['MATORRAL', 'PALMARES'])
}

# Create a mapping dictionary for 'categoria_ant' values
categoria_ant_to_grupo = {}
for grupo, categorias in unique_values_mapping.items():
    for categoria in categorias.tolist():  # Convert numpy array to list before iterating
        categoria_ant_to_grupo[categoria] = grupo

# Adjust the mapping for 'EN_CONFLICTO'
categoria_ant_to_grupo['EN-CONFLICTO'] = 'BOSQUES'

# Fill in the empty rows of 'grupo'
prelabel.loc[prelabel['grupo'].isna(), 'grupo'] = prelabel.loc[prelabel['grupo'].isna(), 'categoria_ant'].map(categoria_ant_to_grupo)


In [None]:
# Count the number of NaN values in the 'grupo' column
number_of_nans = prelabel['grupo'].isna().sum()

# Print the result
print(f"Number of NaN values in 'grupo' column: {number_of_nans}")


In [None]:
# Identify the unique 'categoria_ant' values where 'grupo' is NaN
missing_categories = prelabel[prelabel['grupo'].isna()]['categoria_ant'].unique()

# Print the missing categories
print("Missing categories that were not mapped:")
print(missing_categories)


In [None]:
# Check if all NaN values in 'categoria_ant' correspond to NaN values in 'grupo'
nan_correspondence_check = prelabel[prelabel['categoria_ant'].isna()]['grupo'].isna().all()

# Print the result of the check
print(f"All NaN values in 'categoria_ant' correspond to NaN values in 'grupo': {nan_correspondence_check}")

# Assign 'B-INUNDABLE' and 'CAMINO' to 'OTRAS_COBERTURAS' if 'grupo' is NaN
prelabel.loc[(prelabel['categoria_ant'].isin(['B-INUNDABLE', 'CAMINO'])) & (prelabel['grupo'].isna()), 'grupo'] = 'OTRAS_COBERTURAS'

# Re-check the number of NaN values in the 'grupo' column after the operation
number_of_nans_after = prelabel['grupo'].isna().sum()

# Print the result
print(f"Number of NaN values in 'grupo' column after the operation: {number_of_nans_after}")


In [None]:
# Fill in the remaining NaN values in 'grupo' with 'OTRAS_COBERTURAS'
prelabel['grupo'].fillna('OTRAS_COBERTURAS', inplace=True)

# Re-check the number of NaN values in the 'grupo' column after the operation
number_of_nans_final = prelabel['grupo'].isna().sum()

# Print the result
print(f"Number of NaN values in 'grupo' column after final operation: {number_of_nans_final}")


In [None]:
prelabel.columns

In [None]:
selected_columns_df = prelabel[['anho_capa', 'put_id', 'fecha_res', 'grupo', 'geometry']]

In [None]:
output_path = os.path.join(DATA_PATH,'processing')

selected_columns_df['fecha_res'] = selected_columns_df['fecha_res'].astype(str)

# Create the directory if it doesn't exist
if not os.path.exists(output_path):
    os.makedirs(output_path)
    # Save the GeoDataFrame as a GeoPackage
# Define the filename for the GeoPackage

filename = os.path.join(output_path, "labeled_dataset.gpkg")
selected_columns_df.to_file(filename, driver="GPKG")

In [None]:
# Perform the spatial difference
difference = limit.overlay(selected_columns_df, how='difference')


In [None]:
difference_subset = difference[(difference['anho_capa'] >= 2000) & (difference['anho_capa'] <= 2012)]


In [None]:
difference_subset.geometry = difference_subset.buffer(-49, join_style= 2)
difference_subset.geometry = difference_subset.buffer(49, join_style= 2)

In [None]:
diff_cols =  difference_subset[['anho_capa', 'put_id', 'fecha_res', 'geometry']]

In [None]:
diff_cols.geometry.is_empty.sum()

In [None]:
# Filter out rows with empty geometries
diff_cols = diff_cols[~diff_cols.geometry.is_empty]
diff_cols.geometry.is_empty.sum()

In [None]:
invalid_diff = diff_cols[~diff_cols.geometry.is_valid]
len(invalid_diff)

In [None]:
diff_cols['grupo'] = 'unclassified'

In [None]:
diff_aligned = diff_cols.reindex(columns=selected_columns_df.columns)

In [None]:
qgis_ready = pd.concat([selected_columns_df, diff_aligned], ignore_index = True)

In [None]:
output_path = os.path.join(DATA_PATH,'processing')


# Create the directory if it doesn't exist
if not os.path.exists(output_path):
    os.makedirs(output_path)
    # Save the GeoDataFrame as a GeoPackage
# Define the filename for the GeoPackage

filename = os.path.join(output_path, "qgis_ready.gpkg")
qgis_ready.to_file(filename, driver="GPKG")

# Qgis

qgis_ready was exported to remove polygons that were still overlapping that could not be coded to decide which should be removed. Majority of them were just duplicates of the same polygons with only a few requiring a decision to be made on which polygons should remain. The polygons that required decisions were decided on using the same principle that the earliest registered land use plan claims the area and does not allow any later land use plan over the same area. 

From the cleaned file, a dissolved file was created merging all polygons that were touching (keep disjoint features seperate) to make individual polygons based on year. This is brought in, slivers cleaned with buffers, and then eliminating any overlaps. 