## Illogical transitions as a series
- This script gets a batch of vector files located in a folder and creates a single output wrapping all the series identifying where and what illogical transitions happen in all the AOI.

- Inputs
    - Path of the folder with the vector files.
    - Path of the csv with all the illogical transitions (val1, val2) without headers.
    - Column reference name of the vector files for rasterization.
    - Path of the csv file without headers containing or:
        - 1 column table with all the accurate names.
        - 2 column table with the values and its corresponding names.
- Outputs
    - Path of the output folder with the final vector file.
    - Path of all the intermediate processing files.
- Processing
    - Read all the vector input files.
    - Optional: Dissolve them to make the rest of the process easier.
    - Check the input column if it is based on values or text.
    - Create the corresponding dictionary for classification.
    - Homogenize the reference column with the dictionary.
    - Create intermediate vector files.
    - Rasterize the reference column.
    - Compare all the years in pairs creating the corresponding illogical files.
        - Raster files with ID values.
        - csv with ID values and pair accumulated values.
    - Create a folder system with all the required iterations.
    - Compare all the files inside the folder system.
    - Get all the generated csvs.
    - Wrap up all the info in a single table.
    - Append the text info to the table.
    - Vectorize the final output.
- Author
    - Rubén Crespo Ceballos

In [1]:
from osgeo import gdal, ogr
import rasterio
from rasterio import features
import os
from difflib import SequenceMatcher
import numpy as np
import pandas as pd
import geopandas as gpd
import csv

In [None]:
def create_folder_if_not_exists(folder_path):
    """
    Create a folder if it doesn't exist.

    Parameters:
    - folder_path (str): The path of the folder to be created.
    """
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)
        print(f"Folder created at: {folder_path}")
    else:
        print(f"Folder already exists at: {folder_path}")

def get_vector_file_list(path):
    """
    Get a list of the vector files inside the folder
    Parameters:
    - path (str): path of the folder with the resources.

    Returns:
    - File_list (list). list of the resources.
    """
    File_list = [] #f for f in os.listdir(path) if os.isfile(mypath,f)
    for file in os.listdir(path):
        # "anat" is just to get here necessary ones
        if file.endswith(".shp"):
            if file not in File_list:
                File_list.append(os.path.join(path,file))
        else:
            pass
    return File_list

def get_raster_file_list(path):
    """
    Get a list of the raster files inside the folder
    Parameters:
    - path (str): path of the folder with the resources.

    Returns:
    - File_list (list). list of the resources.
    """
    File_list = [] #f for f in os.listdir(path) if os.isfile(mypath,f)
    for file in os.listdir(path):
        # "32628" is just to get here necessary ones
        if file.endswith(".tif") or file.endswith(".tiff"):
            if file not in File_list:
                File_list.append(os.path.join(path,file))
        else:
            pass
    return File_list

def get_csv_file_list(path):
    """
    Get a list of the csv files inside the folder
    Parameters:
    - path (str): path of the folder with the resources.

    Returns:
    - File_list (list). list of the resources.
    """
    File_list = [] #f for f in os.listdir(path) if os.isfile(mypath,f)
    for file in os.listdir(path):
        if file.endswith(".csv"):
            if file not in File_list:
                File_list.append(os.path.join(path,file))
        else:
            pass
    return File_list

def update_names_based_on_similarity(unique_names, gdf, column_name, similarity_threshold=0):
    """
    Update names in gdf based on similarity to names in unique list.

    Parameters:
    - unique_names (list): list of the unique names.
    - gdf (GeoDataFrame): GeoDataFrame whose names need to be updated.
    - column_name (str): String of the column.
    - similarity_threshold (float): Threshold for similarity ratio.

    Returns:
    - gdf. Updates gdf in place.
    """
    # Add a new column 'valid_text' with None values
    gdf['valid_text'] = None

    total_elements = len(gdf)  # Get total number of elements

    # Iterate through rows of gdf2
    for index, row in gdf.iterrows():

        # Get the value of the column for the current row
        name_gdf = row[column_name]
        highest_similarity_ratio = 0
        best_matching_name = None
        # Iterate through unique names in gdf1
        for unique_name in unique_names:
            # Calculate similarity ratio between names in gdf2 and gdf1
            similarity_ratio = SequenceMatcher(None, unique_name, name_gdf).ratio()
            # Update best matching name if similarity ratio is higher
            if similarity_ratio > highest_similarity_ratio:
                highest_similarity_ratio = similarity_ratio
                best_matching_name = unique_name

        if highest_similarity_ratio >= similarity_threshold:
            gdf.at[index, 'valid_text'] = best_matching_name
            gdf.at[index, 'simil_val'] = highest_similarity_ratio

        print(f"Processing element {index + 1}/{total_elements}", end="\r") # This is to track the process

    return gdf

def create_identifier_dictionary(list):
    """
    Creates a dictionary out of a list appending a new id to each one of them.

    Parameters:
    - list (list): list of strings.

    Returns:
    - value_to_text_dict. dictionary o value: text.
    """
    value_to_text_dict = {index + 1: value for index, value in enumerate(sorted(list))} # {val: text}
    return value_to_text_dict



def gdal_rasterize_from_shapefile(shapefile_path, resolution, nodata_value, data_type, output_path, cols=None, rows=None):
    """
    Rasterizes a GeoDataFrame using GDAL directly.

    Parameters:
    - shapefile_path (string): path of the vector file.
    - resolution (int or float): Resolution of the raster (pixel size).
    - nodata_value: The value to use for no-data pixels.
    - data_type: Data type for the output raster (e.g., gdal.GDT_Float32).
    - output_path (str): Path to save the output raster.
    - cols (int, optional): Number of columns in the output raster.
    - rows (int, optional): Number of rows in the output raster.

    Returns:
    - None. The function writes the raster to the specified output path.
    """

    # Open the Shapefile using OGR
    shapefile = ogr.Open(shapefile_path)
    layer = shapefile.GetLayer()

    # Get the bounds of the Shapefile (same as GeoDataFrame's total_bounds)
    xmin, xmax, ymin, ymax = layer.GetExtent()

    # If cols and rows are not provided, calculate them based on resolution
    if cols is None or rows is None:
        cols = int((xmax - xmin) / resolution)
        rows = int((ymax - ymin) / resolution)

    # Create a new raster dataset
    raster_ds = gdal.GetDriverByName('GTiff').Create(
        output_path, cols, rows, 1, data_type,
        options=['COMPRESS=DEFLATE', 'TILED=YES']
    )

    # Set the geotransform (affine transform for the raster)
    geotransform = (xmin, resolution, 0, ymax, 0, -resolution)
    raster_ds.SetGeoTransform(geotransform)

    # Set the CRS (coordinate reference system) from the Shapefile
    srs = layer.GetSpatialRef()
    if srs:
        raster_ds.SetProjection(srs.ExportToWkt())

    # Create the raster band and set no-data value
    band = raster_ds.GetRasterBand(1)
    band.SetNoDataValue(nodata_value)

    # Rasterize the shapefile
    gdal.RasterizeLayer(
        raster_ds,  # Output raster dataset
        [1],        # Raster band to write to
        layer,      # Input OGR layer to rasterize
        options=['ATTRIBUTE=raster_val', 'ALL_TOUCHED=TRUE']
    )

    # Flush and close the raster dataset
    band.FlushCache()
    raster_ds = None  # Close the file and save

    # Close the shapefile
    shapefile = None

    print(f"Rasterization complete: {output_path}")

    return cols, rows

def check_same_dimensions(raster_files):
    """
    Check the if the dimensions of all the input rasters have the same dimensions.

    Parameters:
    - raster_files (list): List of raster files.

    Returns:
    - dimensions_list (list). The list with all the dimensions.
    """
    dimensions_list = []

    # Open the first raster file in the list
    for file_path in raster_files[:]:
        with rasterio.open(file_path) as src:
            shape_dimensions = src.shape
            dimensions_list.append(shape_dimensions)

    if len(set(dimensions_list)) == 1:
        print(f"All the elements have the same dimensions{dimensions_list[0]}")
        dimensions_list
    else:
        print("The dimensions are note the same")
        return dimensions_list


def read_csv_in_pairs(csv_file):
    """
    Reads a two column csv and transforms it into a pair value list.

    Parameters:
    - csv_file (str): path of the csv file.

    Returns:
    - rule_values_list: list of unique value pairs.
    """
    rule_values_list = []
    with open(csv_file, 'r') as file:
        # next(file)  # Skip the header row
        for line in file:
            # Split the line into two values based on spaces, and convert them to floats
            rule_value_1, rule_value_2 = map(float, line.split(","))
            rule_values_list.append((rule_value_1, rule_value_2))
    return rule_values_list

def csv_to_dict(file_path):
    """
    Creates a dictionary out of a csv of two columns excluding the header.

    Parameters:
    - file_path (string): path of the file.

    Returns:
    - result_dict(dict). dictionary o value: text.
    """
    result_dict = {}

    # Open the CSV file
    with open(file_path, mode='r', newline='', encoding='ISO-8859-1') as file: # encoding='ISO-8859-1' encoding='utf-8'
        reader = csv.reader(file)

        # Skip the header
        # next(reader)

        # Iterate through each row and add to the dictionary
        for row in reader:
            key = row[1]  # First column as the key
            value = row[0]  # Second column as the value
            result_dict[float(key)] = value

    return result_dict

def csv_to_list(file_path):
    """
    Reads a CSV file with one column and transforms it into a list.

    Parameters:
    - file_path (string): Path to the CSV file.

    Returns:
    - return(list): List containing the values from the column
    """
    result = []
    with open(file_path, mode='r', newline='', encoding='ISO-8859-1') as file:
        reader = csv.reader(file)
        for row in reader:
            if row:  # Ensure the row is not empty
                result.append(row[0])
    return result

def generate_consecutive_pairs(paths_list):
    """
    Generates a list of consecutive pairs from the input list in order.

    Parameters:
    - paths_list (list): list of paths.

    Returns:
    - consecutive_pairs: list of consecutive value pairs.
    """
    # Create consecutive pairs using zip
    consecutive_pairs = list(zip(paths_list, paths_list[1:]))
    return consecutive_pairs

def compare_rasters(raster_pair, output, rule_values_list = None):
    """
    Generates a list of random unique pairs, from the input list.

    Parameters:
    - raster_pair (list): list of two elements.
    - output (path): path of the output file.

    Returns:
    - output_loc (str): path out the ourput file.
    - comparison_df: dataframe related to the data.
    """
    # Open raster files
    ds1 = gdal.Open(raster_pair[0])
    ds2 = gdal.Open(raster_pair[1])

    if not ds1 or not ds2:
        print("Error: Unable to open raster files.")
        return

    # Check if both rasters have the same height and width
    if ds1.RasterXSize != ds2.RasterXSize or ds1.RasterYSize != ds2.RasterYSize:
        print("Error: Rasters have different dimensions.")

    # Get the first raster information
    width = ds1.RasterXSize
    height = ds1.RasterYSize
    geotransform = ds1.GetGeoTransform()
    projection = ds1.GetProjection()

    # Create output raster
    driver = gdal.GetDriverByName("GTiff")
    layer_1 = os.path.basename(raster_pair[0]).split('_')[-1].replace(".tif","") # Get always the last element
    layer_2 = os.path.basename(raster_pair[1]).split('_')[-1].replace(".tif","")
    output_filename = f"illogical_transitions_{layer_1}-{layer_2}.tif" #Customize
    output_loc = os.path.join(output, output_filename)

    output_ds = driver.Create(output_loc, width, height, 1, gdal.GDT_Int16, options= ['COMPRESS=DEFLATE', 'TILED=YES']) # GDT_Int32
    output_ds.GetRasterBand(1).SetNoDataValue(0)
    output_ds.SetGeoTransform(geotransform)
    output_ds.SetProjection(projection)

    output_array = np.zeros((height, width), dtype=np.int16) # int16

    unique_value_dict = {} # pairs : unique_value

    # Loop through each pixel and compare values
    block_size = 256  # Adjust the block size as needed
    for y in range(0, height, block_size):
        for x in range(0, width, block_size):
            print(f"Comparing pixels at rows/columns ({x},{y}) from ({width}, {height})", end='\r')
            block_width = min(block_size, width - x)
            block_height = min(block_size, height - y)

            block1 = ds1.GetRasterBand(1).ReadAsArray(x, y, block_width, block_height)
            block2 = ds2.GetRasterBand(1).ReadAsArray(x, y, block_width, block_height)

            correct_value = 0

            if rule_values_list:
                rule_values_list = set(rule_values_list) # To speed up the process, don't use list
                for i in range(block_height):
                    for j in range(block_width):
                        value1 = block1[i, j]
                        value2 = block2[i, j]

                        if (value1, value2) in rule_values_list:
                            unique_value = unique_value_dict.setdefault((value1, value2), len(unique_value_dict) + 1)
                            output_array[y + i, x + j] = unique_value
                        else:
                            output_array[y + i, x + j] = correct_value
            else:
                for i in range(block_height):
                    for j in range(block_width):
                        value1 = block1[i, j]
                        value2 = block2[i, j]

                        if value1 or value2:
                            unique_value = unique_value_dict.setdefault((value1, value2), len(unique_value_dict) + 1)
                            output_array[y + i, x + j] = unique_value
                        else:
                            output_array[y + i, x + j] = correct_value


    # Write the output
    output_ds.GetRasterBand(1).WriteArray(output_array)

    # Close datasets
    ds1 = None
    ds2 = None
    output_ds = None

    # Create the dataframe
    unique_values = list(unique_value_dict.values())
    value_pairs = list(unique_value_dict.keys())

    comparison_df = pd.DataFrame({
        'unival': unique_values,
        f'{layer_1}': [pair[0] for pair in value_pairs],
        f'{layer_2}': [pair[1] for pair in value_pairs]
                                })

    # Lets export the csv just in case
    comparison_df.to_csv(os.path.join(output, f"illo_tab_{layer_1}-{layer_2}.csv"), index=False)

    return print("Finished with: ", output_loc)

def vectorize_raster(raster_path, output_path, df):
    """
    Vectorizes an input raster.

    Parameters:
    - raster_path (str): path of the input raster.
    - output_path (path): path of the output file.
    - df (dataframe): dataframe related to the input raster.

    Returns:
    - None. It produces the vector file.
    """
    # Open the raster file
    with rasterio.open(raster_path) as src:
        # Read raster data as numpy array
        data = src.read(1)
        # Get affine transform of the raster
        transform = src.transform

    # Vectorize the raster data
    raster_shapes = features.shapes(data, transform=transform)

    # Convert the vectorized shapes into a GeoDataFrame
    gdf = gpd.GeoDataFrame.from_features(
        [
            {"geometry": geo_shape, "properties": {"ID": value}}
            for geo_shape, value in raster_shapes
        ],
        crs=src.crs
    )

    # Merge the dataframe with the GeoDataFrame
    merged_gdf = gdf.merge(df, on='ID')
    # Dissolve based on a column value
    dissolved_gdf = merged_gdf.dissolve(by='ID')

    # Save the merged GeoDataFrame as a new shapefile
    filename = os.path.join(output_path, "illogical_transitions")
    dissolved_gdf.to_file(f'{filename}.shp', driver='ESRI Shapefile')
    return

def add_name_columns_to_dataframe(df, names_dictionary):
    """
    Adds the input name list into the  an input raster.

    Parameters:
    - df (dataframe): dataframe related to the input raster.
    - names_dictionary (dict): dict of the names that we want to append.

    Returns:
    - df (dataframe). It produces the updated dataframe with the names.
    """
    # Get the original columns (without including any new columns)
    original_columns = df.columns.tolist()

    for column in original_columns[1:]: # Skip the first column
        df[column + "txt"] = df[column].map(names_dictionary)
        # Alternative
        # column_list = column.split("_") # Example 2019_1
        # df[column_list[0] + "_text_" + column_list[1]] = df[column].map(names_dictionary)

    # Sort the columns
    # order_columns = [df.columns[0]] + sorted(df.columns[1:])
    # df = df[order_columns]

    return df

def dissolve_geodataframe(gdf, column):
    """
    Dissolves a GeoDataFrame based on unique values of a specified column.

    Parameters:
    gdf (GeoDataFrame): Input GeoDataFrame to be dissolved.
    column (str): Column name based on which to dissolve the GeoDataFrame.

    Returns:
    GeoDataFrame: The dissolved GeoDataFrame.
    """

    print(f"Initial number of geometries: {len(gdf)}")

    # Perform the dissolve operation
    dissolved_gdf = gdf.dissolve(by=column)

    print(f"Final number of geometries after dissolve: {len(dissolved_gdf)}")

    return dissolved_gdf

def process_folders(iteration_folders, idx=0, df=None, aggregated_df=None):
    """
    Recursively processes a series of folders containing CSV files to merge data based on shared columns.

    Parameters:
    iteration_folders (list): A list of folder paths to iterate through, each containing CSV files.
    idx (int): The current folder index being processed. Defaults to 0.
    df (DataFrame): The current DataFrame being merged. Defaults to None for the first iteration.
    aggregated_df (DataFrame): The cumulative aggregated DataFrame. Defaults to None for the first iteration.

    Returns:
    DataFrame: The final aggregated DataFrame containing merged data from all folders.

    Functionality:
    - Reads the initial CSV from the first folder and initializes the aggregation process.
    - Iterates through columns in the current DataFrame (`df`), identifying corresponding CSV files in the next folder.
    - Merges data from matching files into the current DataFrame using inner joins, while retaining unmatched rows with outer joins.
    - Adds unique, incrementing suffixes to column names to avoid duplicates.
    - Recursively processes subsequent folders, updating `aggregated_df` with new merged data at each step.
    - Ensures columns are reordered to keep 'ID' first and all other columns sorted alphabetically.
    - Stops when all folders are processed.
    """
    if idx >= len(iteration_folders) - 1:
        return aggregated_df

    current_folder = iteration_folders[idx]
    next_folder = iteration_folders[idx + 1]

    if df is None and aggregated_df is None:
        # Read the initial csv
        csv_files = [f for f in os.listdir(current_folder) if f.endswith('.csv')]
        file_path = os.path.join(current_folder, csv_files[0])
        df = pd.read_csv(file_path)

        # Create the aggregated_df
        df = df.rename(columns={'unival': 'ID'})
        unival_series = df['ID']
        aggregated_df = pd.DataFrame(unival_series.to_frame())

    column_names = df.columns[1:].tolist() # we dont want the first column

    counter = 1  # Restart the counter at 1 for each folder iteration

    for column_name in column_names:
        # The next iterations names come with a suffix. Remove it for searching
        if "_" in column_name:
            base_column_name = column_name.split("_")[0]
        else:
            base_column_name = column_name

        # Look for the matching CSV in the next folder, search and open the csv
        csv_files = [f for f in os.listdir(next_folder) if f.endswith('.csv')]
        matching_csv = [file for file in csv_files if file.startswith(base_column_name)][0] # we expect only one element
        if not matching_csv:
            raise FileNotFoundError(f"No matching file found for {base_column_name} in {next_folder}")
        matching_path = os.path.join(next_folder, matching_csv)
        match_df = pd.read_csv(matching_path)

        # Add a suffix to the columns so we can diferenciate then. Increment by the number of columns added to match_df (excluding the first column)
        match_df.columns = [match_df.columns[0]] + [f'{col}_{counter}' for counter, col in enumerate(match_df.columns[1:], start=counter)]
        counter += len(match_df.columns) - 1

        # Merge the data (internal)
        df = df.rename(columns={column_name: base_column_name}) # Rename the column to do the merge.
        merged_df = pd.merge(df, match_df, left_on=base_column_name, right_on="unival", how='outer') #outer means that we keep all the combinations

        # THIS IS OPTIONAL: Fill NAN to 0.
        merged_df = merged_df.fillna(0)

        # To surpass duplicate names
        merged_df = merged_df.drop(['unival', base_column_name], axis=1)
        merged_df = merged_df.drop(columns=column_names, errors='ignore')

        df = df.rename(columns={base_column_name: column_name}) # We put the name back

        # Merge with aggregated_df (external)
        aggregated_df = pd.merge(merged_df, aggregated_df, on='ID', how='inner')

    # Drop columns that were used for merging in the current iteration
    aggregated_df = aggregated_df.drop(columns=column_names, errors='ignore')

    # Reorder columns after the first column
    order_columns = [aggregated_df.columns[0]] + sorted(aggregated_df.columns[1:], key=lambda x: int(x.split('_')[1])) # Remove after ", key=..." if any issue arisses
    aggregated_df = aggregated_df[order_columns]
    return process_folders(iteration_folders, idx=idx + 1, df=aggregated_df, aggregated_df=aggregated_df)

In [None]:
"""Specify all the inputs"""
# Path for the vector inputs
input_path = r"Y:\z_resources\im-nca-senegal\v2_shp_occsol_anat\23-12-22\shp_occsol_anat\testing"

# Path for the final output
output_path = r"Y:\z_resources\im-nca-senegal\v2_shp_occsol_anat\23-12-22\shp_occsol_anat\testing\output_path"

# Path of the illogical rules csv
rule_table_path = r"Y:\z_resources\im-nca-senegal\v2_shp_occsol_anat\23-12-22\shp_occsol_anat\illogical_transitions.csv"

#Path of all the intermediate outputs (Don't touch this)
output_tmp_path = input_path + r"\output_tmp_path"

#Create the paths
create_folder_if_not_exists(output_tmp_path)
create_folder_if_not_exists(output_path)

"""We have here two situations. Comment the non necessary one"""

# If we have only strings with no values. We append a created value.
names_list_path = r"Y:\z_resources\im-nca-senegal\v2_shp_occsol_anat\23-12-22\shp_occsol_anat\testing\names_list.csv"
names_list = csv_to_list(names_list_path)
names_dictionary = create_identifier_dictionary(names_list) # The output will always be number: text 

# If we have both strings and values
names_dictionary_path = r"Y:\z_resources\ruben\landcover_vector_files_copy\names_dictionary.csv" #It has headers
names_dictionary = csv_to_dict(names_dictionary_path) # {Code: label}
names_list = list(names_dictionary.values())

vector_file_list = get_vector_file_list(input_path)


"""For rasterization"""
# Specify the column to rasterize by.
column_name = 'leyenda'
# Define the resolution of your raster.
resolution = 30  # in meters
# Define the nodata value of your raster.
nodata_value = 0
# Define the data type of the raster.
data_type = gdal.GDT_UInt16
"""
gdal.GDT_Byte,
gdal.GDT_Int16,
gdal.GDT_UInt16,
gdal.GDT_Int32,
gdal.GDT_UInt32,
gdal.GDT_Float32,
gdal.GDT_Float64
"""
# Define the rows and columns for the rasterization (All the files must have the same dimensions).
# You can put the parameters here manually, or let the system take the reference of the first rasterization.
rows = None
columns = None


Folder already exists at: Y:\z_resources\im-nca-senegal\v2_shp_occsol_anat\23-12-22\shp_occsol_anat\testing\output_tmp_path
Folder created at: Y:\z_resources\im-nca-senegal\v2_shp_occsol_anat\23-12-22\shp_occsol_anat\testing\output_path


In [None]:
"""OPTIONAL: For heavy vector files, we are going to dissolve based on the designed column, to simplify the rest of the process"""
dissolved_files_path = output_tmp_path + r"\dissolved_files"
create_folder_if_not_exists(dissolved_files_path)

for file in vector_file_list[:]:
    gdf = gpd.read_file(file)
    print("gdf opened")
    gdf = dissolve_geodataframe(gdf, column_name)
    gdf.to_file(os.path.join(dissolved_files_path, os.path.basename(file).replace(".shp", "_dissolved.shp")) , driver='ESRI Shapefile')
    
# Read the new imputs
vector_file_list = get_vector_file_list(dissolved_files_path)

In [None]:
"""Create the "raster_val" for rasterization according to the content of the columnn"""
optimized_vector_files_path = output_tmp_path + r"\optimized_vector_files"
create_folder_if_not_exists(optimized_vector_files_path)

# Create a list to store filtered DataFrames
filtered_dfs = []

for file in vector_file_list[:]:
    gdf = gpd.read_file(file)
    print(f"gdf {os.path.basename(file)} opened")
    if gdf[column_name].dtype == object:
        print(f"The column '{column_name}' contains strings.")
        gdf = update_names_based_on_similarity(names_list, gdf, column_name, similarity_threshold=0.3)
        print("names updated")
        # Add a new column to the GeoDataFrame containing the unique identifiers
        gdf['raster_val'] = gdf["valid_text"].map(names_dictionary)
        
        """Wrap the info into a df.""" # ACHTUNG: Pending to test
        df = gdf.drop(columns=["geometry"])

        # Get unique rows based on the "Valid_test" column
        unique_valid_texts = df["Valid_text"].unique()

        for value in unique_valid_texts:
            # Filter rows with the current unique value
            filtered_df = df[df["Valid_text"] == value]
            filtered_dfs.append(filtered_df)

        # Merge all filtered DataFrames into one
        merged_df = pd.concat(filtered_dfs, ignore_index=True)

    else:
        print(f"The column '{column_name}' contains values.")
        # Get unique values/strings from the specified column, they are always sorted.
        unique_values = sorted(gdf[column_name].unique())
        # Create a dictionary one to one
        value_to_index = {value: value for value in unique_values}
        gdf['raster_val'] = gdf[column_name].map(value_to_index)
    
    # Opening and saving the file takes a lot of time
    gdf.to_file(os.path.join(optimized_vector_files_path, os.path.basename(file).replace(".shp", "_optimized.shp")) , driver='ESRI Shapefile')

    # Saving the csv.
    merged_df.to_csv(os.path.join(optimized_vector_files_path, 'optimited.csv'), index=False) 
    
# Read the new imputs
vector_file_list = get_vector_file_list(optimized_vector_files_path)


In [None]:
"""Transform the vector files and convert them into rasters"""
rasterized_files_path = output_tmp_path + r"\rasterized_files_path"
create_folder_if_not_exists(rasterized_files_path)

for file in vector_file_list[:]:
    output_path_file = os.path.join(rasterized_files_path, os.path.basename(file).replace(".shp", ".tif"))
    print(f"{os.path.basename(file)} opened")
    print("Creating" , output_path_file)
    cols, rows = gdal_rasterize_from_shapefile(file, resolution, nodata_value, data_type, output_path_file, cols=None, rows=None)


Y:\z_resources\ruben\landcover_vector_files_copy\output_files\cobertura_tierra_2020_epsg3316_100k_2020_dissolved_optimized.tif
Rasterization complete: Y:\z_resources\ruben\landcover_vector_files_copy\output_files\cobertura_tierra_2020_epsg3316_100k_2020_dissolved_optimized.tif


In [None]:
"""Check the dimensions of the rasters are all the same"""
raster_list = get_raster_file_list(rasterized_files_path)
check_same_dimensions(raster_list)

All the elements have the same dimensions(9670, 13385)


In [None]:
"""Create the illogical transitions 1 to 1"""

illogical_path = output_tmp_path + r"\illogical_files"
create_folder_if_not_exists(illogical_path)

raster_list = get_raster_file_list(rasterized_files_path)
all_pairs_path_list = generate_consecutive_pairs(raster_list)

# Read rule table and create a list of pairs with the info
rule_values_list = read_csv_in_pairs(rule_table_path)

for raster_pair in all_pairs_path_list[:]:
    compare_rasters(raster_pair, illogical_path, rule_values_list)

In [None]:
"""Create the folder system"""
# Delte this when finishing
illogical_path = output_tmp_path + r"\illogical_files"
raster_list = get_raster_file_list(illogical_path)

# Generate a folder per each iteration
number_of_iterations = len(raster_list) #  - 1
iteration_folders = [illogical_path]
for iteration in range(1, number_of_iterations, 1):
    iteration_ouput = os.path.join(output_tmp_path, f"iteration_files_{iteration}")
    create_folder_if_not_exists(iteration_ouput)
    iteration_folders.append(iteration_ouput)

Folder already exists at: Y:\z_resources\im-nca-senegal\v2_shp_occsol_anat\23-12-22\shp_occsol_anat\testing\output_tmp_path\iteration_files_1
Folder already exists at: Y:\z_resources\im-nca-senegal\v2_shp_occsol_anat\23-12-22\shp_occsol_anat\testing\output_tmp_path\iteration_files_2


In [None]:
"""Do the comparison by each level"""
for folder in range(len(iteration_folders) - 1): # We don't want to iterate the last one.
    current_folder = iteration_folders[folder]
    next_folder = iteration_folders[folder + 1]
   
    print(f"Working in folder: {current_folder}")
    raster_list = get_raster_file_list(current_folder)

    all_pairs_path_list = generate_consecutive_pairs(raster_list)
    for raster_pair in all_pairs_path_list[:]:
        compare_rasters(raster_pair, next_folder, rule_values_list = None)

Wroking in folder: Y:\z_resources\im-nca-senegal\v2_shp_occsol_anat\23-12-22\shp_occsol_anat\testing\illogical_files
Wroking in folder: Y:\z_resources\im-nca-senegal\v2_shp_occsol_anat\23-12-22\shp_occsol_anat\testing\iteration_files_1
Comparing pixels at rows/columns (13312,9472) from (13385, 9670)

In [None]:
"""Reverse the folder list"""
#This is done separated so we don't do re-reverse at re-executing
iteration_folders.reverse()

In [84]:
"""Create the aggregated table and build it"""
aggregated_df = process_folders(iteration_folders[:])
aggregated_df = add_name_columns_to_dataframe(aggregated_df, names_dictionary)

In [86]:
"""Build the final table"""
final_raster = get_raster_file_list(iteration_folders[0])[0] # There is only one file
vectorize_raster(final_raster, output_path, aggregated_df)

In [85]:
aggregated_df

Unnamed: 0,ID,2010_1,2015_2,2015_3,2020_4,2015_5,2020_6,2020_7,2025_8,2010_1txt,2015_2txt,2015_3txt,2020_4txt,2015_5txt,2020_6txt,2020_7txt,2025_8txt
0,1,0.0,0.0,0.0,0.0,0.0,0.0,3.0,12.0,,,,,,,Aeropuertos,Arbustal abierto esclerófilo
1,64,3.0,12.0,0.0,0.0,0.0,0.0,3.0,12.0,Aeropuertos,Arbustal abierto esclerófilo,,,,,Aeropuertos,Arbustal abierto esclerófilo
2,2,19.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,Bosque abierto alto,Arbustal abierto mesófilo,,,,,,
3,5,15.0,19.0,0.0,0.0,0.0,0.0,0.0,0.0,Arenales,Bosque abierto alto,,,,,,
4,7,3.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,Aeropuertos,Arbustal abierto mesófilo,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
258,248,0.0,0.0,0.0,0.0,0.0,0.0,10.0,8.0,,,,,,,Arbustal,Algodón
259,250,0.0,0.0,0.0,0.0,0.0,0.0,4.0,20.0,,,,,,,Afloramientos rocosos,Bosque abierto alto de tierra firme
260,251,0.0,0.0,0.0,0.0,0.0,0.0,1.0,16.0,,,,,,,Aeropuerto con infraestructura asociada,Arracachal
261,257,0.0,0.0,0.0,0.0,0.0,0.0,10.0,7.0,,,,,,,Arbustal,Ajonjolí
