## Overview

### Structure

The pipeline is as follows:

Download the city gml files -> Convert to shapefiles -> Divide into grids -> Calculate UMP for each grid -> Save as y

Loop through each grid, download sentinel imagery, store and write as tensor 

### Datasets

- X:
    - Sentinel
- Y:
    - Tokyo (Japan, 2021) https://www.geospatial.jp/ckan/dataset/plateau-tokyo23ku/resource/0bab2b7f-6962-41c8-872f-66ad9b40dcb1?inner_span=True
    - Osaka (Japan, 2021) ^ 
    - New York (USA, 2019) https://github.com/opencitymodel/opencitymodel 

## Import Libraries

In [1]:
import geopandas as gpd
import pandas as pd
from glob import glob
import fiona
import os.path
from multiprocessing import Pool
from itertools import repeat

## Convert GML to shp

### Method 1

Convert all of the GML files in a folder into a single shapefile

In [5]:
def gml_to_feather(in_path, out_path, mode= None, log_name= "gml_convert", src_crs= "EPSG:6668", tgt_src= "EPSG:3857"):
    """
    Takes in a gml file and outputs it as a feather file\n
    W/R with feather files is much faster and takes up much less space than using shp files\n
    # Parameters:\n
    - in_path: The path for the gml file\n
    - out_path: The output path for the shape file, must end with a .shp\n
    - mode: 
        - 'o' = overwrites any file at output path, \n
        - None = raises error if file already exists\n
    - src_crs: Source projection\n
    - tgt_src: Target projection\n
    """
    # Extracts features
    with fiona.open(in_path, 'r') as src:
        features = list(src)

    # Converts and places it in geopandas format
    # There seems to be some gml files without the measured height column, will try to log those files in
    gdf = gpd.GeoDataFrame.from_features(features)
    try:
        gdf = gdf[['measuredHeight', 'geometry']]
        gdf.rename(columns={'measuredHeight':'height'}, inplace= True)
    except Exception as e:
        print(f"{e}: {os.path.basename(in_path)}")
        if not log_name is None:
            if not os.path.exists("logs"):
                os.makedirs("logs")
            with open(f"logs/{log_name}.txt", "a") as f:
                f.write(in_path + "\n")
        return len(gdf)

    # Remove the NaN values
    gdf = gdf.dropna().reset_index(drop= True)

    # Covert it to correct projection and strip to polygon instead from multi polygon

    # There is key error with somehow, plus most of the shapes are negligible, hence we will only be taking the first one
    try:
        gdf = gdf.explode(index_parts= True).set_crs(src_crs).to_crs(tgt_src).loc[(slice(None), slice(0)), :].reset_index(drop= True)
    except Exception as e:
        print(f"{e}: {os.path.basename(in_path)}")
        if not log_name is None:
            if not os.path.exists("logs"):
                os.makedirs("logs")
            with open(f"logs/{log_name}.txt", "a") as f:
                f.write(in_path + "\n")
        return len(gdf)


    # Convert coordinates from 2D to 3D
    gdf_geometry = gpd.GeoSeries.from_wkb(gdf.to_wkb(output_dimension= 2)["geometry"])
    gdf.drop(["geometry"], axis= 1, inplace= True)
    gdf = gpd.GeoDataFrame(gdf, geometry= gdf_geometry)

    # Check if parent directory exists
    if not os.path.exists(os.path.dirname(out_path)):
        os.makedirs(os.path.dirname(out_path))
        
    # Outputs to the desired path
    if os.path.exists(out_path):
        if mode == "a":
            gdf.to_feather(out_path, mode= "a")
        elif mode == "o":
            gdf.to_feather(out_path)
        else:
            raise FileExistsError("Output path already exists")
    else:
        gdf.to_feather(out_path)
    
    return 0

def batch_gml_to_feather(in_dir, out_path, n_processes= 12, log_name= None, mode= None, src_crs= "EPSG:6668", tgt_src= "EPSG:3857"):

    # Get all the paths of the gml files
    in_paths = glob(f"{in_dir}/*.gml")
    print("Total input files:", len(in_paths))

    # Reads the gml file and extract features
    with Pool(processes= n_processes) as pool:
        r = pool.starmap(
            gml_to_feather, 
            zip(in_paths, 
                [f'{in_dir}/temp/{os.path.basename(path).replace(".gml", ".feather")}' for path in in_paths], 
                repeat(mode), 
                repeat(log_name),
                repeat(src_crs),
                repeat(tgt_src)))

    # Check for invalid buildings
    print(f"There are {sum(r)} invalid buildings from {len(list(filter(lambda x: x > 0, r)))} files")

    # Get all the paths of the shp files
    in_paths = glob(f"{in_dir}/temp/*.feather")
    print("Total files to merge:", len(in_paths))

    gdfs = [gpd.read_feather(in_path) for in_path in in_paths]
    gdf = gpd.GeoDataFrame(pd.concat(gdfs)).reset_index(drop= True)
    gdf.to_feather(out_path)

    for temp_file in in_paths:
        os.remove(temp_file)

    return gdf

### Tokyo

In [6]:
in_dir = "data/13100_tokyo23-ku_2020_citygml_3_2_op/udx/bldg"
out_path = "data/full_Tokyo_plateau/tokyo_full.feather"

batch_gml_to_feather(in_dir, out_path, mode= "o", log_name= "tokyo")

Total input files: 671
cannot do slice indexing on Index with these indexers [0] of type int: 53392642_bldg_6697_2_op.gml
cannot do slice indexing on Index with these indexers [0] of type int: 53392641_bldg_6697_2_op.gml
cannot do slice indexing on Index with these indexers [0] of type int: 53392633_bldg_6697_2_op.gml
cannot do slice indexing on Index with these indexers [0] of type int: 53393631_bldg_6697_2_op.gml
cannot do slice indexing on Index with these indexers [0] of type int: 53392663_bldg_6697_2_op.gml
cannot do slice indexing on Index with these indexers [0] of type int: 53392653_bldg_6697_2_op.gml
cannot do slice indexing on Index with these indexers [0] of type int: 53393671_bldg_6697_2_op.gml
cannot do slice indexing on Index with these indexers [0] of type int: 53392651_bldg_6697_2_op.gml
cannot do slice indexing on Index with these indexers [0] of type int: 53392662_bldg_6697_2_op.gml
cannot do slice indexing on Index with these indexers [0] of type int: 53393683_bldg_6

### Osaka

In [4]:
in_dir = "data/osaka/udx/bldg"
out_path = "data/osaka/osaka_full.feather"

batch_gml_to_feather(in_dir, out_path, mode= "o", log_name= "osaka")

Total input files: 269
"['measuredHeight'] not in index": 51357370_bldg_6697_op.gml


KeyboardInterrupt: 

### New York

In [None]:
in_dir = "data/NewYork_2"
out_path = "data/NewYork_2/new_york.feather"
src_crs = "EPSG:4326"

batch_gml_to_feather(in_dir, out_path, mode= "o", log_name= "new_york", src_crs= src_crs)

Total: 170
There are 0 invalid buildings from 170 files
Total: 170


Unnamed: 0,height,geometry
0,5.73,"POLYGON ((-8209085.418 5566790.279, -8209067.0..."
1,5.73,"POLYGON ((-8209070.390 5566764.459, -8209066.2..."
2,4.38,"POLYGON ((-8209005.825 5566671.511, -8208999.0..."
3,5.73,"POLYGON ((-8209225.236 5566628.949, -8209213.3..."
4,5.73,"POLYGON ((-8209101.448 5566623.003, -8209085.8..."
...,...,...
14994,4.38,"POLYGON ((-8215328.048 5027787.294, -8215324.3..."
14995,4.74,"POLYGON ((-8215264.930 5027796.217, -8215262.7..."
14996,5.17,"POLYGON ((-8215139.829 5027478.327, -8215138.1..."
14997,5.45,"POLYGON ((-8215386.792 5027729.412, -8215384.6..."


## Divide into grids

## Calculate UMP and export as y

## Download Sentinel and export as X