# 🌇 Welcome to the `loader` module!

This is where your urban data journey begins. Whether you’ve got `CSV`, `Parquet`, or `Shapefiles`, we’ll get them loaded up and ready to explore. UrbanMapper provides two main ways to load data:

1. **Manual Loading of Local Datasets**: You can load datasets available locally in various formats like CSV, Parquet, and Shapefiles. This is the default approach for working with your own data.
2. **Integration with Hugging Face Dataset Library**: UrbanMapper also supports loading datasets from the Hugging Face library via the `from_dataframe()` method. This broadens the possibilities for integrating external data sources seamlessly.

**Data source used**:
- PLUTO data from NYC Open Data. https://www.nyc.gov/content/planning/pages/resources/datasets/mappluto-pluto-change
- Taxi data from NYC Open Data. https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
- **The OSCUR Hugging Face Dataset Source:**
The [OSCUR Hugging Face organization](https://huggingface.co/oscur)
 hosts all datasets associated with [OSCUR](https://oscur.org/): Open-Source Cyberinfrastructure for Urban Computing, a research initiative focused on enabling reproducible, scalable, and accessible data-driven analysis for urban environments.
By using the OSCUR datasets, you can skip downloading datasets from Google Drive or official links locally. These datasets are ready to use in all subsequent notebook examples without issue, making your workflow more efficient and seamless.

**What you’ll learn**:
- How to kick off UrbanMapper.
- Loading data from CSV, Parquet, and Shapefile formats.
- Loading datasets from Hugging Face with UrbanMapper.

Ready? Let’s dive in! 🚀

In [1]:
import urban_mapper as um

# Start up UrbanMapper
mapper = um.UrbanMapper()

## Loading CSV Data

First up, let’s load a CSV file with PLUTO data. We’ll tell UrbanMapper where to find the longitude and latitude columns so it knows what’s what and can make sure those colums are well formatted prior any analysis.

Note that below we employ a given csv, but you can put your own path, try it out!

In [2]:
csv_loader = (
    mapper
    .loader # From the loader module
    .from_file("VZV_Speed_Humps_with_LatLon.csv") # To update with your own path
    .with_columns(geometry_column="the_geom") # Inform your long and lat columns
)

gdf = csv_loader.load() # Load the data and create a geodataframe's instance

# gdf stands for GeoDataFrame, like df in pandas for dataframes.
gdf


  geo_dataframe["centroid"] = geo_dataframe["geometry"].centroid

  dataframe["centroid"] = dataframe["geometry"].centroid


Unnamed: 0,the_geom,OBJECTID,on_street,from_stree,to_street,humps,date_insta,Shape_STLe,longitude,latitude,geometry,centroid,None
0,MULTILINESTRING ((-73.9799168591299 40.6728580...,1,1 STREET,6 AVENUE,7 AVENUE,2,05/22/2014,775.659477,-73.979917,40.672858,POINT (-73.97874 -73.97874),POINT (-73.97874 40.67229),-73.978737
1,MULTILINESTRING ((-73.99294738655144 40.678995...,2,1 STREET,HOYT STREET,BOND STREET,2,09/11/2017,652.150192,-73.992947,40.678995,POINT (-73.99193 -73.99193),POINT (-73.99193 40.67855),-73.991928
2,MULTILINESTRING ((-73.98623027308248 40.675910...,3,1 STREET,WHITWELL PLACE,DENTON PLACE,1,04/20/2024,230.484634,-73.986230,40.675911,POINT (-73.98588 -73.98588),POINT (-73.98588 40.67574),-73.985881
3,MULTILINESTRING ((-73.73560306475389 40.714945...,4,100 AVENUE,100 DR./220 STREET,SPRINGFIELD BL.,1,12/15/2011,423.930065,-73.735603,40.714945,POINT (-73.73484 -73.73484),POINT (-73.73484 40.71501),-73.734844
4,MULTILINESTRING ((-74.03364741744534 40.612833...,5,100 ST,FOURTH AVE,FT HAMILTON PKWY,1,08/01/1996,591.886182,-74.033647,40.612833,POINT (-74.03275 -74.03275),POINT (-74.03275 40.61239),-74.032752
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4020,MULTILINESTRING ((-73.84799379430255 40.869354...,4021,YOUNG AVENUE,EAST GUN HILL ROAD,ADEE AVENUE,1,01/29/2013,724.461134,-73.847994,40.869354,POINT (-73.84796 -73.84796),POINT (-73.84796 40.87035),-73.847960
4021,MULTILINESTRING ((-73.84646715953151 40.812306...,4022,ZEREGA AVE,CASTLE HILL AVE,WESTCHESTER AVE,2,05/02/2001,9419.157313,-73.846467,40.812306,POINT (-73.84412 -73.84412),POINT (-73.84412 40.82446),-73.844118
4022,MULTILINESTRING ((-73.84984040603005 40.838957...,4023,ZEREGA AVENUE,FULLER STREET,"ST, RAYMONDS AV.",1,06/13/2007,897.048744,-73.849840,40.838958,POINT (-73.85104 -73.85104),POINT (-73.85104 40.83979),-73.851038
4023,MULTILINESTRING ((-73.84667778628474 40.836794...,4024,ZEREGA AVENUE,LYVERE STREET,WESTCHESTER AV.,2,11/01/2016,2322.413384,-73.846678,40.836794,POINT (-73.84979 -73.84979),POINT (-73.84979 40.83893),-73.849792


## Loading Parquet Data

Next, let's grab a `parquet` based dataset for the example. Same workflow as for the csv.

In [10]:
parquet_loader = (
    mapper.
    loader. # From the loader module
    from_file("taxi1m.parquet") # To update with your own path
    .with_columns("pickup_longitude", "pickup_latitude") # Inform your long and lat columns
)

gdf = parquet_loader.load() # Load the data and create a geodataframe's instance

gdf

ValueError: The following type is not supported for resetting: <class 'pandas.core.frame.DataFrame'>

## Loading Shapefile Data

Finally, let’s load a Shapefile-based dataset. Shapefiles have geometry built in, so no need to specify columns — UrbanMapper sorts it out for us!

In [None]:
shp_loader = (
    mapper
    .loader # From the loader module
    .from_file("../data/[NYC][USA] MapPluto/Shapefile/MapPLUTO.shp") # To update with your own path
)

gdf = shp_loader.load() # Load the data and create a geodataframe's instance

gdf

## Loading Data from Hugging Face

UrbanMapper provides two ways to load datasets from Hugging Face:

1. **Using `from_dataframe()`**: This method allows you to load a dataset into a pandas DataFrame first, giving you flexibility to preprocess or explore the data before loading it into UrbanMapper.
2. **Using `from_huggingface()`**: This method directly loads the dataset into UrbanMapper, skipping the intermediate DataFrame step for simplicity.

### Method 1: Using `from_dataframe()`

This code loads the "oscur/pluto" dataset from Hugging Face, selects the training split, and converts the first 1,000 rows into a pandas DataFrame for efficient analysis and exploration. The resulting DataFrame can then be loaded into UrbanMapper using `from_dataframe()`.

In [4]:
from datasets import load_dataset, Dataset
import pandas as pd

# Retrieve the dataset from Hugging Face
dataset = load_dataset("oscur/NYC_raised_crosswalk")
# Select the training split
train_ds = dataset["train"]
# Convert the first 1000 rows to a DataFrame
df = pd.DataFrame(train_ds[:1000])

# Load the dataset using UrbanMapper
df_loader = (
    mapper
    .loader # From the loader module
    .from_dataframe(df) # To update with your dataframe
    .with_columns(geometry_column="WKT Geometry") # Inform your long and lat columns
)

gdf = df_loader.load() # Load the data and create a geodataframe's instance

# gdf stands for GeoDataFrame, like df in pandas for dataframes.
gdf

README.md:   0%|          | 0.00/457 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/119 [00:00<?, ? examples/s]


  geo_dataframe["centroid"] = geo_dataframe["geometry"].centroid


Unnamed: 0,Treatment,Date,Nodeid,WKT Geometry,latitude,longitude,geometry,centroid
0,Raised Crosswalk,10/18/2023,17575,POINT (995743.99243167 165383.361633316),165383.361633,9.957440e+05,POINT (995743.99243 165383.36163),POINT (995743.99243 165383.36163)
1,Raised Crosswalk,10/25/2024,54958,POINT (1034279.63922118 200674.264404297),200674.264404,1.034280e+06,POINT (1034279.63922 200674.2644),POINT (1034279.63922 200674.2644)
2,Raised crosswalk,09/24/2018,42999,POINT (1006900.4614258 236342.187988296),236342.187988,1.006900e+06,POINT (1006900.46143 236342.18799),POINT (1006900.46143 236342.18799)
3,Raised Crosswalk,07/12/2024,43081,POINT (1007190.95080571 237129.666015655),237129.666016,1.007191e+06,POINT (1007190.95081 237129.66602),POINT (1007190.95081 237129.66602)
4,Raised Crosswalk - S Leg,08/14/2023,44746,POINT (1009562.59301759 250359.288024947),250359.288025,1.009563e+06,POINT (1009562.59302 250359.28802),POINT (1009562.59302 250359.28802)
...,...,...,...,...,...,...,...,...
114,Raised Crosswalk,06/13/2023,13860,POINT (981997.590026855 173821.543212906),173821.543213,9.819976e+05,POINT (981997.59003 173821.54321),POINT (981997.59003 173821.54321)
115,Raised Crosswalk,08/29/2024,19568,POINT (995377.211608931 194393.404418945),194393.404419,9.953772e+05,POINT (995377.21161 194393.40442),POINT (995377.21161 194393.40442)
116,Raised Crosswalk - N Leg,06/27/2023,18459,POINT (999014.497619659 170285.592407227),170285.592407,9.990145e+05,POINT (999014.49762 170285.59241),POINT (999014.49762 170285.59241)
117,Raised Crosswalk,06/10/2024,29354,POINT (1013226.56201178 183620.736022964),183620.736023,1.013227e+06,POINT (1013226.56201 183620.73602),POINT (1013226.56201 183620.73602)


In [None]:
loader = mapper.loader.from_huggingface("oscur/pluto")
gdf = loader.load()

### Method 2: Using `from_huggingface()`

This method directly loads the "oscur/pluto" dataset into UrbanMapper, skipping the intermediate DataFrame step. It's a simpler and faster way to load datasets hosted on Hugging Face.

In [None]:
# Load a full dataset directly from Hugging Face
loader = mapper.loader.from_huggingface("oscur/pluto", number_of_rows=100).with_columns(longitude_column="longitude", latitude_column="latitude")
gdf = loader.load()
gdf  # Next steps: analyze or visualize the data

## Be Able To Preview Your Loader's instance

Additionally, you can preview your loader's instance to see what columns you've specified and the file path you've loaded from. Pretty useful when you load a urban analysis shared by someone else and might want to check what columns are being used for the analysis.

In [None]:
print(gdf.preview())

## Wrapping Up

And that’s that! 🎈 You’ve loaded data from four different formats like a pro: `CSV`, `Parquet`, `Shapefile`, and datasets from Hugging Face. Now you’re all set to play with modules like `urban_layer` or `imputer`.

In [None]:
import urban_mapper as um

# Start up UrbanMapper
mapper = um.UrbanMapper()

raster_loader = (
    mapper
    .loader # From the loader module
    .from_file("fake.tif") # To update with your own path
)

raster_gdf = csv_loader.load() # Load the data and create a geodataframe's instance

# gdf stands for GeoDataFrame, like df in pandas for dataframes.
raster_gdf

In [None]:
visualize_road_flood_risk(raster_gdf)

In [None]:
def visualize_road_flood_risk(self, roads_with_risk: gpd.GeoDataFrame, 
                                 flood_level: int = 10, figsize: Tuple[int, int] = (12, 8)) -> None:
        """Visualize road flood risk on a map.
        
        Args:
            roads_with_risk: Roads with flood risk data
            flood_level: Flood level to visualize
            figsize: Figure size for the plot
        """
        if roads_with_risk is None:
            print("❌ No road risk data available!")
            return
        
        print(f"\n🗺️ VISUALIZING FLOOD RISK ({flood_level}ft level)")
        print("-" * 45)
        
        risk_column = f'flood_risk_{flood_level}ft'
        
        if risk_column not in roads_with_risk.columns:
            print(f"❌ Risk column {risk_column} not found!")
            return
        
        # Create the plot
        _, ax = plt.subplots(figsize=figsize)
        
        # Color mapping for risk levels
        color_map = {'High': 'red', 'Medium': 'orange', 'Low': 'green', 'Unknown': 'gray'}
        
        # Plot roads by risk level
        for risk_level in ['Low', 'Medium', 'High', 'Unknown']:
            subset = roads_with_risk[roads_with_risk[risk_column] == risk_level]
            if len(subset) > 0:
                subset.plot(ax=ax, color=color_map[risk_level], 
                           linewidth=1, label=f'{risk_level} Risk', alpha=0.7)
        
        ax.set_title(f'Road Network Flood Risk - {flood_level}ft Water Level', fontsize=14, fontweight='bold')
        ax.set_xlabel('Longitude')
        ax.set_ylabel('Latitude')
        ax.legend()
        ax.grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
    

In [None]:
# Test d'intégration pour RasterLoader (support raster)
from urban_mapper.modules.loader.loader_factory import LoaderFactory

# Chemin fictif pour un raster (le fichier n'a pas besoin d'exister pour ce test)
fake_raster_path = "fake.tif"

factory = LoaderFactory()
try:
    loader = factory.from_file(fake_raster_path)
    loader.load()  # Doit lever NotImplementedError pour l’instant
except FileNotFoundError as e:
    print("✅ Fichier raster manquant détecté :", e)
except NotImplementedError as e:
    print("✅ RasterLoader reconnu, mais pas encore implémenté :", e)
except Exception as ex:
    print("❌ Erreur inattendue :", ex)
else:
    print("❌ Erreur : Le RasterLoader aurait dû lever NotImplementedError")