In [1]:
import os
import pandas as pd 
import geopandas as gpd
from pyproj import Transformer

In [2]:
def convert_coordinates(easting, northing):
    longitude, latitude = transformer.transform(easting, northing)
    return latitude, longitude

# **Creating a Hydropower GeoPackage File for GeoH2**

This example demonstrates how to process a hydropower dataset (e.g., [JRC Hydropower Database](https://github.com/energy-modelling-toolkit/hydro-power-database)) and convert it into a GeoPackage format for use in **GeoH2**. Users can adapt this approach to their own hydropower datasets.

---

## **1. Define File Paths**
Set up input and output directories.

- Replace "your-hydropower-file.csv" with your dataset file.
- Ensure the input file is stored in Raw_Spatial_Data/ and structured correctly.

In [8]:
script_dir = os.getcwd()
input_path = os.path.join(script_dir, "Raw_Spatial_Data", "jrc-hydro-power-plant-database.csv") 
output_dir = os.path.join(script_dir, "Inputs_Spider", "data")
os.makedirs(output_dir, exist_ok=True) 
output_path = os.path.join(output_dir, "hydropower_dams.gpkg")

## **2. Load and Select Relevant Columns**
After defining file paths, the next step is to read the hydropower dataset and extract only the necessary columns.


### **Explanation**
To work with hydropower data, we first need to load the dataset and extract only the relevant columns. This ensures we keep only necessary information while making it easier to process later.  

### **Customization**
- If your dataset contains different column names, update the selection accordingly.  
- Ensure that your file includes **latitude (`lat`), longitude (`lon`), installed capacity, generation, and head data**.  
- If your dataset has extra columns, you can keep them if needed, but they won’t be used in this workflow.  


In [10]:
data = pd.read_csv(input_path)

data = data[['id', 'lat', 'lon', 'name', 'type', 
             'installed_capacity_MW', 'avg_annual_generation_GWh', 
             'head', 'country_code']]


## **3. Convert Coordinates to Numeric Format**

### **Explanation**
Hydropower plant locations must have correctly formatted latitude and longitude values. Some datasets may contain errors, such as text values or missing data in these columns. We convert these to numeric format to ensure correct processing.  

### **Customization**
- If your dataset uses different column names for latitude and longitude, update them accordingly.  
- If conversion errors occur, check the dataset for **empty values, incorrect formats, or misplaced decimal points**.  
- Rows with invalid coordinates will be assigned `NaN` (missing value), which will be filtered out later.  

In [None]:
data['lon'] = pd.to_numeric(data['lon'], errors='coerce')
data['lat'] = pd.to_numeric(data['lat'], errors='coerce')

## **4. Rename Columns for Consistency**

### **Explanation**
To ensure compatibility with GeoH2 and maintain a standardized structure, we rename key columns. This step makes it easier to merge or compare data with other datasets later.  

### **Customization**
- If your dataset already uses the target column names (`capacity`, `Latitude`, `Longitude`, `plant_type`), **this step is optional**.  
- If your dataset has different column names for **installed capacity, latitude, longitude, or plant type**, modify the renaming dictionary accordingly.  


In [12]:
data = data.rename(columns={
    "installed_capacity_MW": "capacity",
    "lat": "Latitude",
    "lon": "Longitude",
    "type": "plant_type"
})

## **5. Filter Out Invalid Data**

### **Explanation**
To ensure data quality, we remove rows with missing coordinates and keep only relevant hydropower plant types. This prevents errors in later spatial processing.  

### **Customization**
- If your dataset contains **different plant type classifications**, adjust the filter accordingly.  
- If you want to keep all plant types, remove the filtering step.  
- Ensure that `capacity` is properly formatted as a numeric value to avoid issues in later calculations.  


In [14]:
data = data.dropna(subset=['Longitude', 'Latitude'])
data = data[data['plant_type'].isin(['HDAM', 'HPHS'])]
data['capacity'] = pd.to_numeric(data['capacity'], errors='raise')

## **6. Filter for Existing "Head" Values**

### **Explanation**
Hydraulic head is a critical parameter for hydropower analysis. This step removes entries where the **head** value is missing, ensuring only complete data is used.  

### **Customization**
- If you want to **keep plants with missing head values**, remove this filtering step.  
- If your dataset has an alternative column name for head height, update `"head"` accordingly.  
- If many plants are missing head data, consider estimating it using **GIS-based elevation analysis** or other methods.  


In [15]:
data_existing = data.dropna(subset=['head'])
print(f"Number of missing 'head' values: {data_existing['head'].isna().sum()}")

Number of missing 'head' values: 0


## **7. Convert to a GeoDataFrame**

### **Explanation**
To perform spatial analysis, we convert the dataset into a **GeoDataFrame**, which allows us to store geographic coordinates in a structured format. Each hydropower plant is represented as a point geometry using **longitude and latitude**.  

### **Customization**
- If your dataset has different column names for **longitude and latitude**, update them in `points_from_xy()`.  
- The coordinate reference system (CRS) is set to **EPSG:4326 (WGS 84)**, a standard geographic coordinate system. If needed, change this to match your project requirements.  


In [16]:
gdf = gpd.GeoDataFrame(
    data_existing,
    geometry=gpd.points_from_xy(data_existing.Longitude, data_existing.Latitude)
)
gdf.set_crs(epsg=4326, inplace=True)

Unnamed: 0,id,Latitude,Longitude,name,plant_type,capacity,avg_annual_generation_GWh,head,country_code,geometry
0,H1,46.073100,7.403400,Grande Dixence - Cleuson-Dixence (chandolin-fi...,HDAM,2069.00,1400.0000,1748.00,CH,POINT (7.4034 46.0731)
1,H10,44.177576,7.416505,Chiotas entracque,HPHS,1064.00,,130.00,IT,POINT (7.41651 44.17758)
2,H100,46.067653,10.983605,S.massenza - Vezzano molveno UP_S MASS CL_1,HPHS,377.00,,580.90,IT,POINT (10.9836 46.06765)
4,H1001,46.547808,11.007077,Pracomune,HPHS,42.00,,377.00,IT,POINT (11.00708 46.54781)
8,H1005,42.466999,-5.883000,BARRIOS DE LUNA 1 (MORA LUNA),HDAM,38.00,,300.30,ES,POINT (-5.883 42.467)
...,...,...,...,...,...,...,...,...,...,...
4173,N2049,68.127445,16.521809,Nedre Russvik,HDAM,2.80,,100.20,NO,POINT (16.52181 68.12744)
4174,N2050,68.135978,16.546919,Øvre Russvik,HDAM,5.00,,421.70,NO,POINT (16.54692 68.13598)
4175,N2031,67.671732,16.060866,Raukforsen,HDAM,5.60,,75.75,NO,POINT (16.06087 67.67173)
4176,N1981,67.054756,14.478863,Breivikelva,HDAM,8.59,,305.50,NO,POINT (14.47886 67.05476)


## **8. Export as GeoPackage**

### **Explanation**
The processed hydropower dataset is saved as a **GeoPackage (GPKG)** file in two locations:  
1. **Inputs_Spider/data/** → For compatibility with Spider-based workflows.  
2. **Inputs_GeoH2/data/** → For integration with the **GeoH2** model.  

This ensures that both workflows can access the same standardized hydropower dataset.  

### **Customization**
- If you need to store the file in additional locations, modify the **output paths** accordingly.  
- To save in a different format (e.g., **Shapefile, GeoJSON**), update the `driver` argument.  
- If multiple datasets are processed, ensure **unique filenames** to avoid overwriting.  


In [None]:
# Save to Inputs_Spider/data/
gdf.to_file(output_path, layer='dams', driver="GPKG")

# Save to Inputs_GeoH2/data/
geoH2_output_dir = os.path.join(script_dir, "Inputs_GeoH2", "data")
os.makedirs(geoH2_output_dir, exist_ok=True) 
geoH2_output_path = os.path.join(geoH2_output_dir, "hydropower_dams.gpkg")
gdf.to_file(geoH2_output_path, layer='dams', driver="GPKG")

print(f"GeoPackage files successfully created at:\n- {os.path.relpath(output_path, script_dir)}\n- {os.path.relpath(geoH2_output_path, script_dir)}")