# Step 3: Format of the Inputs 📄

The goal of this step is to **prepare the datasets for modeling**. This involves **merging the Zosteraceae point observations** with the **250m x 250m grid** and the corresponding **environmental variables**. Additionally, any **missing values (NaNs)** are removed to ensure the model runs without errors.

#### 📚 Required Libraries
To format and merge the datasets, the following libraries are needed:

- **`pandas`**: For reading, manipulating, and analyzing tabular data.  
- **`geopandas`**: For handling spatial datasets and geometries.  
- **`shapely`**: To work with bounding boxes and create the spatial grid.

### 🛠️ Steps:
1. **Merge the Data**: 
   - Combine the **Zosteraceae point observations** with the **grid cells** and associated environmental values.  

2. **Handle Missing Values**: 
   - Remove any rows containing **NaN values** to ensure the dataset is clean for modeling.  


### 📄 Reformatting to merge the zosteraceae points with the grid cell and the environmental values

In [27]:
import pandas as pd
import geopandas as gpd
from shapely import wkt

In [33]:
inputs_variables = pd.read_csv("data/02_inputs_environmental_variables.csv")
zosteraceae_points = pd.read_csv("data/01_filtered_Zosteraceae.csv")
clipped_grid_4326 = gpd.read_file("data/clipped_grid_4326.geojson")

In [34]:
zosteraceae_points.columns

Index(['gbifID', 'datasetKey', 'occurrenceID', 'kingdom', 'phylum', 'class',
       'order', 'family', 'genus', 'species', 'infraspecificEpithet',
       'taxonRank', 'scientificName', 'verbatimScientificName',
       'verbatimScientificNameAuthorship', 'countryCode', 'locality',
       'stateProvince', 'occurrenceStatus', 'individualCount',
       'publishingOrgKey', 'decimalLatitude', 'decimalLongitude',
       'coordinateUncertaintyInMeters', 'coordinatePrecision', 'elevation',
       'elevationAccuracy', 'depth', 'depthAccuracy', 'eventDate', 'day',
       'month', 'year', 'taxonKey', 'speciesKey', 'basisOfRecord',
       'institutionCode', 'collectionCode', 'catalogNumber', 'recordNumber',
       'identifiedBy', 'dateIdentified', 'license', 'rightsHolder',
       'recordedBy', 'typeStatus', 'establishmentMeans', 'lastInterpreted',
       'mediaType', 'issue', 'geometry'],
      dtype='object')

In [35]:
# Convert species points to GeoDataFrame
zosteraceae_gdf = gpd.GeoDataFrame(
    zosteraceae_points, 
    geometry=gpd.points_from_xy(zosteraceae_points['decimalLongitude'], zosteraceae_points['decimalLatitude']),
    crs="EPSG:4326"
)

# Convert grid cells to GeoDataFrame (if not already)
inputs_variables_gdf = gpd.GeoDataFrame(
    inputs_variables, 
    geometry=inputs_variables['geometry'].apply(wkt.loads), 
    crs="EPSG:4326"
)

# Spatial join to match points to cells
zosteraceae_with_env = gpd.sjoin(zosteraceae_gdf, inputs_variables_gdf, how='right', predicate='within')
# Add a presence column: 1 if a Zosteraceae point falls within the grid cell, 0 otherwise
zosteraceae_with_env['presence'] = zosteraceae_with_env['gbifID'].notna().astype(int)

In [36]:
zosteraceae_with_env

Unnamed: 0,index_left,gbifID,datasetKey,occurrenceID,kingdom,phylum,class,order,family,genus,...,po4,siconc,sob,thetao,uo,vo,VTM01_SW2,VSDX,geometry,presence
0,,,,,,,,,,,...,0.452668,0.0,35.101162,10.721393,0.054501,0.065077,,,"POLYGON ((10.03034 58.25918, 10.03459 58.25935...",0
1,,,,,,,,,,,...,0.452668,0.0,35.101162,10.721393,0.054501,0.065077,,,"POLYGON ((10.03003 58.26142, 10.03428 58.26158...",0
2,,,,,,,,,,,...,0.452668,0.0,35.101162,10.721393,0.054501,0.065077,,,"POLYGON ((10.02972 58.26366, 10.03396 58.26382...",0
3,,,,,,,,,,,...,0.452668,0.0,35.101162,10.721393,0.054501,0.065077,,,"POLYGON ((10.0294 58.26589, 10.03365 58.26606,...",0
4,,,,,,,,,,,...,0.452668,0.0,35.101162,10.721393,0.054501,0.065077,,,"POLYGON ((10.02909 58.26813, 10.03334 58.26829...",0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
240351,,,,,,,,,,,...,0.337169,0.0,13.510467,10.322074,-0.006674,0.015324,2.790750,0.024698,"POLYGON ((13.05616 55.68358, 13.06013 55.68364...",0
240352,,,,,,,,,,,...,0.341800,0.0,13.806524,10.325938,0.010994,0.012605,3.482053,0.027726,"POLYGON ((13.06113 55.66344, 13.0651 55.6635, ...",0
240353,,,,,,,,,,,...,0.341800,0.0,13.806524,10.325938,0.010994,0.012605,3.482053,0.027726,"POLYGON ((13.06102 55.66568, 13.06499 55.66575...",0
240354,,,,,,,,,,,...,0.341800,0.0,13.806524,10.325938,0.010994,0.012605,3.482053,0.027726,"POLYGON ((13.06091 55.66793, 13.06488 55.66799...",0


Once the plants table and the input variables table are merged. We can see that there is many NaN values is the fist columns because there is no plant located in the cells `0,1,2,3,4,240351,240352,340353,240354` or `240355`. Moreover, there is a large number of rows, bigger than the number of rows of the input variables table, it means there are plants located in cells without environmental data. Let's verify the column names.

In [37]:
zosteraceae_with_env.columns

Index(['index_left', 'gbifID', 'datasetKey', 'occurrenceID', 'kingdom',
       'phylum', 'class', 'order', 'family', 'genus', 'species',
       'infraspecificEpithet', 'taxonRank', 'scientificName',
       'verbatimScientificName', 'verbatimScientificNameAuthorship',
       'countryCode', 'locality', 'stateProvince', 'occurrenceStatus',
       'individualCount', 'publishingOrgKey', 'decimalLatitude',
       'decimalLongitude', 'coordinateUncertaintyInMeters',
       'coordinatePrecision', 'elevation', 'elevationAccuracy', 'depth',
       'depthAccuracy', 'eventDate', 'day', 'month', 'year', 'taxonKey',
       'speciesKey', 'basisOfRecord', 'institutionCode', 'collectionCode',
       'catalogNumber', 'recordNumber', 'identifiedBy', 'dateIdentified',
       'license', 'rightsHolder', 'recordedBy', 'typeStatus',
       'establishmentMeans', 'lastInterpreted', 'mediaType', 'issue',
       'mean_bathymetry', 'chl', 'no3', 'ph', 'po4', 'siconc', 'sob', 'thetao',
       'uo', 'vo', 'VTM01_SW2

Let's remove the useless columns of the Zosteracea table. The model only need the environmental values and the presence values as inputs. It also need complete rows, that's why I deleted all the rows with a NaN using `.dropna()`.

In [38]:
# Filter the column of interest (remove duplicate 'geometry')
zosteraceae_with_env = zosteraceae_with_env[['mean_bathymetry', 'chl', 'no3', 'ph', 'po4', 'siconc', 'sob', 'thetao','uo', 'vo', 'VTM01_SW2', 'VSDX', 'geometry', 'presence']]
# Deleting the rows with any NaN values
zosteraceae_with_env = zosteraceae_with_env.dropna()
zosteraceae_with_env

Unnamed: 0,mean_bathymetry,chl,no3,ph,po4,siconc,sob,thetao,uo,vo,VTM01_SW2,VSDX,geometry,presence
14,-454.978577,1.319354,1.591813,8.212095,0.453248,0.0,35.100361,10.726984,0.060392,0.065536,6.414486,0.036752,"POLYGON ((10.03915 58.25727, 10.04339 58.25744...",0
15,-456.461243,1.313369,1.584467,8.212026,0.452352,0.0,35.100712,10.723794,0.056117,0.064474,6.485106,0.036403,"POLYGON ((10.03883 58.25951, 10.04308 58.25968...",0
16,-457.868994,1.313369,1.584467,8.212026,0.452352,0.0,35.100712,10.723794,0.056117,0.064474,6.485106,0.036403,"POLYGON ((10.03852 58.26175, 10.04276 58.26191...",0
17,-461.075586,1.313369,1.584467,8.212026,0.452352,0.0,35.100712,10.723794,0.056117,0.064474,6.485106,0.036403,"POLYGON ((10.03821 58.26398, 10.04245 58.26415...",0
18,-468.873505,1.313369,1.584467,8.212026,0.452352,0.0,35.100712,10.723794,0.056117,0.064474,6.485106,0.036403,"POLYGON ((10.03789 58.26622, 10.04214 58.26639...",0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
240350,-0.361465,2.017805,4.840452,8.171505,0.337169,0.0,13.510467,10.322074,-0.006674,0.015324,2.790750,0.024698,"POLYGON ((13.05627 55.68134, 13.06024 55.6814,...",0
240352,-0.300009,1.999801,4.590124,8.169117,0.341800,0.0,13.806524,10.325938,0.010994,0.012605,3.482053,0.027726,"POLYGON ((13.06113 55.66344, 13.0651 55.6635, ...",0
240353,-0.300089,1.999801,4.590124,8.169117,0.341800,0.0,13.806524,10.325938,0.010994,0.012605,3.482053,0.027726,"POLYGON ((13.06102 55.66568, 13.06499 55.66575...",0
240354,-0.320632,1.999801,4.590124,8.169117,0.341800,0.0,13.806524,10.325938,0.010994,0.012605,3.482053,0.027726,"POLYGON ((13.06091 55.66793, 13.06488 55.66799...",0


I need to verify that I still have enougth presence rows for the model.

In [39]:
# Verify the presence column
count_presence = zosteraceae_with_env['presence'].value_counts()
print("Presence counts:\n", count_presence)

Presence counts:
 presence
0    229432
1       911
Name: count, dtype: int64


I have more that 900 cells with a plant presence. Let's save the result for the nex step.

In [40]:
# Save the final input file
zosteraceae_with_env.to_csv("data/03_inputs_environmental_variables.csv", index=False)