# Documentation

## Semantic Information

To run the code, to add libraries, and basically to manage the application is done by poetry.

In [None]:
poetry run run-app      # for running the code
poetry add ...          # for adding dependencies

## Config Folder

Here you will find some configuration file from copernicus that you have to change the credentials.
https://cds.climate.copernicus.eu/how-to-api , register and you will get you credentials here. This is from copernicus era5. And replace the file csd_api.txt.
Same for copernicus land, but more complicated way, here https://land.copernicus.eu/en/how-to-guides/how-to-download-spatial-data/how-to-download-m2m, you have to login, and get to the api tokend and generated a new one, and replace the file 'key.json'.

Also there is a file 'paths.py', that contains paths to the hinton database but you can use it to yours storage if you adapt it.

## First Step

The first step is to download data. So we have the apis script that you can download through some sources like ERA5 from Copernicus(https://cds.climate.copernicus.eu/datasets), iNaturalist(https://www.inaturalist.org/), Xeno-Canto(https://xeno-canto.org/), BOLD(https://v4.boldsystems.org/), Copernicus Land(https://land.copernicus.eu/en/products/vegetation/normalised-difference-vegetation-index-v3-0-1km), Map of Life(https://mol.org/). 
Each file has its own methods, and also there is a file downloader that contains some functions can be used from the other files, like a mother class. And we have also the ingestion scripts, which are functions that extract data from files. Files are in the storage. 

Every file of them contains a function, at the end of file, called with the name of the source (ex. era5() )which also you can set your preferences throug params, what exactly you want.

### Download Data

#### APIs

Currently you can download data calling the function from the main like this.

The difference is that with the 'timestamps', we have a file with the timestamps (/data/projects/biodt/storage/processed_data/timestamps/dates_2000_2024.csv) that we get from the species dataset, and we will download data from era5 only for this specific dates. And we download for 4 times : 00:00. 06:00, 12:00, 18:00

If we go with range, we will download files for this range every day, every month.

In [None]:
era5(mode = 'timestamps') 
era5(mode = 'range', start_year = '2000', end_year = '2020')

But also in main we have made the workflow with args, so you can use it through python or poetry to download data without change the code or run the functions manually from main. The instructions with the params is in the main.

You can run Bold, xeno canto, inaturalist without requiring something, but the mapoflife needs a list of species names to run and this one you get it from the files in modalities folders. (/data/projects/biodt/storage/modality_folder_lists), you select from which file you want to extract species names.

The API from copernicus land it's running as we wanted, its not wrong, but it has limitations, thats why we went and download the data from the files from here https://globalland.vito.be/download/manifest/ndvi_1km_v3_10daily_netcdf/manifest_clms_global_ndvi_1km_v3_10daily_netcdf_latest.txt

#### Ingestion Scripts

Here you will find scripts, to download data from csv files that have been located manually here /data/projects/biodt/storage/dataset_files. Basically you run the scripts for example for the indicators for the region or for the world, and creates a new csv in (/data/projects/biodt/storage/data) with the countries, the bounding boxes of each country and the values. Nothing else.

Agriculture, Land, and Forest is from https://data.worldbank.org/indicator. You have to go there to download the data from there first, to locate them in the modality folders, then to run the function and create the new csv that you want for your dataset.
They are from 1961-2021, so you need to check if they update the data to get for recent years.

In [None]:
# To proccess all the agriculture files and create new csvs.
run_agriculture_data_processing(region = 'Europe',global_mode = False, irrigated = True, arable = True, cropland = True)

# And then to merge them in one file (/data/projects/biodt/storage/data/Agriculture/Europe_combined_agriculture_data.csv).
run_agriculture_merging()              

For iNaturalist from files, you will find only a json with metadata, but we downloaded a folder with a lof of images from https://github.com/visipedia/inat_comp/tree/master/2021. And we moved all the images and the metadata to the foler /data/projects/biodt/storage/data/Life

For the livingplanextindex, https://www.livingplanetindex.org/data_portal, we just create a csv file in each folder in /data/projects/biodt/storage/data/Life, and inside we have the population index, but we call it 'distribution'.

Unhabitat (https://data.unhabitat.org/), is for urban but we dont use it, /data/projects/biodt/storage/dataset_files/Unhabitat.

## Second Step

### Data Preprocessing

Here is all the files with preprocessing functions, there are comments, for images, sounds, text, edna. This fiel preprocessing is the important one with combines all the other files in general functions , which used when we create the species dataset parquet file.

## Third Step

### Dataset Creation

The most important part.

Firstly we have to create the species dataset. Now we dont put all the images and the sounds inside.
All the species data are located /data/projects/biodt/storage/data/Life. 

For now is one species dataset which is not totally wrong, but some distributions have not been saved. /data/projects/biodt/storage/processed_data/species_dataset.parquet

But in the meantime its running a function on hinton cluster for creating the new one. /data/projects/biodt/storage/vector_db/species_dataset_2.parquet 
So this will be the one that will be used.

So you run this:


In [None]:
create_species_dataset( root_folder = /data/projects/biodt/storage/data/Life, filepath = /data/projects/biodt/storage/processed_data/species_dataset.parquet, start_year: int = 2000, end_year: int = 2020) 

When we create the species parquet we have all the data for species there. We have the csvs for the indicators, and red list, ndvi and we have to create the data batches.

We have a file batch.py and metadata.py which basically where in the logic of classes and inside the function of normalizing some values, checking for mistakes, but when we save the batches to load the data from there this class was needded so at then end we didnt use them, and we save the data in directories and lists. But the logic is there.

There is 2 files save_data and load_data for the species dataset and the batches, very usefull. And in the preprocessing the same, in the preprocessing also exists the initialization of the tensors. Now we dont put images, and audios to the batch so if we want to include, you have to change the code there, and in the def create_batch(), basically to uncomment some parts.



#### Batch Creation

We call the function: 

In [None]:
create_dataset(
    species_file="/data/projects/biodt/storage/vector_db/species_dataset_2.parquet",
    era5_directory=paths.ERA5_DIR,
    agriculture_file=paths.AGRICULTURE_COMBINED_FILE,
    land_file=paths.LAND_COMBINED_FILE,
    forest_file=paths.FOREST_FILE,
    species_extinction_file=paths.SPECIES_EXTINCTION_FILE,
    load_type="day-by-day",
)


Go and load all the data, from species, from indicators, from era5 and proceed for creating the batch.

For now the batches are created contain data only for Europe and from 2000-2012. Why?? 

Because its running on the background on hinton to download data from era5 until 2024, so it has to end with the downloading, for then to run again the above funtion to create the batches. But also the species dataset is from 2000 until 2020, it needs an update.

Also the indicators are until 2021. except of the ndvi.

So when you have new data you just run again this command above to create the new batches.

Also very important. Some variables are excluded from surface some pressure levels like : its on the function create_batch()

In [None]:
pressure_levels = (50, 500, 1000)
# pressure_levels = tuple(int(level) for level in atmospheric_dataset.pressure_level.values)

for var_name in ["t2m", "msl"]:
# for var_name in ["t2m", "msl", "u10", "v10"]:

the right code for all the variables is in the comments, so you just delete the one that appeared now and uncomment the other.

you change the location in def create_batches() which now is Europe :

In [None]:
min_lon, min_lat, max_lon, max_lat = -30, 34.0, 50.0, 72.0

you can change the first day that you can here in the create batches():

In [None]:
if load_type == "day-by-day":
    start_date = np.datetime64("2000-01-01T00:00:00", "s")

### Utils

Some important function like handling timestamp values from xeno canto or inaturalist, geo functions for bounding boxes, round the degrees, get iso of countries.
statistics that needs work, and also the plots file.

The merge data, is where to find function for calculating avg distance between lat and lon, where merge images and audios in the species dataset creation, and merge world bank and etc.

#### Label mapping

11824,2082,9783,16067,16348,5997,10261,327,13833,9319,18673,16870,10265,15761,9060,10200,2393,511,20832,17663,15861 is the endagered species and plants.

From here https://op.europa.eu/en/publication-detail/-/publication/d426ab4d-fc82-11e5-b713-01aa75ed71a1 and here https://portals.iucn.org/library/sites/library/files/documents/RL-4-013.pdf

The mapping you can find here /data/projects/biodt/storage/processed_data/labels_mapping/label_mappings.json


### Storage

Data folder contains raw data.

Dataset_files contains the csv files from the sources or txt files or json files, that we need them to extract the data and save them to data folder.

Modality folders contain txt files which shows which folders contain which modalities. Is produced by command in terminal, run once, there is no code. It should be updated.

Processed data contains labels mapping, timestamps extracted from species dataset. And the species dataset.