# EDITO Datalab Demo: STAC, Parquet, and Zarr

This notebook demonstrates the core workflow of using EDITO Datalab:
1. **Find services** on the datalab website
2. **Configure services** (RStudio, Jupyter, VSCode)
3. **Run analysis** with STAC search, Parquet reading, and Zarr data

Perfect for a 15-minute tutorial! üöÄ


## 1. STAC Search - Finding Marine Data

First, let's search the EDITO STAC catalog to find available marine datasets.


In [None]:
import requests
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("üåä EDITO Datalab Jupyter Demo")
print("=" * 40)


In [None]:
# Connect to EDITO STAC API
stac_endpoint = "https://api.dive.edito.eu/data/"

try:
    response = requests.get(f"{stac_endpoint}collections")
    collections = response.json()
    
    print(f"‚úÖ Connected to EDITO STAC API")
    print(f"Found {len(collections['collections'])} data collections")
    
    # Show first few collections
    print("\nüìã Available data collections:")
    for i, collection in enumerate(collections['collections'][:10]):
        print(f"{i+1:2d}. {collection['id']} - {collection.get('title', 'No title')}")
        
except Exception as e:
    print(f"‚ùå Error connecting to EDITO API: {e}")


In [None]:
# Search for biodiversity data
print("\nüîç Searching for biodiversity data...")

try:
    search_url = f"{stac_endpoint}search"
    search_params = {
        "collections": ["eurobis-occurrence-data"],
        "limit": 5
    }
    
    response = requests.post(search_url, json=search_params)
    search_results = response.json()
    
    print(f"‚úÖ Found {len(search_results['features'])} biodiversity items")
    
    # Show first item
    if search_results['features']:
        first_item = search_results['features'][0]
        print(f"\nüìä Sample item: {first_item['id']}")
        print(f"Title: {first_item['properties'].get('title', 'No title')}")
        
        print("\nüîó Available data formats:")
        for asset_name, asset in first_item['assets'].items():
            print(f"- {asset_name}: {asset['href']}")
            
except Exception as e:
    print(f"‚ùå Error searching STAC: {e}")


## 2. Reading Parquet Data - Biodiversity Analysis

Now let's read the biodiversity data using Parquet format for efficient access.


In [None]:
import pyarrow.parquet as pq
import pyarrow as pa

print("üìä Reading biodiversity data from Parquet...")

# EUROBIS biodiversity occurrence data
parquet_url = "https://s3.waw3-1.cloudferro.com/emodnet/biology/eurobis_occurrence_data/eurobis_occurrences_geoparquet_2024-10-01.parquet"

try:
    # Read a sample of the data (first 1000 records)
    df = pd.read_parquet(parquet_url)
    df_sample = df.head(1000)
    
    print(f"‚úÖ Loaded {len(df_sample)} biodiversity records (sample)")
    print(f"üìã Total records in dataset: {len(df)}")
    print(f"\nColumns: {list(df_sample.columns)}")
    
except Exception as e:
    print(f"‚ùå Error reading parquet: {e}")
    # Create sample data for demo
    print("Creating sample data for demonstration...")
    df_sample = pd.DataFrame({
        'scientificName': ['Scomber scombrus', 'Gadus morhua', 'Pleuronectes platessa'] * 100,
        'decimalLatitude': np.random.uniform(50, 60, 300),
        'decimalLongitude': np.random.uniform(0, 10, 300),
        'eventDate': pd.date_range('2020-01-01', '2023-12-31', periods=300)
    })


In [None]:
# Filter for marine species
print("üê† Filtering for marine species...")

marine_keywords = ['fish', 'pisces', 'mollusca', 'algae', 'crustacea', 'crab', 'mollusk']

if 'scientificName' in df_sample.columns:
    marine_mask = df_sample['scientificName'].str.contains('|'.join(marine_keywords), case=False, na=False)
    marine_data = df_sample[marine_mask]
else:
    marine_data = df_sample  # Use sample data

print(f"‚úÖ Found {len(marine_data)} marine species records")

# Show top species
if len(marine_data) > 0:
    species_count = marine_data['scientificName'].value_counts().head(10)
    print("\nTop 10 marine species:")
    print(species_count)


In [None]:
# Create a simple visualization
if len(marine_data) > 0:
    plt.figure(figsize=(12, 8))
    
    # Plot 1: Species distribution
    plt.subplot(2, 2, 1)
    species_count.head(5).plot(kind='bar')
    plt.title('Top 5 Marine Species')
    plt.xticks(rotation=45)
    
    # Plot 2: Geographic distribution
    plt.subplot(2, 2, 2)
    plt.scatter(marine_data['decimalLongitude'], marine_data['decimalLatitude'], 
               alpha=0.6, s=20)
    plt.xlabel('Longitude')
    plt.ylabel('Latitude')
    plt.title('Geographic Distribution')
    
    # Plot 3: Temporal distribution
    plt.subplot(2, 2, 3)
    if 'eventDate' in marine_data.columns:
        marine_data['year'] = pd.to_datetime(marine_data['eventDate']).dt.year
        marine_data['year'].value_counts().sort_index().plot(kind='line')
        plt.title('Records by Year')
        plt.xlabel('Year')
        plt.ylabel('Count')
    
    # Plot 4: Summary stats
    plt.subplot(2, 2, 4)
    plt.text(0.1, 0.7, f'Total Records: {len(marine_data)}', fontsize=12)
    plt.text(0.1, 0.5, f'Unique Species: {marine_data["scientificName"].nunique()}', fontsize=12)
    plt.text(0.1, 0.3, f'Latitude Range: {marine_data["decimalLatitude"].min():.1f} - {marine_data["decimalLatitude"].max():.1f}', fontsize=12)
    plt.text(0.1, 0.1, f'Longitude Range: {marine_data["decimalLongitude"].min():.1f} - {marine_data["decimalLongitude"].max():.1f}', fontsize=12)
    plt.title('Summary Statistics')
    plt.axis('off')
    
    plt.tight_layout()
    plt.show()
    
else:
    print("‚ùå No marine data to visualize")


## 3. Reading Zarr Data - Oceanographic Analysis

Now let's work with Zarr data for oceanographic analysis using xarray.


In [None]:
import xarray as xr
import zarr

print("üßä Reading oceanographic data from Zarr...")

# Example Zarr URL (you would get this from STAC search)
# For demo purposes, we'll create sample oceanographic data
print("Creating sample oceanographic data for demonstration...")

# Create sample oceanographic data
lats = np.linspace(50, 60, 50)
lons = np.linspace(0, 10, 50)
times = pd.date_range('2020-01-01', '2020-12-31', freq='D')
depths = np.array([0, 10, 20, 50, 100, 200, 500, 1000])

# Create temperature data with realistic patterns
temp_data = np.random.normal(10, 2, (len(times), len(depths), len(lats), len(lons)))
# Add seasonal variation
seasonal = 5 * np.sin(2 * np.pi * np.arange(len(times)) / 365.25)
temp_data += seasonal[:, np.newaxis, np.newaxis, np.newaxis]
# Add depth variation
temp_data += -0.01 * depths[np.newaxis, :, np.newaxis, np.newaxis]

# Create xarray Dataset
ds = xr.Dataset({
    'temperature': (['time', 'depth', 'lat', 'lon'], temp_data),
    'salinity': (['time', 'depth', 'lat', 'lon'], 
                 temp_data + np.random.normal(0, 0.5, temp_data.shape))
}, coords={
    'time': times,
    'depth': depths,
    'lat': lats,
    'lon': lons
})

print(f"‚úÖ Created oceanographic dataset")
print(f"Dimensions: {ds.dims}")
print(f"Variables: {list(ds.data_vars)}")
print(f"Coordinates: {list(ds.coords)}")


In [None]:
# Analyze the oceanographic data
print("üìä Analyzing oceanographic data...")

# Calculate mean temperature by depth
mean_temp_by_depth = ds.temperature.mean(dim=['time', 'lat', 'lon'])

# Calculate seasonal cycle
seasonal_temp = ds.temperature.mean(dim=['depth', 'lat', 'lon'])

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Plot 1: Temperature profile by depth
axes[0, 0].plot(mean_temp_by_depth, -mean_temp_by_depth.depth)
axes[0, 0].set_xlabel('Temperature (¬∞C)')
axes[0, 0].set_ylabel('Depth (m)')
axes[0, 0].set_title('Mean Temperature Profile')
axes[0, 0].grid(True)

# Plot 2: Seasonal temperature cycle
axes[0, 1].plot(seasonal_temp.time, seasonal_temp)
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Temperature (¬∞C)')
axes[0, 1].set_title('Seasonal Temperature Cycle')
axes[0, 1].tick_params(axis='x', rotation=45)

# Plot 3: Temperature at surface
surface_temp = ds.temperature.isel(depth=0, time=0)
im = axes[1, 0].contourf(surface_temp.lon, surface_temp.lat, surface_temp, levels=20)
axes[1, 0].set_xlabel('Longitude')
axes[1, 0].set_ylabel('Latitude')
axes[1, 0].set_title('Surface Temperature (Jan 1, 2020)')
plt.colorbar(im, ax=axes[1, 0])

# Plot 4: Temperature vs Salinity
temp_flat = ds.temperature.values.flatten()
sal_flat = ds.salinity.values.flatten()
# Sample for plotting
sample_idx = np.random.choice(len(temp_flat), 1000, replace=False)
axes[1, 1].scatter(sal_flat[sample_idx], temp_flat[sample_idx], alpha=0.6, s=1)
axes[1, 1].set_xlabel('Salinity')
axes[1, 1].set_ylabel('Temperature (¬∞C)')
axes[1, 1].set_title('Temperature vs Salinity')
axes[1, 1].grid(True)

plt.tight_layout()
plt.show()

print("‚úÖ Oceanographic analysis complete!")


## 4. Summary - EDITO Datalab Workflow

This notebook demonstrated the core EDITO Datalab workflow:

### üéØ Key Steps:
1. **Find Services**: Go to [datalab.dive.edito.eu](https://datalab.dive.edito.eu/) and select a service
2. **Configure Service**: Choose RStudio, Jupyter, or VSCode with appropriate resources
3. **Run Analysis**: Use STAC to find data, Parquet for tabular data, Zarr for arrays

### üõ†Ô∏è Services Available:
- **RStudio**: Perfect for statistical analysis and visualization
- **Jupyter**: Ideal for machine learning and data exploration
- **VSCode**: Great for larger projects with R and Python

### üìä Data Formats:
- **STAC**: Find and discover marine datasets
- **Parquet**: Efficient tabular data (biodiversity, observations)
- **Zarr**: Cloud-optimized arrays (oceanographic, climate data)

### üöÄ Next Steps:
- Try the RStudio service for R-based analysis
- Explore more datasets in the EDITO STAC catalog
- Use personal storage to save your results

**Happy analyzing! üåäüê†**


## 4. Personal Storage - Connect and Transfer Data

Now let's connect to your personal storage and transfer data.


In [None]:
# Connect to personal storage
print("üíæ Connecting to personal storage...")

import boto3
import os

# Check if storage credentials are available
if os.getenv("AWS_ACCESS_KEY_ID"):
    print("‚úÖ Personal storage credentials found!")
    
    # Connect to EDITO's MinIO storage using environment variables
    s3 = boto3.client(
        "s3",
        endpoint_url=f"https://{os.getenv('AWS_S3_ENDPOINT')}",
        aws_access_key_id=os.getenv('AWS_ACCESS_KEY_ID'),
        aws_secret_access_key=os.getenv('AWS_SECRET_ACCESS_KEY'),
        aws_session_token=os.getenv('AWS_SESSION_TOKEN'),
        region_name=os.getenv('AWS_DEFAULT_REGION')
    )
    
    print("‚úÖ Connected to personal storage!")
    
    # List your buckets to verify connection
    try:
        response = s3.list_buckets()
        print(f"üìÅ Available buckets: {[bucket['Name'] for bucket in response['Buckets']]}")
    except Exception as e:
        print(f"‚ö†Ô∏è Could not list buckets: {e}")
    
else:
    print("‚ùå No storage credentials found. Make sure you're running in EDITO Datalab.")
    print("üí° Your credentials are automatically available in EDITO services")
    print("üí° No need to go to project settings - they're already there!")
    
    # For demo purposes, create a mock connection
    print("Creating mock connection for demonstration...")
    s3 = None


In [None]:
# Process and save data to personal storage
print("üìä Processing data for storage...")

if len(marine_data) > 0:
    # Process the marine data
    processed_data = marine_data.groupby('scientificName').agg({
        'decimalLatitude': 'mean',
        'decimalLongitude': 'mean',
        'eventDate': 'count'
    }).reset_index()
    
    processed_data.columns = ['species', 'mean_latitude', 'mean_longitude', 'count']
    
    print(f"‚úÖ Processed data: {len(processed_data)} species")
    print(processed_data.head())
    
    # Save to local file first
    processed_data.to_csv('processed_marine_data.csv', index=False)
    print("‚úÖ Data saved locally as processed_marine_data.csv")
    
    # Upload to personal storage (if connected)
    if s3:
        try:
            s3.put_object(
                Bucket='your-bucket-name',  # Replace with your actual bucket name
                Key='marine_analysis/processed_marine_data.csv',
                Body=processed_data.to_csv(index=False),
                ContentType='text/csv'
            )
            print("‚úÖ Data uploaded to personal storage!")
        except Exception as e:
            print(f"‚ùå Error uploading to storage: {e}")
            print("üí° Make sure to replace 'your-bucket-name' with your actual bucket name")
    else:
        print("üí° To upload to storage, make sure you're running in EDITO Datalab")
        
else:
    print("‚ùå No marine data to process")


In [None]:
# Download data from personal storage
print("üì• Downloading data from personal storage...")

if s3:
    try:
        # Download from personal storage
        response = s3.get_object(
            Bucket='your-bucket-name',  # Replace with your actual bucket name
            Key='marine_analysis/processed_marine_data.csv'
        )
        downloaded_data = pd.read_csv(response['Body'])
        print("‚úÖ Data downloaded from personal storage!")
        print(f"Downloaded {len(downloaded_data)} records")
        print(downloaded_data.head())
        
    except Exception as e:
        print(f"‚ùå Error downloading from storage: {e}")
        print("üí° Make sure the file exists in your storage and bucket name is correct")
else:
    print("üí° To download from storage, make sure you're running in EDITO Datalab")
    print("üí° Your credentials will be automatically available in EDITO services")


## 5. Summary - EDITO Datalab Workflow

This notebook demonstrated the core EDITO Datalab workflow:

### üéØ Key Steps:
1. **Find Services**: Go to [datalab.dive.edito.eu](https://datalab.dive.edito.eu/) and select a service
2. **Configure Service**: Choose RStudio, Jupyter, or VSCode with appropriate resources
3. **Run Analysis**: Use STAC to find data, Parquet for tabular data, Zarr for arrays
4. **Connect Storage**: Access your personal storage with automatic credentials
5. **Process & Transfer**: Analyze data and save results to your storage

### üõ†Ô∏è Services Available:
- **RStudio**: Perfect for statistical analysis and visualization
- **Jupyter**: Ideal for machine learning and data exploration
- **VSCode**: Great for larger projects with R and Python

### üìä Data Formats:
- **STAC**: Find and discover marine datasets
- **Parquet**: Efficient tabular data (biodiversity, observations)
- **Zarr**: Cloud-optimized arrays (oceanographic, climate data)

### üöÄ Next Steps:
- Try the RStudio service for R-based analysis
- Explore more datasets in the EDITO STAC catalog
- Use personal storage to save your results

**Happy analyzing! üåäüê†**
