# 🌍 NERC Digital Gathering - Environmental Data Hackathon — Soils & Land Cover Notebook

Welcome to the NERC Digital Gathering hackathon!  
This notebook contains the challenge briefs and starter code for you to explore weather, climate, and soil data available through CEDA and other sources. In the hackathon we invite you to use NERC data and to explore the CEDA archive. However, Cranfield hold the national soil map and so if your hack involves soil you can also use that dataset too.

In the hackathon, we are offering a chance to explore and interact with a range of meteorological ands other data in CEDA - NERC's Centre for Environmental Data Analysis. The datasets we are looking at include ECMWF, HAD and MIDAS. These data are in different formats and structures so you can explore these differences as well.

**MIDAS** (Met Office Integrated Data Archive System): This is a database of raw weather observations from land and marine surface stations, both in the UK and globally. It contains daily, hourly and sub-hourly measurements of various parameters like temperature, rainfall, sunshine, wind, cloud cover, and present weather codes. MIDAS data is station based timeseries data in CSV format.

**ECMWF** (European Centre for Medium-Range Weather Forecasts): This organisation produces weather forecasts and climate reanalyses. ECMWF data includes estimates of atmospheric parameters like air temperature, pressure, and wind at different altitudes, as well as surface parameters like rainfall, soil moisture content, ocean-wave height, and sea-surface temperature, for the entire globe. They also have ocean reanalysis and analysis systems like OCEAN5. ECMWF has regional gridded data in NetCDF format.

**HadUK-Grid** is a dataset of gridded climate variables for the UK derived from interpolated land surface observations. It focuses on climate variables like temperature, rainfall, sunshine, mean sea level pressure, wind speed, relative humidity, vapour pressure, days of snow lying, and days of ground frost, at daily, monthly, seasonal, and annual timescales. HADUK has regional gridded data in NetCDF format.

**Soils and Land Cover** In addition to the meteorological notebooks, we are also running a fourth notebook that allows some comparison of soil types and land cover in the county of Bedfordshire. Our challenge is to undertake some spatial analysis to establish any patterns between the datasets. 

**This notebook sets some challenges using soils and land cover data for Bedfordshire.**

---

## ⚙️ Getting Started

1. **Load libraries**  
   The sorts of libraries you may need include `pandas`, `numpy`, `matplotlib`, `seaborn`, and `geopandas`.  
   (Install with `pip install ...` if missing.)

2. **Accessing external data**  
   Example:
   ```python
   import pandas as pd
   url = "https://nercdigitalgathering2025.github.io/data/Bedfordshire_CORINE_Landcover_2018_100mGrid.csv"
   df = pd.read_csv(url)
   print(df.head())
   ```

3. **Notebook structure**  
   Each challenge is introduced in Markdown with background, tasks, and judging criteria.  
   Under each challenge you'll find starter code cells to help you begin.  

---

## ⚙️ Geographical focus
In this hackathon, we will focus the hacking geographically. You can choose to look at the UK as a whole, or focus in on Bedfordshire where we are located. Geographical coordinates for these areas are as follows:

* UK bounding box (roughly -10°W to 3°E, 49°N to 61°N)
* Bedfordshire bounding box (roughly -0.89°W to 0.23°E, 51.95°N to 52.49°N)

---

## ⚙️ Useful links
Here are a few useful web addressses for CEDA data:
* CEDA Data home: https://data.ceda.ac.uk
* CEDA Help Doc home: https://help.ceda.ac.uk
* MIDAS User Guide: https://zenodo.org/records/7357335
* ECMWF website - https://www.ecmwf.int
* JASMIN Notebooks service help: https://help.jasmin.ac.uk/docs/interactive-computing/jasmin-notebooks-service/

---

# 📝 Challenges


## Challenge 1 — Load and Explore Land Cover Data

### Background
CORINE (Coordination of Information on the Environment) Land Cover data provides detailed information about land use and land cover across Europe. The Bedfordshire dataset contains 100m grid resolution data showing different land cover types such as agricultural areas, forests, urban areas, and water bodies.

### Your Task
Load the CORINE Land Cover dataset and perform initial exploration:
- Load the CSV file from the provided URL
- Examine the data structure, columns, and data types
- Generate basic statistics about land cover distribution
- Create visualisations showing land cover patterns

**Success criteria:** successful data loading, clear understanding of data structure, informative statistics and visualisations.


In [None]:
# Import required libraries for data processing, analysis, and visualisation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from urllib.request import urlretrieve
import warnings
warnings.filterwarnings('ignore')

# Set plotting style for better visualisations
plt.style.use('default')
sns.set_palette("husl")

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)


In [None]:
# Define URLs for the datasets
landcover_url = "https://nercdigitalgathering2025.github.io/data/Bedfordshire_CORINE_Landcover_2018_100mGrid.csv"
soils_url = "https://nercdigitalgathering2025.github.io/data/Bedfordshire_NATMAP_Soils_100mGrid.csv"

# Load the CORINE Land Cover dataset
print("Loading CORINE Land Cover data...")
landcover_df = pd.read_csv(landcover_url)

# Display basic information about the dataset
print(f"Dataset shape: {landcover_df.shape}")
print(f"Columns: {list(landcover_df.columns)}")
print("\nFirst few rows:")
landcover_df.head()


In [None]:
# Examine data types and basic statistics
print("Data types:")
print(landcover_df.dtypes)
print("\nBasic statistics:")
print(landcover_df.describe())

# Check for missing values
print("\nMissing values:")
print(landcover_df.isnull().sum())

# Display unique values in categorical columns (if any)
print("\nUnique values in each column:")
for col in landcover_df.columns:
    unique_count = landcover_df[col].nunique()
    print(f"{col}: {unique_count} unique values")
    if unique_count <= 20:  # Show values if not too many
        print(f"  Values: {sorted(landcover_df[col].unique())}")


## Challenge 2 — Load and Explore Soils Data

### Background
The NATMAP Soils dataset provides detailed soil information for Bedfordshire at 100m grid resolution. This dataset contains various soil properties including soil type, texture, drainage characteristics, and other physical and chemical properties that are crucial for understanding agricultural potential, flood risk, and environmental processes.

### Your Task
Load the NATMAP Soils dataset and perform comprehensive analysis:
- Load the CSV file from the provided URL
- Examine the data structure and identify different soil properties
- Generate statistics for numerical soil properties
- Analyse categorical soil classifications
- Create visualisations of soil distribution patterns

**Success criteria:** successful data loading, thorough understanding of soil data structure, comprehensive statistical analysis, clear visualisations.


In [None]:
# Load the NATMAP Soils dataset
print("Loading NATMAP Soils data...")
soils_df = pd.read_csv(soils_url)

# Display basic information about the dataset
print(f"Dataset shape: {soils_df.shape}")
print(f"Columns: {list(soils_df.columns)}")
print("\nFirst few rows:")
soils_df.head()


In [None]:
# Examine data types and basic statistics
print("Data types:")
print(soils_df.dtypes)
print("\nBasic statistics for numerical columns:")
print(soils_df.describe())

# Check for missing values
print("\nMissing values:")
print(soils_df.isnull().sum())

# Identify numerical vs categorical columns
numerical_cols = soils_df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = soils_df.select_dtypes(include=['object']).columns.tolist()

print(f"\nNumerical columns ({len(numerical_cols)}): {numerical_cols}")
print(f"Categorical columns ({len(categorical_cols)}): {categorical_cols}")

# Display unique values in categorical columns
print("\nUnique values in categorical columns:")
for col in categorical_cols:
    unique_count = soils_df[col].nunique()
    print(f"{col}: {unique_count} unique values")
    if unique_count <= 15:  # Show values if not too many
        print(f"  Values: {sorted(soils_df[col].unique())}")


## Challenge 3 — Statistical Analysis and Visualisation

### Background
Understanding the distribution and patterns in both land cover and soils data is essential for environmental analysis. Statistical analysis helps identify dominant land cover types, soil characteristics, and spatial patterns that can inform land management decisions.

### Your Task
Perform comprehensive statistical analysis and create informative visualisations:
- Generate frequency distributions for land cover types
- Analyse soil property distributions and correlations
- Create spatial visualisations of both datasets
- Identify patterns and anomalies in the data
- Compare the spatial extent and coverage of both datasets

**Success criteria:** comprehensive statistical analysis, clear and informative visualisations, identification of key patterns and relationships.


In [None]:
# Analyse land cover distribution
print("=== LAND COVER ANALYSIS ===")

# Identify the main land cover column (assuming it's one of the columns)
# This will need to be adjusted based on actual column names
landcover_cols = [col for col in landcover_df.columns if 'land' in col.lower() or 'cover' in col.lower() or 'type' in col.lower()]
print(f"Potential land cover columns: {landcover_cols}")

# If we can identify a land cover type column, analyse it
if landcover_cols:
    main_col = landcover_cols[0]
    print(f"\nAnalysing column: {main_col}")
    
    # Frequency analysis
    landcover_counts = landcover_df[main_col].value_counts()
    print("\nLand cover type frequencies:")
    print(landcover_counts)
    
    # Create visualisation
    plt.figure(figsize=(12, 6))
    
    # Bar plot of land cover types
    plt.subplot(1, 2, 1)
    landcover_counts.plot(kind='bar')
    plt.title('Land Cover Type Distribution')
    plt.xlabel('Land Cover Type')
    plt.ylabel('Frequency')
    plt.xticks(rotation=45, ha='right')
    
    # Pie chart of land cover types
    plt.subplot(1, 2, 2)
    landcover_counts.plot(kind='pie', autopct='%1.1f%%')
    plt.title('Land Cover Type Proportions')
    plt.ylabel('')
    
    plt.tight_layout()
    plt.show()
else:
    print("Could not identify land cover column. Please examine the data structure above.")


In [None]:
# Analyse soils data
print("=== SOILS ANALYSIS ===")

# Analyse numerical soil properties
if numerical_cols:
    print(f"\nAnalysing {len(numerical_cols)} numerical soil properties:")
    
    # Create correlation matrix for numerical properties
    if len(numerical_cols) > 1:
        plt.figure(figsize=(10, 8))
        correlation_matrix = soils_df[numerical_cols].corr()
        sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
                   square=True, fmt='.2f')
        plt.title('Correlation Matrix of Soil Properties')
        plt.tight_layout()
        plt.show()
    
    # Create distribution plots for key numerical properties
    key_properties = numerical_cols[:4]  # Show first 4 numerical properties
    if key_properties:
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        axes = axes.ravel()
        
        for i, prop in enumerate(key_properties):
            if i < len(axes):
                soils_df[prop].hist(bins=30, ax=axes[i], alpha=0.7)
                axes[i].set_title(f'Distribution of {prop}')
                axes[i].set_xlabel(prop)
                axes[i].set_ylabel('Frequency')
        
        # Hide unused subplots
        for i in range(len(key_properties), len(axes)):
            axes[i].set_visible(False)
        
        plt.tight_layout()
        plt.show()

# Analyse categorical soil properties
if categorical_cols:
    print(f"\nAnalysing {len(categorical_cols)} categorical soil properties:")
    
    # Show frequency distributions for categorical properties
    for col in categorical_cols[:3]:  # Show first 3 categorical properties
        print(f"\n{col} distribution:")
        value_counts = soils_df[col].value_counts()
        print(value_counts.head(10))  # Show top 10 values


## Challenge 4 — Spatial Analysis and Grid Comparison

### Background
Both datasets are provided at 100m grid resolution for Bedfordshire. Understanding the spatial relationship between land cover and soil properties is crucial for environmental analysis, agricultural planning, and ecosystem management.

### Your Task
Perform spatial analysis and intercomparison of the datasets:
- Identify common grid coordinates between the datasets
- Create spatial visualisations showing the distribution of both datasets
- Analyse the relationship between land cover types and soil properties
- Identify areas of overlap and potential data gaps
- Create summary statistics for grid cell comparisons

**Success criteria:** successful spatial analysis, clear understanding of dataset relationships, informative spatial visualisations, meaningful intercomparison results.


In [None]:
# Spatial analysis and grid comparison
print("=== SPATIAL ANALYSIS ===")

# Identify coordinate columns (assuming they contain 'x', 'y', 'lon', 'lat', 'east', 'north')
coord_keywords = ['x', 'y', 'lon', 'lat', 'east', 'north', 'coord']
landcover_coords = [col for col in landcover_df.columns if any(keyword in col.lower() for keyword in coord_keywords)]
soils_coords = [col for col in soils_df.columns if any(keyword in col.lower() for keyword in coord_keywords)]

print(f"Land cover coordinate columns: {landcover_coords}")
print(f"Soils coordinate columns: {soils_coords}")

# Check if we have coordinate information
if landcover_coords and soils_coords:
    print("\nCoordinate information found in both datasets")
    
    # Display coordinate ranges
    print("\nLand cover coordinate ranges:")
    for col in landcover_coords:
        print(f"{col}: {landcover_df[col].min():.2f} to {landcover_df[col].max():.2f}")
    
    print("\nSoils coordinate ranges:")
    for col in soils_coords:
        print(f"{col}: {soils_df[col].min():.2f} to {soils_df[col].max():.2f}")
    
    # Create spatial scatter plots if we have 2D coordinates
    if len(landcover_coords) >= 2 and len(soils_coords) >= 2:
        fig, axes = plt.subplots(1, 2, figsize=(15, 6))
        
        # Land cover spatial plot
        axes[0].scatter(landcover_df[landcover_coords[0]], landcover_df[landcover_coords[1]], 
                       alpha=0.6, s=1, c='blue')
        axes[0].set_xlabel(landcover_coords[0])
        axes[0].set_ylabel(landcover_coords[1])
        axes[0].set_title('Land Cover Data Spatial Distribution')
        axes[0].grid(True, alpha=0.3)
        
        # Soils spatial plot
        axes[1].scatter(soils_df[soils_coords[0]], soils_df[soils_coords[1]], 
                       alpha=0.6, s=1, c='red')
        axes[1].set_xlabel(soils_coords[0])
        axes[1].set_ylabel(soils_coords[1])
        axes[1].set_title('Soils Data Spatial Distribution')
        axes[1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        plt.show()
else:
    print("Coordinate information not clearly identified. Please examine the data structure above.")


## Challenge 5 — Advanced Intercomparison Analysis

### Background
Understanding the relationships between land cover and soil properties is crucial for environmental management, agricultural planning, and ecosystem services assessment. Advanced analysis can reveal patterns that inform land use decisions and environmental policy.

### Your Task
Perform advanced intercomparison analysis between land cover and soils data:
- Create cross-tabulation analysis between land cover types and soil properties
- Identify correlations between land cover and soil characteristics
- Perform statistical tests to identify significant relationships
- Create comprehensive visualisations showing land cover-soil relationships
- Generate summary reports of key findings

**Success criteria:** thorough intercomparison analysis, identification of meaningful relationships, clear statistical interpretation, comprehensive visualisations, actionable insights.


In [None]:
# Advanced intercomparison analysis
print("=== ADVANCED INTERCOMPARISON ANALYSIS ===")

# This analysis will depend on the actual structure of the data
# We'll create a framework that can be adapted based on the column names

# Identify key columns for analysis
print("Available columns for analysis:")
print(f"Land cover: {list(landcover_df.columns)}")
print(f"Soils: {list(soils_df.columns)}")

# Create a merged dataset if we have common coordinates
if landcover_coords and soils_coords:
    print("\nCreating merged dataset for intercomparison...")
    
    # Create coordinate-based merge
    landcover_df['coord_key'] = list(zip(landcover_df[landcover_coords[0]], landcover_df[landcover_coords[1]]))
    soils_df['coord_key'] = list(zip(soils_df[soils_coords[0]], soils_df[soils_coords[1]]))
    
    # Merge datasets on coordinates
    merged_df = pd.merge(landcover_df, soils_df, on='coord_key', how='inner', suffixes=('_lc', '_soil'))
    
    print(f"Merged dataset shape: {merged_df.shape}")
    print(f"Successfully merged {len(merged_df)} grid cells")
    
    # Display merged dataset structure
    print("\nMerged dataset columns:")
    print(list(merged_df.columns))
    
    # Create correlation analysis between land cover and soil properties
    if len(merged_df) > 10:  # Only if we have enough data
        print("\n=== CORRELATION ANALYSIS ===")
        
        # Identify numerical columns for correlation
        merged_numerical = merged_df.select_dtypes(include=[np.number]).columns.tolist()
        
        if len(merged_numerical) > 1:
            # Create correlation matrix
            plt.figure(figsize=(12, 10))
            correlation_matrix = merged_df[merged_numerical].corr()
            
            # Create heatmap
            mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
            sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='coolwarm', 
                       center=0, square=True, fmt='.2f', cbar_kws={"shrink": .8})
            plt.title('Correlation Matrix: Land Cover vs Soil Properties')
            plt.tight_layout()
            plt.show()
            
            # Identify strongest correlations
            print("\nStrongest correlations (|r| > 0.3):")
            for i in range(len(correlation_matrix.columns)):
                for j in range(i+1, len(correlation_matrix.columns)):
                    corr_val = correlation_matrix.iloc[i, j]
                    if abs(corr_val) > 0.3:
                        print(f"{correlation_matrix.columns[i]} vs {correlation_matrix.columns[j]}: {corr_val:.3f}")
    
    # Create summary statistics for merged data
    print("\n=== MERGED DATA SUMMARY ===")
    print(f"Total merged grid cells: {len(merged_df):,}")
    print(f"Coverage area: {len(merged_df) * 100 * 100 / 1_000_000:.2f} km²")
    
    # Data completeness
    print("\nData completeness in merged dataset:")
    completeness = (1 - merged_df.isnull().sum() / len(merged_df)) * 100
    for col, comp in completeness.items():
        if comp < 100:
            print(f"{col}: {comp:.1f}% complete")
    
else:
    print("Cannot perform intercomparison analysis without coordinate overlap.")
    print("Please check the coordinate columns and ensure both datasets cover the same area.")
