# Case 1: Introduction to exploratory data analysis and statistical modeling

**File name and submission**
- Please save and submit your work as: **`Case_1_Firstname_Lastname.ipynb`**  
- Submit **only** the `.ipynb` file. The examiner will place your file into the appropriate environment where all required data files are available in the same directory.

---

**Before you start**
- Track **how much time** you spend on this homework in total.  
- At the **end of the notebook**, there is a cell where you should report the number of hours (this is only for course development feedback).

---

**General instructions**
1. **Work individually.**  
   You may discuss ideas with classmates, but do **not** copy and paste each other’s code.  
   Your notebook must reflect your own independent work.

2. **Use of external materials and AI tools.**  
   You may use course materials, documentation, examples, or quick help from sources such as Stack Overflow or ChatGPT.  
   If you **heavily rely** on external material or AI-generated code (for example, by copying significant parts), **cite the source**.  
   You do **not** need to cite small pieces of help used only to understand errors or debug.

3. **Do not delete notebook cells.**  
   Do **not** remove any pre-existing cells. Only add your solutions in the designated places.  
   If you accidentally delete a cell, restore it via **`Edit → Undo Delete Cells`**.  
   You may add extra cells if needed, but make sure each answer appears **under the correct question or sub-question**.

---

**Installing and loading packages**
- Import all required packages in the **first code cell** (see below).  
- Some common packages (for example `numpy`, `pandas`, `matplotlib`, `seaborn`) are already imported. Add any additional imports you need.

If you want to use a package that is **not installed** in the environment:
- Install it in the same first cell using:
  - `!pip install package_name`
- After installing, **comment out** the installation command so the examiner can see what you installed, but do **not** delete the line.

Example:
```python
# !pip install scikit-learn
import numpy as np
import pandas as pd
```

### Loading Packages

In [None]:
import numpy as np                            
import pandas as pd                            

import matplotlib.pyplot as plt                
import seaborn as sns

## <font color='#fc7202'> Exercise 1: Environmental data and visualization (6 points) </font> 
In this exercise, you will work with real-world environmental datasets and practice basic exploratory data analysis and visualization.

You are **provided with three datasets**, listed below.\
In addition, you must **find and download three more environmentally relevant datasets** from online data repositories.\
This means that you will work with **six datasets in total**.

**Provided datasets:**
- https://bolin.su.se/data/ammar-2024-contaminant-monitoring-1  
- https://bolin.su.se/data/lundevall-zara-2021-asko-methane  
- https://doi.org/10.21334/NPOLAR.2020.6101B7A2

You may find the additional datasets in the following databases, or in other comparable environmental data sources:
- Bolin Centre database: https://bolin.su.se/data/
- PANGAEA database: https://www.pangaea.de/
- Swedish environmental monitoring data:
  - Miljögifter i biota: https://www.ivl.se/
  - Sjöar och vattendrag: https://miljodata.slu.se/
- EBAS database: https://ebas.nilu.no/
- ICOS data portal: https://www.icos-cp.eu/data-services/about-data-portal
- IPCC Data Distribution Centre: https://www.ipcc-data.org/

If you choose to use **other sources** (for example Kaggle), make sure that the data are **observational and real**, not **synthetic or simulated**.  
Synthetic datasets are common on such platforms and are **not suitable** for this exercise.


### <font color='#fc7202'> Task 1: Dataset exploration and description (2 points) </font> 
Using the **three provided datasets** and **three additional datasets that you select and download yourself**, briefly describe each of the six datasets. 

For **each dataset**, include the following information:

- Measured variable(s)
- Measurement location (for example station name, site, or geographic region)
- Time span covered by the data
- Time resolution (for example hourly, daily, monthly)
- Data format (for example CSV, NetCDF, text file)
- Source database

Keep the descriptions concise but informative.

In [None]:
# YOUR CODE HERE!

<font color='#1b7173'>*Your answer here!*</font>

### <font color='#fc7202'> Task 2: Univariate analysis (1 point)</font> 
Select **one dataset** and **one variable** from the six datasets.

- Create a **histogram** of the measured values.
- Label both axes clearly and include appropriate units.
- Briefly explain what the selected variable represents.
- Describe and discuss the **shape of the distribution**.

In [None]:
# YOUR CODE HERE!

<font color='#1b7173'>*Your answer here!*</font>

### <font color='#fc7202'> Task 3: Time series analysis (1 point)</font> 
Choose **one dataset** and **one variable** from the six datasets.

- Plot a **time series** of the measured values.
- Include axis labels and units.
- Explain what the variable represents.
- Discuss the **temporal variability** observed in the plot, such as trends, seasonality, or irregular fluctuations, and suggest possible explanations.

In [None]:
# YOUR CODE HERE!

<font color='#1b7173'>*Your answer here!*</font>

### <font color='#fc7202'> Task 4: Bivariate and multivariate analysis (2 points) </font> 
From **one dataset**, choose **two continuous variables** that are measured simultaneously over time.

1. Create a **scatter plot (X–Y plot)** to visualize the relationship between the two variables.
2. Plot the **two variables as time series**, either in the same figure or in separate panels.

In your discussion, comment on:
- What information each type of visualization provides about the relationship between the variables
- How the scatter plot and the time series complement each other in interpreting the data

In [None]:
# YOUR CODE HERE!

<font color='#1b7173'>*Your answer here!*</font>

## <font color='#fc7202'> Exercise 2: Simple data analysis and statistical modeling (6 points) </font> 
In this exercise, you will explore relationships between environmental variables and apply basic statistical modeling techniques.

### <font color='#fc7202'>Task 1: Exploring bivariate relationships (2 points)</font>

Using datasets from the databases introduced in **Exercise 1**, select **three datasets** that each contain **at least two continuous variables** measured simultaneously at the **same location**.  
You may **reuse datasets from Exercise 1** if they are suitable for this task. If not, you may **select and download new datasets** from the same databases.

For **each dataset**:
- Choose **two continuous variables** that, based on your scientific understanding, are expected to be related.
- Create a **scatter plot** to visualize the relationship between the two variables.
- Compute the **correlation coefficient (*r*)** and the associated ***p*-value**.

For each variable pair, discuss:
- The **strength** and **direction** of the correlation
- Whether the correlation is **statistically significant**
- Possible **physical, chemical, or biological mechanisms** underlying the observed relationship



In [None]:
# YOUR CODE HERE!

<font color='#1b7173'>*Your answer here!*</font>

### <font color='#fc7202'> Task 2: Linear regression analysis (2 points) </font> 
From the datasets used in **Exercise 2, Task 1**, select **one dataset** containing two variables that exhibit a clear (and scientifically meaningful) correlation.

- Fit a **linear regression model** to the data.
- Plot the observed data together with the **fitted regression line**.
- Report the **model parameters** (slope and intercept) and the **coefficient of determination ($R^2$)**.

In your discussion:
- Justify your choice of **independent (predictor, $X$)** and **dependent (response, $Y$)** variable
- Interpret the **slope** and **intercept** in the context of the selected variables


In [None]:
# YOUR CODE HERE!

<font color='#1b7173'>*Your answer here!*</font>

### <font color='#fc7202'>Task 3: Model interpretation, hypothesis testing, and applicability (2 points)</font>

Using the linear regression model from **Exercise 2, Task 2**, explain how the fitted relationship can be used to make predictions.

In particular:
- Explain what it means to use the model to **interpolate** values at grid points **within the range of the measured predictor variable**.
- Explain what it means to use the model to **extrapolate** values **outside the range of the measured predictor variable**.

Then:
- Perform a **hypothesis test for the slope** and assess whether it is significantly different from zero.
- Evaluate the **goodness of fit** of the model.

Finally, discuss whether the fitted model is more appropriate for interpolation or extrapolation, and explain why.


In [None]:
# YOUR CODE HERE!

<font color='#1b7173'>*Your answer here!*</font>

## <font color='#fc7202'> Exercise 3: Case study: primary production across ecosystems (6 points) </font> 

Human activities have led to a strong increase in atmospheric **greenhouse gas concentrations**, in particular carbon dioxide (CO₂), which is the main driver of **global warming**. Emissions from fossil fuel combustion, cement production, and land-use change continue to add large amounts of CO₂ to the atmosphere, altering the Earth’s energy balance and climate system.

At the same time, not all emitted CO₂ remains in the atmosphere. A substantial fraction is taken up by **natural carbon sinks**, primarily the **oceans** and **terrestrial ecosystems**. These sinks slow the rate of climate change by removing CO₂ from the atmosphere, even though they cannot fully compensate for human emissions.

**Terrestrial ecosystems play a particularly important role** in this context. Through photosynthesis, vegetation absorbs CO₂ from the atmosphere and converts it into organic carbon. Globally, land ecosystems are estimated to absorb roughly **one quarter of anthropogenic CO₂ emissions**, thereby significantly reducing the rate of atmospheric CO₂ increase and associated warming.

Understanding how efficiently ecosystems take up carbon, and how this uptake depends on environmental conditions such as light availability, is therefore central to climate science.

In this exercise, you will study ecosystem carbon uptake using observational data from [**FLUXNET**](https://fluxnet.org/), a global network of ecosystem monitoring sites. FLUXNET sites use the [**eddy covariance**](https://link.springer.com/protocol/10.1007/978-1-0716-3790-6_12) technique to measure exchanges of CO₂, water vapor, and energy between ecosystems and the atmosphere, allowing ecosystem-scale estimates of **gross primary production (GPP)**, which represents the total amount of carbon dioxide "fixed" by plants through photosynthesis per unit area and time.

You will analyze monthly data from **12 FLUXNET sites in Europe**, representing **four major ecosystem types** (**three sites per ecosystem type**):
- **DBF**: Deciduous Broadleaved Forest  
- **ENF**: Evergreen Needleleaved Forest  
- **GRA**: Grassland  
- **CRO**: Croplands  

These ecosystem types differ in vegetation structure, seasonality, and photosynthetic behavior, providing a useful basis for comparing radiation absorption and ecosystem productivity.

The data for this exercise are provided in three separate files with **monthly mean values** for the same 12 FLUXNET sites and time periods, allowing direct comparison and combination of variables across files.

- **`gpp_sites_monthly.tsv`**  
  Mean monthly **gross primary production (GPP)** "observed" at each site.  
  This represents the total carbon uptake by photosynthesis at the ecosystem scale.  
  Unit: gC m⁻² d⁻¹ (grams of carbon fixed per square meter of land surface per day)

- **`par_sites_monthly.tsv`**  
  Mean monthly **photosynthetically active radiation (PAR)** measured at each site.  
  This represents the incoming solar radiation available for photosynthesis.  
  Unit: MJ m⁻² d⁻¹ (megajoules of energy per square meter per day)

- **`fapar_sites_monthly.tsv`**  
  Mean monthly **fraction of absorbed photosynthetically active radiation (fAPAR)** at each site.  
  This variable describes the fraction of incoming PAR that is absorbed by the vegetation canopy.  
  Unit: unitless



### <font color='#fc7202'>Task 1: Time series visualization and ecosystem comparison (2 points)</font>

- Load and visualize the **FLUXNET time series** of **GPP**, **PAR**, and **fAPAR** for the 12 sites.

- For **each variable** (`GPP`, `PAR`, and `fAPAR`), create time series plots in which all sites belonging to the same ecosystem type are shown together in a single figure.\
  This should result in **four plots per variable**, one for each ecosystem type (`DBF`, `ENF`, `GRA`, `CRO`).

- Analyze the results and discuss:
  - The **temporal variability** of each variable (for example seasonal patterns).
  - Similarities and differences between sites within the **same ecosystem type**, including how the variables co-vary over time and whether apparent relationships can be identified.
  - Differences in behavior **across ecosystem types**, and how ecosystem characteristics and seasonality may explain the observed patterns.

Ensure that all figures include clear axis labels, appropriate units, and legends.


In [None]:
# YOUR CODE HERE!

<font color='#1b7173'>*Your answer here!*</font>

### <font color='#fc7202'>Task 2: APAR and its relationship to GPP (2 points)</font>

Multiple approaches to estimating GPP at regional to global scales have been developed over past decades. One widely used approach is **light-use efficiency (LUE) models**, which estimate GPP from absorbed radiation and an efficiency term describing how effectively ecosystems convert absorbed light into carbon uptake.

A common LUE formulation is:

$$
\mathrm{GPP} = \mathrm{LUE} \cdot \mathrm{APAR}
$$

The product of PAR and fAPAR is called **absorbed photosynthetically active radiation (APAR)** (the amount of PAR absorbed by the canopy): $$ \mathrm{APAR} = \mathrm{PAR} \cdot \mathrm{fAPAR} $$

In this framework:
- **fAPAR** is mainly controlled by canopy properties (for example **leaf area index, LAI**), which determine how much incoming radiation is intercepted and absorbed.
- **LUE** summarizes how efficiently absorbed radiation is converted into carbon uptake and can vary with environmental conditions such as temperature, water availability, and nutrient status.

**Your tasks:**
- Calculate **APAR** for each site and time step using the provided PAR and fAPAR data.
- Explore the relationship between **APAR** and **GPP** using appropriate visualizations.
- Based on the equations above, what type of **functional relationship** between APAR and GPP would you expect? Explain your reasoning.
- How many **model parameters** are needed to describe your proposed relationship between APAR and GPP? Clearly define what the parameter(s) represent.


In [None]:
# YOUR CODE HERE!

<font color='#1b7173'>*Your answer here!*</font>

### <font color='#fc7202'>Task 3: Preparing data for model calibration and validation (2 points)</font>

In **Exercise 4**, the dataset will be used to develop a **statistical model** that estimates **GPP as a function of APAR**. This requires dividing the available data into two independent subsets: one for **model calibration (training/fitting)** and one for **model validation (testing)**.

- Propose a strategy for splitting the dataset into calibration and validation subsets.
- Motivate the chosen strategy and explain which aspects were important for the decision.
- Discuss the assumptions made about the dataset when applying this split.
- Identify potential weaknesses or limitations of the chosen approach and how they could influence model evaluation.


<font color='#1b7173'>*Your answer here!*</font>

# <font color='#fc7202'>Exercise 4: Case study: statistical modeling of ecosystem photosynthesis (6 points)</font>

In this exercise, you will develop, evaluate, and interpret a **simple statistical model** that estimates **gross primary production (GPP)** as a function of **absorbed photosynthetically active radiation (APAR)**.\
The calibration and validation datasets should follow the data split defined in **Exercise 3, Task 3**.

### <font color='#fc7202'>Task 1: Model calibration using linear regression (3 points)</font>

Using the **calibration dataset**, perform a **linear regression analysis** of the form:
$$
\mathrm{GPP} = f(\mathrm{APAR})
$$

- Fit a **linear regression model**.
- Visualize the relationship between **APAR** and **GPP**, including the **fitted regression line**, and interpret the results for all ecosystems.
- Analyze the **residuals** of the regression as a function of the predictor variable.
- Discuss whether the model appears to be **unbiased**, and justify your conclusion based on the residual analysis.
 

In [None]:
# YOUR CODE HERE!

<font color='#1b7173'>*Your answer here!*</font>

### <font color='#fc7202'>Task 2: Model validation and evaluation (3 points)</font>

Apply the calibrated model to the **validation dataset**.

- Choose appropriate visualizations to assess model performance.
- Interpret the validation results both **mathematically** (for example using goodness-of-fit measures) and **scientifically** (in terms of ecosystem processes).
- Discuss whether the model performs similarly across ecosystem types and whether it can be considered **valid for each ecosystem**.

Use additional visualizations if needed to support your discussion.

In [None]:
# YOUR CODE HERE!

<font color='#1b7173'>*Your answer here!*</font>

### <font color='#1916b4'>Bonus task: Spatial application of the model (3 extra points)</font>

Using the provided **netCDF** file `radiation.nc`, produce an **annual map of APAR over Europe**.

- Apply the statistical model developed above to generate an **annual European GPP map**.
- Qualitatively discuss possible sources of **model insufficiencies** and **uncertainties in the input data**.
- As a hint, you may find it useful to inspect the provided **netCDF** file `landcover.nc` when interpreting the results.


In [None]:
# YOUR CODE HERE!

<font color='#1b7173'>*Your answer here!*</font>

## Before you submit

**Don’t forget:**

Please restart the kernel and run all cells from top to bottom to ensure your notebook works correctly.
1. In Jupyter Notebook or JupyterLab:
   Go to the menu bar and select:
   - `Kernel → Restart & Run All`
2. In Visual Studio Code (VS Code):
   - Click the `Restart Kernel` button `(↻)` (on the toolbar).
   - Then click `Run All` `(▶▶)` to execute all cells in order.
Make sure the notebook runs **end-to-end without errors**.\
If a cell still produces an error that you can’t resolve, simply **comment out that section** so the remaining cells can execute without interruption.

**Time spent**

Please record roughly how long you worked on this assignment:



<font color='#1b7173'>*Total time spent: XXXXX h*</font>

**Comments (optional feedback)**

Here, please, leave your comments regarding the homework, possibly answering the following questions:
- Was it too hard/easy for you?
- What would you suggest to add or remove?
- Anything else you would like to tell us?

<font color='#1b7173'>*Your feedback*</font>

Excellent work making it all the way through!