# Homework 2: Hypothesis Testing

**File name:**  
> Please save and submit your work as  
> **`Homework2_py_Firstname_Lastname.ipynb`**\
> Please submit only the `.ipynb` file, unless otherwise stated.\
> The examiner will place your file into the appropriate environment where all required data files are available in the same directory.
---

**Before you start**
- Record **how much time** the homework takes you in total.  
  At the **end of the notebook** there is a cell where you can write the number of hours (for course development feedback).  

---

**General instructions**
1. **Work individually.**  
   You may **discuss ideas** with classmates, but **do not copy–paste** each other’s code.  
   Each notebook must represent your own independent work.

2. **Use of external materials and AI tools.**  
   You may use course materials, documentation, examples, or quick help from sources such as Stack Overflow or ChatGPT.  
   However, if you **heavily rely** on external material or AI-generated code (for example, by copying significant parts),  
   please **cite the source**. There is **no need to cite** small pieces of help used only to understand or debug your code.

3. **Do not delete notebook cells.**  
   Please do **not remove any pre-existing cells**. Only add your own solutions in the designated places.  
   If you accidentally delete a cell, use **`Edit → Undo Delete Cells`** to restore it.  
   You may add extra cells if needed, but make sure every solution is placed **under the correct question or sub-question**.  
   This structure helps us to evaluate your work efficiently and accurately.
---

**Packages and environment**

All required packages are already imported for you in the **first code cell**.  
If running that cell gives an error, remove the leading `#` from the corresponding installation line and run it again, e.g.:

```python
#!pip install pandas matplotlib scipy
```
You may use additional packages if needed, but:
- Add their installation and import statements in a new code cell placed below the import cell.
- This is the only additional cell you may modify (besides the solution and time-reporting cells).

>*Plotting note*: The setup includes multiple plotting options (`pandas .plot`, `Matplotlib`, `seaborn`, and `Plotnine`). \
> Use one approach of your choice; you may remove or comment out imports you don’t need to keep the notebook tidy.
---

**Before submission**

Before submitting your notebook, please **restart the kernel and run all cells from the beginning** to ensure the entire notebook executes without errors.
This step guarantees that all results, figures, and outputs appear correctly when the notebook is evaluated.
If a particular code cell causes an error that you cannot fix, comment out that part so that the notebook still runs fully.

### Installing and Loading Packages

In [1]:
#!pip install palmerpenguins
#!pip install plotnine

In [42]:
import numpy as np                             # numerical computing: arrays, math functions, statistics
import pandas as pd                            # data manipulation and analysis (tables, DataFrames)

import matplotlib.pyplot as plt                # base plotting library for Python (low-level control)
import seaborn as sns                          # high-level plotting library (statistical graphics, built on Matplotlib)
#import plotnine as p9                          # plotting library with syntax similar to ggplot2

from scipy import stats                        # statistical functions
from scipy.stats import ttest_1samp, ttest_ind, ttest_rel, f, shapiro, levene, wilcoxon, mannwhitneyu, f_oneway, kruskal, tukey_hsd
from outliers import smirnov_grubbs as grubbs
import statsmodels.api as sm
from statsmodels.formula.api import ols

#### <font color='#fc7202'> Q1 (3 p): Lemming Grazing and Methane Fluxes in Arctic Tundra
Since vegetation grazing by **brown lemmings** can alter plant biomass and soil conditions, it may also affect **methane fluxes (CH₄_Flux)** in Arctic tundra.  
A long-term **exclosure vs. control** experiment in Arctic Alaska measured greenhouse-gas fluxes, soil temperature, and vegetation across **Dry / Moist (Mst) / Wet** land-cover types.

You will analyze a subset stored in `hw2_data1.tsv`.

| Variable | Description | Unit |
|---|---|---|
| `Plot` | Plot ID | – |
| `LandCoverType` | *Dry*, *Mst* (= Moist), *Wet* | – |
| `Lemmings` | *control* = grazing allowed; *exclusion* = no grazing | – |
| `CH4_Flux` | Methane flux (positive = emission, negative = uptake) | mg C m⁻² day⁻¹ |
| `C-CO2eq` | CO₂ equivalents | g C m⁻² day⁻¹ |
| `GEE` | Gross ecosystem CO₂ exchange | g C m⁻² day⁻¹ |
| `NEE` | Net ecosystem CO₂ exchange (negative = CO₂ uptake) | g C m⁻² day⁻¹ |
| `SoilTemp_1` | Soil temperature at 1 cm depth | °C |
| `SoilTemp_10` | Soil temperature at 10 cm depth | °C |
| `NDVI` | Vegetation greenness | – |
| `Albedo` | Short-wave surface reflectance | – |

  
Do methane fluxes (`CH4_Flux`) differ between plots open to lemming grazing (*control*) and plots where lemmings were excluded (*exclusion*)?

1. **Load the dataset**  

2. **Explore visually and describe with descriptve statistics**  
   Plot `CH4_Flux` by treatment.  
   In 2-3 sentences, describe center, spread, shape, and any potential outliers.  

3. **State hypotheses**  
   Choose your “typical” metric (**mean** or **median**) and justify it from the data features you observed.  
   Use *α* = 0.05 unless you justify otherwise.

4. **Pick the statistical test suitable for your hypothesis and check its assumptions**  
   Explain **why** your chosen test matches your estimand and diagnostics.

5. **Run the analysis and conclude**  
   Report: test statistic, *p*-value  

   In 2-3 sentences, answer the research question:  
   Did lemming grazing **significantly alter methane fluxes** at the 95% confidence level?  

<font color='#00bf63'>*Your hypotheses here!*</font>

In [4]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

#### <font color='#fc7202'> Q2 (5 p): Effect of Drought on Finch Beak Depth
Suppose you want to study how birds adapt to a severe environmental change.\
In this case, you will examine the effect of a 6-year drought with very little rainfall on the beak depth of finches living on a small Pacific island.\
Before the drought, the average beak depth of finches on this island was 9.2 mm.

After the drought, a new sample of finches was measured.
Using these measurements, investigate whether the mean beak depth has changed compared to the pre-drought average of 9.2 mm.

Use the dataset provided as a Python list in the cell below to perform your analysis.
1. **Describe and visualize.**  
   Compute descriptive statistics and make an informative plot(s).
   In 2-3 sentences, describe center, spread, shape, and any potential outliers.

2. **Formulate hypotheses and pick α.**  
   State H₀ and H₁ relative and choose a significance level (default α = 0.05 unless you justify otherwise).

3. **Choose the appropriate statistical test.**  
   Decide whether you are testing a difference in **mean** or **median** and select a matching one-sample test. Briefly justify your choice.

4. **If needed, clean the data (transparently).**  
   Identify any questionable values and state whether you keep them, transform the data, or apply a robust method - in each case, justify briefly.

5. **Check assumptions for your chosen test.**  
   Provide minimal diagnostics (e.g., distributional shape/normality check) and comment on whether assumptions appear reasonable.

6. **Perform the test and report clearly.**  
   Report the test used, test statistic, *p*-value.  
   Conclude in 2-3 sentences: did beak depth change relative to **9.2 mm** at your chosen α?

</font>

In [21]:
beak_depth_mm = [6.1, 6.8, 10.3, 10.8, 7.5, 9.8, 9.6, 7.2, 7.8, 9.9, 7.6, 9.7, 11.4, 8.3, 11.3, 10.8, 10.3, 14.5, 10.6, 10.8, 10.9, 11.1, 11.3, 11.2, 11.4, 11.2, 11.3, 11.1]

<font color='#00bf63'>*Your hypotheses here!*</font>

In [5]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

#### <font color='#fc7202'> Q3 (5 p): Regional Differences in Hedgehog Hibernation Duration

Researchers radio-tagged *Erinaceus europaeus* in two U.S. regions (**north** and **south**) and tracked winter movements to measure **hibernation duration** (number of days not leaving the nest). Random samples were taken independently in each region.

Do hedgehogs from the **northern** and **southern** regions differ in **hibernation duration** (days confined to the nest)?

Use the DataFrame provided in the next cell: `nest_days`

1. **Describe & visualize.**  
   Summarize per region and plot the distributions. In 2-3 sentences, describe center, spread, shape, and any potential outliers.

2. **Formulate hypotheses & choose *α.***  
   Define H₀ and H₁, and state your significance level (α = 0.05 unless you justify otherwise). 

3. **Select a test & check assumptions.**  
    Briefly justify your choice and comment on key assumptions.

4. **Analyze & report.**  
   Report the test statistic, *p*-value.

5. **Conclude.**  
    In 2-3 sentences, answer whether hibernation duration differs between the **north** and **south** regions at the 95% confidence level.  

</font>

In [23]:
nest_days = pd.DataFrame({'north_region': [95, 98, 99, 112, 103, 105, 106, 109, 110, 111, 112, 114, 95, 98, 103],
                          'south_region': [90, 92, 94, 93, 95, 98, 100, 102, 103, 104, 98, 102, 99, np.nan, np.nan]})

<font color='#00bf63'>*Your hypotheses here!*</font>

In [6]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

#### <font color='#fc7202'> Q4 (8 p): PFOS in Human Blood

Perfluorooctane sulfonate (PFOS) is a persistent, bioaccumulative fluorinated compound formerly used in numerous industrial applications (e.g., Teflon-related manufacturing). PFOS is detectable in human blood worldwide, including Sweden.  
You will analyze PFOS concentrations (ng/mL) measured in **2019** from **56 residents** of the Docentbacken neighborhood in Stockholm.

Use the data provided in the file: `hw2_data2.tsv`

1. **Describe & visualize.**  
   Summarize PFOS (n, mean, SD, median, IQR, min-max) and visualize the data.  
   In 2-3 sentences, describe center, spread, shape, and any potential outliers.

2. **Check distribution and screen for outliers.**  
   Test whether the PFOS data are approximately normally distributed using the **Shapiro-Wilk** and/or **Kolmogorov-Smirnov** tests.  
   If the data deviate strongly from normality, consider possible approaches such as **data transformation** or using a **non-parametric** test.  
   Run **Grubbs’ test** once to check for a potential outlier (if you find one, do not remove it). Report the statistic and *p*-value, and briefly comment on what you find.

3. **Formulate hypotheses & choose *α*.**  
   Compare Docentbacken levels to **4.5 ng/mL** (national reference).  
   State H₀ and H₁ and your significance level (*α* = 0.05 unless you justify otherwise). 

4. **Analyze & report.**  
   Based on your previous results, choose an appropriate statistical test to compare the PFOS concentrations at Docentbacken with the national reference value of 4.5 ng/mL (representing the geometric mean for Sweden).  
   Justify your choice depending on whether the data appear normally distributed or not.  
   Report the test statistic, *p*-value.

5. **Conclude.**  
   In 2-3 sentences, state whether PFOS levels at Docentbacken match **4.5 ng/mL** at the 95% confidence level.  

</font>

<font color='#00bf63'>*Your hypotheses here!*</font>

In [None]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

#### <font color='#fc7202'> Q5 (10 p): Effect of Urbanization on Insect Species Richness 

Biodiversity describes the variety and abundance of living organisms, including plants, fungi, and animals, within an ecosystem.
Human activities such as habitat loss, climate change, and pollution can lead to biodiversity loss by reducing species abundance or causing local extinctions.

Urban environments can also affect biodiversity. Some insect species struggle to persist in cities due to reduced native vegetation and limited water availability, while others may thrive in residential areas that offer green spaces.
You will now explore whether insect **species richness** (the number of different insect species) differs among urban environments with varying degrees of development in the Los Angeles region.

Use the dataset provided in the file: `hw2_data3.tsv`

| Variable                                                 | Description                                                | Unit                 |
| -------------------------------------------------------- | ---------------------------------------------------------- | -------------------- |
| `Sample`                                                 | Sampling site ID                                           | –                    |
| `Site`                                                   | Sampling location name                                     | –                    |
| `Trap Days`                                              | Number of days traps were active                           | days                 |
| `Month`                                                  | Month of sampling                                          | –                    |
| `Season`                                                 | Sampling season                                            | –                    |
| `Urban Type`                                             | Numeric code for level of urbanization                     | –                    |
| `Temp Max`, `Temp Min`, `Temp Mean`                      | Maximum, minimum, and mean air temperature during sampling | °C                   |
| `Rh Max`, `Rh Min`, `Rh Mean`                            | Maximum, minimum, and mean relative humidity               | %                    |
| `Photoperiod`                                            | Average daily daylight hours                               | hours                |
| `Lawn`, `Compost`, `Mulch`, `Drought`, `Native`, `Water` | Presence/absence of local habitat features                 | categorical (Yes/No) |
| `Richness`                                               | Insect species richness (number of species recorded)       | count                |

A separate key (provided below) links numeric `Urban Type` codes to the descriptive categories (`Dense`, `Developed`, `Suburban`, `Natural`).
You will need to use this information to correctly identify which observations belong to each category before analyzing the data


1. **Load the dataset.**\
   Review all variables and use the provided key to classify sampling sites into the correct urban environment categories.

2. **Explore visually and describe with descriptive statistics.**  
   Explore species richness across the different urban environments. In 2-3 sentences, describe the main patterns in the data.

3. **State hypotheses.**  
   State your null and alternative hypotheses for whether insect species richness differs among urban environments.
   Choose an appropriate significance level (*α*).

4. **Pick the appropriate statistical test and check its assumptions.**\
   Choose a suitable test to evaluate whether insect species richness differs among the urban environments.
   Briefly justify your choice and check that the assumptions of your selected test are reasonably met.
   If your test indicates significant differences, perform an appropriate **post hoc comparison** to explore which groups differ from each other.
   Report the main test results, including the test statistic, and *p*-value.

5. **Conclude.**  
   Summarize your findings in 2-3 sentences.
   Discuss whether insect species richness appears to vary with the degree of urbanization, and interpret your results in context.
   Use additional columns (e.g., temperature, humidity, vegetation, water availability) to support your interpretation and suggest possible ecological explanations for the observed differences.

</font> 


In [40]:
urban_names = {'1': 'Natural',
               '3': 'Suburban',
               '8': 'Developed',
               '9': 'Dense'}

<font color='#00bf63'>*Your hypotheses here!*</font>

In [None]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

#### <font color='#fc7202'> Q6 (10 p): Interlaboratory Comparison of PAH Measurements

Four laboratories (**Lab 1-4**) analyzed identical PAH-contaminated water samples using five analytical methods (**A-E**).  
The goal was to assess whether results are consistent across laboratories and methods.  

|      |  A   |   B   |   C   |   D   |   E   |
|:----:|:----:|:-----:|:-----:|:-----:|:-----:|
|Lab 1| 585.2| 567.9 | 559.2 | 580.9 | 550.5 |
|Lab 2| 586.3| 541.9 | 580.9 | 598.2 | 559.2 |
|Lab 3| 615.5| 598.2 | 580.9 | 585.2 | 554.9 |
|Lab 4| 606.9| 550.5 | 572.2 | 567.9 | 585.2 |

1. **Load the dataset.**  
   Import `hw2_data4.tsv` and reshape the data into a tidy (long) format with columns: `Lab`, `Method`, and `PAH`.

2. **Explore visually and describe with descriptive statistics.**  
   Visualize **PAH concentrations** by **Lab** and **Method**.  
   In 2-3 sentences, describe the main patterns in the data.

3. **State hypotheses.**  
   Formulate null and alternative hypotheses for whether mean PAH concentrations differ among **labs** and among **methods**.  
   Choose an appropriate significance level (*α*).

4. **Pick the appropriate statistical test and check its assumptions.**  
   Choose a suitable test or model to evaluate whether **Lab**, **Method**, or both influence PAH concentration.  
   Briefly justify your choice and check that the assumptions of your selected test are reasonably met.  
   If significant differences are detected, perform an appropriate **post hoc comparison** to identify which groups differ.  
   Report the test statistic(s), and *p*-value.

5. **Conclude.**  
   Summarize your findings in 2-3 sentences.  
   State whether PAH concentrations vary among labs and/or methods at the 95% confidence level.  
   Briefly discuss what your results might imply for inter-laboratory consistency and method standardization in water quality surveys.

</font>  

<font color='#00bf63'>*Your hypotheses here!*</font>

In [None]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

## Before you submit

**Don’t forget:**

Please restart the kernel and run all cells from top to bottom to ensure your notebook works correctly.
1. In Jupyter Notebook or JupyterLab:
   Go to the menu bar and select:
   - `Kernel → Restart & Run All`
2. In Visual Studio Code (VS Code):
   - Click the `Restart Kernel` button `(↻)` (on the toolbar).
   - Then click `Run All` `(▶▶)` to execute all cells in order.
Make sure the notebook runs **end-to-end without errors**.\
If a cell still produces an error that you can’t resolve, simply **comment out that section** so the remaining cells can execute without interruption.

**Time spent**

Please record roughly how long you worked on this assignment:



<font color='#00bf63'>*Total time spent: XXXXX h*</font>

**Comments (optional feedback)**

Here, please, leave your comments regarding the homework, possibly answering the following questions:
- Was it too hard/easy for you?
- What would you suggest to add or remove?
- Anything else you would like to tell us?

<font color='#00bf63'>*Your feedback*</font>

Excellent work making it all the way through!