# Homework 1: Exploratory Data Analysis and Statistical Thinking

**File name:**  
> Please save and submit your work as  
> **`Homework1_py_Firstname_Lastname.ipynb`**\
> Please submit only the `.ipynb` file, unless otherwise stated.\
> The examiner will place your file into the appropriate environment where all required data files are available in the same directory.
---

**Before you start**
- Record **how much time** the homework takes you in total.  
  At the **end of the notebook** there is a cell where you can write the number of hours (for course development feedback).  

---

**General instructions**
1. **Work individually.**  
   You may **discuss ideas** with classmates, but **do not copy–paste** each other’s code.  
   Each notebook must represent your own independent work.

2. **Use of external materials and AI tools.**  
   You may use course materials, documentation, examples, or quick help from sources such as Stack Overflow or ChatGPT.  
   However, if you **heavily rely** on external material or AI-generated code (for example, by copying significant parts),  
   please **cite the source**. There is **no need to cite** small pieces of help used only to understand or debug your code.

3. **Do not delete notebook cells.**  
   Please do **not remove any pre-existing cells**. Only add your own solutions in the designated places.  
   If you accidentally delete a cell, use **`Edit → Undo Delete Cells`** to restore it.  
   You may add extra cells if needed, but make sure every solution is placed **under the correct question or sub-question**.  
   This structure helps us to evaluate your work efficiently and accurately.
---

**Packages and environment**

All required packages are already imported for you in the **first code cell**.  
If running that cell gives an error, remove the leading `#` from the corresponding installation line and run it again, e.g.:

```python
#!pip install pandas matplotlib scipy
```
You may use additional packages if needed, but:
- Add their installation and import statements in a new code cell placed below the import cell.
- This is the only additional cell you may modify (besides the solution and time-reporting cells).

>*Plotting note*: The setup includes multiple plotting options (`pandas .plot`, `Matplotlib`, `seaborn`, and `Plotnine`). \
> Use one approach of your choice; you may remove or comment out imports you don’t need to keep the notebook tidy.
---

**Before submission**

Before submitting your notebook, please **restart the kernel and run all cells from the beginning** to ensure the entire notebook executes without errors.
This step guarantees that all results, figures, and outputs appear correctly when the notebook is evaluated.
If a particular code cell causes an error that you cannot fix, comment out that part so that the notebook still runs fully.

### Installing and Loading Packages

In [1]:
#!pip install palmerpenguins
#!pip install plotnine

In [2]:
import numpy as np                             # numerical computing: arrays, math functions, statistics
import pandas as pd                            # data manipulation and analysis (tables, DataFrames)

import matplotlib.pyplot as plt                # base plotting library for Python (low-level control)
import seaborn as sns                          # high-level plotting library (statistical graphics, built on Matplotlib)
#import plotnine as p9                          # plotting library with syntax similar to ggplot2

from scipy import stats                        # statistical functions

from palmerpenguins import load_penguins       # function to load the Palmer Penguins dataset

## Part I: Exploring the Palmer Penguins Dataset

In this part you will work with the **Palmer Penguins** dataset — real measurements of adult foraging penguins from the Palmer Archipelago, Antarctica. The data include species labels and several morphological traits commonly used in ecological studies.

**Citation:**  
Horst, A. M., Hill, A. P., & Gorman, K. B. (2020). *palmerpenguins: Palmer Archipelago (Antarctica) penguin data.* Data collected by Dr. Kristen Gorman and the Palmer Station LTER. (Original dataset made available via Palmer Station LTER; R/Python package “palmerpenguins” for convenient access.)


#### Load the data

We’ve installed the package and imported the loader for you. Run the cell below to create a DataFrame named `penguins`:

In [3]:
# Load the penguins dataset
penguins = load_penguins()

#### <font color='#fc7202'> Q1 (1 p): Describe the dataset:
1. Report how many observations and variables the dataset contains.
2. List each variable, its data type and (when applicable) its units.
3. For each variable, report how many missing values there are.

> *Hint:* Recall from the workshop how to get a quick pandas DataFrame overview. </font>

In [4]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

#### <font color='#fc7202'> Q2 (3 p): Answer the following questions:
1. Which island has the most observations? Report the count and percentage of total.
2. Which species occur on which islands? Provide a species × island summary (include row/column totals).
3. How are females and males distributed overall and by species? Report counts and percentages, include a suitable comparative plot.
4. How many penguins (count and %) have body mass > 5000 g; which species do these penguins belong to (report counts/percentages by species); and in which year was the heaviest penguin observed (also report its species, island, sex, and body mass)?
6. List the top 5 lightest penguins (ties allowed), showing only the columns: `species`, `island`, `sex`, `body_mass_g`.
7. For every species that occurs on multiple islands, calculate the mean `flipper_length_mm` for each island and briefly compare the island means.

If there is any missing data, please describe how you take it into account.

> *Hint*: Consider `pandas.crosstab() `and `.value_counts(...)`.
</font>

In [5]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

#### <font color='#fc7202'> Q3 (5 p): Distributions and summary statistics

1. Create suitable plots to show the distribution of each numeric variable, applying your judgment - omit plots or summaries that are not meaningful for a given variable and briefly state why. For variables where it is appropriate, compute and report the mean, median, and standard deviation, then write 1-2 sentences per variable explaining how these summaries align with the observed distribution.

2. Choose one numeric variable and visualize its distribution by species. For each group, report mean, median, standard deviation, and coefficient of variation. Briefly interpret the differences and relate them to the plot(s).

3. Compute group-wise sample skewness and excess kurtosis for the same variable as in (2); report the sign and magnitude of skewness, and classify kurtosis as leptokurtic, mesokurtic, or platykurtic relative to the normal distribution. Explain whether these metrics align with your visual impression.

> *Hints:*
> - Skewness/kurtosis: check functions like `Series.skew()`, `Series.kurt()` or `scipy.stats.skew`, `scipy.stats.kurtosis`.
> - Be explicit about missing-data handling.

</font>

In [6]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

## Part II: Weather Time Series (2006–2016)

You are provided a tab-separated dataset part2_data.tsv containing weather observations from 2006–2016 (e.g., temperature, humidity, wind, pressure, visibility etc.).

#### <font color='#fc7202'> Q4 (3 p): Temperature time series

1. Read the data into a `pandas DataFrame` named `weather_data` from `part2_data.tsv`.
2. Describe the dataset:
   - Report rows × columns.
   - List feature names, data types, and units (where available).
   - Summarize missing values per feature.
   - Provide a brief summary (3-5 sentences) of what the dataset contains and any immediate data-quality issues you notice.
3. Plot the temperature time series (2006–2016). To do this:
   - Parse `Formatted Date` to `datetime` using `pd.to_datetime(..., errors="coerce", utc=True)`.
   - Sort the entire `DataFrame` in ascending order by the parsed time column.
   - Plot time (*x*-axis) against `Temperature (C)` (*y*-axis) as a line plot (do not use `Apparent Temperature`).
4. Comment briefly on whether the time series looks reasonable (e.g., seasonality, trends, etc.).

</font>

In [7]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

#### <font color='#fc7202'> Q5 (5 p): Assessment of data distributions

1. Plot histograms and boxplots for all numeric variables to examine their distributions.
2. For each variable, describe the distribution (center, spread, shape, outliers) and assess whether it aligns with expected physical/measurement behavior for that quantity; if not, specify the inconsistency and a plausible explanation.
3. State whether descriptive statistics are appropriate for each variable: indicate if reporting mean/median/mode/standard deviation is meaningful or not, and justify your decision.

</font>

In [8]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

#### <font color='#fc7202'> Q6 (3 p): Normality assessment (temperature and wind speed)
Evaluate the empirical distributions of **`Temperature (C)`** and **`Wind Speed (km/h)`**.  
Apply **Shapiro–Wilk** and **Kolmogorov–Smirnov** tests and construct **Q–Q plots**.  
Using the **p-values** from the Shapiro–Wilk and Kolmogorov–Smirnov tests **together with the Q–Q plots**, evaluate whether each variable’s data are **compatible with a normal distribution at α = 0.05**; if not, briefly describe the nature of the deviation.

</font>


In [9]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

#### <font color='#fc7202'> Q7 (1 p): Distributional modeling of wind speed
Propose a plausible **parametric distribution** for **`Wind Speed (km/h)`** and justify your choice using the dataset’s evidence. Explain why the proposed distribution is physically reasonable for wind.
</font>

<font color='#00bf63'>*Your answer here!*</font>

## Part III: Simulation of Normal Distribution

A river reference sample has been measured **1,000** times for **dissolved oxygen (DO)**. The sample **mean** is **8.79 mg/L** and the sample **standard deviation** is **2.74 mg/L**. Assume (for this exercise) that single-measurement results follow a **normal** distribution. What is the probability that the **next measurement** is **≤ 5.00 mg/L**?  




#### <font color='#fc7202'> Q8 (3 p): Effect of sample size
Treat individual readings as coming from a normal distribution with **mean 8.79 mg/L** and **sd 2.74 mg/L**. Use simulation to examine the effect of sample size:
- Generate **10** values and plot a histogram.
- Generate **10,000** values and plot a second histogram.

Write **2-3 sentences** comparing the two plots, focusing on **shape** and the **stability of the sample mean** relative to **8.79**.

> *Hint.* The **next code cell** contains starter code to sample from a normal distribution.
</font>

In [10]:
rng = np.random.default_rng(42) # for reproducibility
mu = 8.79
sigma = 2.74
rng.normal(mu, sigma, size=10)

array([ 9.6249248 ,  5.94044355, 10.84623628, 11.36714732,  3.44416358,
        5.22202815,  9.1402827 ,  7.9234953 ,  8.74396483,  6.45265964])

In [11]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

#### <font color='#fc7202'> Q9 (3 p): Probability
Using the same normal model (**mean 8.79 mg/L**, **sd 2.74 mg/L**), draw **one** sufficiently large synthetic sample (choose the size informed by Q8), **count** the observations **≤ 5.00 mg/L**, and report the resulting **fraction** as your estimate of the required probability.
</font>

In [12]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

#### <font color='#fc7202'> Q9 (5 p): Sampling bias
Assume, hypothetically, that the **true mean** dissolved oxygen concentration is **8.00 mg/L** (with a **standard deviation of 2.74 mg/L**).
What is the probability that an experiment with **1,000 measurements** would produce a **sample mean ≥ 8.79 mg/L**?

- To estimate this probability, perform the following steps:
    1. Simulate the experiment many times: In each iteration, draw a random sample of size 1,000 from a normal distribution with the specified mean (8.00) and standard deviation (2.74).
    2. Compute the sample mean for each simulated dataset.
    3. Estimate the probability as the proportion (fraction) of simulated means that are ≥ 8.79 mg/L.
- Visualize the results by plotting both distributions on the same figure:
    1. The hypothetical true distribution (normal with mean = 8.00 mg/L, sd = 2.74 mg/L)
    2. The observed-sample distribution, corresponding to a normal distribution with mean = 8.79 mg/L and the same sd (2.74 mg/L)
    3. Add a vertical line at 8.79 mg/L to indicate the observed sample mean.

> *Hint:* You can use a simple **for-loop** (or a vectorized alternative) to repeatedly sample and compute means.
</font>

In [13]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

## Before you submit

**Don’t forget:**

Please restart the kernel and run all cells from top to bottom to ensure your notebook works correctly.
1. In Jupyter Notebook or JupyterLab:
   Go to the menu bar and select:
   - `Kernel → Restart & Run All`
2. In Visual Studio Code (VS Code):
   - Click the `Restart Kernel` button `(↻)` (on the toolbar).
   - Then click `Run All` `(▶▶)` to execute all cells in order.
Make sure the notebook runs **end-to-end without errors**.\
If a cell still produces an error that you can’t resolve, simply **comment out that section** so the remaining cells can execute without interruption.

**Time spent**

Please record roughly how long you worked on this assignment:



<font color='#00bf63'>*Total time spent: XXXXX h*</font>

**Comments (optional feedback)**

Here, please, leave your comments regarding the homework, possibly answering the following questions:
- Was it too hard/easy for you?
- What would you suggest to add or remove?
- Anything else you would like to tell us?

<font color='#00bf63'>*Your feedback*</font>

Excellent work making it all the way through!