# NOTEBOOK 2: Data handling and basic statistics

## Data handling
---
In python exist two main data analysis libraries `pandas` and `polars`. In this tutorial `pandas` will be used.

The pandas documentation can be found under https://pandas.pydata.org/

### Data loading

### Indexing data

## Plotting
---
### Pure matplotlib API
Plotting artificial time series data

### Pandas built-in plotting

Enhance visual appearence

## Confidence intervall
---

When you estimate the mean of a population you often want to provide a range of values that is likely to contain the true population parameter. This is the confidence interval (CI).

### $t$-statistic (one sample mean)

When the population standard deviation is unknown (common in practice) and sample size is small (<30), you use the $t$-distribution instead of normal ($z$) distribution

$$
\text{CI} = \bar{x} \pm t \cdot \frac{s}{\sqrt{n}}
$$

where $\bar{x}$ is sample mean $s$ is sample standard deviation, $n$ sample size, $t$ critical $t$-value for the desired confidence level (e.g. 95%) with $n-1$ degrees of freedom

## Hypothesis testing
---

### One-sample $t$-test

Used when the true population mean is known

**The wine manufacturere says that the fixed acidity for red wines is 8.33.**
The value 8.33 might have come from the certificate of analysis of a standard reference material for example.

### Hypotheses

- **Null hypothesis (H₀):** The mean of the sample **is identical** to the specified population mean.
- **Alternative hypothesis (H₁):** The mean of the sample **is not identical** to the specified population mean.

The module `pingouin` implements a lot of usefull statistical tests (see: https://pingouin-stats.org/build/html/index.html)

The results can be interpreted in that way:
With a probability of 81.2 % to get a sample mean of 8.32 it the true population mean is 8.33

### Two-sample $t$-test
A two-sample t-test (also called independent samples t-test) is used to determine whether the means of two independent groups are significantly different from each other.

When to use it?

- You have two separate groups (e.g. treatment vs. control)
- You want to test if their population means are equal
- Data is approximately normally distributed within each group
- Variances are assumed equal (when unequal use Welch’s t-test)

### Hypotheses

- **Null hypothesis (H₀):** The population means of the two groups **are equal**.
- **Alternative hypothesis (H₁):** T the population means of the two groups **are not equal**.

A p-value that small indicates that it is very (very) unlikely that the means of the two groups are identical

### ANOVA
---
ANOVA (Analysis of Variance) is a statistical method used to test whether there are significant differences between the means of three or more groups.

**Why not use multiple t-tests?**

If we were to compare each pair of groups with individual t-tests, the probability of making a Type I error (false positive) increases with each test. ANOVA controls this error rate by testing all groups simultaneously in a single procedure.

---

#### Hypotheses

- **Null hypothesis (H₀):** All group means are equal.
- **Alternative hypothesis (H₁):** At least one group mean is different.

#### Basic idea

ANOVA compares **two sources of variability**:
- **Between-group variability:** Variance due to the interaction between the different groups (how much group means differ from the overall mean).
- **Within-group variability:** Variance due to differences within individual groups (natural spread of data).

From this variabilities a F-statistic and its corresponding p-value according to the F-distribution are calculated.

If the null hypothesis is rejected the Tukey's HSD (Honestly Significant Difference) post hoc test can be conducted to identify the groups which reject the null hypothesis.


## EXCURSE: Plotting the Mandelbrot set