# Workshop: Python Basics

Welcome to this Python workshop! The goal of this notebook is to introduce you to the essentials of data analysis in Python, from the very basics to some more advanced data wrangling techniques.

---
Before you start, make sure you are familiar with the [Python Syntax Fundamentals](https://github.com/RaHub4AI/MI7032/blob/main/Introduction_to_R_and_Python/Python_Syntax_Fundamentals.md).
These rules cover things like comments, variables, assignment, reserved words, and code readability - and you should keep them in mind while writing your code.

## General Information

In [1]:
# Checking the complete version of the installed Python
!python --version # print the Python version with Magic

Python 3.12.11


In [2]:
import sys
print(sys.version)

3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]


In [3]:
# Check the current working directory
import os
print(os.getcwd())

# Set your working directory (example path)
# os.chdir('C:/Users/YourName/Documents/EDS_projects')

# Check the files in the directory
print(os.listdir())


/content
['.config', 'sample_data']


## Operators
---
#### Arithmetic
Used for performing basic mathematical operations (addition, subtraction, multiplication, division, powers, etc.).
| Operator | Description       | Example   | Result |
|----------|-------------------|-----------|--------|
| `+`      | Addition          | `5 + 3`   | 8      |
| `-`      | Subtraction       | `5 - 3`   | 2      |
| `*`      | Multiplication    | `5 * 3`   | 15     |
| `/`      | Division (float)  | `5 / 2`   | 2.5    |
| `//`     | Floor division    | `5 // 2`  | 2      |
| `%`      | Modulus (remainder) | `5 % 2` | 1      |
| `**`     | Exponentiation    | `2 ** 3`  | 8      |
---
#### Assignment
Used for assigning values to variables and updating them in place (e.g., `x += 1` means increase `x` by `1`).
| Operator | Example   | Equivalent to |
|----------|-----------|---------------|
| `=`      | `x = 5`  | assign value 5 to `x` |
| `+=`     | `x += 3` | `x = x + 3` |
| `-=`     | `x -= 3` | `x = x - 3` |
| `*=`     | `x *= 3` | `x = x * 3` |
| `/=`     | `x /= 3` | `x = x / 3` |
| `//=`    | `x //= 3` | `x = x // 3` |
| `%=`     | `x %= 3` | `x = x % 3` |
| `**=`    | `x **= 3` | `x = x ** 3` |
---
#### Comparison  
Used for comparing values; they return `True` or `False`.
| Operator | Description      | Example   | Result |
|----------|------------------|-----------|--------|
| `==`     | Equal to         | `5 == 3`  | False |
| `!=`     | Not equal to     | `5 != 3`  | True  |
| `>`      | Greater than     | `5 > 3`   | True  |
| `<`      | Less than        | `5 < 3`   | False |
| `>=`     | Greater or equal | `5 >= 5`  | True  |
| `<=`     | Less or equal    | `3 <= 5`  | True  |
---
#### Logical  
Used for combining or inverting boolean values (`and`, `or`, `not`).
| Operator | Description  | Example   | Result |
|----------|--------------|-----------|--------|
| `and`    | Logical AND  | `(5 > 3) and (5 < 10)` | True |
| `or`     | Logical OR   | `(5 > 3) or (5 > 10)`  | True |
| `not`    | Logical NOT  | `not(5 > 3)`           | False |
----
#### Membership
Used to test if a value is part of a sequence (like lists, strings, or tuples).
| Operator | Description                | Example               | Result |
|----------|----------------------------|-----------------------|--------|
| `in`     | Checks if value is in a sequence | `3 in [1, 2, 3]` | True   |
| `not in` | Checks if value is not in a sequence | `4 not in [1, 2, 3]` | True |
---
#### Identity  
Used to check if two variables refer to the same object in memory, not just equal values.
| Operator | Description                      | Example       | Result |
|----------|----------------------------------|---------------|--------|
| `is`     | True if two variables point to the same object | `x is y` | True/False |
| `is not` | True if two variables point to different objects | `x is not y` | True/False |


### Python as a Calculator
The simplest use of Python is doing computations directly:

In [None]:
# Let's try the arithmetic operators
3 + 5

8

In [4]:
25 ** 2

625

In [5]:
1 / 2

0.5

In [6]:
1 % 2

1

In Python, many core mathematical functions and constants are provided by the
[`math` module](https://docs.python.org/3/library/math.html).  
To use them, we first need to import the module:

In [7]:
import math

In [8]:
2 + math.sqrt(375769) - 25**2

-10.0

In [9]:
math.sin(math.pi/6) + math.acosh(1)

0.49999999999999994

>Do you see the difference with **R**?
>In R, the same expression prints as exactly `0.5`, while in Python it shows up as `0.49999999999999994`.  
>This is not because Python is less accurate - both R and Python use the same [IEEE 754 double-precision floating-point standard](https://standards.ieee.org/ieee/754/6210/). The difference comes from how the two languages **display results**:  
>- R rounds the output it prints, so small errors are usually hidden.  
>- Python shows more digits by default, so you can see the tiny rounding error.
>
>In fact, if you copy and paste `options(digits = 17)` into the **same cell in your R notebook**, R will print the same number as Python.
> <font color='gold'> However, if you are using **R in Google Colab**, simply setting `options(digits=17)` is not enough, because Colab still rounds output.  To see the same long result as Python, you need to explicitly format the number, for example with: `sprintf("%.17g", sin(pi/6) + acosh(1))`</font>
>
>So both languages are doing the **same computation**, only the presentation differs.


In [17]:
cmath.sqrt(3)

(1.7320508075688772+0j)

In [12]:
import cmath

In [10]:
math.log(math.exp(1))

1.0

In [15]:
cmath.sqrt(-1+0j)  # complex numbers

1j

Unlike R, the `math` module only supports real numbers. For complex numbers, Python provides the [`cmath` module](https://docs.python.org/3/library/cmath.html).

In [13]:
import cmath

cmath.sqrt(-1+0j)

1j

In [18]:
math.factorial(6) / math.comb(4, 2)

120.0

><font color='gold'> *Think!* Why is the answer a float and not an integer? </font>

In [19]:
1 / 0

ZeroDivisionError: division by zero

In [20]:
0 / 0

ZeroDivisionError: division by zero

If in **R** we divide by zero, we get special IEEE 754 values like `Inf`, `-Inf`, or `NaN`, then in Python dividing by zero raises an explicit error: `ZeroDivisionError: division by zero`.

This difference comes from design choices: Python signals errors explicitly,
while R follows IEEE 754 rules by default.

To work with IEEE 754–style floating-point behavior in Python, we can use
the [NumPy library](https://numpy.org/). NumPy is a powerful package for
numerical computing that extends Python with arrays, vectorized operations,
and mathematical functions (many of which follow R-like conventions).
> *Hint*: We will use NumPy a lot!

*Read more about Python and R and Calculus: https://doi.org/10.1080/09332480.2025.2510168*

> In Python, it is common to import packages with a **short alias** using the `as` keyword.  
>For example, instead of writing `import numpy` and then calling functions as `numpy.sqrt(...)`,  
we usually write: `import numpy as np`.
>
>This makes the code shorter and easier to read.
>
>*The abbreviation `np` is a widely accepted community convention.
>If you see Python code in textbooks, tutorials, or research papers,
`np` will almost always refer to NumPy.
>Using these standard abbreviations helps your code stay familiar and understandable to others.*

In [21]:
import numpy as np

In [22]:
np.divide(1, 0)

  np.divide(1, 0)


np.float64(inf)

In [23]:
np.divide(-1, 0)

  np.divide(-1, 0)


np.float64(-inf)

In [None]:
np.divide(0, 0)

  np.divide(0, 0)


np.float64(nan)

>As you see, in addition to errors, Python also gives you **warnings**. Right now, NumPy warns that you are dividing by zero.  
>This is useful because it helps you notice potential caveats in your code.  However, when using some packages, warnings can become very frequent and distracting.  
>Python provides an option to **suppress warnings** if needed. Here is how you can do it:

```python
import warnings
warnings.filterwarnings("ignore")

import numpy as np
np.divide(1, 0)    # no warning shown


In [24]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
np.divide(1, 0)    # no warning shown


np.float64(inf)

## Variables

Creating variables in Python means defining names and assigning values to them.
Variables are used to store data such as numbers, text, lists, or even more complex objects. Assignment is done with the `=` operator, which binds a value to a variable name.

A variable’s name should clearly describe the data it contains, so that the name itself serves as a meaningful reference to the information it holds.  


In [25]:
sum1 = 2 + 3

You can print values in multiple ways:

In [27]:
# Using string concatenation (requires conversion with str())
print("The sum1 of the numbers is:", sum1)

# Using f-strings (recommended, clearer and more readable)
print(f"The sum2 of the numbers is: {sum1+4}")

The sum1 of the numbers is: 5
The sum2 of the numbers is: 9


><font color='gold'> Do you notice the difference?</font>
> - With `,`, Python automatically inserts spaces.







In [28]:
temperature = 10
location = "Stockholm"

# Printing multiple values at once
print("The temperature in", location, "is", temperature, "°C")

print(f"The temperature in {location} is {temperature} °C")

# Embedding expressions directly inside an f-string
print(f"In five years, the temperature will be {temperature + 5} °C") # very pessimistic prediction

The temperature in Stockholm is 10 °C
The temperature in Stockholm is 10 °C
In five years, the temperature will be 15 °C


## Getting Help

Many Python functions have **default parameter values**, so you don’t always need to specify every argument.  
To learn how a function works, you can access its documentation in different ways.

#### 1. Using `help()`

In [29]:
help(len)

Help on built-in function len in module builtins:

len(obj, /)
    Return the number of items in a container.



This shows the function’s signature, description, parameters, and sometimes examples.

#### 2. Using `?` in IPython or Jupyter

In Jupyter notebooks or IPython, you can also type:

In [34]:
math.log?

This will display the docstring (documentation string) for the function.

Help pages in Python are based on the function’s **docstring**.  
A docstring is text written by the developers of the function to explain how it works.  

Typically, Python help pages include:
- **Signature**: the function name with its parameters and default values.  
- **Docstring**: a description of what the function does, sometimes with notes or examples.  
- **Type**: whether it is a built-in function, method, or user-defined function.  

>In addition, most Python packages (like NumPy or pandas) have excellent online documentation,  
which is often the fastest way to learn a new function.


### <font color='gold'> Task 1 </font>
1. What does the function `np.log()` do in Python?
2. Try one example from the help page (or create your own).

In [31]:
help(np.log)

Help on ufunc:

log = <ufunc 'log'>
    log(x, /, out=None, *, where=True, casting='same_kind', order='K', dtype=None, subok=True[, signature])

    Natural logarithm, element-wise.

    The natural logarithm `log` is the inverse of the exponential function,
    so that `log(exp(x)) = x`. The natural logarithm is logarithm in base
    `e`.

    Parameters
    ----------
    x : array_like
        Input value.
    out : ndarray, None, or tuple of ndarray and None, optional
        A location into which the result is stored. If provided, it must have
        a shape that the inputs broadcast to. If not provided or None,
        a freshly-allocated array is returned. A tuple (possible only as a
        keyword argument) must have length equal to the number of outputs.
    where : array_like, optional
        This condition is broadcast over the input. At locations where the
        condition is True, the `out` array will be set to the ufunc result.
        Elsewhere, the `out` array will 

In [32]:
np.log([1, np.e, np.e**2, 0])

array([  0.,   1.,   2., -inf])

## Data Types

In Python, **data types** define the kind of values a variable can hold and what operations can be performed on them.  
Python is a **dynamically typed language**, so you don’t need to declare the type explicitly - Python infers it from the assigned value.  

You can check the type of an object with the `type()` function.


#### Numbers

Python has several numeric data types:

- **Integers (`int`)**: whole numbers without a decimal point  
  Examples: `54`, `-12`, `1240`

- **Floating-point numbers (`float`)**: numbers with decimals  
  Examples: `1.54`, `6.0`, `-2.3`

- **Complex numbers (`complex`)**: numbers with a real and imaginary part (written with `j` instead of `i`)  
  Example: `5 + 2j`  
  *(Don’t worry — we won’t work with complex numbers in this course.)*


In [35]:
weight = 80      # int
height = 1.80    # float
bmi = weight / (height**2)
print(bmi)
print(type(bmi))

24.691358024691358
<class 'float'>


#### Strings (`str`)

Strings represent text data. They can be written with single, double, or triple quotes (for multi-line text).

In [36]:
string1 = "I’m going to become an Environmental Data Scientist!"
print(type(string1))

<class 'str'>


Strings in Python are immutable, meaning they cannot be changed after creation.
If you want to modify text, you create a new string instead.

In [37]:
word = "BAT"

In [38]:
word[0]

'B'

In [39]:
word[0] = "C"

TypeError: 'str' object does not support item assignment

#### Booleans (`bool`)

Booleans represent truth values: `True` or `False`.
They are often the result of comparisons:

In [42]:
is_raining = False
print(type(is_raining))

print(5 > 2)   # True
print(3 < 7)  # False

<class 'bool'>
True
True


#### Other Important Data Types

- List (`list`): an ordered, mutable collection

In [46]:
list_example = ["e", 5, "Introduction to Programming", 5.7]
list_example[1]

5

- Dictionary (`dict`): key–value pairs

In [49]:
dict_example = {
    50: ["E", 'C', 'helloworld'],
    75: "B",
    3: "water"
}
dict_example


{50: ['E', 'C', 'helloworld'], 75: 'B', 3: 'water'}

- Tuple (`tuple`): ordered but immutable

In [50]:
tuple_example = ("ACES", 150.99, 44)
tuple_example

('ACES', 150.99, 44)

- Set (`set`): unordered collection of unique elements

In [51]:
set_example = {1, 2, 2, 3, 4, 5, "Saturday"}
set_example

{1, 2, 3, 4, 5, 'Saturday'}

Python provides built-in functions to convert between data types (also called *type casting*).

In [52]:
# Convert string to integer
string_to_integer = int("15")
type(string_to_integer)

int

In [53]:
# Convert integer to string
integer_to_string = str(56)
type(integer_to_string)

str

In [54]:
str(56)

'56'

In [55]:
# Convert integer to float
num_float = float(87)
type(num_float)

float

In [56]:
# Convert to collections
print(f'''{list("hello")},
{tuple([1, 2, 3])},
{set([1, 1, 2])}''')    # ['h', 'e', 'l', 'l', 'o'], (1, 2, 3),  {1, 2}

['h', 'e', 'l', 'l', 'o'],
(1, 2, 3),
{1, 2}


In [57]:
int("5") + 5

10

### <font color='gold'> Task 2 </font>
- Can you add booleans?
- What is the numeric value of `False`?

In [2]:
int(True) # numeric value of True

1

In [3]:
int(False) # numeric value of False

0

In [1]:
True + True # 1 + 1

2

In [4]:
True + False # 1 + 0

1

In [6]:
False - True  # 0 - 1

-1

In [7]:
True * 10 # 1 * 10

10

## Functions

Functions package a specific task so you can name it, reuse it, and test it.  
They take inputs, perform work, and (usually) return a result—making programs clearer and less repetitive.

---

#### General syntax

```python
def function_name(parameters):
    """
    Optional docstring that explains what the function does,
    its parameters, return value, and any caveats.
    """
    # operations
    return value

```
- `def` starts the definition
- `function_name` follows Python naming conventions (`snake_case`)
- `parameters` are comma-separated inputs (they may have default values)
- A `return` sends a value back to the caller (optional for “void” functions)

**Example: Body Mass Index (BMI)**

In [60]:
def bmi(weight, height=180):
    """
    Compute Body Mass Index (BMI).

    Parameters
    ----------
    weight : float or int
        Body mass in kilograms.
    height : float or int, default 180
        Body height in centimeters.

    Returns
    -------
    float
        BMI computed as weight / (height/100)**2.

    Notes
    -----
    Raises ValueError if weight or height are non-positive.
    """
    if weight <= 0 or height <= 0:
        raise ValueError("weight and height must be positive")
    return weight / (height / 100) ** 2

# Calls (positional and with default)
print(bmi(85))        # height defaults to 180 cm
print(bmi(85, 195))   # override default height


26.234567901234566
22.353714661406972


> Parameters are the names in the function definition (`weight`, `height`).
> Arguments are the actual values you pass when calling the function (`85`, `195`).
>
> Python supports:
> - Positional arguments: `bmi(85, 180)`
> - Keyword arguments: `bmi(weight=85, height=180)` (clearer, order-independent)
> - Default values: `height=180` lets you omit `height`
>
> `return` hands a value back to the caller. If you omit it, the function returns `None` (`void`function).

In [61]:
def square_of(number):
    return number ** 2

def print_greeting(name):
    # No return statement -> returns None
    print(f"Hello, {name}!")

In [62]:
square_of(3)

9

In [63]:
square = square_of(3)
print(square)

9


In [64]:
print_greeting("ACES")

Hello, ACES!


In [65]:
greeting = print_greeting("ACES")
print(greeting) # void function does not return anything

Hello, ACES!
None


You can also add type hints to improve readability and editor support, but they are not enforced at runtime



In [None]:
def bmi_typed(weight: float, height_cm: float = 180) -> float:
    """BMI with type hints (weight kg, height cm)."""
    if weight <= 0 or height_cm <= 0:
        raise ValueError("weight and height must be positive")
    return weight / (height_cm / 100) ** 2

Additionally, you can also use `args` and `kwargs` to make your functions more flexible. They let you accept a variable number of positional and keyword arguments, but they do not change how the function behaves unless you explicitly handle them in your code.

In [73]:
def mean_of(*numbers):
    """Mean of any number of positional arguments."""
    if not numbers:
        raise ValueError("provide at least one number")
    return sum(numbers) / len(numbers)

mean_of(1, 2, 3, 4)


2.5

In [74]:
def labeled_print(prefix, **options):
    """Forward arbitrary keyword options to print()."""
    print(prefix, **options)

labeled_print("Important message", sep=" | ", end=" !!!\n")

Important message !!!


As you have already seen, there are several types of functions in Python - and we’ve actually used all of them already:
- **Built-in functions**: These are always available because they are part of Python itself.  
  Examples: `print()` to output data, `int()`, `str()`, `float()` for type conversion, and `input()` to read user input.
- **Imported functions**: Many useful functions live in external modules that you must import before using.  
  Examples: `math.sqrt()`, `math.factorial()`, or `random.randint()`.
- **User-defined functions**: These are the custom functions you write yourself to solve specific tasks.  
  They can also make use of built-in or imported functions.

## Packages

In Python, many extra functions are organized into **packages** (also called libraries).  
You’ve already seen this when we imported packages like `math` and `numpy`.

```python
import math
import numpy as np
```
These packages were already installed to our Python environment. However, if you don’t have a package installed, you must install it before importing.
In Jupyter notebooks, you can use a magic command with `!` to run shell commands directly:


In [75]:
!pip install pymannkendall

Collecting pymannkendall
  Downloading pymannkendall-1.4.3-py3-none-any.whl.metadata (14 kB)
Downloading pymannkendall-1.4.3-py3-none-any.whl (12 kB)
Installing collected packages: pymannkendall
Successfully installed pymannkendall-1.4.3


After installation, you can import the package in the usual way:

In [76]:
import pymannkendall

> <font color='gold'> If you are using Google Colab, many widely used scientific and data analysis packages (like `numpy`, `pandas`, `matplotlib`, `scikit-learn`) are already installed for you.
You only need to install a package if you try to import it and get an error that it’s missing.</font>

There are thousands of Python packages available for science, data analysis, and environmental research.  
The most common places to find them are:
- **[PyPI (Python Package Index)](https://pypi.org/)** → the main repository where almost all Python packages are published.  
- **[Anaconda Navigator](https://anaconda.org/)** → if you are using Anaconda, you can search and install packages through the Anaconda platform.  
- **Project websites and GitHub** → some research-oriented or niche packages are shared directly on GitHub or project homepages before they appear on PyPI.

You can search directly on PyPI, or even from Google by typing the package name + “PyPI”.  
For example: *“pymannkendall PyPI”*.


### <font color='gold'> Task 3 </font>
Find a Python package that is **not yet installed** in your environment.  
1. Search for it on [PyPI](https://pypi.org/) or Google.  
2. Install it with.  
3. Import it into your notebook to confirm the installation.

In [8]:
!pip install rdkit # open-source cheminformatics and machine learning package: https://www.rdkit.org/docs/GettingStartedInPython.html

Collecting rdkit
  Downloading rdkit-2025.3.6-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (4.1 kB)
Downloading rdkit-2025.3.6-cp312-cp312-manylinux_2_28_x86_64.whl (36.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.1/36.1 MB[0m [31m56.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdkit
Successfully installed rdkit-2025.3.6


In [9]:
import rdkit

## Lists and NumPy Arrays

In R, the central object type is the **vector**. In Python, we often start with **lists**, and for scientific work, we usually move to **NumPy arrays**, which are very similar to R vectors.


#### Lists

Lists are one-dimensional containers that can store **different types of data** (unlike R vectors, which must hold the same type).  

Examples of creating lists:


In [84]:
# Sequences
#list(range(1, 11))       # 1 to 10
list(range(9, 1, -1))    # reverse sequence

[9, 8, 7, 6, 5, 4, 3, 2]

In [85]:
# Arbitrary elements
[1, 4, 2, 6]

[1, 4, 2, 6]

In [86]:
# Mixing different data types
["A", 1, "B", True, "C", 2, 6, 8]

['A', 1, 'B', True, 'C', 2, 6, 8]

In [None]:
# Repeating patterns
[1, 2] * 5

[1, 2, 1, 2, 1, 2, 1, 2, 1, 2]

Extracting data from a list is done using square brackets `[]`.

<font color='gold'>Remember: in Python indexing starts at `0`, not at `1` like in R.</font>

In [92]:
x = list(range(10, 0, -1))
x

[10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

In [100]:
x[0:3]

[10, 9, 8]

In [93]:
  # 10 down to 1
print(x[0])        # first element (10)
print(x[0:5])      # slice: first 5 elements
print(x[-2])       # last element (1)

10
[10, 9, 8, 7, 6]
2


#### NumPy Arrays

For numerical and scientific work, we usually use NumPy arrays.
They behave much more like R vectors: all elements must have the same type, and arithmetic operations are applied elementwise.

In [107]:
import numpy as np

# Creating arrays
#np.arange(1, 11)                  # 1 to 10
#np.array([1, 4, 2, 6])            # arbitrary elements
#np.linspace(0, 1, num=10)         # sequence with equal spacing
#np.tile([1, 2], 5)                # repeat pattern
np.array(["A", "B", "C"])          # character array

array(['A', 'B', 'C'], dtype='<U1')

If you mix types, NumPy will **upcast** everything to a common type that can hold them all.  
You can always check what type NumPy chose by looking at the array’s `.dtype`.

In [108]:
# Mixed strings, numbers, and booleans → everything becomes string
arr1 = np.array(["A", 1, "B", True, "C", 2, 6, 8])
print(arr1)
print(arr1.dtype)   # dtype shows the type of array elements

# Mixing numbers and booleans → booleans are treated as integers (True = 1 and False = 0)
arr2 = np.array([True, 2, 6, 8]) # [1, 2, 6, 8]
print(arr2)
print(arr2.dtype)

['A' '1' 'B' 'True' 'C' '2' '6' '8']
<U21
[1 2 6 8]
int64


Extracting data from NumPy arrays works the same way as it does for lists:

In [12]:
x = np.arange(10, 0, -1) # 10 to 1
#x[0]        # first element
#x[0:5]      # first 5 elements
x[-1]       # last element


np.int64(1)

NumPy arrays support elementwise operations, similar to R vectors:

In [116]:
#np.arange(1, 11) + 5           # add 5 to each element
#np.arange(1, 11) + np.arange(10, 0, -1)  # elementwise addition

np.log10([1, 10, 100, 1000])   # logarithm base 10


array([0., 1., 2., 3.])

You can also use your own functions with arrays:

In [120]:
weights = np.array([85, 90, 95, 100, 105])
heights = 194
bmi = weights / (heights / 100) ** 2
print(bmi)

[22.58475927 23.91327452 25.24178978 26.57030503 27.89882028]


Many functions take an array and return a single value:

In [14]:
#np.min(np.arange(1, 11))
#np.max(np.arange(1, 11))
#np.sum(np.arange(1, 11))
np.mean(np.arange(1, 11))


np.float64(5.5)

> **Beyond 1D: Multidimensional Arrays**
>
> So far, we have only worked with **1D arrays** (similar to R vectors).  
> However, NumPy arrays can also be **2D (matrices)** or even **higher-dimensional**.  
> This makes NumPy very powerful for scientific and data analysis tasks.

In [None]:
# 2D array (matrix)
matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])
print(matrix)
print(matrix.shape)   # (2, 3) → 2 rows, 3 columns

[[1 2 3]
 [4 5 6]]
(2, 3)


In [None]:
# 3D array (tensor)
tensor = np.array([ [[1, 2], [3, 4]],
                    [[5, 6], [7, 8]] ])
print(tensor)
print(tensor.shape)   # (2, 2, 2)

[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]
(2, 2, 2)


### <font color='gold'> Task 4</font>
Define a function that **rescales** a numeric array to the range [0, 1],  
so that the minimum becomes 0 and the maximum becomes 1.

> *Hint:* for array $x$ and element $x_i$, use:
$$ \frac{x_i - \text{min}(x)}{\text{max}(x) - \text{min}(x)} $$

Test your function on arrays `0:10` and `-5:5`. Are the results similar or different?

In [15]:
def rescale_01(x):
  return (x - np.min(x)) / (np.max(x) - np.min(x))

In [16]:
arr1 = np.arange(0, 10)
arr2 = np.arange(-5, 5)

In [17]:
rescale_01(arr1)

array([0.        , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
       0.55555556, 0.66666667, 0.77777778, 0.88888889, 1.        ])

In [18]:
rescale_01(arr2)

array([0.        , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
       0.55555556, 0.66666667, 0.77777778, 0.88888889, 1.        ])

## Pandas DataFrame

In real-world data analysis, information is often organized in **tables**,  
where columns contain variables of different types. In Python, the most common way to work with such data is with the **Pandas** library.

Official docs: [https://pandas.pydata.org/docs/](https://pandas.pydata.org/docs/)

#### Creating a DataFrame
You can create a DataFrame from a **dictionary** - recall from the [Data Types](#data-types) chapter that dictionaries store data as key–value pairs.  
Here, keys become column names and values become the column data.

In [None]:
!pip install pandas

In [19]:
import pandas as pd

stations = pd.DataFrame({
    "ID": ["101", "102", "103", "104"],
    "name": ["Umeå", "Vindeln", "Siljansfors", "Asa"],
    "longitude": [20.19, 19.46, 14.24, 14.47],
    "latitude": [63.49, 64.14, 60.53, 57.10],
    "altitude": [33, 225, 320, 180]
})

print(stations)
print(type(stations))

    ID         name  longitude  latitude  altitude
0  101         Umeå      20.19     63.49        33
1  102      Vindeln      19.46     64.14       225
2  103  Siljansfors      14.24     60.53       320
3  104          Asa      14.47     57.10       180
<class 'pandas.core.frame.DataFrame'>


Adding a column: simply assign a new Series or list to a column name.

In [20]:
regions = ["Norrland", "Norrland", "Svealand", "Götaland"]
stations["region"] = regions
stations

Unnamed: 0,ID,name,longitude,latitude,altitude,region
0,101,Umeå,20.19,63.49,33,Norrland
1,102,Vindeln,19.46,64.14,225,Norrland
2,103,Siljansfors,14.24,60.53,320,Svealand
3,104,Asa,14.47,57.1,180,Götaland


Adding a row: use `pd.concat()` to stack rows.

In [22]:
new_station = pd.DataFrame({
    "ID": ["105"],
    "name": ["Tönnersjöheden"],
    "longitude": [13.07],
    "latitude": [56.42],
    "altitude": [80],
    "region": ["Götaland"]
})

stations = pd.concat([stations, new_station], ignore_index=True)
stations

Unnamed: 0,ID,name,longitude,latitude,altitude,region
0,101,Umeå,20.19,63.49,33,Norrland
1,102,Vindeln,19.46,64.14,225,Norrland
2,103,Siljansfors,14.24,60.53,320,Svealand
3,104,Asa,14.47,57.1,180,Götaland
4,105,Tönnersjöheden,13.07,56.42,80,Götaland


> <font color='gold'>*Think*: Why we use `ignore_index=True`? </font>

If a row is missing some columns, Pandas will fill them with `NaN`.

In [24]:
extra_station = pd.DataFrame({
    "ID": ["106"],
    "name": ["Kulbäcksliden"],
    "longitude": [19.49],
    "latitude": [64.52],
    "region": ["Norrland"]
})

stations = pd.concat([stations, extra_station], ignore_index=True)
stations


Unnamed: 0,ID,name,longitude,latitude,altitude,region
0,101,Umeå,20.19,63.49,33.0,Norrland
1,102,Vindeln,19.46,64.14,225.0,Norrland
2,103,Siljansfors,14.24,60.53,320.0,Svealand
3,104,Asa,14.47,57.1,180.0,Götaland
4,105,Tönnersjöheden,13.07,56.42,80.0,Götaland
5,106,Kulbäcksliden,19.49,64.52,,Norrland


We often need to subset (extract rows and columns) from our DataFrames.  
In Pandas, there are several ways to do this:

- **Square brackets `[]`** → most common way to access one or more columns.  
- **Dot notation `.`** → shorthand for simple column names (no spaces/special characters).  
- **`.iloc[]`** → position-based indexing (rows and columns by integer index).  
- **`.loc[]`** → label-based indexing (rows and columns by their names).  
- **`.drop()`** → remove specific rows or columns by label or index.


In [26]:
# Column access
#columns_needed = ['name', 'longitude']
#stations[columns_needed]      # using brackets
stations[['name', 'longitude']] # same as previous
#stations.region         # using dot notation (works only for simple names)

Unnamed: 0,name,longitude
0,Umeå,20.19
1,Vindeln,19.46
2,Siljansfors,14.24
3,Asa,14.47
4,Tönnersjöheden,13.07
5,Kulbäcksliden,19.49


In [30]:
stations

Unnamed: 0,ID,name,longitude,latitude,altitude,region
0,101,Umeå,20.19,63.49,33.0,Norrland
1,102,Vindeln,19.46,64.14,225.0,Norrland
2,103,Siljansfors,14.24,60.53,320.0,Svealand
3,104,Asa,14.47,57.1,180.0,Götaland
4,105,Tönnersjöheden,13.07,56.42,80.0,Götaland
5,106,Kulbäcksliden,19.49,64.52,,Norrland


In [31]:
# Single cell (row 3, column 2 → remember Python is 0-based)
stations.iloc[2, 1]     # position-based indexing
#stations.loc[2, "name"] # label-based indexing

'Siljansfors'

In [32]:
# Slicing rows
stations.iloc[0:3]      # first 3 rows
#stations.loc[0:2]       # rows with index labels 0, 1, 2

Unnamed: 0,ID,name,longitude,latitude,altitude,region
0,101,Umeå,20.19,63.49,33.0,Norrland
1,102,Vindeln,19.46,64.14,225.0,Norrland
2,103,Siljansfors,14.24,60.53,320.0,Svealand


In [33]:
# Selecting multiple columns
#stations[["name", "altitude"]]
stations.loc[:, ["name", "altitude"]]

Unnamed: 0,name,altitude
0,Umeå,33.0
1,Vindeln,225.0
2,Siljansfors,320.0
3,Asa,180.0
4,Tönnersjöheden,80.0
5,Kulbäcksliden,


In [34]:
# Drop a column
stations.drop(columns="longitude")

# Drop multiple columns
stations.drop(columns=["longitude", "latitude"])

# Drop a row (by index label)
stations.drop(index=2)  # removes the 3rd row (index 2)


Unnamed: 0,ID,name,longitude,latitude,altitude,region
0,101,Umeå,20.19,63.49,33.0,Norrland
1,102,Vindeln,19.46,64.14,225.0,Norrland
3,104,Asa,14.47,57.1,180.0,Götaland
4,105,Tönnersjöheden,13.07,56.42,80.0,Götaland
5,106,Kulbäcksliden,19.49,64.52,,Norrland


In [54]:
stations # we still have an original dataframe (haven't used inplace=True)

Unnamed: 0,ID,name,longitude,latitude,altitude,region
0,101,Umeå,20.19,63.49,33.0,Norrland
1,102,Vindeln,19.46,64.14,225.0,Norrland
2,103,Siljansfors,14.24,60.53,320.0,Svealand
3,104,Asa,14.47,57.1,180.0,Götaland
4,105,Tönnersjöheden,13.07,56.42,80.0,Götaland
5,105,Tönnersjöheden,13.07,56.42,80.0,Götaland
6,105,Tönnersjöheden,13.07,56.42,80.0,Götaland
7,106,Kulbäcksliden,19.49,64.52,,Norrland


We can also filter rows based on specific conditions.

In [35]:
# Filter rows where altitude is exactly (`==`) 80
stations[stations["altitude"] == 80]

Unnamed: 0,ID,name,longitude,latitude,altitude,region
4,105,Tönnersjöheden,13.07,56.42,80.0,Götaland


In [36]:
# Filter rows where altitude is NOT equal to 80
stations[stations["altitude"] != 80]

Unnamed: 0,ID,name,longitude,latitude,altitude,region
0,101,Umeå,20.19,63.49,33.0,Norrland
1,102,Vindeln,19.46,64.14,225.0,Norrland
2,103,Siljansfors,14.24,60.53,320.0,Svealand
3,104,Asa,14.47,57.1,180.0,Götaland
5,106,Kulbäcksliden,19.49,64.52,,Norrland


In [38]:
# To remove missing values use .dropna()
stations.dropna()

Unnamed: 0,ID,name,longitude,latitude,altitude,region
0,101,Umeå,20.19,63.49,33.0,Norrland
1,102,Vindeln,19.46,64.14,225.0,Norrland
2,103,Siljansfors,14.24,60.53,320.0,Svealand
3,104,Asa,14.47,57.1,180.0,Götaland
4,105,Tönnersjöheden,13.07,56.42,80.0,Götaland


In [39]:
# Filter rows where altitude is between 100 and 200
stations[(stations["altitude"] > 100) & (stations["altitude"] < 200)]

Unnamed: 0,ID,name,longitude,latitude,altitude,region
3,104,Asa,14.47,57.1,180.0,Götaland


In [42]:
# Filter rows where altitude is either > 200 or < 100
stations[(stations["altitude"] > 200) | (stations["altitude"] < 100)]

Unnamed: 0,ID,name,longitude,latitude,altitude,region
0,101,Umeå,20.19,63.49,33.0,Norrland
1,102,Vindeln,19.46,64.14,225.0,Norrland
2,103,Siljansfors,14.24,60.53,320.0,Svealand
4,105,Tönnersjöheden,13.07,56.42,80.0,Götaland


### Categorical Data in Pandas

In R, factors are used for categorical variables with defined levels.  
Pandas provides a similar feature through the **Categorical dtype**.

We can convert a column to categorical with `astype("category")`.

In [43]:
# Example: convert the region column to categorical
stations["region"] = stations["region"].astype("category")

print(stations.dtypes)         # check column types
print(stations["region"].cat.categories)  # view the categories (levels)


ID             object
name           object
longitude     float64
latitude      float64
altitude      float64
region       category
dtype: object
Index(['Götaland', 'Norrland', 'Svealand'], dtype='object')


> Why use categorical data?
> - Saves memory when there are repeated values.  
> - Allows explicit control of category order (useful for plotting or modeling).  
> - Makes data more consistent (only defined categories are allowed).


### <font color='gold'> Task 5 </font>  

Create a DataFrame named `earthquake_classes` with three columns:  
- `class`  
- `magnitude`  
- `description`  

Use the information from this image provided by the Alaska Earthquake Center:  
![earthquake.png](https://earthquake.alaska.edu/sites/default/files/inline-images/magnitude%20classes_0.png)  

Finally, convert the `class` column into a **categorical** variable..  



In [44]:
earthquake_classes  = pd.DataFrame({
    'class': ['minor', 'light', 'moderate', 'strong', 'major', 'great'],
    'magnitude': ['3.0-3.9', '4.0-4.9', '5.0-5.9', '6.0-6.9', '7.0-7.9', '8.0 or larger'],
    'description': ['may be felt', 'likely felt', 'minor damage may occur', 'damage may occur', 'damage expected', 'significant damage expected']})
earthquake_classes

Unnamed: 0,class,magnitude,description
0,minor,3.0-3.9,may be felt
1,light,4.0-4.9,likely felt
2,moderate,5.0-5.9,minor damage may occur
3,strong,6.0-6.9,damage may occur
4,major,7.0-7.9,damage expected
5,great,8.0 or larger,significant damage expected


In [47]:
print(f'Before conversion:\n {earthquake_classes.dtypes}') # before conversion
earthquake_classes['class'] = earthquake_classes['class'].astype('category')
print(f'After conversion:\n {earthquake_classes.dtypes}') # after conversion

Before conversion:
 class          category
magnitude        object
description      object
dtype: object
After conversion:
 class          category
magnitude        object
description      object
dtype: object


## Reading and Writing Data Files

Hopefully, Task 5 gave you an idea that **entering data manually can be tedious and error-prone**.  
Fortunately, in most real workflows we can **read data directly from files** instead of typing everything by hand.  

Depending on the field, data can come in very different formats.  
For example:  
- **Text-based tables** (e.g., medical records stored as `.txt` files) that can be read with simple text import functions.  
- **Highly specialized formats**, such as weather radar data stored in the `ODIM HDF5` standard, which require special packages to read and interpret.

In this course, however, we will mainly deal with **tabular data**, most often stored in:  
- **CSV** (`.csv`) → comma-separated (`,`) values  
- **TSV** (`.tsv`) → tab-separated (`\t`) values  
- **Excel** (`.xlsx`, `.xls`) → spreadsheet files  

I strongly recommend using **TSV** files whenever possible. Although CSV files are popular, they can be problematic because commas often appear inside data values (e.g., in addresses or descriptions).  **bold text**
This makes parsing difficult and error-prone.  
**TSV files** are usually safer: tabs rarely appear inside the data, making these files easier to parse reliably.  
That’s why I prefer - <font color='#2fa1ff'>and strongly recommend</font> - using TSV files whenever possible.  


There are **several ways** to import data into Python.  
The most common and flexible function is `pd.read_csv()`, which works not only for CSV but also for many text-based tabular formats.  
You can also use `pd.read_table()` where you specify the delimiter yourself.  

To try this out, download the file `berry_data.csv` from the following source:  
Langvall, O. (2021). *Swedish Forest Phenology dataset (Version 1)* [Data set]. Swedish University of Agricultural Sciences. https://doi.org/10.5878/jbab-cy46  
> Dataset page: https://researchdata.se/en/catalogue/dataset/2021-194-1/1  

Once you have the file, place it in your working directory (or upload it if you are using Google Colab).  

Now let’s read it into Python using both `read_table()` and `read_csv()`.

In [87]:
import os

# Check files in the working directory
print(os.listdir())

['.config', 'berry_data.csv', 'sample_data']


In [49]:
# Read with explicit delimiter
#pd.read_table("berry_data.csv", sep=",")

# Read with shortcut
pd.read_csv("berry_data.csv")


Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe
0,103,2440100,2006,135.0,0.0,,
1,103,2440100,2006,142.0,50.3,,
2,103,2440100,2006,144.0,83.4,,
3,103,2440100,2006,149.0,111.7,,
4,103,2440100,2006,153.0,110.5,0.0,
...,...,...,...,...,...,...,...
1884,103,2440200,2020,276.0,,,4.1
1885,103,2440200,2020,283.0,,,2.9
1886,103,2440200,2020,289.0,,,2.4
1887,103,2440200,2020,300.0,,,0.9


### <font color='gold'> Task 6 </font>  

1. Read the file `berry_data.csv` into a DataFrame called `berry_data`.  
2. Explore the dataset:  
   - How many **rows** and **columns** does it contain?  
   - What are the **features** (**variables**) in the dataset?  

> *Hint:* For the first part, check the `.shape` attribute of the DataFrame.  
> For the second part, check the `.columns` attribute and consult the accompanying `Metadata.pdf` file.


In [50]:
berry_data = pd.read_csv('berry_data.csv')

In [51]:
berry_data.shape # 1889 rows and 7 columns

(1889, 7)

In [52]:
berry_data.columns

Index(['Station', 'Species', 'Year', 'doy', 'Flowers', 'Unripe', 'Ripe'], dtype='object')

>Files may have quirks (custom missing value markers, unexpected separators, wrong data types, etc.).  
Most of these can be handled via parameters in `pd.read_csv()` and other `read_*` functions.  
Check the [Pandas documentation on IO tools](https://pandas.pydata.org/docs/user_guide/io.html) for details and examples.

In [53]:
berry_data

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe
0,103,2440100,2006,135.0,0.0,,
1,103,2440100,2006,142.0,50.3,,
2,103,2440100,2006,144.0,83.4,,
3,103,2440100,2006,149.0,111.7,,
4,103,2440100,2006,153.0,110.5,0.0,
...,...,...,...,...,...,...,...
1884,103,2440200,2020,276.0,,,4.1
1885,103,2440200,2020,283.0,,,2.9
1886,103,2440200,2020,289.0,,,2.4
1887,103,2440200,2020,300.0,,,0.9


In [55]:
# Quick look at the dataset
berry_data.info()
#berry_data.head(5) # first 5 rows
#berry_data.tail(3) # last 3 rows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1889 entries, 0 to 1888
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Station  1889 non-null   int64  
 1   Species  1889 non-null   int64  
 2   Year     1889 non-null   int64  
 3   doy      1888 non-null   float64
 4   Flowers  904 non-null    float64
 5   Unripe   1267 non-null   float64
 6   Ripe     1030 non-null   float64
dtypes: float64(4), int64(3)
memory usage: 103.4 KB


After processing data in Python, we often want to export it back to a file.  The `pandas` package provides convenient functions `to_csv()` and `to_excel()` for writing tabular data.

In [56]:
# Save as TSV (tab-separated values)
berry_data.to_csv("berry_data.tsv", sep="\t", index=False)

In [60]:
# Check files in the working directory
print(os.listdir())

['.config', 'berry_data.tsv', 'berry_data.csv', 'sample_data']


### Saving and loading Python objects (Pickle)

Saving tables as text files (like CSV or TSV) is good for interoperability, but if a file is created in Python and will later be processed again in Python, it can be more efficient and reliable to save it in a **binary format**.  

This can be done with the `pickle` module.  
Pickle allows you to save one or more Python objects into a file and load them again later.

In [103]:
import pickle

In [104]:
#import pickle

x = 1

# Save variable x and the stations DataFrame into a pickle file
with open("objects.pkl", "wb") as f:
    pickle.dump((x, stations), f)

In [105]:
print(os.listdir())

['.config', 'berry_data.tsv', 'berry_data.csv', 'objects.pkl', 'sample_data']


In [107]:
# Load objects back
with open("objects.pkl", "rb") as f:
    x_loaded, stations_loaded = pickle.load(f)

#print(x_loaded)
print(stations_loaded.head())


1


In [108]:
x

1

> 🥒 **Fun fact about names in Python**  
> Why is it called *pickle*? Because just like real pickling preserves food, Python's `pickle` module preserves your Python objects so you can use them later!
> Python has quite a few other modules with funny or surprising names:  
> - **turtle** 🐢 → lets you draw shapes with a turtle crawling on the screen.  
> - **antigravity** 🚀 → opens a classic xkcd comic about Python in your browser.  
> - **this** 📜 → prints “The Zen of Python” with guiding principles for writing Pythonic code.
> - **beautifulsoup** 🍲 → for web scraping, because it helps you make sense of messy “soup-like” HTML.  
>
> Part of Python’s charm is that its community doesn’t take itself too seriously - names are often practical, but sometimes also playful!


In [109]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [124]:
!pip3 install ColabTurtle

Collecting ColabTurtle
  Downloading ColabTurtle-2.1.0.tar.gz (6.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ColabTurtle
  Building wheel for ColabTurtle (setup.py) ... [?25l[?25hdone
  Created wheel for ColabTurtle: filename=ColabTurtle-2.1.0-py3-none-any.whl size=7643 sha256=5e7828d7eb6302a6bc6c3d28e787843b1ad6f62e98d073e57e5a1eead9a873c3
  Stored in directory: /root/.cache/pip/wheels/9f/af/64/ffd85f9858ed7d56b7293dcedbc9d461bf13c8cfc97e352bc8
Successfully built ColabTurtle
Installing collected packages: ColabTurtle
Successfully installed ColabTurtle-2.1.0


In [126]:
import ColabTurtle.Turtle as lia
lia.initializeTurtle(initial_speed=5)
lia.color('blue')
lia.forward(100)
lia.right(45)
lia.color('red')
lia.forward(50)

## Data Wrangling and Data Manipulation  

In real projects, data is **rarely in the exact format you need**.  
Before cleaning or transforming, first **explore** what you have.

Exploration helps answer:
- How many rows/columns?
- Which variables (features) are included?
- What are the dtypes?
- Are there missing or unusual values?

**Useful Pandas functions** (with `berry_data` as example):


In [114]:
berry_data.shape # (rows, columns)

(1889, 7)

In [115]:
berry_data.columns # column names

Index(['Station', 'Species', 'Year', 'doy', 'Flowers', 'Unripe', 'Ripe'], dtype='object')

In [116]:
berry_data.info() # dtypes + non-null counts

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1889 entries, 0 to 1888
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Station  1889 non-null   int64  
 1   Species  1889 non-null   int64  
 2   Year     1889 non-null   int64  
 3   doy      1888 non-null   float64
 4   Flowers  904 non-null    float64
 5   Unripe   1267 non-null   float64
 6   Ripe     1030 non-null   float64
dtypes: float64(4), int64(3)
memory usage: 103.4 KB


In [117]:
berry_data.describe() # quick stats for numeric columns

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe
count,1889.0,1889.0,1889.0,1888.0,904.0,1267.0,1030.0
mean,103.338274,2440142.0,2012.823187,199.893008,10.957528,15.287385,7.074856
std,1.106329,49.44074,4.388867,49.14091,22.022614,22.380494,10.738629
min,102.0,2440100.0,2006.0,105.0,0.0,0.0,0.0
25%,102.0,2440100.0,2009.0,160.0,0.0,0.0,0.0
50%,103.0,2440100.0,2013.0,194.0,0.5,4.1,2.0
75%,104.0,2440200.0,2017.0,238.0,10.5,22.2,10.175
max,105.0,2440200.0,2020.0,320.0,134.5,120.0,69.2


In [118]:
berry_data.corr() # correlation matrix between all the numeric columns in the data frame

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe
Station,1.0,-0.05645,-0.068834,-0.330701,-0.070824,-0.069632,-0.141287
Species,-0.05645,1.0,-0.033001,0.140806,-0.127518,-0.137048,-0.159318
Year,-0.068834,-0.033001,1.0,0.074191,-0.046718,0.046571,0.05525
doy,-0.330701,0.140806,0.074191,1.0,-0.142158,-0.22428,0.011158
Flowers,-0.070824,-0.127518,-0.046718,-0.142158,1.0,-0.216441,-0.045709
Unripe,-0.069632,-0.137048,0.046571,-0.22428,-0.216441,1.0,-0.303639
Ripe,-0.141287,-0.159318,0.05525,0.011158,-0.045709,-0.303639,1.0


In [119]:
berry_data.head() # first rows
#berry_data.tail(3) # last 3 rows

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe
0,103,2440100,2006,135.0,0.0,,
1,103,2440100,2006,142.0,50.3,,
2,103,2440100,2006,144.0,83.4,,
3,103,2440100,2006,149.0,111.7,,
4,103,2440100,2006,153.0,110.5,0.0,


In [61]:
berry_data["Station"].unique() # distinct values in a column

array([103, 102, 104, 105])

In [62]:
berry_data["Station"].value_counts() # frequency counts

Unnamed: 0_level_0,count
Station,Unnamed: 1_level_1
102,581
104,494
103,451
105,363


In [63]:
berry_data.isnull() # missing values

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe
0,False,False,False,False,False,True,True
1,False,False,False,False,False,True,True
2,False,False,False,False,False,True,True
3,False,False,False,False,False,True,True
4,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...
1884,False,False,False,False,True,True,False
1885,False,False,False,False,True,True,False
1886,False,False,False,False,True,True,False
1887,False,False,False,False,True,True,False


Once we know what the dataset looks like, the next step is wrangling:
the process of cleaning, structuring, and transforming raw or messy data into a usable format.

Typical wrangling tasks include:
- Handling missing values
- Merging multiple datasets
- Reshaping between wide and long formats
- Converting variables to the right data types (`dtypes`)

Closely related is data manipulation, which covers operations like:
- Filtering rows
- Sorting observations
- Aggregating values
- Creating or modifying columns

Pandas API docs: https://pandas.pydata.org/docs/


### Core Pandas Operations

Pandas is centered on the **DataFrame/Series** objects and a set of **methods** you chain together.
Most methods return a new object, so you can build readable pipelines (no exact `%>%` equivalent, but method chaining and `.pipe()` play a similar role).

**Index/select**
- `df[...]`, `.loc[row_sel, col_sel]`, `.iloc[row_idx, col_idx]`
- `query()` (string conditions), `eval()` (expression evaluation)

**Missing data**
- `isna()`, `notna()`
- `dropna()` (remove missing), `fillna()` (impute/replace)

**Create/modify columns**
- `assign(...)` (add columns), direct assignment `df["new"] = ...`
- `rename(columns=...)`, `astype(...)` (change dtype)
- Vectorized helpers: `np.where`, `np.select`, string ops via `.str.*`

**Summaries & grouping**
- `agg(...)`, `mean()`, `size()`, `nunique()`
- `groupby(...).agg(...)`, `groupby(...).transform(...)` (broadcast group stats back to rows)

**Sorting & ordering**
- `sort_values(...)`, `sort_index(...)`

**Reshape (wide ↔ long)**
- `melt()` (wide → long), `pivot()` / `pivot_table()` (long → wide)
- `stack()`, `unstack()` for index-based reshaping

**Combine tables**
- `merge(left, right, how=..., on=...)` (joins)
- `concat([...], axis=0/1)` (stack rows / add columns)
- `join()` (index-based combine)

**Duplicates & uniques**
- `duplicated()`, `drop_duplicates()`, `unique()`, `value_counts()`

**Categorical & datetimes**
- `astype("category")`, `.cat.categories`, `.cat.reorder_categories(...)`
- `to_datetime(...)`, `.dt` accessor; time series `set_index(...).resample("M").agg(...)`

**I/O**
- `read_csv()`, `read_table()`, `read_excel()`, `to_csv()`, `to_parquet()`…




In [64]:
# Keep only the columns 'doy' (day of year) and 'Ripe' from the dataset
berry_data[["doy", "Ripe"]]

Unnamed: 0,doy,Ripe
0,135.0,
1,142.0,
2,144.0,
3,149.0,
4,153.0,
...,...,...
1884,276.0,4.1
1885,283.0,2.9
1886,289.0,2.4
1887,300.0,0.9


In [65]:
# Keep all columns except 'doy' and 'Ripe'
berry_data.drop(columns=["doy", "Ripe"])


Unnamed: 0,Station,Species,Year,Flowers,Unripe
0,103,2440100,2006,0.0,
1,103,2440100,2006,50.3,
2,103,2440100,2006,83.4,
3,103,2440100,2006,111.7,
4,103,2440100,2006,110.5,0.0
...,...,...,...,...,...
1884,103,2440200,2020,,
1885,103,2440200,2020,,
1886,103,2440200,2020,,
1887,103,2440200,2020,,


In [66]:
# Rename columns
berry_data.rename(columns={"doy": "day_of_year"})

Unnamed: 0,Station,Species,Year,day_of_year,Flowers,Unripe,Ripe
0,103,2440100,2006,135.0,0.0,,
1,103,2440100,2006,142.0,50.3,,
2,103,2440100,2006,144.0,83.4,,
3,103,2440100,2006,149.0,111.7,,
4,103,2440100,2006,153.0,110.5,0.0,
...,...,...,...,...,...,...,...
1884,103,2440200,2020,276.0,,,4.1
1885,103,2440200,2020,283.0,,,2.9
1886,103,2440200,2020,289.0,,,2.4
1887,103,2440200,2020,300.0,,,0.9


In [67]:
# Keep only the rows where the value in 'Year' is greater than 2019
berry_data.loc[berry_data["Year"] > 2019]

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe
224,103,2440100,2020,141.0,0.0,,
225,103,2440100,2020,148.0,15.0,0.0,
226,103,2440100,2020,155.0,39.2,0.0,
227,103,2440100,2020,164.0,3.7,26.7,
228,103,2440100,2020,170.0,0.1,21.6,
...,...,...,...,...,...,...,...
1884,103,2440200,2020,276.0,,,4.1
1885,103,2440200,2020,283.0,,,2.9
1886,103,2440200,2020,289.0,,,2.4
1887,103,2440200,2020,300.0,,,0.9


In [68]:
# Keep only the rows where the value in 'Year' is exactly 2012
berry_data.loc[berry_data.Year == 2012]

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe
119,103,2440100,2012,135.0,0.0,,
120,103,2440100,2012,142.0,1.0,0.0,
121,103,2440100,2012,146.0,87.5,0.0,
122,103,2440100,2012,153.0,74.3,25.4,
123,103,2440100,2012,159.0,29.7,75.9,
...,...,...,...,...,...,...,...
1755,105,2440200,2012,145.0,0.0,,
1756,105,2440200,2012,151.0,10.2,0.0,
1757,105,2440200,2012,165.0,92.2,4.0,
1758,105,2440200,2012,184.0,18.9,62.2,0.0000


In [69]:
# Keep only the rows where 'Station' is either 104 or 105
berry_data[berry_data.Station.isin([104, 105])]

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe
549,104,2440100,2006,125.0,0.0,,
550,104,2440100,2006,131.0,6.0,,
551,104,2440100,2006,139.0,14.3,,
552,104,2440100,2006,142.0,14.3,0.0,
553,104,2440100,2006,149.0,7.9,0.0,
...,...,...,...,...,...,...,...
1847,105,2440200,2020,191.0,5.3,49.5,0.0000
1848,105,2440200,2020,206.0,0.5,46.0,0.0000
1849,105,2440200,2020,213.0,0.0,,
1850,105,2440200,2020,243.0,,0.0,


In [70]:
# Keep only the rows where 'Station' is 105 AND 'Year' is 2020
berry_data[(berry_data.Station == 105) & (berry_data.Year == 2020)]

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe
1075,105,2440100,2020,120.0,0.0,,
1076,105,2440100,2020,128.0,20.5,0.0,
1077,105,2440100,2020,132.0,40.5,3.0,
1078,105,2440100,2020,141.0,18.8,28.5714,
1079,105,2440100,2020,147.0,5.1,26.8,
1080,105,2440100,2020,154.0,2.4,19.3,
1081,105,2440100,2020,163.0,0.6,17.0,
1082,105,2440100,2020,170.0,0.0,16.9,
1083,105,2440100,2020,184.0,,3.9,0.0
1084,105,2440100,2020,191.0,,1.3,1.2


In [71]:
# You can create new columns with function .assign()
berry_data.assign(second_half = berry_data["doy"] > 365/2).iloc[:, :]

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe,second_half
0,103,2440100,2006,135.0,0.0,,,False
1,103,2440100,2006,142.0,50.3,,,False
2,103,2440100,2006,144.0,83.4,,,False
3,103,2440100,2006,149.0,111.7,,,False
4,103,2440100,2006,153.0,110.5,0.0,,False
...,...,...,...,...,...,...,...,...
1884,103,2440200,2020,276.0,,,4.1,True
1885,103,2440200,2020,283.0,,,2.9,True
1886,103,2440200,2020,289.0,,,2.4,True
1887,103,2440200,2020,300.0,,,0.9,True


In [72]:
np.where(berry_data["doy"] < 365/2, "first_half", "second_half")

array(['first_half', 'first_half', 'first_half', ..., 'second_half',
       'second_half', 'second_half'], dtype='<U11')

In [160]:
# You can also have more descriptive values
berry_data.assign(half = np.where(berry_data["doy"] < 365/2, "first_half", "second_half"))

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe,half
0,103,2440100,2006,135.0,0.0,,,first_half
1,103,2440100,2006,142.0,50.3,,,first_half
2,103,2440100,2006,144.0,83.4,,,first_half
3,103,2440100,2006,149.0,111.7,,,first_half
4,103,2440100,2006,153.0,110.5,0.0,,first_half
...,...,...,...,...,...,...,...,...
1884,103,2440200,2020,276.0,,,4.1,second_half
1885,103,2440200,2020,283.0,,,2.9,second_half
1886,103,2440200,2020,289.0,,,2.4,second_half
1887,103,2440200,2020,300.0,,,0.9,second_half


> `np.where(condition, value_if_true, value_if_false)` checks a condition for each row in the dataframe.
>
> The condition here is `berry_data["doy"] < 365/2`, i.e. is the day of year before the middle of the year (day 182.5)?
> - If the condition is `True`, it returns `"first_half"`.
> - If the condition is `False`, it returns `"second_half"`.
>
> So effectively, this creates a new column called half that labels each observation as belonging to either the first half or the second half of the year, depending on its day-of-year (doy) value.

In [73]:
berry_data.assign(quarter = np.select(
        [berry_data["doy"] <= 91, berry_data["doy"] <= 182, berry_data["doy"] <= 273],
        ["Q1", "Q2", "Q3"],
        default="Q4"
    ))

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe,quarter
0,103,2440100,2006,135.0,0.0,,,Q2
1,103,2440100,2006,142.0,50.3,,,Q2
2,103,2440100,2006,144.0,83.4,,,Q2
3,103,2440100,2006,149.0,111.7,,,Q2
4,103,2440100,2006,153.0,110.5,0.0,,Q2
...,...,...,...,...,...,...,...,...
1884,103,2440200,2020,276.0,,,4.1,Q4
1885,103,2440200,2020,283.0,,,2.9,Q4
1886,103,2440200,2020,289.0,,,2.4,Q4
1887,103,2440200,2020,300.0,,,0.9,Q4


> `np.select(conditions, choices, default=...)` evaluates a list of conditions in order.
> Each condition is checked row by row in the dataframe.
> - If the first condition is `True`, the corresponding value from choices is assigned. If not, it moves on to the next condition, and so on.
> - If none of the conditions are `True`, the value in `default` is assigned.
> Here:
> - If `doy <= 91`, the row gets `"Q1"`.
> - Else if `doy <= 182`, it gets `"Q2"`.
> - Else if `doy <= 273`, it gets `"Q3"`.
> - Otherwise, it gets `"Q4"`.
>
> So this creates a new column quarter that divides the year into four quarters based on the day-of-year.

In [74]:
# Missing data
#berry_data.isna().sum()
#berry_data.fillna({"Flowers": 0})     # fill missing values with a specified value or strategy
berry_data.dropna(how='any') # drop rows with at least one missing value

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe
30,103,2440100,2007,172.0,0.1,46.1,0.0
31,103,2440100,2007,180.0,0.0,32.9,0.0
32,103,2440100,2007,187.0,0.0,31.7,8.8
95,103,2440100,2010,183.0,0.0,39.7,0.0
108,103,2440100,2011,168.0,0.0,43.1,0.0
...,...,...,...,...,...,...,...
1810,105,2440200,2017,214.0,0.0,21.9,3.3
1835,105,2440200,2019,212.0,0.0,29.6,0.5
1846,105,2440200,2020,184.0,19.7,27.2,0.0
1847,105,2440200,2020,191.0,5.3,49.5,0.0


In [75]:
# Calculating summary statistics
berry_data['Flowers'].mean() # calculate mean of 'Flowers' column

np.float64(10.957528265732302)

In [177]:
# Grouped summaries
berry_data.groupby(["Station"], dropna=False).agg(mean_flower_count=("Flowers", "mean"), N=("Flowers", "size")).reset_index()

Unnamed: 0,Station,mean_flower_count,N
0,102,12.955981,581
1,103,14.519676,451
2,104,4.528571,494
3,105,11.472199,363


> `berry_data.groupby(["Station"], dropna=False)`
> - Groups the data by the column `"Station"`.
> - `dropna=False` means that if some rows have missing values in `"Station"`, they will still be included as their own group (instead of being dropped).
>
>.agg(...)
> - Performs aggregation (summary statistics) on each group.
> - `mean_flower_count=("Flowers", "mean")` → calculates the mean value of the Flowers column in each group and stores it in a new column called `mean_flower_count`.
> - `N=("Flowers", "size")` → counts the number of rows in each group (not the number of non-missing flower values, but the group size).
>
> .reset_index()
> - Turns the grouped result back into a regular dataframe, instead of having `"Station"` as the index.
>
> Result:
> You get a new dataframe where each row corresponds to a station, showing:
> - the average flower count for that station, and
> - the number of observations (`N`) available for that station.

In [76]:
# Arrangning data

# Sorts the dataframe by the column "Flowers".
# By default ascending=True (smallest values first).
# Here ascending=False puts the largest values first.
berry_data.sort_values(["Flowers"], ascending=False)

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe
1791,105,2440200,2016,159.0,134.5,12.5,
1770,105,2440200,2014,151.0,134.5,1.0,
529,102,2440100,2020,161.0,126.7,0.0,
1817,105,2440200,2018,142.0,118.5,0.0,
450,102,2440100,2016,160.0,115.4,4.0,
...,...,...,...,...,...,...,...
1884,103,2440200,2020,276.0,,,4.1
1885,103,2440200,2020,283.0,,,2.9
1886,103,2440200,2020,289.0,,,2.4
1887,103,2440200,2020,300.0,,,0.9


In [77]:
# Reshaping DataFrames (wide ↔ long)
long_df = pd.melt(berry_data, id_vars=["Year", "Species"], value_vars=["Flowers"], var_name="variable", value_name="value")
long_df

Unnamed: 0,Year,Species,variable,value
0,2006,2440100,Flowers,0.0
1,2006,2440100,Flowers,50.3
2,2006,2440100,Flowers,83.4
3,2006,2440100,Flowers,111.7
4,2006,2440100,Flowers,110.5
...,...,...,...,...
1884,2020,2440200,Flowers,
1885,2020,2440200,Flowers,
1886,2020,2440200,Flowers,
1887,2020,2440200,Flowers,


> `pd.melt(...)`
> - Reshapes the dataframe from wide format to long format.
> - Useful when you want one column that contains variable names and another that contains their values (tidy data structure).
>
> `id_vars=["Year", "Species"]`
> - These columns are kept as they are (not melted). They act as identifiers for each row.
>
> `value_vars=["Flowers"]`
> - These are the columns that will be “melted” into long format.
> - In this case, only the column `"Flowers"` is being melted.
>
> `var_name="variable"`
> - The name of the new column that will contain the former column name(s).
> - So, this column will contain `"Flowers"` for every row.
>
> `value_name="value"`
> - The name of the new column that will contain the corresponding values (the flower counts).
>
>Result:
> - You get a new dataframe (`long_df`) with columns: `"Year"`, `"Species"`, `"variable"` (always `"Flowers"`), `"value"` (the values from the `"Flowers"` column).

In [None]:
wide_df = long_df.pivot_table(index=["Year", "Species"], columns="variable", values="value", aggfunc="mean").reset_index()
wide_df

Unnamed: 0,Year,Species,variable,value
0,2006,2440100,Flowers,0.0
1,2006,2440100,Flowers,50.3
2,2006,2440100,Flowers,83.4
3,2006,2440100,Flowers,111.7
4,2006,2440100,Flowers,110.5
...,...,...,...,...
1884,2020,2440200,Flowers,
1885,2020,2440200,Flowers,
1886,2020,2440200,Flowers,
1887,2020,2440200,Flowers,


> `long_df.pivot_table(...)`
> - Reshapes the dataframe from long format back to a wide format.
> - Similar to Excel pivot tables: rows (index), columns, values, and an aggregation function.
>
> `index=["Year", "Species"]`
> - These columns become the row identifiers (like grouping keys).
> - So, the table will have one row per unique combination of `Year` and `Species`.
>
> `columns="variable"`
> - The unique values in the variable column become new column headers.
> - Since in your melted dataframe `"variable"` contained `"Flowers"`, the pivot will create a column named `"Flowers"`.
>
> `values="value"`
> - This is the column that gets spread into the table cells.
> - Here it takes the numbers from the `"value"` column (the flower counts).
>
> `aggfunc="mean"`
> - If multiple rows map to the same `Year` + `Species` + `variable`, the mean is calculated.
> - (Other options could be `"sum"`, `"count"`, etc.)
>
> `.reset_index()`
> - Converts the grouped row index (`Year`, `Species`) back into regular dataframe columns.
>
> Result:
> - You get a wide-format dataframe where each row is a `Year–Species` combination, and each measured variable (like `"Flowers"`) is a separate column with the mean value filled in.

In [200]:
# Changing data types
berry_data["Station"].dtype, berry_data["Station"].astype("str").dtype
#pd.to_datetime(berry_data['Year'])

(dtype('int64'), dtype('O'))

### Joining DataFrames  

In many cases, information is spread across multiple tables, and we need to **combine them based on shared keys** (e.g., IDs, station codes, years).  
This process is called a **join**.  

In Pandas, joins are done with the [`merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function (similar to SQL joins).  
The argument `how=` specifies the type of join:  

- `how="inner"` → **inner join**: keeps only rows with matching keys in both tables  
- `how="left"` → **left join**: keeps all rows from `df1` (the left table) and adds matching info from `df2`  
- `how="right"` → **right join**: keeps all rows from `df2` (the right table) and adds matching info from `df1`  
- `how="outer"` → **full outer join**: keeps all rows from both tables, filling in `NaN` where no match is found

![Joins.png](https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/joins.jpg)

Pandas does not have built-in `semi_join` or `anti_join` functions, but you can achieve the same with filtering:  
- **Semi join**: `df1[df1["key"].isin(df2["key"])]`  
- **Anti join**: `df1[~df1["key"].isin(df2["key"])]`  

In [78]:
df1 = pd.DataFrame({
    "Station": [101, 102, 103],
    "Name": ["Umeå", "Vindeln", "Siljansfors"]
})

df2 = pd.DataFrame({
    "Station": [102, 103, 104],
    "Altitude": [225, 320, 180]
})

# Inner join
pd.merge(df1, df2, on="Station", how="inner")

Unnamed: 0,Station,Name,Altitude
0,102,Vindeln,225
1,103,Siljansfors,320


In [79]:
# Left join
pd.merge(df1, df2, on="Station", how="left")

Unnamed: 0,Station,Name,Altitude
0,101,Umeå,
1,102,Vindeln,225.0
2,103,Siljansfors,320.0


In [80]:
# Right join
pd.merge(df1, df2, on="Station", how="right")

Unnamed: 0,Station,Name,Altitude
0,102,Vindeln,225
1,103,Siljansfors,320
2,104,,180


In [81]:
# Full outer join
pd.merge(df1, df2, on="Station", how="outer")

Unnamed: 0,Station,Name,Altitude
0,101,Umeå,
1,102,Vindeln,225.0
2,103,Siljansfors,320.0
3,104,,180.0


In [83]:
# Semi join
df1[df1["Station"].isin(df2["Station"])]

Unnamed: 0,Station,Name
1,102,Vindeln
2,103,Siljansfors


In [84]:
# Anti join
df1[~df1["Station"].isin(df2["Station"])]

Unnamed: 0,Station,Name
0,101,Umeå


### <font color='gold'>Task 7</font>

You will work with two DataFrames: `stations` and `berry_data`.

1. **Combine the datasets**  
   Use a Pandas **inner join** to create `stations_berries`, matching on the station ID.  
   > *Hints:*  
   > - In Pandas, use `pd.merge(left, right, how="inner", on="...")`.  
   > - If the key column has **different names**, use `left_on="...", right_on="..."`.  
   > - You can change dtypes with `.astype(...)`, e.g. `df["Station"] = df["Station"].astype(int)`.

2. **Identify missing measurements**  
   Find which stations **do not** have berry measurements.  
   Briefly **reflect** in a sentence: Does it make sense some stations have no berry data? Why might that be?

3. **Analyze berry observations**   
   - For **2020**, which station had the **highest average count of ripe berries per 0.25 m²**?  
   - For **each station and year**, find the **earliest day of year (`doy`)** when the **average number of ripe berries per 0.25 m²** was **greater than** the **average number of unripe berries per 0.25 m²**.


1.

In [86]:
stations = pd.read_csv("stationlist.csv")
stations


Unnamed: 0,station_id,name,longitude,latitude,altitude
0,101,Umeå,20.19,63.49,33
1,102,Vindeln,19.46,64.14,225
2,103,Siljansfors,14.24,60.53,320
3,104,Asa,14.47,57.1,180
4,105,Tönnersjöheden,13.07,56.42,80


In [87]:
# Merge the two dataframes: match stations.station_id with berry_data.Station
stations_berries = pd.merge(stations, berry_data, left_on="station_id", right_on="Station")
stations_berries

Unnamed: 0,station_id,name,longitude,latitude,altitude,Station,Species,Year,doy,Flowers,Unripe,Ripe
0,102,Vindeln,19.46,64.14,225,102,2440100,2006,145.0,0.0,,
1,102,Vindeln,19.46,64.14,225,102,2440100,2006,152.0,8.3,0.0,
2,102,Vindeln,19.46,64.14,225,102,2440100,2006,160.0,56.6,0.0,
3,102,Vindeln,19.46,64.14,225,102,2440100,2006,172.0,0.0,67.8,
4,102,Vindeln,19.46,64.14,225,102,2440100,2006,179.0,0.0,0.0,0.0010
...,...,...,...,...,...,...,...,...,...,...,...,...
1884,105,Tönnersjöheden,13.07,56.42,80,105,2440200,2020,191.0,5.3,49.5,0.0000
1885,105,Tönnersjöheden,13.07,56.42,80,105,2440200,2020,206.0,0.5,46.0,0.0000
1886,105,Tönnersjöheden,13.07,56.42,80,105,2440200,2020,213.0,0.0,,
1887,105,Tönnersjöheden,13.07,56.42,80,105,2440200,2020,243.0,,0.0,


In [88]:
# Alternative: first rename 'station_id' -> 'Station' in stations,
# then merge directly on the shared 'Station' column
stations.rename(columns={'station_id': 'Station'}, inplace=True)
stations_berries = pd.merge(stations, berry_data, on="Station")
stations_berries

Unnamed: 0,Station,name,longitude,latitude,altitude,Species,Year,doy,Flowers,Unripe,Ripe
0,102,Vindeln,19.46,64.14,225,2440100,2006,145.0,0.0,,
1,102,Vindeln,19.46,64.14,225,2440100,2006,152.0,8.3,0.0,
2,102,Vindeln,19.46,64.14,225,2440100,2006,160.0,56.6,0.0,
3,102,Vindeln,19.46,64.14,225,2440100,2006,172.0,0.0,67.8,
4,102,Vindeln,19.46,64.14,225,2440100,2006,179.0,0.0,0.0,0.0010
...,...,...,...,...,...,...,...,...,...,...,...
1884,105,Tönnersjöheden,13.07,56.42,80,2440200,2020,191.0,5.3,49.5,0.0000
1885,105,Tönnersjöheden,13.07,56.42,80,2440200,2020,206.0,0.5,46.0,0.0000
1886,105,Tönnersjöheden,13.07,56.42,80,2440200,2020,213.0,0.0,,
1887,105,Tönnersjöheden,13.07,56.42,80,2440200,2020,243.0,,0.0,


2.

In [89]:
# Select rows where Unripe, Ripe, and Flowers are all missing (NaN)
stations_berries[(stations_berries["Unripe"].isna()) & (stations_berries["Ripe"].isna()) & stations_berries["Flowers"].isna()]

Unnamed: 0,Station,name,longitude,latitude,altitude,Species,Year,doy,Flowers,Unripe,Ripe
1381,104,Asa,14.47,57.1,180,2440200,2006,132.0,,,
1382,104,Asa,14.47,57.1,180,2440200,2006,139.0,,,
1383,104,Asa,14.47,57.1,180,2440200,2006,142.0,,,
1384,104,Asa,14.47,57.1,180,2440200,2006,149.0,,,


In [266]:
# Same result as before, just written more compactly using .all(axis=1).
stations_berries[stations_berries[["Unripe", "Ripe", "Flowers"]].isna().all(axis=1)]

Unnamed: 0,Station,name,longitude,latitude,altitude,Species,Year,doy,Flowers,Unripe,Ripe
1381,104,Asa,14.47,57.1,180,2440200,2006,132.0,,,
1382,104,Asa,14.47,57.1,180,2440200,2006,139.0,,,
1383,104,Asa,14.47,57.1,180,2440200,2006,142.0,,,
1384,104,Asa,14.47,57.1,180,2440200,2006,149.0,,,


> In `pandas`, `axis=0` means “work down the rows” (column-wise).
> `axis=1` means “work across the columns” (row-wise).
>
> ```python
> stations_berries[["Unripe", "Ripe", "Flowers"]].isna().all(axis=1)
> ```
>
>`.isna()` checks for missing values in each cell.
> - `.all(axis=1)` asks: “for each row, are all these columns NaN?” That returns a boolean `Series` used to filter the dataframe.
>
> Without `axis=1`, it would instead check column-wise (down the rows).


3.  -

In [274]:
# Find the maximum value of Ripe in year 2020
max_riped_2020 = stations_berries[(stations_berries.Year == 2020)].Ripe.max()

In [275]:
# Select the row(s) from 2020 where Ripe equals the maximum value found that year
stations_berries[(stations_berries.Year == 2020) & (stations_berries.Ripe == max_riped_2020)]

Unnamed: 0,Station,name,longitude,latitude,altitude,Species,Year,doy,Flowers,Unripe,Ripe
290,102,Vindeln,19.46,64.14,225,2440100,2020,223.0,,1.8,48.2


  3.  -

In [90]:
# Remove rows where Ripe or Unripe is missing (NaN)
clean = stations_berries.dropna(subset=['Ripe', 'Unripe'])
clean

Unnamed: 0,Station,name,longitude,latitude,altitude,Species,Year,doy,Flowers,Unripe,Ripe
4,102,Vindeln,19.46,64.14,225,2440100,2006,179.0,0.0,0.0,0.001
12,102,Vindeln,19.46,64.14,225,2440100,2007,183.0,0.0,49.0,0.000
13,102,Vindeln,19.46,64.14,225,2440100,2007,190.0,0.0,46.0,0.000
14,102,Vindeln,19.46,64.14,225,2440100,2007,197.0,,42.5,0.500
15,102,Vindeln,19.46,64.14,225,2440100,2007,204.0,,33.7,8.500
...,...,...,...,...,...,...,...,...,...,...,...
1872,105,Tönnersjöheden,13.07,56.42,80,2440200,2019,212.0,0.0,29.6,0.500
1873,105,Tönnersjöheden,13.07,56.42,80,2440200,2019,226.0,,6.8,7.500
1883,105,Tönnersjöheden,13.07,56.42,80,2440200,2020,184.0,19.7,27.2,0.000
1884,105,Tönnersjöheden,13.07,56.42,80,2440200,2020,191.0,5.3,49.5,0.000


In [284]:
clean[clean['Ripe'] > clean['Unripe']].groupby(['Station', 'Year'], as_index=False).agg(doy_early=('doy', 'min'))

Unnamed: 0,Station,Year,doy_early
0,102,2006,179.0
1,102,2007,211.0
2,102,2008,217.0
3,102,2009,217.0
4,102,2010,214.0
5,102,2011,206.0
6,102,2012,220.0
7,102,2013,213.0
8,102,2014,211.0
9,102,2015,222.0


> `clean[clean['Ripe'] > clean['Unripe']]`
> - Keeps only rows where the number of ripe berries is greater than the number of unripe berries.
>
> `.groupby(['Station', 'Year'], as_index=False)`
> - Groups the filtered data by station and year.
> - `as_index=False` makes sure `Station` and `Year` stay as normal columns (not the index).
>
> `.agg(doy_early=('doy', 'min'))`
> - For each group, finds the smallest day of year (doy) that satisfies the condition (ripe > unripe).
> - Saves it in a new column called `doy_early`.
>
> Result:
> - A dataframe showing, for each station and year, the earliest day when ripe berries outnumbered unripe berries.

## Iterations

**Loops** let you run a block of code repeatedly. They’re essential for automating repetitive work, walking through data structures, and implementing algorithms that repeat until a condition is met. In Python you’ll mainly use two loop constructs: **`while`** and **`for`**.


> #### Updating variables
> Before looping, it’s useful to understand how values get updated.

In [286]:
# Reassignment based on the previous value
x = 0
x = x + 1     # read the right-hand side first, then assign to x
x

1

In [288]:
# Same as previous
x = 0
x += 1
x

1

> Python evaluates the **right side first**, then assigns to the name on the left.\
> If you use a variable before it has a value, you’ll get a `NameError`.\
> Shorthand operators make updates concise: `x += 1` (increment), `x -= 1` (decrement), `x *= 2`, `x /= 2`, etc.
> Example (cumulative rainfall over a week):

In [297]:
values = [1, 1, 1]
values

[1, 1, 1]

In [299]:
sum = 0
for value in values:
    sum += 1
    print('after:', sum)


after: 1
after: 2
after: 3


In [91]:
rain_mm = [2.0, 0.0, 4.5, 1.2, 0.0, 8.3, 3.1]
total = 0.0

# Loop through daily rainfall, update total, and print cumulative sum each day
for day in rain_mm:
    total += day
    print(total)

# Final total rainfall over the whole week
print(f"Cumulative rainfall over a week: {total} mm")


2.0
2.0
6.5
7.7
7.7
16.0
19.1
Cumulative rainfall over a week: 19.1 mm


In [92]:
# This is the same as simply using Python's built-in sum()
total2 = sum(rain_mm)
print(f"Total rainfall (using sum): {total2} mm")

Total rainfall (using sum): 19.1 mm


### `while` loops

A `while` loop repeats as long as its condition is `True`. The condition is checked before each iteration.
![while_loop.jpg](https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/while_loop.jpg)

In [326]:
# Sum daily CO₂ measurements until a threshold is reached
co2_readings = [392, 401, 407, 415, 420, 425]
i = 0
total = 0

while i < len(co2_readings) and total <= 12000:
    total += co2_readings[i] # add the next reading to the cumulative total
    i += 1                   # move to the next reading/element in the list
    print(total)             # show running total

total, i  # cumulative sum and how many readings used


392
793
1200
1615
2035
2460


> - `while i < len(co2_readings)` → loop continues as long as there are still readings left.
> - `and total <= 12000` → but also stops once the cumulative sum passes 12,000.
>
> Inside the loop:
> - The next CO₂ reading is added to total.
> - `i` counts how many readings have been included so far.
> `print(total)` shows the running sum after each step.
>
> Result:
> At the end, total contains the cumulative CO₂ sum up to the threshold, and `i` tells you how many readings were needed to reach it.

Use `while` when you don’t know ahead of time how many iterations you need.
\
Make sure something inside the loop changes the condition; otherwise you’ll loop forever.

#### Infinite loops (and how they happen)
If the condition never becomes `False`, the loop never ends.

In [330]:
# Example of a infinite loop/bug
i = 0
while i < 5:
    print(i)
    i += 4.5  # forgotten increment → infinite loop



0


Sometimes infinite loops are intentional (e.g., continuously reading a sensor stream) and are controlled with a break when a stop condition occurs.

### `for` loops

A `for` loop iterates over items in a sequence (list, string, NumPy array, etc.). Use it when the number of iterations is known or you’re iterating over a collection.
![for_loop.jpg](https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/for_loop.jpg)

In [331]:
# Print each station name
stations = ["Umeå", "Vindeln", "Siljansfors", "Asa"]
for name in stations:
    print(name)


Umeå
Vindeln
Siljansfors
Asa


#### Iterating with `range()`
`range(stop)` generates `0, 1, 2, ..., stop-1` (the end is exclusive).
You can also use `range(start, stop, step)`.

In [332]:
# Days of year: first five days
for doy in range(1, 6):   # 1..5
    print(doy)


1
2
3
4
5


In [335]:
# Simple growing degree-day (GDD) accumulator
base = 10
daily_tmean = [8, 12, 15, 9, 14, 11]  # °C
gdd = 0
for t in daily_tmean:
    gdd += max(0, t - base)
gdd

12

In [334]:
max(0, 12 - 10)

2

### Nested Loops
Sometimes we need a loop inside another loop - this is called a nested loop.
Nested loops are useful when:
- You want to process **two dimensions of data** (e.g., rows × columns in a table or grid).
- You need to **compare every element of one collection with every element of another**.
- You are **working with spatial or temporal data where multiple variables interact**.

In [336]:
range(4)

range(0, 4)

In [342]:
for i in range(2):
  print('i', i)
  for j in range(2):
    #print('j', i)
    print(i+j)

i 0
0
1
i 1
1
2


In [344]:
values = ['A', 'B']
for i in range(1,4):
  for j in values:
    #print('j', i)
    print(i, j)

1 A
1 B
2 A
2 B
3 A
3 B


In [None]:
# Monitoring daily rainfall across multiple stations
stations = ["Umeå", "Vindeln", "Siljansfors"]
rainfall_data = [
    [5.2, 0.0, 1.3],   # Day 1: rainfall in mm for each station
    [0.0, 0.0, 0.0],   # Day 2
    [12.1, 3.4, 0.0]   # Day 3
]

for day, daily_measurements in enumerate(rainfall_data, start=1):
    for station, rainfall in zip(stations, daily_measurements):
        print(f"Day {day} at {station}: {rainfall} mm")

Day 1 at Umeå: 5.2 mm
Day 1 at Vindeln: 0.0 mm
Day 1 at Siljansfors: 1.3 mm
Day 2 at Umeå: 0.0 mm
Day 2 at Vindeln: 0.0 mm
Day 2 at Siljansfors: 0.0 mm
Day 3 at Umeå: 12.1 mm
Day 3 at Vindeln: 3.4 mm
Day 3 at Siljansfors: 0.0 mm


> *Hint*:
> - `enumerate(rainfall_data, start=1)` → gives both the day number and the list of rainfall values for that day.
> - `zip(stations, daily_measurements)` → pairs each station name with the corresponding rainfall value.
>
> The outer loop iterates over days, and the inner loop iterates over stations for each day.

#### Controlling loop flow: break and continue
- `break` → exit the loop immediately.
- `continue` → skip the rest of the current iteration and move to the next one.

In [346]:
# BREAK: stop scanning once PM2.5 exceeds a health threshold
pm25 = [8, 11, 4, 39, 51, 22, 1]
threshold = 15
for value in pm25:
    if value > threshold:
        print("Alert: unhealthy air quality!", value)
        break


Alert: unhealthy air quality! 39
Alert: unhealthy air quality! 51
Alert: unhealthy air quality! 22


In [None]:
# CONTINUE: skip missing temperature readings (None) when averaging
temps = [12.5, None, 13.2, 11.8, None, 12.9]
count = 0
total = 0.0

for t in temps:
    if t is None:
        continue
    total += t
    count += 1

avg = total / count if count else None
avg


12.6

> In nested loops, `break` only exits the innermost loop.

## Conditional Statements
When we learned about **controlling loops**, we already saw how the keyword `if` can decide whether a block of code should run.
Conditional statements are one of the most important building blocks of programming. They let us make decisions in code:
- Run certain parts of code only if conditions are `True`.
- Skip or choose alternatives if they’re `False`.

This enables us to write programs that adapt to different situations, much like we do in real-world decision-making.

### `if` Statement

The simplest form is the `if` statement.
It checks whether a condition is `True` and, if so, executes the block of code inside it.
![if_condition.jpg](https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/if_condition.jpg)

In [348]:
# Check if temperature is above freezing
temperature = 3

if temperature > 0:
    print("Water is liquid.")

Water is liquid.


> Nothing is printed here, since the condition was `False`.\
> If the temperature had been above 0, the program would have printed `"Water is liquid."`.

### `else` Clause

The `else` clause provides an alternative when the condition in `if` is not met.
![if_and_else_condition.jpg](https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/if_and_else_condition.jpg)

In [350]:
# Are trees actively photosynthesizing?
sunlight = True

if sunlight == True:
    print("Photosynthesis is happening.")
else:
    print("Trees are not photosynthesizing right now.")

Photosynthesis is happening.


The `else` block ensures that something always happens, whether the `if` condition is `True` or not.

### `elif` Clause

The `elif` ("else if") clause allows us to test multiple conditions sequentially.
It’s useful when there are more than two possible outcomes.
![elif_condition.jpg](https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/elif_condition.jpg)

In [352]:
# "Classifying Air Quality Index (AQI)
aqi = 5

if aqi <= 50:
    print("Air quality is good.")
elif aqi <= 100:
    print("Air quality is moderate.")
elif aqi <= 150:
    print("Air quality is unhealthy for sensitive groups.")
else:
    print("Air quality is unhealthy for everyone.")


Air quality is good.


> Here the program evaluates the conditions in order and executes only the first matching block.

### Nested Conditional Statements

Sometimes, one decision depends on another. That’s when nested `if` statements come in handy.

In [93]:
# Checking if rainfall is enough for crops
rainfall_mm = 8

if rainfall_mm > 0:
    print("It rained today.")
    if rainfall_mm >= 50:
        print("Soil moisture is sufficient for crops.")
    else:
        print("Rainfall was too low, irrigation might be needed.")
else:
    print("No rainfall today.")


It rained today.
Rainfall was too low, irrigation might be needed.


> Be careful: nesting too many `if` statements can make your code hard to read and maintain.\
> Often, there are cleaner alternatives (e.g., combining conditions with `and`/`or`, or using dictionaries).

### <font color='gold'>Task 8</font>
Using the dataset `berry_data.csv`, your task is to calculate the maximum number of ripe berries per 0.25 m² for each station across all years.

Steps to guide you:
- Load the dataset into a Pandas DataFrame called `berry_data`.
- Identify the unique stations in the dataset.
- Use a loop to go through each station.
- Inside the loop, use an `if–else` statement to check whether the station has ripe berry data.
- If data exists, calculate the maximum number of ripe berries for that station.
- If no data exists, store a value like `None`.
- Collect the results in a dictionary, where:
  - keys = station names
  - values = maximum ripe berry counts.
- Convert this dictionary into a new Pandas DataFrame with stations as column names and a single row of values.

> *Hint*: Use `.unique()` to list all stations.

In [129]:
berry_data = pd.read_csv("berry_data.csv") # reading the data into a dataframe

In [131]:
berry_data

Unnamed: 0,Station,Species,Year,doy,Flowers,Unripe,Ripe
0,103,2440100,2006,135.0,0.0,,
1,103,2440100,2006,142.0,50.3,,
2,103,2440100,2006,144.0,83.4,,
3,103,2440100,2006,149.0,111.7,,
4,103,2440100,2006,153.0,110.5,0.0,
...,...,...,...,...,...,...,...
1884,103,2440200,2020,276.0,,,4.1
1885,103,2440200,2020,283.0,,,2.9
1886,103,2440200,2020,289.0,,,2.4
1887,103,2440200,2020,300.0,,,0.9


In [132]:
unique_stations = berry_data["Station"].unique() # identifying the unique stations
unique_stations

array([103, 102, 104, 105])

In [121]:
max_ripe_dict = {} # making an empty dictionary to store results

In [133]:
# Looping through each station
for station in unique_stations:
  # Selecting data for this station
  station_data = berry_data[berry_data["Station"] == station]
  # If there is ripe data for this station
  if station_data["Ripe"].notna().any():
    # Calculating maximum number of ripe berries
    max_ripe_dict[station] = station_data["Ripe"].max()
  else:
    # If no ripe data exists, store None
    max_ripe_dict[station] = None

max_ripe_dict

{np.int64(103): 60.9,
 np.int64(102): 48.2,
 np.int64(104): 69.2,
 np.int64(105): 59.4}

In [134]:
# Converting the dictionary into a DataFrame
max_ripe_df = pd.DataFrame([max_ripe_dict])
max_ripe_df

Unnamed: 0,103,102,104,105
0,60.9,48.2,69.2,59.4
