# Workshop: Python Basics

Welcome to this Python workshop!  
In this notebook, you will learn the essentials of data analysis in Python, starting from the very basics and progressing to more advanced data wrangling and visualization techniques.

---

Before you begin, make sure you are familiar with the [Python Syntax Fundamentals](https://github.com/RaHub4AI/MI4011_Block1/blob/main/Introduction_to_Python/Python_Syntax_Fundamentals.md).

That material covers core rules such as comments, variables, assignment, reserved keywords, and code readability. You should keep these principles in mind throughout this workshop while writing your code.


## General Information

In [None]:
# Checking the complete version of the installed Python
!python --version # print the Python version with Magic

In [None]:
import sys
print(sys.version)

In [None]:
# Check the current working directory
import os
print(os.getcwd())

# Set your working directory (example path)
# os.chdir('C:/Users/YourName/Documents/EDS_projects')

# Check the files in the directory
print(os.listdir())


## Operators
---
#### Arithmetic
Used for performing basic mathematical operations (addition, subtraction, multiplication, division, powers, etc.).
| Operator | Description       | Example   | Result |
|----------|-------------------|-----------|--------|
| `+`      | Addition          | `5 + 3`   | 8      |
| `-`      | Subtraction       | `5 - 3`   | 2      |
| `*`      | Multiplication    | `5 * 3`   | 15     |
| `/`      | Division (float)  | `5 / 2`   | 2.5    |
| `//`     | Floor division    | `5 // 2`  | 2      |
| `%`      | Modulus (remainder) | `5 % 2` | 1      |
| `**`     | Exponentiation    | `2 ** 3`  | 8      |
---
#### Assignment
Used for assigning values to variables and updating them in place (e.g., `x += 1` means increase `x` by `1`).
| Operator | Example   | Equivalent to |
|----------|-----------|---------------|
| `=`      | `x = 5`  | assign value 5 to `x` |
| `+=`     | `x += 3` | `x = x + 3` |
| `-=`     | `x -= 3` | `x = x - 3` |
| `*=`     | `x *= 3` | `x = x * 3` |
| `/=`     | `x /= 3` | `x = x / 3` |
| `//=`    | `x //= 3` | `x = x // 3` |
| `%=`     | `x %= 3` | `x = x % 3` |
| `**=`    | `x **= 3` | `x = x ** 3` |
---
#### Comparison  
Used for comparing values; they return `True` or `False`.
| Operator | Description      | Example   | Result |
|----------|------------------|-----------|--------|
| `==`     | Equal to         | `5 == 3`  | False |
| `!=`     | Not equal to     | `5 != 3`  | True  |
| `>`      | Greater than     | `5 > 3`   | True  |
| `<`      | Less than        | `5 < 3`   | False |
| `>=`     | Greater or equal | `5 >= 5`  | True  |
| `<=`     | Less or equal    | `3 <= 5`  | True  |
---
#### Logical  
Used for combining or inverting boolean values (`and`, `or`, `not`).
| Operator | Description  | Example   | Result |
|----------|--------------|-----------|--------|
| `and`    | Logical AND  | `(5 > 3) and (5 < 10)` | True |
| `or`     | Logical OR   | `(5 > 3) or (5 > 10)`  | True |
| `not`    | Logical NOT  | `not(5 > 3)`           | False |
----
#### Membership
Used to test if a value is part of a sequence (like lists, strings, or tuples).
| Operator | Description                | Example               | Result |
|----------|----------------------------|-----------------------|--------|
| `in`     | Checks if value is in a sequence | `3 in [1, 2, 3]` | True   |
| `not in` | Checks if value is not in a sequence | `4 not in [1, 2, 3]` | True |
---
#### Identity  
Used to check if two variables refer to the same object in memory, not just equal values.
| Operator | Description                      | Example       | Result |
|----------|----------------------------------|---------------|--------|
| `is`     | True if two variables point to the same object | `x is y` | True/False |
| `is not` | True if two variables point to different objects | `x is not y` | True/False |


### Python as a Calculator
The simplest use of Python is doing computations directly:

In [None]:
# Let's try the arithmetic operators
3 + 5

In [None]:
25 ** 2

In [None]:
1 / 2

In [None]:
1 % 2

In [None]:
1 / 0

In Python, many core mathematical functions and constants are provided by the
[`math` module](https://docs.python.org/3/library/math.html).  
To use them, we first need to import the module:

In [None]:
import math

> **Best practice:**  
> Always keep all your imports (`import...`) together at the very top of your script or notebook.  
>
> Why?  
> - It makes your code easier to read: anyone can immediately see which tools you are using.  
> - It avoids errors later, since all dependencies are loaded before running the main analysis.  
> - It makes your work reproducible: when you share your notebook, others can quickly install/load the same packages.  

In [None]:
2 + math.sqrt(375769) - 25**2

Another very powerful tool in Python is [**`NumPy`**](https://numpy.org/).\
While the `math` module provides basic mathematical functions, `NumPy` is designed for working efficiently with **arrays of numbers** and for performing fast numerical computations.

`NumPy` allows you to store data in arrays and apply mathematical operations to entire datasets at once, rather than looping over individual values.\
This makes code both **shorter** and **much faster**, which is especially important when working with large environmental datasets.


> In Python, it is common to import packages with a **short alias** using the `as` keyword.  
>For example, instead of writing `import numpy` and then calling functions as `numpy.sqrt(...)`,  
we usually write: `import numpy as np`.
>
>This makes the code shorter and easier to read.
>
>*The abbreviation `np` is a widely accepted community convention.
>If you see Python code in textbooks, tutorials, or research papers,
`np` will almost always refer to `NumPy`.
>Using these standard abbreviations helps your code stay familiar and understandable to others.*

In [None]:
import numpy as np

In [None]:
np.divide(1, 0)

>As you see, in addition to errors, Python also gives you **warnings**. Right now, `NumPy` warns that you are dividing by zero.  
>This is useful because it helps you notice potential caveats in your code.  However, when using some packages, warnings can become very frequent and distracting.  
>Python provides an option to **suppress warnings** if needed. Here is how you can do it:

```python
import warnings
warnings.filterwarnings("ignore")

import numpy as np
np.divide(1, 0)    # no warning shown


## Variables

Creating variables in Python means defining names and assigning values to them.
Variables are used to store data such as numbers, text, lists, or even more complex objects.\
Assignment is done with the `=` operator, which binds a value to a variable name.

A variable’s name should clearly describe the data it contains, so that the name itself serves as a meaningful reference to the information it holds.  


In [None]:
sum1 = 2 + 3

You can print values in multiple ways:

In [None]:
# Using string concatenation (requires conversion with str())
print("The sum1 of the numbers is:", sum1)

# Using f-strings (recommended, clearer and more readable)
print(f"The sum2 of the numbers is: {sum1}")

><font color='#ff751f'> Do you notice the difference?</font>
> - With `,`, Python automatically inserts spaces.







In [None]:
temperature = 2
location = "Stockholm"

# Printing multiple values at once
print("The temperature in", location, "is", temperature, "°C")

print(f"The temperature in {location} is {temperature} °C")

# Embedding expressions directly inside an f-string
print(f"In five years, the temperature will be {temperature + 5} °C") # very pessimistic prediction

## Getting Help

Many Python functions have **default parameter values**, so you don’t always need to specify every argument.  
To learn how a function works, you can access its documentation in different ways.

#### 1. Using `help()`

In [None]:
help(len)

This shows the function’s signature, description, parameters, and sometimes examples.

#### 2. Using `?` in IPython or Jupyter

In Jupyter notebooks or IPython, you can also type:

In [None]:
len?

This will display the docstring (documentation string) for the function.

Help pages in Python are based on the function’s **docstring**.  
A docstring is text written by the developers of the function to explain how it works.  

Typically, Python help pages include:
- **Signature**: the function name with its parameters and default values.  
- **Docstring**: a description of what the function does, sometimes with notes or examples.  
- **Type**: whether it is a built-in function, method, or user-defined function.  

>In addition, most Python packages (like `NumPy` or `pandas`) have excellent online documentation, which is often the fastest way to learn a new function.


### <font color='#ff751f'> Task 1 </font>
1. What does the function `np.log()` do in Python?

In [None]:
# YOUR CODE HERE!

## Data Types

In Python, **data types** define the kind of values a variable can hold and what operations can be performed on them.  
Python is a **dynamically typed language**, so you don’t need to declare the type explicitly - Python infers it from the assigned value.  

You can check the type of an object with the `type()` function.


#### Numbers

Python has several numeric data types:

- **Integers (`int`)**: whole numbers without a decimal point  
  Examples: `54`, `-12`, `1240`

- **Floating-point numbers (`float`)**: numbers with decimals  
  Examples: `1.54`, `6.0`, `-2.3`

- **Complex numbers (`complex`)**: numbers with a real and imaginary part (written with `j` instead of `i`)  
  Example: `5 + 2j`  
  *(Don’t worry — we won’t work with complex numbers in this course.)*


In [None]:
weight = 80      # int
height = 1.80    # float
bmi = weight / (height**2)
print(bmi)
print(type(bmi))

#### Strings (`str`)

Strings represent text data. They can be written with single, double, or triple quotes (for multi-line text).

In [None]:
string1 = "I’m going to become an Environmental Data Scientist!"
print(type(string1))

Strings in Python are immutable, meaning they cannot be changed after creation.
If you want to modify text, you create a new string instead.

In [None]:
word = "BAT"

In [None]:
word[0]

In [None]:
word[0] = "C"

#### Booleans (`bool`)

Booleans represent truth values: `True` or `False`.
They are often the result of comparisons:

In [None]:
is_snowing = False
print(type(is_snowing))

print(5 > 2)   # True
print(3 == 7)  # False

#### Other Important Data Types

- List (`list`): an ordered, mutable collection

In [None]:
list_example = ["e", 5, "Introduction to Programming", 5.7]
list_example

- Dictionary (`dict`): key–value pairs

In [None]:
dict_example = {
    1: "methane",
    2: "CO2",
    3: "water"
}
dict_example


- Tuple (`tuple`): ordered but immutable

In [None]:
tuple_example = ("Department of Environmental Science", 150.99, 44)
tuple_example

- Set (`set`): unordered collection of unique elements

In [None]:
set_example = {1, 2, 3, 4, 5, "Saturday"}
set_example

Python provides built-in functions to convert between data types (also called *type casting*).

In [None]:
# Convert string to integer
string_to_integer = int("15")
type(string_to_integer)

In [None]:
# Convert integer to string
integer_to_string = str(56)
type(integer_to_string)

In [None]:
# Convert integer to float
num_float = float(87)
type(num_float)

In [None]:
# Convert to collections
print(f'''{list("hello")},
{tuple([1, 2, 3])},
{set([1, 1, 2])}''')    # ['h', 'e', 'l', 'l', 'o'], (1, 2, 3),  {1, 2}

In [None]:
int("5") + 5

### <font color='#ff751f'> Task 2 </font>
- Can you add booleans?
- What is the numeric value of `False`?

In [None]:
# YOUR CODE HERE!

## Functions

Functions package a specific task so you can name it, reuse it, and test it.  
They take inputs, perform work, and (usually) return a result - making programs clearer and less repetitive.

---

#### General syntax

```python
def function_name(parameters):
    """
    Optional docstring that explains what the function does,
    its parameters, return value, and any caveats.
    """
    # operations
    return value

```
- `def` starts the definition
- `function_name` follows Python naming conventions (`snake_case`)
- `parameters` are comma-separated inputs (they may have default values)
- A `return` sends a value back to the caller (optional for “void” functions)

**Example: Body Mass Index (BMI)**

In [None]:
def bmi(weight, height=180):
    """
    Compute Body Mass Index (BMI).

    Parameters
    ----------
    weight : float or int
        Body mass in kilograms.
    height : float or int, default 180
        Body height in centimeters.

    Returns
    -------
    float
        BMI computed as weight / (height/100)**2.

    Notes
    -----
    Raises ValueError if weight or height are non-positive.
    """
    if weight <= 0 or height <= 0:
        raise ValueError("weight and height must be positive")
    return weight / (height / 100) ** 2

# Calls (positional and with default)
print(bmi(85))        # height defaults to 180 cm
print(bmi(85, 195))   # override default height


> Parameters are the names in the function definition (`weight`, `height`).
> Arguments are the actual values you pass when calling the function (`85`, `195`).
>
> Python supports:
> - Positional arguments: `bmi(85, 180)`
> - Keyword arguments: `bmi(weight=85, height=180)` (clearer, order-independent)
> - Default values: `height=180` lets you omit `height`
>
> `return` hands a value back to the caller. If you omit it, the function returns `None` (`void`function).

In [None]:
def square_of(number):
    return number ** 2

def print_greeting(name):
    # No return statement -> returns None
    print(f"Hello, {name}!")

In [None]:
square_of(3)

In [None]:
square = square_of(3)
print(square)

In [None]:
print_greeting("Department of Environmental Science")

In [None]:
greeting = print_greeting("Department of Environmental Science")
print(greeting)

You can also add type hints to improve readability and editor support, but they are not enforced at runtime



In [None]:
def bmi_typed(weight: float, height_cm: float = 180) -> float:
    """BMI with type hints (weight kg, height cm)."""
    if weight <= 0 or height_cm <= 0:
        raise ValueError("weight and height must be positive")
    return weight / (height_cm / 100) ** 2

As you have already seen, there are several types of functions in Python - and we’ve actually used all of them already:
- **Built-in functions**: These are always available because they are part of Python itself.  
  Examples: `print()` to output data, `int()`, `str()`, `float()` for type conversion, and `input()` to read user input.
- **Imported functions**: Many useful functions live in external modules that you must import before using.  
  Examples: `math.sqrt()`, `math.factorial()`, or `random.randint()`.
- **User-defined functions**: These are the custom functions you write yourself to solve specific tasks.  
  They can also make use of built-in or imported functions.

## Packages

In Python, many extra functions are organized into **packages** (also called libraries).  
You’ve already seen this when we imported packages like `math` and `NumPy`.

```python
import math
import numpy as np
```
These packages were already installed to our Python environment. However, if you don’t have a package installed, you must install it before importing.
In Jupyter notebooks, you can use a magic command with `!` to run shell commands directly:


In [None]:
!pip install cartopy

After installation, you can import the package in the usual way:

In [None]:
import cartopy

There are thousands of Python packages available for science, data analysis, and environmental research.  
The most common places to find them are:
- **[PyPI (Python Package Index)](https://pypi.org/)** → the main repository where almost all Python packages are published.  
- **[Anaconda Navigator](https://anaconda.org/)** → if you are using Anaconda, you can search and install packages through the Anaconda platform.  
- **Project websites and GitHub** → some research-oriented or niche packages are shared directly on GitHub or project homepages before they appear on PyPI.

You can search directly on PyPI, or even from Google by typing the package name + “PyPI”.  
For example: *“cartopy  PyPI”*.


## Lists and `NumPy` Arrays

#### Lists

Lists are one-dimensional containers that can store **different types of data**.  

Examples of creating lists:


In [None]:
# Sequences
list(range(1, 11))       # 1 to 10
#list(range(9, 1, -1))    # reverse sequence

In [None]:
# Arbitrary elements
[1, 4, 2, 6]

In [None]:
# Mixing different data types
["A", 1, "B", True, "C", 2, 6, 8]

In [None]:
# Repeating patterns
[1, 2] * 5

Extracting data from a list is done using square brackets `[]`.

<font color='#ff751f'>Remember: in Python indexing starts at `0`.</font>

In [None]:
x = list(range(10, 0, -1))  # 10 down to 1
print(x[0])        # first element (10)
print(x[0:5])      # slice: first 5 elements
print(x[-1])       # last element (1)

#### `NumPy` Arrays

For numerical and scientific work, we often use **`NumPy` arrays**.  
All elements in a `NumPy` array have the same data type, and arithmetic operations are applied **elementwise**.


In [None]:
import numpy as np

# Creating arrays
np.arange(1, 11)                  # 1 to 10
#np.array([1, 4, 2, 6])            # arbitrary elements
#np.linspace(0, 1, num=10)         # sequence with equal spacing
#np.tile([1, 2], 5)                # repeat pattern
#np.array(["A", "B", "C"])         # character array

If you mix types, `NumPy` will **upcast** everything to a common type that can hold them all.  
You can always check what type `NumPy` chose by looking at the array’s `.dtype`.

In [None]:
# Mixed strings, numbers, and booleans → everything becomes string
arr1 = np.array(["A", 1, "B", True, "C", 2, 6, 8])
print(arr1)
print(arr1.dtype)   # dtype shows the type of array elements

# Mixing numbers and booleans → booleans are treated as integers
arr2 = np.array([True, 2, 6, 8])
print(arr2)
print(arr2.dtype)

Extracting data from `NumPy` arrays works the same way as it does for lists:

In [None]:
x = np.arange(10, 0, -1)
x[0]        # first element
#x[0:5]      # first 5 elements


`NumPy` arrays support elementwise operations:

In [None]:
np.arange(1, 11) + 5           # add 5 to each element
#np.arange(1, 11) + np.arange(10, 0, -1)  # elementwise addition
#np.log10([1, 10, 100, 1000])   # logarithm base 10


You can also use your own functions with arrays:

In [None]:
weights = np.array([85, 90, 95, 100, 105])
heights = 194
bmi = weights / (heights / 100) ** 2
print(bmi)

Many functions take an array and return a single value:

In [None]:
np.min(np.arange(1, 11))
#np.max(np.arange(1, 11))
#np.mean(np.arange(1, 11))


> **Beyond 1D: Multidimensional Arrays**
>
> So far, we have only worked with **1D arrays**.  
> However, `NumPy` arrays can also be **2D (matrices)** or even **higher-dimensional**.  
> This makes `NumPy` very powerful for scientific and data analysis tasks.

In [None]:
# 2D array (matrix)
matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])
print(matrix)
print(matrix.shape)   # (2, 3) → 2 rows, 3 columns

### <font color='#ff751f'> Task 3</font>
Define a function that **rescales** a numeric array to the range [0, 1],  
so that the minimum becomes 0 and the maximum becomes 1.

> *Hint:* for array $x$ and element $x_i$, use:
$$ \frac{x_i - \text{min}(x)}{\text{max}(x) - \text{min}(x)} $$

Test your function on arrays `0:10` and `-5:5`. Are the results similar or different?

In [None]:
# YOUR CODE HERE!

## Pandas DataFrame

In real-world data analysis, information is often organized in **tables**, where columns contain variables of different types.\
In Python, the most common way to work with such data is with the **`Pandas`** library.

Official docs: [https://pandas.pydata.org/docs/](https://pandas.pydata.org/docs/)

#### Creating a DataFrame
You can create a DataFrame from a **dictionary** - recall from the [Data Types](#data-types) chapter that dictionaries store data as key–value pairs.  
Here, keys become column names and values become the column data.

In [None]:
import pandas as pd

stations = pd.DataFrame({
    "ID": ["101", "102", "103", "104"],
    "name": ["Umeå", "Vindeln", "Siljansfors", "Asa"],
    "longitude": [20.19, 19.46, 14.24, 14.47],
    "latitude": [63.49, 64.14, 60.53, 57.10],
    "altitude": [33, 225, 320, 180]
})

print(stations)
print(type(stations))

Adding a column: simply assign a new Series or list to a column name.

In [None]:
regions = ["Norrland", "Norrland", "Svealand", "Götaland"]
stations["region"] = regions
stations

Adding a row: use `pd.concat()` to stack rows.

In [None]:
new_station = pd.DataFrame({
    "ID": ["105"],
    "name": ["Tönnersjöheden"],
    "longitude": [13.07],
    "latitude": [56.42],
    "altitude": [80],
    "region": ["Götaland"]
})

stations = pd.concat([stations, new_station], ignore_index=True)
stations

> <font color='#ff751f'>*Think*: Why we use `ignore_index=True`? </font>

If a row is missing some columns, `Pandas` will fill them with `NaN`.

In [None]:
extra_station = pd.DataFrame({
    "ID": ["106"],
    "name": ["Kulbäcksliden"],
    "longitude": [19.49],
    "latitude": [64.52],
    "region": ["Norrland"]
})

stations = pd.concat([stations, extra_station], ignore_index=True)
stations


We often need to subset (extract rows and columns) from our DataFrames.  
In `Pandas`, there are several ways to do this:

- **Square brackets `[]`** → most common way to access one or more columns.  
- **Dot notation `.`** → shorthand for simple column names (no spaces/special characters).  
- **`.iloc[]`** → position-based indexing (rows and columns by integer index).  
- **`.loc[]`** → label-based indexing (rows and columns by their names).  
- **`.drop()`** → remove specific rows or columns by label or index.


In [None]:
# Column access
stations["region"]      # using brackets
#stations.region         # using dot notation (works only for simple names)

In [None]:
# Single cell (row 3, column 2 → remember Python is 0-based)
stations.iloc[2, 1]     # position-based indexing
#stations.loc[2, "name"] # label-based indexing

In [None]:
# Slicing rows
stations.iloc[0:3]      # first 3 rows
#stations.loc[0:2]       # rows with index labels 0, 1, 2

In [None]:
# Selecting multiple columns
stations[["name", "altitude"]]
#stations.loc[:, ["name", "altitude"]]

In [None]:
# Drop a column
stations.drop(columns="longitude")

# Drop multiple columns
#stations.drop(columns=["longitude", "latitude"])

# Drop a row (by index label)
#stations.drop(index=2)  # removes the 3rd row (index 2)


We can also filter rows based on specific conditions.

In [None]:
# Filter rows where altitude is exactly (`==`) 80
stations[stations["altitude"] == 80]

In [None]:
# Filter rows where altitude is NOT equal to 80
stations[stations["altitude"] != 80]

In [None]:
# To remove missing values use .dropna()
stations.dropna()

In [None]:
# Filter rows where altitude is between 100 and 200
stations[(stations["altitude"] > 100) & (stations["altitude"] < 200)]

In [None]:
# Filter rows where altitude is either > 200 or < 100
stations[(stations["altitude"] > 200) | (stations["altitude"] < 100)]

## Reading and Writing Data Files

Entering data manually is **time-consuming and error-prone**.  
In real data analysis workflows, we almost never type data by hand. Instead, we **read data directly from files**.

Depending on the field and application, data can come in many different formats. For example:
- **Text-based tables**, such as simple `.txt` files or logs, which can be read using basic text import functions.
- **Highly specialized formats**, such as weather radar data stored in the `ODIM HDF5` standard, which require dedicated libraries to read and interpret correctly.

In practice, most day-to-day data analysis involves **tabular data**. This type of data is most commonly stored as:
- **CSV** (`.csv`) files, where values are separated by commas
- **TSV** (`.tsv`) files, where values are separated by tabs
- **Excel** files (`.xlsx`, `.xls`), typically used for spreadsheets

> I strongly recommend using **`TSV` files** whenever possible.  
> Although `CSV` files are very common, they can be problematic because commas often appear inside data values, such as in addresses or descriptive text.\
> This can lead to parsing errors and unexpected behavior.  
>  
> **`TSV` files** are usually more robust, since tab characters rarely appear inside the data itself. This makes them easier and safer to read reliably.  
> For this reason, I prefer, and strongly recommend, using `TSV` files whenever possible.



There are **several ways** to import data into Python.  
The most common and flexible function is `pd.read_csv()`, which works not only for CSV but also for many text-based tabular formats.  
You can also use `pd.read_table()` where you specify the delimiter yourself.  

To try this out, download the file `berry_data.csv` from the following source:  
Langvall, O. (2021). *Swedish Forest Phenology dataset (Version 1)* [Data set]. Swedish University of Agricultural Sciences. https://doi.org/10.5878/jbab-cy46  
> Dataset page: https://researchdata.se/en/catalogue/dataset/2021-194-1/1  

Once you have the file, place it in your working directory.  

Now let’s read it into Python using both `read_table()` and `read_csv()`.

In [None]:
import os

# Check files in the working directory
print(os.listdir())

In [None]:
# Read with explicit delimiter
#pd.read_table("berry_data.csv", sep=",")

# Read with shortcut
#pd.read_csv("berry_data.csv")


### <font color='#ff751f'> Task 4 </font>  

1. Read the file `berry_data.csv` into a DataFrame called `berry_data`.


In [None]:
# YOUR CODE HERE!

>Files may have quirks (custom missing value markers, unexpected separators, wrong data types, etc.).  
Most of these can be handled via parameters in `pd.read_csv()` and other `read_*` functions.  
Check the [Pandas documentation on IO tools](https://pandas.pydata.org/docs/user_guide/io.html) for details and examples.

After processing data in Python, we often want to export it back to a file.  The `pandas` package provides convenient functions `to_csv()` and `to_excel()` for writing tabular data.

In [None]:
# Save as TSV (tab-separated values)
#berry_data.to_csv("berry_data.tsv", sep="\t", index=False)

In [None]:
# Check files in the working directory
#print(os.listdir())

## Data Wrangling and Data Manipulation  

In real projects, data is **rarely in the exact format you need**.  
Before cleaning or transforming, first **explore** what you have.

Exploration helps answer:
- How many rows/columns?
- Which variables (features) are included?
- What are the dtypes?
- Are there missing or unusual values?

**Useful `pandas` functions** (with `berry_data` as example):


In [None]:
berry_data.shape # (rows, columns)

In [None]:
berry_data.columns # column names

In [None]:
berry_data.info() # dtypes + non-null counts

In [None]:
berry_data.describe() # quick stats for numeric columns

In [None]:
berry_data.corr() # correlation matrix between all the numeric columns in the data frame

In [None]:
berry_data.head() # first rows
#berry_data.tail(3) # last 3 rows

In [None]:
berry_data["Station"].unique() # distinct values in a column

In [None]:
berry_data["Station"].value_counts() # frequency counts

In [None]:
berry_data.isnull() # missing values

Once we know what the dataset looks like, the next step is wrangling:
the process of cleaning, structuring, and transforming raw or messy data into a usable format.

Typical wrangling tasks include:
- Handling missing values
- Merging multiple datasets
- Reshaping between wide and long formats
- Converting variables to the right data types (`dtypes`)

Closely related is data manipulation, which covers operations like:
- Filtering rows
- Sorting observations
- Aggregating values
- Creating or modifying columns

`pandas` API docs: https://pandas.pydata.org/docs/


### Core `pandas` Operations

`pandas` is centered on the **DataFrame/Series** objects and a set of **methods** you chain together.
Most methods return a new object, so you can build readable pipelines.

**Index/select**
- `df[...]`, `.loc[row_sel, col_sel]`, `.iloc[row_idx, col_idx]`
- `query()` (string conditions), `eval()` (expression evaluation)

**Missing data**
- `isna()`, `notna()`
- `dropna()` (remove missing), `fillna()` (impute/replace)

**Create/modify columns**
- `assign(...)` (add columns), direct assignment `df["new"] = ...`
- `rename(columns=...)`, `astype(...)` (change dtype)
- Vectorized helpers: `np.where`, `np.select`, string ops via `.str.*`

**Summaries & grouping**
- `agg(...)`, `mean()`, `size()`, `nunique()`
- `groupby(...).agg(...)`, `groupby(...).transform(...)` (broadcast group stats back to rows)

**Sorting & ordering**
- `sort_values(...)`, `sort_index(...)`

**Reshape (wide ↔ long)**
- `melt()` (wide → long), `pivot()` / `pivot_table()` (long → wide)
- `stack()`, `unstack()` for index-based reshaping

**Combine tables**
- `merge(left, right, how=..., on=...)` (joins)
- `concat([...], axis=0/1)` (stack rows / add columns)
- `join()` (index-based combine)

**Duplicates & uniques**
- `duplicated()`, `drop_duplicates()`, `unique()`, `value_counts()`

**Categorical & datetimes**
- `astype("category")`, `.cat.categories`, `.cat.reorder_categories(...)`
- `to_datetime(...)`, `.dt` accessor; time series `set_index(...).resample("M").agg(...)`

**I/O**
- `read_csv()`, `read_table()`, `read_excel()`, `to_csv()`, `to_parquet()`…




In [None]:
# Keep only the columns 'doy' (day of year) and 'Ripe' from the dataset
berry_data[["doy", "Ripe"]]

In [None]:
# Keep all columns except 'doy' and 'Ripe'
berry_data.drop(columns=["doy", "Ripe"])


In [None]:
# Rename columns
berry_data.rename(columns={"doy": "day_of_year"})

In [None]:
# Keep only the rows where the value in 'Year' is greater than 2019
berry_data.loc[berry_data["Year"] > 2019]

In [None]:
# Keep only the rows where the value in 'Year' is exactly 2012
berry_data.loc[berry_data.Year == 2012]

In [None]:
# Keep only the rows where 'Station' is either 104 or 105
berry_data[berry_data.Station.isin([104, 105])]

In [None]:
# Keep only the rows where 'Station' is 105 AND 'Year' is 2020
berry_data[(berry_data.Station == 105) & (berry_data.Year == 2020)]

In [None]:
# You can create new columns with function .assign()
berry_data.assign(second_half = berry_data["doy"] > 365/2)[["doy", "second_half"]]

In [None]:
# You can also have more descriptive values
berry_data.assign(half = np.where(berry_data["doy"] < 365/2, "first_half", "second_half"))

In [None]:
berry_data.assign(quarter = np.select(
        [berry_data["doy"] <= 91, berry_data["doy"] <= 182, berry_data["doy"] <= 273],
        ["Q1", "Q2", "Q3"],
        default="Q4"
    ))

In [None]:
# Missing data
berry_data.isna().sum()
#berry_data.fillna({"Flowers": 0})     # fill missing values with a specified value or strategy
#berry_data.dropna(how='any', inplace=False) # drop rows with at least one missing value

In [None]:
# Calculating summary statistics
berry_data.Flowers.mean() # calculate mean of 'Flowers' column

In [None]:
# Grouped summaries
berry_data.groupby(["Station"], dropna=False).agg(mean_flower_count=("Flowers", "mean"), N=("Flowers", "size")).reset_index()

In [None]:
# Arrangning data
berry_data.sort_values(["Flowers"], ascending=False)

In [None]:
# Reshaping DataFrames (wide ↔ long)
long_df = pd.melt(berry_data, id_vars=["Year", "Species"], value_vars=["Flowers"], var_name="variable", value_name="value")
wide_df = long_df.pivot_table(index=["Year", "Species"], columns="variable", values="value", aggfunc="mean").reset_index()

In [None]:
long_df

In [None]:
wide_df

In [None]:
# Changing data types
berry_data["Station"].dtype, berry_data["Station"].astype("str").dtype
#pd.to_datetime(berry_data['Year'])

### Joining DataFrames  

In many cases, information is spread across multiple tables, and we need to **combine them based on shared keys** (e.g., IDs, station codes, years).  
This process is called a **join**.  

In `pandas`, joins are done with the [`merge()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) function (similar to SQL joins).  
The argument `how=` specifies the type of join:  

- `how="inner"` → **inner join**: keeps only rows with matching keys in both tables  
- `how="left"` → **left join**: keeps all rows from `df1` (the left table) and adds matching info from `df2`  
- `how="right"` → **right join**: keeps all rows from `df2` (the right table) and adds matching info from `df1`  
- `how="outer"` → **full outer join**: keeps all rows from both tables, filling in `NaN` where no match is found

<p align="left">
  <img src="https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/joins.jpg"
       alt="Joins"
       width="500">
</p>


In [None]:
df1 = pd.DataFrame({
    "Station": [101, 102, 103],
    "Name": ["Umeå", "Vindeln", "Siljansfors"]
})

df2 = pd.DataFrame({
    "Station": [102, 103, 104],
    "Altitude": [225, 320, 180]
})

# Inner join
pd.merge(df1, df2, on="Station", how="inner")

In [None]:
# Left join
pd.merge(df1, df2, on="Station", how="left")

In [None]:
# Right join
pd.merge(df1, df2, on="Station", how="right")

In [None]:
# Full outer join
pd.merge(df1, df2, on="Station", how="outer")

## Iterations

**Loops** let you run a block of code repeatedly. They’re essential for automating repetitive work, walking through data structures, and implementing algorithms that repeat until a condition is met. In Python you’ll mainly use two loop constructs: **`while`** and **`for`**.


> #### Updating variables
> Before looping, it’s useful to understand how values get updated.

In [None]:
# Reassignment based on the previous value
x = 0
x = x + 1     # read the right-hand side first, then assign to x
x

> Python evaluates the **right side first**, then assigns to the name on the left.\
> If you use a variable before it has a value, you’ll get a `NameError`.\
> Shorthand operators make updates concise: `x += 1` (increment), `x -= 1` (decrement), `x *= 2`, `x /= 2`, etc.
> Example (cumulative rainfall over a week):

In [None]:
rain_mm = [2.0, 0.0, 4.5, 1.2, 0.0, 8.3, 3.1]
total = 0.0
for day in rain_mm:
    total += day
    print(total)
print(f"Cumulative rainfall over a week: {total} mm")


### `while` loops

A `while` loop repeats as long as its condition is `True`. The condition is checked before each iteration.
<p align="left">
  <img src="https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/while_loop.jpg"
       alt="while_loop"
       width="500">
</p>

In [None]:
# Sum daily CO₂ measurements until a threshold is reached
co2_readings = [392, 401, 407, 415, 420, 425]
i = 0
total = 0

while i < len(co2_readings) and total < 1200:
    total += co2_readings[i]
    i += 1

total, i  # cumulative sum and how many readings used


Use `while` when you don’t know ahead of time how many iterations you need.
\
Make sure something inside the loop changes the condition; otherwise you’ll loop forever.

#### Infinite loops (and how they happen)
If the condition never becomes `False`, the loop never ends.

In [None]:
# Example of a infinite loop/bug
#i = 0
#while i < 5:
#    print(i)
    #i += 1  # forgotten increment → infinite loop


Sometimes infinite loops are intentional (e.g., continuously reading a sensor stream) and are controlled with a break when a stop condition occurs.

### `for` loops

A `for` loop iterates over items in a sequence (list, string, NumPy array, etc.). Use it when the number of iterations is known or you’re iterating over a collection.
<p align="left">
  <img src="https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/for_loop.jpg"
       alt="for_loop"
       width="500">
</p>

In [None]:
# Print each station name
stations = ["Umeå", "Vindeln", "Siljansfors", "Asa"]
for name in stations:
    print(name)


#### Iterating with `range()`
`range(stop)` generates `0, 1, 2, ..., stop-1` (the end is exclusive).
You can also use `range(start, stop, step)`.

In [None]:
# Days of year: first five days
for doy in range(1, 6):   # 1..5
    print(doy)


In [None]:
# Simple growing degree-day (GDD) accumulator
base = 10
daily_tmean = [8, 12, 15, 9, 14, 11]  # °C
gdd = 0
for t in daily_tmean:
    gdd += max(0, t - base)
gdd

## Conditional Statements
Conditional statements are one of the most important building blocks of programming. They let us make decisions in code:
- Run certain parts of code only if conditions are `True`.
- Skip or choose alternatives if they’re `False`.

This enables us to write programs that adapt to different situations, much like we do in real-world decision-making.

### `if` Statement

The simplest form is the `if` statement.
It checks whether a condition is `True` and, if so, executes the block of code inside it.
<p align="left">
  <img src="https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/if_condition.jpg"
       alt="if condition"
       width="270">
</p>

In [None]:
# Check if temperature is above freezing
temperature = -3

if temperature > 0:
    print("Water is liquid.")

> Nothing is printed here, since the condition was `False`.\
> If the temperature had been above 0, the program would have printed `"Water is liquid."`.

### `else` Clause

The `else` clause provides an alternative when the condition in `if` is not met.
<p align="left">
  <img src="https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/if_and_else_condition.jpg"
       alt="if and else condition"
       width="300">
</p>

In [None]:
# Are trees actively photosynthesizing?
sunlight = False

if sunlight:
    print("Photosynthesis is happening.")
else:
    print("Trees are not photosynthesizing right now.")

The `else` block ensures that something always happens, whether the `if` condition is `True` or not.

### `elif` Clause

The `elif` ("else if") clause allows us to test multiple conditions sequentially.
It’s useful when there are more than two possible outcomes.
<p align="left">
  <img src="https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/elif_condition.jpg"
       alt="elif_condition"
       width="500">
</p>

In [None]:
# "Classifying Air Quality Index (AQI)
aqi = 135

if aqi <= 50:
    print("Air quality is good.")
elif aqi <= 100:
    print("Air quality is moderate.")
elif aqi <= 150:
    print("Air quality is unhealthy for sensitive groups.")
else:
    print("Air quality is unhealthy for everyone.")


> Here the program evaluates the conditions in order and executes only the first matching block.

### Nested Conditional Statements

Sometimes, one decision depends on another. That’s when nested `if` statements come in handy.

In [None]:
# Checking if rainfall is enough for crops
rainfall_mm = 8

if rainfall_mm > 0:
    print("It rained today.")
    if rainfall_mm >= 50:
        print("Soil moisture is sufficient for crops.")
    else:
        print("Rainfall was too low, irrigation might be needed.")
else:
    print("No rainfall today.")


> Be careful: nesting too many `if` statements can make your code hard to read and maintain.\
> Often, there are cleaner alternatives (e.g., combining conditions with `and`/`or`, or using dictionaries).

## Data Visualization

One of the most effective ways to understand a dataset is to **visualize it**.  
Plots help reveal patterns, trends, differences between groups, and potential issues such as outliers or missing values that are not always obvious from tables alone.

For this reason, data visualization is a central part of **exploratory data analysis (EDA)** workflows.\
In practice, visualization is often one of the first steps taken after loading and inspecting a dataset.

In this section, we will practice data visualization using the [**palmerpenguins**](https://allisonhorst.github.io/palmerpenguins/) dataset.

The dataset contains real biological measurements of three penguin species (*Adélie*, *Chinstrap*, and *Gentoo*) collected by Dr. Kristen Gorman and the Palmer Station Long Term Ecological Research (LTER) program in Antarctica’s Palmer Archipelago. Each row corresponds to a single penguin, and the columns contain numerical measurements and categorical information such as species, island, and sex.

Each row in the dataset corresponds to a single penguin (a record in the DataFrame), and the main columns are:

| Column             | Description                                         | Python data type (pandas) |  
|--------------------|-----------------------------------------------------|---------------------------|  
| `species`          | Penguin species (*Adélie*, *Chinstrap*, *Gentoo*)   | categorical (`category`)  |  
| `island`           | Island in the Palmer Archipelago where observed     | categorical (`object`)    |  
| `bill_length_mm`   | Length of the penguin’s bill (mm)                   | numeric (`float64`)       |  
| `bill_depth_mm`    | Depth (thickness) of the penguin’s bill (mm)        | numeric (`float64`)       |  
| `flipper_length_mm`| Length of the penguin’s flipper (mm)                | numeric (`float64`)       |  
| `body_mass_g`      | Body mass of the penguin (g)                        | numeric (`int64`)         |  
| `sex`              | Male or female (some values missing)                | categorical (`object`)    |  
| `year`             | Year of observation (2007–2009)                     | numeric (`int64`)         |  

<p align="center">
  <img src="https://allisonhorst.github.io/palmerpenguins/reference/figures/lter_penguins.png"
       alt="palmer_penguins"
       width="600">
</p>


In [None]:
!pip install palmerpenguins

In [None]:
from palmerpenguins import load_penguins       # function to load the Palmer Penguins dataset

Now that we have installed and loaded our packages, let’s load the `palmerpenguins` dataset and take a first look at it.

In [None]:
# Load the penguins dataset
penguins = load_penguins()

# Preview the first rows
penguins.head()

In [None]:
# Check dimensions (rows x columns)
penguins.shape

In [None]:
# Quick summary of all variables
penguins.describe(include="all")

### Plotting with `pandas`

A convenient way to start visualizing data in Python is to use the **built-in plotting methods in `pandas`**.

These methods work directly on `DataFrame` and `Series` objects and allow you to create common plots with only a few lines of code. This makes them well suited for quick exploration of the data.

The plots are based on a more general plotting system that we will introduce **in the next part**. For now, we use `pandas` plotting as a simple starting point.

> **General pattern**
> ```python
> DataFrame.plot(kind="plot_type", ...)
> Series.plot(kind="plot_type", ...)
> ```

Common plot types include:
- `"line"` (default)
- `"bar"` / `"barh"`
- `"hist"`
- `"box"`
- `"kde"` / `"density"`
- `"area"`
- `"pie"`
- `"scatter"` (`DataFrame` only)
- `"hexbin"` (`DataFrame` only)

We will use these plots to get an initial visual overview of the dataset.


#### Histogram

Let’s start with a simple histogram of penguin flipper lengths.  
We can use the `kind='hist'` argument inside `.plot()`.
> (*Histograms are great for showing the distribution of one numeric variable.*)

In [None]:
# Basic histogram of flipper length
penguins["flipper_length_mm"].plot(kind='hist')

That already works, but `pandas` also gives us **method-specific shortcuts**.  
For example, instead of `kind='hist'`, we can directly use `.plot.hist()`.  
This also allows us to add more options for customization.

In [None]:
# Histogram of flipper length with extra options
penguins["flipper_length_mm"].plot.hist(
    bins=20,                 # number of bins
    color="#c25bc8",         # fill color
    edgecolor="black",       # outline for bars
    title="Distribution of Flipper Length"  # figure title
)

#### Density Plot
A **density plot** (Kernel Density Estimate, or KDE) is a smoothed version of the histogram. Instead of showing counts per bin, it estimates the probability density of the variable.  

This is useful for seeing the *overall distribution shape* without being sensitive to the choice of bin width.

In [None]:
# Basic density plot of flipper length
penguins["flipper_length_mm"].plot.kde()


That gives us the smoothed curve.  But we can also customize its appearance.

In [None]:
# KDE with formatting
penguins["flipper_length_mm"].plot.kde(
    lw=2,                      # line width
    color="#c25bc8",           # line color
    title="Density of Flipper Length"
)

> *Notes:*
> - `.plot.kde()` is the same as `.plot(kind="kde")` or `.plot.density()`  
> - KDE plots assume a **continuous numeric variable**  


#### Boxplot

Boxplots are excellent for comparing the **distribution of a numeric variable across categories**.  
They show:
- Median (line inside the box)  
- Interquartile range (the box)  
- Whiskers (spread of most of the data)
- Potential outliers (points beyond whiskers)

In [None]:
# Basic boxplot of body mass by species
penguins.boxplot(
    column="body_mass_g",     # numeric variable
    by="species",             # grouping category
    grid=False
)

> *Notes:*
> - `column=...` → numeric column(s) to plot  
> - `by=...` → categorical column used for grouping  
>
>
> *Boxplots are **very powerful for grouped comparisons** and are often used together with violin plots (we’ll get to that in `seaborn` section).*

#### Bar Chart

Bar charts are very useful for comparing **counts across categories**.


In [None]:
# Bar chart of species counts
penguins["species"].value_counts().plot.bar(color="#c25bc8", edgecolor="black", title="Number of Penguins per Species")

> *Notes:*
> - `.value_counts()` gives the counts of each category  
> - `.plot.bar()` makes a vertical bar chart (use `.plot.barh()` for horizontal)  


We can also show counts of multiple categories **stacked on top of each other**.

In [None]:
# Count of species split by sex (stacked bars)
penguins.groupby(["species", "sex"]).size().unstack().plot.bar(stacked=True, color=["darkorange", "#184445"], edgecolor="black", title="Penguin Counts by Species and Sex (stacked)")

Instead of stacking, we can put bars **side by side** for easier comparisons.  
This is sometimes called a “dodged” bar plot.

In [None]:
# Count of species split by sex (stacked bars)
penguins.groupby(["species", "sex"]).size().unstack().plot.bar(stacked=False, color=["darkorange", "#184445"], edgecolor="black", title="Penguin Counts by Species and Sex (stacked)")

> *Notes:*
> - `stacked=True` → layers categories on top of each other  
> - `stacked=False` (default with multiple columns) → grouped bars side by side  

#### Scatterplot
Scatterplots show the **relationship between two continuous variables**.  
Unlike histograms, here we need both an `x` and a `y` variable.  

> *Recall*: `.plot.scatter()` works only on a **DataFrame** (not a single Series).

In [None]:
# Basic scatterplot: flipper length vs body mass
penguins.plot.scatter(x="flipper_length_mm", y="body_mass_g")

Let's enhance it by **mapping color to species**, which makes the groups easier to see.

In [None]:
# Drop NAs
df = penguins.dropna(subset=["bill_length_mm", "bill_depth_mm", "species"])

# Define the original Palmer penguins colors
penguin_colors = {
    "Adelie": "#FF8C00",    # dark orange
    "Chinstrap": "#9932CC", # purple
    "Gentoo": "#008B8B"     # teal
}

# Map each species to its color
df["species_color"] = df["species"].map(penguin_colors)

# Pandas scatter plot using custom colors
df.plot.scatter(
    x="bill_length_mm",
    y="bill_depth_mm",
    c=df["species_color"],    # supply actual colors, not colormap
    alpha=0.8,
    title="Bill Length vs Bill Depth (colored by species)"
)


### From `pandas` to `matplotlib`

As we saw, `pandas`’ built-in plotting functions are **great for quick visualizations**.  
They are especially useful while you are working on a project and just want to check the data.  

But if we want **more flexibility and customization**, then [**`matplotlib`**](https://matplotlib.org/) is the true workhorse of plotting in Python.  
It gives us control over almost every aspect of a figure: size, colors, markers, labels, titles, styles, and more.  

Below are some **main concepts** to keep in mind when using `matplotlib`:
- **Figure**: the whole canvas (can contain multiple plots)  
- **Axes**: one plot (with *x*- and *y*-axis, labels, etc.)  
- **Axis**: individual *x*- or *y*-axis  
- **Artist**: everything visible (lines, points, text, legends…)  

![cheatsheet1.png](https://matplotlib.org/cheatsheets/_images/cheatsheets-1.png)
![cheatsheet2.png](https://matplotlib.org/cheatsheets/_images/cheatsheets-2.png)

Let’s start with a very simple example.

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Basic histogram with Matplotlib
# Drop missing values to avoid errors
flipper_lengths = penguins["flipper_length_mm"].dropna()

# Create a histogram
plt.hist(flipper_lengths)

# Add title and axis labels
plt.title("Distribution of Flipper Length")
plt.xlabel("Flipper length (mm)")
plt.ylabel("Count")

# Show the figure
plt.show()

That’s the most basic `matplotlib` plot: `plt.hist()` creates a histogram, then we add a title and axis labels.  

Notice how similar this is to what we did with **`pandas`’ built-in plotting**.  
**Other common `matplotlib` functions:**
- `plt.plot()` → line plot  
- `plt.scatter()` → scatter plot  
- `plt.bar()` / `plt.barh()` → bar charts  
- `plt.boxplot()` → boxplots   
In other words: **the same figures we made in `pandas` can also be made here in `matplotlib`**, but `matplotlib` allows finer control over style, layout, labels, colors, and more.


#### Figure Size and Style

We often want to control how our plots look. There are **two main ways**:

1. Globally with `rcParams` (applies to all plots in the notebook):  
```python
plt.rcParams["figure.figsize"] = (6,4)   # default size (width, height in inches)
plt.rcParams["axes.labelsize"] = 12      # font size of axis labels
plt.rcParams["axes.titlesize"] = 14      # font size of titles
```
2. Per-plot inside `plt.figure()` or `plt.subplots()` (only affects one plot):
```python
fig, ax = plt.subplots(figsize=(6,4))    # create one figure and one axes
ax.hist(flipper_lengths, bins=20)        # histogram with 20 bins
ax.set_title("Distribution of Flipper Length")  # set title for this axes

In [None]:
fig, ax = plt.subplots(figsize=(12,8))    # create one figure and one axes
ax.hist(flipper_lengths, bins=20)        # histogram with 20 bins
ax.set_title("Distribution of Flipper Length")  # set title for this axes

> *Note:* The object-oriented approach with `fig, ax = plt.subplots()` is recommended, because it gives us full control.

#### Subplots

With `plt.subplots(rows, cols)` you create a **grid of Axes** to draw into.

> *Typical workflow:*
> 1. Create a grid of axes with `fig, axes = plt.subplots(nrows, ncols, figsize=(...))`  
> 2. Loop over your groups (e.g., species)  
> 3. For each group, select the matching `ax` and plot there  
> 4. Tidy labels, titles, legends, colorbars

In [None]:
# Create a 1x3 grid of subplots (one per species)
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12, 4), sharey=True)

species = ["Adelie", "Chinstrap", "Gentoo"]

for ax, sp in zip(axes, species):
    data = df.loc[df["species"] == sp, "flipper_length_mm"].dropna()
    ax.hist(data, bins=20)
    ax.set_title(sp)
    ax.set_xlabel("Flipper length (mm)")
    ax.set_ylabel("Count")

fig.suptitle("Flipper length distributions by species")
plt.tight_layout()
plt.show()

In the previous example, all histograms used the default color settings.  
In the next example, we assign a fixed color to each penguin species.

In [None]:
# Define the Palmer penguins colors
penguin_colors = {
    "Adelie": "#FF8C00",    # dark orange
    "Chinstrap": "#9932CC", # purple
    "Gentoo": "#008B8B"     # teal
}

# Create a 1x3 grid of subplots (one per species)
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12, 4), sharey=True)

species = ["Adelie", "Chinstrap", "Gentoo"]

for ax, sp in zip(axes, species):
    data = df.loc[df["species"] == sp, "flipper_length_mm"].dropna()
    ax.hist(
        data,
        bins=20,
        color=penguin_colors[sp],
        edgecolor="black"
    )
    ax.set_title(sp)
    ax.set_xlabel("Flipper length (mm)")
    ax.set_ylabel("Count")

fig.suptitle("Flipper length distributions by species")
plt.tight_layout()
plt.show()


### Plotting with `seaborn`

While plotting directly with `matplotlib` gives you full control, it often requires more code and manual adjustments, especially when working with grouped data or statistical summaries.

This is where [**`seaborn`**](https://seaborn.pydata.org/index.html) becomes useful. `seaborn` is built on top of `matplotlib` and is designed specifically for **statistical data visualization**. It makes common tasks such as comparing groups, visualizing distributions, and showing relationships between variables easier and more concise.

For this reason, `seaborn` is widely used in data analysis and exploratory data analysis workflows. In the next examples, we use `seaborn` to create the same types of plots with less code and clearer defaults.


In [None]:
import seaborn as sns

# Define the Palmer penguins colors
penguin_colors = {
    "Adelie": "#FF8C00",    # dark orange
    "Chinstrap": "#9932CC", # purple
    "Gentoo": "#008B8B"     # teal
}

g = sns.FacetGrid(data=df, col="species", hue="species", col_order=["Adelie", "Chinstrap", "Gentoo"], 
                  palette=penguin_colors, height=4, aspect=1, legend_out=False)

g.map_dataframe(sns.histplot, x="flipper_length_mm", bins=20, stat="count", edgecolor="black")

g.set_axis_labels("Flipper length (mm)", "Count")
g.set_titles("{col_name}")
g.fig.suptitle("Flipper length distributions by species", y=1.05)

plt.show()


Earlier, we used boxplots to compare distributions across groups.  
While boxplots are very compact and informative, they hide some details about the shape of the data.

A closely related visualization is the **violin plot**.  
Violin plots combine the summary information of a boxplot with a smooth estimate of the data distribution. This makes them especially useful when comparing distributions across groups.


In [None]:
plt.figure(figsize=(6, 4))

sns.violinplot(data=penguins, x="species", y="flipper_length_mm", palette=penguin_colors, inner="box")

plt.xlabel("Species")
plt.ylabel("Flipper length (mm)")
plt.title("Flipper length by species")
plt.tight_layout()
plt.show()


By default, we used `inner="box"` to show a boxplot inside each violin.  
Instead of a boxplot, it is also possible to display the **individual data points**.

This is often done using **jittered points**, which makes it easier to see the actual observations, especially when the sample size is small to moderate.


In [None]:
plt.figure(figsize=(6, 4))

sns.violinplot(data=penguins, x="species", y="flipper_length_mm", palette=penguin_colors, inner=None, alpha=0.5)

sns.stripplot(data=penguins, x="species", y="flipper_length_mm", palette=penguin_colors, jitter=True)

plt.xlabel("Species")
plt.ylabel("Flipper length (mm)")
plt.title("Flipper length by species")
plt.tight_layout()
plt.show()


In addition to basic plot types, **`seaborn`** also provides higher-level visualization tools that are especially useful during exploratory data analysis.

One common example is **`pairplot`**, which automatically creates a grid of plots showing pairwise relationships between multiple numerical variables, along with their distributions. This is a quick way to explore correlations, clusters, and differences between groups.

In the next example, we use `pairplot` to get an overview of relationships between several penguin measurements.


In [None]:
sns.pairplot(penguins, vars=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'], corner=True, height=2, aspect=1, plot_kws={'color': 'darkgray'}, diag_kws={'color': 'darkgray'} )
plt.show()

> By default, `pairplot()` creates a full matrix of plots - meaning it repeats the same scatterplots above and below the diagonal.\
> Setting `corner=True` tells seaborn to show only the lower triangle of this grid, removing duplicate plots and making the figure cleaner and easier to read.
>
> We can also color the points by species to reveal how group differences shape these relationships:

In [None]:
sns.pairplot(penguins, vars=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'], hue='species',
             palette={'Gentoo': '#1b7173', 'Adelie': '#fb7302', 'Chinstrap': '#c25bc7'}, corner=True, height=2, aspect=1)
plt.show()

---

#### Final note  
*If some of this felt confusing, that’s a good sign - it usually means you’re doing it right.*
