## Image Analysis - DataFrames and Automation

In this notebook, we will explore how to:
- use **pandas** for working with tabular data
- perform basic statistical analysis with **scipy**
- get familiar with another Python plotting library — **seaborn**

Finally, we will practice **automation on multiple files**, both for tabular data and for image processing tasks.

---
---
### Lesson 1: DataFrames (pandas package)

Pandas is a library used for data manipulation and analysis, particularly useful for dealing with structured data like tables. It provides the tools to create and work with DataFrames.

DataFrame is a two-dimensional data structure similar to a spreadsheet (table).

We typically import pandas like this:
```python
import pandas as pd

```

In [None]:
# You can create a DataFrame from different types of data.  

# Create a DataFrame from a dictionary of lists
import pandas as pd
import numpy as np

measurement_data = {
    'cell_id': [1, 2, 3, 4, 5],
    'area': [150, 230, 95, 180, 40],
    'intensity': [88.5, 95.2, 101.0, 89.7, np.nan]
}

df = pd.DataFrame(measurement_data)

# Display the DataFrame
df

**nan** stands for **Not a Number**

It’s a special floating-point value used to represent missing or undefined numerical data.

In [None]:
# We can easily get some statistics for our data with .describe()

print("\n--- Summary Statistics ---")
df.describe()

#### Column/row extraction

In [None]:
# You can extract a single column from a pandas DataFrame by column name

# This returns a pandas Series (keeps the row index)
area_column = df['area']

print(area_column) 
print()
print(type(area_column)) # pandas Series 

In [None]:
# Access the underlying NumPy array of the column
# This returns raw values only (no index)
area_values = df['area'].values
print(area_values)
print()
print(type(area_values))

In [None]:
# Row extraction

row1 = df.iloc[0]   # try with .values
print('First row:\n', row1)

print()

row_selection = df.iloc[[0,-1]]
print('First and last row:\n', row_selection)

#### Data filtering

In [None]:
# You can filter data based on conditions

# Create a new DataFrame containing only cells with area less than a value.
filtered_df = df[df['area'] < 200]
filtered_df

In [None]:
# The ... results in a boolean 'mask
less_than = df['area'] < 200
less_than

In [None]:
# we then use this mask to filter our DataFrame

df[less_than]

In [None]:
# Combining multiple logical conditions

# Each condition must be in parentheses ( )
filtered_df2 = df[(df['area'] < 200) & (df['area'] > 100)]
filtered_df2


In [None]:
# Check in code above that indexes stay the same as in original DataFrame
# Use reset_index if you want new indexes order

filtered_df2_reset = filtered_df2.reset_index(drop=True) # check output when drop=False
filtered_df2_reset

In [None]:
# Remove rows with missing (nan) values

df_clean = df.dropna()
df_clean

In [None]:
# You can sort dataframes by one or more columns.
# ascending=True/False controls ascending/descending order

sorted_df = df.sort_values(by='intensity', ascending=False)
sorted_df

In [None]:
# Display first or last rows from DataFrame

print(df.head(2))

print(df.tail(2))

##### --- ***Exercise*** ---

You have experimental data in a dictionary. Your task is to analyze it with Pandas.
1. Create a **Pandas DataFrame** from the given data dictionary.
2. Filter the DataFrame to find all rows where the **Treatment** is `'Drug_A'`.

   *Hint: use the equality operator `==`*
3. From the filtered data, compute the **average Score** for Drug A.

   *Hint: you can use .mean() on extracted 'Score' column*
4. Using the original DataFrame, find out **which treatments were `Effective`**.  

   *Hint: you can use `.unique()` on the `Treatment` column, or `numpy.unique(df['Treatment'].values)` to find the unique treatment names*

5. Display the **3 rows with the highest Score**.  

   *Hint: you can use sort by `Score` in descending order and use `.head(3)`*

In [None]:
## cd
data = {'Animal_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
        'Treatment': ['Drug_A', 'Drug_B', 'Placebo', 'Drug_A', 'Drug_B', 'Placebo', 'Drug_A', 'Drug_B', 'Drug_A', 'Drug_B'],
        'Result': ['Effective', 'Not Effective', 'Not Effective', 'Effective', 'Effective', 'Not Effective', 'Effective', 'Effective', 'Effective', 'Not Effective'],
        'Score': [50.3, 3.5, 10.1, 17.0, 93.3, 1.5, 99.9, 73.7, 69.2, 0.5]}

# Your code here


<details>
<summary>Click to see the example solution</summary>

```python
# 1. Create DataFrame from dictionary
df = pd.DataFrame(data)

# 2. Filter rows where Treatment is 'Drug_A'
drug_a_df = df[df['Treatment'] == 'Drug_A']
print("Rows with Drug_A treatment:")
print(drug_a_df)

# 3. Average Score for Drug_A
avg_score_drug_a = drug_a_df['Score'].mean()
print("\nAverage Score for Drug_A:", avg_score_drug_a)

# 4. Which treatments were effective?
effective_treatments = df[df['Result'] == 'Effective']['Treatment'].unique()
print("\nTreatments with at least one effective result:", effective_treatments)

# 5. Display 3 rows with the highest 'Score'
top3_scores = df.sort_values('Score', ascending=False).head(3)
print("\nTop 3 rows by Score:")
print(top3_scores)
```


<br><br>

##### Reading and saving

In [None]:
# We can save our DataFrame as table (csv, excel, ...)
sorted_df.to_csv('first_measurements.csv', index=False)

In [None]:
# We can read tables from file into DataFrames
loaded_df = pd.read_csv('first_measurements.csv') # check other file formats
loaded_df

In [None]:
# DataFrames are convenient for plotting
import matplotlib.pyplot as plt 

plt.figure(figsize=(4, 4))
plt.scatter(loaded_df['area'], loaded_df['intensity'])
plt.title('scatterplot')
plt.xlabel('Area')
plt.ylabel('Mean intensity')
plt.show()

##### Aggregation

When we have multiple DataFrames, we can combine them into single one.

pd.concat() is a pandas function used to combine multiple DataFrames along a particular axis (rows or columns)

In [None]:
# Aggregation (multiple tables)

df_image1 = pd.read_csv("../data/cells_control.csv")
df_image2 = pd.read_csv("../data/cells_diseased.csv")
df_image3 = pd.read_csv("../data/cells_conditioned.csv")

# Merge dataframes vertically
combined_df = pd.concat([df_image1, df_image2, df_image3], ignore_index=True) # default axis is 0 (rows)

combined_df

##### DataFrame data types

Not all columns have to store numerical data.

Data types can be for example:
- int, float → numeric columns
- bool → boolean
- object → general-purpose column (strings, mixed data...)
- category → categorical data (memory-efficient, useful for grouping)


You can quickly check the data types of each column using:
```python
print(df.dtypes)   # Shows the type of each column
print(df.info())   # Shows summary including data types and non-null counts
```

You can convert string/object columns to categorical with method: `.astype("category")`

In [None]:
print(combined_df.info())

In [None]:
combined_df["Group"] = combined_df["Group"].astype("category")
combined_df["State"] = combined_df["State"].astype("category")

print(combined_df.dtypes)

##### Group-level statistics

When your dataset has **categorical variables** (like experimental groups, treatments, or conditions) and **numeric measurements**, you might be interested in doing statistics **per group**.

- `describe()` can give summary statistics for the **entire DataFrame**, but it does not separate by category
- to get statistics per group, you can use `groupby()` combined with `agg()` (aggregate)

In [None]:
# describe method is applied on entire dataframe without disciminating categories
combined_df.describe()

In [None]:
# If we want to extract statistics per group, we can use methods group_by() and agg()

# !! agg() works only on numerical columns
# We need to remove all other types first (except the ones we are grouping by)
reduced_df = combined_df.drop("State", axis=1) # this removes column 'State'

# Now we group by column 'Group'
summary_per_group = reduced_df.groupby('Group', observed=False).agg(["count", "mean", "std"]) 
summary_per_group

In [None]:
# we can use multiple factors
summary_per_group_and_state = combined_df.groupby(['Group','State'], observed=False).agg(["count","sum", "mean", "std"]) 
summary_per_group_and_state

##### --- ***Exercise*** ---


1. On `combined_df`, compute **mean, maximum, and standard deviation** of all numerical columns per factor **State**. 

2. Find out which `State` have the largest average `Circularity`.

 - *Hint: you can extract values from a pivoted table by specifing column and subcolumn like [Circularity, mean].*
 - *You can use method `.idxmax()` to find the row label with the maximum value.* 

In [None]:
# Your code here...


<details>
<summary>Click to see the example solution</summary>

```python

# 1. Compute mean, maximum, and std per State
r_df = combined_df.drop('Group', axis=1)
summary_state = r_df.groupby("State", observed=False).agg(["mean", "max", "std"]).round(2) # optional rounding
print(summary_state)

# 2. Find out State with largest mean Circularity
max_circ = summary_state['Circularity', 'mean'].max() # get maximum

# Option 1 - # filter rows with maximum
max_circ_df = summary_state[summary_state['Circularity', 'mean']==max_circ] # filter rows with maximum
print(max_circ_df['State])

# Option 2 - use dedicated method
print(summary_state["Circularity", "mean"].idxmax())

```


##### Visualizing group differences with Seaborn

The **Seaborn** library builds on Matplotlib but provides higher-level plotting functions tailored for statistical data.

A particularly useful feature is the `hue` argument, which colors data points by group/category.

In [None]:
# Seaborn 
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(6, 4))
sns.boxplot(
    data=combined_df, 
    x="Group", 
    y="Mean_Intensity", 
    hue="Group", 
    palette="Set2" # color-coding
    )

plt.show()

In [None]:
# Some seaborn plots also allow discrimination of multiple factors
  
plt.figure(figsize=(6, 6))
sns.scatterplot(
    data=combined_df, 
    x="Area", y="Circularity", 
    hue="Group", style="State"
    )
plt.show()

#### Hypothesis Testing

Visualizations help us see patterns and differences in data, but they are not enough for statistical confirmation.  
To formally test whether groups differ, we use hypothesis testing.

In Python, the `scipy.stats` module provides a variety of statistical tests.

Below, we will use a **t-test** to compare whether the **means of two groups are equal**.

In [None]:
# Hypothesis testing

from scipy.stats import ttest_ind

# the following is combination of filtering DataFrame and extracting column Mean_Intensity
controls = combined_df[combined_df['Group']=='Control']['Mean_Intensity']
diseased = combined_df[combined_df['Group']=='Diseased']['Mean_Intensity']

# the t-test takes two lists/arrays 
# by default the ttest_ind() performs two-sided t-test and assumes equal variance
t_stat, p_value = ttest_ind(controls, diseased)

print(f"\nT-test p-value for comparing areas: {p_value:.9f}") # formatting number to 9 decimal places

##### --- ***Exercise*** ---

1. Make a boxplot comparing **Area** across categories of **State**.
2. Add a swarmplot (`sns.swarmplot`) on top of boxplot to see individual datapoints.
3. Perform a t-test comparing "Circularity" between Control and Diseased.
4. Perform a t-test comparing "Mean_Intensity" between Starved Pre-conditioned and Unstarved Pre-conditioned.

In [None]:
# Your code here...


<details>
<summary>Click to see the example solution</summary>

```python

# Boxplot with swarmplot for Area per State
plt.figure(figsize=(6, 4))
sns.boxplot(data=combined_df, x="State", y="Area", palette="Set2", hue='State')
sns.swarmplot(data=combined_df, x="State", y="Area", color='black')
plt.tight_layout()
plt.show()

# T-test Circularity: Control vs Diseased
controls = combined_df[combined_df["Group"] == "Control"]["Circularity"]
diseased = combined_df[combined_df["Group"] == "Diseased"]["Circularity"]
t_stat, p_value = ttest_ind(controls, diseased)
print(f"T-test Circularity (Control vs Diseased): p = {p_value:.4f}")

# T-test Mean_Intensity: Starved vs Unstarved Pre-conditioned
pre_starved = combined_df[
    (combined_df["Group"] == "Pre-conditioned") & (combined_df["State"] == "Starved")
]["Mean_Intensity"]
pre_unstarved = combined_df[
    (combined_df["Group"] == "Pre-conditioned") & (combined_df["State"] == "Unstarved")
]["Mean_Intensity"]
t_stat, p_value = ttest_ind(pre_starved, pre_unstarved)
print(f"T-test Mean_Intensity (Starved vs Unstarved, Pre-conditioned): p = {p_value:.4f}")
```


---
---

### Lesson 2: Automation

#### A Quick Guide to File Paths

Python offers several ways to handle paths, each with its own strengths. All are usefull as they automatically use the correct path separator (`/` or `\`).

Covered packages: `os`, `pathlib`, and `glob`

In [None]:
# --- os.path.join(): The safe way to build paths ---
import os

# This automatically uses the correct separator for your OS (`/` or `\`)
folder = "data"
filename = "image_01.tif"

correct_path = os.path.join(folder, filename)
print(f"OS-safe path: {correct_path}")

correct_path = os.path.join(folder, 'subfolder', filename)
print(f"OS-safe path: {correct_path}")

In [None]:
# Get directory path and filename parts
base_name = os.path.basename(correct_path) # image_01.tif
parent_path = os.path.dirname(correct_path) # data\subfolder
parent_dir = os.path.basename(parent_path) # subfolder

print(f"File name part: {base_name}")
print(f"Parent directory path: {parent_path}")
print(f"Parent directory name: {parent_dir}")

In [None]:
# --- pathlib  ---
from pathlib import Path

# pathlib treats paths as objects, you can use the `/` operator to build paths
#csv_file_path = '..' / 'data' / "cells_control.csv" # this gives error
csv_file_path = '..' / Path('data') / "cells_control.csv"
print(f"Path:   {csv_file_path}   is   {type(csv_file_path)}")

print(f"Does this file exist? {csv_file_path.exists()}")

In [None]:
# --- Get parts of the path ---

print(f"Filename only: {csv_file_path.name}")
print(f"Parent folder path: {csv_file_path.parent}")
print(f"Parent folder: {csv_file_path.parent.name}")

Creating new directories (folders) is possible with both os and pathlib

In [None]:
# Create the directory if it doesn’t exist yet

# os
os.makedirs('test_folder_os', exist_ok=True)

# pathlib
folder = Path("test_folder_pathlib")
folder.mkdir(exist_ok=True)

##### Listing files in folder

**os**
```python
os.listdir()
```

**glob** - finding files with wildcards 
```python
import glob
glob.glob(folder/*.csv) 

# or 
Path("folder").glob("*.csv") # returns a generator, not a list
```


Wildcards are special characters that help you match patterns in file names — instead of typing exact file names.

For example: * means any number of characters

| Goal                | glob                                         | pathlib                       |
| ------------------- | -------------------------------------------- | ----------------------------- |
| Find only in folder | `glob.glob("folder/*.tif")`                    | `Path("folder").glob("*.tif")`  |
| Find recursively    | `glob.glob("folder/**/*.tif", recursive=True)` | `Path("folder").rglob("*.tif")` |

In [None]:
# os.listdir() lists all files/folders

my_path = os.path.join('..', 'data')
files = os.listdir(my_path)

# Output is a list of filenames
print(files[:10]) # print first 10 files

In [None]:
# List comprehension
only_tif_files = [tif for tif in files if tif.endswith('.tif')]

# the above is same as typing:
only_tif_files = []
for tif in files:
    if tif.endswith('.tif'):
        only_tif_files.append(tif)

# print selection
print(only_tif_files[:10])

In [None]:
# --- glob.glob(): the way to find files with wildcards ---
import glob

# Find all files ending with .tif
my_path = os.path.join('..', 'data')

tif_files = glob.glob(os.path.join(my_path, "*.tif"))

print(f"Found n TIF files: {len(tif_files)}")
print(tif_files[-5:]) # glob returns list of strings


In [None]:
# To search recursively (including subfolders)

tif_files_recursive = glob.glob(os.path.join(r"../data", "**", "*.tif"), recursive=True)
print(len(tif_files_recursive))
print(tif_files_recursive[-5:])

##### --- **Exercise** ---

1. **Create a path** to the `data` folder by combining the relative marker `'..'` and the folder name `'data'` - store the path in a variable `data_path`.
   - Use either **`os`** or **`pathlib`**.
2. **List all files** in your defined path and print how many files were found (`len()` function).  
    - *Hint*: Use `os.listdir()` or `glob.glob(data_path+"/*")`
3. **List all images** in the folder (ending with **.tif** or **.png**).
    - *Hint*: can filter the previous output with `endswith()` function 
    - or combine glob searches like `glob.glob("folder/*.tif") + glob.glob("folder/*.png")`
4. **Find all .csv files** that include **'con'** in their name (e.g. control, conditioned)
    - *Hint*: use `'substring' in 'string'` to find if `'con'` is in the filename
    - Read them into a list of DataFrames with pandas - `pd.read_csv()`
    - Combine list of DataFrames into one DataFrame - `pd.concat(list_of_dfs, ignore_index=True)`
    - Save the combined DataFrame to a new folder `results` as `combined.csv` 
        - `os.makedirs()` to create a folder and `dataframe.to_csv()` to save DataFrame

In [None]:
# Your code here...


<details>
<summary>Click to see the example solution</summary>


```python
# EXAMPLE SOLUTION
import os
import glob
import pandas as pd
from pathlib import Path

# --- 1. Create path to data folder ---
data_path = os.path.join("..", "data")
print("Data path:", data_path)
# or
data_path2 = Path("..") / "data"
print("Data path:", data_path2)

# --- 2. List all files and count them ---
all_files = glob.glob(os.path.join(data_path, "*"))
# or
all_files = glob.glob(data_path+"/*"))
print(f"Found {len(all_files)} files:")
print(all_files)

# --- 3. List all image files (.tif or .png) ---
image_files = glob.glob(os.path.join(data_path, "*.tif")) + glob.glob(os.path.join(data_path, "*.png"))
print(f"\nFound {len(image_files)} image files:")
print(image_files)

# or list comprehension with tuple ('.tif', '.png')
image_files2 = [file for file in all_files if file.endswith(('.tif', '.png'))]
print(f"\nFound {len(image_files2)} image files.")

# --- 4. Find all CSVs with 'con' in name ---
csv_files = glob.glob(os.path.join(data_path, "*.csv"))
csv_files_con = [f for f in csv_files if "con" in os.path.basename(f)]
print(f"\nCSV files with 'con' in name: {csv_files_con}")

# read the CSVs with pandas and store them in a list
dfs = [pd.read_csv(f) for f in csv_files_con]

# merge Dataframes into a single one
combined_df = pd.concat(dfs, ignore_index=True)

# Create results folder and save the final DataFrame
os.makedirs("results", exist_ok=True)
combined_df.to_csv("results/combined.csv", index=False)


```


---
---

### Lesson 3: Image analysis workflow - recap

In the introductury session, we practiced designing and tuning a full image-processing workflow:
- We loaded and visualized microscopy images.
- We adjusted individual processing steps such as filtering, thresholding, and segmentation.
- We extracted quantitative measurements (e.g., area, intensity, circularity).
- We packaged this workflow in a function to be able to run it in a loop.

- Now, we will recap this session
- Add analysis of measurements using **Pandas**, **Seaborn**, and **scipy** packages.
- Practice automatic processing of batch of images.

##### Image loading

Example image loading libraries:
- scikit-image (skimage)
- imageio
- bioio
- tifffile

In [None]:
from skimage import io
import imageio.v3 as iio

input_image = io.imread('../data/noisy_cells.tif')
input_image2 = iio.imread('../data/noisy_cells.tif')

print('Image shape:', input_image.shape)
print('Image shape:', input_image2.shape)

print('Are images identical?', np.array_equal(input_image, input_image2))

##### Image processing & segmentation

Libraries like skimage and scipy provide a variety of modules with functions for common image-processing operations — such as filtering, thresholding, transformations, and other manipulations of images or masks.

In [None]:
from skimage import filters

gaussian_blur = filters.gaussian(input_image, sigma=2)         
# Smooths image, reduces noise by averaging neighboring pixels

median_filtered = filters.median(input_image)                  
# Reduces noise while better preserving edges

sobel_edges = filters.sobel(input_image)                       
# Enhances edges by computing the gradient magnitude of the image

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(8, 8))
ax = axes.ravel()

ax[0].imshow(input_image, cmap='gray')
ax[0].set_title('Original Image')

ax[1].imshow(gaussian_blur, cmap='gray')
ax[1].set_title('Gaussian Blur')

ax[2].imshow(median_filtered, cmap='gray')
ax[2].set_title('Median Filter')

ax[3].imshow(sobel_edges, cmap='gray')
ax[3].set_title('Sobel Edges')


for a in ax:
    a.axis('off')

plt.tight_layout()
plt.show()

Segmentation is the process of separating an image into regions. 

In a simple case, we can segment image pixels into two classes (e.g., foreground vs. background) by applying a threshold on pixel values.

The thresholded image produces a mask, which is a boolean array indicating which pixels belong to the foreground (True) and which to the background (False).


The result of instance segmentation using `skimage.measure.label()` is called labels. Labels is typically an integer array where each connected object (e.g., a cell, a nucleus) is assigned a unique integer value, and the background is usually 0. This allows to identify and analyze individual objects separately.

In [None]:
threshold_m = 30 # manual 
threshold_a = filters.threshold_mean(gaussian_blur) # automatic 

print('Detected threshold value:', threshold_a)

binary_mask = gaussian_blur > threshold_a
print(binary_mask.dtype)

plt.imshow(binary_mask, cmap='gray')
plt.show()

In [None]:
# Improve mask with post-processing
from skimage import morphology
filled = morphology.remove_small_holes(binary_mask, area_threshold=200)
cleaned = morphology.remove_small_objects(filled, min_size=50)

plt.imshow(cleaned, cmap='gray')
plt.show()

In [None]:
from skimage import measure

labels = measure.label(cleaned)
print(labels.dtype)

plt.imshow(labels, cmap = 'nipy_spectral')
plt.axis('off')
plt.show()

##### Object measurements

The `measure.regionprops_table()` function from scikit-image computes quantitative features for each labeled region in an image.  

Each unique region (identified by a unique label) can be characterized by geometric and intensity-based properties.

The output of this function is a dictionary containing one array per measured property.

In [None]:
# Measurements
 
# We specify which properties we want to measure for each object.
properties_to_measure = ['label', 'area', 'mean_intensity', 'perimeter', 'eccentricity']

# regionprops_table uses our `labels` image and the original `nuclei_image`
props_dict = measure.regionprops_table(
    labels, # input mask/labels image
    intensity_image=input_image, # for raw intensity measurements input intensity image
    properties=properties_to_measure # define an interable (list, tuple, set..) of properties to measure
)

props_dict

In [None]:
# Convert the dictionary of results into a Pandas DataFrame
import pandas as pd

image_df = pd.DataFrame(props_dict)

image_df.head(5) # show first 5 rows of a table

---

#### Process all images in folder

Let's combine everything together.

We will use a for-loop to apply a user-defined function (containing image-processing and measurement steps) to all images in a folder.

In [None]:
# Step 1: Define our reusable analysis function
def analyze_image(image_array, image_name):
    blurred_image = filters.gaussian(image_array, sigma=3)
    threshold_value = filters.threshold_otsu(blurred_image)
    mask = blurred_image > threshold_value
    label_image = measure.label(mask)
    props_dict = measure.regionprops_table(label_image, intensity_image=image_array,
                                           properties=('label', 'area', 'mean_intensity'))
    results_df = pd.DataFrame(props_dict)
    # Add the image name to track our data
    results_df['source_image'] = image_name

    return results_df

In [None]:
# Step 2: Find the files
input_folder = r"../data/batch_analysis/input"
file_list = glob.glob(os.path.join(input_folder, "*.png"))

print(file_list)

In [None]:
# Step 3: Loop and aggregate results
import imageio.v3 as iio

all_image_results = []
print("\nStarting batch processing...")

# Iterate over list of filepaths
for file_path in file_list:
    print(f"Processing: {file_path}")
    
    # Read image
    image = iio.imread(file_path)
    
    # Get just the filename from the full path for cleaner tables
    filename_only = os.path.basename(file_path)
    
    # Apply our function - our function returns DataFrame with measurements
    single_image_df = analyze_image(image, filename_only)

    # Add DataFrame for currently processed file to a pre-defined list
    all_image_results.append(single_image_df)
    
print()
print('Number of dataframes in list:', len(all_image_results))

In [None]:
# Step 4: Concatenate into a final DataFrame
final_batch_df = pd.concat(all_image_results, ignore_index=True)

print("\n--- Final Batch Results ---")
final_batch_df

##### --- ***Exercise*** --- 

Work with provided `analyze_nuclei` function (from introductory session) to solve the following exercise.

The function takes in 2 arguments - a microscopic image (2D array) and its name (string). 

The function should identify individual nuclei, measure their properties (e.g., area and mean intensity), and return both a labeled image and a DataFrame containing the measurements.

***Batch Process the Entire Folder***

You know the tools, you have the function. Now it's time to automate everything!

**Instructions:**
1. Use glob to get a list of paths to all `.tif` files in the `data/batch_analysis/nuclei_data` folder.
2. Create an empty list called `all_nuclei_results`.
3. Write a `for` loop that iterates through your list of file paths. 
- Inside the loop:
    - Load the current image using `iio.imread()`.
    - Get just the filename (without the folder path) - e.g. using `os.path.basename()`.
    - Call your `analyze_nuclei` function with the loaded image and the filename.
    - Append the DataFrame returned by the function to your `all_nuclei_results` list.
4. After the loop, use `pd.concat()` to combine the list of DataFrames into a single, master DataFrame.
5. Print the head and tail of your final DataFrame to see the combined results from all images.
6. Find out how many nuclei per image were detected and print it. 
    - *Hint: you can use .groupby('filename').describe()*
7. As a final analysis, create a boxplot showing the distribution of nuclei `area` for each of the images.
    - *Hint*: `seaborn.boxplot` is great for this, you can use `filename` column as `hue` argument. 
    - *Usage: `sns.boxplot(data=dataframe, x=x_axis_column, y=y_axis_column, hue=group_column)`*

8. Bonus for fast solvers: write a new function to minimize the code needed for solving instructions - (skip instructions 5 and 6) - ideally, your function should take only path as input and return a boxplot

In [None]:
# Run this code cell

from skimage import filters, morphology, measure
import scipy.ndimage as ndi

def analyze_nuclei(image_array, file_name):
    filtered_array = filters.gaussian(image_array, sigma=1)
    threshold_value = filters.threshold_otsu(filtered_array)
    mask = filtered_array > threshold_value
    processed_mask = ndi.binary_fill_holes(mask)
    mask_cleaned = morphology.remove_small_objects(processed_mask, min_size=50)
    label_image = measure.label(mask_cleaned)

    props = measure.regionprops_table(label_image, intensity_image=image_array,
                                      properties=['label', 'area', 'mean_intensity'])
    df = pd.DataFrame(props)

    df['filename'] = file_name

    return df, label_image

In [None]:
# Your code here...


<details>
<summary>Click to see the example solution</summary>

```python
# 1. Get list of all .tif files in the folder
image_files = glob.glob("../data/batch_analysis/nuclei_data/*.tif")

# 2. Create an empty list to store results
all_nuclei_results = []

# 3. Loop through each file
for file_path in image_files:
    # Load image
    image_array = iio.imread(file_path)
    
    # Extract filename only
    filename = os.path.basename(file_path)
    
    # Analyze nuclei
    df, labels = analyze_nuclei(image_array, filename)
    
    # Append results to list
    all_nuclei_results.append(df)

# 4. Combine all DataFrames into a single master DataFrame
master_df = pd.concat(all_nuclei_results, ignore_index=True)

# 5. Inspect results
print(master_df.head())
print(master_df.tail())

# 6. Print counts per image
print(master_df.groupby('filename').describe())

# 7. Boxplot of nuclei area per image
plt.figure(figsize=(10, 6))
sns.boxplot(data=master_df, x='filename', y='area', palette="Set3", hue='filename')
plt.xticks(rotation=45)
plt.ylabel("Nuclei Area")
plt.xlabel("Source Image")
plt.tight_layout()
plt.show()
```

<details>
<summary>Click to see the example solution for bonus</summary>

```python
# Example solution for bonus
def plot_nuclei_per_image(folder_path):

    image_files = glob.glob(folder_path+"/*.tif")

    dfs = []

    for image_file in image_files:
        
        image_array = iio.imread(image_file)
        image_name = os.path.basename(image_file)

        df, _ = analyze_nuclei(image_array, image_name)
        dfs.append(df)

    master_df = pd.concat(dfs, ignore_index=True)

    fig = plt.figure(figsize=(8, 6))
    sns.boxplot(data=master_df, x='filename', y='area', palette="Set3", hue='filename')
    plt.xticks(rotation=45)
    plt.ylabel("Nuclei Area")
    plt.xlabel("Source Image")
    plt.tight_layout()
    plt.show()

# Calling function on given path
images_path = '../data/batch_analysis/nuclei_data'
plot_nuclei_per_image(images_path)
```

<br><br>
#### Slice-by slice reading

When working with image data stored as a series of files, it is often useful to load all slices into a single multi-dimensional array for easier processing.

The **scikit-image** function `imread_collection()` allows you to read multiple images at once using a filename pattern.
You can use a wildcard (*) to match all files in a folder that belong to your dataset. This collection can then be converted into a **NumPy** array, effectively creating an image stack.

However, the number of z-slices, channels or frames is not recognized. You have to reshape the loaded data into the appropriate multi-dimensional form (for example, ZCYX) yourself.

*Note*: Alternatively, you can build your own for-loops to load images from disk manually.
This approach gives you more flexibility — for example, to sort slices and channels, skip specific files, or arrange the data into custom dimensions.

In [None]:
from skimage import io

im_collection = io.imread_collection('../data/batch_analysis/tiffs/' + "*")
image_stack = im_collection.concatenate()
image_stack.shape

In [None]:
# Change the shape of array with reshape

num_channels = 2
num_z_slices = 5
num_t_frames = 10
image5d = np.reshape(image_stack, (num_t_frames, num_z_slices, num_channels, image_stack.shape[-2], image_stack.shape[-1]))
image5d.shape

In [None]:
# use stackview for interactive multi-dimensional image plot 
import stackview
stackview.slice(image5d)

### USE CASE - Pixel-based colocalization 

Pixel-based colocalization assesses the 'overlap' between two or more channels. (Colocalization only indicates that two signals are present within the same volume resolved by the microscope - it does not prove molecular interaction.)

There are many types of coefficients to characterize colocalization (Pearson’s correlation, Spearman’s rank correlation, Manders’ overlap coefficients, intersection coefficients, cross-correlation analysis...). In this notebook we will compute the Pearson's correlation coefficient.

- **Pearson's Correlation Coefficient (PCC):** Measures the linear relationship between the two channels' intensities. It ranges from -1 (perfect negative correlation) to +1 (perfect possitive correlation), with 0 indicating no correlation. A major drawback is its sensitivity to background pixels, which often creates a strong, artificial positive correlation. Thresholding is essential.

##### Example

In [None]:
# Setup imports and 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import stackview
from skimage.io import imread
from skimage import filters

In [None]:
# Load a multi-channel image

raw_image = imread(r'../data/cellpainting.tif')

print(raw_image.shape)

In [None]:
# Use stackview for interactive visualization of the channels

stackview.switch(
    {"actin":   raw_image[0,:,:],
     "er":raw_image[1,:,:],
     "speckles":      raw_image[2,:,:],
     "mito":      raw_image[3,:,:],
     "nuclei":      raw_image[4,:,:],
    },
    colormap=['pure_red',"pure_green", "pure_yellow", "pure_magenta", 'pure_cyan'],
    toggleable=True
)

In [None]:
# Store channels in a dictionary for easy access

channels_dict = {
    "actin": raw_image[0,:,:],
    "er": raw_image[1,:,:],
    "speckles": raw_image[2,:,:],
    "mito": raw_image[3,:,:],
    "nuclei": raw_image[4,:,:],
}

Visualize pixel intensities with a cytofluorogram.

**The Cytofluorogram:**

A 2D scatter plot for visualization of pixel-wise intensity relationship between two image channels. Each point on the plot represents a single pixel, with its intensity in Channel 1 on x-axis and its intensity in Channel 2 on y-axis.

Let's create a cytofluorogram for the **mito** and **er** channels.

In [None]:
# For histogram ploting we will flatten the 2D image arrays into 1D arrays of pixels intensities

# Again, we will do it for all channels and create a new dictionary to store the results
f_channels_dict = {key: value.ravel() for key, value in channels_dict.items()}

for key, arr in f_channels_dict.items():
    print(key, arr.shape)

# We can also create a pandas DataFrame for convenience
channels_df = pd.DataFrame(f_channels_dict)
channels_df.head()

In [None]:
# --- The Cytofluorogram (Scatter plot) ---

plt.figure(figsize=(6,6))
plt.scatter(x=channels_df['mito'], y=channels_df['er'], s=1, alpha=0.1)
plt.grid(True, linestyle='--', alpha=0.5)
plt.xlabel("Mitochondria intensity")
plt.ylabel("ER intensity")

plt.show()

In [None]:
# A 2D histogram is often more informative for dense data
# Note the high density of points near the origin, representing background pixels.

plt.figure(figsize=(7, 7))
sns.histplot(channels_df, x='mito', y='er', cbar=True, cmap='YlGnBu_r', bins=100)
plt.show()

Calculating Pearson's Correlation Coefficient (PCC)

First, we calculate PCC on the raw pixel data, which includes a vast number of dark background pixels. This will demonstrate how background skews the result.

We can use `scipy` library to compute PCC.

In [None]:
# --- Pearson's Correlation ---
from scipy.stats import pearsonr

pcc, p_value = pearsonr(channels_df['mito'], channels_df['er'])
print(f"Pearson's r (Mito vs. ER): {pcc:.4f}, p-value: {p_value:.4e}")


Now, we'll apply a threshold to each channel to create mask that separates signal from background. 

By analyzing only the pixels within these masks, we get a more biologically meaningful correlation value.

In [None]:
# --- Step 1: Create a mask for significant pixels in each channel ---

# We can use Otsu's method to automatically find a threshold
thresh_er = filters.threshold_otsu(channels_dict['er'])
print(f"Otsu threshold for 'er': {thresh_er}")

# Or we might use a manual threshold based on visual inspection
threshold = 4000

In [None]:
# We can now simply filter the DataFrame
# Note: for simplicity, we are setting a single threshold value for both channels 
# this ignores differences in intensity distributions
new_df = channels_df[(channels_df['mito']>threshold) & (channels_df['er']>threshold)]

print(f'Remaining pixels: {len(new_df)}')

In [None]:
# Compute PCC on filtered data

pcc, p_value = pearsonr(new_df['mito'], new_df['er'])
print(f"Pearson's r (Mito vs. ER): {pcc:.4f}, p-value: {p_value:.4e}")

In [None]:
# Make new figure
plt.figure(figsize=(7, 7))

# Plot histograms/scatter data
sns.histplot(channels_df, x='mito', y='er', color='gray', bins=100)
sns.histplot(new_df, x='mito', y='er', cmap='YlGnBu_r', bins=100)

# Draw threshold lines to highlight quadrants
plt.axvline(threshold, color='red', linestyle='--', label='Mito threshold low')
plt.axhline(threshold, color='blue', linestyle='--', label='ER threshold')

# Labels and title
plt.legend(loc='upper right')

plt.grid(True, linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

Lets visualize our mask

In [None]:
# Apply threshold on 2D array data
mask_er = channels_dict['er'] > threshold
mask_mito = channels_dict['mito'] > threshold

# Combine masks using logical operation
combined_mask = np.logical_and(mask_mito, mask_er)
# or
combined_mask = mask_mito & mask_er

In [None]:
# Visualize all masks side-by-side

# subplots allow us to show grid of plots in one figure 
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

axes[0].imshow(mask_mito, cmap='gray')
axes[0].set_title('Mito Mask')
axes[0].axis('off')

axes[1].imshow(mask_er, cmap='gray')
axes[1].set_title('ER Mask')
axes[1].axis('off')

axes[2].imshow(combined_mask, cmap='gray')
axes[2].set_title('Combined Mask')
axes[2].axis('off')

plt.tight_layout()
plt.show()

##### Recap: Operations between masks

In image analysis, it is often useful to combine or compare different binary masks to extract specific regions of interest. 

Masks as boolean arrays (True for foreground, False for background), can be processed with logical operations to create new masks.

Common operations:
- *Inversion - NOT (`~`)* – flips all boolean values:
    ```python
    inverted_mask = ~mask
    ```
- *Intersection (`&`)* – keeps only pixels present in both masks:
    ```python
    overlap_mask = mask1 & mask2
    ```
- *Union (`|`)* – includes all pixels present in either mask:
    ```python
    combined_mask = mask1 | mask2
    ```
- *Difference / Subtraction (`& ~`)* – removes pixels of one mask from another:
    ```python
    cytoplasm_mask = cell_mask & ~nuclei_mask
    ```
- *Exclusive OR (`^`)* – pixels present in one mask or the other, but not both:
    ```python
    xor_mask = mask1 ^ mask2
    ```

##### --- Exercise ---

Repeat the colocalization analysis steps above for a different channel pair. You can work with the created dictionaries.

1. Choose another pair of channels to analyze (e.g., actin vs. er, or nuclei vs. speckles).

2. Plot cytofluorogram. Create a scatter or histogram plot showing the pixel intensity correlation between the two channels.

3. Determine appropriate thresholds for each channel (either manually or using automatic methods).

4. Generate binary masks for both channels based on the chosen thresholds, combine them (e.g., using logical AND), and compute the masked Pearson correlation coefficient inside masked area.

5. Visualize the combined mask.

In [None]:
# Your code here...


<details>
<summary>Click to see the example solution</summary>

```python
# EXAMPLE SOLUTION

# Visualize pixel intensity correlation
plt.figure(figsize=(7, 7))
sns.histplot(channels_df, x='actin', y='er', bins=100, cbar=True, cmap='YlGnBu_r')
plt.show()

# Create masks for significant pixels
thresh_1 = filters.threshold_otsu(channels_dict['actin'])
print(f"Otsu threshold for 'ch1': {thresh_1}")
mask_1 = channels_dict['actin'] > thresh_1

thresh_2 = 4000
mask_2 = channels_dict['er'] > thresh_2

# Combine masks
combined_mask = mask_1 & mask_2

# Apply masks 
masked_df = channels_df[(channels_df['actin']>thresh_1) & (channels_df['er']>thresh_2)]

# Calculate masked Pearson correlation
pcc_masked, p_value = pearsonr(masked_df['actin'], masked_df['er'])
print(f"Masked Pearson's r (ch1 vs. ch2): {pcc_masked:.4f}, p-value: {p_value:.4e}")

# Visualize the combined mask
plt.figure(figsize=(6, 6))
plt.imshow(combined_mask, cmap='gray')
plt.title('Combined Mask')
plt.axis('off')
plt.show()
```