## Image Analysis - DataFrames and Automation

In this notebook, we will explore how to use **pandas** for working with tabular data, perform basic statistical analysis with **scipy**, and get familiar with another Python plotting library — **seaborn**.
Finally, we will practice **automation on multiple files**, both for tabular data and for image processing tasks.

---
---
### Lesson 1: DataFrames (pandas package)

Pandas is a library used for data manipulation and analysis, particularly useful for dealing with structured data like tables. It provides the tools to create and work with DataFrames.

DataFrame is a two-dimensional data structure similar to a spreadsheet (table).

We typically import pandas like this:
```python
import pandas as pd

```

In [None]:
# You can create a DataFrame from different types of data.  

# Create a DataFrame from a dictionary of lists
import pandas as pd
import numpy as np

measurement_data = {
    'cell_id': [1, 2, 3, 4, 5],
    'area': [150, 230, 95, 180, 40],
    'intensity': [88.5, 95.2, 101.0, 89.7, np.nan]
}

df = pd.DataFrame(measurement_data)

# Display the DataFrame
df

**nan** stands for “**Not a Number**

It’s a special floating-point value used to represent missing or undefined numerical data.

In [None]:
# We can easily get statistics for our data with .describe()

print("\n--- Summary Statistics ---")
print(df.describe())

In [None]:
# You can extract a single column from a pandas DataFrame using its column name

# This way it keeps the row index and behaves like a labeled one-dimensional array.
area_column = df['area']
print(area_column) 
print(type(area_column)) # pandas Series 

print()

# This way it is  without the index or Series structure 
area_values = df['area'].values
print(area_values)

In [None]:
# You can filter data based on conditions

# This creates a new DataFrame containing only cells with area less than a value.
filtered_df = df[df['area'] < 200]
filtered_df

In [None]:
# Combining multiple logical conditions
# Each condition must be in parentheses ( )
filtered_df2 = df[(df['area'] < 200) & (df['area'] > 100)]
filtered_df2


In [None]:
# Remove rows with missing (nan) values

df_clean = df.dropna()
df_clean

In [None]:
# You can sort dataframes by one or more columns.
# ascending=True/False controls ascending/descending order

sorted_df = df.sort_values(by='intensity', ascending=False)
sorted_df

##### --- ***Exercise*** ---

You have experimental data in a dictionary. Your task is to analyze it with Pandas.
1. Create a **Pandas DataFrame** from the given data dictionary.
2. Filter the DataFrame to find all rows where the **Treatment** was `'Drug_A'`. 
  *Hint: use the equality operator `==`*
3. From the filtered data, compute the **average Score** for Drug A.
   *Hint: you can use .mean() on extracted 'Score' column*
4. Using the entire DataFrame, find out **which treatments were effective**.  
   *Hint: you can use .unique() on pandas Series object (extracted Treatment) to find unique treatments*
5. Display the **3 rows with the highest Score**.  
   *Hint: use `.sort_values('Score', ascending=False).head(3)`*

In [None]:
## cd
data = {'Animal_id': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
        'Treatment': ['Drug_A', 'Drug_B', 'Placebo', 'Drug_A', 'Drug_B', 'Placebo', 'Drug_A', 'Drug_B', 'Drug_A', 'Drug_B'],
        'Result': ['Effective', 'Not Effective', 'Not Effective', 'Effective', 'Effective', 'Not Effective', 'Effective', 'Effective', 'Effective', 'Not Effective'],
        'Score': [50.3, 3.5, 10.1, 17.0, 93.3, 1.5, 99.9, 73.7, 69.2, 0.5]}


# Your code here


<details>
<summary>Click to see the example solution</summary>

```python
# 1. Create DataFrame from dictionary
df = pd.DataFrame(data)

# 2. Filter rows where Treatment is 'Drug_A'
drug_a_df = df[df['Treatment'] == 'Drug_A']
print("Rows with Drug_A treatment:")
print(drug_a_df)

# 3. Average Score for Drug_A
avg_score_drug_a = drug_a_df['Score'].mean()
print("\nAverage Score for Drug_A:", avg_score_drug_a)

# 4. Which treatments were effective?
effective_treatments = df[df['Result'] == 'Effective']['Treatment'].unique()
print("\nTreatments with at least one effective result:", effective_treatments)

# 5. Display 3 rows with the highest 'Score'
top3_scores = df.sort_values('Score', ascending=False).head(3)
print("\nTop 3 rows by Score:")
print(top3_scores)
```


<br><br>

##### Reading and saving

In [None]:
# We can save our DataFrame as table (csv, excel, ...)
sorted_df.to_csv('first_measurements.csv', index=False)

In [None]:
# We can read tables from file into DataFrames
loaded_df = pd.read_csv('first_measurements.csv') # check other file formats
loaded_df

In [None]:
# DataFrames are convenient for plotting
import matplotlib.pyplot as plt 

plt.figure(figsize=(4, 4))
plt.scatter(loaded_df['area'], loaded_df['intensity'])
plt.title('scatterplot')
plt.xlabel('Area')
plt.ylabel('Mean intensity')
plt.show()

##### Aggregation

When we have multiple DataFrames, we can combine them into single one.

pd.concat() is a pandas function used to combine multiple DataFrames along a particular axis (rows or columns)

In [None]:
# Aggregation (multiple tables)

df_image1 = pd.read_csv("../data/cells_control.csv")
df_image2 = pd.read_csv("../data/cells_diseased.csv")
df_image3 = pd.read_csv("../data/cells_conditioned.csv")

# Merge dataframes vertically
combined_df = pd.concat([df_image1, df_image2, df_image3], ignore_index=True)

combined_df

##### DataFrame data types

Not all columns have to store numerical data.

Data types can be:
- int64, float64 → numeric columns
- object → usually strings (categories, text)
- category → categorical data (memory-efficient, useful for grouping)


You can quickly check the data types of each column using:
```python
print(df.dtypes)   # Shows the type of each column
print(df.info())   # Shows summary including data types and non-null counts
```

You can convert string/object columns to categorical with method: `.astype("category")`

In [None]:
print(combined_df.info())

In [None]:
combined_df["Group"] = combined_df["Group"].astype("category")
combined_df["State"] = combined_df["State"].astype("category")

print(combined_df.dtypes)

##### Group-level statistics

When your dataset has **categorical variables** (like experimental groups, treatments, or conditions) and **numeric measurements**, it is often useful to calculate statistics **per group**.

- `describe()` can give summary statistics for the **entire DataFrame**, but it does not separate by category
- to get statistics per group, you can use `groupby()` combined with `agg()` (aggregate)

In [None]:
# describe method is applied on entire dataframe without disciminating categories
combined_df.describe()

In [None]:
# if we want to extract statistics per group, we can use methods group_by() and agg()

# !! agg() works only on numerical columns
# we need to remove all other types first (except the ones we are grouping by)
reduced_df = combined_df.drop("State", axis=1) # this removes column 'State'

# Now we group by column 'Group'
summary_per_group = reduced_df.groupby('Group', observed=False).agg(["count", "mean", "std"]) 
summary_per_group

In [None]:
# we can use multiple factors
summary_per_group_and_state = combined_df.groupby(['Group','State'], observed=False).agg(["count","sum", "mean", "std"]) 
summary_per_group_and_state

##### Visualizing group differences with Seaborn

The **Seaborn** library builds on Matplotlib but provides higher-level plotting functions tailored for statistical data.

A particularly useful feature is the `hue` argument, which colors data points by group/category.

In [None]:
# Seaborn 
import seaborn as sns

plt.figure(figsize=(6, 4))
sns.boxplot(
    data=combined_df, 
    x="Group", 
    y="Mean_Intensity", 
    hue="Group", 
    palette="Set2" # color-coding
    )

plt.show()

In [None]:
# Some seaborn plots also allow discrimination of multiple factors
  
plt.figure(figsize=(6, 6))
sns.scatterplot(
    data=combined_df, 
    x="Area", y="Circularity", 
    hue="Group", style="State"
    )
plt.show()

#### Hypothesis Testing

Visualizations help us see patterns and differences in data, but they are not enough for statistical confirmation.  
To formally test whether groups differ, we use hypothesis testing.

In Python, the `scipy.stats` module provides a variety of statistical tests.

Below, we will use a **t-test** to compare whether the **means of two groups are equal**.

In [None]:
# Hypothesis testing

from scipy.stats import ttest_ind

# the following is combination of filtering DataFrame and extracting column Mean_Intensity
controls = combined_df[combined_df['Group']=='Control']['Mean_Intensity']
diseased = combined_df[combined_df['Group']=='Diseased']['Mean_Intensity']

# the t-test takes two lists/arrays 
# by default the ttest_ind() performs two-sided t-test and assumes equal variance
t_stat, p_value = ttest_ind(controls, diseased)

print(f"\nT-test p-value for comparing areas: {p_value:.9f}")


##### --- ***Exercise*** ---


On `combined_df` from previous exercises perform the following:

1. Compute mean and standard deviation of all numerical columns per factor **State**. 
    - Which condition have larger average area?
2. Make a boxplot comparing **Area** across categories of **State**.
4. Add a swarmplot (`sns.swarmplot`) on top of boxplot to see individual datapoints.
5. Perform a t-test comparing "Circularity" between Control and Diseased.
6. Perform a t-test comparing "Mean_Intensity" between Starved Pre-conditioned and Unstarved Pre-conditioned.

In [None]:
# Your code here


<details>
<summary>Click to see the example solution</summary>

```python
# Compute mean and std per State
r_df = combined_df.drop('Group', axis=1)
summary_state = r_df.groupby("State", observed=False).agg(["mean", "std"]).round(2) # optional rounding
print(summary_state)

# Boxplot with swarmplot for Area per State
plt.figure(figsize=(6, 4))
sns.boxplot(data=combined_df, x="State", y="Area", palette="Set2", hue='State')
sns.swarmplot(data=combined_df, x="State", y="Area", color='black')
plt.tight_layout()
plt.show()

# T-test Circularity: Control vs Diseased
controls = combined_df[combined_df["Group"] == "Control"]["Circularity"]
diseased = combined_df[combined_df["Group"] == "Diseased"]["Circularity"]
t_stat, p_value = ttest_ind(controls, diseased)
print(f"T-test Circularity (Control vs Diseased): p = {p_value:.4f}")

# T-test Mean_Intensity: Starved vs Unstarved Pre-conditioned
pre_starved = combined_df[
    (combined_df["Group"] == "Pre-conditioned") & (combined_df["State"] == "Starved")
]["Mean_Intensity"]
pre_unstarved = combined_df[
    (combined_df["Group"] == "Pre-conditioned") & (combined_df["State"] == "Unstarved")
]["Mean_Intensity"]
t_stat, p_value = ttest_ind(pre_starved, pre_unstarved)
print(f"T-test Mean_Intensity (Starved vs Unstarved, Pre-conditioned): p = {p_value:.4f}")
```


---
---

### Lesson 2: Automation

#### A Quick Guide to File Paths

Python offers several ways to handle paths, each with its own strengths. All are usefull as they automatically use the correct path separator (`/` or `\`).

Packages: `os`, `pathlib`, and `glob`

In [None]:
# --- os.path.join(): The safe way to build paths ---
import os

# This automatically uses the correct separator for your OS (`/` or `\`)
folder = "data"
filename = "image_01.tif"
correct_path = os.path.join(folder, "subfolder", filename)
print(f"OS-safe path: {correct_path}")


# Get directory and filename parts
parent_dir = os.path.dirname(correct_path)
base_name = os.path.basename(correct_path)
grandparent_dir = os.path.dirname(parent_dir)

print(f"File name part: {base_name}")
print(f"Parent directory part: {parent_dir}")
print(f"Grandparent directory part: {grandparent_dir}")


In [None]:
# --- pathlib  ---
from pathlib import Path

# It uses objects and the `/` operator
p = Path("data")
csv_file_path = '..' / p / "cells_control.csv"
print(f"\nPathlib object: {csv_file_path}")
print(f"Does this file exist? {csv_file_path.exists()}")


# --- Get parts of the path ---
print(f"Parent folder: {csv_file_path.parent}")
print(f"Filename only: {csv_file_path.name}")

Creating new directories (folders) is possible with both

In [None]:
# Create the directory if it doesn’t exist yet

# os
os.makedirs('test_folder_os', exist_ok=True)

# pathlib
folder = Path("test_folder_pathlib")
folder.mkdir(exist_ok=True)

##### Listing files in folder

- os.listdir()
- glob: finding files with wildcards 
    - glob.glob(folder/*.csv) or Path("folder").glob("*.csv")


Wildcards are special characters that help you match patterns in file names — instead of typing exact file names.
For example: * means any number of characters

| Goal                | glob                                         | pathlib                       |
| ------------------- | -------------------------------------------- | ----------------------------- |
| Find only in folder | `glob.glob("folder/*.tif")`                    | `Path("folder").glob("*.tif")`  |
| Find recursively    | `glob.glob("folder/**/*.tif", recursive=True)` | `Path("folder").rglob("*.tif")` |

In [None]:
# os.listdir() lists all files

my_path = os.path.join('..', 'data')
files = os.listdir(my_path)

# Output is a list of filenames
print(files[:10]) # print first 10 files

In [None]:
# List comprehension
only_tif_files = [tif for tif in files if tif.endswith('.tif')]

# the above is same as typing:
only_tif_files = []
for tif in files:
    if tif.endswith('.tif'):
        only_tif_files.append(tif)

# print selection
print(only_tif_files[:10])

In [None]:
# --- glob.glob(): the way to find files with wildcards ---
import glob

# Find all files ending with .tif
my_path = os.path.join('..', 'data')
tif_files = glob.glob(os.path.join(my_path, "*.tif"))
print(f"\nFound n TIF files: {len(tif_files)}")

# The output is a list of paths
print(tif_files[-5:])

In [None]:
# To search recursively (including subfolders)

tif_files_recursive = glob.glob(os.path.join(r"../data", "**", "*.tif"), recursive=True)
print(len(tif_files_recursive))
print(tif_files_recursive[-5:])

##### --- **Exercise** ---

1. **Create a path** to the `data` folder by combining the relative marker `'..'` and the folder name `'data'` - store the path in a variable `data_path`.
   - Use either **`os.path.join()`** or **`pathlib.Path`**.
2. **List all files** in your defined path and print how many files were found (len() function).  
    *Hint*: Use os.listdir() or glob.glob("path+/*")
3. **List all images** in the folder (ending with **.tif** or **.png**).
    *Hint*: you can combine previous output with `endswith()` function or combine glob searches (`glob.glob("folder/*.tif") + glob.glob("folder/*.png")`)
4. **Find all .csv files** that include **'con'** in their name (e.g. control, conditioned)
    - *Hint*: use `'substring' in 'string'` to find if `'con'` is in filename
    - read them into a list a list of DataFrames - `pd.read_csv()`
    - combine them into one DataFrame - `pd.concat(list_of_dfs, ignore_index=True)`
    - save it to a new folder `results` by name `combined.csv` - `os.makedirs()` to create a folder and `df.to_csv()` to save DataFrame


In [None]:
# Your code here


<details>
<summary>Click to see the example solution</summary>


```python
# EXAMPLE SOLUTION
import os
import glob
import pandas as pd
from pathlib import Path

# --- 1. Create path to data folder ---
data_path = os.path.join("..", "data")
print("Data path:", data_path)
# or
data_path2 = Path("..") / "data"
print("Data path:", data_path2)

# --- 2. List all files and count them ---
all_files = glob.glob(os.path.join(data_path, "*"))
print(f"Found {len(all_files)} files:")
print(all_files)

# --- 3. List all image files (.tif or .png) ---
image_files = glob.glob(os.path.join(data_path, "*.tif")) + glob.glob(os.path.join(data_path, "*.png"))
print(f"\nFound {len(image_files)} image files:")
print(image_files)

# or list comprehension with tuple ('.tif', '.png')
image_files2 = [file for file in all_files if file.endswith(('.tif', '.png'))]
print(f"\nFound {len(image_files2)} image files.")

# --- 4. Find all CSVs with 'con' in name, combine, and save ---
csv_files = glob.glob(os.path.join(data_path, "*.csv"))
csv_files_con = [f for f in csv_files if "con" in os.path.basename(f)]
print(f"\nCSV files with 'con' in name: {csv_files_con}")

dfs = [pd.read_csv(f) for f in csv_files_con]
combined_df = pd.concat(dfs, ignore_index=True)

# Create results folder and save
os.makedirs("results", exist_ok=True)
combined_df.to_csv("results/combined.csv", index=False)


```


---
---

### Lesson 3: Image analysis workflow - recap

In the introductury session we practiced designing and tuning a full image-processing workflow:
- We loaded and visualized microscopy images.
- We adjusted individual processing steps such as filtering, thresholding, and segmentation.
- We extracted quantitative measurements (e.g., area, intensity, circularity).
- We packaged this workflow in a function to be able to run it in a loop.

- Now, we will recap this session
- Add analysis of those measurements using **Pandas**, **Seaborn**, and **scipy** packages.
- Practice automatic processing of batch of images or applying different settings.

##### Image loading

In [None]:
from skimage import io
import imageio.v3 as iio

input_image = io.imread('../data/noisy_cells.tif')
input_image2 = iio.imread('../data/noisy_cells.tif')

print('Image shape:', input_image.shape)
print('Image shape:', input_image2.shape)

print('Are images identical?', np.array_equal(input_image, input_image2))

##### Image processing & segmentation

Within `skimage` and `scipy`, there are many modules that provide functions for applying common processing operations on images or masks — such as filtering, thresholding, morphology.

In [None]:
from skimage import filters

gaussian_blur = filters.gaussian(input_image, sigma=2)         
# Smooths image, reduces noise by averaging neighboring pixels

median_filtered = filters.median(input_image)                  
# Reduces noise while better preserving edges

sobel_edges = filters.sobel(input_image)                       
# Detects edges by computing the gradient magnitude of the image

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(8, 8))
ax = axes.ravel()

ax[0].imshow(input_image, cmap='gray')
ax[0].set_title('Original Image')

ax[1].imshow(gaussian_blur, cmap='gray')
ax[1].set_title('Gaussian Blur')

ax[2].imshow(median_filtered, cmap='gray')
ax[2].set_title('Median Filter')

ax[3].imshow(sobel_edges, cmap='gray')
ax[3].set_title('Sobel Edges')


for a in ax:
    a.axis('off')

plt.tight_layout()
plt.show()

Segmentation is the process of separating an image into meaningful regions. In a simple case, we can segment image pixels into two classes (e.g., foreground vs. background) by applying a threshold on pixel values.

The thresholded image produces a mask, which is a boolean array indicating which pixels belong to the foreground (True) and which to the background (False).


The result of instance segmentation using `skimage.measure.label()` is called labels, which is an integer array where each connected object (e.g., a nucleus) is assigned a unique integer value, and the background is usually 0. This allows you to identify and analyze individual objects separately.

In [None]:
threshold = 30 # manual 
threshold = filters.threshold_mean(gaussian_blur) # automatic 

binary_mask = gaussian_blur > threshold
print(binary_mask.dtype)

plt.imshow(binary_mask, cmap='gray')

In [None]:
from skimage import measure

labels = measure.label(binary_mask)
print(labels.dtype)
plt.imshow(labels, cmap = 'nipy_spectral')

##### Parameter Sweep

When tuning a processing workflow, it is typical to test multiple parameters.

Let’s compare how different levels of Gaussian blur affect the same image with help of a loop.

- We define a list of parameter values (sigma_values) to test.
- We use a simple **loop** to apply the same function with each parameter.
- We use another **loop** to plot images with subplots

In [None]:
# Define sigma values for Gaussian blur
sigma_values = [1, 3, 5, 7, 9, 11]

# Apply Gaussian filter with different sigmas
blurred_images = []
for sigma in sigma_values:
    blurred = filters.gaussian(input_image, sigma=sigma)
    blurred_images.append(blurred)

# Alternative: list comprehension
# blurred_images = [filters.gaussian(image, sigma=s) for s in sigma_values]

In [None]:
# Plot results side by side
fig, axes = plt.subplots(1, len(sigma_values),figsize=(25, 5))

# zip() is a built-in Python function 
# it lets you iterate over multiple sequences (lists, tuples...) in parallel
for ax, s, img in zip(axes, sigma_values, blurred_images):
    ax.imshow(img, cmap="gray")
    ax.set_title(f"σ = {s}")
    ax.axis("off")

plt.suptitle("Effect of Gaussian Blur with Different σ", fontsize=16)
plt.show()
 

##### --- ***Mini Exercise*** ---

In [None]:
numbers = [1,2,3,4]
dictionary = {
    'dogs': 2,
    'cats': 9,
    'sloths': 11,
    'spiders': 999

}

# Work with me to use zip


##### --- ***Exercise*** ---

Test how different thresholding algorithms segment the same image.

- Use the image from previous exercise (`input_image` or `gaussian_blur`).
- Create a dictionary of thresholding methods (Otsu, Li, Yen, Minimum).
    - such as `methods = {'name1': function1(), 'name2': function2()}`
- Create a new empty dictionary to store results.
    - such as `new_dictionary = {}`
- Use a loop to iterate over the methods in dictionary (*Hint: loop through the dictionary items using `for key, value in methods.items()`)
    - apply thresholding method to get threshold value
    - generate a binary mask with threshold value
    - store binary array in the new dictionary under the method name (*Hint: store value under key with `new_dictionary[key]=value`*)
- Optional: Plot binary arrays in one figure. (*Hint: you can iterate over axis and dict items together with `for ax, (key, value) in zip(axes, dict.items()`*)
    - Optional: Plot the original image alongside binary arrays in one figure.

In [None]:
# Your code here


<details>
<summary>Click to see the example solution</summary>

```python
# load thresholding functions
from skimage.filters import threshold_otsu, threshold_li, threshold_yen, threshold_minimum

# Dictionary of thresholding methods
methods = {
    "Otsu": threshold_otsu,
    "Li": threshold_li,
    "Yen": threshold_yen,
    "Min": threshold_minimum
}

# Create empty dictionary to store masks
masks = {}

# Apply each thresholding method in a loop
for name, func in methods.items():
    thresh = func(input_image)        
    mask = input_image > thresh      
    masks[name] = mask          

# Add original image to the new dictionary to make plotting easier
masks['raw'] = input_image

# Plot original image + threshold results
fig, axes = plt.subplots(1, len(masks), figsize=(16, 5))

for ax, (name, mask) in zip(axes, masks.items()):
    ax.imshow(mask, cmap="gray")
    ax.set_title(name)
    ax.axis("off")

plt.show()
```

<br><br>

In [None]:
# By the way, skimage has a function to try all thresholds

filters.try_all_threshold(input_image, figsize=(8, 5), verbose=False)
plt.tight_layout()


##### Object measurements

The `measure.regionprops_table()` function from scikit-image computes quantitative features for each labeled region in an image.  

Each connected region (identified by a unique label) can be characterized by geometric and intensity-based properties.

The output of this function is a is a dictionary containing one array per measured property.

In [None]:
# Measurements
 
# We specify which properties we want to measure for each object.
properties_to_measure = ('label', 'area', 'mean_intensity', 'perimeter', 'eccentricity')

# regionprops_table uses our `labels` image and the original `nuclei_image`
props_dict = measure.regionprops_table(
    labels,
    intensity_image=input_image,
    properties=properties_to_measure
)

props_dict

In [None]:
# Convert the dictionary of results into a Pandas DataFrame
import pandas as pd

image_df = pd.DataFrame(props_dict)

image_df.head(5) # show first 5 rows of a table

---

#### Process all images in folder

Let's combine everything together.

We will use for-loop to apply a user-defined function with processing and measurement steps on all images in folder.

In [None]:
# Step 1: Define our final, reusable analysis function
def analyze_image_final(image_array, source_filename):
    blurred_image = filters.gaussian(image_array, 3)
    threshold_value = filters.threshold_otsu(blurred_image)
    mask = blurred_image > threshold_value
    label_image = measure.label(mask)
    props_dict = measure.regionprops_table(label_image, intensity_image=image_array,
                                           properties=('label', 'area', 'mean_intensity'))
    results_df = pd.DataFrame(props_dict)
    # Add the source filename to track our data
    results_df['source_file'] = source_filename

    return results_df

In [None]:
# Step 2: Find the files
input_folder = r"../data/batch_analysis/input"
file_list = glob.glob(os.path.join(input_folder, "*.png"))

print(file_list)

In [None]:
# Step 3: Loop and aggregate results
import imageio.v3 as iio

all_image_results = []
print("\nStarting batch processing...")

# Iterate over list of filepaths
for file_path in file_list:
    print(f"Processing: {file_path}")
    
    # Read image
    image = iio.imread(file_path)
    
    # Get just the filename from the full path for cleaner tables
    filename_only = os.path.basename(file_path)
    
    # Apply our function - our function returns DataFrame with measurements
    single_image_df = analyze_image_final(image, filename_only)

    # Add DataFrame for currently processed file to a pre-defined list
    all_image_results.append(single_image_df)
    
print()
print('Number of dataframes in list:', len(all_image_results))

In [None]:
# Step 4: Concatenate into a final DataFrame
final_batch_df = pd.concat(all_image_results, ignore_index=True)

print("\n--- Final Batch Results ---")
final_batch_df

##### --- ***Exercise*** --- 

At the introductory session you were supposed to create an analyze nuclei function that takes in 2 arguments - a microscopic image (2D array) and its name (string). The function should identify individual nuclei, measure their properties (e.g., area and mean intensity), and return both a labeled image and a DataFrame containing the measurements along with a column storing the filename.

The version of this function from example solution is defined below, use it to solve the following exercise. 

***Batch Process the Entire Folder***

You have the tools, you have the function. Now it's time to automate everything!

**Instructions:**
1. Use glob to get a list of all `.tif` files in the `data/batch_analysis/nuclei_data` folder.
2. Create an empty list called `all_nuclei_results`.
3. Write a `for` loop that iterates through your list of file paths.
4. Inside the loop:
    - Load the current image using `iio.imread()`.
    - Get just the filename (without the folder path) using `os.path.basename()`.
    - Call your `analyze_nuclei` function with the loaded image and the filename.
    - Append the DataFrame returned by the function to your `all_nuclei_results` list.
5. After the loop, use `pd.concat()` to combine the list of DataFrames into a single, master DataFrame.
6. Print the head and tail of your final DataFrame to see the combined results from all images.
7. As a final analysis, create a boxplot showing the distribution of nuclei `area` for each of the images.
    - *Hint*: `seaborn.boxplot` is great for this, you can use `source` column as `hue` argument. 
    - Usage: `sns.boxplot(data=dataframe, x=x_axis_column, y=y_axis_column, hue=group_column)`

In [None]:
# run this code cell
from skimage import filters, morphology, measure
import scipy.ndimage as ndi

def analyze_nuclei(image_array, source):
    filtered_array = filters.gaussian(image_array, sigma=1)
    threshold_value = filters.threshold_otsu(filtered_array)
    mask = filtered_array > threshold_value
    processed_mask = ndi.binary_fill_holes(mask)
    mask_cleaned = morphology.remove_small_objects(processed_mask, min_size=50)
    label_image = measure.label(mask_cleaned)

    props = measure.regionprops_table(label_image, intensity_image=image_array,
                                      properties=['label', 'area', 'mean_intensity'])
    df = pd.DataFrame(props)

    df['filename'] = source

    return df, label_image

In [None]:
# Your code here


<details>
<summary>Click to see the example solution</summary>

```python
# 1. Get list of all .tif files in the folder
image_files = glob.glob("../data/batch_analysis/nuclei_data/*.tif")

# 2. Create an empty list to store results
all_nuclei_results = []

# 3. Loop through each file
for file_path in image_files:
    # Load image
    image_array = iio.imread(file_path)
    
    # Extract filename only
    filename = os.path.basename(file_path)
    
    # Analyze nuclei
    df, labels = analyze_nuclei(image_array, filename)
    
    # Append results to list
    all_nuclei_results.append(df)

# 5. Combine all DataFrames into a single master DataFrame
master_df = pd.concat(all_nuclei_results, ignore_index=True)

# 6. Inspect results
print(master_df.head())
print(master_df.tail())

# 7. Boxplot of nuclei area per image
plt.figure(figsize=(10, 6))
sns.boxplot(data=master_df, x='filename', y='area', palette="Set3", hue='filename')
plt.xticks(rotation=45)
plt.ylabel("Nuclei Area")
plt.xlabel("Source Image")
plt.tight_layout()
plt.show()
```

<br><br>
#### Slice-by slice reading

When working with image data stored as a series of files, it is often useful to load all slices into a single multi-dimensional array for easier processing.

The **scikit-image** function `imread_collection()` allows you to read multiple images at once using a filename pattern.
You can use a wildcard (*) to match all files in a folder that belong to your dataset. This collection can then be converted into a **NumPy** array, effectively creating an image stack.

However, the number of z-slices, channels or frames is not recognized. You have to reshape the loaded data into the appropriate multi-dimensional form (for example, (z, c, y, x)) your self.

*Note*: Alternatively, you can build your own for-loops to load images from disk manually.
This approach gives you more flexibility — for example, to sort slices and channels, skip specific files, or arrange the data into custom dimensions.

In [None]:
from skimage import io

im_collection = io.imread_collection('../data/batch_analysis/tiffs/' + "*")
image_stack = im_collection.concatenate()
image_stack.shape

In [None]:
# Change the shape of array with reshape

num_channels = 2
num_z_slices = 5
num_t_frames = 10
image5d = np.reshape(image_stack, (num_t_frames, num_z_slices, num_channels, image_stack.shape[-2], image_stack.shape[-1]))
image5d.shape

In [None]:
import stackview
stackview.slice(image5d)

#### Pixel-Based Colocalization 

Pixel-based colocalization assesses the statistical correlation between the intensity values of corresponding pixels in two or more channels. It's important to remember that colocalization only indicates that two signals are present within the same volume resolved by the microscope - it does not prove molecular interaction.

**The Cytofluorogram (Scatter Plot):**
A 2D scatter plot is the primary visualization tool for colocalization. Each point on the plot represents a single pixel, with its x-coordinate being its intensity in Channel 1 and its y-coordinate its intensity in Channel 2.

There are many types of coefficients you can evaluate (Pearson’s correlation, Spearman’s rank correlation, Manders’ overlap coefficient, cross-correlation analysis...). In this notebook we will compute the Pearson correlation coefficient.

- **Pearson's Correlation Coefficient (PCC):** Measures the linear relationship between the two channels' intensities. It ranges from -1 (perfect anti-correlation) to +1 (perfect correlation), with 0 indicating no correlation. A major drawback is its sensitivity to background pixels, which often creates a strong, artificial positive correlation. Thresholding is essential.


##### Example

In [None]:
# --- Setup: Imports and Load a Multi-Channel Image ---
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import stackview
from skimage.io import imread
from skimage import filters

In [None]:
raw_image = imread(r'../data/cellpainting.tif')

print(raw_image.shape)

In [None]:
# Use stackview for interactive visualization of the channels

stackview.switch(
    {"actin":   raw_image[0,:,:],
     "er":raw_image[1,:,:],
     "speckles":      raw_image[2,:,:],
     "mito":      raw_image[3,:,:],
     "nuclei":      raw_image[4,:,:],
    },
    colormap=['pure_red',"pure_green", "pure_yellow", "pure_magenta", 'pure_cyan'],
    toggleable=True
)

In [None]:
# Store channels in a dictionary for easy access

channels_dict = {
    "actin": raw_image[0,:,:],
    "er": raw_image[1,:,:],
    "speckles": raw_image[2,:,:],
    "mito": raw_image[3,:,:],
    "nuclei": raw_image[4,:,:],
}

Visualize Pixel Intensity Correlations with a cytofluorogram.

Cytofluorogram is a 2D histogram (scatter-like plot) showing how pixel intensities in two channels correlate

Let's create a cytofluorogram for the **mito** and **er** channels.

In [None]:
# For efficient analysis, flatten the 2D image arrays into 1D arrays of pixels
flattened_pixels = {key: value.ravel() for key, value in channels_dict.items()}
df_pixels = pd.DataFrame(flattened_pixels)

print(f"Total number of pixels: {len(df_pixels)}")
df_pixels.head()

In [None]:
# --- The Cytofluorogram (Scatter Plot) ---
plt.figure(figsize=(6,6))
plt.scatter(x=df_pixels['mito'], y=df_pixels['er'], s=1, alpha=0.1)
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()

In [None]:
# A 2D histogram is often more informative for dense data
# Note the high density of points near the origin, representing background pixels.

plt.figure(figsize=(7, 7))
sns.histplot(df_pixels, x='mito', y='er', cbar=True, cmap='YlGnBu_r', bins=200)
plt.show()

Calculating Pearson's Correlation Coefficient (PCC)

First, we calculate PCC on the raw pixel data, which includes a vast number of dark background pixels. This will demonstrate how background skews the result.

In [None]:
# --- Pearson's Correlation ---
from scipy.stats import pearsonr

pcc, p_value = pearsonr(df_pixels['mito'], df_pixels['er'])
print(f"Pearson's r (Mito vs. ER): {pcc:.4f}, p-value: {p_value:.4e}")


Now, we'll apply a threshold to each channel to create masks that separate signal from background. By analyzing only the pixels within these masks, we get a more biologically meaningful correlation value.

In [None]:
# --- Step 1: Create a mask for significant pixels in each channel ---

# We can use Otsu's method to automatically find a threshold
thresh_er = filters.threshold_otsu(channels_dict['er'])
print(f"Otsu threshold for 'er': {thresh_er}")


# Or we might use a manual threshold based on visual inspection
threshold = 2500
mask_er = channels_dict['er'] > threshold
mask_mito = channels_dict['mito'] > threshold

# --- Step 2: Combine masks ---
combined_mask = np.logical_and(mask_mito, mask_er)
# or
combined_mask = mask_mito & mask_er

In [None]:
# --- Step 3: Visualize all masks side-by-side ---
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

axes[0].imshow(mask_mito, cmap='gray')
axes[0].set_title('Mito Mask')
axes[0].axis('off')

axes[1].imshow(mask_er, cmap='gray')
axes[1].set_title('ER Mask')
axes[1].axis('off')

axes[2].imshow(combined_mask, cmap='gray')
axes[2].set_title('Combined Mask')
axes[2].axis('off')

plt.tight_layout()
plt.show()

##### Operations between masks

In image analysis, it is often useful to combine or compare different binary masks to extract specific regions of interest. 

Masks as boolean arrays (True for foreground, False for background), can be processed with logical operations to create new masks.

Common operations:
- *Inversion - NOT (`~`)* – flips all boolean values:
    ```python
    inverted_mask = ~mask
    ```
- *Intersection (`&`)* – keeps only pixels present in both masks:
    ```python
    overlap_mask = mask1 & mask2
    ```
- *Union (`|`)* – includes all pixels present in either mask:
    ```python
    combined_mask = mask1 | mask2
    ```
- *Difference / Subtraction (`& ~`)* – removes pixels of one mask from another:
    ```python
    cytoplasm_mask = cell_mask & ~nuclei_mask
    ```
- *Exclusive OR (`^`)* – pixels present in one mask or the other, but not both:
    ```python
    xor_mask = mask1 ^ mask2
    ```

In [None]:
# Apply the combined mask to both channels
mito_masked = channels_dict['mito'][combined_mask]
er_masked = channels_dict['er'][combined_mask]

# --- Step 4: Calculate Masked Pearson's ---
pcc_masked, p_value = pearsonr(mito_masked, er_masked)
print(f"Masked Pearson's r (Mito vs. ER): {pcc_masked:.4f}, p-value: {p_value:.4e}")

In [None]:
plt.figure(figsize=(7, 7))

# Plot all pixels as a scatter plot (light alpha to avoid overplotting)
sns.scatterplot(df_pixels, x='mito', y='er', s=1, alpha=0.1, color='gray')

# Draw threshold lines to highlight quadrants
plt.axvline(threshold, color='red', linestyle='--', label='Mito threshold low')
plt.axhline(threshold, color='blue', linestyle='--', label='ER threshold')

# Labels and title
plt.legend(loc='upper right')

plt.grid(True, linestyle='--', alpha=0.3)
plt.tight_layout()
plt.show()

##### --- Exercise ---

Repeat the colocalization analysis steps above for a different channel pair. You can work with the created dictionaries.

1. Choose another pair of channels to analyze (e.g., actin vs. er, or nuclei vs. speckles).

2. Plot cytofluorogram. Create a scatter or density plot showing the pixel intensity correlation between the two channels.

3. Determine appropriate thresholds for each channel (either using automatically or manually).

4. Generate binary masks for both channels based on the chosen thresholds, combine them (e.g., using logical AND), and compute the masked Pearson correlation coefficient.

5. Visualize the combined mask.

In [None]:
# Your code here


<details>
<summary>Click to see the example solution</summary>

```python
# EXAMPLE SOLUTION

# Visualize pixel intensity correlation
plt.figure(figsize=(7, 7))
sns.histplot(df_pixels, x='actin', y='er', bins=100, cbar=True, cmap='YlGnBu_r')
plt.show()

# Create masks for significant pixels
thresh_1 = filters.threshold_otsu(channels_dict['actin'])
print(f"Otsu threshold for 'ch1': {thresh_1}")
mask_1 = channels_dict['actin'] > thresh_1
mask_2 = channels_dict['er'] > 2000

# Combine masks
combined_mask = mask_1 & mask_2
ch1_masked = channels_dict['actin'][combined_mask]
ch2_masked = channels_dict['er'][combined_mask]

# Calculate masked Pearson correlation
pcc_masked, p_value = pearsonr(ch1_masked, ch2_masked)
print(f"Masked Pearson's r (ch1 vs. ch2): {pcc_masked:.4f}, p-value: {p_value:.4e}")

# Visualize the combined mask
plt.figure(figsize=(6, 6))
plt.imshow(combined_mask, cmap='gray')
plt.title('Combined Mask')
plt.axis('off')
plt.show()
```