# Week 19 - Progress Log

I have **290 food webs**, each with `Nodes`, `Edges`, and `Connectance`. To cluster them into **small, medium, and large** based on the number of **Edges (links)** (which makes sense computationally), here's a recommended academic approach:

---

### ✅ **Step 1: Define Size Categories by Edges**

A **quantile-based partitioning** gives a fair distribution and is statistically justified for heterogeneously sized datasets.

#### Option A: Use **Quantiles (tertiles)**

```text
Small:        bottom 33% (low edge count)
Medium:       middle 33%
Large:        top 33% (high edge count)
```

#### Option B: Define **custom edge thresholds**

Based on ecological logic or runtime performance:

```text
Small:        Edges ≤ 500
Medium:       500 < Edges ≤ 3000
Large:        Edges > 3000
```

I recommend starting with **Option A (tertiles)** to see how the data distributes naturally. Then we can refine.

---

In [35]:
import pandas as pd

# Reload the uploaded CSV
file_path = "../../data/processed/foodweb_metrics.csv"
df = pd.read_csv(file_path)

# Calculate tertiles based on the 'Edges' column
tertile_1 = df['Edges'].quantile(1/3)
tertile_2 = df['Edges'].quantile(2/3)

# Define categories
def categorize_edges(edges):
    if edges <= tertile_1:
        return 'small'
    elif edges <= tertile_2:
        return 'medium'
    else:
        return 'large'

# Apply categorization
df['SizeCategory'] = df['Edges'].apply(categorize_edges)

# Split into separate DataFrames
df_small = df[df['SizeCategory'] == 'small']
df_medium = df[df['SizeCategory'] == 'medium']
df_large = df[df['SizeCategory'] == 'large']

# Save to CSV files
small_path = "../../data/processed/small_tertile_foodwebs_16-120.csv"
medium_path = "../../data/processed/medium_tertile_foodwebs_121-417.csv"
large_path = "../../data/processed/large_tertile_foodwebs_418-16041.csv"

df_small.to_csv(small_path, index=False)
df_medium.to_csv(medium_path, index=False)
df_large.to_csv(large_path, index=False)

print("Total amount of Foodwebs:")
print(df_small["Edges"].count() + df_medium["Edges"].count() + df_large["Edges"].count())

Total amount of Foodwebs:
290


The dataset has been successfully split into **tertiles** based on the number of edges (links), and stored in the following CSV files:

* [small\_webs.csv](sandbox:/mnt/data/small_webs.csv)
* [medium\_webs.csv](sandbox:/mnt/data/medium_webs.csv)
* [large\_webs.csv](sandbox:/mnt/data/large_webs.csv)

In [46]:
# Re-import modules after code execution state reset
import pandas as pd
import numpy as np

# Reload the dataset
file_path = "../../data/processed/foodweb_metrics.csv"
df = pd.read_csv(file_path)

# Recompute tertiles
tertile_1 = df['Edges'].quantile(1/3)
tertile_2 = df['Edges'].quantile(2/3)

# Re-categorize by edge count
def categorize_edges(edges):
    if edges <= tertile_1:
        return 'small'
    elif edges <= tertile_2:
        return 'medium'
    else:
        return 'large'

df['SizeCategory'] = df['Edges'].apply(categorize_edges)

# Split categories
df_small = df[df['SizeCategory'] == 'small']
df_medium = df[df['SizeCategory'] == 'medium']
df_large = df[df['SizeCategory'] == 'large']

# Sample from each category
np.random.seed(42)
sampled_small = df_small.sample(n=min(10, len(df_small)), random_state=42)
sampled_medium = df_medium.sample(n=min(10, len(df_medium)), random_state=42)
sampled_large = df_large.sample(n=min(10, len(df_large)), random_state=42)

# Combine and save
sampled_combined = pd.concat([sampled_small, sampled_medium, sampled_large]).reset_index(drop=True)
sampled_path = "../../data/processed/stratified_random_sampled_foodwebs.csv"
sampled_combined.to_csv(sampled_path, index=False)

sampled_combined

Unnamed: 0,Foodweb,Nodes,Edges,Connectance,SizeCategory
0,PGUBP3,21,68,0.161905,small
1,CGP1,18,27,0.088235,small
2,SF2M2,15,30,0.142857,small
3,Twin Lake East,13,17,0.108974,small
4,SF1I2,20,58,0.152632,small
5,SF1I4,21,80,0.190476,small
6,PP1I1,27,119,0.169516,small
7,CGP3,31,103,0.110753,small
8,Indian Lake,35,102,0.085714,small
9,Brook trout lake,15,19,0.090476,small


Here is the final sample of food webs—10 from each size category (small, medium, large)—stored in one CSV:

* [sampled\_foodwebs.csv](sandbox:/mnt/data/sampled_foodwebs.csv)

This subset is ideal for testing generalization without processing the full dataset.