# Week 19 - Progress Log

I have **290 food webs**, each with `Nodes`, `Edges`, and `Connectance`. To cluster them into **small, medium, and large** based on the number of **Edges (links)** (which makes sense computationally), here's a recommended academic approach:

---

## ✅ **Step 1: Define Size Categories by Edges**

A **quantile-based partitioning** gives a fair distribution and is statistically justified for heterogeneously sized datasets.

### Option A: Use **Quantiles (tertiles)**

```text
Small:        bottom 33% (low edge count)
Medium:       middle 33%
Large:        top 33% (high edge count)
```

### Option B: Define **custom edge thresholds**

Based on ecological logic or runtime performance:

```text
Small:        Edges ≤ 500
Medium:       500 < Edges ≤ 3000
Large:        Edges > 3000
```

I recommend starting with **Option A (tertiles)** to see how the data distributes naturally. Then we can refine.

---

In [35]:
import pandas as pd

# Reload the uploaded CSV
file_path = "../../data/processed/foodweb_metrics.csv"
df = pd.read_csv(file_path)

# Calculate tertiles based on the 'Edges' column
tertile_1 = df['Edges'].quantile(1/3)
tertile_2 = df['Edges'].quantile(2/3)

# Define categories
def categorize_edges(edges):
    if edges <= tertile_1:
        return 'small'
    elif edges <= tertile_2:
        return 'medium'
    else:
        return 'large'

# Apply categorization
df['SizeCategory'] = df['Edges'].apply(categorize_edges)

# Split into separate DataFrames
df_small = df[df['SizeCategory'] == 'small']
df_medium = df[df['SizeCategory'] == 'medium']
df_large = df[df['SizeCategory'] == 'large']

# Save to CSV files
small_path = "../../data/processed/small_tertile_foodwebs_16-120.csv"
medium_path = "../../data/processed/medium_tertile_foodwebs_121-417.csv"
large_path = "../../data/processed/large_tertile_foodwebs_418-16041.csv"

df_small.to_csv(small_path, index=False)
df_medium.to_csv(medium_path, index=False)
df_large.to_csv(large_path, index=False)

print("Total amount of Foodwebs:")
print(df_small["Edges"].count() + df_medium["Edges"].count() + df_large["Edges"].count())

Total amount of Foodwebs:
290


The dataset has been successfully split into **tertiles** based on the number of edges (links), and stored in the following CSV files:

* [small\_webs.csv](sandbox:/mnt/data/small_webs.csv)
* [medium\_webs.csv](sandbox:/mnt/data/medium_webs.csv)
* [large\_webs.csv](sandbox:/mnt/data/large_webs.csv)

In [46]:
# Re-import modules after code execution state reset
import pandas as pd
import numpy as np

# Reload the dataset
file_path = "../../data/processed/foodweb_metrics.csv"
df = pd.read_csv(file_path)

# Recompute tertiles
tertile_1 = df['Edges'].quantile(1/3)
tertile_2 = df['Edges'].quantile(2/3)

# Re-categorize by edge count
def categorize_edges(edges):
    if edges <= tertile_1:
        return 'small'
    elif edges <= tertile_2:
        return 'medium'
    else:
        return 'large'

df['SizeCategory'] = df['Edges'].apply(categorize_edges)

# Split categories
df_small = df[df['SizeCategory'] == 'small']
df_medium = df[df['SizeCategory'] == 'medium']
df_large = df[df['SizeCategory'] == 'large']

# Sample from each category
np.random.seed(42)
sampled_small = df_small.sample(n=min(10, len(df_small)), random_state=42)
sampled_medium = df_medium.sample(n=min(10, len(df_medium)), random_state=42)
sampled_large = df_large.sample(n=min(10, len(df_large)), random_state=42)

# Combine and save
sampled_combined = pd.concat([sampled_small, sampled_medium, sampled_large]).reset_index(drop=True)
sampled_path = "../../data/processed/stratified_random_sampled_foodwebs.csv"
sampled_combined.to_csv(sampled_path, index=False)

sampled_combined

Unnamed: 0,Foodweb,Nodes,Edges,Connectance,SizeCategory
0,PGUBP3,21,68,0.161905,small
1,CGP1,18,27,0.088235,small
2,SF2M2,15,30,0.142857,small
3,Twin Lake East,13,17,0.108974,small
4,SF1I2,20,58,0.152632,small
5,SF1I4,21,80,0.190476,small
6,PP1I1,27,119,0.169516,small
7,CGP3,31,103,0.110753,small
8,Indian Lake,35,102,0.085714,small
9,Brook trout lake,15,19,0.090476,small


Here is the final sample of food webs—10 from each size category (small, medium, large)—stored in one CSV:

* [sampled\_foodwebs.csv](sandbox:/mnt/data/sampled_foodwebs.csv)

This subset is ideal for testing generalization without processing the full dataset.

In [2]:
# Code to generate .mat files from the sampled food webs with mass data in ascending order

import os
import pandas as pd
import numpy as np
from scipy.sparse import csc_matrix
from scipy.io import savemat

df = pd.read_csv("../../data/processed/stratified_random_sampled_foodwebs.csv")
foodweb_filenames = df['Foodweb'].unique()

input_dir = "../../data/processed/foodwebs_csv"
output_dir = "../../data/processed/foodwebs_mat_by_ecosystem/MAT_mass"
os.makedirs(output_dir, exist_ok=True)

failed_files = []

for filename in foodweb_filenames:
    filename = filename.strip()
    if not filename.endswith(".csv"):
        filename += ".csv"

    csv_path = os.path.join(input_dir, filename)
    mat_path = os.path.join(output_dir, filename.replace(".csv", "_tax_mass.mat"))

    try:
        df_web = pd.read_csv(csv_path)
        df_web['res.taxonomy'] = df_web['res.taxonomy'].astype(str)
        df_web['con.taxonomy'] = df_web['con.taxonomy'].astype(str)
        prey = df_web['res.taxonomy']
        predator = df_web['con.taxonomy']
        species = sorted(set(prey).union(set(predator)))
        species_index = {name: i for i, name in enumerate(species)}
        N = len(species)

        adj_matrix = np.zeros((N, N), dtype=int)
        for res, con in zip(prey, predator):
            i = species_index[res]
            j = species_index[con]
            adj_matrix[i, j] = 1
        net_sparse = csc_matrix(adj_matrix)

        res_masses = df_web[['res.taxonomy', 'res.mass.mean.g.']].dropna().rename(
            columns={'res.taxonomy': 'species', 'res.mass.mean.g.': 'mass'})
        con_masses = df_web[['con.taxonomy', 'con.mass.mean.g.']].dropna().rename(
            columns={'con.taxonomy': 'species', 'con.mass.mean.g.': 'mass'})
        all_masses = pd.concat([res_masses, con_masses])
        species_mass = all_masses.groupby('species')['mass'].mean()

        taxonomy_names = np.array(species, dtype=object)
        mean_masses = np.array([species_mass.get(name, np.nan) for name in species])

        savemat(mat_path, {
            "net": net_sparse,
            "taxonomy": taxonomy_names,
            "mass": mean_masses
        })

    except Exception as e:
        failed_files.append((filename, str(e)))

## Review and understand the purpose of the `evaluate_on_all_unseen` input parameter from the `sample_neg()` function.

---

### ✅ **DEFAULT BEHAVIOR: `evaluate_on_all_unseen = false`**

#### Logic:

* Combine all known links: `net = train + test`.
* Find all non-links: `neg_net = (net == 0)`.
* Randomly **sample `k*(train_size + test_size)` negative edges** from those non-links.

  * First `k * train_size` → `train_neg`
  * Remaining `k * test_size` → `test_neg`

#### Summary:

If:

* `train_size = 8`, `test_size = 2`, and `k = 2`

Then:

* `train_neg = 16`
* `test_neg = 4`

Exactly what we expected.

---

### ✅ **MODIFIED BEHAVIOR: `evaluate_on_all_unseen = true`**

This mode **changes only how `test_neg` is constructed**, to enable evaluation on **all** possible test negative links.

#### What happens:

* `train_neg` is sampled just like before: `k * train_size`
* `test_neg = all non-links` **minus** the links used for `train_neg`

  * So the test set is evaluated **against all remaining unknown links** (i.e. a *denser* and *more realistic* test scenario)

#### Implications:

* We’re no longer controlling how many test negative examples we get. It will be:

  ```
  test_neg = total_possible_neg_links - train_neg
  ```

* So if my graph is sparse and `k` is small, we’ll get **a lot more test\_neg** samples compared to the original test\_pos.

This mode is useful when:

* You want to do **open-world** evaluation (i.e., predict whether any unseen link might be real).
* You care about **ranking performance (e.g., AUC)** over the entire non-observed space.

---

### ✅ Summary Table

| Param                   | `evaluate_on_all_unseen = false` | `evaluate_on_all_unseen = true`  |
| ----------------------- | -------------------------------- | -------------------------------- |
| `train_neg` size        | `k * train_pos`                  | `k * train_pos`                  |
| `test_neg` size         | `k * test_pos`                   | \~all remaining non-links        |
| Useful for...           | Controlled experiments           | Open-world, realistic evaluation |
| Does negative sampling? | Yes (train & test)               | Yes (only train)                 |
| AUC interpretation      | Controlled                       | More generalizable               |

---

#### Recommendation

If we're testing **small/medium/large food webs** and care about generalization and ranking, the suggestion is:

* `evaluate_on_all_unseen = true` **with AUC**
* Use `evaluate_on_all_unseen = false` for more balanced per-instance metrics like precision, recall, or F1.