# Week - 38

### What I need to add:

1. Compute, for each food web, the baseline fraction of rare links present in the training set when using the Random strategy (averaged over numExperiments), using a clear rarity definition (bottom rareQuantile by degree-based rarity score).

2. Log those baselines (per-web, per-ratio) and a global average per train ratio.

3. Derive the sweep range for RareLinksStrategy automatically around that baseline (unless I keep the fixed range).


---

## Baseline-Centered Rare-Link Sampling for WLNM

## 1. Objective

We refine the train/test split used by WLNM to systematically evaluate the effect of **rare ecological interactions** in training. Instead of choosing training links purely at random, we (i) estimate how many rare links would enter training **under a Random split**, and (ii) **center the RareLinks sweep around that empirical baseline**.

## 2. Connectivity & Rarity Definitions

* **Directed connectivity (per node):**
  $\deg^{out}(i) = \sum_j A_{ij},\quad \deg^{in}(i) = \sum_j A_{ji},\quad \deg(i)=\deg^{out}(i)+\deg^{in}(i)$
* **Edge rarity score (per link $i\to j$):**
  $r_{ij} = \deg(i)+\deg(j)$ (lower $r_{ij}$ ⇒ rarer interaction).
* **Rare set (quantile-based):**
  For all observed links, compute $r_{ij}$ and mark the bottom $q$ proportion as **rare** (default $q=0.20$).

## 3. Baseline Estimation (Random Strategy)

For each food web and train ratio:

1. Compute the **rare set** using the rarity definition above (on the full observed network).
2. Repeat **Random** train/test splitting for $T$ runs (same directed logic and optional reachability constraint as used in WLNM preprocessing).
3. For each run, record the **fraction of rare links** that fall into the training set.
4. **Baseline** = mean of these fractions across the $T$ runs.

We log one CSV per food web (rare fraction under Random) and a per-ratio summary (mean baseline across food webs).

**Rationale.** This yields a dataset-specific, train-ratio-specific reference for how many rare links a purely random split would include, accounting for any reachability constraints in the split procedure.

## 4. Baseline-Centered RareLinks Sweep

Let $\hat{\rho}$ be the Random baseline fraction for a dataset at a given train ratio.

* Define a sweep range
  $[\,\max(\rho_{\min}, \hat{\rho}-w),\, \min(\rho_{\max}, \hat{\rho}+w)\,]$
  with step $\Delta$ (defaults: $w=0.05$, $\Delta=0.01$, $\rho_{\min}=0.01$, $\rho_{\max}=0.50$).
* Each sweep point sets **RareFraction** = proportion of training links drawn from the rare set; the remainder is filled from non-rare links.

This anchors the exploration to a **realistic center** (what Random already gives) while testing controlled over/under-representation of rare interactions.

## 5. Train/Test Selection Procedures

### 5.1 Random Strategy (Reference)

* **Input:** directed network $A$, train ratio $\alpha$.
* **Procedure:** uniformly sample links for test until reaching $(1-\alpha)|E|$, optionally preserving a reachability constraint (if removing a link would sever a required path, skip it). The remaining links form training.
* **Output:** $A_{\text{train}}, A_{\text{test}}$.
* **Use:** Baseline performance; baseline rare-fraction estimation.

### 5.2 RareLinks Strategy (Baseline-Centered)

* **Input:** $A$, train ratio $\alpha$, rare set (bottom $q$), RareFraction $\rho$.
* **Procedure:**

  1. Compute $r_{ij}$ and sort links ascending (rarest first).
  2. Target training size $m=\lceil \alpha |E| \rceil$.
  3. Select $\lceil \rho m\rceil$ links from the rare set (starting from the rarest), then fill the remaining $m-\lceil \rho m\rceil$ slots from the other links.
  4. The **test set** is $E \setminus E_{\text{train}}$.
* **Output:** $A_{\text{train}}, A_{\text{test}}$.
* **Note:** The selection itself does not enforce a reachability check (consistent with our current implementation); connectivity effects are reflected in the Random baseline.

## 6. Hyperparameters (defaults used here)

* Rare quantile $q=0.20$.
* Baseline sweep band $w=0.05$ and step $\Delta=0.01$, clamped to $[0.01, 0.50]$.
* Number of Random repeats $T$ for baseline = number of experiments (can be set higher for a more stable baseline).
* Directed logic; optional reachability check in Random baseline (adaptive off for $n\ge 30$, off for small $n$).

## 7. Outputs & Logging

* **Per-web baseline CSV:** `Foodweb, TrainRatio, RareQuantile, RandomTrainRareFraction`.
* **Per-ratio summary CSV:** `TrainRatio, NumFoodwebs, MeanRandomRareFraction`.
* **Performance logs (per strategy):** AUC, best threshold, Precision/Recall/F1, $K$, TrainRatio; RareLinks also logs RareFraction.

## 8. Computational Considerations

* Computing rarity scores is $O(|E|)$; sorting for the rare set is $O(|E|\log|E|)$.
* Random baseline adds $T$ train/test splits (linear in $|E|$ with possible rejections due to reachability).
* RareLinks selection is dominated by one sort and linear scans to fill quotas.

## 9. Expected Behaviour & Interpretation

* **Random** provides an unbiased reference both for performance (AUC/F1) and for the typical rare-link fraction in training.
* **Baseline-centered RareLinks** probes how nudging the rare-link share above/below that reference impacts prediction—often improving recall for rare interactions while potentially trading off precision on common links.
* Centering the sweep on $\hat{\rho}$ prevents arbitrary ranges and makes cross-web comparisons more principled.

---

### Comparison (updated)

| Feature               | Random (Reference)                            | RareLinks (Baseline-Centered)                              |
| --------------------- | --------------------------------------------- | ---------------------------------------------------------- |
| Selection basis       | Uniform sampling                              | Sorted by rarity; quota $\rho$ from rare set               |              |     |   |      |   |                                      |
| RareFraction $\rho$   | Emergent (measured)                           | **Set** by sweep **around** measured baseline $\hat{\rho}$ |              |     |   |      |   |                                      |
| Rare definition       | N/A (but measured using rare set)             | Bottom $q$ by $r_{ij}=\deg(i)+\deg(j)$                     |              |     |   |      |   |                                      |
| Connectivity handling | Optional path preservation (used in baseline) | Not enforced during selection (as implemented)             |              |     |   |      |   |                                      |
| Best for              | Baseline metrics & reference rare fraction    | Sensitivity analysis near realistic rare-link prevalence   |              |     |   |      |   |                                      |
