# Likely activation potential: SPEC

This notebook will implement the *likely co-occurrence* component and likely activation potential from `main.tex`, using the same ontology input and operational choices as `typical_proof.ipynb`.


## Goal
Compute likely activation potential for resources used in each proof `N`, using:
- context `C` (same as in `typical_proof.ipynb`),
- a *salient set* `S` defined by one of two variants (selectable by a flag),
- likely co-occurrence `Φ_{L_c}(r, C, S)` and likely activation `Φ_L(r, C, S)`.

Output one CSV per analysis; filenames must encode key parameters and include a timestamp.


## Inputs / Parameters
- `START_PROPOSITION`, `END_PROPOSITION`: proof range (same semantics as `typical_proof.ipynb`).
- `EPSILON`: weight for `Φ_L = ε * Φ_h + (1 - ε) * Φ_{L_c}` (parallel to `DELTA` in typical).
- `HISTORY_WEIGHTS`: 3-tuple `(α, β, γ)` for direct/hierarchical/mereological histories (same validation rules).
- `TYPE_SELECTION`: boolean, if `True` use relation/operation types in proposition/proof queries; if `False` use direct concepts.
- `S_VARIANT`: enum flag in `{"statement_only", "statement_plus_related_chunks"}` selecting the salient set definition.
- `EXCLUDED_CONCEPT_IRIS`, `EXCLUDED_CONCEPT_IRI_SUBSTRINGS`: same filtering behavior as in `typical_proof.ipynb`.
- Input TTL: use the same selection logic as `typical_proof.ipynb` (latest TTL in `ontologies/`).


## Context `C` (same as `typical_proof.ipynb`)
For each proof `N`:
- `C` includes resources from definitions, postulates, common notions, and propositions up to `N` (included), plus proofs up to `N-1` (included).
- Use the same query families as `typical_proof.ipynb` for direct / hierarchical / mereological histories and Hebbian co-occurrence.
- Apply the same exclusion filters after each query materializes.
- Build `hebb_C` using the same operational definition of “together” as `typical_proof.ipynb`:
  resources co-occur if they are used in the same definition, postulate, common notion, proposition, or proof (via `refers_to` / `contains_concept`).
- Preserve the same ordered-pair caveat used in `typical_proof.ipynb` (queries return `(o1, o2)` pairs).
- Counting convention: keep duplicates (multiset semantics). When combining query outputs, do not deduplicate by resource; retain multiplicities/counts and aggregate by summing counts.


## Salient set `S` (two variants)
Let `last_proposition` be proposition `N` (the statement immediately preceding proof `N`).
Assumption: proof `N` is always immediately after proposition `N`.

**Variant 1: `S_VARIANT = "statement_only"`**
- `S` = resources in the *statement* of `last_proposition`.
- Implement via a dedicated SPARQL query that extracts resources from the statement, with `TYPE_SELECTION` applied.
- What "resources from the statement" means will be operationalized by the queries below. It does _not_ mean that the resources ought to occur in the statement _directly_.

**Variant 2: `S_VARIANT = "statement_plus_related_chunks"`**
- Let `S0` be defined operationally as follows: collect statement-resource candidates from the statement-only extraction queries for `last_proposition`, apply `EXCLUDED_CONCEPT_IRIS` and `EXCLUDED_CONCEPT_IRI_SUBSTRINGS`, then deduplicate by IRI. In symbols: `S0 := dedup(filter(candidates_from_statement(last_proposition)))`.
- Let `R` be the set of resources that occur in any chunk (definition, postulate, common notion, proposition, or proof) before proposition N that shares at least one resource with `S0`. The definition of "occurs in" will be operationalized below and it may differ in different cases.
- Then `S = S0 ∪ R`.
- Implement via a dedicated SPARQL query that
  (a) identifies chunks sharing at least one resource with `S0`, and
  (b) returns all resources from those chunks, with `TYPE_SELECTION` applied.

For each variant, apply the same exclusion filters as in `typical_proof.ipynb`.

The multiset ‘keep duplicates’ rule applies to context/history/hebb aggregations; S is always deduplicated (set semantics).

## Likely co-occurrence (`Φ_{L_c}`)
- Build `hebb_C` from query outputs as ordered pairs `(o1, o2, links)`, aggregated by sum of `links` per ordered pair.
- Define `deg_C^S(r) = sum_{s in S, s != r} (hebb_C(r, s) + hebb_C(s, r))` (symmetrize at use time).
- Define denominator over unique context resources: let `U_C` be the set of unique resource IRIs in context `C` after filtering; then `Z = sum_{u in U_C} deg_C^S(u)`.
- Set `Φ_{L_c}(r, C, S) = deg_C^S(r) / Z` if `Z > 0`, else `0`.
- This is computed only for resources *used in proof `N`* (same as typical).

## Likely activation potential (`Φ_L`)
- Compute historical component `Φ_h(r, C)` using the same direct/hierarchical/mereological histories and weights as typical.
- Combine with likely co-occurrence:
  `Φ_L(r, C, S) = ε * Φ_h(r, C) + (1 - ε) * Φ_{L_c}(r, C, S)`
  where `ε = EPSILON` in `[0, 1]`.


## Four analyses (S variant × TYPE_SELECTION)
Run all four combinations:
1. `S_VARIANT=statement_only`, `TYPE_SELECTION=False`
2. `S_VARIANT=statement_only`, `TYPE_SELECTION=True`
3. `S_VARIANT=statement_plus_related_chunks`, `TYPE_SELECTION=False`
4. `S_VARIANT=statement_plus_related_chunks`, `TYPE_SELECTION=True`

Execution mode note: this notebook is organized into separate sections so each combination can be run independently.
You can pick and choose which combinations to execute; the list above is a coverage checklist, not a single mandatory run.

Each run should iterate proofs `N` in `[START_PROPOSITION, END_PROPOSITION]`, compute
`Φ_h`, `Φ_{L_c}`, `Φ_L`, and the set of new resources (same definition as in typical).


## Outputs
For each analysis, write one CSV with rows per `(proof, resource)` and mirror the structure of `typical_proof.ipynb`:
- `proof`
- `resource_used_in_proof`
- `number_of_resources_used_in_proof`
- `phi_h`
- `phi_lc` (likely co-occurrence)
- `phi_l` (likely activation)
- `new_resources`
- `number_of_new_resources`
- `new_resources` and `number_of_new_resources` are per-proof data and are repeated across all rows of the same proof (as in `typical_proof.ipynb`).

**Filename requirements**
- Must include: `S_VARIANT`, `TYPE_SELECTION`, proof range, EPSILON, HISTORY_WEIGHTS, and a timestamp.
- Example pattern: `likely_{s_variant}_type_{type_flag}_eps-{epsilon}_w-{history_weights}_p{start}-{end}_{YYYYmmdd_HHMMSS}.csv`

The output directory should mirror `typical_proof.ipynb` (e.g., `output/`).


## SPARQL queries


CASE 1: salient_statement_resources
For salient_statement_resources(last_proposition, type_selection=False) use the following SPARQL queries:
- queries.direct_template_propositions_proofs(last_proposition)
- queries.hierarchical_template_propositions_proofs(last_proposition) [super-concepts of statement resources are statement resources]
- queries.mereological_template_propositions_proofs(last_proposition). [components of statement resources are statement resources]

For salient_statement_resources(last_proposition, type_selection=True) use the following SPARQL queries:
- queries.direct_template_last_item_types(last_proposition)
- queries.hierarchical_template_propositions_proofs(last_proposition) [super-concepts of statement resources are statement resources]
- queries.mereological_template_propositions_proofs(last_proposition). [components of statement resources are statement resources]

Clarification: for `TYPE_SELECTION=True`, this mixed setup is intentional and correct: direct statement resources are taken at type level, while hierarchical and mereological expansions remain concept-level as listed.

These SPARQL queries provide resources and counts. Then the notebook can use these results to proceed with the required calculations.

CASE 2: salient_statement_plus_related_chunks 
For salient_statement_plus_related_chunks(last_proposition, type_selection=False) use the following SPARQL queries:
(a) queries.direct_template_propositions_proofs(last_proposition)
(b) queries.hierarchical_template_propositions_proofs(last_proposition) [super-concepts of statement resources are statement resources]
(c) queries.mereological_template_propositions_proofs(last_proposition) [components of statement resources are statement resources]
(d) queries.find_salient_definitions_postulates_common_notions(resource_iris) [outputs: IRIs of definitions, postulates, common notions]
Note on query (e): in `queries.find_salient_propositions_proofs(resource_iris, proposition)`, `proposition` must be a proposition IRI string (e.g., `<https://www.foom.com/core#proposition_17>`), not a numeric index. The query returns propositions/proofs strictly before that proposition.
(e) queries.find_salient_propositions_proofs(resource_iris, proposition) [outputs: IRIs of propositions, and proofs]
(f) queries.direct_definitions_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(g) queries.direct_postulates_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(h) queries.direct_common_notions_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(i) queries.hierarchical_definitions_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(j) queries.hierarchical_postulates_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(k) queries.hierarchical_common_notions_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(l) queries.mereological_definitions_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(m) queries.mereological_postulates_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(n) queries.mereological_common_notions_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(o) queries.direct_template_propositions_proofs_selected_values(iri_of_salient_resources). [use IRIs from queries (e)]
Contract for (d): it musts return the IRIs to be used as `iri_of_salient_resources` VALUES input for queries (f) through (n). 
Contract for (e): it musts return the IRIs to be used as `iri_of_salient_resources` VALUES input for query (o).

For salient_statement_plus_related_chunks(last_proposition, type_selection=True) use the following SPARQL queries:
(a) queries.direct_template_last_item_types(last_proposition)
(b) queries.hierarchical_template_propositions_proofs(last_proposition) [super-concepts of statement resources are statement resources]
(c) queries.mereological_template_propositions_proofs(last_proposition) [components of statement resources are statement resources]
(d) queries.find_salient_definitions_postulates_common_notions(resource_iris)  [outputs: IRIs of definitions, postulates, common notions]
(e) queries.find_salient_propositions_proofs(resource_iris, proposition) [outputs: IRIs of propositions, and proofs]
(f) queries.direct_definitions_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(g) queries.direct_postulates_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(h) queries.direct_common_notions_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(i) queries.hierarchical_definitions_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(j) queries.hierarchical_postulates_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(k) queries.hierarchical_common_notions_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(l) queries.mereological_definitions_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(m) queries.mereological_postulates_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(n) queries.mereological_common_notions_selected_values(iri_of_salient_resources) [use IRIs from queries (d)]
(o) queries.direct_template_last_item_types(iri_of_salient_resources). [use IRIs from queries (e)]
Contract for (d): it must return the IRIs to be used as `iri_of_salient_resources` VALUES input for queries (f) through (n). 
Contract for (e): it must return the IRIs to be used as `iri_of_salient_resources` VALUES input for query (o).

The queries (a), (b), and (c) find both the resources that are salient in the last proposition and the counts. 
Use resources from (a)-(c) only to construct `resource_iris` for queries (d) and (e) after filtering and deduplication.
Use counts from (f) through (o) only to build chunk/resource multiplicities for downstream context/history/hebb aggregations; do not use any salient-extraction counts as weights in `S`, `deg_C^S`, or `\Phi_{L_c}`.

TYPE_SELECTION=True can narrow chunk retrieval in vairant 2: this is an accepted consequence. This is intentional.
TYPE_SELECTION is applied correctly in Variant 2: only some queries are affected by TYPE_SELECTION.

The spec requires that all queries:
- are filtered by `EXCLUDED_CONCEPT_IRIS` and `EXCLUDED_CONCEPT_IRI_SUBSTRINGS`
  immediately after each query result is materialized.

NOTE: main.tex intentionally leaves “used in C” and “used together” open to different
  operationalizations. So the asymmetry in likely_proof.ipynb is allowed and does not
  contradict the theory; it just needs to be stated explicitly as a modeling choice.



Note on salient-query counts vs set `S`
In this notebook, `S` is a **set** (membership-only), not a weighted set.

- Counts returned by salient-set extraction queries `(a)(b)(c)` are used only to gather/aggregate candidate resources before filtering and deduplication.
- After exclusions, candidates are deduplicated by IRI to form `S`.
- These counts do **not** weight `deg_C^S` or `\Phi_{L_c}`.

Computation remains:
\[
\deg_C^S(r)=\sum_{s\in S,\ s\neq r}\big(hebb_C(r,s)+hebb_C(s,r)\big),\quad
\Phi_{L_c}(r,C,S)=\frac{\deg_C^S(r)}{\sum_{u\in U_C}\deg_C^S(u)}
\]
(with `0` when the denominator is `0`; here `U_C` is the set of unique context resource IRIs after filtering).

If weighted salience is desired, that is a different model and requires explicit formula changes.

## Validation / Edge cases
- S is a set and may be empty after filtering.
If S = ∅, define deg_C^S(r) = 0 for all resources r, so Φ_{L_c}(r,C,S) = 0 for all r.
Therefore Φ_L(r,C,S) = EPSILON * Φ_h(r,C).

Operational rule:

If salient-resource extraction yields no resources, skip salient-dependent SPARQL steps (find_salient_*, *_selected_values) and treat their outputs as empty.
Continue the proof iteration normally and still write output rows.

- If `hebb_C` is empty, `Φ_{L_c}` must be 0.

- Ensure `EPSILON ∈ [0,1]` and `HISTORY_WEIGHTS` sum to 1.

- For `N=1`, context contains definitions, postulates, common notions, and proposition 1; there are no prior proofs.


# Likely activation potential: IMPLEMENTATION

Focused implementation of the approved spec using minimal new modules and the existing typical pipeline helpers.


In [None]:
from __future__ import annotations

import datetime as dt
from pathlib import Path

import pandas as pd

from modules import file_utils, rdf_utils
from modules.exclusion_filters import normalize_excluded_iris
from modules.likely_activation import compute_phi_l, compute_phi_lc
from modules.likely_context import build_context_for_proof
from modules.likely_salience import S_VARIANTS, build_salient_set_for_proof
from modules.query_runner import QueryRunner
from modules.typical_activation import compute_new_resources, compute_phi_h
from modules.typical_proof_resources import resources_in_proof


In [None]:
OUTPUT_DIR = Path('output')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

INPUT_TTL = file_utils.latest_file(folder=Path('ontologies'), filename_fragment='ontology_', extension='ttl')
graph = rdf_utils.load_graph(INPUT_TTL)
runner = QueryRunner(graph)

# Parameters
EPSILON = 1/2
HISTORY_WEIGHTS = (6 / 9, 1 / 9, 2 / 9)
START_PROPOSITION = 1
END_PROPOSITION = 48

EXCLUDED_CONCEPT_IRIS = [
    "https://www.foom.com/core#concept__one",
    "https://www.foom.com/core#concept__two",
    "https://www.foom.com/core#concept__three",
    "https://www.foom.com/core#concept__four",
    "https://www.foom.com/core#concept__are_same",
    "https://www.foom.com/core#concept__possible",
    "https://www.foom.com/core#concept__is_half",
    "https://www.foom.com/core#concept__are_same",
    "https://www.foom.com/core#concept___thing",
]

EXCLUDED_CONCEPT_IRI_SUBSTRINGS = [
    "https://www.foom.com/core#set_",
]

def validate_params() -> None:
    if not (0.0 <= EPSILON <= 1.0):
        raise ValueError(f"EPSILON must be in [0, 1], got {EPSILON}.")
    if len(HISTORY_WEIGHTS) != 3:
        raise ValueError(f"HISTORY_WEIGHTS must have length 3, got {len(HISTORY_WEIGHTS)}.")
    if any((w < 0.0 or w > 1.0) for w in HISTORY_WEIGHTS):
        raise ValueError("All HISTORY_WEIGHTS must be in [0, 1].")
    total = sum(HISTORY_WEIGHTS)
    if abs(total - 1.0) > 1e-9:
        raise ValueError(f"HISTORY_WEIGHTS must sum to 1, got {total}.")
    if START_PROPOSITION > END_PROPOSITION:
        raise ValueError("START_PROPOSITION must be <= END_PROPOSITION.")
    if any(not isinstance(value, str) for value in EXCLUDED_CONCEPT_IRIS):
        raise ValueError("EXCLUDED_CONCEPT_IRIS must contain only strings.")
    if any(not isinstance(value, str) for value in EXCLUDED_CONCEPT_IRI_SUBSTRINGS):
        raise ValueError("EXCLUDED_CONCEPT_IRI_SUBSTRINGS must contain only strings.")

validate_params()
EXCLUDED_IRIS = normalize_excluded_iris(EXCLUDED_CONCEPT_IRIS)


In [None]:
def run_analysis(s_variant: str, type_selection: bool) -> tuple[pd.DataFrame, Path]:
    if s_variant not in S_VARIANTS:
        raise ValueError(f"Unknown s_variant {s_variant}. Expected one of {sorted(S_VARIANTS)}")

    rows: list[dict[str, object]] = []

    for proof_n in range(START_PROPOSITION, END_PROPOSITION + 1):
        context_resources, family_dfs, hebb_df = build_context_for_proof(
            proof_n,
            runner=runner,
            type_selection=type_selection,
            excluded_iris=EXCLUDED_IRIS,
            excluded_substrings=EXCLUDED_CONCEPT_IRI_SUBSTRINGS,
        )
        proof_resources = resources_in_proof(
            proof_n,
            runner=runner,
            type_selection=type_selection,
            excluded_iris=EXCLUDED_IRIS,
            excluded_substrings=EXCLUDED_CONCEPT_IRI_SUBSTRINGS,
        )
        proof_resources_sorted = sorted(proof_resources)

        salient_set = build_salient_set_for_proof(
            proof_n,
            runner=runner,
            type_selection=type_selection,
            s_variant=s_variant,
            excluded_iris=EXCLUDED_IRIS,
            excluded_substrings=EXCLUDED_CONCEPT_IRI_SUBSTRINGS,
        )

        phi_h_df = compute_phi_h(proof_resources_sorted, family_dfs, HISTORY_WEIGHTS)
        phi_lc_df = compute_phi_lc(
            proof_resources_sorted,
            hebb_df,
            salient_set,
            context_resources,
        )
        phi_l_df = compute_phi_l(proof_resources_sorted, phi_h_df, phi_lc_df, EPSILON)

        new_resources = compute_new_resources(proof_resources_sorted, context_resources)
        proof_count = len(proof_resources_sorted)
        new_count = len(new_resources)

        phi_h_map = dict(zip(phi_h_df["resource_used_in_proof"], phi_h_df["phi_h"]))
        phi_lc_map = dict(zip(phi_lc_df["resource_used_in_proof"], phi_lc_df["phi_lc"]))
        phi_l_map = dict(zip(phi_l_df["resource_used_in_proof"], phi_l_df["phi_l"]))

        for resource in proof_resources_sorted:
            rows.append({
                "proof": proof_n,
                "resource_used_in_proof": resource,
                "number_of_resources_used_in_proof": proof_count,
                "phi_h": float(phi_h_map.get(resource, 0.0)),
                "phi_lc": float(phi_lc_map.get(resource, 0.0)),
                "phi_l": float(phi_l_map.get(resource, 0.0)),
                "new_resources": new_resources,
                "number_of_new_resources": new_count,
            })

    results_df = pd.DataFrame(
        rows,
        columns=[
            "proof",
            "resource_used_in_proof",
            "number_of_resources_used_in_proof",
            "phi_h",
            "phi_lc",
            "phi_l",
            "new_resources",
            "number_of_new_resources",
        ],
    )

    history_weights_label = "-".join(f"{w:.4f}" for w in HISTORY_WEIGHTS)
    epsilon_label = f"{EPSILON:.4f}"
    timestamp = dt.datetime.now().strftime("%Y%m%d-%H%M%S")
    type_label = str(type_selection).lower()
    output_path = OUTPUT_DIR / (
        f"likely_{s_variant}_type-{type_label}_eps-{epsilon_label}_w-{history_weights_label}_"
        f"p{START_PROPOSITION}-{END_PROPOSITION}_{timestamp}.csv"
    )
    results_df.to_csv(output_path, index=False)

    print(f"[{s_variant} | type={type_label}] proofs={END_PROPOSITION - START_PROPOSITION + 1}, rows={len(results_df)}")
    print(f"Saved: {output_path}")
    return results_df, output_path


In [None]:
RUN_ALL_COMBINATIONS = True
SELECTED_RUNS = [
    ("statement_only", False),
    ("statement_only", True),
    ("statement_plus_related_chunks", False),
    ("statement_plus_related_chunks", True),
]

runs = SELECTED_RUNS if RUN_ALL_COMBINATIONS else SELECTED_RUNS[:1]
all_results: dict[tuple[str, bool], tuple[pd.DataFrame, Path]] = {}
for s_variant, type_selection in runs:
    all_results[(s_variant, type_selection)] = run_analysis(s_variant, type_selection)

all_results
