# Evolver Loop 74 Analysis

Goal: assess whether **A/B ordering asymmetry** exists in the *full* dataset and how large the potential impact is for an **order-dependent applicability-domain (AD)** layer (kNN distance + fallback).

Motivation: Evaluator flagged `_dist_features_full(..., flip=False)` order-dependence while base model predictions are flip-invariant.

In [None]:
import pandas as pd, numpy as np
DATA_PATH='/home/data'
full=pd.read_csv(f'{DATA_PATH}/catechol_full_data_yields.csv')
full.head(), full.shape

In [None]:
# Basic stats: how often A/B is already in canonical (lexicographic) order?
a=full['SOLVENT A NAME'].astype(str)
b=full['SOLVENT B NAME'].astype(str)
canon_a=np.where(a<b,a,b)
canon_b=np.where(a<b,b,a)
is_canon=(a==canon_a) & (b==canon_b)
print('Rows:',len(full))
print('Already canonical order:',is_canon.mean())
print('Non-canonical order:',(~is_canon).mean())

# How many unique ordered pairs vs canonical pairs?
ordered_pairs=pd.Series(list(zip(a,b))).nunique()
canon_pairs=pd.Series(list(zip(canon_a,canon_b))).nunique()
print('Unique ordered pairs:',ordered_pairs)
print('Unique canonical pairs:',canon_pairs)
print('Avg orientations per canonical pair:',ordered_pairs/canon_pairs)

In [None]:
# For each canonical pair, how many rows appear in both orientations?
df=full[['SOLVENT A NAME','SOLVENT B NAME','SolventB%']].copy()
df['A']=a; df['B']=b
# canonical key
df['key']=np.where(a<b, a+'||'+b, b+'||'+a)
df['orientation']=np.where(a<b,'A<B','A>B')

orient_counts=df.pivot_table(index='key', columns='orientation', values='SolventB%', aggfunc='size', fill_value=0)
orient_counts['both_orientations']=(orient_counts.get('A<B',0)>0) & (orient_counts.get('A>B',0)>0)
print('Canonical pairs:',len(orient_counts))
print('Pairs appearing in both orientations:',orient_counts['both_orientations'].mean())
print('Pairs only one orientation:',(~orient_counts['both_orientations']).mean())

# Distribution of imbalance
orient_counts['imbalance']=np.abs(orient_counts.get('A<B',0)-orient_counts.get('A>B',0)) / (orient_counts.get('A<B',0)+orient_counts.get('A>B',0))
print('Imbalance quantiles:',orient_counts['imbalance'].quantile([0,0.25,0.5,0.75,0.9,0.95,1.0]).to_dict())

In [None]:
# Check SolventB% symmetry: if swapped orientation exists, does pct transform as 1-pct approximately?
# We'll compare distributions of pct for both orientations within the same canonical pair.

df2=df.copy()
# make canonical pct: if A>B, pct should become 1-pct for canonical representation
pct=df2['SolventB%'].astype(float)
df2['pct_canon']=np.where(df2['orientation']=='A<B', pct, 1.0-pct)

# summarize pct ranges per canonical pair
stats=df2.groupby('key')['pct_canon'].agg(['min','max','nunique','count'])
print('pct_canon min/max quantiles:')
print(stats[['min','max']].quantile([0,0.25,0.5,0.75,0.9,0.95,1.0]))

# how many pairs have both orientations and overlapping pct range near extremes?
joined=df2.merge(df2, on='key', suffixes=('_1','_2'))
print('Done')