Résumé des tests annotés

Échantillons appariés : 150 paires pour chaque valeur de lambda_max (0.1, 0.2, 0.4).

Médianes : Levina‑Bickel m_hat médiane = 7.940; spectral d_s médian = 0.212 (λ=0.1), 0.299 (λ=0.2), 0.518 (λ=0.4).

Différence moyenne (Levina − spectral) ≈ 7.73 (λ=0.1), 7.64 (λ=0.2), 7.43 (λ=0.4); écart-type des différences ≈ 0.053–0.055.

Tests statistiques : Wilcoxon signed‑rank p ≈ 2.30e‑26 pour chaque λ_max (fortement significatif). Paired t p essentially 0 (ordre de 1e‑319 … 0) — la différence est extrêmement statistiquement significative.

Corrélations : Spearman rho ≈ 0.11 (p≈0.18) pour λ=0.1, −0.036 (p≈0.66) pour 0.2, ≈0.003 (p≈0.97) pour 0.4 — aucune corrélation robuste entre m_hat et d_s aux trois réglages.

Taille d’effet (Cohen d, apparié) : très grande (≈ 135–147) due à la très faible variance des différences relatives à la très grande moyenne de la différence — mathématiquement correcte mais peu informative ici (différences ≫ écart‑type).

Interprétation concise et implication
Les deux estimateurs ne donnent pas des valeurs comparables en échelle absolue : Levina‑Bickel retourne une dimension locale ≈ 8 tandis que la pente spectrale, pour la plage small‑λ choisie, renvoie d_s ≪ 1 (λ_small) ou ≈ 0.5 (λ=0.4).

Les tests montrent une différence statistiquement robuste entre méthodes, mais la corrélation faible/absente indique qu’elles ne varient pas ensemble de façon cohérente sur les mêmes sous‑échantillons.

Conclusion pratique : tu peux affirmer que « Levina‑Bickel et la pente spectrale (avec ces λ_max) fournissent des estimations incompatibles en valeur absolue » — il faut préciser la définition de « dimension » que tu rapportes.

Cell Python — Diagnostics de linéarité et résidus pour les fits log‑log (sélection de sous‑échantillons)

In [18]:
# Cell: Linear regression diagnostics on log N(lambda) vs log lambda for selected bootstrap subsamples
import os
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from scipy import sparse
from scipy.sparse.linalg import eigsh
from scipy import stats
import matplotlib.pyplot as plt

# Parameters (modifiable)
csv_path = 'data/sunspots_raw/Sunspots.csv'
value_col_candidates = ['Number', 'Total Sunspot', 'Total Sunspot Number', 'Monthly Mean']
embedding_dim = 10
tau = 1
k_neighbors = 10
n_eig = 400                 # compute more eigenvalues to inspect small-to-mid range
subsample_frac = 0.6
rng_seed = 42
n_diagnostics = 10          # number of bootstrap samples to inspect (choose 10 representative)
lambda_max_diag = 0.2       # primary lambda_max used for the diagnostics (adjustable)
min_points_for_fit = 6
out_dir = 'results/spectral_diagnostics_linearity'
os.makedirs(out_dir, exist_ok=True)

# Utilities
def takens_embed(x, dim, tau):
    m = len(x) - (dim - 1) * tau
    if m <= 0:
        return None
    embed = np.empty((m, dim))
    for i in range(dim):
        embed[:, i] = x[i * tau : i * tau + m]
    return embed

def build_laplacian_eigs(X_points, k_neighbors, n_eig):
    n_nodes_local = X_points.shape[0]
    nbrs = NearestNeighbors(n_neighbors=min(k_neighbors + 1, n_nodes_local), algorithm='auto').fit(X_points)
    distances, indices = nbrs.kneighbors(X_points)
    adj = sparse.lil_matrix((n_nodes_local, n_nodes_local), dtype=np.float32)
    for i in range(n_nodes_local):
        for j in indices[i, 1:]:
            adj[i, j] = 1.0
            adj[j, i] = 1.0
    adj = adj.tocsr()
    deg = np.array(adj.sum(axis=1)).flatten()
    deg[deg == 0] = 1.0
    D_inv_sqrt = sparse.diags(1.0 / np.sqrt(deg))
    I = sparse.identity(n_nodes_local, format='csr')
    L_norm = I - D_inv_sqrt @ adj @ D_inv_sqrt
    n_eig_local = min(n_eig, n_nodes_local - 1)
    try:
        eigvals, _ = eigsh(L_norm, k=n_eig_local, which='SM', tol=1e-6, maxiter=5000)
    except Exception:
        try:
            from scipy.linalg import eigh
            Ld = L_norm.toarray()
            eigvals_all = eigh(Ld, eigvals_only=True)
            eigvals = np.sort(eigvals_all)[:n_eig_local]
        except Exception as e:
            print("Eigen decomposition failed:", e)
            return None
    return np.sort(eigvals)

def spectral_counting(eigvals):
    eps = 1e-12
    lams = eigvals[eigvals > eps]
    lam_vals = np.unique(lams)
    N_vals = np.array([np.searchsorted(lams, lam, side='right') for lam in lam_vals])
    return lam_vals, N_vals

def linreg_diagnostics(lam_fit, N_fit):
    # linear regression on log-log
    x = np.log(lam_fit)
    y = np.log(N_fit)
    n = len(x)
    slope, intercept, r_value, p_value, stderr = stats.linregress(x, y)
    y_pred = intercept + slope * x
    resid = y - y_pred
    mse = np.sum(resid**2) / max(n - 2, 1)
    ss_tot = np.sum((y - np.mean(y))**2)
    r2 = 1.0 - np.sum(resid**2) / ss_tot if ss_tot > 0 else np.nan
    # leverage for simple linear regression
    xbar = np.mean(x)
    Sxx = np.sum((x - xbar)**2)
    h = np.repeat(1.0/n, n)
    if Sxx > 0:
        h = 1.0/n + ((x - xbar)**2) / Sxx
    # standardized residuals
    with np.errstate(divide='ignore', invalid='ignore'):
        std_resid = resid / np.sqrt(mse * (1 - h))
    # Cook's distance approximation
    p = 2  # intercept + slope
    cooks = (resid**2) / (p * mse) * (h / (1 - h)**2)
    return dict(
        slope=slope, intercept=intercept, r_value=r_value, p_value=p_value, stderr=stderr,
        r2=r2, mse=mse, x=x, y=y, y_pred=y_pred, resid=resid, std_resid=std_resid, h=h, cooks=cooks
    )

# Load series and embedding
df0 = pd.read_csv(csv_path)
col = next((c for c in value_col_candidates if c in df0.columns), None)
if col is None:
    numeric_cols = df0.select_dtypes(include=[np.number]).columns.tolist()
    if not numeric_cols:
        raise RuntimeError("No numeric column found in CSV.")
    col = numeric_cols[-1]
series = pd.to_numeric(df0[col], errors='coerce').dropna().values
X_full = takens_embed(series, embedding_dim, tau)
if X_full is None:
    raise RuntimeError("Embedding too short for given embedding_dim/tau.")
n_nodes = X_full.shape[0]

# Select diagnostics indices deterministically
rng = np.random.default_rng(rng_seed)
indices_list = [rng.choice(np.arange(n_nodes), size=max(120, int(np.floor(subsample_frac * n_nodes))), replace=False)
                for _ in range(n_diagnostics)]

summary_rows = []
for i, idx in enumerate(indices_list, start=1):
    X_sub = X_full[idx, :]
    eigvals = build_laplacian_eigs(X_sub, k_neighbors, n_eig)
    if eigvals is None:
        print(f"diag {i}: eig failed; skipping")
        continue
    lam_vals, N_vals = spectral_counting(eigvals)
    mask = lam_vals <= lambda_max_diag
    lam_fit = lam_vals[mask]
    N_fit = N_vals[mask]
    n_points = len(lam_fit)
    result = None
    if n_points >= min_points_for_fit:
        result = linreg_diagnostics(lam_fit, N_fit)
    # Save raw counting and eigvals
    pd.DataFrame({'lambda': lam_vals, 'N_lambda': N_vals}).to_csv(f"{out_dir}/diag_{i:02d}_counting.csv", index=False)
    pd.DataFrame({'eig_index': np.arange(1, len(eigvals)+1), 'eigval': eigvals}).to_csv(f"{out_dir}/diag_{i:02d}_eigvals.csv", index=False)
    # Plot log-log with fit and diagnostic panels
    plt.figure(figsize=(10,4))
    ax1 = plt.subplot2grid((1,3), (0,0), colspan=2)
    ax2 = plt.subplot2grid((1,3), (0,2))
    # left: log-log + fit
    ax1.loglog(lam_vals, N_vals, 'o', markersize=4, alpha=0.6, label='N(lambda)')
    if result is not None:
        # plot fitted line over lam_fit
        lam_line = np.linspace(lam_fit.min(), lam_fit.max(), 200)
        ax1.loglog(lam_line, np.exp(result['intercept']) * lam_line**(result['slope']), '-', color='C1', lw=1.5,
                   label=f'fit (<= {lambda_max_diag}) slope={result["slope"]:.3f} (stderr={result["stderr"]:.3f}) R2={result["r2"]:.3f}')
    ax1.set_xlabel('lambda (eigenvalue)')
    ax1.set_ylabel('N(lambda)')
    ax1.set_title(f'diag {i}: log-log N(lambda) (n_points_fit={n_points})')
    ax1.legend(fontsize=8)
    ax1.grid(alpha=0.3, which='both')
    # right: residuals vs fitted + Cook's high points
    if result is not None:
        ax2.plot(result['y_pred'], result['resid'], 'o', ms=5, alpha=0.7)
        ax2.axhline(0, color='gray', lw=1)
        ax2.set_xlabel('fitted log N')
        ax2.set_ylabel('residuals')
        ax2.set_title('residuals vs fitted')
        # highlight potential influential points (Cook > 4/n)
        cooks = result['cooks']
        thresh = 4.0 / len(cooks) if len(cooks)>0 else 0
        infl_idx = np.where(cooks > thresh)[0]
        for ii in infl_idx:
            ax2.annotate(f"{ii+1}", (result['y_pred'][ii], result['resid'][ii]), fontsize=7, color='red')
        # inset: qqplot of standardized residuals
        plt.sca(ax2)
        ax_in = ax2.inset_axes([0.05, -0.65, 0.9, 0.6])
        stats.probplot(result['std_resid'], dist="norm", plot=ax_in)
        ax_in.set_title('QQ std resid', fontsize=7)
    else:
        ax2.text(0.1, 0.5, f"Insufficient fit points\n(n_points={n_points})", transform=ax2.transAxes)
    plt.tight_layout()
    plt.savefig(f"{out_dir}/diag_{i:02d}_diagnostic.png", dpi=150)
    plt.close()
    # collect summary
    summary_rows.append({
        'diag': i,
        'n_nodes_sub': int(X_sub.shape[0]),
        'n_eig_computed': int(len(eigvals)),
        'n_lambda_total': int(len(lam_vals)),
        'n_points_fit': int(n_points),
        'fit_ok': bool(result is not None),
        'slope': float(result['slope']) if result is not None else np.nan,
        'stderr_slope': float(result['stderr']) if result is not None else np.nan,
        'r2': float(result['r2']) if result is not None else np.nan,
        'mse': float(result['mse']) if result is not None else np.nan,
        'max_cook': float(np.nanmax(result['cooks'])) if result is not None else np.nan,
        'n_influential': int(np.sum(result['cooks'] > (4.0 / max(1, len(result['cooks']))))) if result is not None else 0
    })
    print(f"Saved diag {i}: n_points_fit={n_points}, fit_ok={result is not None}")

# Save diagnostics summary
pd.DataFrame(summary_rows).to_csv(f"{out_dir}/linearity_diagnostics_summary.csv", index=False)
print("Diagnostics saved to", out_dir)


Saved diag 1: n_points_fit=17, fit_ok=True
Saved diag 2: n_points_fit=17, fit_ok=True
Saved diag 3: n_points_fit=16, fit_ok=True
Saved diag 4: n_points_fit=17, fit_ok=True
Saved diag 5: n_points_fit=17, fit_ok=True
Saved diag 6: n_points_fit=17, fit_ok=True
Saved diag 7: n_points_fit=17, fit_ok=True
Saved diag 8: n_points_fit=18, fit_ok=True
Saved diag 9: n_points_fit=16, fit_ok=True
Saved diag 10: n_points_fit=16, fit_ok=True
Diagnostics saved to results/spectral_diagnostics_linearity


Observations clés (d’après linearity_diagnostics_summary.csv)
Fits retenus : 10 diagnostics, n_points_fit = 16–18 → plage λ ≤ 0.2 donne suffisamment de points pour la régression log‑log.

Pentes (slope) sur λ_max = 0.2 : ≈ 0.140 − 0.154 → d_s = 2*slope ≈ 0.28 − 0.31 pour ces sous‑échantillons.

Erreur standard des pentes : ~0.027–0.029 → estimation de la pente relativement précise.

R² des fits : ~0.64–0.66 → la relation log N vs log λ est modérément bien expliquée par une droite linéaire sur la plage choisie.

Diagnostics d’influence : max_cook ≫ 1 (≈100+) et n_influential = 1 pour chaque diag — il existe au moins un point très influent par fit (Cook élevé, dû à la formule approximative pour petites n et résidus).

MSE et résidus : MSE ~0.23–0.25 (en log‑espace) — résidus non négligeables mais pas catastrophiques.

Interprétation rapide
Les régressions log‑log sur λ ≤ 0.2 sont raisonnablement linéaires pour ces sous‑échantillons (R² ~0.65) ; les pentes sont stables entre diagnostics.

Le fait qu’il y ait 1 point « influent » par fit suggère que la pente est parfois tirée par quelques petites valeurs propres/points de comptage — ce qui peut expliquer la faible échelle absolue de d_s comparée à m_hat (Levina).

En clair : la méthode spectrale, sur la plage λ choisie, produit une estimation cohérente et précise (faible stderr) mais d’échelle bien différente de Levina — la discordance n’est pas due à un fit très mauvais, mais à ce que chaque méthode mesure.

Cell Python — Influence diagnostics, refit sans influents et régressions robustes (Theil‑Sen + RANSAC)

In [20]:
# Cell: Identify influential points (Cook threshold), refit without them, compare with Theil-Sen and RANSAC
import os
import numpy as np
import pandas as pd
from sklearn.linear_model import TheilSenRegressor, RANSACRegressor, LinearRegression
from sklearn.metrics import mean_squared_error
from scipy import stats
import matplotlib.pyplot as plt

# Paramètres (adapter si besoin)
in_dir = 'results/spectral_diagnostics_linearity'   # dossier où sont les diagnostics précédents
out_dir = 'results/spectral_influence_refit'
lambda_max_diag = 0.2   # même plage que diagnostics précédents
min_points_for_fit = 6
os.makedirs(out_dir, exist_ok=True)

# Charger les CSV produits précédemment (diag_XX_counting.csv et diag_XX_eigvals.csv attendus)
diag_files = sorted([f for f in os.listdir(in_dir) if f.endswith('_counting.csv')])
if not diag_files:
    raise RuntimeError(f"No counting CSVs found in {in_dir}")

summary_rows = []

def fit_linear_on_log(lam, N):
    x = np.log(lam).reshape(-1, 1)
    y = np.log(N)
    lr = LinearRegression()
    lr.fit(x, y)
    y_pred = lr.predict(x)
    resid = y - y_pred
    n = len(x)
    p = 2
    mse = np.sum(resid**2) / max(n - p, 1)
    # standard error for slope
    se_slope = np.sqrt(mse / np.sum((x.flatten() - x.mean())**2)) if np.sum((x.flatten() - x.mean())**2) > 0 else np.nan
    # R^2
    ss_tot = np.sum((y - y.mean())**2)
    r2 = 1.0 - np.sum(resid**2) / ss_tot if ss_tot > 0 else np.nan
    return {
        'model': lr,
        'slope': float(lr.coef_[0]),
        'intercept': float(lr.intercept_),
        'y_pred': y_pred,
        'resid': resid,
        'mse': mse,
        'se_slope': se_slope,
        'r2': r2,
        'x': x.flatten(),
        'y': y
    }

def compute_leverage_and_cooks(x, resid, mse):
    # x: 1D log-lam values
    n = len(x)
    xbar = np.mean(x)
    Sxx = np.sum((x - xbar)**2)
    if Sxx <= 0:
        h = np.repeat(1.0/n, n)
    else:
        h = 1.0/n + ((x - xbar)**2) / Sxx
    with np.errstate(divide='ignore', invalid='ignore'):
        std_resid = resid / np.sqrt(mse * (1 - h))
    p = 2
    with np.errstate(divide='ignore', invalid='ignore'):
        cooks = (resid**2) / (p * mse) * (h / (1 - h)**2)
    return h, std_resid, cooks

def fit_theilsen(lam, N):
    x = np.log(lam).reshape(-1, 1)
    y = np.log(N)
    if len(x) < 3:
        return None
    ts = TheilSenRegressor(random_state=0)
    ts.fit(x, y)
    y_pred = ts.predict(x)
    resid = y - y_pred
    mse = mean_squared_error(y, y_pred)
    return {'model': ts, 'slope': float(ts.coef_[0]), 'intercept': float(ts.intercept_),
            'y_pred': y_pred, 'resid': resid, 'mse': mse}

def fit_ransac(lam, N):
    x = np.log(lam).reshape(-1, 1)
    y = np.log(N)
    if len(x) < 3:
        return None
    base = LinearRegression()
    ransac = RANSACRegressor(estimator=base, random_state=0)
    try:
        ransac.fit(x, y)
    except Exception:
        return None
    # Try to extract slope/intercept robustly
    slope = np.nan
    intercept = np.nan
    try:
        # estimator_ exists and should be a fitted LinearRegression
        est = getattr(ransac, 'estimator_', None)
        if est is not None and hasattr(est, 'coef_'):
            slope = float(est.coef_[0])
            intercept = float(est.intercept_)
        else:
            # fallback: predict two points to infer slope/intercept
            xp = np.array([[x.min()], [x.max()]])
            yp = ransac.predict(xp)
            slope = float((yp[1] - yp[0]) / (xp[1,0] - xp[0,0]))
            intercept = float(yp[0] - slope * xp[0,0])
    except Exception:
        slope = np.nan
        intercept = np.nan
    y_pred = ransac.predict(x)
    resid = y - y_pred
    mse = mean_squared_error(y, y_pred)
    return {'model': ransac, 'slope': slope, 'intercept': intercept,
            'y_pred': y_pred, 'resid': resid, 'mse': mse}

# Process each diagnostic file
for fname in diag_files:
    diag_tag = fname.replace('_counting.csv', '').replace('diag_', '')
    csv_path = os.path.join(in_dir, fname)
    df = pd.read_csv(csv_path)
    lam_vals = df['lambda'].values
    N_vals = df['N_lambda'].values
    # select <= lambda_max_diag
    mask = lam_vals <= lambda_max_diag
    lam_fit = lam_vals[mask]
    N_fit = N_vals[mask]
    n_points = len(lam_fit)
    if n_points < min_points_for_fit:
        summary_rows.append({
            'diag': diag_tag,
            'n_points_fit': n_points,
            'fit_ok': False
        })
        print(f"{diag_tag}: insufficient points ({n_points})")
        continue

    # Original OLS on log-log
    res_orig = fit_linear_on_log(lam_fit, N_fit)
    h, std_resid, cooks = compute_leverage_and_cooks(res_orig['x'], res_orig['resid'], res_orig['mse'])
    # influential threshold: Cook > 4/n (common rule of thumb)
    cook_thresh = 4.0 / max(1, n_points)
    infl_idx = np.where(cooks > cook_thresh)[0]
    n_infl = int(len(infl_idx))

    # Refit excluding influential points (if any)
    if n_infl > 0:
        lam_noinfl = np.delete(lam_fit, infl_idx)
        N_noinfl = np.delete(N_fit, infl_idx)
        res_noinfl = fit_linear_on_log(lam_noinfl, N_noinfl) if len(lam_noinfl) >= min_points_for_fit else None
    else:
        res_noinfl = None

    # Robust fits
    res_ts = fit_theilsen(lam_fit, N_fit)
    res_ransac = fit_ransac(lam_fit, N_fit)

    # Save a comparison plot (log-log)
    plt.figure(figsize=(7,4))
    ax = plt.gca()
    ax.loglog(lam_fit, N_fit, 'o', ms=4, alpha=0.6, label='N(lambda) data')
    xline = np.linspace(lam_fit.min(), lam_fit.max(), 200)

    # OLS
    y_ols = np.exp(res_orig['intercept']) * xline**(res_orig['slope'])
    ax.loglog(xline, y_ols, '-', color='C1', lw=1.5, label=f'OLS slope={res_orig["slope"]:.3f}')
    # OLS without influents
    if res_noinfl is not None:
        y_noinfl = np.exp(res_noinfl['intercept']) * xline**(res_noinfl['slope'])
        ax.loglog(xline, y_noinfl, '--', color='C2', lw=1.5, label=f'OLS w/o influ slope={res_noinfl["slope"]:.3f}')
    # Theil-Sen
    if res_ts is not None:
        y_ts = np.exp(res_ts['intercept']) * xline**(res_ts['slope'])
        ax.loglog(xline, y_ts, '-.', color='C3', lw=1.5, label=f'Theil-Sen slope={res_ts["slope"]:.3f}')
    # RANSAC
    if res_ransac is not None and not np.isnan(res_ransac['slope']):
        y_ransac = np.exp(res_ransac['intercept']) * xline**(res_ransac['slope'])
        ax.loglog(xline, y_ransac, ':', color='C4', lw=1.5, label=f'RANSAC slope={res_ransac["slope"]:.3f}')

    ax.set_xlabel('lambda (eigenvalue)')
    ax.set_ylabel('N(lambda)')
    ax.set_title(f'{diag_tag} influence/refit (n_points={n_points}, n_infl={n_infl})')
    ax.legend(fontsize=8)
    ax.grid(alpha=0.3, which='both')

    # annotate influential points on plot
    if n_infl > 0:
        for ii in infl_idx:
            lam_pt = lam_fit[ii]
            N_pt = N_fit[ii]
            ax.loglog([lam_pt], [N_pt], 's', color='red', ms=6, label='_nolegend_')
            ax.annotate(str(ii+1), xy=(lam_pt, N_pt), xytext=(5, -5), textcoords='offset points', color='red', fontsize=7)

    plt.tight_layout()
    plt.savefig(os.path.join(out_dir, f'{diag_tag}_influence_refit.png'), dpi=150)
    plt.close()

    # Save table of influential points (if any)
    infl_table = []
    for ii in infl_idx:
        infl_table.append({'diag': diag_tag, 'index_in_fit': int(ii), 'lambda': float(lam_fit[ii]), 'N_lambda': int(N_fit[ii]), 'cook': float(cooks[ii]), 'leverage': float(h[ii]), 'std_resid': float(std_resid[ii])})
    if infl_table:
        pd.DataFrame(infl_table).to_csv(os.path.join(out_dir, f'{diag_tag}_influential_points.csv'), index=False)

    # Collect summary row
    summary_rows.append({
        'diag': diag_tag,
        'n_points_fit': int(n_points),
        'fit_ok': True,
        'slope_orig': float(res_orig['slope']),
        'stderr_orig': float(res_orig['se_slope']),
        'r2_orig': float(res_orig['r2']),
        'n_influential': int(n_infl),
        'slope_noinfl': float(res_noinfl['slope']) if res_noinfl is not None else np.nan,
        'stderr_noinfl': float(res_noinfl['se_slope']) if res_noinfl is not None else np.nan,
        'r2_noinfl': float(res_noinfl['r2']) if res_noinfl is not None else np.nan,
        'slope_theilsen': float(res_ts['slope']) if res_ts is not None else np.nan,
        'slope_ransac': float(res_ransac['slope']) if res_ransac is not None else np.nan,
        'cook_thresh': float(cook_thresh),
        'max_cook': float(np.max(cooks))
    })

    print(f"{diag_tag}: n_points={n_points}, n_infl={n_infl}, slope_orig={res_orig['slope']:.4f}, slope_noinfl={summary_rows[-1]['slope_noinfl'] if not np.isnan(summary_rows[-1]['slope_noinfl']) else 'NA'}")

# Save summary CSV
pd.DataFrame(summary_rows).to_csv(os.path.join(out_dir, 'influence_refit_summary.csv'), index=False)
print("Saved influence/refit outputs to", out_dir)


01: n_points=17, n_infl=1, slope_orig=0.1542, slope_noinfl=0.5558200020115645
02: n_points=17, n_infl=1, slope_orig=0.1459, slope_noinfl=0.5501239674590426
03: n_points=16, n_infl=1, slope_orig=0.1478, slope_noinfl=0.5445538208615759
04: n_points=17, n_infl=1, slope_orig=0.1510, slope_noinfl=0.560048104583994
05: n_points=17, n_infl=1, slope_orig=0.1518, slope_noinfl=0.5635590512413825
06: n_points=17, n_infl=1, slope_orig=0.1544, slope_noinfl=0.5603363702308413
07: n_points=17, n_infl=1, slope_orig=0.1496, slope_noinfl=0.5629016353983204
08: n_points=18, n_infl=1, slope_orig=0.1541, slope_noinfl=0.5625108274735132
09: n_points=16, n_infl=1, slope_orig=0.1461, slope_noinfl=0.5495966987448233
10: n_points=16, n_infl=1, slope_orig=0.1403, slope_noinfl=0.5428415309250458
Saved influence/refit outputs to results/spectral_influence_refit


Résumé court et immédiat

Retirer le(s) point(s) influent(s) change drastiquement la pente : les pentes OLS originales sont ≈ 0.14–0.154 (d_s ≈ 0.28–0.31) alors que les pentes recalculées sans le point influent passent à ≈ 0.54–0.56 (d_s ≈ 1.08–1.12) — factor ~3–4 d’écart sur slope.

Les estimateurs robustes (Theil‑Sen, RANSAC) fournissent eux aussi des pentes élevées, souvent proches de la version « OLS sans influent » (ex. slope_theilsen ≈ 0.72–0.85; slope_ransac ≈ slope_noinfl pour la plupart des diagnostics).

Donc l’estimation spectrale sur λ ≤ 0.2 est fortement sensible à un unique point influent par fit : soit ce point est un artefact/petit λ isolé qui écrase la pente, soit l’OLS est inadapté et la pente « réelle » est plus élevée.

Ce que les fichiers produits contiennent (où regarder en priorité)

results/spectral_influence_refit/influence_refit_summary.csv — résumé comparatif (OLS orig, OLS sans influent, Theil‑Sen, RANSAC) — tu l’as joint.

results/spectral_influence_refit/diag_##_influential_points.csv — pour chaque diag, indices et valeurs (lambda, N_lambda, cook, leverage, std_resid) des points influents — ouvre-les pour voir la valeur propre incriminée.

results/spectral_influence_refit/diag_##_influence_refit.png — plot comparatif (données, OLS, OLS w/o influent, Theil‑Sen, RANSAC) — inspecte ces images.

Interprétation pratique

L’OLS original (avec tous les points) donne une pente faible et stable mais qui repose sur un jeu de points dont un seul est très influent (Cook >> threshold).

Les approches robustes ou la suppression de l’influent produisent pentes nettement plus hautes : ça signifie que la décision méthodologique (inclure/exclure ou utiliser un estimateur robuste) change radicalement l’interprétation quantitative de d_s.

Statistiquement, OLS+point influent → d_s ≪ m_hat ; OLS‑sans‑influents / robust → d_s augmenté (encore loin de Levina mais plus proche).

Cell Python — Synthèse comparative des d_s (orig, sans influent, Theil‑Sen, RANSAC) + tests appariés et figure

In [21]:
# Cell: summarize d_s = 2*slope for methods (OLS orig, OLS w/o influent, Theil-Sen, RANSAC),
# compute medians/IQR, paired Wilcoxon tests between methods, save CSV and plot.
import os
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

in_dir = 'results/spectral_influence_refit'
paired_in_path = 'results/paired_levina_spectral_raw.csv'  # available but not used for diag-level summary
out_dir = 'results/spectral_influence_summary'
os.makedirs(out_dir, exist_ok=True)

# Load the influence/refit summary produced previously
summary_fp = os.path.join(in_dir, 'influence_refit_summary.csv')
if not os.path.exists(summary_fp):
    raise RuntimeError(f"File not found: {summary_fp}")

inf_df = pd.read_csv(summary_fp)

# Keep only rows with fit_ok True (safety)
inf_df = inf_df[inf_df['fit_ok'] == True].copy()
if inf_df.empty:
    raise RuntimeError("No valid fits in influence_refit_summary.csv")

# Compute d_s = 2 * slope for each method
def two(x): 
    return 2.0 * x

inf_df['d_s_orig'] = two(inf_df['slope_orig'])
inf_df['d_s_noinfl'] = two(inf_df['slope_noinfl'])
inf_df['d_s_theilsen'] = two(inf_df['slope_theilsen'])
inf_df['d_s_ransac'] = two(inf_df['slope_ransac'])

# Produce aggregated summary (median, IQR)
methods = ['d_s_orig','d_s_noinfl','d_s_theilsen','d_s_ransac']
agg_rows = []
for m in methods:
    vals = inf_df[m].dropna().values
    if vals.size == 0:
        med = iqr_low = iqr_high = np.nan
    else:
        med = float(np.median(vals))
        q1 = float(np.percentile(vals,25))
        q3 = float(np.percentile(vals,75))
        iqr_low, iqr_high = q1, q3
    agg_rows.append({'method': m, 'n': int(np.sum(~inf_df[m].isna())), 'median': med, 'iqr_lower': iqr_low, 'iqr_upper': iqr_high})

agg_df = pd.DataFrame(agg_rows)
agg_df.to_csv(os.path.join(out_dir, 'd_s_methods_aggregate.csv'), index=False)

# Paired comparisons (Wilcoxon) between methods across diagnostics (rows are paired by diag)
paired_tests = []
pairs = [('d_s_orig','d_s_noinfl'), ('d_s_orig','d_s_theilsen'), ('d_s_orig','d_s_ransac'),
         ('d_s_noinfl','d_s_theilsen'), ('d_s_noinfl','d_s_ransac'), ('d_s_theilsen','d_s_ransac')]
for a,b in pairs:
    A = inf_df[a].values
    B = inf_df[b].values
    mask = (~np.isnan(A)) & (~np.isnan(B))
    if mask.sum() >= 2:
        try:
            w = stats.wilcoxon(A[mask], B[mask], alternative='two-sided', zero_method='wilcox')
            stat, pval = float(w.statistic), float(w.pvalue)
        except Exception:
            stat, pval = np.nan, np.nan
        # also report median differences
        med_diff = float(np.median(A[mask] - B[mask]))
        paired_tests.append({'method_A': a, 'method_B': b, 'n_pairs': int(mask.sum()), 'wilcoxon_stat': stat, 'wilcoxon_p': pval, 'median_diff': med_diff})
    else:
        paired_tests.append({'method_A': a, 'method_B': b, 'n_pairs': int(mask.sum()), 'wilcoxon_stat': np.nan, 'wilcoxon_p': np.nan, 'median_diff': np.nan})

pd.DataFrame(paired_tests).to_csv(os.path.join(out_dir, 'd_s_paired_tests.csv'), index=False)

# Save the per-diag d_s table
per_diag = inf_df[['diag','n_points_fit','n_influential','max_cook','slope_orig','slope_noinfl','slope_theilsen','slope_ransac',
                   'd_s_orig','d_s_noinfl','d_s_theilsen','d_s_ransac']].copy()
per_diag.to_csv(os.path.join(out_dir, 'd_s_per_diag.csv'), index=False)

# Plot: grouped strip + box for d_s distributions (small K diagnostics)
plt.figure(figsize=(7,4))
data = [inf_df[m].dropna().values for m in methods]
# boxplot
bp = plt.boxplot(data, positions=np.arange(len(methods)), widths=0.6, patch_artist=True, showfliers=False)
colors = ['#c6dbef','#9ecae1','#6baed6','#3182bd']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
# overlay points (jitter)
for i, arr in enumerate(data):
    x = np.random.normal(i, 0.08, size=arr.size)
    plt.scatter(x, arr, color='k', s=12, alpha=0.8)
plt.xticks(np.arange(len(methods)), ['orig','no_infl','Theil-Sen','RANSAC'])
plt.ylabel('d_s = 2 * slope')
plt.title('Comparison of d_s across methods (diagnostics)')
plt.grid(alpha=0.25, axis='y')
plt.tight_layout()
plt.savefig(os.path.join(out_dir, 'd_s_methods_comparison.png'), dpi=150)
plt.close()

# Small printed summary for quick inspection
print("Saved aggregates:", os.path.join(out_dir, 'd_s_methods_aggregate.csv'))
print("Saved paired tests:", os.path.join(out_dir, 'd_s_paired_tests.csv'))
print("Saved per-diag table:", os.path.join(out_dir, 'd_s_per_diag.csv'))
print("Saved figure:", os.path.join(out_dir, 'd_s_methods_comparison.png'))


Saved aggregates: results/spectral_influence_summary\d_s_methods_aggregate.csv
Saved paired tests: results/spectral_influence_summary\d_s_paired_tests.csv
Saved per-diag table: results/spectral_influence_summary\d_s_per_diag.csv
Saved figure: results/spectral_influence_summary\d_s_methods_comparison.png


Résumé court des résultats que tu as fournis

Les OLS originaux (avec tous points) donnent d_s médian ≈ 0.30 (slope ≈ 0.15) pour λ_max = 0.2.

L’exclusion du point influent ou l’utilisation d’estimateurs robustes accroît fortement d_s :

d_s without influential ≈ 1.12 (median)

d_s Theil‑Sen ≈ 1.56 (median)

d_s RANSAC ≈ 1.12 (median)

Les tests appariés entre méthodes (K=10 diagnostics) montrent des différences statistiquement significatives (Wilcoxon p ≈ 0.002) entre orig vs no_infl / Theil‑Sen / RANSAC.

En clair : la pente spectrale estimée sur λ ≤ 0.2 est très sensible à un unique point influent par fit — la décision méthodologique (inclure/exclure ou utiliser robust) change la conclusion numérique sur d_s.

Interprétation pratique immédiate
L’OLS « orig » produit des pentes faibles mais essentiellement dictées par 1 point influent par diagnostic.

Les estimateurs robustes (Theil‑Sen, RANSAC) et la version OLS sans influent donnent des pentes ~3–5× plus élevées.

Il n’y a pas d’unique « vérité » ici : il faut choisir et défendre une politique (robuste vs garder tous points) ou présenter les deux résultats avec justification.