# NMD AlphaFold-Predicted Structural Property Analysis

This notebook analyzes **AlphaFold-predicted structural features** in the **NMD-affected region** of proteins carrying frameshift and nonsense variants. It compares **variant–wild-type (Var−WT) differences** across candidate and control variant groups.

The analysis includes:
- Solvent accessibility (SASA)
- pLDDT structural confidence
- Aromaticity
- Net charge
- Isoelectric point (pI)
- Amino-acid composition of the NMD region

## 1. Load NMD-Region AlphaFold Property Table

In this section, we load the full AlphaFold2 feature table and prepare it for downstream analysis.

### Key steps:
- Read input data (`*.csv`)
- Remove index-like or duplicate lowercase columns
- Inspect:
  - dimensions  
  - column names  
  - missing values  
  - summary statistics  
- Collapse the table to one row per variant ID (`df_NMD_unique`)


In [None]:
import pandas as pd
# --- 1. Load the CSV ---
df_NMD = pd.read_csv("WT_var_NMD_4c_AD.csv")

# Drop the auto-index column if it exists
if "...1" in df_NMD.columns:
    df_NMD = df_NMD.drop(columns=["...1"])

print("Shape:", df_NMD.shape)
print("Columns:", df_NMD.columns.tolist())

# --- 2. Quick look at the data ---
display(df_NMD.head())
display(df_NMD.info())
display(df_NMD.describe(include="all").transpose())

# --- 3. Uniqueness checks for IDs ---
print("Unique ids:", df_NMD["id"].nunique())

# Collapse NMD df to unique IDs
df_NMD_unique = df_NMD.groupby("id").first().reset_index()

print("Original rows:", df_NMD.shape[0])
print("Unique IDs:", df_NMD_unique.shape[0])

## 2. Compute Variant-Minus-Wild-Type Differences

For each structural feature, we compute:

metric_Diff = metric_Var − metric_WT


Included features:
- `aromaticity_Diff`  
- `pI_Diff`  
- `plddt_mean_Diff`  
- `net_charge_Diff`  
- `rel_sasa_diff`

Lowercase or duplicate column versions are removed and renamed to maintain consistency.

This produces a standardized dataframe for downstream visualization.

In [None]:
# Identify WT and Var pairs based on column names
pairs = [
    ("aromaticity_WT_nmd", "aromaticity_vars_nmd"),
    ("pI_WT_nmd", "pI_vars_nmd"),
    ("plddt_mean_WT_nmd", "plddt_mean_vars_nmd"),
    ("net_charge_WT_nmd", "net_charge_vars_nmd")
]

# Add difference columns
for wt_col, var_col in pairs:
    diff_col = wt_col.replace("_WT_nmd", "_Diff")  # e.g. aromaticity_Diff
    df_NMD_unique[diff_col] = df_NMD_unique[var_col] - df_NMD_unique[wt_col]

# Quick check of new columns
print(df_NMD_unique[[c for c in df_NMD_unique.columns if c.endswith("_Diff")]].head())

In [None]:
# choose the capitalized names to keep
keep = {
    "aromaticity_diff": "aromaticity_Diff",
    "pI_diff":          "pI_Diff",
    "plddt_mean_diff":  "plddt_mean_Diff",
    "net_charge_diff":  "net_charge_Diff"
}

# if lowercase exists but capitalized also exists, drop lowercase
for low, cap in keep.items():
    if low in df_NMD_unique.columns and cap in df_NMD_unique.columns:
        df_NMD_unique.drop(columns=[low], inplace=True)
    elif low in df_NMD_unique.columns:  # only lowercase exists → rename to capitalized
        df_NMD_unique.rename(columns={low: cap}, inplace=True)

In [None]:
df_NMD_unique.head(5)

## 3. Normalize Variant Groups

Variant labels are harmonized into the following categories:

- Minus1  
- Plus1  
- Nonsense  
- Minus1_Control  
- Plus1_Control  
- Nonsense_Control  

Two helper variables are added:
- `family`  → Minus1 / Plus1 / Nonsense  
- `status`  → Candidate / Control  

This ensures consistent grouping, coloring, and statistics across all figures.

## 4. Visualization Setup

This section defines a unified plotting function that:

- Uses color-safe palettes per variant family  
- Lightens colors for control groups  
- Generates publication-ready boxplots  
- Adds:
  - sample sizes  
  - Mann–Whitney U test p-values  
  - significance stars (*, **, ***)  
- Automatically adjusts y-axis limits and bracket spacing  
- Saves figures as PNG files

This ensures consistent styling across all metrics analyzed.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.patches import Patch
from scipy.stats import mannwhitneyu

FAMILY_COLOR = {'Minus1':'#4472C4', 'Plus1':'#C65911', 'Nonsense':'#70AD47'}

def lighten_color(color, amount=0.55):
    r, g, b = mcolors.to_rgb(color)
    return (1 - amount) + amount*r, (1 - amount) + amount*g, (1 - amount) + amount*b

def plot_diff_candidate_vs_control_colorsafe(
    df, diff_col, metric_label=None, context_label="Full length",
    families=("Minus1","Plus1","Nonsense"), figsize=(12,6), savepath=None
):
    # ---- prep ----
    d = df.copy()
    g = d["source"].astype(str)
    g = g.str.replace("snv_control","Nonsense_Control",case=False,regex=False)
    g = g.str.replace("snv","Nonsense",case=False,regex=False)
    g = g.str.replace("minus1_control","Minus1_Control",case=False,regex=False)
    g = g.str.replace("plus1_control","Plus1_Control",case=False,regex=False)
    g = g.str.replace("minus1","Minus1",case=False,regex=False)
    g = g.str.replace("plus1","Plus1",case=False,regex=False)
    d["group"]  = g
    d["family"] = d["group"].str.replace("_Control","",regex=False)
    d["status"] = np.where(d["group"].str.endswith("_Control"), "Control", "Candidate")
    d = d[d["family"].isin(families)].copy()
    d[diff_col] = pd.to_numeric(d[diff_col], errors="coerce")

    # ---- figure ----
    fig, ax = plt.subplots(figsize=figsize)
    ax.grid(True, axis='y', alpha=0.3, linestyle='--'); ax.set_axisbelow(True)

    # compute y limits for padding/annotations
    y = d[diff_col].to_numpy()
    y_min, y_max = np.nanmin(y), np.nanmax(y)
    pad = 0.15 * max(y_max - y_min, 1.0)
    ax.set_ylim(y_min - 0.05*pad, y_max + 2*pad)
    y_annot = y_max + pad

    # plot each family separately with its own palette
    xpos = {fam:i for i, fam in enumerate(families)}
    for fam in families:
        sub = d[d["family"]==fam]
        if sub.empty: 
            continue
        xi = xpos[fam]
        pal = {
            "Candidate": FAMILY_COLOR[fam],
            "Control":   lighten_color(FAMILY_COLOR[fam])
        }
        sub = sub.assign(xslot=xi)
        sns.boxplot(
            data=sub, x="xslot", y=diff_col, hue="status",
            hue_order=["Candidate","Control"], order=[xi], dodge=True, width=0.6,
            palette=pal, ax=ax, legend=False
        )
        # stats (Mann–Whitney)
        a = sub.loc[sub["status"]=="Candidate", diff_col].dropna().to_numpy()
        b = sub.loc[sub["status"]=="Control",   diff_col].dropna().to_numpy()
        if len(a) and len(b):
            p = mannwhitneyu(a,b,alternative="two-sided").pvalue
            stars = "ns" if p>=0.05 else ("*" if p<0.05 and p>=0.01 else ("**" if p<0.01 and p>=0.001 else "***"))
            ax.text(xi, y_annot, f"{stars}\nn={len(a)}/{len(b)}\np={p:.3g}",
                    ha="center", va="bottom",
                    bbox=dict(boxstyle="round,pad=0.25", fc="white", ec="0.6"), fontsize=10)

    # axes/labels
    ax.set_xlim(-0.5, len(families)-0.5)
    ax.set_xticks(range(len(families)))
    ax.set_xticklabels(families, rotation=20, ha="right")
    ax.set_xlabel("Variant Type", fontsize=11, fontweight="bold")
    ax.set_ylabel(metric_label or diff_col, fontsize=11, fontweight="bold")
    ttl = metric_label or diff_col
    ax.set_title(f"{ttl} (Var − WT): Candidate vs Control ({context_label})",
                 fontsize=14, fontweight="bold", pad=10)

    # legend (simple)
    leg_handles = [Patch(facecolor="#6e6e6e", edgecolor="k", label="Candidate"),
                   Patch(facecolor="#cfcfcf", edgecolor="k", label="Control")]
    ax.legend(handles=leg_handles, title="Group", loc="upper left",
              bbox_to_anchor=(1.01, 1.0), borderaxespad=0.)
    plt.tight_layout()
    if savepath:
        fig.savefig(savepath, dpi=300, bbox_inches="tight")
    return fig, ax


## 5. Candidate vs Control Comparisons by Structural Feature

For each metric, we visualize the distribution of:

Candidate Var−WT differences
vs
Control Var−WT differences


Metrics compared:
- **Isoelectric point difference (pI_Diff)**
- **Mean pLDDT difference**
- **Aromaticity difference**
- **Net charge difference**
- **Relative SASA difference**

Each figure contains:
- Two groups per family (Candidate vs Control)
- Clear labeling
- Statistical annotation via Mann–Whitney U test
- Saved PNG output for external use


In [None]:
targets = [
    ("pI_Diff",          "Isoelectricity"),
    ("plddt_mean_Diff",  "pLDDT (mean)"),
    ("aromaticity_Diff", "Aromaticity"),
    ("net_charge_Diff", "Net Charge")
]

for col, label in targets:
    plot_diff_candidate_vs_control_colorsafe(
        df=df_NMD_unique, diff_col=col,families=('Minus1','Plus1'),
        metric_label=label, context_label="NMD region",
        savepath=f"NMD_{col}_Candidate_vs_Control.png"
    )

In [None]:
# 1) Compute Var − WT for relative SASA (NMD region)
required = {"rel_sasa_var_nmd", "rel_sasa_WT_nmd"}
missing = required - set(df_NMD_unique.columns)
if missing:
    raise KeyError(f"Missing columns in df_NMD_unique: {missing}")

df_NMD_unique["rel_sasa_diff"] = (
    df_NMD_unique["rel_sasa_var_nmd"] - df_NMD_unique["rel_sasa_WT_nmd"]
)


# 2) Plot like your net charge figure
plot_diff_candidate_vs_control_colorsafe(
    df=df_NMD_unique,
    diff_col="rel_sasa_diff",
    metric_label="Relative SASA",
    context_label="NMD region",families =('Minus1','Plus1'),
    savepath="NMD_rel_sasa_diff_Candidate_vs_Control.png"
)

## 7. Generated Outputs

This notebook exports:

- Cleaned & standardized NMD-region structural dataset
- Var−WT difference tables
- PNG figures:
  - `NMD_pI_Diff_Candidate_vs_Control.png`
  - `NMD_plddt_mean_Diff_Candidate_vs_Control.png`
  - `NMD_aromaticity_Diff_Candidate_vs_Control.png`
  - `NMD_net_charge_Diff_Candidate_vs_Control.png`
  - `NMD_rel_sasa_diff_Candidate_vs_Control.png`