# Binary Phenotype Formatting for GWAS Pipeline

Prepares **binary-coded** (case/control) phenotype files for BOLT-LMM GWAS.

Loads the four raw extracted phenotype TSVs and produces a single output with
PLINK-compatible binary coding (1 = control, 2 = case; higher value = more isolated):

| Derived Trait | Raw Fields Used | Coding |
|---|---|---|
| **Loneliness** | Field 2020 | 0 -> 1 (control), 1 -> 2 (case) |
| **AbilityToConfide** | Field 2110 | 0 (never/almost never) -> 2 (case), else -> 1 (control) |
| **FreqSoc** | Fields 1031 + 709 | Lives alone AND rarely visited -> 2 (case), else -> 1 (control) |

Output: `isolation_run_binary.tsv.gz` with columns `FID, IID, Loneliness, AbilityToConfide, FreqSoc`

In [None]:
import os
import pandas as pd
from functools import reduce

INPUT_DIR = "/home/mabdel03/data/files/Isolation_Genetics/GWAS/Basket_File/extracted_phenotypes"
OUTPUT_DIR = "/home/mabdel03/data/files/Isolation_Genetics/GWAS/Scripts/ukb21942/pheno"

os.makedirs(OUTPUT_DIR, exist_ok=True)

## Load Raw Phenotype Files

These TSVs are produced by `extract_all_phenotypes.sh` from the UKBB basket file.
Each has columns: `FID`, `IID`, `Phenotype`.

In [None]:
raw_phenotypes = {
    "Loneliness":       f"{INPUT_DIR}/phenotype_2020.tsv",   # Field 2020
    "AbilityToConfide": f"{INPUT_DIR}/phenotype_2110.tsv",   # Field 2110
    "FreqVisit":        f"{INPUT_DIR}/phenotype_1031.tsv",   # Field 1031
    "NumHousehold":     f"{INPUT_DIR}/phenotype_709.tsv",    # Field 709
}

df_list = []
for name, path in raw_phenotypes.items():
    df = pd.read_csv(path, sep="\t")
    df_list.append((df, name))
    print(f"Loaded {name}: {len(df)} rows from {os.path.basename(path)}")

## Fix Column Format and Merge

The raw extraction may produce rows where FID, IID, and phenotype values are
space-separated in a single column. Parse and split these into proper columns,
then outer-merge all phenotypes on FID/IID.

In [None]:
corrected_dfs = []

for df, phenotype_name in df_list:
    fid_vals = []
    iid_vals = []
    pheno_vals = []

    for val in df["FID"].astype(str):
        parts = val.split(" ")
        if len(parts) < 3:
            parts = val.split()

        if len(parts) >= 3:
            fid_vals.append(parts[0])
            iid_vals.append(parts[1])
            pheno_vals.append(parts[2])
        else:
            fid_vals.append(parts[0] if len(parts) > 0 else "NA")
            iid_vals.append(parts[1] if len(parts) > 1 else "NA")
            pheno_vals.append("NA")

    corrected_df = pd.DataFrame({
        "FID": fid_vals,
        "IID": iid_vals,
        phenotype_name: pheno_vals,
    })
    corrected_dfs.append(corrected_df)

master_df = reduce(
    lambda left, right: pd.merge(left, right, on=["FID", "IID"], how="outer"),
    corrected_dfs,
)

print(f"Merged master dataframe: {master_df.shape[0]} participants, {master_df.shape[1]} columns")
master_df.head()

## Clean and Create Binary Traits

Map UKBB special codes (-1 = "do not know", -3 = "prefer not to answer") to NA,
then apply binary coding for each derived trait.

In [None]:
# Map UKBB special codes to NA
for col in ["Loneliness", "AbilityToConfide", "FreqVisit", "NumHousehold"]:
    master_df[col] = master_df[col].astype(str).str.strip()
    master_df.loc[master_df[col].isin(["-1", "-3", "nan", "None"]), col] = "NA"

# Convert to numeric for coding
loneliness_num = pd.to_numeric(master_df["Loneliness"].replace("NA", pd.NA), errors="coerce")
ability_num = pd.to_numeric(master_df["AbilityToConfide"].replace("NA", pd.NA), errors="coerce")
freqvisit_num = pd.to_numeric(master_df["FreqVisit"].replace("NA", pd.NA), errors="coerce")
numhouse_num = pd.to_numeric(master_df["NumHousehold"].replace("NA", pd.NA), errors="coerce")

# --- Loneliness binary ---
# Field 2020: 0 = No -> 1 (control), 1 = Yes -> 2 (case)
master_df["Loneliness_binary"] = loneliness_num.map({0.0: "1", 1.0: "2"}).fillna("NA")

# --- AbilityToConfide binary ---
# Field 2110: 0 = None/almost never -> 2 (case, isolated), anything else -> 1 (control)
master_df["AbilityToConfide_binary"] = ability_num.apply(
    lambda x: "NA" if pd.isna(x) else ("2" if x == 0 else "1")
)

# --- FreqSoc binary (composite) ---
# Case (2): lives alone (NumHousehold=1) AND rarely visited (FreqVisit in {6,7})
# Control (1): otherwise
freqsoc_binary = []
for nh, fv in zip(numhouse_num, freqvisit_num):
    if pd.isna(nh) or pd.isna(fv):
        freqsoc_binary.append("NA")
    elif nh == 1 and fv in (6, 7):
        freqsoc_binary.append("2")
    else:
        freqsoc_binary.append("1")

master_df["FreqSoc_binary"] = freqsoc_binary

# Set IID = FID (UKBB uses same value for both in single-sample data)
master_df["IID"] = master_df["FID"]

print("Binary trait value counts:")
for trait in ["Loneliness_binary", "AbilityToConfide_binary", "FreqSoc_binary"]:
    print(f"\n{trait}:")
    print(master_df[trait].value_counts())

## Save Binary Output

Output file: `isolation_run_binary.tsv.gz`
Columns: `FID, IID, Loneliness, AbilityToConfide, FreqSoc`

In [None]:
binary_df = master_df[["FID", "IID", "Loneliness_binary", "AbilityToConfide_binary", "FreqSoc_binary"]].copy()
binary_df.columns = ["FID", "IID", "Loneliness", "AbilityToConfide", "FreqSoc"]

output_file = f"{OUTPUT_DIR}/isolation_run_binary.tsv.gz"
binary_df.to_csv(output_file, sep="\t", index=False, compression="gzip", na_rep="NA")

print(f"Saved: {output_file}")
print(f"Shape: {binary_df.shape}")
print()
binary_df.head(10)