# Continuous Phenotype Formatting for GWAS Pipeline

Prepares **continuous-coded** phenotype files for BOLT-LMM GWAS.

Loads the four raw extracted phenotype TSVs and produces a single output with
continuous coding (higher value = more isolated):

| Derived Trait | Raw Fields Used | Coding |
|---|---|---|
| **Loneliness** | Field 2020 | Raw value (0 or 1) |
| **AbilityToConfide** | Field 2110 | 5 - raw value (reverses scale so higher = more isolated) |
| **FreqSoc** | Fields 1031 + 709 | Mean of z-scored FreqVisit and z-scored negated NumHousehold |

Output: `isolation_run_continuous.tsv.gz` with columns `FID, IID, Loneliness, AbilityToConfide, FreqSoc`

In [None]:
import os
import pandas as pd
from functools import reduce

INPUT_DIR = "/home/mabdel03/data/files/Isolation_Genetics/GWAS/Basket_File/extracted_phenotypes"
OUTPUT_DIR = "/home/mabdel03/data/files/Isolation_Genetics/GWAS/Scripts/ukb21942/pheno"

os.makedirs(OUTPUT_DIR, exist_ok=True)

## Load Raw Phenotype Files

These TSVs are produced by `extract_all_phenotypes.sh` from the UKBB basket file.
Each has columns: `FID`, `IID`, `Phenotype`.

In [None]:
raw_phenotypes = {
    "Loneliness":       f"{INPUT_DIR}/phenotype_2020.tsv",   # Field 2020
    "AbilityToConfide": f"{INPUT_DIR}/phenotype_2110.tsv",   # Field 2110
    "FreqVisit":        f"{INPUT_DIR}/phenotype_1031.tsv",   # Field 1031
    "NumHousehold":     f"{INPUT_DIR}/phenotype_709.tsv",    # Field 709
}

df_list = []
for name, path in raw_phenotypes.items():
    df = pd.read_csv(path, sep="\t")
    df_list.append((df, name))
    print(f"Loaded {name}: {len(df)} rows from {os.path.basename(path)}")

## Fix Column Format and Merge

The raw extraction may produce rows where FID, IID, and phenotype values are
space-separated in a single column. Parse and split these into proper columns,
then outer-merge all phenotypes on FID/IID.

In [None]:
corrected_dfs = []

for df, phenotype_name in df_list:
    fid_vals = []
    iid_vals = []
    pheno_vals = []

    for val in df["FID"].astype(str):
        parts = val.split(" ")
        if len(parts) < 3:
            parts = val.split()

        if len(parts) >= 3:
            fid_vals.append(parts[0])
            iid_vals.append(parts[1])
            pheno_vals.append(parts[2])
        else:
            fid_vals.append(parts[0] if len(parts) > 0 else "NA")
            iid_vals.append(parts[1] if len(parts) > 1 else "NA")
            pheno_vals.append("NA")

    corrected_df = pd.DataFrame({
        "FID": fid_vals,
        "IID": iid_vals,
        phenotype_name: pheno_vals,
    })
    corrected_dfs.append(corrected_df)

master_df = reduce(
    lambda left, right: pd.merge(left, right, on=["FID", "IID"], how="outer"),
    corrected_dfs,
)

print(f"Merged master dataframe: {master_df.shape[0]} participants, {master_df.shape[1]} columns")
master_df.head()

## Clean and Create Continuous Traits

Map UKBB special codes (-1 = "do not know", -3 = "prefer not to answer") to NA,
then apply continuous coding for each derived trait. All traits are oriented so
that higher values indicate greater social isolation.

In [None]:
# Map UKBB special codes to NA
for col in ["Loneliness", "AbilityToConfide", "FreqVisit", "NumHousehold"]:
    master_df[col] = master_df[col].astype(str).str.strip()
    master_df.loc[master_df[col].isin(["-1", "-3", "nan", "None"]), col] = "NA"

# Convert to numeric
loneliness_num = pd.to_numeric(master_df["Loneliness"].replace("NA", pd.NA), errors="coerce")
ability_num = pd.to_numeric(master_df["AbilityToConfide"].replace("NA", pd.NA), errors="coerce")
freqvisit_num = pd.to_numeric(master_df["FreqVisit"].replace("NA", pd.NA), errors="coerce")
numhouse_num = pd.to_numeric(master_df["NumHousehold"].replace("NA", pd.NA), errors="coerce")

# --- Loneliness continuous ---
# Raw 0/1 value (already oriented: 1 = lonely)
master_df["Loneliness_continuous"] = loneliness_num

# --- AbilityToConfide continuous ---
# Reverse scale: 5 - raw value (higher = less ability to confide = more isolated)
master_df["AbilityToConfide_continuous"] = 5.0 - ability_num

# --- FreqSoc continuous (composite) ---
# Average of z-scored FreqVisit and z-scored negated NumHousehold
# Higher FreqVisit codes = less frequent visits (more isolated)
# Negated NumHousehold: fewer people = more isolated
freqvisit_z = (freqvisit_num - freqvisit_num.mean()) / freqvisit_num.std()
numhouse_neg = -numhouse_num
numhouse_neg_z = (numhouse_neg - numhouse_neg.mean()) / numhouse_neg.std()

master_df["FreqSoc_continuous"] = (freqvisit_z + numhouse_neg_z) / 2
master_df.loc[freqvisit_num.isna() | numhouse_num.isna(), "FreqSoc_continuous"] = pd.NA

# Set IID = FID (UKBB uses same value for both in single-sample data)
master_df["IID"] = master_df["FID"]

print("Continuous trait summary statistics:")
for trait in ["Loneliness_continuous", "AbilityToConfide_continuous", "FreqSoc_continuous"]:
    vals = pd.to_numeric(master_df[trait], errors="coerce")
    print(f"\n{trait}:")
    print(f"  N valid: {vals.notna().sum()}")
    print(f"  Mean:    {vals.mean():.4f}")
    print(f"  Std:     {vals.std():.4f}")
    print(f"  Range:   [{vals.min():.4f}, {vals.max():.4f}]")

## Save Continuous Output

Output file: `isolation_run_continuous.tsv.gz`
Columns: `FID, IID, Loneliness, AbilityToConfide, FreqSoc`

In [None]:
continuous_df = master_df[["FID", "IID", "Loneliness_continuous", "AbilityToConfide_continuous", "FreqSoc_continuous"]].copy()
continuous_df.columns = ["FID", "IID", "Loneliness", "AbilityToConfide", "FreqSoc"]

output_file = f"{OUTPUT_DIR}/isolation_run_continuous.tsv.gz"
continuous_df.to_csv(output_file, sep="\t", index=False, compression="gzip", na_rep="NA")

print(f"Saved: {output_file}")
print(f"Shape: {continuous_df.shape}")
print()
continuous_df.head(10)