# PGS001298 Polygenic Score Analysis

This notebook performs preprocessing, exploratory analysis, and chromosome-specific
visualization of the PGS001298 scoring file using GRCh38 harmonized positions.
This notebook runs the full workflow after cloning your GitHub repository.

In [None]:
# Step 0: Clone the repo (run once)
!git clone https://github.com/Arun21P/PGS001298_analysis.git
%cd PGS001298_analysis

In [None]:
from pathlib import Path
import gzip
import shutil
import os
import pandas as pd
import matplotlib.pyplot as plt

# Project root
PROJECT_ROOT = Path.cwd()

DATA_DIR = PROJECT_ROOT / "data"
OUTPUT_DIR = PROJECT_ROOT / "output"
OUTPUT_DIR.mkdir(exist_ok=True)

## Downloading and Unzipping Input File

This step unzips the PGS file. Dataset cleaning and preprocessing are performed in the following section.

In [None]:
def unzip_gz_file(gz_path, data_dir):
    """
    Unzips a .gz file and saves the unzipped file ONLY in data directory.
    Returns the unzipped file path.
    """
    gz_path = Path(gz_path)
    data_dir = Path(data_dir)

    txt_filename = gz_path.stem
    txt_path = data_dir / txt_filename

    if not txt_path.exists():
        with gzip.open(gz_path, "rb") as f_in:
            with open(txt_path, "wb") as f_out:
                shutil.copyfileobj(f_in, f_out)
        print(f"Unzipped file saved to: {txt_path}")
    else:
        print(f"Unzipped file already exists: {txt_path}")

    return txt_path

## A.-->a. Data Preprocessing

This function loads the PGS scoring file, removes missing values, keeps only
autosomal variants, enforces correct data types, and saves a cleaned version
for downstream analysis.

In [None]:
def preprocess_pgs_data(file_path):
    """
    Loads and preprocesses the full PGS scoring file.
    """
    df = pd.read_csv(file_path, sep="\t", comment="#")


    df.isna().sum()

    # output
    # rsID                    118
    # chr_name                  0
    # chr_position              0
    # effect_allele             0
    # other_allele              0
    # effect_weight             0
    # hm_source                 0
    # hm_rsID                 120
    # hm_chr                    3
    # hm_pos                    3
    # hm_inferOtherAllele    9227
    # dtype: int64



    df = df.dropna(axis=1, how="all")
    df = df.dropna()
    df.isna().sum()


    # output
    # rsID             0
    # chr_name         0
    # chr_position     0
    # effect_allele    0
    # other_allele     0
    # effect_weight    0
    # hm_source        0
    # hm_rsID          0
    # hm_chr           0
    # hm_pos           0
    # dtype: int64




    # Keep all  chromosomes (autosomes + sex)
    # chr_map = {"X": 23, "Y": 24}
    # df["hm_chr"] = df["hm_chr"].replace(chr_map)
    # df["hm_chr"] = df["hm_chr"].astype(int)

    # Keep only autosomes (1â€“22)
    df = df[df["hm_chr"].isin([str(i) for i in range(1, 23)])]

    # Correct data types
    df["hm_chr"] = df["hm_chr"].astype(int)
    df["hm_pos"] = df["hm_pos"].astype(int)
    df["effect_weight"] = df["effect_weight"].astype(float)
    df["rsID"] = df["rsID"].astype(str)

    # Drop GRCh37 columns
    df = df.drop(columns=["chr_name", "chr_position"])

    output_file = OUTPUT_DIR / "PGS001298_hmPOS_GRCh38_cleaned.txt"
    df.to_csv(output_file, sep="\t", index=False)

    print(f"Cleaned DataFrame saved to {output_file}")
    return df

## A.-->b. Exploratory Data Analysis

This step reports dataset dimensions, chromosome-wise variant counts, and
visualizes the distribution of effect weights across all autosomal chromosomes.

In [None]:
def exploratory_analysis(clean_df):
    # Number of variants and columns
    print("Number of variants:", clean_df.shape[0])
    print("Number of columns:", clean_df.shape[1])
    # Summary stats for numeric columns
    print(clean_df[["hm_chr", "hm_pos", "effect_weight"]].describe())
    # Unique chromosomes
    print("Unique chromosomes:", sorted(clean_df["hm_chr"].unique()))
    # Count of variants per chromosome
    print("\nVariants per chromosome:")
    print(clean_df["hm_chr"].value_counts().sort_index())

    plt.figure(figsize=(6,4))
    plt.hist(clean_df["effect_weight"], bins=50)
    plt.xlabel("Effect Weight")
    plt.ylabel("Frequency")
    plt.title("Distribution of Effect Weight (All Chromosomes)")

    # create folder if missing
    save_path = OUTPUT_DIR / "Distribution_of_Effect_Weight.png"
    plt.savefig(save_path, dpi=300, bbox_inches='tight')
    plt.show()

    print(f"Plot saved to {save_path}")

## A.-->c. Chromosome-Specific Effect Weight Distribution (Chromosome 21)

This function generates a histogram of effect weights for chromosome 21 and
summarizes variant-level statistics.

In [None]:
def plot_chr_effect_weight_histogram(clean_df, chr_num):

    chr_df = clean_df[clean_df["hm_chr"] == chr_num][["hm_chr", "hm_pos", "effect_weight"]]

    if chr_df.empty:
        print(f"No variants found for chromosome {chr_num}.")
        return

    plt.figure(figsize=(6,4))
    plt.hist(chr_df["effect_weight"], bins=50)
    plt.xlabel("Effect Weight")
    plt.ylabel("Frequency")
    plt.title(f"Effect Weight Distribution on Chromosome {chr_num}")

    save_path = OUTPUT_DIR / f"chr{chr_num}_effect_weight_hist.png"
    plt.savefig(save_path, dpi=300, bbox_inches='tight')
    plt.show()

    print(f"Plot saved to {save_path}")
    print(chr_df["effect_weight"].describe())

## Main Execution and Conclusion

In [None]:
gz_file = DATA_DIR / "PGS001298_hmPOS_GRCh38.txt.gz"
txt_file = unzip_gz_file(gz_file, DATA_DIR)

# OR
# clean_df = pd.read_csv(gz_file, sep="\t", comment="#", compression="gzip")

clean_df = preprocess_pgs_data(txt_file)
exploratory_analysis(clean_df)
plot_chr_effect_weight_histogram(clean_df, chr_num=21)

print("PGS001298 Analysis Completed")
print("Conclusion: The effect weight distribution across all chromosomes is approximately symmetric and centered around zero, indicating that most variants have small additive effects, as expected for a polygenic trait. Chromosome 21 shows a similar pattern with fewer variants and no chromosome-specific bias.")

### Conclusion

The effect weight distribution across all chromosomes is approximately symmetric and centered around zero, indicating that most variants have small additive effects, as expected for a polygenic trait. Chromosome 21 shows a similar pattern with fewer variants and no chromosome-specific bias.