## Description
There are three main parts:
1. Preliminary processing for munging
2. Munging for LDSC
3. LDSC to calculate heritability

## 1. Preliminary
This preliminary section will be ran in `python3`.

Let's read `CAD_clean.csv` and convert to a `.txt` file ready to be munged

In [1]:
import pandas as pd

# Step 1: Read the CSV file
df = pd.read_csv("CAD_clean.csv")

# Step 3: Select and reorder desired columns
df_subset = pd.DataFrame({
    "SNP": df["SNP"],
    "A1": df["A1"],
    "A2": df["A2"],
    "P": df["P"],
    "Z": df["Z"],
    "MAF": df["MAF"]
})

# Step 4: Save as tab-delimited text file
df_subset.to_csv("CAD_clean_munge.txt", sep="\t", index=False)


In [2]:
df_subset.head()

Unnamed: 0,A1,A2,MAF,P,SNP,Z
0,C,T,0.158264,0.452802,rs143225517,0.75075
1,A,G,0.763018,0.73946,rs3094315,-0.332568
2,G,A,0.740969,0.846265,rs3131972,-0.193885
3,C,T,0.744287,0.775066,rs3131971,0.285755
4,A,C,0.775368,0.706527,rs61770173,-0.376526


## 2. Munging for LDSC
The rest of the notebook will be ran in `python2`.

In [3]:
%%bash

# Input parameters
munge='/gpfs/gibbs/project/bdsi/shared/Genetics/ldsc/munge_sumstats.py' # Path to munge script
gwas='CAD_clean_munge.txt' # Path to GWAS summary statistics
path_out='./' # output path
file_out='munge_CAD' # prefix of munged output
## required columns
snp='SNP' # rsID
a1='A1' # effect allele
a2='A2' # non-effect allele
frq='MAF' # frequency of effect allele
p='P' # p-value
effect_size='Z' # effect sizes, can be either odds ratio or z statistic
null_value=$([ "$effect_size" = "OR" ] && echo 1 || echo 0) # null value of the effect size
## for optional column, if you don't have it, manually add sample size
N=184305
## Set threshold for quality control on minor allele frequency
maf=0.01 # default

# Run
python ${munge} \
    --sumstats ${gwas} \
    --snp ${snp} \
    --a1 ${a1} \
    --a2 ${a2} \
    --frq ${frq} \
    --p ${p} \
    --N ${N} \
    --signed-sumstats ${effect_size},${null_value} \
    --maf-min ${maf} \
    --out ${path_out}/${file_out}

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./munge_sumstats.py \
--signed-sumstats Z,0 \
--out .//munge_CAD \
--frq MAF \
--N 184305.0 \
--a1 A1 \
--a2 A2 \
--snp SNP \
--sumstats CAD_clean_munge.txt \
--p P 

Interpreting column names as follows:
A1:	Allele 1, interpreted as ref allele for signed sumstat.
P:	p-Value
A2:	Allele 2, interpreted as non-ref allele for signed sumstat.
SNP:	Variant ID (e.g., rs number)
MAF:	Allele frequency
Z:	Directional summary statistic as specified by --signed-sumstats.

Reading sumstats from CAD_clean_munge.txt into memory 5000000 SNPs at a time.
.. done
Read 9455778 SNPs from --sumstats file.
Removed 0 SNPs with missing values.
Removed 0 SNPs with I

### 2.1 Briefly view outputted munged file

In [2]:
%%bash
zcat munge_CAD.sumstats.gz|head

SNP	A1	A2	Z	N
rs143225517	C	T	0.751	184305.000
rs3094315	A	G	-0.333	184305.000
rs3131972	G	A	-0.194	184305.000
rs3131971	C	T	0.286	184305.000
rs61770173	A	C	-0.377	184305.000
rs2073813	A	G	0.346	184305.000
rs3131969	G	A	-0.394	184305.000
rs3131968	G	A	-0.410	184305.000
rs3131967	C	T	-0.555	184305.000



gzip: stdout: Broken pipe


## 3. LDSC to calculate heritability

In [4]:
%%bash

# Input parameters
ldsc='/gpfs/gibbs/project/bdsi/shared/Genetics/ldsc/ldsc.py' # path to ldsc.py
ldscore='/gpfs/gibbs/project/bdsi/shared/Genetics/ldsc/LDscore/' # folder for ldscores and correponding regression weights
gwas='munge_CAD.sumstats.gz' # path to munged gwas summary statistics
output_path='./' # directory of output files

python ${ldsc} \
    --h2 ${gwas} \
    --ref-ld-chr ${ldscore}LDscore.@ \
    --w-ld-chr ${ldscore}LDscore.@ \
    --out ${output_path}ldsc_CAD

*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call: 
./ldsc.py \
--h2 munge_CAD.sumstats.gz \
--ref-ld-chr /gpfs/gibbs/project/bdsi/shared/Genetics/ldsc/LDscore/LDscore.@ \
--out ./ldsc_CAD \
--w-ld-chr /gpfs/gibbs/project/bdsi/shared/Genetics/ldsc/LDscore/LDscore.@ 

Beginning analysis at Thu Jul 17 09:25:30 2025
Reading summary statistics from munge_CAD.sumstats.gz ...
Read summary statistics for 7088729 SNPs.
Reading reference panel LD Score from /gpfs/gibbs/project/bdsi/shared/Genetics/ldsc/LDscore/LDscore.[1-22] ... (ldscore_fromlist)
Read reference panel LD Scores for 1190321 SNPs.
Removing partitioned LD Scores with zero variance.
Reading regression weight LD Score from /gpfs/gibbs/pro

  coef = np.linalg.lstsq(x, y)
