## Description
Let's read in the summary statistics for the CAD dataset:
- File input: `GWAS_CAD.txt`.

**Data Cleaning steps:**
1. Rename columns
2. Drop unneeded columns
3. Compute Z-score column
4. Compute OR column
5. Reorder columns
6. Sort dataframe by `(CHR, BP)`
7. Write out cleaned dataset to `.csv` file format

**Expected final cleaned dataset columns:**
1. `CHR`
2. `SNP`
3. `BP`
4. `A1`
5. `A2`
6. `OR`
7. `BETA`
8. `SE_BETA`
9. `P`
10. `Z`
11. `MAF`

In [1]:
import pandas as pd
import numpy as np

df1 = pd.read_csv("/gpfs/gibbs/project/bdsi/shared/Genetics/data/real_data/GWAS_CAD.txt", sep="\t")

In [2]:
## Briefly look at the uncleaned dataset's column names
#df1.columns

In [3]:
## Step 1: Rename the column names

df1_new = df1.rename(columns={
    "markername": "SNP",
    "chr": "CHR",
    "bp_hg19": "BP",
    "effect_allele": "A1",
    "noneffect_allele": "A2",
    "effect_allele_freq": "MAF",
    "beta": "BETA",
    "se_dgc": "SE_BETA",
    "p_dgc": "P"
})

In [4]:
## Step 2: Drop unneeded columns
df1_new = df1_new.drop(columns=["median_info", "model", "het_pvalue", "n_studies"], errors='ignore')

## Step 3: Compute Z-score column
df1_new["Z"] = df1_new["BETA"] / df1_new["SE_BETA"]

## Step 4: Compute OR column
df1_new['OR'] = np.exp(df1_new['BETA'])

## Step 5: Reorder columns to desired fixed order
desired_order = ["CHR", "SNP", "BP", "A1", "A2", "OR", "BETA", "SE_BETA", "P", "Z", "MAF"]
df1_new = df1_new[desired_order]


## Step 6: Sort by CHR then BP
df1_new = df1_new.sort_values(by=['CHR', 'BP'], na_position='last').reset_index(drop=True)

df1_new.head()
#df_new.dtypes
#len(df1_new) ## 9455778 rows

Unnamed: 0,CHR,SNP,BP,A1,A2,OR,BETA,SE_BETA,P,Z,MAF
0,1,rs143225517,751756,C,T,1.013091,0.013006,0.017324,0.452802,0.75075,0.158264
1,1,rs3094315,752566,A,G,0.994771,-0.005243,0.015765,0.73946,-0.332568,0.763018
2,1,rs3131972,752721,G,A,0.996973,-0.003032,0.015638,0.846265,-0.193885,0.740969
3,1,rs3131971,752894,C,T,1.004651,0.00464,0.016238,0.775066,0.285755,0.744287
4,1,rs61770173,753405,A,C,0.993729,-0.006291,0.016708,0.706526,-0.376526,0.775368


In [5]:
#print("CAD dataset data cleaning done.")

In [6]:
## Step 7: Write cleaned CAD dataset to .csv file
df1_new.to_csv("CAD_clean.csv", index=False)