In [1]:
import gwaslab as gl 
import genal
from config import *
import pandas as pd 




## Load Data

1. The `gl.Sumstats` module supports loading data from various sources such as directories, Pandas DataFrames, and more.
2. You can build a pipeline to convert and save raw summary statistics into supported formats. Several commonly used formats are already supported — see the [formats directory](https://github.com/Cloufield/formatbook/tree/main/formats) for details (e.g., GWAS Catalog format).

> **Note:** When working with large datasets, using CSV or gzip-compressed files can be time-consuming. It is highly recommended to preprocess your data and save it in the `feather` format for significantly faster loading and downstream analysis.



In [None]:
%%time
filename = "GCST90449056.tsv.gz"
data = pd.read_csv(data_dir / filename, sep = r"\s+", compression="gzip")
data

Unnamed: 0,chromosome,base_pair_location,effect_allele,other_allele,beta,standard_error,effect_allele_frequency,p_value,rs_id,n
0,1,752566,G,A,-0.003855,0.002427,0.1637,0.112194,rs3094315,663139
1,1,885689,G,A,-0.012647,0.004002,0.0537,0.001575,rs4970452,656956
2,1,885699,A,G,-0.012867,0.003977,0.0544,0.001216,rs4970376,656954
3,1,886006,T,C,-0.012928,0.003981,0.0543,0.001164,rs4970375,656955
4,1,887801,A,G,-0.012811,0.003977,0.0544,0.001278,rs3828047,656954
...,...,...,...,...,...,...,...,...,...,...
5753790,22,51172460,C,T,-0.004041,0.005831,0.0225,0.488332,rs5770824,715356
5753791,22,51175626,G,A,0.001762,0.003547,0.0613,0.619277,rs3810648,739079
5753792,22,51177257,T,C,-0.000743,0.004725,0.0348,0.875139,rs73174437,713374
5753793,22,51178090,A,G,0.001494,0.003735,0.0549,0.689183,rs2285395,739077


This summary statistics file was downloaded directly from the GWAS Catalog, it's already in a supported format. That means you can load it without needing additional conversion:

>Note: This file was downloaded from the GWAS Catalog, so we can load it directly using `gl.Sumstats()`.

As rsid may not always be present in the summary statistics, so we can check or 

In [None]:
# define fmt_dict to map columns
fmt_dict = dict(
    fmt = "gwascatalog",
    build = "38"  # check genome build or versionConvert in the beginning
)
if "rs_id" in data.columns:
    fmt_dict["rsid"] = "rs_id"
if "n" in data.columns:
    fmt_dict["n"] = "n"

if "snpid" in data.columns:
    fmt_dict["snpid"] = "variant_id"

sumstats = gl.Sumstats(
    sumstats=data,
    **fmt_dict
)
sumstats.data

2025/06/10 01:27:22 GWASLab v3.6.3 https://cloufield.github.io/gwaslab/
2025/06/10 01:27:22 (C) 2022-2025, Yunye He, Kamatani Lab, GPL-3.0 license, gwaslab@gmail.com
2025/06/10 01:27:22 Python version: 3.9.23 | packaged by conda-forge | (main, Jun  4 2025, 17:57:12) 
[GCC 13.3.0]
2025/06/10 01:27:22 Start to load format from formatbook....
2025/06/10 01:27:22  -gwascatalog format meta info:
2025/06/10 01:27:22   - format_name  : gwascatalog
2025/06/10 01:27:22   - format_source  : https://www.ebi.ac.uk/gwas/docs/methods/summary-statistics
2025/06/10 01:27:22   - format_version  : 20220726
2025/06/10 01:27:22  -gwascatalog to gwaslab format dictionary:
2025/06/10 01:27:22   - gwascatalog keys: variant_id,chromosome,base_pair_location,other_allele,effect_allele,beta,effect_allele_frequency,standard_error,p-value,odds_ratio,ci_lower,ci_upper
2025/06/10 01:27:22   - gwaslab values: SNPID,CHR,POS,NEA,EA,BETA,EAF,SE,P,OR,OR_95L,OR_95U
2025/06/10 01:27:22 Start to initialize gl.Sumstats from 

Unnamed: 0,rsID,CHR,POS,EA,NEA,EAF,BETA,SE,N,STATUS
0,rs3094315,1,752566,G,A,0.1637,-0.003855,0.002427,663139,3899999
1,rs4970452,1,885689,G,A,0.0537,-0.012647,0.004002,656956,3899999
2,rs4970376,1,885699,A,G,0.0544,-0.012867,0.003977,656954,3899999
3,rs4970375,1,886006,T,C,0.0543,-0.012928,0.003981,656955,3899999
4,rs3828047,1,887801,A,G,0.0544,-0.012811,0.003977,656954,3899999
...,...,...,...,...,...,...,...,...,...,...
5753790,rs5770824,22,51172460,C,T,0.0225,-0.004041,0.005831,715356,3899999
5753791,rs3810648,22,51175626,G,A,0.0613,0.001762,0.003547,739079,3899999
5753792,rs73174437,22,51177257,T,C,0.0348,-0.000743,0.004725,713374,3899999
5753793,rs2285395,22,51178090,A,G,0.0549,0.001494,0.003735,739077,3899999


standarization of data to remove dup and fix some errors, so the data can be suitable for further analysis

In [17]:
n_cores = 6 # or by number of cores
sumstats.basic_check(remove=True,remove_dup=True, n_cores=n_cores)


2025/06/10 01:28:44 Start to check SNPID/rsID...v3.6.3
2025/06/10 01:28:44  -Current Dataframe shape : 5753795 x 10 ; Memory usage: 356.19 MB
2025/06/10 01:28:44  -Checking rsID data type...
2025/06/10 01:28:44  -Checking if rsID is rsxxxxxx...
2025/06/10 01:28:51  -Checking if CHR:POS:NEA:EA is mixed in rsID column ...
2025/06/10 01:28:53  -Number of CHR:POS:NEA:EA mixed in rsID column : 0
2025/06/10 01:28:54  -Number of Unrecognized rsID : 0
2025/06/10 01:28:54  -A look at the unrecognized rsID : set() ...
2025/06/10 01:28:54 Finished checking SNPID/rsID.
2025/06/10 01:28:54 Start to fix chromosome notation (CHR)...v3.6.3
2025/06/10 01:28:54  -Current Dataframe shape : 5753795 x 10 ; Memory usage: 356.19 MB
2025/06/10 01:28:54  -Checking CHR data type...
2025/06/10 01:28:55  -Variants with standardized chromosome notation: 5753795
2025/06/10 01:28:56  -All CHR are already fixed...
2025/06/10 01:29:01 Finished fixing chromosome notation (CHR).
2025/06/10 01:29:01 Start to fix basepair

In [18]:
sumstats.data

Unnamed: 0,rsID,CHR,POS,EA,NEA,EAF,BETA,SE,N,STATUS
0,rs3094315,1,752566,G,A,0.1637,-0.003855,0.002427,663139,3850099
1,rs4970452,1,885689,G,A,0.0537,-0.012647,0.004002,656956,3850099
2,rs4970376,1,885699,A,G,0.0544,-0.012867,0.003977,656954,3850099
3,rs4970375,1,886006,T,C,0.0543,-0.012928,0.003981,656955,3850099
4,rs3828047,1,887801,A,G,0.0544,-0.012811,0.003977,656954,3850099
...,...,...,...,...,...,...,...,...,...,...
5753790,rs5770824,22,51172460,C,T,0.0225,-0.004041,0.005831,715356,3850099
5753791,rs3810648,22,51175626,G,A,0.0613,0.001762,0.003547,739079,3850099
5753792,rs73174437,22,51177257,T,C,0.0348,-0.000743,0.004725,713374,3850099
5753793,rs2285395,22,51178090,A,G,0.0549,0.001494,0.003735,739077,3850099


Now this file is ok with some basic check and unified column names. 

Save to feather format for later use in python

In [24]:
filename.split(".")[0]

'GCST90449056'

In [25]:
parquet_path =  data_dir / "parquet"

parquet_path.mkdir(exist_ok=True, parents=True)
to_file = str(parquet_path/filename.split(".")[0])
sumstats.to_format(
        path=to_file,
        fmt="gwaslab", 
   tab_fmt = "parquet" 
)

2025/06/10 01:38:43 Start to convert the output sumstats in:  gwaslab  format
2025/06/10 01:38:44  -Start outputting sumstats in gwaslab format...
2025/06/10 01:38:44  -gwaslab format will be loaded...
2025/06/10 01:38:44  -gwaslab format meta info:
2025/06/10 01:38:44   - format_name  : gwaslab
2025/06/10 01:38:44   - format_source  : https://cloufield.github.io/gwaslab/
2025/06/10 01:38:44   - format_version  : 20231220_v4
2025/06/10 01:38:44  -Output path: /home/xutingfeng/job/finemap_tools/data/parquet/GCST90449056.gwaslab.parquet
2025/06/10 01:38:44  -Output columns: rsID,CHR,POS,EA,NEA,EAF,BETA,SE,N,STATUS
2025/06/10 01:38:44  -Writing sumstats to: /home/xutingfeng/job/finemap_tools/data/parquet/GCST90449056.gwaslab.parquet...
2025/06/10 01:38:48  -Saving log file to: /home/xutingfeng/job/finemap_tools/data/parquet/GCST90449056.gwaslab.log
2025/06/10 01:38:48 Finished outputting successfully!


## Reformat to other format

Sometimes we need to reformat the data to other format, such as ldsc or other format to load in other software.



### LDSC

In [28]:
filename = "GCST90449056.gwaslab.parquet"
to_load_file = parquet_path / filename

data = pd.read_parquet(to_load_file)
data

Unnamed: 0,rsID,CHR,POS,EA,NEA,EAF,BETA,SE,N,STATUS
0,rs3094315,1,752566,G,A,0.1637,-0.003855,0.002427,663139,3850099
1,rs4970452,1,885689,G,A,0.0537,-0.012647,0.004002,656956,3850099
2,rs4970376,1,885699,A,G,0.0544,-0.012867,0.003977,656954,3850099
3,rs4970375,1,886006,T,C,0.0543,-0.012928,0.003981,656955,3850099
4,rs3828047,1,887801,A,G,0.0544,-0.012811,0.003977,656954,3850099
...,...,...,...,...,...,...,...,...,...,...
5753790,rs5770824,22,51172460,C,T,0.0225,-0.004041,0.005831,715356,3850099
5753791,rs3810648,22,51175626,G,A,0.0613,0.001762,0.003547,739079,3850099
5753792,rs73174437,22,51177257,T,C,0.0348,-0.000743,0.004725,713374,3850099
5753793,rs2285395,22,51178090,A,G,0.0549,0.001494,0.003735,739077,3850099


Now turn it to gl.Sumstats

In [27]:
mysumstats = gl.Sumstats(
    data, 
    fmt = "gwaslab"
)
mysumstats.data

2025/06/10 01:42:11 GWASLab v3.6.3 https://cloufield.github.io/gwaslab/
2025/06/10 01:42:11 (C) 2022-2025, Yunye He, Kamatani Lab, GPL-3.0 license, gwaslab@gmail.com
2025/06/10 01:42:11 Python version: 3.9.23 | packaged by conda-forge | (main, Jun  4 2025, 17:57:12) 
[GCC 13.3.0]
2025/06/10 01:42:11 Start to load format from formatbook....
2025/06/10 01:42:11  -gwaslab format meta info:
2025/06/10 01:42:11   - format_name  : gwaslab
2025/06/10 01:42:11   - format_source  : https://cloufield.github.io/gwaslab/
2025/06/10 01:42:11   - format_version  : 20231220_v4
2025/06/10 01:42:11 Start to initialize gl.Sumstats from pandas DataFrame ...
2025/06/10 01:42:12  -Reading columns          : EAF,EA,CHR,POS,rsID,SE,NEA,STATUS,N,BETA
2025/06/10 01:42:12  -Renaming columns to      : EAF,EA,CHR,POS,rsID,SE,NEA,STATUS,N,BETA
2025/06/10 01:42:12  -Current Dataframe shape : 5753795  x  10
2025/06/10 01:42:12  -Initiating a status column: STATUS ...
2025/06/10 01:42:14 Start to reorder the columns.

Unnamed: 0,rsID,CHR,POS,EA,NEA,EAF,BETA,SE,N,STATUS
0,rs3094315,1,752566,G,A,0.1637,-0.003855,0.002427,663139,9999999
1,rs4970452,1,885689,G,A,0.0537,-0.012647,0.004002,656956,9999999
2,rs4970376,1,885699,A,G,0.0544,-0.012867,0.003977,656954,9999999
3,rs4970375,1,886006,T,C,0.0543,-0.012928,0.003981,656955,9999999
4,rs3828047,1,887801,A,G,0.0544,-0.012811,0.003977,656954,9999999
...,...,...,...,...,...,...,...,...,...,...
5753790,rs5770824,22,51172460,C,T,0.0225,-0.004041,0.005831,715356,9999999
5753791,rs3810648,22,51175626,G,A,0.0613,0.001762,0.003547,739079,9999999
5753792,rs73174437,22,51177257,T,C,0.0348,-0.000743,0.004725,713374,9999999
5753793,rs2285395,22,51178090,A,G,0.0549,0.001494,0.003735,739077,9999999


In [33]:
ldsc_dir = data_dir / "ldsc"
ldsc_dir.mkdir(exist_ok=True, parents=True)
to_file = str(ldsc_dir/filename.split(".")[0])
mysumstats.to_format(to_file,fmt="ldsc",hapmap3=False,exclude_hla=True,md5sum=True)


2025/06/10 01:44:06 Start to convert the output sumstats in:  ldsc  format
2025/06/10 01:44:07  -Excluded 32666 variants in HLA region (chr6: 25000000-34000000 )...
2025/06/10 01:44:07  -Formatting statistics ...
2025/06/10 01:44:13  -Float statistics formats:
2025/06/10 01:44:13   - Columns       : ['EAF', 'BETA', 'SE']
2025/06/10 01:44:13   - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}']
2025/06/10 01:44:13  -Start outputting sumstats in ldsc format...
2025/06/10 01:44:13  -ldsc format will be loaded...
2025/06/10 01:44:13  -ldsc format meta info:
2025/06/10 01:44:13   - format_name  : ldsc
2025/06/10 01:44:13   - format_source  : https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format
2025/06/10 01:44:13   - format_source2  : https://github.com/bulik/ldsc/blob/master/munge_sumstats.py
2025/06/10 01:44:13   - format_version  :  20150306
2025/06/10 01:44:13  -gwaslab to ldsc format dictionary:
2025/06/10 01:44:13   - gwaslab keys: rsID,NEA,EA,EAF,N,BETA,P,Z,INFO,OR,CHR,POS


In [32]:
to_file = str(ldsc_dir/filename.split(".")[0])
to_file

'/home/xutingfeng/job/finemap_tools/data/ldsc/GCST90449056'

## other situation

- if not GRCh38, use `versionConvert.py`
- if no rsid , at least should have `variant_id` columns with `chr:pos:ref:alt` format [sorted of ref and alt] by `resetID2.py`.

> why ref:alt should be sorted?  Reason: 1) ref and alt may are presented as A1 A2 or EA NEA etc, which we can not make sure which one are ref and alt or this is not important for variant name, as this is working as mapping between different variant name.

For preprocessing stage, we recommend to be consistent format with columns or varaint name, but not preprocess with variants to keep or rm, which this should be done in downstream analysis.