# Summary statistics merger

## Aim

- 1.To merge multiple summary statistic files to new summary statistic files with common SNPs
- 2.To deal with allele flip and reserve issues in the process of merging

## Notes
 - 1. If there are duplicated `indels` in the summary statistics, they will be removed. For example, two SNPs at 10000 on chr1. one's `A0` is `T`, and `A1` is `TC`. Whereas the other one's `A0` is `TC`, and `A1` is `T`. Both of them will be removed. More about `indels` issues(https://github.com/statgenetics/UKBB_GWAS_dev/issues/81#issuecomment-1015556800).
 - 2. If duplicated `chr:pos` (GWAS) or `gene:chr:pos` (TWAS) exist, run a recursive match for each pair of them between two summary statistic files (`query`(each of inputs) and `subject` (target file)). 
 - 3. under the same `chr:pos` or `gene:chr:pos`, The variants' `A0` and `A1` are matched by exact, flip, reverse, or flip+reverse models. Only one of them is `True`, the variant in two files are matched. If they are matched by flip or flip+reverse, the sign of `query`'s `STAT` will be inversed. And the `query`'s `A0` and `A1` will be the same as the `subject`'s `A0` and `A1`.       

## Pre-requisites

Make sure you install the pre-requisited before running this notebook:

```
pip install LDtools
```

## Input

- `--cwd`, the path of working directory
- `--yml_path`, the path of yaml file
- `--keep-ambiguous`, boolean. default False. if add --keep-ambiguous parameter, keep ambiguous alleles which can not be decided from flip or reverse, such as A/T or C/G. Otherwise, remove them. 
- `--intersect`, boolean. default False. if add --intersect parameter, output intersect SNPs in all input files.

### The format of the input yaml file 

For GWAS summary statistics: `ID` is `CHR,POS,A0,A1`, which can be used as a unique label for each variant.

```
INPUT:
  - ./data/testflip/*.gz:
        ID: CHR,POS,A0,A1
        CHR: CHR
        POS: POS
        A0: REF
        A1: ALT
        SNP: SNP
        STAT: BETA
        SE: SE
        P: P
  - ./data/testflip/flip/snps500_flip.regenie.snp_stats.gz:
  
TARGET: 
  - ./data/testflip/snps500.regenie.snp_stats.gz:
        ID: CHR,POS,A0,A1
        CHR: CHR
        POS: POS
        A0: REF
        A1: ALT
        SNP: SNP
        STAT: BETA
        SE: SE
        P: P
OUTPUT: data/testflip/output/
```

For TWAS summary statistics: `ID` is `GENE,CHR,POS,A0,A1`, which add the `GENE` name because a variant can be made association with multiple genes. 

```
INPUT:
  - data/twas/*.txt:
        ID: GENE,CHR,POS,A0,A1
        CHR: chrom
        POS: pos
        A0: ref
        A1: alt
        SNP: variant_id
        GENE: gene
        STAT: beta
        SE: se
        P: pval
 
  
TARGET: 
  - data/twas/DLPFC.chr6.mol_phe.cis_long_table.reformated.txt:
        ID: GENE,CHR,POS,A0,A1
        CHR: chrom
        POS: pos
        A0: ref
        A1: alt
        SNP: variant_id
        GENE: gene
        STAT: beta
        SE: se
        P: pval
OUTPUT: ../data/twas/output/
```

There are three parts in the input yaml file.
- INPUT
   - A list of yml file, as the output from yml_generator, each yml file documents a set of input
       - the input summary statistic files with the column names in below. 
       - the input files can be from multiple directory and from different format. The input paths must follow the rules related to Unix shell. the format is to pair the column names with keys (CHR, POS, A0, A1, SNP, STAT, SE, P). if not provided, the column names of the input file will be considered as the default keys.
       - The input summary statistic file cannot have duplicated chr:pos
       - The input summary statstic file cannot have # in its header
       -`ID` in yml is a unique identifier for each SNP. The default are CHR,POS. These duplicated identifier will only keep the first one in the file.
- TARGET
   - the target file is a reference summary statistic file or a file with chr, pos, a0, a1 columns at least, which the other files compare with.
- OUTPUT
   - the path of an output directory for new summary statistic files

## Output
new summary statistic files with common SNPs in all input files. the sign of statistics has been corrected to make it consistent in different data.
   - for each input sumstat file, a qced version will be generated.
   - The generated sumstat files will have header as \"CHR  ,   POS  ,   A0   ,   A1    ,  SNP   ,  STAT ,   SE    ,  P\" regardless of input header
   - The generated sumstat files will be in gz format.

## Example command

```
sos run ./summary_stats_merger.ipynb --cwd data --yml_list data/yml_list.txt --keep-ambiguous --intersect
```

In [None]:
[global]
# Work directory where output will be saved to
parameter: cwd = path
## path to a list of yml file , with columns #chr and dir
parameter: yml_list = path
import pandas as pd
yml_path = pd.read_csv(yml_list,sep = "\t").values.tolist()
#if add --keep-ambiguous parameter, keep ambiguous alleles which can not be decided from flip or reverse, such as A/T or C/G. Otherwise, remove them.
parameter: keep_ambiguous = False
# if add --intersect parameter, output intersect SNPs in all input files.
parameter: intersect = False
# Containers that contains the necessary packages
parameter: container = str

## Workflow codes

In [113]:
[default_1 (export utils script)]
depends: Py_Module('LDtools')
output: f'{cwd:a}/utils.py'
report: expand = '${ }', output=f'{cwd:a}/utils.py'

    import os
    from LDtools.sumstat import read_sumstat
    from LDtools.utils import *
    def merge_sumstats(yml,keep_ambiguous,intersect):
        #parse yaml
        yml = load_yaml(yml)
        input_dict = parse_input(yml['INPUT'])
        target_dict = parse_input(yml['TARGET'])
        output_path = yml['OUTPUT']

        input_dict[list(target_dict.keys())[0]] = list(target_dict.values())[0]
        lst_sumstats_file = [os.path.basename(i) for i in input_dict.keys()]
        print('Total number of sumstats: ',len(lst_sumstats_file))
        if len(set(lst_sumstats_file))<len(lst_sumstats_file):
            raise Exception("There are duplicated names in {}".format(lst_sumstats_file))
        #read all sumstats
        print(input_dict)
        lst_sumstats = {os.path.basename(i):read_sumstat(i,j) for i,j in input_dict.items()}
        nqs = []
        #check duplicated indels and remove them.
        subject = check_indels(lst_sumstats[os.path.basename(list(target_dict.keys())[0])])
        for query in lst_sumstats.values():
            #check duplicated indels and remove them.
            query = check_indels(query)
            #under the same chr:pos or gene:chr:pos. match A0 and A1 by exact, flip, reverse, or flip+reverse.
            #if duplicated chr_pos or gene_chr_pos exist, run a recursive match for each pair of them between query and subject.
            nq,_ = snps_match(query,subject,keep_ambiguous)
            nqs.append(nq)
        if intersect:
            #get common snps
            common_snps = set.intersection(*[set(nq.SNP) for nq in nqs])
            print('Total number of common SNPs: ',len(common_snps))
            #write out new smustats
            for output_sumstats,nq in zip(lst_sumstats_file,nqs):
                sumstats = nq[nq.SNP.isin(common_snps)]
                sumstats.to_csv(os.path.join(output_path, output_sumstats), sep = "\t", header = True, index = False,compression='gzip')
        else:
            for output_sumstats,nq in zip(lst_sumstats_file,nqs):
                #output match SNPs with target SNPs.
                nq.to_csv(os.path.join(output_path, output_sumstats), sep = "\t", header = True, index = False,compression='gzip')
        print('All are done!!!')

In [2]:
[default_2 (merge sumstats)]
depends: f'{cwd:a}/utils.py'
input: for_each = "yml_path"
python: expand = '${ }', input = f'{cwd:a}/utils.py', stderr = f'{cwd:a}/output.stderr', stdout = f'{cwd:a}/output.stdout'

    yml = "${_yml_path[1]}"
    keep_ambiguous = ${keep_ambiguous}
    intersect = ${intersect}
    print(yml, keep_ambiguous,intersect)
    merge_sumstats(yml, keep_ambiguous,intersect)