# Genotype data preprocessing

This document performs genotype data quality control and preprocessing.

## Overview

### Analysis steps

1. Genotype data quality control (QC). See here for the [QC default settings](https://cumc.github.io/xqtl-pipeline/pipeline/data_preprocessing/genotype/GWAS_QC.html).
2. Principle component analysis (PCA) based QC, and PC computation for each sub-population available in the genotype data.
3. Genomic relationship matrix (GRM) computation.
4. Genotype data reformatting for downstream fine-mapping analysis.

### Input data requirement

1. Genotype data. See here for [format details](https://cumc.github.io/xqtl-pipeline/pipeline/data_preprocessing/genotype/genotype_formatting.html).
2. [Optional] a sample information file to specific population information, if external data such as HapMap or 1000 Genomes are to be integrated to the PCA analysis to visualize and assess population structure in the genotype data. See here for [format details](https://cumc.github.io/xqtl-pipeline/pipeline/data_preprocessing/genotype/genotype_formatting.html).

## QC for VCF(（Variant Call Format) files
#### 2.1.1 Input
Subject VCF file, genome-wide or regional variation to be QC


In [None]:
setwd('/home/ubuntu/xqtl_protocol_exercise')
library(data.table)
# genotype VCF before QC
geno = fread('data/WGS/vcf/ENSG00000073921.variants.add_chr.vcf.gz')
dim(geno)
geno[1:4,1:11]

#CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,sample0,sample1
<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
chr11,84957209,chr11:84957209_G_C,G,C,.,.,PR;AC=99;AN=300,GT,0/0,0/0
chr11,84957210,chr11:84957210_C_T,C,T,.,.,PR;AC=0;AN=300,GT,0/0,0/0
chr11,84957254,chr11:84957254_A_C,A,C,.,.,PR;AC=0;AN=300,GT,0/0,0/0
chr11,84957263,chr11:84957263_C_T,C,T,.,.,PR;AC=0;AN=300,GT,0/0,0/0


In [None]:
# dbsnp-variants file to annotate rsid 
# chrom start end rsid for each snp
cd /home/ubuntu/xqtl_protocol_exercise
zcat reference_data/00-All.add_chr.variants.gz | head

chr1	10019	10020	rs775809821
chr1	10039	10039	rs978760828
chr1	10043	10043	rs1008829651
chr1	10051	10051	rs1052373574
chr1	10055	10055	rs768019142
chr1	10055	10055	rs892501864
chr1	10063	10063	rs1010989343
chr1	10077	10077	rs1022805358
chr1	10108	10108	rs62651026
chr1	10109	10109	rs376007522


#### 2.1.2 Command

- 📍 Step 1: `variant preprocessing`

**Purpose**

Prepare and clean up raw variant records to standard format and annotate known variants.

**Procedures**

- **Split multi-allelic variants** into multiple bi-allelic records
- **Left-normalize indels** and **correct REF/ALT** based on the reference FASTA
- **Annotate variants using dbSNP**, adding RSID to known variants

**Output File**

```
ENSG00000073921.variants.add_chr.leftnorm.vcf.gz

```

**Changes**

- `ID` field may change from `.` or `chr:pos` format to `rsXXXX` if matched in dbSNP
- Each variant record will represent only one REF/ALT pair (bi-allelic format)

---

- 📍 Step 2: `variant level QC`

**Purpose**

Filter out low-quality or unreliable variants and genotypes.

**Procedures**

- For each genotype, filter by:
    - **DP (Depth)**
    - **GQ (Genotype Quality)**
    - **AB (Allele Balance)**
- Filter out:
    - **Monomorphic sites** (no heterozygosity across samples)
    - **Variants with high missingness**
    - **Variants failing HWE threshold** (optional)

**Output File**

```
ENSG00000073921.variants.add_chr.leftnorm.bcftools_qc.vcf.gz

```

**Changes**

- Low-confidence genotypes are set to `./.`
- Variants with no remaining informative genotypes are removed
- Overall file becomes cleaner and smaller in size

---

- 📍 Step 3: `genotype data summary statistics`

**Purpose**

Evaluate the effectiveness of QC using summary statistics.

**Procedures**

- Use `bcftools stats` to compute:
    - Total variants, SNPs/indels, missingness, heterozygosity, etc.
- Use `SnpSift tstv` to compute:
    - **Transition/Transversion ratio (TS/TV)**
- Statistics are separated into:
    - **Known variants** (with RSID)
    - **Novel variants** (no RSID)

**Output Files**

```
.novel_variant_sumstats
.known_variant_sumstats
.novel_variant.snipsift_tstv
.known_variant.snipsift_tstv

Perform QC on VCF files. The QC-ed data will also be exported to PLINK format for next steps analysis.

In [2]:
sos run pipeline/VCF_QC.ipynb qc \
    --genoFile data/WGS/vcf/ENSG00000073921.variants.add_chr.vcf.gz \
    --dbsnp-variants reference_data/00-All.add_chr.variants.gz \
    --reference-genome reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --skip_vcf_header_filtering \
    --cwd output/vcf/ 

  import pkg_resources
INFO: Running [32mvariant preprocessing[0m: Handel multi-allelic sites, left normalization of indels and add variant ID
INFO: [32mvariant preprocessing[0m is [32mcompleted[0m.
INFO: [32mvariant preprocessing[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/vcf/ENSG00000073921.variants.add_chr.leftnorm.vcf.gz[0m
INFO: Running [32mvariant level QC[0m: genotype QC
INFO: [32mvariant level QC[0m is [32mcompleted[0m.
INFO: [32mvariant level QC[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/vcf/ENSG00000073921.variants.add_chr.leftnorm.bcftools_qc.vcf.gz[0m
INFO: Running [32mgenotype data summary statistics[0m: 
INFO: [32mqc_3[0m (index=1) is [32mcompleted[0m.
INFO: [32mqc_3[0m (index=0) is [32mcompleted[0m.
INFO: [32mgenotype data summary statistics[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/vcf/ENSG00000073921.variants.add_chr.leftnorm.novel_variant_sumstats /mnt/va

### 2.2 Converting VCF to PLINK format.
Converting VCF to PLINK format.

- Input: VCF files
- Output: PLINK format

- PLINK1 Format (Traditional Format)      
The PLINK1 format consists of three files:
- .bed: Binary genotype data file
- .bim: Variant information file (includes chromosome, position, variant ID, etc.)
- .fam: Sample information file (includes family ID, individual ID, etc.)

In [None]:
cd /home/ubuntu/xqtl_protocol_exercise
sos run pipeline/genotype_formatting.ipynb vcf_to_plink \
    --genoFile `ls data/WGS/vcf/wgs.chr*.random.vcf.gz` \
    --cwd output/plink/ 

  import pkg_resources
INFO: Running [32mvcf_to_plink[0m: 
INFO: [32mvcf_to_plink[0m (index=0) is [32mignored[0m due to saved signature
INFO: [32mvcf_to_plink[0m (index=1) is [32mignored[0m due to saved signature
INFO: [32mvcf_to_plink[0m (index=2) is [32mignored[0m due to saved signature
INFO: [32mvcf_to_plink[0m (index=3) is [32mignored[0m due to saved signature
INFO: [32mvcf_to_plink[0m (index=4) is [32mignored[0m due to saved signature
INFO: [32mvcf_to_plink[0m (index=5) is [32mignored[0m due to saved signature
INFO: [32mvcf_to_plink[0m (index=6) is [32mignored[0m due to saved signature
INFO: [32mvcf_to_plink[0m (index=7) is [32mignored[0m due to saved signature
INFO: [32mvcf_to_plink[0m (index=8) is [32mignored[0m due to saved signature
INFO: [32mvcf_to_plink[0m (index=9) is [32mignored[0m due to saved signature
INFO: [32mvcf_to_plink[0m (index=10) is [32mignored[0m due to saved signature
INFO: [32mvcf_to_plink[0m (index=11) is [32m

In [None]:
# merge plink bed into 1
sos run pipeline/genotype_formatting.ipynb merge_plink \
    --genoFile `ls output/plink/wgs.chr*.random.bed` \
    --name wgs.merged \
    --cwd output/plink/ 

### 2.3 Genotype PLINK File Quality Control

About `qc`:   
1. `[qc_no_prune, qc_1 (basic QC filters)]`:  
-- `aim`: To filter SNPs and select individuals based on various quality control (QC) criteria. The goal is to ensure that the genotype data is of high quality and free from potential errors or biases before further analysis.   

`Input`:    
- genoFile: The primary input file containing genotype data.  
- Various parameters that dictate the QC criteria:  
- maf_filter, maf_max_filter: Minimum and maximum Minor Allele Frequency (MAF) thresholds.  
- mac_filter, mac_max_filter: Minimum and maximum Minor Allele Count (MAC) thresholds.  
- geno_filter: Maximum missingness per variant.  
- mind_filter: Maximum missingness per sample.  
- hwe_filter: Hardy-Weinberg Equilibrium (HWE) filter threshold.  
- other_args: Other optional PLINK arguments.  
- meta_only: Flag to determine if only SNP and sample lists should be output.  
- rm_dups: Flag to remove duplicate variants.  

`Output`: A file (or set of files) with the suffix .plink_qc (and possibly .extracted if specific variants are kept). The exact format (e.g., .bed or .snplist) depends on the meta_only parameter.  


In [None]:
cd /home/ubuntu/xqtl_protocol_exercise
sos run pipeline/GWAS_QC.ipynb qc_no_prune \
   --cwd output/plink \
   --genoFile output/plink/wgs.merged.bed \
   --geno-filter 0.1 \
   --mind-filter 0.1 \
   --hwe-filter 1e-08 \
   --mac-filter 0 

  import pkg_resources
INFO: Running [32mqc_no_prune[0m: Filter SNPs and select individuals
INFO: [32mqc_no_prune[0m is [32mcompleted[0m.
INFO: [32mqc_no_prune[0m output:   [32m/mnt/vast/hpc/homes/al4225/xqtl_protocol_data/output/plink/wgs.merged.plink_qc.bed[0m
INFO: Workflow qc_no_prune (ID=w6697f77cea0f6dc2) is executed successfully with 1 completed step.


## 2.4 Genotype QCed plink files separate by chrom

In [None]:
sos run pipeline/genotype_formatting.ipynb genotype_by_chrom \
    --genoFile output/plink/wgs.merged.plink_qc.bed \
    --cwd output/genotype_by_chrom \
    --chrom `cut -f 1 output/plink/wgs.merged.plink_qc.bim | uniq | sed "s/chr//g"`

  import pkg_resources
INFO: Running [32mgenotype_by_chrom_1[0m: 
INFO: [32mgenotype_by_chrom_1[0m (index=3) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=6) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=0) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=5) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=1) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=4) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=2) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=9) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=7) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=12) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=11) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=13) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=8) is [32mcompleted[0m.
INFO: [32mgenotype_by_chrom_1[0m (index=10)