In [None]:
## this notebook shows how to download data from the 1000 genomes project and convert it into the plink binary format.
## the converted plink files () are provided in the data folder for the tutorial. This notebook is provided for those awho are interested in trying this out themselves


Read environmental variable from file .env
The file sets environmental variables:
PROJECT_DIR=
DATA_DIR=
SINGULARITY_CACHEDIR=
SINGULARITY_TMPDIR=

### load modules

In [1]:
module load singularity

Loading singularity/3.9.5[m
  Loading requirement: golang/1.16[m
[K[?1l>

In [7]:
#assuming that you have git in your path
cd ~/GIT/
git clone 

/home/wongs6/miniconda3/bin/git


In [6]:
cd ~/GIT/statgen_workshop_tutorial/ #you may need to change it to where to you clone the statgen_workshop_tutorial
export PROJECT_DIR=$(pwd)
export DATA_DIR=$PROJECT_DIR/data

In [5]:
cd ${PROJECT_DIR}

Download 1000 genomes chr21 VCF

In [6]:
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz -P ${DATA_DIR} 

--2023-07-09 20:57:37--  https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
Resolving ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)... 193.62.193.140
Connecting to ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)|193.62.193.140|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 209774472 (200M) [application/x-gzip]
Saving to: ‘/DCEG/Projects/CoherentLogic/statgen_workshop/data/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz’


2023-07-09 21:02:38 (682 KB/s) - ‘/DCEG/Projects/CoherentLogic/statgen_workshop/data/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz’ saved [209774472/209774472]



download 1000 genomes ped file

In [7]:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped -P ${DATA_DIR}

--2023-07-09 21:04:12--  ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped
           => ‘/DCEG/Projects/CoherentLogic/statgen_workshop/data/20130606_g1k.ped’
Resolving ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)... 193.62.193.140
Connecting to ftp.1000genomes.ebi.ac.uk (ftp.1000genomes.ebi.ac.uk)|193.62.193.140|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /vol1/ftp/technical/working/20130606_sample_info ... done.
==> SIZE 20130606_g1k.ped ... 155869
==> PASV ... done.    ==> RETR 20130606_g1k.ped ... done.
Length: 155869 (152K) (unauthoritative)


2023-07-09 21:04:14 (413 KB/s) - ‘/DCEG/Projects/CoherentLogic/statgen_workshop/data/20130606_g1k.ped’ saved [155869]



## While plink and plink1.9 can be easily downloaded as binary and usually have no problem running on most unix systems, here we will use singularity to pull the plink Docker image for reproducible compute environment

In [10]:
### we use singularity to pull the image
singularity pull ${PROJECT_DIR}/containers/plink1.9.sif docker://biocontainers/plink1.9:v1.90b6.6-181012-1-deb_cv1 

[34mINFO:   [0m Converting OCI blobs to SIF format
[34mINFO:   [0m Starting build...
Getting image source signatures
Copying blob 478cd0aa93c0 [>--------------------------------] 796.5KiB / 48.0MiB
[1A[JCopying blob 478cd0aa93c0 done  
[1A[JCopying blob 478cd0aa93c0 done  
[1A[JCopying blob 478cd0aa93c0 done  
[1A[JCopying blob 478cd0aa93c0 done  
[1A[JCopying blob 478cd0aa93c0 done  
[1A[JCopying blob 478cd0aa93c0 done  
[1A[JCopying blob 478cd0aa93c0 done  
[1A[JCopying blob 478cd0aa93c0 done  
[2A[JCopying blob 478cd0aa93c0 done  
Copying blob 94d6a239eb0e done  
[2A[JCopying blob 478cd0aa93c0 done  
Copying blob 94d6a239eb0e done  
[2A[JCopying blob 478cd0aa93c0 done  
Copying blob 94d6a239eb0e done  
[2A[JCopying blob 478cd0aa93c0 done  
Copying blob 94d6a239eb0e done  
[3A[JCopying blob 478cd0aa93c0 done  
Copying blob 94d6a239eb0e done  
Copying blob e8e87313e9cb done  
[3A[JCopying blob 478cd0aa93c0 done  
Copying blob 94d6a239eb0e done  
Copying

convert the vcf to plink binary format (bed, bim, fam), to reduce the file size, we only keep those variants with MAF>0.05

In [6]:
singularity run --bind /DCEG:/DCEG ${PROJECT_DIR}/containers/plink1.9.sif \
   plink1.9 \
   --vcf $DATA_DIR/ALL.chr21.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \
   --maf 0.05 \
   --make-bed \
   --out $DATA_DIR/ALL_chr21_maf0.05

PLINK v1.90b6.6 64-bit (12 Oct 2018)           www.cog-genomics.org/plink/1.9/
(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /DCEG/Projects/CoherentLogic/statgen_workshop/data//ALL_chr21_maf0.05.log.
Options in effect:
  --maf 0.05
  --make-bed
  --out /DCEG/Projects/CoherentLogic/statgen_workshop/data//ALL_chr21_maf0.05
  --vcf /DCEG/Projects/CoherentLogic/statgen_workshop/data//ALL.chr21.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz

515899 MB RAM detected; reserving 257949 MB for main workspace.
--vcf: 1105k variants complete.
/DCEG/Projects/CoherentLogic/statgen_workshop/data//ALL_chr21_maf0.05-temporary.bed
+
/DCEG/Projects/CoherentLogic/statgen_workshop/data//ALL_chr21_maf0.05-temporary.bim
+
/DCEG/Projects/CoherentLogic/statgen_workshop/data//ALL_chr21_maf0.05-temporary.fam
written.
1105538 variants loaded from .bim file.
2504 people (0 males, 0 females, 2504 ambiguous) loaded from .fam.
Ambiguous sex IDs written to
