<a href="https://colab.research.google.com/github/luuloi/GWAS_Introduction_2023/blob/main/06_PolygenicRiskScore.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. SETUP WORKING ENVIRONMENT

## Download Necessary Pakage

In [None]:
# Install using pip

!pip install rpy2==3.5.1
!pip install -q condacolab
!pip install gdown

In [13]:
# Ignore rpy2's warnings

import rpy2
import warnings
from rpy2.rinterface import RRuntimeWarning

warnings.filterwarnings("ignore", category=RRuntimeWarning)

In [14]:
# Initialize conda

import condacolab
condacolab.install()

✨🍰✨ Everything looks OK!


## Set up R environment by installing required packages

In [15]:
# activate R magic
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


## Install Plink Package

In [None]:
# Install plink

!conda install -c bioconda plink

## Prepare Data

In [17]:
%%bash

# Create folder to store neccesary data
mkdir data
mkdir -p data/base
mkdir -p data/target

mkdir: cannot create directory ‘data’: File exists


In [None]:
%%bash
wget -O data/base/Height.gwas.txt.gz https://raw.githubusercontent.com/luuloi/GWAS_Introduction_2023/main/data/Height.gwas.txt.gz
gdown -O data/target/EUR.zip https://drive.google.com/uc?id=1uhJR_3sn7RA8U5iYQbcmTp6vFdQiF4F2&export=download

In [24]:
!unzip data/target/EUR.zip -d data/target

Archive:  data/target/EUR.zip
  inflating: data/target/EUR.bed     
  inflating: data/target/EUR.bim     
  inflating: data/target/EUR.cov     
  inflating: data/target/EUR.fam     
  inflating: data/target/EUR.height  


## Have a look into our data

In [19]:
!zcat data/base/Height.gwas.txt.gz | head

CHR	BP	SNP	A1	A2	N	SE	P	OR	INFO	MAF
1	756604	rs3131962	A	G	388028	0.00301666	0.483171	0.997886915712657	0.890557941364774	0.369389592764921
1	768448	rs12562034	A	G	388028	0.00329472	0.834808	1.00068731609353	0.895893511351165	0.336845754096289
1	779322	rs4040617	G	A	388028	0.00303344	0.42897	0.997603556067569	0.897508290615237	0.377368010940814
1	801536	rs79373928	G	T	388028	0.00841324	0.808999	1.00203569922793	0.908962856432993	0.483212245374095
1	808631	rs11240779	G	A	388028	0.00242821	0.590265	1.00130832511154	0.893212523690488	0.450409558999587
1	809876	rs57181708	G	A	388028	0.00336785	0.71475	1.00123165786833	0.923557624081969	0.499743932656759
1	835499	rs4422948	G	A	388028	0.0023758	0.710884	0.999119752645202	0.906437735120596	0.481016005816168
1	838555	rs4970383	A	C	388028	0.00235773	0.150993	0.996619945289758	0.907716506801574	0.327164029672754
1	840753	rs4970382	C	T	388028	0.00207377	0.199967	0.99734567895614	0.914602590137255	0.498936220426316


In [26]:
!cat data/target/EUR.height | head

FID	IID	Height
HG00096	HG00096	169.132168767547
HG00097	HG00097	171.256258630279
HG00099	HG00099	171.534379938588
HG00101	HG00101	169.850176470551
HG00102	HG00102	172.788360878389
HG00103	HG00103	169.862973824923
HG00105	HG00105	168.939248611414
HG00107	HG00107	168.972346393861
HG00108	HG00108	171.311736719186


# 1. QUALITY CONTROL

## A. QC of Base Data

**Reading the base data file**

In [28]:
!zcat data/base/Height.gwas.txt.gz | head

CHR	BP	SNP	A1	A2	N	SE	P	OR	INFO	MAF
1	756604	rs3131962	A	G	388028	0.00301666	0.483171	0.997886915712657	0.890557941364774	0.369389592764921
1	768448	rs12562034	A	G	388028	0.00329472	0.834808	1.00068731609353	0.895893511351165	0.336845754096289
1	779322	rs4040617	G	A	388028	0.00303344	0.42897	0.997603556067569	0.897508290615237	0.377368010940814
1	801536	rs79373928	G	T	388028	0.00841324	0.808999	1.00203569922793	0.908962856432993	0.483212245374095
1	808631	rs11240779	G	A	388028	0.00242821	0.590265	1.00130832511154	0.893212523690488	0.450409558999587
1	809876	rs57181708	G	A	388028	0.00336785	0.71475	1.00123165786833	0.923557624081969	0.499743932656759
1	835499	rs4422948	G	A	388028	0.0023758	0.710884	0.999119752645202	0.906437735120596	0.481016005816168
1	838555	rs4970383	A	C	388028	0.00235773	0.150993	0.996619945289758	0.907716506801574	0.327164029672754
1	840753	rs4970382	C	T	388028	0.00207377	0.199967	0.99734567895614	0.914602590137255	0.498936220426316


* **CHR**: The chromosome in which the SNP resides
* **BP**: Chromosomal co-ordinate of the SNP
* **SNP**: SNP ID, usually in the form of rs-ID
* **A1**: The effect allele of the SNP
* **A2**: The non-effect allele of the SNP
* **N**: Number of samples used to obtain the effect size estimate
* **SE**: The standard error (SE) of the effect size esimate
* **P**: The P-value of association between the SNP genotypes and the base phenotype
* **OR**: The effect size estimate of the SNP, if the outcome is binary/case-control. If the outcome is continuous or treated as continuous then this will usually be BETA
* **INFO**: The imputation information score
* **MAF**: The minor allele frequency (MAF) of the SNP



### QC checklist: Base data

**i. Heritability check**

Recommend that PRS analyses are performed on base data with a chip-heritability estimate 	\frac{1}{2}\ 2snp > 0.05\

**ii. Effect allele**

**iii. Genome build**