# Genomic characterization and data formats
***

## **Step 1.** We start our demo with VCF files obtained from the 1000 Genomes Project.

`ls data/*.vcf`

```
data/CLM.chr22.vcf  data/IBS.chr22.vcf	data/PEL.chr22.vcf  data/YRI.chr22.vcf
```

### These VCF files correspond to 4 different populations:
* Colombian in Medellín Colombia (CLM)
* Iberian Populations in Spain (IBS)
* Peruvian in Lima Peru (PEL)
* Yoruba in Ibadan, Nigeria (YRI)


#### Let's have a look at these files.

`head data/CLM.chr22.vcf -n255 | tail -n3`

```
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG01112	HG01113	HG01119	HG01121	HG01122	HG01124	HG01125	HG01130	HG01131	HG01133	HG01134	HG01136	HG01137	HG01139	HG01140	HG01142	HG01148	HG01149	HG01250	HG01251	HG01253	HG01254	HG01256	HG01257	HG01259	HG01260	HG01269	HG01271	HG01272	HG01275	HG01277	HG01280	HG01281	HG01284	HG01341	HG01342	HG01344	HG01345	HG01348	HG01350	HG01351	HG01353	HG01354	HG01356	HG01357	HG01359	HG01360	HG01362	HG01363	HG01365	HG01366	HG01369	HG01372	HG01374	HG01375	HG01377	HG01378	HG01383	HG01384	HG01389	HG01390	HG01431	HG01432	HG01435	HG01437	HG01438	HG01440	HG01441	HG01443	HG01444	HG01447	HG01455	HG01456	HG01459	HG01461	HG01462	HG01464	HG01465	HG01468	HG01474	HG01479	HG01485	HG01486	HG01488	HG01489	HG01491	HG01492	HG01494	HG01495	HG01497	HG01498	HG01550	HG01551	HG01556
22	16050075	rs587697622	A	G	100	PASS	AC=1;AF=0.000199681;AN=5008;NS=2504;DP=8012;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0;SAS_AF=0.001;AA=.|||;VT=SNP	GT	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0
22	16050115	rs587755077	G	A	100	PASS	AC=32;AF=0.00638978;AN=5008;NS=2504;DP=11468;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0.0234;EUR_AF=0;SAS_AF=0;AA=.|||;VT=SNP	GT	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|1	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0
```

Markdown for what columns means

___
***

## **Step 2.** We convert the VCF files to PED (Plink pedigree & genotype) file format. This file format is required for a several ancestry estimation softwares.

`plink --vcf data/CLM.chr22.vcf --recode --out data/CLM.chr22`

### VCF files are always accompanied by MAP files (hold information regarding SNPs present in VCF files)
#### Let's have a look at these files.

`cut -f1-50 -d " " data/CLM.chr22.ped | head -n10`

```
HG01112 HG01112 0 0 0 -9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
HG01113 HG01113 0 0 0 -9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
HG01119 HG01119 0 0 0 -9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
HG01121 HG01121 0 0 0 -9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
HG01122 HG01122 0 0 0 -9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
HG01124 HG01124 0 0 0 -9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
HG01125 HG01125 0 0 0 -9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
HG01130 HG01130 0 0 0 -9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
HG01131 HG01131 0 0 0 -9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
HG01133 HG01133 0 0 0 -9 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
```

`head data/CLM.chr22.map`

```
22	rs587697622	0	16050075
22	rs587755077	0	16050115
22	rs587654921	0	16050213
22	rs587712275	0	16050319
22	rs587769434	0	16050527
22	rs587638893	0	16050568
22	rs587720402	0	16050607
22	rs587593704	0	16050627
22	rs587670191	0	16050646
```

***
___

## **Step 3.** Ancestry estimation softwares are capable of working with as low as 10,000 SNPs. We will perform LD pruning to bring the number of SNPs down.

### Let's look at the initial number of SNPs in our PED files.

`wc data/CLM.chr22.map -l`

```
1103547 data/CLM.chr22.map
```

### Before we go ahead with pruning, we have to combine all the PED files (coming from 4 VCF files). Pruning is performed based on linkage disequilibrium (LD), and we should have genotype calls from all the samples before we check for LD.

`cat data/IBS.chr22.ped data/PEL.chr22.ped data/YRI.chr22.ped data/CLM.chr22.ped > data/allCombined.ped`

`cp data/CLM.chr22.map data/allCombined.map`

### We will now perform LD pruning using plink.
* Plink creates a list of SNPs that should be included after pruning.

`plink --file data/allCombined --indep-pairwise 1000 5 0.3 --out data/allCombined`

```
PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to data/allCombined.log.
Options in effect:
  --file data/allCombined
  --indep-pairwise 1000 5 0.3
  --out data/allCombined

516841 MB RAM detected; reserving 258420 MB for main workspace.
.ped scan complete (for binary autoconversion).

Performing single-pass .bed write (1103547 variants, 394 people).
--file: data/allCombined-temporary.bed + data/allCombined-temporary.bim 
    
data/allCombined-temporary.fam written.
1103547 variants loaded from .bim file.
394 people (0 males, 0 females, 394 ambiguous) loaded from .fam.
Ambiguous sex IDs written to data/allCombined.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 394 founders and 0 nonfounders present.
Calculating allele frequencies... 
Total genotyping rate is 0.999866.
1103547 variants and 394 people pass filters and QC.
Note: No phenotypes present.
Pruned 964115 variants from chromosome 22, leaving 139432.
Pruning complete.  964115 of 1103547 variants removed.
Marker lists written to data/allCombined.prune.in and
data/allCombined.prune.out .

```

* We will now extract the SNPs which passed the LD pruning.

`plink --file data/allCombined --extract data/allCombined.prune.in --make-bed --out data/allCombinedLDPruned`

```
PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to data/allCombinedLDPruned.log.
Options in effect:
  --extract data/allCombined.prune.in
  --file data/allCombined
  --make-bed
  --out data/allCombinedLDPruned

516841 MB RAM detected; reserving 258420 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (1103547 variants, 394 people).
--file: data/allCombinedLDPruned-temporary.bed
data/allCombinedLDPruned-temporary.bim + data/allCombinedLDPruned-temporary.fam
written.
1103547 variants loaded from .bim file.
394 people (0 males, 0 females, 394 ambiguous) loaded from .fam.
Ambiguous sex IDs written to data/allCombinedLDPruned.nosex .
--extract: 139469 variants remaining.
Warning: At least 15 duplicate IDs in --extract file.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 394 founders and 0 nonfounders present.
Calculating allele frequencies...  done.
Total genotyping rate is 0.999632.
139469 variants and 394 people pass filters and QC.
Note: No phenotypes present.
--make-bed to data/allCombinedLDPruned.bed + data/allCombinedLDPruned.bim +
data/allCombinedLDPruned.fam ... .

```

***
___

## **Step 4.** During our last step of LD pruning, we also created a BED file from our original PED file. BED files hold the exact infomration as PED files, but in binary format.
### You can create BED files from PED files using the following command

`plink --file data/allCOmbined.chr22 --make-bed --out data/allCombined.chr22`

```
PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to data/CLM.chr22.log.
Options in effect:
  --file data/CLM.chr22
  --make-bed
  --out data/CLM.chr22

516841 MB RAM detected; reserving 258420 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (1103547 variants, 94 people).
--file: data/CLM.chr22-temporary.bed + data/CLM.chr22-temporary.bim 
data/CLM.chr22-temporary.fam written.
1103547 variants loaded from .bim file.
94 people (0 males, 0 females, 94 ambiguous) loaded from .fam.
Ambiguous sex IDs written to data/CLM.chr22.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 94 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.999881.
1103547 variants and 94 people pass filters and QC.
Note: No phenotypes present.
--make-bed to data/CLM.chr22.bed + data/CLM.chr22.bim + data/CLM.chr22.fam ....
PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to data/IBS.chr22.log.
Options in effect:
  --file data/IBS.chr22
  --make-bed
  --out data/IBS.chr22

```

#### BED files are accompanied by BIM & FAM files which hold variant & sample information.

`ls data/*allCombinedLD*`

```
data/allCombinedLDPruned.bed  data/allCombinedLDPruned.log
data/allCombinedLDPruned.bim  data/allCombinedLDPruned.nosex
data/allCombinedLDPruned.fam
```

***
___