# Parsing VCF Files by Hand
In this tutorial you will learn the some of the basics of `python` syntax while also learning to parse VCF files by hand! I am assuming that you have a basic understanding of coding—in some language—and of the VCF file specification, but if not you should read through the [documentation](https://samtools.github.io/hts-specs/VCFv4.3.pdf) first.  
The first thing we need to do is import __ONLY__ the necessary packages to parse a bgzipped VCF file as the point of this tutorial is to parse the VCF file by hand! In fact we only need to import the `gzip` package so we can read the bgzipped VCF file.

In [1]:
# Import the necessary packages.
import gzip

YOU ARE OFF TO A GREAT START! Now let's get started, by first saving the first 500 lines of the chromosome 1 TGP VCF file—that I provided in the same directory—to the varible `my_vcf`. To do so, we will call the VCF file as string, which is denoted with either single quotes (`' '`) or double quotes (`" "`)—it doesn't matter which one you use just be consistent!

In [2]:
# Load the vcf file.
my_vcf = './tgp_chr1_first_500_lines.vcf.gz'

GREAT JOB FRIEND! It should be noted that the prefix `./` denotes the working directory, but is not needed. Next we will use the `gzip` package to print out the meta info and header line only.

In [3]:
# Using the gzip package open the vcf file such that it is readable and save it to the variable data.
with gzip.open(my_vcf, 'rt') as data:
    # Loop through every line in the vcf file.
    for line in data:
        # If the current line is a part of the meta info or header...
        if line.startswith('#'):
            # Print the meta info or header line.
            print(line)
        # Else...
        else:
            # Break the current loop.
            break

##fileformat=VCFv4.1

##FILTER=<ID=PASS,Description="All filters passed">

##fileDate=20150218

##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

##source=1000GenomesPhase3Pipeline

##contig=<ID=1,assembly=b37,length=249250621>

##contig=<ID=2,assembly=b37,length=243199373>

##contig=<ID=3,assembly=b37,length=198022430>

##contig=<ID=4,assembly=b37,length=191154276>

##contig=<ID=5,assembly=b37,length=180915260>

##contig=<ID=6,assembly=b37,length=171115067>

##contig=<ID=7,assembly=b37,length=159138663>

##contig=<ID=8,assembly=b37,length=146364022>

##contig=<ID=9,assembly=b37,length=141213431>

##contig=<ID=10,assembly=b37,length=135534747>

##contig=<ID=11,assembly=b37,length=135006516>

##contig=<ID=12,assembly=b37,length=133851895>

##contig=<ID=13,assembly=b37,length=115169878>

##contig=<ID=14,assembly=b37,length=107349540>

##contig=<ID=15,assembly=b37,length=102531392>

##contig=<ID=16,assembly=b37,lengt

AWESOME, RIGHT? When we are parsing VCF files we rarely ever record the meta info or header line, but the same principal above can be used to skip through that information! In fact, now lets parse through the VCF file skipping the meta info, but saving the header line to the variable `header_info` and printing it. NOTE: Meta info lines have the line prefix `##` while the header line has the prefix `#`.

In [4]:
# Using the gzip package open the vcf file such that it is readable and save it to the variable data.
with gzip.open(my_vcf, 'rt') as data:
    # Loop through every line in the vcf file.
    for line in data:
        # If the current line is a part of the meta info...
        if line.startswith('##'):
            # Continue to the next line in the vcf file.
            continue
        # Else-if the current line is the header line.
        elif line.startswith('#'):
            # Save the header line.
            header_info = line
            # Print the header line.
            print(header_info)
        # Else...
        else:
            # Break the current loop.
            break

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG00096	HG00097	HG00099	HG00100	HG00101	HG00102	HG00103	HG00105	HG00106	HG00107	HG00108	HG00109	HG00110	HG00111	HG00112	HG00113	HG00114	HG00115	HG00116	HG00117	HG00118	HG00119	HG00120	HG00121	HG00122	HG00123	HG00125	HG00126	HG00127	HG00128	HG00129	HG00130	HG00131	HG00132	HG00133	HG00136	HG00137	HG00138	HG00139	HG00140	HG00141	HG00142	HG00143	HG00145	HG00146	HG00148	HG00149	HG00150	HG00151	HG00154	HG00155	HG00157	HG00158	HG00159	HG00160	HG00171	HG00173	HG00174	HG00176	HG00177	HG00178	HG00179	HG00180	HG00181	HG00182	HG00183	HG00185	HG00186	HG00187	HG00188	HG00189	HG00190	HG00231	HG00232	HG00233	HG00234	HG00235	HG00236	HG00237	HG00238	HG00239	HG00240	HG00242	HG00243	HG00244	HG00245	HG00246	HG00250	HG00251	HG00252	HG00253	HG00254	HG00255	HG00256	HG00257	HG00258	HG00259	HG00260	HG00261	HG00262	HG00263	HG00264	HG00265	HG00266	HG00267	HG00268	HG00269	HG00271	HG00272	HG00273	HG00274	HG00275	HG00276	HG00277	HG00278	HG00280	HG00281	HG00282	HG00284	HG

WOW, GREAT JOB! Although, I just said we normally skip the meta info and header lines lets inspect the header line.

In [5]:
# Print the data type.
print(type(header_info))
# Display the header line.
header_info

<class 'str'>


'#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tHG00096\tHG00097\tHG00099\tHG00100\tHG00101\tHG00102\tHG00103\tHG00105\tHG00106\tHG00107\tHG00108\tHG00109\tHG00110\tHG00111\tHG00112\tHG00113\tHG00114\tHG00115\tHG00116\tHG00117\tHG00118\tHG00119\tHG00120\tHG00121\tHG00122\tHG00123\tHG00125\tHG00126\tHG00127\tHG00128\tHG00129\tHG00130\tHG00131\tHG00132\tHG00133\tHG00136\tHG00137\tHG00138\tHG00139\tHG00140\tHG00141\tHG00142\tHG00143\tHG00145\tHG00146\tHG00148\tHG00149\tHG00150\tHG00151\tHG00154\tHG00155\tHG00157\tHG00158\tHG00159\tHG00160\tHG00171\tHG00173\tHG00174\tHG00176\tHG00177\tHG00178\tHG00179\tHG00180\tHG00181\tHG00182\tHG00183\tHG00185\tHG00186\tHG00187\tHG00188\tHG00189\tHG00190\tHG00231\tHG00232\tHG00233\tHG00234\tHG00235\tHG00236\tHG00237\tHG00238\tHG00239\tHG00240\tHG00242\tHG00243\tHG00244\tHG00245\tHG00246\tHG00250\tHG00251\tHG00252\tHG00253\tHG00254\tHG00255\tHG00256\tHG00257\tHG00258\tHG00259\tHG00260\tHG00261\tHG00262\tHG00263\tHG00264\tHG00265\tHG00266\tHG00267\

GREAT JOB! So now we know that the variable `header_info` is a string, which makes sense after we displayed the string. If you look closely you will noticed that all text in the header line is sperated by a tab (`\t`) and that the line ends with a new line (`\n`). This makes sense considering that a VCF file is just a tab delimnated text file! Lets now take advantage of the fact that VCF files are strings seperated by tabs and split this string using the `.split()` built in function.

In [6]:
# Split the header line by tabs.
header_info_split = header_info.split()
# Print out the data type.
print(type(header_info_split))
# Display the split header info.
print(header_info_split)

<class 'list'>
['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'HG00096', 'HG00097', 'HG00099', 'HG00100', 'HG00101', 'HG00102', 'HG00103', 'HG00105', 'HG00106', 'HG00107', 'HG00108', 'HG00109', 'HG00110', 'HG00111', 'HG00112', 'HG00113', 'HG00114', 'HG00115', 'HG00116', 'HG00117', 'HG00118', 'HG00119', 'HG00120', 'HG00121', 'HG00122', 'HG00123', 'HG00125', 'HG00126', 'HG00127', 'HG00128', 'HG00129', 'HG00130', 'HG00131', 'HG00132', 'HG00133', 'HG00136', 'HG00137', 'HG00138', 'HG00139', 'HG00140', 'HG00141', 'HG00142', 'HG00143', 'HG00145', 'HG00146', 'HG00148', 'HG00149', 'HG00150', 'HG00151', 'HG00154', 'HG00155', 'HG00157', 'HG00158', 'HG00159', 'HG00160', 'HG00171', 'HG00173', 'HG00174', 'HG00176', 'HG00177', 'HG00178', 'HG00179', 'HG00180', 'HG00181', 'HG00182', 'HG00183', 'HG00185', 'HG00186', 'HG00187', 'HG00188', 'HG00189', 'HG00190', 'HG00231', 'HG00232', 'HG00233', 'HG00234', 'HG00235', 'HG00236', 'HG00237', 'HG00238', 'HG00239', 'HG00240', 'HG00242'

WOOHOO, YOU ARE A ROCKSTAR! Lets now play around with the `header_info_split` list to gain intuition for how to parse VCF files. It should be noted that `python` uses zero-based indexing which simply means we start counting at 0 to length-1.

In [7]:
# Print the first column.
print('Column 1 = {0}'.format(header_info_split[0]))
# Print the second column.
print('Column 2 = {0}'.format(header_info_split[1]))
# Print the third column.
print('Column 3 = {0}'.format(header_info_split[2]))
# Print the fourth column.
print('Column 4 = {0}'.format(header_info_split[3]))
# Print the fifth column.
print('Column 5 = {0}'.format(header_info_split[4]))
# Print the sixth column.
print('Column 6 = {0}'.format(header_info_split[5]))
# Print the seventh column.
print('Column 7 = {0}'.format(header_info_split[6]))
# Print the eighth column.
print('Column 8 = {0}'.format(header_info_split[7]))
# Print the Ninth column.
print('Column 9 = {0}'.format(header_info_split[8]))
# Print the rest of columns.
print('Columns 10-End = {0}'.format(header_info_split[9:]))

Column 1 = #CHROM
Column 2 = POS
Column 3 = ID
Column 4 = REF
Column 5 = ALT
Column 6 = QUAL
Column 7 = FILTER
Column 8 = INFO
Column 9 = FORMAT
Columns 10-End = ['HG00096', 'HG00097', 'HG00099', 'HG00100', 'HG00101', 'HG00102', 'HG00103', 'HG00105', 'HG00106', 'HG00107', 'HG00108', 'HG00109', 'HG00110', 'HG00111', 'HG00112', 'HG00113', 'HG00114', 'HG00115', 'HG00116', 'HG00117', 'HG00118', 'HG00119', 'HG00120', 'HG00121', 'HG00122', 'HG00123', 'HG00125', 'HG00126', 'HG00127', 'HG00128', 'HG00129', 'HG00130', 'HG00131', 'HG00132', 'HG00133', 'HG00136', 'HG00137', 'HG00138', 'HG00139', 'HG00140', 'HG00141', 'HG00142', 'HG00143', 'HG00145', 'HG00146', 'HG00148', 'HG00149', 'HG00150', 'HG00151', 'HG00154', 'HG00155', 'HG00157', 'HG00158', 'HG00159', 'HG00160', 'HG00171', 'HG00173', 'HG00174', 'HG00176', 'HG00177', 'HG00178', 'HG00179', 'HG00180', 'HG00181', 'HG00182', 'HG00183', 'HG00185', 'HG00186', 'HG00187', 'HG00188', 'HG00189', 'HG00190', 'HG00231', 'HG00232', 'HG00233', 'HG00234', '

AMAZING, RIGHT? Lets recap what this means when we are parsing VCF files:  
* Index `0` = Chromosome
* Index `1` = Position
* Index `2` = Variant ID
* Index `3` = Reference Allele
* Index `4` = Alternative Allele
* Index `5` = Quality Score
* Index `6` = Filter Status
* Index `7` = Info Field
* Index `8` = Format Field
* Indicies `9:` = All of the Samples  
It should be noted that these indicies will be the same no matter the VCF file! Lets parse through the VCF file and convince ourselves these are true!

In [8]:
# Using the gzip package open the vcf file such that it is readable and save it to the variable data.
with gzip.open(my_vcf, 'rt') as data:
    # Loop through every line in the vcf file.
    for line in data:
        # If the current line is a part of the meta info or header line...
        if line.startswith('#'):
            # Continue to the next line in the vcf file.
            continue
        # Else...
        else:
            # Convert the line to a list by splitting the string by tabs.
            spline = line.split()
            # If the position is less than the third position (I peaked inside the vcf).
            if (int(spline[1]) < 10505):
                # Print out the line information by column.
                print('CHROM = {0}'.format(spline[0]))
                print('POS = {0}'.format(spline[1]))
                print('ID = {0}'.format(spline[2]))
                print('REF = {0}'.format(spline[3]))
                print('ALT = {0}'.format(spline[4]))
                print('QUAL = {0}'.format(spline[5]))
                print('FILTER = {0}'.format(spline[6]))
                print('INFO = {0}'.format(spline[7]))
                print('FORMAT = {0}'.format(spline[8]))
                print('SAMPLES = {0}'.format(spline[9:]))
                print('------------------------------')
            # Else...
            else:
                # Break the loop to stop printing too much info.
                break

CHROM = 1
POS = 10177
ID = rs367896724
REF = A
ALT = AC
QUAL = 100
FILTER = PASS
INFO = AC=2130;AF=0.425319;AN=5008;NS=2504;DP=103152;EAS_AF=0.3363;AMR_AF=0.3602;AFR_AF=0.4909;EUR_AF=0.4056;SAS_AF=0.4949;AA=|||unknown(NO_COVERAGE);VT=INDEL
FORMAT = GT
SAMPLES = ['1|0', '0|1', '0|1', '1|0', '0|0', '1|0', '1|0', '1|0', '1|0', '0|0', '0|0', '0|0', '0|0', '0|0', '0|0', '0|0', '0|1', '1|0', '0|0', '0|0', '1|0', '0|0', '0|0', '0|0', '0|1', '1|0', '0|1', '0|1', '0|1', '0|1', '1|0', '0|0', '1|0', '1|0', '0|0', '0|1', '0|0', '0|0', '1|0', '0|1', '1|0', '0|0', '1|0', '1|0', '0|0', '1|0', '0|1', '0|1', '0|0', '0|0', '1|0', '1|0', '0|0', '0|0', '0|1', '0|0', '0|0', '1|0', '1|1', '1|0', '0|1', '0|0', '0|0', '1|1', '0|1', '0|0', '0|1', '0|1', '0|0', '1|0', '1|0', '1|0', '0|1', '0|0', '1|0', '1|0', '1|0', '0|0', '1|0', '0|0', '0|1', '0|1', '1|0', '0|1', '1|1', '0|0', '0|1', '0|0', '1|0', '0|0', '0|0', '1|0', '0|0', '0|0', '0|0', '1|0', '1|0', '0|0', '0|1', '0|0', '1|0', '0|0', '1|0', '0|1', '1|0', '0

GREAT JOB! The last thing we will do is filter the VCF file to only contain bi-allelic SNPs, which is the common filtering schemes for most population genetic analyses, and write the filtered sites to a new unzipped VCF file named `tgp_chr1_first_500_lines_biallelic_only.vcf`. We are lucky that the TGP has already pre-filtered this VCF file for quality so we will just remove sites that are multiallelic or structural variants!

In [9]:
# Using the gzip package open the vcf file such that it is readable and save it to the variable data.
with gzip.open(my_vcf, 'rt') as data:
    # Intialize a vcf file to write the filtered output to.
    new_vcf = open('./tgp_chr1_first_500_lines_biallelic_only.vcf', 'w')
    # Loop through every line in the vcf file.
    for line in data:
        # If the current line is a part of the meta info or header line...
        if line.startswith('#'):
            # Write the header information to the new vcf file.
            new_vcf.write(line)
        # Else...
        else:
            # Convert the line to a list by splitting the string by tabs.
            spline = line.split()
            # Grab the position number.
            pos = spline[1]
            # Grab the refernce allele.
            ref = spline[3]
            # Grab the alternate allele.
            alt = spline[4]
            # If the site is a biallelic snp...
            if ((len(ref) == 1) & (len(alt) == 1)):
                # Print the biallelic site info.
                print('Keeping site... POS: {0}; REF: {1}; ALT: {2}'.format(pos, ref, alt))
                # Write the filtered line to the new vcf file.
                new_vcf.write(line)
            # Else...
            else:
                # Print the site info for the site we are filtering out.
                print('Filtering site... POS: {0}; REF: {1}; ALT: {2}'.format(pos, ref, alt))
    # Close the new vcf file.
    new_vcf.close()

Filtering site... POS: 10177; REF: A; ALT: AC
Filtering site... POS: 10235; REF: T; ALT: TA
Filtering site... POS: 10352; REF: T; ALT: TA
Keeping site... POS: 10505; REF: A; ALT: T
Keeping site... POS: 10506; REF: C; ALT: G
Keeping site... POS: 10511; REF: G; ALT: A
Keeping site... POS: 10539; REF: C; ALT: A
Keeping site... POS: 10542; REF: C; ALT: T
Keeping site... POS: 10579; REF: C; ALT: A
Filtering site... POS: 10616; REF: CCGCCGTTGCAAAGGCGCGCCG; ALT: C
Keeping site... POS: 10642; REF: G; ALT: A
Keeping site... POS: 11008; REF: C; ALT: G
Keeping site... POS: 11012; REF: C; ALT: G
Keeping site... POS: 11063; REF: T; ALT: G
Keeping site... POS: 13011; REF: T; ALT: G
Keeping site... POS: 13110; REF: G; ALT: A
Keeping site... POS: 13116; REF: T; ALT: G
Keeping site... POS: 13118; REF: A; ALT: G
Keeping site... POS: 13156; REF: G; ALT: C
Keeping site... POS: 13259; REF: G; ALT: A
Keeping site... POS: 13273; REF: G; ALT: C
Keeping site... POS: 13284; REF: G; ALT: A
Keeping site... POS: 1