## fastq2vcf
### Tutorial intended to cover the analysis of human NGS data. In particular the processing of fastq files in order to get variants in VCF format. 


Links: 

[https://jupyter.org/](https://jupyter.org/)  
[https://pandas.pydata.org/](https://pandas.pydata.org/)


#### Fastq Downloading

We are going to use public data from the Simons Genomes Diversity Project ([SGDP](https://reichdata.hms.harvard.edu/pub/datasets/sgdp/)). Specifically 7 selected genomes from each 'Region' included in the dataset. The SGDP Project ID in ENA repository is [PRJEB9586](https://www.ebi.ac.uk/ena/browser/view/PRJEB9586).

The code below fetches partial fastq files (2 million paired-end reads per individual) coming from genomes of the 7 different regions defined in the SGDP (CentralAsiaSiberia, Africa, EastAsia, WestEurasia, America, SouthAsia and Oceania). The reduction in reads per sample is to speed up the analysis but the code can be easily modified to get and work on complete genomes. 


In [8]:
import pandas as pd
import os

SGDP_ENA_PID = 'PRJEB9586'
ENA_URL = 'https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=' + SGDP_ENA_PID + '&result=read_run'
metadata_URL = 'https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/SGDP_metadata.279public.21signedLetter.samples.txt'

download_table = pd.read_csv(ENA_URL, sep='\t')
download_table = download_table[download_table['submitted_format']=='FASTQ;FASTQ']

metadata = pd.read_csv(metadata_URL, encoding="ISO-8859-1", sep='\t')
metadata = metadata[(metadata['#Sequencing_Panel']=='C') & (metadata['Embargo']=='FullyPublic')]
metadata = metadata.drop_duplicates(subset=['Region'], keep='last')
# metadata.head()

download_table = download_table[download_table['library_name'].isin([x for x in metadata['Illumina_ID']])]
# download_table.head()

# [wd] = !pwd
for sample_ID in metadata['Illumina_ID']:
    !mkdir -p fastq/{sample_ID}
    
    print("Dowloading fastq partial files for sample:", sample_ID)
    [fastq_url] = download_table[download_table['library_name']==sample_ID]['fastq_ftp'].str.split(';').values
    [run_accession] = download_table[download_table['library_name']==sample_ID]['run_accession'].values
        
    if os.path.exists('fastq/' + sample_ID + '/' + run_accession + '_1.fastq.gz'):
        print("__fastq 1 file already present__")
    else:
        !wget --quiet -O - '{fastq_url[0]}' | zcat 2>/dev/null | head -n 8000000 - | gzip - > fastq/{sample_ID}/{run_accession}_1.fastq.gz
    if os.path.exists('fastq/' + sample_ID + '/' + run_accession + '_2.fastq.gz'):
        print("__fastq 2 file already present__")
    else:
        !wget --quiet -O - '{fastq_url[1]}' | zcat 2>/dev/null | head -n 8000000 - | gzip - > fastq/{sample_ID}/{run_accession}_2.fastq.gz


Dowloading fastq partial files for sample: LP6005443-DNA_B04
__fastq 1 file already present__
__fastq 2 file already present__
Dowloading fastq partial files for sample: LP6005441-DNA_A08
__fastq 1 file already present__
__fastq 2 file already present__
Dowloading fastq partial files for sample: LP6005441-DNA_B09
__fastq 1 file already present__
__fastq 2 file already present__
Dowloading fastq partial files for sample: LP6005441-DNA_G09
__fastq 1 file already present__
__fastq 2 file already present__
Dowloading fastq partial files for sample: LP6005519-DNA_G02
__fastq 1 file already present__
__fastq 2 file already present__
Dowloading fastq partial files for sample: LP6005442-DNA_B12
__fastq 1 file already present__
__fastq 2 file already present__
Dowloading fastq partial files for sample: LP6005443-DNA_F08
__fastq 1 file already present__
__fastq 2 file already present__


#### MD5 checksum

[MD5](https://en.wikipedia.org/wiki/MD5) is a hash function that produces a 32-character string ([checksum](https://en.wikipedia.org/wiki/Checksum)) for a given file and is widely used to check for data corruption in data files after their transmission or storage. For instance, since the checksum is unique (with very high probability) for every file, if this is the same before and after the downloading of a data file, this means the two files are the same and no data corruption occured during the transfer.

In my experience, specially when working with relatively big datasets (hundreds-to-thousands of samples, but also smaller numbers), its not rare that some of the downloaded files experienced some sort of small data corruption even after a 'succesfull' exit status and finishing during a wget retrieval (as used above). it is a bit annoying to find out about this issue later on in other stages of the data processing, so it is always much better if one has a quick way to do a checksum test and re-download those data files not passing it. 

In this case and since we modified the original fastq files from ENA repository by reducing the number of reads in each file, we can no longer use the MD5 checksums pre-computed and provided originally for the complete files. However I have computed them in advanced for these reduced fastq files and here below there is an example on how the test can be done. 

In [None]:
# for sample_ID in metadata['Illumina_ID']:
#     !cd fastq/{sample_ID} 
    
#     print("MD5 checksum for fastq files of sample:", sample_ID)    
#     md5_1 = download_table[download_table['library_name']==sample_ID]['fastq_md5'].str.split(';').values
#     md5_2 = download_table[download_table['library_name']==sample_ID]['fastq_md5'].str.split(';').values
    
#     # md5 command
#     # ..
# #     !cd {wd}