## fastq2vcf
### Tutorial intended to cover the analysis of human NGS data. In particular the processing of fastq files in order to get variants in VCF format. 


Links: 

[https://jupyter.org/](https://jupyter.org/)  
[https://pandas.pydata.org/](https://pandas.pydata.org/)


#### Fastq Downloading

We are going to use public data from the Simons Genomes Diversity Project ([SGDP](https://reichdata.hms.harvard.edu/pub/datasets/sgdp/)). Specifically 7 selected genomes from each 'Region' included in the dataset. The SGDP Project ID in ENA repository is [PRJEB9586](https://www.ebi.ac.uk/ena/browser/view/PRJEB9586).

In [22]:
import pandas as pd
import os

SGDP_ENA_PID = 'PRJEB9586'
ENA_URL = 'https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=' + SGDP_ENA_PID + '&result=read_run'
metadata_URL = 'https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/SGDP_metadata.279public.21signedLetter.samples.txt'

download_table = pd.read_csv(ENA_URL, sep='\t')
download_table = download_table[download_table['submitted_format']=='FASTQ;FASTQ']
# download_table.head()

metadata = pd.read_csv(metadata_URL, encoding="ISO-8859-1", sep='\t')
metadata = metadata[(metadata['#Sequencing_Panel']=='C') & (metadata['Embargo']=='FullyPublic')]
metadata = metadata.drop_duplicates(subset=['Region'], keep='last')

download_table = download_table[download_table['library_name'].isin([x for x in metadata['Illumina_ID']])]
# download_table.head()

[wd] = !pwd
for sample_ID in metadata['Illumina_ID']:
    !mkdir -p fastq/{sample_ID}
    
    print("Dowloading fastq partial files for sample:", sample_ID)
    [fastq_url] = download_table[download_table['library_name']==sample_ID]['fastq_ftp'].str.split(';').values
    [run_accession] = download_table[download_table['library_name']==sample_ID]['run_accession'].values
    
    if os.path.exists('fastq/' + sample_ID + '/' + run_accession + '_1.fastq.gz'):
        print("__fastq 1 file already present__")
    else:
        !wget --quiet -O - '{fastq_url[0]}' | zcat 2>/dev/null | head -n 8000000 - | gzip - > fastq/{sample_ID}/{run_accession}_1.fastq.gz
    if os.path.exists('fastq/' + sample_ID + '/' + run_accession + '_2.fastq.gz'):
        print("__fastq 2 file already present__")
    else:
        !wget --quiet -O - '{fastq_url[1]}' | zcat 2>/dev/null | head -n 8000000 - | gzip - > fastq/{sample_ID}/{run_accession}_2.fastq.gz


Dowloading fastq files for sample: LP6005443-DNA_B04
__fastq  file already present__
__fastq  file already present__
Dowloading fastq files for sample: LP6005441-DNA_A08
__fastq  file already present__
__fastq  file already present__
Dowloading fastq files for sample: LP6005441-DNA_B09
__fastq  file already present__
__fastq  file already present__
Dowloading fastq files for sample: LP6005441-DNA_G09
__fastq  file already present__
__fastq  file already present__
Dowloading fastq files for sample: LP6005519-DNA_G02
__fastq  file already present__
__fastq  file already present__
Dowloading fastq files for sample: LP6005442-DNA_B12
__fastq  file already present__
__fastq  file already present__
Dowloading fastq files for sample: LP6005443-DNA_F08
__fastq  file already present__
__fastq  file already present__


#### MD5 checksum


...........

In [None]:
# for sample_ID in metadata['Illumina_ID']:
#     !cd fastq/{sample_ID} 
    
#     print("MD5 checksum for fastq files of sample:", sample_ID)    
#     md5_1 = download_table[download_table['library_name']==sample_ID]['fastq_md5'].str.split(';').values
#     md5_2 = download_table[download_table['library_name']==sample_ID]['fastq_md5'].str.split(';').values
    
#     # md5 command
#     # ..
# #     !cd {wd}