## fastq2vcf
### Tutorial intended to cover the analysis of human NGS data. In particular the processing of fastq files in order to get variants in VCF format. 

Links: 




#### Fastq Downloading

We are going to use public data from the Simons Genomes Diversity Project ([SGDP](https://reichdata.hms.harvard.edu/pub/datasets/sgdp/)). Specifically 7 selected genomes from each 'Region' included in the dataset. The SGDP Project ID in ENA repository is [PRJEB9586](https://www.ebi.ac.uk/ena/browser/view/PRJEB9586).

In [None]:
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

SGDP_ENA_PID = 'PRJEB9586'
ENA_URL = 'https://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=' + SGDP_ENA_PID + '&result=read_run'
metadata_URL = 'https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/SGDP_metadata.279public.21signedLetter.samples.txt'

download_table = pd.read_csv(ENA_URL, sep='\t')
download_table = download_table[download_table['submitted_format']=='FASTQ;FASTQ']
# download_table.head()

metadata = pd.read_csv(metadata_URL, encoding="ISO-8859-1", sep='\t')
metadata = metadata[(metadata['#Sequencing_Panel']=='C') & (metadata['Embargo']=='FullyPublic')]
metadata = metadata.drop_duplicates(subset=['Region'], keep='last')

download_table = download_table[download_table['library_name'].isin([x for x in metadata['Illumina_ID']])]
# download_table

[wd] = !pwd
for sample_ID in metadata['Illumina_ID']:
    !mkdir -p fastq/{sample_ID}
#     !cd fastq/{sample_ID} 
    
    print("Dowloading fastq files for sample:", sample_ID)
    fastq_1 = download_table[download_table['library_name']==sample_ID]['fastq_ftp'].str.split(';').values
    fastq_2 = download_table[download_table['library_name']==sample_ID]['fastq_ftp'].str.split(';').values
    !wget -b -t=100 -c {fastq_1[0][0]}; wget -b -t=100 -c {fastq_1[0][1]}
    
    !mv *.fastq.gz fastq/{sample_ID}     
#     !cd {wd}

#### MD5 checksum


...........

In [None]:
for sample_ID in metadata['Illumina_ID']:
    !cd fastq/{sample_ID} 
    
    print("MD5 checksum for fastq files of sample:", sample_ID)    
    md5_1 = download_table[download_table['library_name']==sample_ID]['fastq_md5'].str.split(';').values
    md5_2 = download_table[download_table['library_name']==sample_ID]['fastq_md5'].str.split(';').values
    
    # md5 command
    # ..
#     !cd {wd}