# Illumina data-analysis

The illumina data has to be spliced according to the index. This is generally done by the Basespace software. 
Each sample should be treated through the pipeline seperatly in order to obtain a profile of each sample.


## Unzipping fastq and compressing to one file

In [35]:
%%bash

cd /media/genomics/senne/Illumina_multiplex_data2/
gunzip -d *.fastq.gz




gzip: Senne2800_S19_L001_R1_001.fastq: Operation not permitted
gzip: Senne2800_S19_L001_R1_001.fastq: Operation not permitted
gzip: Senne2800_S19_L002_R1_001.fastq: Operation not permitted
gzip: Senne2800_S19_L002_R1_001.fastq: Operation not permitted
gzip: Senne2800_S19_L003_R1_001.fastq: Operation not permitted
gzip: Senne2800_S19_L003_R1_001.fastq: Operation not permitted
gzip: Senne2800_S19_L004_R1_001.fastq: Operation not permitted
gzip: Senne2800_S19_L004_R1_001.fastq: Operation not permitted
gzip: Senne47_S20_L001_R1_001.fastq: Operation not permitted
gzip: Senne47_S20_L001_R1_001.fastq: Operation not permitted
gzip: Senne47_S20_L002_R1_001.fastq: Operation not permitted
gzip: Senne47_S20_L002_R1_001.fastq: Operation not permitted
gzip: Senne47_S20_L003_R1_001.fastq: Operation not permitted
gzip: Senne47_S20_L003_R1_001.fastq: Operation not permitted
gzip: Senne47_S20_L004_R1_001.fastq: Operation not permitted
gzip: Senne47_S20_L004_R1_001.fastq: Operation not permitted
gzip: Se

In [4]:
!pwd

/home/senne/nanopore


In [40]:
%%bash

cd /media/genomics/senne/Illumina_multiplex_data2/

cat Senne48_*.fastq > 48.fastq

##Converting a fastq to a fasta

In [44]:
%%bash

cd /home/senne/nanopore/Multiplex/Reference_sequences/SNP/9948_Illumina_0.5µM/

sed -n '1~4s/^@/>/p;2~4p' 48.fastq > 48.fasta

## Input

In [2]:
# Init
#

snpFile   = '/home/senne/nanopore/SNP/known_SNP_sequence/SNP_sequence.fasta'  # 
readFile  = '/home/senne/nanopore/Multiplex/Reference_sequences/SNP/9948_Illumina_0.5µM/48.fasta' # each sample has a seperate fasta file
resultDir = '/home/senne/nanopore/Multiplex/Reference_sequences/SNP/9948_Illumina_0.5µM/'
fastq_file_name= '/home/senne/nanopore/Multiplex/Reference_sequence/SNP/Male_Promega/PromegaM.fastq' # Eigenlijk niet nodig hier. 
bwa       = '/opt/tools/bwa-0.7.15'            # v0.7.5
samtools  = '/opt/tools/samtools-1.3.1' # v1.3.1
bcftools  = '/opt/tools/bcftools-1.3.1' # v1.3.1

# Check
!ls {snpFile}
print('Number of SNPs in reference file:')
!grep -c ">" {snpFile}



/home/senne/nanopore/SNP/known_SNP_sequence/SNP_sequence.fasta
Number of SNPs in reference file:
52


##Mapping using BWA

In [65]:
# Map reads to reference sequences
#

# Build index of the references
!{bwa} index {snpFile}

# Map reads
!{bwa} mem -a {snpFile} {readFile} > {resultDir}/48.sam


[bwa_index] Pack FASTA... 0.00 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.00 seconds elapse.
[bwa_index] Update BWT... 0.00 sec
[bwa_index] Pack forward-only FASTA... 0.00 sec
[bwa_index] Construct SA from BWT and Occ... 0.00 sec
[main] Version: 0.7.15-r1140
[main] CMD: /opt/tools/bwa-0.7.15 index /home/senne/nanopore/SNP/known_SNP_sequence/SNP_sequence.fasta
[main] Real time: 0.017 sec; CPU: 0.006 sec
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 148786 sequences (10000010 bp)...
[M::process] read 149412 sequences (10000139 bp)...
[M::mem_process_seqs] Processed 148786 reads in 3.256 CPU sec, 3.154 real sec
[M::process] read 149482 sequences (10000102 bp)...
[M::mem_process_seqs] Processed 149412 reads in 3.655 CPU sec, 3.249 real sec
[M::process] read 149622 sequences (10000029 bp)...
[M::mem_process_seqs] Processed 149482 reads in 3.963 CPU sec, 3.610 real sec
[M::process] read 149542 sequences (10000057 bp)...
[M::mem_process_seqs] Pro

##Sorting SAM file and converting to BAM

In [3]:

!{samtools} view -Sbu {resultDir}/48.sam | {samtools} sort -o {resultDir}/48.bam -
!{samtools} index {resultDir}/48.bam {resultDir}/48.bam.bai
print('Done')

[bam_sort_core] merging from 11 files...
Done


## Extracting SNP profile
1) Creating a vcf file

2) SNP profile generation

In [4]:
# Generate vcf file from bam file. Needs the reference and its index file 
#
# Note: the commands below are for samtools and bcftools v1.3.1 (will not work on v0.1.19!)

# Reporting all positions
!{samtools} mpileup -d 1000000 -uf {snpFile} {resultDir}/48.bam | {bcftools} call -V indels -m - > {resultDir}/48.bam.vcf

# Reporting variants only (excludes SNPs homozygous for reference allele)
#!{samtools} mpileup -d 100000 -uf {snpFile} {resultDir}/test_sorted.bam | {bcftools} call -V indels -mv - > {resultDir}/test_sorted.bam.vcf

print('Done')

[mpileup] 1 samples in 1 input files
Note: Neither --ploidy nor --ploidy-file given, assuming all sites are diploid
Done


In [5]:
# Get SNP profile


snpData = {}

with open(resultDir+'/48.bam.vcf') as f:
    for l in f:
        if l.startswith('#'):
            continue
            
        snp, pos, id, ref, alt, qual, filter, info, d, dd = l.split()
        
        # Our SNP of interest is always at position 26 of the reference
        if int(pos) != 26:
            continue

        par = {}
        for p in info.split(';'):
            pv = p.split('=')
            par[pv[0]] = pv[1]
        
        snpData[snp] = {'pos': pos, 'ref': ref, 'alt': alt, 'qual': qual, 'filter': filter, 'info': par}

# DEBUG
print('Got data for {} SNPs:'.format(len(snpData)))

# Save/print results
with open(resultDir + '/ttt_profile.csv', 'w') as f:
    # Table header
    f.write('snp, coverage, ref_allele, ref_percent, alt_allele, alt_percent, genotype\n')
    
    # Table data
    for s in sorted(snpData.keys()):
        totalDepth = int(snpData[s]['info']['DP'])
        depthList  = [int(d) for d in snpData[s]['info']['DP4'].split(',')]
        refDepth   = sum(depthList[0:2])
        altDepth   = sum(depthList[2:4])
        
        # Estimate the diploid genotype: when the minor allele is more than 10 times weaker than the major allele,
        # we should ignore it for a pure sample?
        if refDepth > altDepth and altDepth/refDepth < 0.1:
            genotype = snpData[s]['ref'] + snpData[s]['ref']
        elif altDepth > refDepth and refDepth/altDepth < 0.1:
            genotype = snpData[s]['alt'] + snpData[s]['alt']
        else:
            genotype = snpData[s]['ref'] + snpData[s]['alt']
        
        if snpData[s]['alt'] == '.':
            # Only 1 allele was observed
            f.write(','.join([s, str(totalDepth), snpData[s]['ref'], '{:.1f}'.format(100*refDepth/totalDepth), '', '', snpData[s]['ref']+snpData[s]['ref']]) + '\n')
            # DEBUG
            print('  {} ({})  {} ({:.1f} %)'.format(s, totalDepth, snpData[s]['ref'], 100*refDepth/totalDepth))
        else:
            # Two alleles were observed
            f.write(','.join([s, str(totalDepth), snpData[s]['ref'], '{:.1f}'.format(100*refDepth/totalDepth), snpData[s]['alt'], '{:.1f}'.format(100*altDepth/totalDepth), genotype]) + '\n')
            # DEBUG
            print('  {} ({})  {} ({:.1f} %)  {} ({:.1f} %)'.format(s, totalDepth, snpData[s]['ref'], 100*refDepth/totalDepth, snpData[s]['alt'], 100*altDepth/totalDepth))


Got data for 52 SNPs:
  rs1005533 (144996)  A (99.8 %)
  rs1015250 (248717)  C (0.0 %)  G (100.0 %)
  rs1024116 (402530)  A (43.0 %)  G (57.0 %)
  rs1028528 (128801)  A (69.1 %)  G (30.9 %)
  rs1029047 (79115)  A (99.7 %)
  rs1031825 (277887)  A (50.5 %)  C (49.5 %)
  rs10495407 (1000011)  A (50.1 %)  G (49.9 %)
  rs1335873 (299477)  A (0.1 %)  T (99.9 %)
  rs1355366 (656223)  A (51.9 %)  G (48.1 %)
  rs1357617 (345388)  A (99.6 %)
  rs1360288 (30642)  C (71.1 %)  T (28.9 %)
  rs1382387 (189198)  G (52.6 %)  T (47.4 %)
  rs1413212 (998999)  A (0.1 %)  G (99.9 %)
  rs1454361 (804990)  A (53.3 %)  T (46.7 %)
  rs1463729 (226122)  A (0.1 %)  G (99.9 %)
  rs1490413 (202631)  A (0.2 %)  G (99.8 %)
  rs1493232 (999987)  A (99.7 %)
  rs1528460 (8310)  C (0.2 %)  T (99.8 %)
  rs1886510 (114319)  C (54.4 %)  T (45.6 %)
  rs1979255 (88400)  C (99.8 %)
  rs2016276 (246601)  A (46.5 %)  G (53.5 %)
  rs2040411 (93944)  A (99.8 %)
  rs2046361 (152422)  A (0.1 %)  T (99.9 %)
  rs2056277 (232780)  C (