# Genome Assembly

__Versions__  
Canu: 1.7

### Canu: 
Assembly of nanpore reads

In [None]:
#!/bin/bash
#SBATCH --partition=ellison_1 # Partition (job queue)
#SBATCH --job-name=canu200 # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1 # Number of compute nodes
#SBATCH --ntasks=1 # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28 # Threads per process (or per core)
#SBATCH --export=ALL # Export you current environment settings to the job envronment
#SBATCH --time=72:00:00
#SBATCH --mem=200G

canu -p D_triauraria_all_seqs -d D_triauraria_all_sequs_assembly_200m gnuplotTested\=true overlapper\=mhap utgReAlign\=true useGrid\=0 genomeSize\=200m -nanopore-raw all_TA_combined_seqs.fastq >& all_TA_combined_seqs_output_200.txt

# Quality Control

__Versions__   
Purge Haplotigs: v1.0.1    
Nanopolish: 0.8.4  
Pilon: 1.22  
bowtie2: 2.2.9  

### Purge Haplotigs:
identified heterozygosity and determines consensus sequence

In [None]:
#!/bin/bash
#SBATCH --partition=ellison_1 # Partition (job queue)
#SBATCH --job-name=PHap # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1 # Number of compute nodes
#SBATCH --ntasks=1 # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28 # Threads per process (or per core)
#SBATCH --export=ALL # Export you current environment settings to the job envronment
#SBATCH --time=72:00:00
#SBATCH --mem=200G

python nanopolish_makerange.py D_triauraria_all_seqs.contigs.fasta | parallel --results nanopolish.results -P 8 \
    nanopolish variants --consensus -o polished.{1}.vcf -w {1} -r reads.fastq -b reads.sorted.bam -g draft.fa -t 4 --min-candidate-frequency 0.1

### Nanopolish:
Accounts for errors in nanopore sequencing

In [None]:
#!/bin/bash
#SBATCH --partition=genetics_1 # Partition (job queue)
#SBATCH --job-name=nanopol # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1 # Number of compute nodes
#SBATCH --ntasks=1 # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28 # Threads per process (or per core)
#SBATCH --export=ALL # Export you current environment settings to the job envronment
#SBATCH --time=72:00:00
#SBATCH --mem=150G
#SBATCH --mail-type=END
#SBATCH --mail-user=ama390@scarletmail.rutgers.edu

python nanopolish_makerange.py /all_TA_combined/curated.fa | parallel --results nanopolish.results -P 6 \
 nanopolish variants --consensus -o polished.{1}.vcf -w {1} -r /run_3_sequs.fastq -b /curated_run3_sequs.sorted.bam -g /all_TA_combined/curated.fa -t 4 --min-candidate-frequency 0.1

In [None]:
#!/bin/bash
#SBATCH --partition=ellison_1 # Partition (job queue)
#SBATCH --job-name=nanopol # Assign an 8-character name to your job, no spaces
#SBATCH --nodes=1 # Number of compute nodes
#SBATCH --ntasks=1 # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28 # Threads per process (or per core)
#SBATCH --export=ALL # Export you current environment settings to the job envronment
#SBATCH --time=72:00:00
#SBATCH --mem=150G

nanopolish vcf2fasta -g /all_TA_combined/curated.fa polished.*.vcf > polished_genome.fa

### Pilon:
Combines nanopore and Illumina data to correct sequencing errors.

In [None]:
#!/bin/bash
 #SBATCH --partition=ellison_1 # Partition (job queue)
#SBATCH --job-name=allpilon # Assign an 8-character name to your job
#SBATCH --nodes=1 # Number of compute nodes
#SBATCH --ntasks=1 # Processes (usually = cores) on each node
#SBATCH --cpus-per-task=28 # Threads per process (or per core)
 #SBATCH --export=ALL # Export you current environment settings to the job environment
#SBATCH --time=12:00:00
#SBATCH --mem=128G
module load bowtie2

bowtie2-build pilon_1.fasta pilon_1_index

bowtie2 -X 1200 -p 28 --no-unal -x pilon_1_index -S TA_illumina_Pilon_15.sam -1 /TA_illumina/sample_trimmed_1P.fastq.gz -2 /TA_illumina/sample_trimmed_2P.fastq.gz

samtools view -b TA_illumina_Pilon_1.sam > TA_illumina_Pilon_1.bam

samtools sort TA_illumina_Pilon_1.bam > TA_illumina_Pilon_1.sorted.bam

samtools index TA_illumina_Pilon_1.sorted.bam

pilon --genome pilon_1.fasta --frags TA_illumina_Pilon_1.sorted.bam --output pilon_2 --changes --minmq 15 &> pilon_1.log

# Scaffolding

### Juicer and 3D DNA alignment pipeline  

__Versions__:  
BWA:0.7.13  
JUICER:1.5.6  
JUICEBOX: 1.11.08   
3D-DNA: 180419  

Prepare reference genome

In [None]:
bwa index reference.fasta
samtools faidx reference.fasta
#fasta.fai has the length of each scaffold. Print first 2 columns for scaffold ID and length info
awk '{print $1,$2}' reference.fasta.fai > reference.chrom.sizes

Combine replicate HiC files 

In [None]:
#submit SLURM script with the following to combine replicate fastq files
zcat ./d_species_HiC_*_R1_001.fastq.gz | gzip > combined_reps_R1.fastq.gz
zcat ./d_species_HiC_*_R2_001.fastq.gz | gzip > combined_reps_R2.fastq.gz

Run JUICER to align HiC reads to reference assembly, create a contact matrix (.hic), and a "merged_nodups.txt" file to input to 3D-DNA

In [None]:
juicer.sh -d /Dtria/juicer/work/Dtria -p/scratch/tg484/Dtria/juicer/references/Dtria_20181107_r2_illumina_reformat.chrom.sizes -s none -z /Dtria/juicer/references/Dtria_20181107_r2_illumina.reformat.fa -D /Dtria/juicer -t 28 -b GCTGAGGGATCCCTCAGC

Run **3D-DNA**  
First run without --editor-repeat-coverage flag. 

In [None]:
run-asm-pipeline.sh ---build-gapped-map -m diploid -i 5000 -r 3 /Dtria/juicer/references/Dtria_20181107_r2_illumina.reformat.fa /Dtria/juicer/work/Dtria/aligned/merged_nodups.txt >& /Dtria/3D-DNA/3d-dna.log

View the **rawchrom.assembly** output using Juicebox. If the output looks poorly aligned, run the following two commands to view a text based histogram. Run --editor-repeat coverage flag with value corresponding to the peak of the histogram.

In [None]:
bash ~/pkg/3d-dna/edit/run-coverage-analyzer.sh ./3D-DNA_d_species/reference.0.hic
awk -f plot_coverage.awk coverage_wide.wig 

In [None]:
#final 3D-DNA command using "--editor-repeat-coverage 6" because peak of histogram was around 6
run-asm-pipeline.sh ---build-gapped-map --editor-repeat-coverage 6 -m diploid -i 5000 -r 3 /Dtria/juicer/references/Dtria_20181107_r2_illumina.reformat.fa /Dtria/juicer/work/Dtria/aligned/merged_nodups.txt >& /Dtria/3D-DNA/3d-dna.log

Once obtaining fairly well-aligned output. Edit any misalignments using JUICEBOX. Export new assembly (called **.review.assembly**) and create new fasta file using the following command. This command will output a CAPITALIZED **(.FINAL.fasta)** file that should be used in downstream analysis.

In [None]:
module load java
./pkg/3d-dna/run-asm-pipeline-post-review.sh -r reference.rawchrom.review.assembly ./juicer/references/reference.fa ./juicer/work/d_species/aligned/merged_nodups.txt >& final_fasta.log
#the .review.assembly file is the edited file from JUICEBOX. This command will output a CAPITALIZED (.FINAL.fasta) file that should be used in downstream analysis.

In [None]:
module load java
./pkg/3d-dna/run-asm-pipeline-post-review.sh -r reference.rawchrom.review.assembly ./juicer/references/reference.fa ./juicer/work/d_species/aligned/merged_nodups.txt >& final_fasta.log
#the .review.assembly file is the edited file from JUICEBOX. This command will output a CAPITALIZED (.FINAL.fasta) file that should be used in downstream analysis.

Assign D. triauraria megascaffolds to corresponding Muller element.

Sanity check: Check the number of genes on each Muller element in the final <tt>chrom_count_table<tt>.    
Should be between ~1000-3500 for Muller A - E. Muller F ~50-75. 

In [None]:
module load samtools
module load blast

# GET LENGTHS OF SCAFFOLDS
# scaffold lengths will be stored in the second column of NAME.scaffolded.fai
samtools faidx reference.FINAL.fasta 
makeblastdb -dbtype nucl -in reference.FINAL.fasta 
tblastn -query dmel_r6_longest_isoform.pep.fasta -db reference.FINAL.fasta -evalue 1e-4 -outfmt 6 > d_species.tblastn
# KEEP ONLY THE BEST HIT FOR EACH QUERY
python /projects/genetics/ellison_lab/scripts/best.py d_species.tblastn > d_species.tblastn.best

# COUNT THE NUMBER OF PEPTIDES FROM EACH DMEL CHROMOSOME THAT MATCH EACH SCAFFOLD IN THE ASSEMBLY
cut -f 1,2 d_species.tblastn.best | sed -r 's/_/\t/' | sort -k3 | bedtools groupby -g 3 -c 2 -o freqdesc > d_species.chrom_count

echo -e "SCAFFOLD\tLENGTH\tMULLER_A\tMULLER_B\tMULLER_C\tMULLER_D\tMULLER_E\tMULLER_F\tOTHER\tTOTAL" > d_species.chrom_count_table.txt

python blast_chrom_count2table.py d_species.chrom_count reference.FINAL.fasta.fai | sort -k2 -nr >> d_species.chrom_count_table.txt

blast_chrom_count2table.py

In [None]:
import sys

countfile=sys.argv[1]
lengthfile=sys.argv[2]

lengths={}
le=open(lengthfile)
for scaff in le:
    slist=scaff.rstrip().split("\t")
    lengths[slist[0]]=slist[1]

#print("SCAFFOLD","LENGTH","MULLER_A","MULLER_B","MULLER_C","MULLER_D","MULLER_E","MULLER_F","OTHER","TOTAL",sep="\t")
muller = {"2L":"Muller_B","2R":"Muller_C","3L":"Muller_D","3R":"Muller_E","4":"Muller_F","X":"Muller_A"}

output = {}
fh=open(countfile)
for line in fh:
    list1=line.rstrip().split("\t")
    list2=list1[1].split(",")
    outdict = {"Muller_A":0,
               "Muller_B":0,
               "Muller_C":0,
               "Muller_D":0,
               "Muller_E":0,
               "Muller_F":0}
    other = {}
    chromSum=0
    for c in list2:
        (chrom,num) = c.split(":")
        chromSum+=int(num)
        if muller.get(chrom):
            mchrom = muller[chrom]
            outdict[mchrom] += int(num)
        else:
            other[chrom] = num
    outline = [list1[0],lengths[list1[0]],outdict["Muller_A"],outdict["Muller_B"],outdict["Muller_C"],outdict["Muller_D"],outdict["Muller_E"],outdict["Muller_F"]]
    other_list=[]
    if(other):
        for m,n in other.items():
            item = m+":"+n
            other_list.append(item)
    else:
        other_list.append("NA:NA")
    other_str = ','.join(other_list)
    outline.append(other_str)
    outline.append(chromSum)
    convert = [str(i) for i in outline] 
    print("\t".join(convert))
