### HERVd: the Human Endogenous RetroViruses Database
- https://herv.img.cas.cz/

<!-- here I'm using a custom [scallop](https://github.com/Kingsford-Group/scallop) + [gtfmerge](https://github.com/Kingsford-Group/rnaseqtools#gtfmerge) pipeline to create merged gtf annotation of hg19 and tinat transcripts introduced in this  -->

In [72]:
import pandas as pd

In [18]:
import gzip

In [80]:
def write_fasta(path, fa):
    file = open(path, 'w')
    for f in fa:
        file.write('>' + f + '\n')
        file.write(fa[f] + '\n')
    file.close()

    
def read_fasta(path):
    if 'gz' in path:
        file=gzip.open(path, 'rb')
        lines = [x.strip() for x in file.readlines()]
    else:
        file = open(path)
        lines = file.read().splitlines()
    ids = [s[1:] for s in lines if '>' in s]
    n = [i for i,s in enumerate(lines) if '>' in s]
    n.append(len(lines))
    sequences = [''.join(lines[i+1:j]) for i,j in zip(n[:-1],n[1:])]
    file.close()
    fa = dict(zip(ids, sequences))
    return fa

In [48]:
ls herv-genome/

[0m[38;5;9mpackage-entities-dna.gff3.gz[0m  [38;5;9mpackage-entities-erv.gff3.gz[0m
package-entities-erv.c.gff3   [38;5;27mpackage-entities-erv_salmon_index[0m/
[38;5;9mpackage-entities-erv.fa.gz[0m    package-entities-erv_salmon_index.log


In [3]:
!zcat herv-genome/package-entities-erv.gff3.gz | grep -v "#" \
> herv-genome/package-entities-erv.c.gff3

## STAR

https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html

In [133]:
!head -n 30 herv-genome/package-entities-erv.c.gff3

chr1	.	BED_feature	34597	34659	500	-	.	ID=BED_feature1;Name=12:ERV:MLT1L
chr1	.	BED_thick_feature	34597	34659	500	-	.	Parent=BED_feature1;Name=12:ERV:MLT1L
chr1	.	BED_block	34597	34659	500	-	.	Parent=BED_feature1;Name=12:ERV:MLT1L
chr1	.	BED_feature	34566	34972	500	-	.	ID=BED_feature2;Name=14:ERV:MLT1J2
chr1	.	BED_thick_feature	34566	34972	500	-	.	Parent=BED_feature2;Name=14:ERV:MLT1J2
chr1	.	BED_block	34566	34972	500	-	.	Parent=BED_feature2;Name=14:ERV:MLT1J2
chr1	.	BED_feature	40850	41007	500	-	.	ID=BED_feature3;Name=18:ERV:LTR16C
chr1	.	BED_thick_feature	40850	41007	500	-	.	Parent=BED_feature3;Name=18:ERV:LTR16C
chr1	.	BED_block	40850	41007	500	-	.	Parent=BED_feature3;Name=18:ERV:LTR16C
chr1	.	BED_feature	41804	42717	500	-	.	ID=BED_feature4;Name=19:ERV:ERV3-16A3_I
chr1	.	BED_thick_feature	41804	42717	500	-	.	Parent=BED_feature4;Name=19:ERV:ERV3-16A3_I
chr1	.	BED_block	41804	42717	500	-	.	Parent=BED_feature4;Name=19:ERV:ERV3-16A3_I
chr1	.	BED_feature	46766	46842	500	-	.	ID=BED_featur

In [127]:
%%bash

cd herv-genome/
mkdir STAR.index

nohup STAR \
    --runThreadN 32 \
    --runMode genomeGenerate \
    --sjdbGTFfeatureExon BED_block \
    --genomeDir STAR.index \
    --limitGenomeGenerateRAM 400000000000 \
    --genomeFastaFiles /data_gilbert/home/aarab/genomes/hg38/gencode.v34/GRCh38.primary_assembly.genome.fa \
    --sjdbGTFfile package-entities-erv.c.gff3 \
    --sjdbOverhang 99 &> STAR.index.log&
# package-entities-erv.c.fa \
cd ../

mkdir: cannot create directory ‘STAR.index’: File exists


In [134]:
cat herv-genome/STAR.index.log

Nov 13 23:59:50 ..... started STAR run
Nov 13 23:59:50 ... starting to generate Genome files
Nov 14 00:00:55 ... starting to sort Suffix Array. This may take a long time...
Nov 14 00:01:08 ... sorting Suffix Array chunks and saving them to disk...
Nov 14 00:17:14 ... loading chunks from disk, packing SA...
Nov 14 00:18:42 ... finished generating suffix array
Nov 14 00:18:42 ... generating Suffix Array index
Nov 14 00:22:47 ... completed Suffix Array index
Nov 14 00:22:47 ..... processing annotations GTF
Nov 14 00:22:57 ... writing Genome to disk ...
Nov 14 00:23:04 ... writing Suffix Array to disk ...
Nov 14 00:23:51 ... writing SAindex to disk
Nov 14 00:23:58 ..... finished successfully


## Salmon

In [81]:
fa = read_fasta('herv-genome/package-entities-erv.fa')

fa = dict([(n.split(' ')[5].replace('cloud_id=',''), fa[n]) for n in fa.keys()])

write_fasta('herv-genome/package-entities-erv.c.fa', fa)

In [82]:
!salmon index -t herv-genome/package-entities-erv.c.fa \
    -i herv-genome/package-entities-erv_salmon_index \
&> herv-genome/package-entities-erv_salmon_index.log

In [83]:
!tail herv-genome/package-entities-erv_salmon_index.log

[2022-11-10 22:19:05.217] [puff::index::jointLog] [info] chunk size = 212,572,123
[2022-11-10 22:19:05.217] [puff::index::jointLog] [info] chunk 0 = [0, 212,572,139)
[2022-11-10 22:19:05.217] [puff::index::jointLog] [info] chunk 1 = [212,572,139, 425,144,216)
[2022-11-10 22:20:00.373] [puff::index::jointLog] [info] finished populating pos vector
[2022-11-10 22:20:00.373] [puff::index::jointLog] [info] writing index components
[2022-11-10 22:20:02.841] [puff::index::jointLog] [info] finished writing dense pufferfish index
[2022-11-10 22:20:03.145] [jLog] [info] done building index
for info, total work write each  : 2.331    total work inram from level 3 : 4.322  total work raw : 25.000 
Bitarray      1279781248  bits (100.00 %)   (array + ranks )
final hash             0  bits (0.00 %) (nb in final hash 0)


### DAC

In [84]:
%%bash 
cd DAC/RNA-seq/
mkdir -p herv/
mkdir -p herv/quants

for f in fastq/*; do
	f=`basename $f`;
	o=${f/.fastq.gz/};
	echo $o
	echo `date`
	salmon quant \
		-i ../../herv-genome/package-entities-erv_salmon_index \
		-l A \
		-r fastq/$f -o herv/quants/$o -p 10 &> herv/quants/$o.log
	echo `date`
done
cd ../../

hl60_120h_t_1
Thu Nov 10 22:20:04 PST 2022
Thu Nov 10 22:22:24 PST 2022
hl60_120h_t_2
Thu Nov 10 22:22:24 PST 2022
Thu Nov 10 22:24:57 PST 2022
hl60_120h_u_1
Thu Nov 10 22:24:57 PST 2022
Thu Nov 10 22:26:58 PST 2022
hl60_120h_u_2
Thu Nov 10 22:26:58 PST 2022
Thu Nov 10 22:29:22 PST 2022
hl60_6h_t_1
Thu Nov 10 22:29:22 PST 2022
Thu Nov 10 22:31:40 PST 2022
hl60_6h_t_2
Thu Nov 10 22:31:40 PST 2022
Thu Nov 10 22:35:37 PST 2022
hl60_6h_u_1
Thu Nov 10 22:35:37 PST 2022
Thu Nov 10 22:37:18 PST 2022
hl60_6h_u_2
Thu Nov 10 22:37:18 PST 2022
Thu Nov 10 22:39:49 PST 2022
hl60_72h_t_1
Thu Nov 10 22:39:49 PST 2022
Thu Nov 10 22:42:23 PST 2022
hl60_72h_t_2
Thu Nov 10 22:42:23 PST 2022
Thu Nov 10 22:44:59 PST 2022
hl60_72h_u_1
Thu Nov 10 22:44:59 PST 2022
Thu Nov 10 22:48:34 PST 2022
hl60_72h_u_2
Thu Nov 10 22:48:34 PST 2022
Thu Nov 10 22:52:54 PST 2022
kg1_t_1
Thu Nov 10 22:52:54 PST 2022
Thu Nov 10 22:55:20 PST 2022
kg1_t_2
Thu Nov 10 22:55:20 PST 2022
Thu Nov 10 22:55:56 PST 2022
kg1_t_3
Thu Nov 

### DAC-rg3038

In [85]:
%%bash 
cd DAC-rg3039/RNA-seq/
mkdir -p herv/
mkdir -p herv/quants

for f in fastq/*; do
	f=`basename $f`;
	o=${f/.fastq.gz/};
	echo $o
	echo `date`
	salmon quant \
		-i ../../herv-genome/package-entities-erv_salmon_index \
		-l A \
		-r fastq/$f -o herv/quants/$o -p 10 &> herv/quants/$o.log
	echo `date`
done
cd ../../

hl60_combination_1
Thu Nov 10 23:44:51 PST 2022
Thu Nov 10 23:49:15 PST 2022
hl60_combination_2
Thu Nov 10 23:49:15 PST 2022
Thu Nov 10 23:54:34 PST 2022
hl60_decitabine_1
Thu Nov 10 23:54:34 PST 2022
Thu Nov 10 23:57:57 PST 2022
hl60_decitabine_2
Thu Nov 10 23:57:57 PST 2022
Fri Nov 11 00:01:43 PST 2022
hl60_dmso_1
Fri Nov 11 00:01:43 PST 2022
Fri Nov 11 00:07:02 PST 2022
hl60_dmso_2
Fri Nov 11 00:07:02 PST 2022
Fri Nov 11 00:10:02 PST 2022
hl60_rg3039_1
Fri Nov 11 00:10:02 PST 2022
Fri Nov 11 00:13:02 PST 2022
hl60_rg3039_2
Fri Nov 11 00:13:02 PST 2022
Fri Nov 11 00:17:46 PST 2022


## 

In [86]:
!date

Fri Nov 11 00:17:46 PST 2022
