# Pseudoalignment with Kallisto

In [1]:
MIAMIID = !echo $USER
MIAMIID = str(MIAMIID)
MIAMIID = MIAMIID[2:len(MIAMIID)-2]
print(MIAMIID)

schroe51


### We need to download the following pieces of data in order to use Kallisto:

- Reference Transcriptome: File containting all of the known transcripts of the mouse genomes
- Reference Annotations: File with information on location and structure of the genes in the mouse genome

Now let's download the transcriptome file:

In [2]:
%%sh
wget ftp://ftp.ensembl.org/pub/release-97/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz

--2022-12-08 13:38:40--  ftp://ftp.ensembl.org/pub/release-97/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz
           => ‘Mus_musculus.GRCm38.cdna.all.fa.gz’
Resolving ftp.ensembl.org... 193.62.193.139
Connecting to ftp.ensembl.org|193.62.193.139|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-97/fasta/mus_musculus/cdna ... done.
==> SIZE Mus_musculus.GRCm38.cdna.all.fa.gz ... 51982200
==> PASV ... done.    ==> RETR Mus_musculus.GRCm38.cdna.all.fa.gz ... done.
Length: 51982200 (50M) (unauthoritative)

     0K .......... .......... .......... .......... ..........  0%  235K 3m36s
    50K .......... .......... .......... .......... ..........  0%  519K 2m37s
   100K .......... .......... .......... .......... ..........  0% 13.9M 1m46s
   150K .......... .......... .......... .......... ..........  0% 1.34M 88s
   200K .......... .......... .......... .......... ..........  0%  7

Now lets make a folder and move the transcriptome into that folder to keep organized

In [4]:
%mkdir -p /home/{MIAMIID}/test/transcriptome
%mv Mus_musculus.GRCm38.cdna.all.fa.gz /home/{MIAMIID}/test/transcriptome

We need to use Kallisto's indexing function to prepare the transcriptome for further analysis. The index is essentially a way to speed up analysis. First lets make some folders to help organize.

In [2]:
%mkdir -p /home/{MIAMIID}/test/kallisto
%cd /home/{MIAMIID}/test/kallisto/

/home/schroe51/test/kallisto


In [4]:
!kallisto index --index="Mus_musculus.GRCm38_index" /home/{MIAMIID}/test/transcriptome/Mus_musculus.GRCm38.cdna.all.fa.gz


[build] loading fasta file /home/schroe51/test/transcriptome/Mus_musculus.GRCm38.cdna.all.fa.gz
[build] k-mer length: 31
        from 645 target sequences
        with pseudorandom nucleotides
[build] counting k-mers ... done.
[build] building target de Bruijn graph ...  done 
[build] creating equivalence classes ...  done
[build] target de Bruijn graph has 736280 contigs and contains 100745897 k-mers 



Now we are going to move this index into the transcriptome folder because it can now be used for pseudoalignment

In [5]:
%mv Mus_musculus.GRCm38_index /home/{MIAMIID}/test/transcriptome/

We Will also need to download the mouse GTF file. This file contains the coordinates and descriptions for all gene names and locations.

In [7]:
!wget ftp://ftp.ensembl.org/pub/release-97/gtf/mus_musculus/Mus_musculus.GRCm38.97.chr.gtf.gz

--2022-12-08 15:30:27--  ftp://ftp.ensembl.org/pub/release-97/gtf/mus_musculus/Mus_musculus.GRCm38.97.chr.gtf.gz
           => ‘Mus_musculus.GRCm38.97.chr.gtf.gz’
Resolving ftp.ensembl.org... 193.62.193.139
Connecting to ftp.ensembl.org|193.62.193.139|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /pub/release-97/gtf/mus_musculus ... done.
==> SIZE Mus_musculus.GRCm38.97.chr.gtf.gz ... 30157973
==> PASV ... done.    ==> RETR Mus_musculus.GRCm38.97.chr.gtf.gz ... done.
Length: 30157973 (29M) (unauthoritative)


2022-12-08 15:30:31 (10.2 MB/s) - ‘Mus_musculus.GRCm38.97.chr.gtf.gz’ saved [30157973]



Now we will move this file to a new folder named "annotations" inside the kallisto file in order to keep organized

In [8]:
%mkdir -p /home/{MIAMIID}/test/kallisto/annotations

In [9]:
%mv Mus_musculus.GRCm38.97.chr.gtf.gz /home/{MIAMIID}/test/kallisto/annotations

Finally, we will now download a file containing the name of each chromosome and the length of each chromosome. This will help us to produce visualzations.

In [10]:
!wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.26_GRCm38.p6/GCF_000001635.26_GRCm38.p6_assembly_report.txt

--2022-12-08 15:31:19--  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.26_GRCm38.p6/GCF_000001635.26_GRCm38.p6_assembly_report.txt
           => ‘GCF_000001635.26_GRCm38.p6_assembly_report.txt’
Resolving ftp.ncbi.nlm.nih.gov... 130.14.250.7, 165.112.9.230, 2607:f220:41e:250::12, ...
Connecting to ftp.ncbi.nlm.nih.gov|130.14.250.7|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /genomes/all/GCF/000/001/635/GCF_000001635.26_GRCm38.p6 ... done.
==> SIZE GCF_000001635.26_GRCm38.p6_assembly_report.txt ... 24706
==> PASV ... done.    ==> RETR GCF_000001635.26_GRCm38.p6_assembly_report.txt ... done.
Length: 24706 (24K) (unauthoritative)


2022-12-08 15:31:20 (879 KB/s) - ‘GCF_000001635.26_GRCm38.p6_assembly_report.txt’ saved [24706]



Now we will cut out the information we need from this file

In [15]:
!head -n63 GCF_000001635.26_GRCm38.p6_assembly_report.txt|tail -n21|cut -f1,9 > /home/{MIAMIID}/test/kallisto/annotations/mouse_chromosomes.tsv

In [16]:
cat /home/{MIAMIID}/test/kallisto/annotations/mouse_chromosomes.tsv

1	195471971
2	182113224
3	160039680
4	156508116
5	151834684
6	149736546
7	145441459
8	129401213
9	124595110
10	130694993
11	122082543
12	120129022
13	120421639
14	124902244
15	104043685
16	98207768
17	94987271
18	90702639
19	61431566
X	171031299
Y	91744698


# Working with pre-trimmed data

In [2]:
%cd /home/{MIAMIID}/test/trimmed_reads
!ls

/home/schroe51/test/trimmed_reads
fastqc_trimmed_results		   SRR5017133_trimmed.fastq.gz_quant
quantFastq.sh			   SRR5017135_trimmed.fastq.gz
SRR5017128_trimmed.fastq.gz	   SRR5017135_trimmed.fastq.gz_quant
SRR5017128_trimmed.fastq.gz_quant  SRR5017137_trimmed.fastq.gz
SRR5017132_trimmed.fastq.gz	   SRR5017138_trimmed.fastq.gz
SRR5017132_trimmed.fastq.gz_quant  *_trimmed.fastq.gz_quant
SRR5017133_trimmed.fastq.gz


### We will now use Kallisto to analyze our all of our trimmed sequences based on all the mouse genome information we just downloaded

Note: This will take ~1hr+

In [3]:
sh = """
for file in /home/""" + MIAMIID + """/test/trimmed_reads/*_trimmed.fastq.gz; do output=$(basename --suffix=.fastq.gz_trimmed.fastq.gz $file)_quant; kallisto quant\
 --single\
 --threads=24\
 --index=/home/""" + MIAMIID + """/test/transcriptome/Mus_musculus.GRCm38_index\
 --bootstrap-samples=25\
 --fragment-length=200\
 --sd=20\
 --output-dir=$output\
 --genomebam\
 --gtf=/home/""" + MIAMIID + """/test/kallisto/annotations/Mus_musculus.GRCm38.97.chr.gtf.gz\
 --chromosomes=/home/""" + MIAMIID + """/test/kallisto/annotations/mouse_chromosomes.tsv\
 $file; done
 """
with open('quantFastq.sh', 'w') as file:
  file.write(sh)

In [4]:
!source quantFastq.sh


[quant] fragment length distribution is truncated gaussian with mean = 200, sd = 20
[index] k-mer length: 31
[index] number of targets: 118,783
[index] number of k-mers: 100,745,897
[index] number of equivalence classes: 434,774
[quant] running in single-end mode
[quant] will process file 1: /home/schroe51/test/trimmed_reads/SRR5017128_trimmed.fastq.gz
[quant] finding pseudoalignments for the reads ... done
[quant] processed 41,112,323 reads, 36,330,050 reads pseudoaligned
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 1,409 rounds
[bstrp] number of EM bootstraps complete: 25
[  bam] writing pseudoalignments to BAM format .. done
[  bam] sorting BAM files .. done
[  bam] indexing BAM file .. done


[quant] fragment length distribution is truncated gaussian with mean = 200, sd = 20
[index] k-mer length: 31
[index] number of targets: 118,783
[index] number of k-mers: 100,745,897
[index] number of equivalence classes: 434,774
[quant] ru

Now move the analyzed files into their own folder to keep things organized

In [5]:
%mkdir -p /home/{MIAMIID}/test/kallisto/analyzed
%mv *_quant /home/{MIAMIID}/test/kallisto/analyzed
%cd /home/{MIAMIID}/test/kallisto/analyzed
!ls

/home/schroe51/test/kallisto/analyzed
SRR5017128_trimmed.fastq.gz_quant  SRR5017137_trimmed.fastq.gz_quant
SRR5017132_trimmed.fastq.gz_quant  SRR5017138_trimmed.fastq.gz_quant
SRR5017133_trimmed.fastq.gz_quant  *_trimmed.fastq.gz_quant
SRR5017135_trimmed.fastq.gz_quant


# Now we can look at the files in the High-Fat Diet Control 1

In [6]:
%cd /home/{MIAMIID}/test/kallisto/analyzed/SRR5017135_trimmed.fastq.gz_quant
!ls

/home/schroe51/test/kallisto/analyzed/SRR5017135_trimmed.fastq.gz_quant
abundance.h5   pseudoalignments.bam	 run_info.json
abundance.tsv  pseudoalignments.bam.bai


In [7]:
!head -n 100 abundance.tsv

target_id	length	eff_length	est_counts	tpm
ENSMUST00000196221.1	9	2.52391	0	0
ENSMUST00000179664.1	11	2.58312	0	0
ENSMUST00000177564.1	16	2.66459	0	0
ENSMUST00000178537.1	12	2.60424	0	0
ENSMUST00000178862.1	14	2.63753	0	0
ENSMUST00000179520.1	11	2.58312	0	0
ENSMUST00000179883.1	16	2.66459	0	0
ENSMUST00000195858.1	10	2.55707	0	0
ENSMUST00000179932.1	12	2.60424	0	0
ENSMUST00000180001.1	17	2.67702	0	0
ENSMUST00000178815.1	10	2.55707	0	0
ENSMUST00000177965.1	17	2.67702	0	0
ENSMUST00000178909.1	29	2.82093	0	0
ENSMUST00000177646.1	10	2.55707	0	0
ENSMUST00000178230.1	17	2.67702	0	0
ENSMUST00000178483.1	29	2.82093	0	0
ENSMUST00000179262.1	10	2.55707	0	0
ENSMUST00000178549.1	17	2.67702	0	0
ENSMUST00000193012.1	29	2.82093	0	0
ENSMUST00000179166.1	10	2.55707	0	0
ENSMUST00000179560.1	17	2.67702	0	0
ENSMUST00000177839.1	17	2.67702	0	0
ENSMUST00000103439.1	23	2.74792	0	0
ENSMUST00000180266.1	17	2.67702	0	0
ENSMUST00000103441.1	23	2.74792	0	0
ENSMUST00000103567.5	371	172	0	0

# Now all that is left is to visualize this data to better understand how it all fits together