#  E.coli outbreak investigation#

**Project 3.**\
Lab journal by Anna Ogurtsova

---


### Step 0. Environement and project directory creation



In [None]:
!mkdir project_3
!cd project_3
!mamba create -n project_3 -c bioconda fastqc bwa samtools igv varscan prokka spades quast
!mamba activate project_3

### Step 1.  Exploring the dataset

For this project, we provide three libraries from the TY2482 sample with the following insert sizes and orientation:

● SRR292678 - paired end, insert size 470 bp ([forward reads](https://d28rh4a8wq0iu5.cloudfront.net/bioinfo/SRR292678sub_S1_L001_R1_001.fastq.gz), [reverse reads](https://d28rh4a8wq0iu5.cloudfront.net/bioinfo/SRR292678sub_S1_L001_R1_001.fastq.gz), 400 Mb each)

● SRR292862 – mate pair, insert size 2 kb, (forward reads, reverse reads, 200 Mb each)

● SRR292770 – mate pair, insert size 6 kb, (forward reads, reverse reads, 200 Mb each)

After running Snakemake file to generate FastQC report on all 6 of the fastq files I got the following number of reads:

In [None]:
!snakemake --cores 12 --config sample=SRR292770_S1 --config reads=L001_R1_001 
!snakemake --cores 12 --config sample=SRR292770_S1 --config reads=L001_R2_001
!snakemake --cores 12 --config sample=SRR292862_S2 --config reads=L001_R1_001
!snakemake --cores 12 --config sample=SRR292862_S2 --config reads=L001_R2_001 

Number of reads:
- SRR292678sub_S1_L001_R1_001.fastq: **5499346**
- SRR292678sub_S1_L001_R2_001.fastq: **5499346**
- SRR292770_S1_L001_R1_001.fastq: **5102041**
- SRR292770_S1_L001_R2_001.fastq: **5102041**
- SRR292862_S2_L001_R1_001.fastq: **5102041**
- SRR292862_S2_L001_R2_001.fastq: **5102041**


### Step 2.  K-mer profile and genome size estimation

With a use of the paired-end library (SRR292678) and Jellywish, let's count the number of k-mers and estimate the genome size:

Jellyfich count command:

-  -m or “mer” specifies the length

- -C tells it to ignore directionality (it treats each read the same as its reverse complement).

- -s is an initial estimate for the size of the hash table jellyfish uses, set > genome size

-  -o specifies the name of the output file. choose a name with the k-mer length in it

In [None]:
%%bash
jellyfish count -o kmers_count -m 31 -C -s 6000000 SRR292678sub_S1_L001_R1_001.fastq SRR292678sub_S1_L001_R2_001.fastq
jellyfish histo -o hist kmers_count

The genome size was estimated with the use of R script:

In [None]:
%%R
#before
spec1 <- read.table("hist.csv")
plot(spec1[5:200,],type="l",
     main = "K-mers distribution_before",
     xlab = "depth", ylab = "count")
points(spec1[16:200,])
sum(as.numeric(spec1[16:1500,1]*spec1[16:1500,2])) #calculate the total number of k-mer in the distribution
genome_size <- sum(as.numeric(spec1[16:1500,1]*spec1[16:1500,2]))/125 #125 - the peak position
genome_size/1000000 #5.14 Mb

#after
spec2 <- read.table("hist_after.csv")
plot(spec2[1:20,],type="l",
main = "K-mers distribution_after",
     xlab = "depth", ylab = "count"
points(spec2[1:20,])
sum(as.numeric(spec2[1:15,1]*spec2[1:15,2]))
genome_size_after <- sum(as.numeric(spec2[1:15,1]*spec2[1:15,2]))/2
genome_size_after/1000000 #5.32 Mb

### Step 3. Assembling E. coli X genome from paired reads

For the read correction and assembly I used SPAdes assembler.
I ran SPAdes in the paired-end mode, providing paired reads of E. coli X. from the library SRR292678 (forward and reverse).

In [None]:
%%bash
spades.py -o ./result -1 SRR292678sub_S1_L001_R1_001.fastq -2 SRR292678sub_S1_L001_R2_001.fastq 

QUAST report analysis

In [None]:
%%bash 
quast -o . contigs.fasta scaffolds.fasta

Effect of read correction

In [None]:
%%bash
jellyfish count -o kmers_count_corrected -m 31 -C -s 6000000 contigs.fasta scaffolds.fasta
jellyfish histo -o hist_after kmers_count_corrected

### Step 4. Gene prediction and annotation with Prokka

Prokka tool  identifies the coordinates of putative genes
within contigs and then uses BLAST for similarity-based annotation using all proteins from sequenced bacterial genomes in the RefSeq database.

In [None]:
%%bash
prokka scaffolds.fasta --centre XXX

```--centre XXX```
in Prokka is used to rename all contigs to be NCBI (and Mauve) compliant. And Mauve is a system for constructing multiple genome alignments in the presence of large-scale evolutionary events such as rearrangement and inversion.

Prokka output results:
```
organism: Genus species strain
contigs: 327
bases: 5437160
CDS: 5057
rRNA: 23
repeat_region: 1
tRNA: 80
tmRNA: 1
```

### Step 5. Finding the closest relative of E. coli X

I needed to locate 16S rRNA in the assembled E. coli X genome to find the known genome that is the most similar to the pathogenic strain (and infer properties of E. coli X from it).
I used the rRNA genes prediction tool [Barrnap](https://github.com/tseemann/barrnap).

In [None]:
%%bash 
barrnap -o rrna.fa < contigs.fasta > rrna.gff
head -n 3 rrna.gff

Next I  used BLAST to search for the genome in the RefSeq database with 16S rRNA that is most similar to the 16S rRNA that I just found.

``` output
Escherichia coli 55989, complete sequence
Sequence ID: NC_011748.1 Length: 5154862Number of Matches: 7
```

### Step 6. What is the genetic cause of HUS?
To understand the genetic cause of HUS, I performed a genome-wide comparison with the reference genome and analyzed the regions where these strains differ from each other.
I used a program called Mauve, which visualizes an alignment as a series of conserved segments called Locally Collinear Blocks (LCBs), which are similar to synteny blocks. Insertions and deletions in LCBs correspond to insertions and deletions in a bacterial.

Found genes:
- stxB 3483605,3483874. Shiga toxin subunit B precursor locus_tag PROKKA_03323

- stxA 3483886,3484845. Shiga toxin subunit A precursor. locus_tag PROKKA_03324

### Step 7. Tracing the source of toxin genes in E.coli

Based on the annotation of nearby proteins, the possible origin of these toxin genes in E.coli is:


- Int-Tn PROKKA_03362 [transposase from transposon Tn916](https://www.ncbi.nlm.nih.gov/gene/?term=Int-TN&report=full_report)

- In GenBank this Int-Tn gene which encodes integrase that occurs in Escherichia phage phi191 genome. This integrase was also previously reported in [Clostridium difficile R20291](https://www.ncbi.nlm.nih.gov/gene/8469373).

- In E.coli phage phi191 genome [Shiga toxin subunit also occurs](https://www.ncbi.nlm.nih.gov/gene/?term=Shiga+toxin+StxA++phage+phi191)

- Other genes found nearby: nanS_5, HTH-type transcriptional regulator,rrrD_2, kilR_2 killing protein KiLR

### Step 8. Antibiotic resistance detection

To search for genes responsible for antibiotic resistance, I used [ResFinder](https://cge.food.dtu.dk/services/ResFinder/), which specifically searches a database of genes implicated in antibiotic resistance, identifying similarities between the sequenced genome and this database using local alignment.

After blast of regions flanking bla1 and bla2 genes:

- Escherichia coli plasmid pKC90-L DNA, complete sequence, strain: KC90
- Escherichia coli plasmid pB5-L DNA
