# What causes antibiotic resistance?

**Project 1.**\
Lab journal by Anna Ogurtsova

---

Tools which I used:

- **FastQC v0.12.1** to evaluate the quality of reads 
- **Trimmomatic-0.39** to perform reads filtering 
- **bwa-0.7.17** to align reads to a reference genome 
- **samtools 1.7** for SAM file compression, BAM file sorting, and indexing 
- **VarScan.v2.3.9** for variant calling 
- **SnpEff 5.2** for automatic SNP annotation


### Step 0. Environement and project directory creation

In [None]:
!mkdir project_1
!cd project_1
!mamba create -n project_1 -c bioconda trimmomatic fastqc bwa samtools igv varscan snpeff
!mamba activate project_1

### Step 1. Download raw data

1. Download reference E.coli genome sequence [GCF_000005845.2_ASM584v2_genomic.fna.gz](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz)

In [None]:
!wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz

2. Download annotation in .gff format(*_genomic.gff.gz):

In [None]:
!wget -c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gff.gz

3. Downloaded raw Illumina sequencing reads from shotgun sequencing of an E. coli strain that is resistant to the antibiotic ampicillin.

In [None]:
!wget https://figshare.com/ndownloader/files/23769689
!wget https://figshare.com/ndownloader/files/23769692

### Step 2. Raw sequencing data inspection

Count number of reads in each file

In [None]:
!gunzip 237*
!wc -l amp_res_1.fastq #(output 1823504 amp_res_1.fastq). Therefore, number of reads = 1823504/4 = 455876 - in line with fastqc report
!wc -l amp_res_2.fastq

### Step 3. Inspect raw sequencing data with FastQC. Filtering the reads.

Run fastQC on forward and reverse reads

In [None]:
%%bash
mkdir fastqc
fastqc -o . amp*.fastq 

### Step 4. (Optional, 1 bonus point) Filtering the reads

**Explanation of the options:**

1. `PE`: Indicates that we are running Trimmomatic in paired-end mode.
2. `-phred33`: Specifies that the input quality scores are in Phred+33 format.
3. `input_forward.fastq.gz` and `input_reverse.fastq.gz`: Input paired-end FASTQ files.
4. `output_forward_paired.fastq.gz` and `output_reverse_paired.fastq.gz`: Output files for the trimmed paired reads.
5. `output_forward_unpaired.fastq.gz` and `output_reverse_unpaired.fastq.gz`: Output files for the unpaired reads.
6. `LEADING:20`: Cut bases off the start of a read if quality is below 20.
7. `TRAILING:20`: Cut bases off the end of a read if quality is below 20.
8. `SLIDINGWINDOW:10:20`: Trim reads using a sliding window approach with a window size of 10 and average quality within the window of 20.
9. `MINLEN:20`: Drop the read if it is below a length of 2

Trimming the reads

In [None]:
%%bash
mkdir trimmed
trimmomatic PE -phred33 amp_res_1.fastq amp_res_2.fastq \
    trimmed/output_amp_res_1_paired.fastq.gz trimmed/output_amp_res_1_unpaired.fastq.gz \
    trimmed/output_amp_res_2_paired.fastq.gz trimmed/output_amp_res_2_unpaired.fastq.gz \
    LEADING:20 TRAILING:20 SLIDINGWINDOW:10:20 MINLEN:20

Trimming results:
Input Read Pairs: 455876
- Both Surviving: 446259 (97.89%)
- Forward Only Surviving: 9216 (2.02%)
- Reverse Only Surviving: 273 (0.06%)
- Dropped: 128 (0.03%)

### Step 5. Aligning sequences to reference

In [None]:
%%bash
bwa index GCF_000005845.2_ASM584v2_genomic.fna.gz > reference.bwaindex.log # indexing the reference genome
mkdir align
bwa mem GCF_000005845.2_ASM584v2_genomic.fna.gz trimmed/output_amp_res_1_paired.fastq trimmed/output_amp_res_2_paired.fastq > align/alignment.sam # reads alignement on the reference


In [None]:
!samtools view -S -b align/alignment.sam > align/alignment.bam # convert .sam format to .bam format
!samtools sort align/alignment.bam -o align/alignment.sorted.bam # sorting of bam file by sequence coordinate on reference
!samtools index align/alignment.sorted.bam # index bam file for faster search

In [None]:
%%bash
samtools flagstat align/alignment.bam #basic statistics

  892776 + 0 in total (QC-passed reads + QC-failed reads)
  
  891649 + 0 mapped (99.87% : N/A)
  
  888554 + 0 properly paired (99.56% : N/A)

### Step 6. Variant calling

Made mpileup file for variance callig

In [None]:
!samtools mpileup -f GCF_000005845.2_ASM584v2_genomic.fna align/alignment_sorted.bam > my.mpileup

Made variance calling of changes that exists in >= 20% of reads

In [None]:
!varscan mpileup2snp my.mpileup  --min-var-freq 0.2 --variants --output-vcf 1 > VarScan_results.vcf

**Results**\
Only SNPs will be reported

Warning: No p-value threshold provided, so p-values will not be calculated

Min coverage: 8 \
Min reads2:	2 \
Min var freq:	0.2 \
Min avg qual:	15 \
P-value thresh:	0.01 \
4641343 bases in pileup file

**9 variant positions (6 SNP, 3 indel)**

**1 were failed by the strand-filter**

**5 variant positions reported (5 SNP, 0 indel)**


### Step 7. Automatic SNP annotation

Downloaded reference annotation from GenBank

In [None]:
!wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gbff.gz

Database creation for automatic annotation & annotation:
1. Create SnpEff.config file and add there just one string: `k12.genome : ecoli_K12`

In [4]:
! echo "k12.genome : ecoli_K12" > snpEff.config

2. Create folder for the database

In [None]:
!mkdir -p data/k12

3. Put there my .gbk file (unzip and rename to genes.gbk)

In [None]:
%%bash
gunzip GCF_000005845.2_ASM584v2_genomic.gbff.gz
cp GCF_000005845.2_ASM584v2_genomic.gbff data/k12/genes.gbk

4. Create database [Instruction how to build a database from GenBank file](http://pcingola.github.io/SnpEff/snpeff/build_db/#step-2-option-2-building-a-database-from-genbank-files)

In [None]:
!snpEff build -genbank -v k12

5. Annotate

In [None]:
!snpEff ann k12 VarScan_results.vcf > VarScan_results_annotated.vcf