# Lab report

please leave all comments in the notebook cells

## 16.10.22 Day first

### Step 1

At first, I counted a number of lines in .fastq files.

```bash 
zcat <path_to_file>/amp_res_1.fastq.gz | wc -l
> 1823504
zcat <path_to_file>/amp_res_2.fastq.gz | wc -l
> 1823504
```
`<path_to_file>` means absolute (or sometimes relative path to file)

It means that the number of lines is equal in both files. All is ok. If we need to know how many reads are in files, we need to divide this number by four (because there are four lines to one read in .fastq). I tried to write the previous command in the python variable. It did something unfriendly: casts number to string and covered them in a list.

```python
a = !zcat <path_to_file>/amp_res_1.fastq.gz | wc -l
print(a)
# ['1823504']
```
After this, I tried to cast the variable into "int" format. It was ok. Then I counted the number of reads.

```python
a = int(a[0])
print(a)
# 1823504

```
```python
lines_fastq1 = !zcat <path_to_file>/amp_res_1.fastq.gz | wc -l
lines_fastq2 = !zcat <path_to_file>/amp_res_2.fastq.gz | wc -l
reads_fastq1 = int(int(lines_fastq1[0])/4)
reads_fastq2 = int(int(lines_fastq2[0])/4)
print(reads_fastq1, reads_fastq2)
# 455876 455876

```

The number of reads is equal. It means, now we haven't problems with the discordance of two files.

### Step 2

Then I use fastqc (v. 0.11.9) to take some information about reads. I save fastqc's output in the project's directory.

```bash
!fastqc -o . <path_to_file>/amp_res_1.fastq.gz  <path_to_file>/amp_res_2.fastq.gz
```

_fastqc checks some parameters of reads: quality, GC-content, level of duplications, adaptor content and many others_

Our basic statistic matches what we calculated for the number of reads.

We have some warnings from fastq, related to the per base sequence quality and per tile sequence quality in the first (forward reads) file and per base sequence quality in the second (backward reads).

The loss of quality at the finish is a common situation associated with sequencing quality loss of Illumina at the end of a read. It's usually not a problem.

Warnings from per tile sequence quality on forward reads are more problematical. Patterns of quality loss are associated with air bubbles at the up-left part of the figure. Narrow lines in the right part of the figure may be associated with dirt on the tile.

### Step 3

After this trimmomatic (v. 0.39) was installed.

At first, I needed to know in what directory the tool is situated (it is important for using the tool). I used recommended command:

```bash
dpkg -L trimmomatic | grep '.jar'
#/usr/share/java/trimmomatic-0.39.jar
#/usr/share/java/trimmomatic.jar
```

Next script trimmed reads with a quality loss on the trailing end. We excluded reads with quality below 20 at the start and the end, with quality mean below 20 in the 10-line wide window on tile and with the length of reads lesser than 20 bases.

```bash
ava -jar /usr/share/java/trimmomatic-0.39.jar PE -phred33 <path_to_file>/amp_res_1.fastq.gz <path_to_file>/amp_res_2.fastq.gz <path_to_file>/po1.fq <path_to_file>/upo1.fq <path_to_file>/po2.fq <path_to_file>/upo2.fq LEADING:20 TRAILING:20 SLIDINGWINDOW:10:20 MINLEN:20
```

## 18.10.22 Day second

### Step 4

I used bwa (v. 0.7.17-r1188) for indexing the reference fasta sequence.

```bash
bwa index <path_to_file>/GCF_000005845.2_ASM584v2_genomic.fna.gz
```
_bwa is an utility for making Burrowse-Wheeler transformation for the next mapping of big amounts of sequences._

Output after ```ls``` command was

```bash
amp_res_1.fastq.gz
amp_res_2.fastq.gz
GCF_000005845.2_ASM584v2_genomic.fna.gz
GCF_000005845.2_ASM584v2_genomic.fna.gz.amb   #
GCF_000005845.2_ASM584v2_genomic.fna.gz.ann   #
GCF_000005845.2_ASM584v2_genomic.fna.gz.bwt   #
GCF_000005845.2_ASM584v2_genomic.fna.gz.pac   #
GCF_000005845.2_ASM584v2_genomic.fna.gz.sa    #
GCF_000005845.2_ASM584v2_genomic.gff.gz

```

For reading alignments, I used bwa mem.
```bash
bwa mem path_to_file/GCF_000005845.2_ASM584v2_genomic.fna.gz <path_to_file>po1.fq <path_to_file>po2.fq > alignments.sam
```

```
@SQ	SN:lcl|NC_000913.3_cds_NP_414542.1_1	LN:66
@SQ	SN:lcl|NC_000913.3_cds_NP_414543.1_2	LN:2463
@SQ	SN:lcl|NC_000913.3_cds_NP_414544.1_3	LN:933
@SQ	SN:lcl|NC_000913.3_cds_NP_414545.1_4	LN:1287
@SQ	SN:lcl|NC_000913.3_cds_NP_414546.1_5	LN:297
@SQ	SN:lcl|NC_000913.3_cds_NP_414547.1_6	LN:777
@SQ	SN:lcl|NC_000913.3_cds_NP_414548.1_7	LN:1431
@SQ	SN:lcl|NC_000913.3_cds_NP_414549.1_8	LN:954
@SQ	SN:lcl|NC_000913.3_cds_NP_414550.1_9	LN:588
@SQ	SN:lcl|NC_000913.3_cds_NP_414551.1_10	LN:567
```

For zipping this file I used samtools (v. 1.16.1).

```bash
samtools view -S -b alignments.sam > alignments.bam
```

For indexing the BAM files I used samtools too...

```bash
samtools sort alignments.bam -o al_sorted.bam
```

... and then I indexed this file for faster working with it.

```bash
samtools index al_sorted.bam
```



## 20.10.22 Day third
### Step 5

All results was succesfully reproduced by Kirill Sizov. Head of alignments.sam was added to lab report of the previous day. He give some recomendations for previous lab  notes.

## 21.10.22 Day fourth

# Step 6

For distinguishing actual mutations from the sequencing errors I used mpileup tool.

```bash
samtools mpileup -f ./raw_data/GCF_000005845.2_ASM584v2_genomic.fna al_sorted.bam >  my.mpileup
```

First ten rows of data in my.mpileup:

```bash
cat my.mpileup | head -10
#NC_000913.3	1	A	5	^].^].^],^],^],	>;EEB
#NC_000913.3	2	G	7	...,,,,	E1=JII0
#NC_000913.3	3	C	7	...,,,,	H9=IGH8
#NC_000913.3	4	T	7	...,,,,	E;CGCD9
#NC_000913.3	5	T	7	...,,,,	C<EIJJ:
#NC_000913.3	6	T	7	...,,,,	E=EIII1
#NC_000913.3	7	T	7	...,,,,	C@EHHG1
#NC_000913.3	8	C	6	...,,,	DE>JDJ
#NC_000913.3	9	A	6	...,,,	DH?JHI
#NC_000913.3	10	T	7	...,,,,	FGCIHEH
```



## 23.10.22 Day fifth.
### Step 7

I used VarScan (v. 2.3.9) for differiantiating sequencing errors from SNPs.

```bash
java -jar ../VarScan.v2.3.9.jar  mpileup2snp my.mpileup --min-var-freq 0.1 --variants --output-vcf 1 > VarScan_results.vcf
```

After this, I repeated scanning with `--min-var-freq 0.05 ` and `--min-var-freq 0.5 `, but had the same result in IGV.

### Step 8

For identification locuses of mutations I used SnpEff (v. 5.1d).

Firstly, I create database for E.coli K12. Annotation and sequence were downloaded from [GeneBank](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.gbff.gz)

Then I created in working directory new file `vim snpeff.config` with 
> k12.genome : ecoli_K12

content.

File was moved to other directory, unpacked and renamed to genes.gbk

I created database by command `snpEff build -genbank -v k12`, when I was in "data" directory. I had a problems with `snpEff ann k12 VarScan_results.vcf` command from guideline, then I try to solve the problem manually.

Upd.: my coautor does this step, then we have file for checking ourselves. All was correct.
Upd.2.0: the solution of problem was decribed in workshop's chat.

It was very important to check the strand orientation in gene of interest and tap "flip strand" to change aminoacid sequence of the annotation.

We have SNP (G instead C) in the middle position of A. It changes alanin (GCC) to glycin (GGC). We have missens mutation in CDS. Full information about this locus is below.
``` 
NC_000913.3:91413-93179

Type: gene
ID: gene-b0084
Dbxref: ASAP:ABE-0000309,ECOCYC:EG10341,GeneID:944799
Name: ftsI
gbkey: Gene
gene: ftsI
gene_biotype: protein_coding
gene_synonym: ECK0085,pbpB,sep
locus_tag: b0084
---------------------------
Type: CDS
ID: cds-NP_414626.1
Parent: gene-b0084
Dbxref: UniProtKB/Swiss-Prot:P0AD68,Genbank:NP_414626.1,AS
AP:ABE-0000309,ECOCYC:EG10341,GeneID:944799
Name: NP_414626.1
gbkey: CDS
gene: ftsI
locus_tag: b0084
orig_transcript_id: gnl|b0084|mrna.NP_414626
product: peptidoglycan DD-transpeptidase FtsI
protein_id: NP_414626.1
transl_table: 11
```
Second locus was very strange. After flipping strands it shows that A substituted... A. I screened IGV for futher consultation and continue to watching.

Third locus didn't contain protein-encoding gene, but had small RNA exon and substitution G instead T. I have some ambiguous ideas about this mutation and continue to watching IGV results.
```
NC_000913.3:852725-853064

Type: gene
ID: gene-b4416
Dbxref: ASAP:ABE-0047240,ECOCYC:G0-8881,GeneID:2847681
Name: rybA
gbkey: Gene
gene: rybA
gene_biotype: ncRNA
gene_synonym: ECK0806
locus_tag: b4416
```
```
NC_000913.3:852725-853064

Type: ncRNA
ID: rna-b4416
Parent: gene-b4416
Dbxref: ASAP:ABE-0047240,ECOCYC:G0-8881,GeneID:2847681
gbkey: ncRNA
gene: rybA
locus_tag: b4416
product: small RNA RybA
---------------------------
Type: exon
ID: exon-b4416-1
Parent: rna-b4416
Dbxref: ASAP:ABE-0047240,ECOCYC:G0-8881,GeneID:2847681
gbkey: ncRNA
gene: rybA
locus_tag: b4416
product: small RNA RybA

```

Fourth locus contained substituition A to G. We have missens mutation because replacing GGT to GAT changes glycin to aspartic acid. Full information about this locus is below.

```
NC_000913.3:1905688-1906254

Type: gene
ID: gene-b1821
Dbxref: ASAP:ABE-0006065,ECOCYC:G6999,GeneID:946341
Name: mntP
gbkey: Gene
gene: mntP
gene_biotype: protein_coding
gene_synonym: ECK1819,yebN
locus_tag: b1821
---------------------------
Type: CDS
ID: cds-NP_416335.4
Parent: gene-b1821
Dbxref: UniProtKB/Swiss-Prot:P76264,Genbank:NP_416335.4,AS
AP:ABE-0006065,ECOCYC:G6999,GeneID:946341
Name: NP_416335.4
gbkey: CDS
gene: mntP
locus_tag: b1821
orig_transcript_id: gnl|b1821|mrna.NP_416335
product: Mn(2( )) exporter
protein_id: NP_416335.4
transl_table: 11

```

Fifth is substitution C to T. We have missens-mutation, because exchange GTA to GCA substitutes valine to alanine.
```
NC_000913.3:3534516-3535868

Type: gene
ID: gene-b3404
Dbxref: ASAP:ABE-0011108,ECOCYC:EG10269,GeneID:947272
Name: envZ
gbkey: Gene
gene: envZ
gene_biotype: protein_coding
gene_synonym: ECK3391,ompB,perA,tpo
locus_tag: b3404
---------------------------
Type: CDS
ID: cds-NP_417863.1
Parent: gene-b3404
Dbxref: UniProtKB/Swiss-Prot:P0AEJ4,Genbank:NP_417863.1,AS
AP:ABE-0011108,ECOCYC:EG10269,GeneID:947272
Name: NP_417863.1
gbkey: CDS
gene: envZ
locus_tag: b3404
orig_transcript_id: gnl|b3404|mrna.NP_417863
product: sensor histidine kinase EnvZ
protein_id: NP_417863.1
transl_table: 11

```


Sixth is substitution T to C. We have samesense-mutation, because GCC and GCT code the same aminoacid - alanine.

## 25-26.10.22 Days sixth and seventh
### Step nine

Writing an mini-article. We used over-leaf. Dixi et animam levavi.
