# “Exploring Hemagglutinin Mutations Throughout Influenza Infection”

Project #2

*Lab Journal by Artem Vasilev and Tatiana Lisitsa*

---

## Preparing

Update packages:

In [None]:
! sudo apt update && sudo apt upgrade

Make sure you have Java installed on your PC (it will be used in the next steps):

In [None]:
! java --version

You may want to update it. For that check the latest version of Java (e.g. [here](https://www.codejava.net/java-se/java-se-versions-history)), and  update it from terminal:

In [None]:
! sudo apt install openjdk-19-jdk

Create virtual environment from `environment.yaml` file, which you can find on the [GitHub](https://github.com/ArtemVaska/BI_Practice_Project_2) page:

In [None]:
# ! mamba env create -f environment.yaml -p /home/user/(anaconda3 or conda)/envs/env_name

# uncomment the top line and specify an install path with your username

Activate it and go to the created repo:

In [None]:
! mamba activate Practice_Project_2

In [None]:
! cd ~/conda/envs/Practice_Project_2  # this path can be different from yours!

Download VarScan from [here](https://github.com/dkoboldt/varscan/blob/99ba0e047d7f048d533f411edb6bb1a189a4fa5d/VarScan.v2.4.6.jar):

In [None]:
! mkdir tools/

In [None]:
! mv ~/Downloads/VarScan.v2.4.6.jar ./tools/

Download IGV:

In [None]:
! cd tools/

In [None]:
! wget https://data.broadinstitute.org/igv/projects/downloads/2.16/IGV_Linux_2.16.2_WithJava.zip

In [None]:
! unzip IGV_Linux_2.16.2_WithJava.zip && rm IGV_Linux_2.16.2_WithJava.zip

In [None]:
! cd ..

Download `Snakefile` from [GitHub](https://github.com/ArtemVaska/BI_Practice_Project_2) and move to your created folder:

In [None]:
! mv ~/Downloads/Snakefile ~/conda/envs/Practice_Project_2

---

## Analyzing data

Most of the steps are done using `SnakeMake`. For details see Snakefile's contents

In [None]:
! snakemake --cores=all -p results/variants/reference.roommate_freq.vcf results/variants/reference.roommate_rare.vcf

After running this command, your repository will have the following structure:

```
-/Practice_Project_2/
 |- raw_data
       |- ref
             |- reference.fasta
             |- reference.fasta.amb
             |- reference.fasta.ann
             |- reference.fasta.bwt
             |- reference.fasta.fai
             |- reference.fasta.pac
             |- reference.fasta.sa
       |- roommate.fastq
       |- roommate.fastq.gz
 |- results
       |- bwa
             |- reference.roommate.sorted.bam
             |- reference.roommate.sorted.bam.bai
       |- variants
             |- reference.roommate.mpileup
             |- reference.roommate_freq.vcf  # --min-var-freq 0.95
             |- reference.roommate.mpileup
             |- reference.roommate_rare.vcf  # --min-var-freq 0.001
```

On this step you download **reference sequence** for the influenza hemagglutinin gene from NCBI GenBank DataBase (ID: KF848938.1) and **sequencing results** (label: SRR1705851) from NCBI Sequence Read Archive (SRA)

Also you 1) index the reference file, 2) align your roommate’s viral data to the reference sequence, 3) make an mpileup file for reference and 4) look for common and rare variants with VarScan

It’s good to check that you selected correct reference: let's count percentage of reads that mapped:

In [None]:
! samtools view -c results/bwa/reference.roommate.sorted.bam  # 361349

In [None]:
! samtools view -c -F 4 results/bwa/reference.roommate.sorted.bam  # 361116

`-c`  # count reads that match a given filter

`-F 4`  # filters reads, excluding unaligned reads

---

Inspect data quality using FastQC:

In [None]:
! mkdir results/fastqc

In [None]:
! fastqc -o ./results/fastqc/ --noextract ./raw_data/roommate.fastq.gz

Given the quality of the bases and the lack of adaptor sequences, we assume that the data have been preprocessed.

---

Then you have to pull out the variants in a convenient format using `awk`:

In [None]:
! snakemake --cores=all -p results/variants/reference.roommate_freq_parsed.txt results/variants/reference.roommate_rare_parsed.txt

```
-/Practice_Project_2/
 |- raw_data
       ...
 |- results
       |- bwa
             ...
       |- variants
             ...
             |- reference.roommate_freq_parsed.txt
             |- reference.roommate_rare_parsed.txt
```

You will get 2 parsed files with found SNPs

---

After that you need to inspect and align the **control sample sequencing data**, that consist of **three controls** (from sequencing of isogenic reference samples)

Download them:

In [None]:
! snakemake --cores=all -p raw_data/control1.fastq raw_data/control2.fastq raw_data/control3.fastq

And do the same steps as with reference for all controls with a minimum variant frequency of 0.001 (0.1%):

In [None]:
! snakemake --cores=all -p results/variants/reference.control1_rare.vcf

In [None]:
! snakemake --cores=all -p results/variants/reference.control1_rare_parsed.txt

In [None]:
! snakemake --cores=all -p results/variants/reference.control2_rare.vcf

In [None]:
! snakemake --cores=all -p results/variants/reference.control2_rare_parsed.txt

In [None]:
! snakemake --cores=all -p results/variants/reference.control3_rare.vcf

In [None]:
! snakemake --cores=all -p results/variants/reference.control3_rare_parsed.txt

```
-/Practice_Project_2/
 |- raw_data
       ...
       |- control1.fastq
       |- control1.fastq.gz
       |- control2.fastq
       |- control2.fastq.gz
       |- control3.fastq
       |- control3.fastq.gz
 |- results
       |- bwa
             ...
             |- reference.control1.sorted.bam
             |- reference.control1.sorted.bam.bai
             |- reference.control2.sorted.bam
             |- reference.control2.sorted.bam.bai
             |- reference.control3.sorted.bam
             |- reference.control3.sorted.bam.bai
       |- variants
             ...
             |- reference.control1.mpileup
             |- reference.control1_rare.vcf
             |- reference.control1_rare_parsed.txt
             |- reference.control2.mpileup
             |- reference.control2_rare.vcf
             |- reference.control2_rare_parsed.txt
             |- reference.control3.mpileup
             |- reference.control3_rare.vcf
             |- reference.control3_rare_parsed.txt
```

Compare the control results to your roommate’s results

That's what we got: [MAF of variants in sample and controls](https://github.com/ArtemVaska/BI_Practice_Project_2/blob/main/MAF%20of%20variants%20in%20sample%20and%20controls.pdf)