# "Breathing Without Oxygen: Analysis of Differential Gene Expression in Yeast Under Hypoxic Conditions"

Project #6

*Lab Journal by Artem Vasilev and Tatiana Lisitsa*

---

### **System Info**

OS: Ubuntu 22.04.4 LTS

Versions of used tools:
- snakemake 7.32.4
- fastqc 0.12.1
- multiqc 1.21
- hisat2 2.2.1
- samtools 1.19
- gffread 0.12.7
- featureCounts (subread package) 2.0.6

## **Preparing**

Update packages:

`$ sudo apt update && sudo apt upgrade`

### **VE setup and downloading tools**

Create mamba VE:

`$ mamba create -n bi_practice_6 && mamba activate bi_practice_6`

Install required tools:

`$ mamba install -c conda-forge -c bioconda snakemake fastqc multiqc hisat2 samtools gffread subread`

Create folder for this project:

`$ mkdir Project_6 && cd Project_6`

### **Installing the newest version of R ([Method 2](https://phoenixnap.com/kb/install-r-ubuntu))**

1. `$ sudo apt install software-properties-common dirmngr -y`
2. `$ wget -qO- https://cloud.r-project.org/bin/linux/ubuntu/marutter_pubkey.asc | sudo tee -a /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc`
3. `$ gpg --show-keys /etc/apt/trusted.gpg.d/cran_ubuntu_key.asc`  # checks the key is E298A3A825C0D65DFD57CBB651716619E084DAB9
4. `$ sudo add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/"`
5. `$ sudo apt install r-base r-base-dev -y`
6. `$ R`  # 4.3.2

---

## **Data processing**

Many of the steps are done using **SnakeMake**. For details see `Snakefile`'s contents

### **Downloading**

**Yeast reads:**

- SRR941816: fermentation 0 minutes replicate 1
- SRR941817: fermentation 0 minutes replicate 2
- SRR941818: fermentation 30 minutes replicate 1
- SRR941819: fermentation 30 minutes replicate 2

**+ Reference genome *Saccharomyces cerevisiae* strain S288c assembly R64 and annotation**

```shell
$ snakemake --cores=all -p \
raw_data/y00_1.fastq.gz \
raw_data/y00_2.fastq.gz \
raw_data/y30_1.fastq.gz \
raw_data/y30_2.fastq.gz \
reference/Saccharomyces_cerevisiae.fna \
reference/Saccharomyces_cerevisiae.gff
```

### **FastQC + MultiQC**

```shell
$ snakemake --cores=all -p \
results/fastqc/y00_1_fastqc.html \
results/fastqc/y00_2_fastqc.html \
results/fastqc/y30_1_fastqc.html \
results/fastqc/y30_2_fastqc.html \
results/multiqc/multiqc_report.html
```

![](fastqc_sequence_counts_plot.png)

![](fastqc_per_base_sequence_quality_plot.png)

### **Analysis Pipeline**

```shell
$ snakemake --cores=all -p \
results/deseq2/genes.txt \
results/heatmaps/output.pdf
```

This pipeline includes:
- Aligning with `HISAT2`
- Quantifying with `featureCounts`
- Finding differentially expressed genes with `DESeq2`

After running all commands above (3), your repository will have the following structure:
```
-/Practice/Project_6/
 |- logs
       |- feature_all_samples.log (5,4 kB)
       |- featureCounts.log (5,4 kB)
 |- raw_data
       |- y00_1.fastq.gz (433,5 MB)
       |- y00_2.fastq.gz (477,1 MB)
       |- y30_1.fastq.gz (83,1 MB)
       |- y30_2.fastq.gz (295,6 MB)
 |- reference
       |- Saccharomyces_cerevisiae.fna (12,3 MB)
       |- Saccharomyces_cerevisiae.fna.gz (3,8 MB)
       |- Saccharomyces_cerevisiae.gff (12,3 MB)
       |- Saccharomyces_cerevisiae.gff.gz (2,2 MB)
       |- Saccharomyces_cerevisiae.gtf (2,4 MB)
 |- results
       |- BAM
            |- y00_1.bam (321,8 MB)
            |- y00_2.bam (356,3 MB)
            |- y30_1.bam (67,9 MB)
            |- y30_2.bam (226,0 MB)
       |- count
            |- all_samples.tsv (388,7 kB)
            |- all_samples.tsv.summary (608 bytes)
            |- simple_counts.txt (179,1 kB)
       |- deseq2
            |- genes.txt (393 bytes)
            |- norm-matrix-deseq2.txt (488,4 kB)
            |- result.txt (833,4 kB)
       |- fastqc
            |- 8 items (4,2 MB)
       |- heatmaps
            |- output.pdf (283,6 kB)
       |- hisat
            |- index
                 |- 8 items (22,8 MB)
       |- multiqc
            |- multiqc_data
                 |- 7 items (366,9 kB)
            |- multiqc_report.html (4,7 MB)
```

### **Draw another heatmap**

Firstly, rename the `output.pdf` file in the `heatmap` folder.

Secondly, replace the following code in `r_scripts/draw-heatmap.r`:

`cexRow = 0.5` → `cexRow = 0.2`

`cexCol = 0.8` → `cexCol = 0.6`

Filter 100 highly differentially expressed genes:

```shell
$ head -n 100 results/deseq2/result.txt | cut -f 1 | cut -d "-" -f 2 > results/deseq2/100_genes.txt
```

Run `get_data_subset.py` to obtain subset of data for heatmap:

```shell
$ python3 get_data_subset.py
```

And draw the 2nd heatmap manually:

```shell
$ cat results/deseq2/norm-matrix-deseq2_100_genes.txt | R -f r_scripts/draw-heatmap.r
```

---

## **Data analyzing**

Go to http://www.yeastgenome.org/cgi-bin/GO/goSlimMapper.pl

For your top 100 differentially expressed genes:
- in step 1 press "Choose file" and upload `100_genes.txt`
- in step 2, select "Yeast GO-Slim: Process"
- in step 3, make sure "SELECT ALL TERMS" is highlighted. Press "Submit Form"

We can see a lot of upregulated genes and also **6 downregulated** from our top 100:

- YDR342C
- YKR097W
- YLR327C
- YCR021C
- YMR081C
- YNL117W

[Heatmap for all genes](https://github.com/ArtemVaska/BI_Practice_Project_6/blob/main/output_all_genes.pdf)


[Heatmap for 100 genes](https://github.com/ArtemVaska/BI_Practice_Project_6/blob/main/output_100_genes.pdf)

The tables of **annotated genes** you can find in the GitHub repository

You have to download them and open in browser to see the results