# **CHAPTER 2. Differential analysis**

**Install conda env and activate it**

```
conda env create -f diff_an.yaml
```

```
conda activate diff_an
```

## **Part 0. Copy kreports from the server**

```
scp -r username@host.com:"/path/to/kreports/folder" data/kreports/
```

## **Part 1. Data parsing**

Rename files. Delete `_kraken_report` from file names.

In [4]:
# Usage
# {path_to_script} {path_to_folder}
%run scripts/rename_files.py data/kreports

Install KrakenTools

In [8]:
! git clone https://github.com/jenniferlu717/KrakenTools

Cloning into 'KrakenTools'...
remote: Enumerating objects: 360, done.[K
remote: Counting objects: 100% (110/110), done.[K
remote: Compressing objects: 100% (53/53), done.[K
remote: Total 360 (delta 79), reused 69 (delta 57), pack-reused 250[K
Receiving objects: 100% (360/360), 129.37 KiB | 1003.00 KiB/s, done.
Resolving deltas: 100% (218/218), done.


Create folder for files groupping

In [5]:
! mkdir data/mpa

Convert kraken reports in `data/kreports` folder to MPA format and place files in `data/mpa` folder

In [6]:
# Usage
# {path_to_script} {path_to_txt_files} {path_to_output_mpa_files}
! ./scripts/run_kreport2mpa.sh data/kreports data/mpa

Combine mpa files

In [7]:
# Usage
# {path_to_script} {path_to_mpa_files} {output_file_name}
%run KrakenTools/combine_mpa.py -i data/mpa/* -o data/COMBINED.txt

 Number of files to parse: 10
 Number of classifications to write: 17159
	17159 classifications printed


### **Part 1.1. Getting `counts.csv` files**

Prepare folders to place data

In [8]:
%%bash
mkdir counts
mkdir counts/txt
mkdir counts/csv

Parse `data/COMBINED.txt` file to counts files on several taxonomic levels

#### **Part 1.1.1. _`Species`_ level**

In [46]:
%%bash

grep -E "s__" data/COMBINED.txt \
| grep -v "t__" \
| grep -v "s__Homo_sapiens" \
| sed "s/^.*|//g" \
| sed "s/SRS[0-9]*-//g" \
> counts/txt/counts_species.txt

#### **Part 1.1.2. _`Genus`_ level**

In [49]:
%%bash

grep -E "g__" data/COMBINED.txt \
| grep -v "t__" \
| grep -v "s__" \
| grep -v "g__Homo" \
| sed "s/^.*|//g" \
| sed "s/SRS[0-9]*-//g" \
> counts/txt/counts_genus.txt

#### **Part 1.1.3. _`Family`_ level**

In [50]:
%%bash

grep -E "f__" data/COMBINED.txt \
| grep -v "t__" \
| grep -v "s__" \
| grep -v "g__" \
| grep -v "f__Hominidae" \
| sed "s/^.*|//g" \
| sed "s/SRS[0-9]*-//g" \
> counts/txt/counts_family.txt

#### **Part 1.1.4. _`Order`_ level**

In [12]:
%%bash

grep -E "o__" data/COMBINED.txt \
| grep -v "t__" \
| grep -v "s__" \
| grep -v "g__" \
| grep -v "f__" \
| grep -v "o__Primates" \
| sed "s/^.*|//g" \
| sed "s/SRS[0-9]*-//g" \
> counts/txt/counts_order.txt

#### **Part 1.1.5. _`Class`_ level**

In [13]:
%%bash

grep -E "c__" data/COMBINED.txt \
| grep -v "t__" \
| grep -v "s__" \
| grep -v "g__" \
| grep -v "f__" \
| grep -v "o__" \
| grep -v "c__Mammalia" \
| sed "s/^.*|//g" \
| sed "s/SRS[0-9]*-//g" \
> counts/txt/counts_class.txt

#### **Part 1.1.6. _`Phylum`_ level**

In [14]:
%%bash

grep -E "p__" data/COMBINED.txt \
| grep -v "t__" \
| grep -v "s__" \
| grep -v "g__" \
| grep -v "f__" \
| grep -v "o__" \
| grep -v "c__" \
| grep -v "p__Chordata" \
| sed "s/^.*|//g" \
| sed "s/SRS[0-9]*-//g" \
> counts/txt/counts_phylum.txt

### **Part 1.2. Process counts files**

`processing_script` will do:
1. Return the 1st line with sample ids from `data/COMBINED.txt`
2. Delete '[X]__' and '_' from organisms names. [X] stands for taxonomic id (s - species, g - genus etc.)

In [47]:
# Usage
# {path_to_script} {path_to_txt_file} {path_to_output_file}
%run scripts/processing_script.py data/COMBINED.txt counts/txt/counts_species.txt

In [51]:
# Usage
# {path_to_script} {path_to_txt_file} {path_to_output_file}
%run scripts/processing_script.py data/COMBINED.txt counts/txt/counts_genus.txt

In [52]:
# Usage
# {path_to_script} {path_to_txt_file} {path_to_output_file}
%run scripts/processing_script.py data/COMBINED.txt counts/txt/counts_family.txt

In [21]:
# Usage
# {path_to_script} {path_to_txt_file} {path_to_output_file}
%run scripts/processing_script.py data/COMBINED.txt counts/txt/counts_order.txt

In [22]:
# Usage
# {path_to_script} {path_to_txt_file} {path_to_output_file}
%run scripts/processing_script.py data/COMBINED.txt counts/txt/counts_class.txt

In [23]:
# Usage
# {path_to_script} {path_to_txt_file} {path_to_output_file}
%run scripts/processing_script.py data/COMBINED.txt counts/txt/counts_phylum.txt

### **Part 1.3. Convert to `csv` file format**

`convert2csv` will do:
1. Take `counts_tax_lvl.txt` as the input
2. Give `counts_tax_lvl.csv` as the output

In [48]:
# Usage
# {path_to_script} {path_to_input_file}
%run scripts/convert2csv.py counts/txt/counts_species.txt counts/csv/counts_species.csv

Data has been successfully converted and saved as 'counts/csv/counts_species.csv'.


In [53]:
# Usage
# {path_to_script} {path_to_input_file}
%run scripts/convert2csv.py counts/txt/counts_genus.txt counts/csv/counts_genus.csv

Data has been successfully converted and saved as 'counts/csv/counts_genus.csv'.


In [54]:
# Usage
# {path_to_script} {path_to_input_file}
%run scripts/convert2csv.py counts/txt/counts_family.txt counts/csv/counts_family.csv

Data has been successfully converted and saved as 'counts/csv/counts_family.csv'.


In [27]:
# Usage
# {path_to_script} {path_to_input_file}
%run scripts/convert2csv.py counts/txt/counts_order.txt counts/csv/counts_order.csv

Data has been successfully converted and saved as 'counts/csv/counts_order.csv'.


In [28]:
# Usage
# {path_to_script} {path_to_input_file}
%run scripts/convert2csv.py counts/txt/counts_class.txt counts/csv/counts_class.csv

Data has been successfully converted and saved as 'counts/csv/counts_class.csv'.


In [29]:
# Usage
# {path_to_script} {path_to_input_file}
%run scripts/convert2csv.py counts/txt/counts_phylum.txt counts/csv/counts_phylum.csv

Data has been successfully converted and saved as 'counts/csv/counts_phylum.csv'.


### **Part 1.4. Create metadata**

In [1]:
import csv

In [2]:
# Define the data
data = [
    {'sample_id': 'D1', 'Group': 'Vespertilio murinus'},
    {'sample_id': 'D2', 'Group': 'Vespertilio murinus'},
    {'sample_id': 'D3', 'Group': 'Vespertilio murinus'},
    {'sample_id': 'D4', 'Group': 'Vespertilio murinus'},
    {'sample_id': 'D5', 'Group': 'Vespertilio murinus'},
    {'sample_id': 'P1', 'Group': 'Nyctalus noctula'},
    {'sample_id': 'P2', 'Group': 'Nyctalus noctula'},
    {'sample_id': 'P3', 'Group': 'Nyctalus noctula'},
    {'sample_id': 'P4', 'Group': 'Nyctalus noctula'},
    {'sample_id': 'P5', 'Group': 'Nyctalus noctula'}
]

# Define the CSV file name
filename = 'metadata.csv'

# Write the data to the CSV file
with open(filename, mode='w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=['sample_id', 'Group'])
    writer.writeheader()
    writer.writerows(data)

print(f'{filename} has been created successfully.')

metadata.csv has been created successfully.


## **Part 2. Comparative statistics**

### **Part 2.1. Differential Microbial Abundance**

In [None]:
import pandas as pd

`MaAsLin2` is the next generation of `MaAsLin` (Microbiome Multivariable Association with Linear Models).

`MaAsLin2` is comprehensive R package for efficiently determining multivariable association between clinical metadata and microbial meta-omics features. `MaAsLin2` relies on general linear models to accommodate most modern epidemiological study designs, including cross-sectional and longitudinal, along with a variety of filtering, normalization, and transform methods.

In [None]:
! mkdir MaAsLin2_results

#### **Part 2.1.1. _`Species`_ level**

In [None]:
%%bash
# Usage
# {path_to_script} {path_to_metadata} {path_to_counts} {path_to_output}
Rscript.exe scripts/MaAsLin2.R metadata.csv counts/csv/counts_species.csv MaAsLin2_results/species

In [33]:
MaAsLin2_results_species = pd.read_csv('MaAsLin2_results/species/significant_results.tsv', sep='\t')
MaAsLin2_results_species

Unnamed: 0,feature,metadata,value,coef,stderr,N,N.not.0,pval,qval


#### **Part 2.1.2. _`Genus`_ level**

In [None]:
%%bash
# Usage
# {path_to_script} {path_to_metadata} {path_to_counts} {path_to_output}
Rscript.exe scripts/MaAsLin2.R metadata.csv counts/csv/counts_genus.csv MaAsLin2_results/genus

In [36]:
MaAsLin2_results_species = pd.read_csv('MaAsLin2_results/genus/significant_results.tsv', sep='\t')
MaAsLin2_results_species

Unnamed: 0,feature,metadata,value,coef,stderr,N,N.not.0,pval,qval


#### **Part 2.1.3. _`Family`_ level**

In [None]:
%%bash
# Usage
# {path_to_script} {path_to_metadata} {path_to_counts} {path_to_output}
Rscript.exe scripts/MaAsLin2.R metadata.csv counts/csv/counts_family.csv MaAsLin2_results/family

In [38]:
MaAsLin2_results_family = pd.read_csv('MaAsLin2_results/family/significant_results.tsv', sep='\t')
MaAsLin2_results_family

Unnamed: 0,feature,metadata,value,coef,stderr,N,N.not.0,pval,qval


#### **Part 2.1.4. _`Order`_ level**

In [None]:
%%bash
# Usage
# {path_to_script} {path_to_metadata} {path_to_counts} {path_to_output}
Rscript.exe scripts/MaAsLin2.R metadata.csv counts/csv/counts_order.csv MaAsLin2_results/order

In [40]:
MaAsLin2_results_order = pd.read_csv('MaAsLin2_results/order/significant_results.tsv', sep='\t')
MaAsLin2_results_order

Unnamed: 0,feature,metadata,value,coef,stderr,N,N.not.0,pval,qval


#### **Part 2.1.5. _`Class`_ level**

In [None]:
%%bash
# Usage
# {path_to_script} {path_to_metadata} {path_to_counts} {path_to_output}
Rscript.exe scripts/MaAsLin2.R metadata.csv counts/csv/counts_class.csv MaAsLin2_results/class

In [42]:
MaAsLin2_results_class = pd.read_csv('MaAsLin2_results/class/significant_results.tsv', sep='\t')
MaAsLin2_results_class

Unnamed: 0,feature,metadata,value,coef,stderr,N,N.not.0,pval,qval


#### **Part 2.1.6. _`Phylum`_ level**

In [None]:
%%bash
# Usage
# {path_to_script} {path_to_metadata} {path_to_counts} {path_to_output}
Rscript.exe scripts/MaAsLin2.R metadata.csv counts/csv/counts_phylum.csv MaAsLin2_results/phylum

In [44]:
MaAsLin2_results_phylum = pd.read_csv('MaAsLin2_results/phylum/significant_results.tsv', sep='\t')
MaAsLin2_results_phylum

Unnamed: 0,feature,metadata,value,coef,stderr,N,N.not.0,pval,qval


As it can be seen on each taxonomic level there is no significant results in differential microbial abundance. Anyway let's visualize these results to take a closer look!

### **Part 2.2. Visualization.**

Please open `RStudio` and go through  `Volcano_plots_journal.R` script.<br>
There are a lot of manual adjustments to the plots to make it executable.

### **Part 2.3. Alpha- and Beta diversities**

In [None]:
! mkdir Alpha_div

Alpha diversity calculations

In [None]:
%%bash
# Usage
# {path_to_script} {path_to_metadata} {path_to_counts} {path_to_output}
Rscript.exe scripts/Alpha_div_calculations.R counts/csv/counts_species.csv Alpha_div/alpha_div_cult.csv

#### **Part 2.3.1. Alpha- and Beta diversities visualization**

Please open `RStudio` and go through  `Alpha_Beta_div_journal.R` script.<br>
There are a lot of manual adjustments to the plots to make it executable.

## **Part 3. Bar-plot - Mean Relative Abundance**

Please open `RStudio` and go through  `Bar_plots_journal.R` script.<br>
There are a lot of manual adjustments to the plots to make it executable.