# Day 2

Today, we will start using nf-core pipelines to find differentially abundant genes in our dataset. 
We are using data from the following paper: https://www.nature.com/articles/s41593-023-01350-3#Sec10

1. Please take some time to read through the paper and understand their approach, hypotheses and goals.

What was the objective of the study?

The objective of the study was to quantify and characterize the molecular (especially transcriptomic) changes in mice after receiving opiod tratment and then withdrawal under chronic pain. To control these mice were compared to mice without chronic pain, and to mice which were given a placebo instead of the opiods.

What do the conditions mean?

oxy:
Oxy is short for oxycodon, a typical opiod used for treatment of heavy pain.


sal:
Sal is short for saline solution, which does not contain any active ingredients and serves as a placebo.

What do the genotypes mean?

SNI:
SNI is short for spared nerve injury. Parts of the sciatic nerve were surgically removed in the mice, causing chronic pain.

Sham: Sham surgerys are placebo surgeries, the mice received the same surgical procedure but without the removal of parts of the nerve, hence serving as a negative control for the chronic pain.

Imagine you are the bioinformatician in the group who conducted this study. They hand you the raw files and ask you to analyze them.

What would you do?

Which groups would you compare to each other?

Please also mention which outcome you would expect to see from each comparison.

In general I would set up a differential abundance pipeline that quantifies the differentially expressed genes between the different conditions, afterwars I would exploratorily look into different clusterings/groupings of the tranriptomes and also explore the biological relevance of the genes using Gene Set enrichment Analysis (GSEA).  
The conditions to compare are oxy vs. sal in both SNI and sham mice to compare the differences in response to opioid withdrawal and SNI-oxy vs sham-oxy / SNI-sal vs. sham-sal to compare the transciptional differences between opioid withdrawal if the mice suffer from the pain.  
One would expect to find differentially expressed genes related to neuronal activation, especially of neurons related to reward patterns, when comparing the withdrawal group to the placebo group. Additionally, one would expect that the difference in these signatures is more drastic in the mice that suffer from chronic pain.

Your group gave you a very suboptimal excel sheet (conditions_runs_oxy_project.xlsx) to get the information you need for each run they uploaded to the SRA.<br>
So, instead of directly diving into downloading the data and starting the analysis, you first need to sort the lazy table.<br>
Use Python and Pandas to get the table into a more sensible order.<br>
Then, perform some overview analysis and plot the results
1. How many samples do you have per condition?
2. How many samples do you have per genotype?
3. How often do you have each condition per genotype?

1. 8  
2. 8  
3. 4

In [3]:
import pandas as pd

# read an Excel file
df = pd.read_excel("/Users/peterbrederlow/Documents/Uni/MasterBioinformatik/SS25/WorkflowCourse/computational-workflows-2025/notebooks/day_02/conditions_runs_oxy_project.xlsx")

# look at the first few rows
print(df.head())

  Patient          Run RNA-seq  DNA-seq condition: Sal Condition: Oxy  \
0       ?  SRR23195505       x      NaN              x            NaN   
1       ?  SRR23195506       x      NaN            NaN              x   
2       ?  SRR23195507       x      NaN              x            NaN   
3       ?  SRR23195508       x      NaN            NaN              x   
4       ?  SRR23195509       x      NaN            NaN              x   

  Genotype: SNI Genotype: Sham  
0             x            NaN  
1           NaN              x  
2           NaN              x  
3             x            NaN  
4             x            NaN  


In [14]:
df = df.fillna(0).replace("x", 1)

df["Sham/Sal"] = df["Genotype: Sham"] + df["condition: Sal"]
df["Oxy/Sal"] = df["Condition: Oxy"] + df["condition: Sal"]

# If you just want 1/0 instead of 0/1/2 (any positive = 1)
df["Sham/Sal"] = (df["Sham/Sal"] > 0).astype(int)
df["Oxy/Sal"] = (df["Oxy/Sal"] > 0).astype(int)

# make sure numeric (instead of strings "1"/"0")
df = df.astype({col: "int" for col in df.columns if col not in ["Patient", "Run"]})

# sort by conditions (first "Sal", then "Oxy")
df = df.sort_values(by=["Sham/Sal", "Oxy/Sal"], ascending=[False, False])

df= df[["Run", "Sham/Sal", "Oxy/Sal"]]

print(df.head())

           Run  Sham/Sal  Oxy/Sal
0  SRR23195505         1        1
2  SRR23195507         1        1
5  SRR23195510         1        1
7  SRR23195512         1        1
8  SRR23195513         1        1


In [None]:
#plot?

They were so kind to also provide you with the information of the number of bases per run, so that you can know how much space the data will take on your Cluster.<br>
Add a new column to your fancy table with this information (base_counts.csv) and sort your dataframe according to this information and the condition.

Then select the 2 smallest runs from your dataset and download them from SRA (maybe an nf-core pipeline can help here?...)

In [15]:
runs = pd.read_csv("/Users/peterbrederlow/Documents/Uni/MasterBioinformatik/SS25/WorkflowCourse/computational-workflows-2025/notebooks/day_02/base_counts.csv")
print(runs.head())





           Run       Bases
0  SRR23195505  6922564500
1  SRR23195506  7859530800
2  SRR23195507  8063298900
3  SRR23195508  6927786900
4  SRR23195509  7003550100


In [16]:
#map to the metadata
df_merged = pd.merge(df, runs, on="Run", how="left")

print(df_merged.head())

           Run  Sham/Sal  Oxy/Sal       Bases
0  SRR23195505         1        1  6922564500
1  SRR23195507         1        1  8063298900
2  SRR23195510         1        1  7377388500
3  SRR23195512         1        1  7462857900
4  SRR23195513         1        1  8099181600


In [17]:
df_sorted = df_merged.sort_values(by="Bases", ascending=True)

# show the runs with the smallest base count
print(df_sorted[["Run", "Bases"]].head())

top2 = df_sorted[["Run"]].head(2)

# save as CSV
top2.to_csv("top2_ids.csv", index=False, header = False)

            Run       Bases
14  SRR23195516  6203117700
9   SRR23195511  6456390900
15  SRR23195517  6863840400
0   SRR23195505  6922564500
12  SRR23195508  6927786900


In [None]:
#get smallest runs using nextflow
nextflow run nf-core/fetchngs \
  --input /Users/peterbrederlow/Documents/Uni/MasterBioinformatik/SS25/WorkflowCourse/computational-workflows-2025/notebooks/day_02/top2_ids.csv \
  --outdir "./SRFetch_results" \
  -profile docker


While your files are downloading, get back to the paper and explain how you would try to reproduce the analysis.<br>
When you are done with this shout, so we can discuss the different ideas.

The rnaseq pipeline can be used to align and quality control the data using the same tools used in the papaer (HISAT2, HTSEQ), which can then be quality controlled and be made ready for differential analysis.
Since the paper used DESeq2, a classic R differential expression tool, to analyze the transcriptomes, it makes sense to use the nf-core pipeline deseq2_differential. After that the GSEA can be analyzed using the differential abundance pipeline of nf-core, which can also be used for differntial expression analysis. 