# Day 2

Today, we will start using nf-core pipelines to find differentially abundant genes in our dataset. 
We are using data from the following paper: https://www.nature.com/articles/s41593-023-01350-3#Sec10

1. Please take some time to read through the paper and understand their approach, hypotheses and goals.

What was the objective of the study?

In the study the influence of oxycodone treatment and the subsequent withdrawl is studied on mice with or without chronic pain. There are four main groups of mice (chronic pain (SNI) vs. control (Sham) and treatment (Oxy) vs. control(Sal)) and the differences in weight, behavior and specially in gene expression in the tissue of different brain regions is studied to figure out differneces in the transcriptomic level. 

What do the conditions mean?

oxy: the groups of mice were treated with oxycodone, which is an opioid


sal: is given to the control groups and is sterile saline

What do the genotypes mean?

SNI: This stands for Spared nerve injury. These mice had a surgery, that induced this chronic nerve pain. 


Sham: Is a control group. These mice did all the same treatment and surgery steps, but the final nerve injury was not introduced, so they should not have the chronic pain. 

Imagine you are the bioinformatician in the group who conducted this study. They hand you the raw files and ask you to analyze them.

What would you do?

Which groups would you compare to each other?

Please also mention which outcome you would expect to see from each comparison.

If I would have been ask to analyze a raw file from this study, I would first check: 
- which experiments do I have/in which files is which experiments  
- where are the timepoint distributed (is every analysis time point a different table?)  
- get an overview about the data  
    
I would compare the groups where I woudl like to find differences: 
- SNI treated vs. not treated: To figure out the influence of the oxycodon treatment under the influence of pain.  
- Sham treated vs. not treated: To also figure out the influence of the treatment without the influence of pain. T
- SNI treated vs Sham treated: To figure out the differences in the treatment influences by the pain or not.   

It would also be possible to just take Sham not treated as base line and compare all other mice groups to this one. But I think some differential expressed genes could be missing then, as maybe in one condition a gene is up-regulated and in the other down-regulated.

Your group gave you a very suboptimal excel sheet (conditions_runs_oxy_project.xlsx) to get the information you need for each run they uploaded to the SRA.<br>
So, instead of directly diving into downloading the data and starting the analysis, you first need to sort the lazy table.<br>
Use Python and Pandas to get the table into a more sensible order.<br>
Then, perform some overview analysis and plot the results
1. How many samples do you have per condition?
2. How many samples do you have per genotype?
3. How often do you have each condition per genotype?

In [1]:
import pandas as pd

metadata = pd.read_excel("conditions_runs_oxy_project.xlsx", index_col="Run")
metadata

Unnamed: 0_level_0,Patient,RNA-seq,DNA-seq,condition: Sal,Condition: Oxy,Genotype: SNI,Genotype: Sham
Run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
SRR23195505,?,x,,x,,x,
SRR23195506,?,x,,,x,,x
SRR23195507,?,x,,x,,,x
SRR23195508,?,x,,,x,x,
SRR23195509,?,x,,,x,x,
SRR23195510,?,x,,x,,x,
SRR23195511,?,x,,,x,,x
SRR23195512,?,x,,x,,,x
SRR23195513,?,x,,x,,x,
SRR23195514,?,x,,,x,,x


In [2]:
# How many samples do you have per condition?
numSal = metadata["condition: Sal"].notna().sum()
numOxy = metadata["Condition: Oxy"].notna().sum()
print("Samples for the Sal condition:", numSal)
print("Samples for the Oxy condition:", numOxy)

# How many samples do you have per genotype?
numSni = metadata["Genotype: SNI"].notna().sum()
numSham = metadata["Genotype: Sham"].notna().sum()
print("Samples for the SNI genotype:", numSni)
print("Samples for the Sham genotype:", numSham)

# How often do you have each condition per genotype?
numSniSal = (metadata["Genotype: SNI"].notna() & metadata["condition: Sal"].notna()).sum()
numShamSal = (metadata["Genotype: Sham"].notna() & metadata["condition: Sal"].notna()).sum()
numSniOxy = (metadata["Genotype: SNI"].notna() & metadata["Condition: Oxy"].notna()).sum()
numShamOxy = (metadata["Genotype: Sham"].notna() & metadata["Condition: Oxy"].notna()).sum()
print("Samples for the SNI genotype with saline treatment:", numSniSal)
print("Samples for the Sham genotype with saline treatment:", numShamSal)
print("Samples for the SNI genotype with oxycodone treatment:", numSniOxy)
print("Samples for the Sham genotype with oxycodone treatment:", numShamOxy)

Samples for the Sal condition: 8
Samples for the Oxy condition: 8
Samples for the SNI genotype: 8
Samples for the Sham genotype: 8
Samples for the SNI genotype with saline treatment: 4
Samples for the Sham genotype with saline treatment: 4
Samples for the SNI genotype with oxycodone treatment: 4
Samples for the Sham genotype with oxycodone treatment: 4


They were so kind to also provide you with the information of the number of bases per run, so that you can know how much space the data will take on your Cluster.<br>
Add a new column to your fancy table with this information (base_counts.csv) and sort your dataframe according to this information and the condition.

Then select the 2 smallest runs from your dataset and download them from SRA (maybe an nf-core pipeline can help here?...)

In [11]:
counts = pd.read_csv("base_counts.csv", index_col="Run")
metadata["Bases"] = counts["Bases"]
sorted = metadata.sort_values("Bases", ascending=True)

In [12]:
# Prepare csv file
id_csv = pd.DataFrame(sorted.index[0:2])
id_csv.to_csv("ids_selfmade.csv", header=False, index=False)

In [2]:
# Run pipline
!nextflow run nf-core/fetchngs -profile docker --input ids.csv --outdir test --max_memory "5GB"

[33mNextflow 25.04.7 is available - Please consider updating your version to it[m

[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 25.04.0[m
[K
Launching[35m `https://github.com/nf-core/fetchngs` [0;2m[[0;1;36mamazing_khorana[0;2m] DSL2 - [36mrevision: [0;36m8ec2d934f9 [master][m
[K
[33mWARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`[39m[K


-[2m----------------------------------------------------[0m-
                                        [0;32m,--.[0;30m/[0;32m,-.[0m
[0;34m        ___     __   __   __   ___     [0;32m/,-._.--~'[0m
[0;34m  |\ | |__  __ /  ` /  \ |__) |__         [0;33m}  {[0m
[0;34m  | \| |       \__, \__/ |  \ |___     [0;32m\`-._,-`-,[0m
                                        [0;32m`._,._,'[0m
[0;35m  nf-core/fetchngs v1.12.0-g8ec2d93[0m
-[2m----------------------------------------------------[0m-
[1mCore Nextflow options[0m
  [0

While your files are downloading, get back to the paper and explain how you would try to reproduce the analysis.<br>
When you are done with this shout, so we can discuss the different ideas.

The idea was to check out the method section in the paper and then try to reproduce the analysis. As in this section the different package versions etc. are not clearly given this mgiht be difficult. 