# Day 2

Today, we will start using nf-core pipelines to find differentially abundant genes in our dataset. 
We are using data from the following paper: https://www.nature.com/articles/s41593-023-01350-3#Sec10

1. Please take some time to read through the paper and understand their approach, hypotheses and goals.

What was the objective of the study?

Identify the transcriptomic effects of oxycodone withdrawal on mice with chronic pain.

What do the conditions mean?

oxy: Oxy stands for Oxycodone, an opioid. Mice in the oxy cohorts were treated with oxycodone after their surgery and then subjected to withdrawals.


sal: Sal stands for Saline, which was used to provide a control group of mice without oxycodone treatment.

What do the genotypes mean?

SNI: SNI mice suffered chronic pain due to a prolonged spare nerve injury.


Sham: Sham mice had no chronic pain and were used as a control group to control for the effects of the chronic pain.

Imagine you are the bioinformatician in the group who conducted this study. They hand you the raw files and ask you to analyze them.

What would you do?

On a conceptual level I would search for similarities and differences between the different groups to identify changes in the transcription behaviour. In practice this would mean comparing the data using a variety of measures, maybe do PCA or a clustering based on gene expression. Identify differences in maximum/minimum/mean/median gene-expressions or behaviours, check whether the differences are statistically significant and so on.

Which groups would you compare to each other?

Sham-sal to Sham-oxy to isolate and identify the effects of the treatment, Sham-sal to SNI-sal to look only at the effects of the chronic pain, SNI-sal to SNI-oxy to see the effects of treatment on mice suffering chronic pain, Sham-oxy to SNI-oxy to observe the different effects of treatment on mice that need it versus those that don't, and lastly Sham-Sal to SNI-oxy to see how the effects of treatment and injury interact.

Please also mention which outcome you would expect to see from each comparison.

I would expect that the non-control groups show disrupted transcription and impaired behaviour compared to the control groups. However, during active oxycodone treatment these effects should be lessened in SNI mice only to worsen during the withdrawal.

Your group gave you a very suboptimal excel sheet (conditions_runs_oxy_project.xlsx) to get the information you need for each run they uploaded to the SRA.<br>
So, instead of directly diving into downloading the data and starting the analysis, you first need to sort the lazy table.<br>
Use Python and Pandas to get the table into a more sensible order.<br>
Then, perform some overview analysis and plot the results
1. How many samples do you have per condition?
2. How many samples do you have per genotype?
3. How often do you have each condition per genotype?

In [1]:
import pandas as pd

excel_data = "conditions_runs_oxy_project.xlsx"

df = pd.read_excel(excel_data, index_col="Run")


In [2]:
df = df.fillna(False)
df = df.replace("x",True)
df

  df = df.fillna(False)
  df = df.replace("x",True)


Unnamed: 0_level_0,Patient,RNA-seq,DNA-seq,condition: Sal,Condition: Oxy,Genotype: SNI,Genotype: Sham
Run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
SRR23195505,?,True,False,True,False,True,False
SRR23195506,?,True,False,False,True,False,True
SRR23195507,?,True,False,True,False,False,True
SRR23195508,?,True,False,False,True,True,False
SRR23195509,?,True,False,False,True,True,False
SRR23195510,?,True,False,True,False,True,False
SRR23195511,?,True,False,False,True,False,True
SRR23195512,?,True,False,True,False,False,True
SRR23195513,?,True,False,True,False,True,False
SRR23195514,?,True,False,False,True,False,True


In [None]:
# Sample count for Genotype Sham: 8
# Sample count for Genotype NSI: 8

They were so kind to also provide you with the information of the number of bases per run, so that you can know how much space the data will take on your Cluster.<br>
Add a new column to your fancy table with this information (base_counts.csv) and sort your dataframe according to this information and the condition.

Then select the 2 smallest runs from your dataset and download them from SRA (maybe an nf-core pipeline can help here?...)

In [3]:
base_counts = pd.read_csv("base_counts.csv", index_col="Run")
data = pd.merge(df, base_counts, on="Run")
data

Unnamed: 0_level_0,Patient,RNA-seq,DNA-seq,condition: Sal,Condition: Oxy,Genotype: SNI,Genotype: Sham,Bases
Run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
SRR23195505,?,True,False,True,False,True,False,6922564500
SRR23195506,?,True,False,False,True,False,True,7859530800
SRR23195507,?,True,False,True,False,False,True,8063298900
SRR23195508,?,True,False,False,True,True,False,6927786900
SRR23195509,?,True,False,False,True,True,False,7003550100
SRR23195510,?,True,False,True,False,True,False,7377388500
SRR23195511,?,True,False,False,True,False,True,6456390900
SRR23195512,?,True,False,True,False,False,True,7462857900
SRR23195513,?,True,False,True,False,True,False,8099181600
SRR23195514,?,True,False,False,True,False,True,7226808600


In [4]:
data = data.sort_values("Bases")
data = data.sort_values("condition: Sal")
data
# The smallest runs are SRR23195516 and SRR23195511 w9th 6203117700 and 6456390900 bases respectively.

Unnamed: 0_level_0,Patient,RNA-seq,DNA-seq,condition: Sal,Condition: Oxy,Genotype: SNI,Genotype: Sham,Bases
Run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
SRR23195516,?,True,False,False,True,True,False,6203117700
SRR23195511,?,True,False,False,True,False,True,6456390900
SRR23195517,?,True,False,False,True,True,False,6863840400
SRR23195508,?,True,False,False,True,True,False,6927786900
SRR23195519,?,True,False,False,True,False,True,6996050100
SRR23195509,?,True,False,False,True,True,False,7003550100
SRR23195514,?,True,False,False,True,False,True,7226808600
SRR23195506,?,True,False,False,True,False,True,7859530800
SRR23195505,?,True,False,True,False,True,False,6922564500
SRR23195510,?,True,False,True,False,True,False,7377388500


In [7]:
!nextflow run nf-core/fetchngs --input ids.csv -profile singularity --outdir day2_output --max_memory "4GB"


[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 25.04.7[m
[K
Launching[35m `https://github.com/nf-core/fetchngs` [0;2m[[0;1;36mspecial_angela[0;2m] DSL2 - [36mrevision: [0;36m8ec2d934f9 [master][m
[K
[33mWARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`[39m[K


-[2m----------------------------------------------------[0m-
                                        [0;32m,--.[0;30m/[0;32m,-.[0m
[0;34m        ___     __   __   __   ___     [0;32m/,-._.--~'[0m
[0;34m  |\ | |__  __ /  ` /  \ |__) |__         [0;33m}  {[0m
[0;34m  | \| |       \__, \__/ |  \ |___     [0;32m\`-._,-`-,[0m
                                        [0;32m`._,._,'[0m
[0;35m  nf-core/fetchngs v1.12.0-g8ec2d93[0m
-[2m----------------------------------------------------[0m-
[1mCore Nextflow options[0m
  [0;34mrevision       : [0;32mmaster[0m
  [0;34mrunName        : [0;32mspecial_angel

While your files are downloading, get back to the paper and explain how you would try to reproduce the analysis.<br>
When you are done with this shout, so we can discuss the different ideas.

Unfortunately, the paper only has surface explanations of their processes, which means that we can use their tools on their data but we would have to guess which versions of those tools were used, as well as the parameters and options they used. This makes reproducing the data very difficult.

In [8]:
!nextflow run nf-core/fetchngs --input ids.csv -profile apptainer --outdir day2_output --max_memory "4GB"


[1m[38;5;232m[48;5;43m N E X T F L O W [0;2m  ~  [mversion 25.04.7[m
[K
Launching[35m `https://github.com/nf-core/fetchngs` [0;2m[[0;1;36msilly_raman[0;2m] DSL2 - [36mrevision: [0;36m8ec2d934f9 [master][m
[K
[33mWARN: Access to undefined parameter `monochromeLogs` -- Initialise it to a default value eg. `params.monochromeLogs = some_value`[39m[K


-[2m----------------------------------------------------[0m-
                                        [0;32m,--.[0;30m/[0;32m,-.[0m
[0;34m        ___     __   __   __   ___     [0;32m/,-._.--~'[0m
[0;34m  |\ | |__  __ /  ` /  \ |__) |__         [0;33m}  {[0m
[0;34m  | \| |       \__, \__/ |  \ |___     [0;32m\`-._,-`-,[0m
                                        [0;32m`._,._,'[0m
[0;35m  nf-core/fetchngs v1.12.0-g8ec2d93[0m
-[2m----------------------------------------------------[0m-
[1mCore Nextflow options[0m
  [0;34mrevision       : [0;32mmaster[0m
  [0;34mrunName        : [0;32msilly_raman[0m
