# Day 2

Today, we will start using nf-core pipelines to find differentially abundant genes in our dataset. 
We are using data from the following paper: https://www.nature.com/articles/s41593-023-01350-3#Sec10

1. Please take some time to read through the paper and understand their approach, hypotheses and goals.

What was the objective of the study?

The aim of the study was to investigate the development of physical dependence and addiction disorders due to misuse of opioid analgesics with pain therapeutics.
#### They wanted to examine transcriptomic effects of chronic opioid exposure and physical dependence under chronic pain states in the brain reward circuitry.
Furthermore, it was the aim to then identify non-opioid medications that alleviate pain while attenuating withdrawal. To do this, a mouse model was introduced, where mice first received doses of the opioid oxycodon for a certain time and afterwards it was withrdawend spontaniously during the next weeks. Two main groups were observed, one group with neurupathic pain and one without. For both groups, saline-treated mice control groups were observed too.

What do the conditions mean?

oxy: oxycodon (mice with or without spared nerve injury (SNI) were exposed to high doses of oxycodone for 2 weeks, and then it was withdrawned within the following 3 weeks spontaneously)


sal: saline (solution of water with salt) (control groups of mice with or without SNI were treated with not oxycodon but saline)

What do the genotypes mean?

SNI: mice with spared nerve injury


Sham: sham groups are groups that do not receive the actual treatment, but a placebo for instance and therefore can be used as control group.

Both groups have the surgical stress, but only the first groups has the SNI.

Imagine you are the bioinformatician in the group who conducted this study. They hand you the raw files and ask you to analyze them.

What would you do?

Which groups would you compare to each other?

Please also mention which outcome you would expect to see from each comparison.

Your group gave you a very suboptimal excel sheet (conditions_runs_oxy_project.xlsx) to get the information you need for each run they uploaded to the SRA.<br>
So, instead of directly diving into downloading the data and starting the analysis, you first need to sort the lazy table.<br>
Use Python and Pandas to get the table into a more sensible order.<br>
Then, perform some overview analysis and plot the results
1. How many samples do you have per condition?
2. How many samples do you have per genotype?
3. How often do you have each condition per genotype?

In [16]:
import pandas as pd
import openpyxl


# load
df = pd.read_excel('conditions_runs_oxy_project.xlsx')

# cells with NaN fill with false, X with True
df = df.fillna(False)
df = df.replace("x", True)

# sort
df_sorted = df.sort_values(by=['condition: Sal', 'Condition: Oxy', 'Genotype: SNI', 'Genotype: Sham'])

# print for control
df_sorted

# 1. 8

df["Condition Oxy"].sum()


# 2.

# 3.






  df = df.fillna(False)
  df = df.replace("x", True)


KeyError: 'Condition Oxy'

They were so kind to also provide you with the information of the number of bases per run, so that you can know how much space the data will take on your Cluster.<br>
Add a new column to your fancy table with this information (base_counts.csv) and sort your dataframe according to this information and the condition.

Then select the 2 smallest runs from your dataset and download them from SRA (maybe an nf-core pipeline can help here?...)

In [14]:
bases_per_run_csv = "base_counts.csv"

bases = pd.read_csv(bases_per_run_csv, index_col = "Run")
bases

Unnamed: 0_level_0,Bases
Run,Unnamed: 1_level_1
SRR23195505,6922564500
SRR23195506,7859530800
SRR23195507,8063298900
SRR23195508,6927786900
SRR23195509,7003550100
SRR23195510,7377388500
SRR23195511,6456390900
SRR23195512,7462857900
SRR23195513,8099181600
SRR23195514,7226808600


In [15]:
df = df.merge(bases, on="Run")
df

Unnamed: 0,Patient,Run,RNA-seq,DNA-seq,condition: Sal,Condition: Oxy,Genotype: SNI,Genotype: Sham,Bases
0,?,SRR23195505,True,False,True,False,True,False,6922564500
1,?,SRR23195506,True,False,False,True,False,True,7859530800
2,?,SRR23195507,True,False,True,False,False,True,8063298900
3,?,SRR23195508,True,False,False,True,True,False,6927786900
4,?,SRR23195509,True,False,False,True,True,False,7003550100
5,?,SRR23195510,True,False,True,False,True,False,7377388500
6,?,SRR23195511,True,False,False,True,False,True,6456390900
7,?,SRR23195512,True,False,True,False,False,True,7462857900
8,?,SRR23195513,True,False,True,False,True,False,8099181600
9,?,SRR23195514,True,False,False,True,False,True,7226808600


While your files are downloading, get back to the paper and explain how you would try to reproduce the analysis.<br>
When you are done with this shout, so we can discuss the different ideas.