# Day 2

Today, we will start using nf-core pipelines to find differentially abundant genes in our dataset. 
We are using data from the following paper: https://www.nature.com/articles/s41593-023-01350-3#Sec10

1. Please take some time to read through the paper and understand their approach, hypotheses and goals.

What was the objective of the study?

The goal is to understand the transcriptomic effects of chronic opioid exposure and physical dependance under chronic pain states in the brain reward circuitry (using high doses of oxycodone over prolonged time).

What do the conditions mean?

oxy: oxycodone (treatment drug)


sal: saline (placebo)

What do the genotypes mean?

SNI: spared nerve injury, the group with chronic nerve pain


Sham: control group without spared nerve injury

Imagine you are the bioinformatician in the group who conducted this study. They hand you the raw files and ask you to analyze them.

What would you do?

Essentially, compare the genetic profiles of the different groups to identify the differences made because of the nerve injury or opioid exposure.

Which groups would you compare to each other?
1. SNI-Oxy vs SHAM-Oxy to assess the difference the nerve injury made.
2. SHAM-Sal vs SNI-Oxy because they would have the most differences
3. SNI-Sal vs SNI-Oxy to assess the difference the oxycodone made in addition to the nerve injury
4. SHAM-Sal vs SHAM-Oxy to asses the differences the oxycodone made in healty (no nerve pain) mice

Please also mention which outcome you would expect to see from each comparison.
1. SNI-Oxy mice should display less pain-relieving effects of the oxycodone, while also having stronger reaction to the withdrawl.
2. SNI-Oxy mice should display more pain and sensitivity than SHAM-Sal mice, also a different range of genetic expression in the brain.
3. SNI-Oxy mice should have reduced nerve pain in comparison to the SNI-Sal mice, and of course strong withdrawl effects.
4. SHAM-Oxy mice should be less social and active than SHAM-Sal mice, but recover more quickly from the oxycodone withdrawl than their SNI counterpart.

Your group gave you a very suboptimal excel sheet (conditions_runs_oxy_project.xlsx) to get the information you need for each run they uploaded to the SRA.<br>
So, instead of directly diving into downloading the data and starting the analysis, you first need to sort the lazy table.<br>
Use Python and Pandas to get the table into a more sensible order.<br>
Then, perform some overview analysis and plot the results
1. How many samples do you have per condition?
2. How many samples do you have per genotype?
3. How often do you have each condition per genotype?

In [2]:
import pandas as pd

df = pd.read_excel("conditions_runs_oxy_project.xlsx")

print("Samples with Sal condition: ", df["condition: Sal"].value_counts().iloc[0])
print("Samples with Oxy condition: ", df["Condition: Oxy"].value_counts().iloc[0])
print("Samples with SNI genotype: ", df["Genotype: SNI"].value_counts().iloc[0])
print("Samples with Sham genotype: ", df["Genotype: Sham"].value_counts().iloc[0])

print("Samples with SNI-Sal:", df[df["Genotype: SNI"].notnull() & df["condition: Sal"].notnull()].shape[0])
print("Samples with SNI-Oxy:", df[df["Genotype: SNI"].notnull() & df["Condition: Oxy"].notnull()].shape[0])
print("Samples with Sham-Sal:", df[df["Genotype: Sham"].notnull() & df["condition: Sal"].notnull()].shape[0])
print("Samples with Sham-Oxy:", df[df["Genotype: Sham"].notnull() & df["Condition: Oxy"].notnull()].shape[0])

#print(df)


Samples with Sal condition:  8
Samples with Oxy condition:  8
Samples with SNI genotype:  8
Samples with Sham genotype:  8
Samples with SNI-Sal: 4
Samples with SNI-Oxy: 4
Samples with Sham-Sal: 4
Samples with Sham-Oxy: 4


They were so kind to also provide you with the information of the number of bases per run, so that you can know how much space the data will take on your Cluster.<br>
Add a new column to your fancy table with this information (base_counts.csv) and sort your dataframe according to this information and the condition.

Then select the 2 smallest runs from your dataset and download them from SRA (maybe an nf-core pipeline can help here?...)

In [6]:
df["Bases"] = 0
df_2 = pd.read_csv("base_counts.csv")
for index, row in df.iterrows():
    for index_2, row_2 in df_2.iterrows():
        if row["Run"] == row_2["Run"]:
            df.at[index, "Bases"] = row_2["Bases"]

df.sort_values(by="Bases", inplace=True)
print(df.head(2)["Run"])

11    SRR23195516
6     SRR23195511
Name: Run, dtype: object


In [None]:
!nextflow run nf-core/fetchngs -profile docker --max_memory "6GB" --input ./ids.csv --outdir ./output_ngs/

While your files are downloading, get back to the paper and explain how you would try to reproduce the analysis.<br>
When you are done with this shout, so we can discuss the different ideas.

Given the fastq files of the tissue samples of the different groups, compare and identify key genetic components that are unique or modified, and the amount of them. Then try to find the modifications that were made to achive these modified genes in the affected mouse groups. For example, the rnaseq pipeline could be used for the differential abundance analysis step, using mostly the DeSeq output files and plots. 