This notebook recaps some of the python programming you have learnt over the last 2 years. Work on it in pairs, with one person typing (driver) and the other person asking questions or making suggestions (navigator) until you get to the answer provided. Switch roles each exercise. You are not expected to remember it all immediately - use the cheat sheets, ELM and google to remind yourselves.

# Exercise 1:
When we run an experiment, we give each sample a unique identifier. When these results are stored in a public database, this unique identifier is called an <i>accession</i>. 

To analyse a dataset, we often want to be able to loop through the list of samples, processing each sample in the same way.

The file `data/Schistosoma_mansoni/list_ids.txt` contains a single word on each line, representing a sample accession. Use python to read the file and make a list of accessions. Make sure you remove whitespace/newline characters using `.strip()`. Print how many accessions there are in the file?

In [None]:
accessions = []

# (1) open the file in read mode

# (2) use a for loop over the lines in the file 

# (3) remove any whitespace/newline characters from the line

# (4) add the new accession to the list

print(f"There are {len(accessions)} accessions in the file")

In [1]:
accessions = []

# (1) open the file in read mode
with open(f"data/Schistosoma_mansoni/list_ids.txt","r") as f:
    # (2) use a for loop over the lines in the file 
    for line in f:
        # (3) remove any whitespace/newline characters from the line
        # (4) add the new accession to the list
        accessions.append(line.strip())

print(f"There are {len(accessions)} accessions in the file")

There are 12 accessions in the file


<details>
<summary><i>Hints and tips</i></summary>

    
    Note that there is more than one correct solution. 

    For example, you can open the file directly in step (1), and then you need to close the file again after the end of the loop, or you can use a `with open...` statement, in which case you need to indent the code you want to run while the file is open.

    You can combine steps (3) and (4) into a single line.

</details>

# Exercise 2:

Tables of data are often stored in a CSV file. We can use `pandas` to read these files. Sometimes a table will have row names, otherwise pandas gives each row a numeric id.

Read in the pandas dataframe `data/Schistosoma_mansoni/metadata.csv` using the first column as the row name. Print out the index. 

In [None]:
import pandas as pd

# (1) create a dataframe called metadata from the file

print(metadata.index)

In [2]:
import pandas as pd

# (1) create a dataframe called metadata from the file
metadata = pd.read_csv("data/Schistosoma_mansoni/metadata.csv", index_col=0)

print(metadata.index)

Index(['ERR022872', 'ERR022873', 'ERR022874', 'ERR022875', 'ERR022876',
       'ERR022877', 'ERR022878', 'ERR022879', 'ERR022880', 'ERR022881',
       'ERR022882', 'ERR022883'],
      dtype='object', name='accession')


# Exercise 3:
Subset the dataframe from Exercise 2 to include only rows where the "stage" column is `cercarium`  or `24 hr schistosomulum`. How many rows are there now?

In [None]:
# (1) create a dataframe called metadata_s with the subset of rows

print(f"The new dataframe has {len(metadata_s)} rows")

In [3]:
# (1) create a dataframe called metadata_s with the subset of rows
metadata_s = metadata[metadata["stage"].isin(["cercarium","24 hr schistosomulum"])]

print(f"The new dataframe has {len(metadata_s)} rows")

The new dataframe has 8 rows


# Exercise 4:

We often want to combine information from more than one file. Read in a second metadata table from `data/Schistosoma_mansoni/metadata2.tsv`. It has an accession column and a data column. Create a combined dataframe with `accession`, `stage` and `date` columns.

In [None]:
# (1) create a dataframe called metadata2 from the new file

# (2) create a combined dataframe

print(combined_metadata)

In [4]:
# (1) create a dataframe called metadata2 from the new file
metadata2 = pd.read_csv("data/Schistosoma_mansoni/metadata2.tsv", index_col=0, sep="\t")

# (2) create a combined dataframe
combined_metadata = metadata.join(metadata2)

print(combined_metadata)

                          stage        date
accession                                  
ERR022872             cercarium  2025-08-22
ERR022873   platyhelminth adult         NaN
ERR022874   3 hr schistosomulum  2025-08-22
ERR022875             cercarium  2025-08-21
ERR022876   3 hr schistosomulum  2025-08-23
ERR022877             cercarium  2025-08-22
ERR022878             cercarium         NaN
ERR022879   3 hr schistosomulum  2025-08-20
ERR022880  24 hr schistosomulum         NaN
ERR022881  24 hr schistosomulum         NaN
ERR022882  24 hr schistosomulum         NaN
ERR022883  24 hr schistosomulum  2025-08-20


<details>
<summary><i>Hints and tips</i></summary>

    In this example, both dataframes had the same set of row names in the same order. You may want to remind yourselves how you might join the dataframes if this were not the case.

</details>

# Exercise 5:

Note that several of the dates in the combined dataframe don't exist. Set them to `2025-08-24`.

In [None]:
# (1) fill in empty dates

print(combined_metadata)

In [5]:
# (1) fill in empty dates
combined_metadata.fillna("2025-08-24",inplace=True)

print(combined_metadata)

                          stage        date
accession                                  
ERR022872             cercarium  2025-08-22
ERR022873   platyhelminth adult  2025-08-24
ERR022874   3 hr schistosomulum  2025-08-22
ERR022875             cercarium  2025-08-21
ERR022876   3 hr schistosomulum  2025-08-23
ERR022877             cercarium  2025-08-22
ERR022878             cercarium  2025-08-24
ERR022879   3 hr schistosomulum  2025-08-20
ERR022880  24 hr schistosomulum  2025-08-24
ERR022881  24 hr schistosomulum  2025-08-24
ERR022882  24 hr schistosomulum  2025-08-24
ERR022883  24 hr schistosomulum  2025-08-20


# Exercise 6:

Later in this workshop, we will be working with differential expression analysis. We will need filter the rows based on the values in several column values.

Read in dataframe from `analysis/Schistosoma_mansoni/cercarum_vs_24h_schistosomulum.full.csv`. This contains differential expression data for a number of genes which may vary between the cercarium and schistosomulum life stages. Filter to the rows where both `padj < 0.05` and `abs(log2FoldChange) > 1`. 

Use `head()`, `describe()` and `info()` on the filtered dataframe to get an idea of what it looks like. 

In [None]:
# (1) read in the dataframe - does it have row names?

# (2) create a filtered dataframe 


In [7]:
# (1) read in the dataframe - does it have row names?
df = pd.read_csv("analysis/Schistosoma_mansoni/cercarium_vs_24h_schistosomulum.full.csv", index_col=0)

# (2) create a filtered dataframe 
filtered_df=df[(df.padj<0.05)&(abs(df.log2FoldChange)>1)]

In [8]:
filtered_df.head()

Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
gene:Smp_000050,2824.491025,-1.787652,0.281826,-6.343098,2.251895e-10,7.984448e-10
gene:Smp_000080,12.055362,-1.99681,0.907545,-2.200231,0.02779048,0.04268445
gene:Smp_000100,26042.364286,2.74156,0.086656,31.637303,1.1337530000000001e-219,1.08296e-216
gene:Smp_000150,5933.919227,1.861954,0.238959,7.791934,6.59914e-15,3.227598e-14
gene:Smp_000160,821.898894,-3.572021,0.229266,-15.580266,9.914077000000001e-55,3.322781e-53


In [9]:
filtered_df.describe()

Unnamed: 0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
count,3918.0,3918.0,3918.0,3918.0,3918.0,3918.0
mean,3761.137551,0.095166,0.341424,0.182178,0.0007691458,0.001246876
std,23089.923115,2.739577,0.282832,9.650298,0.003468202,0.005456012
min,1.908885,-10.592026,0.067425,-32.764926,4.548872e-309,2.225074e-308
25%,151.236041,-1.899617,0.183211,-7.450183,8.701742000000001e-27,8.47063e-26
50%,731.102596,-1.014627,0.258149,-2.288461,3.072762e-14,1.436306e-13
75%,2460.575794,1.985204,0.379917,7.709872,1.695995e-07,4.705582e-07
max,834912.545669,12.561855,3.193867,40.996409,0.0320492,0.04871641


In [10]:
filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3918 entries, gene:Smp_000050 to gene:Smp_900110
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   baseMean        3918 non-null   float64
 1   log2FoldChange  3918 non-null   float64
 2   lfcSE           3918 non-null   float64
 3   stat            3918 non-null   float64
 4   pvalue          3918 non-null   float64
 5   padj            3918 non-null   float64
dtypes: float64(6)
memory usage: 214.3+ KB
