# Welcome
In this notebook, you will practice some of the python programming you have learnt over the last 2 years. We will also introduce a couple of new commands that will be useful during our analysis. 

Work on it in pairs, with one person typing (driver) and the other person asking questions or making suggestions (navigator) until you get to the answer provided. Switch roles each exercise. 

You are not expected to remember it all immediately - use the cheat sheets, ELM, google to remind yourselves. Lecturers and demonstrators are here to help you, so please ask as many questions as you need. 

# Exercise 1:

Tables of data are often stored in a CSV file. We can use `pandas` to read these files as 'dataframes' and work with them. You worked with pandas dataframes in <i>Data Exploration in Biology 2</i> (DExB2) and <i>Variation</i>. If you need a refresher on pandas, you can have a look [here](https://pandas.pydata.org/docs/user_guide/index.html).

<div class="alert alert-block alert-warning">
Use pandas to read the CSV file 'metadata' into a dataframe called 'metadata'. The file is stored in the folder "Schistosoma mansoni" inside the folder "data"

In [None]:
import pandas as pd

# (1) create a dataframe called metadata from the file
metadata = ...

# (2) have a look at the dataframe


Pandas has given each row a numeric id. However, we would like to use the accession number as the row identifier. In other words, we would like to use the column accession as the [index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.index.html) .

<div class="alert alert-block alert-warning">
Read in the pandas dataframe again, this time using the first column as the index. Print out the index. 

In [None]:
# (1) create a dataframe called metadata from the file, using the first column as index


# (2) print the index


# Exercise 2:

We often want to focus our analysis on part of a dataframe. You learned how to subset a dataframe in DExB2, week 4, class 8. 

<div class="alert alert-block alert-warning">
Subset the dataframe from Exercise 2 to include only rows where the "stage" column is `cercarium`  or `24 hr schistosomulum`. How many rows are there now?

<details>
<summary><i>Hints and tips</i></summary>

You can use `isin` to subset the dataframe.

</details>

In [None]:
# (1) create a dataframe called metadata_s with only the subset of rows

# (2) how many rows are there now?

# Exercise 3:

We often want to combine information from more than one file. You learned how to combine datasets in DExB2, class 7.

We have a second metadata table called "metadata2.tsv" stored in data/Schistosoma_mansoni. This table has an accession column, a column with the date of collection, a column with the ug of RNA obtained, and a column with the purity fo the RNA obtained. We want to combine this with the metadata table from exercise 2.

<div class="alert alert-block alert-warning">
Create a combined dataframe with `accession`, `stage` and `date` columns.

<details>
<summary><i>Hints and tips</i></summary>

You can use the command `join` to combine both dataframes

</details>

In [None]:
# (1) create a dataframe called metadata2 from the new file

# (2) combine the dataframes metadata and metadata2

# (3) print the combined dataframe


<details>
<summary><i>Hints and tips</i></summary>

    In this example, both dataframes had the same set of row names in the same order. You may want to remind yourselves how you might join the dataframes if this were not the case.

</details>

# Exercise 4:

Note that several of the dates in the combined dataframe don't exist. We want to set those `NaN` to an actual value. You learned how to do this in DExB2 week 2, class 4. 

<div class="alert alert-block alert-warning">
    
Set the `NaN` values to `2025-08-24`

<details>
<summary><i>Hints and tips</i></summary>

You can use the command `fillna` 
</details> 

In [None]:
# (1) fill in empty dates

# (2) print the result
print(combined_metadata)

# Exercise 5:

Sometimes, we need to filter the rows of a dataset based on the values in several columns. You learned how to do that in DExB2 week 4, class 8. 

<div class="alert alert-block alert-warning">
    
Filter the dataframe from the previous exercise (combined_metadata) so that it only contains samples with more than 50 ug of RNA and purity of 80 or more

<details>
<summary><i>Hints and tips</i></summary>

Use the syntax `dataframe[(dataframe[column1]condition) & (dataframe[column2]condition)]`
</details> 

In [None]:
# (1) create a filtered dataframe


# (2) print the result
print(filtered_metadata)

Sometimes, it is also useful to get an overview of the values in a dataframe. You learned how to do this in DExB2 class 1

<div class="alert alert-block alert-warning">

Get descriptive statistics of the dataframe

<details>
<summary><i>Hints and tips</i></summary>

Use `.describe()`
</details> 

In [None]:
# get descriptive statsitics 


# Exercise 6:
When we run an experiment, we give each sample a unique identifier. When these results are stored in a public database, this unique identifier is called an <i>accession</i>. 

In our example, the accessions are stored in the file `data/Schistosoma_mansoni/list_ids.txt`. This file contains a single word on each line, representing a sample accession. We want to use python to read the file and make a list of accessions.

This is a repetitive process: for each line in the file, we want python to (i) read the line, (ii) remove whitespace/newline characters and (iii) add the accession number to a list. We use `for` loops to perform this kind of repetitive action. You learned about `for` loops in <i>Variation</i>, coding practical notebook 3. 



<div class="alert alert-block alert-warning">
    
Complete the `...` in the code below to make the list of accessions and print how many accessions there are in the file. 

We have provided `.append` to add the accession number to the list, and `.strip()` to remove whitespace/newline characters. 

<details>
<summary><i>Hints and tips</i></summary>

    
    Note that there is more than one way to obtain the solution. 

    For example, you can open the file directly in step (1), and then you need to close the file again after the end of the loop, or you can use a `with open...` statement, in which case you need to indent the code you want to run while the file is open.

</details>

In [None]:
# (1) Create a new list called accessions. The list will be empty for now.
accessions = []

# (2) open the file in read mode
with open(f"...", "r") as file:
    # (3) use a for loop over the lines in the file 
    for ... :
        # (4) remove any whitespace/newline characters from the line
        # (5) add the new accession to the list
        accessions.append(line.strip())
# (6) have a look at the list to check it all worked well
...