# Data Science Competition 2023

This file is here to help you get started with the dataset. 

## Reminders

- The dataset is a part of an ongoing research project (in real life too!)! So we kindly ask you not to publish it on the internet nor to send it to anyone outside this event. Thank you for your understanding!
- You're encouraged to read the instructions ``competition-instructions.pdf`` and the dataset description `dataset_description.pdf` first and then this notebook.
- If you are handling too many datapoints at a time, it is always possible to 
  - ignore the "time series" aspect of the data (e.g. look at a specific time step, or look at the abundance, regardless of the time step).
  - look at certain taxonomic ranks (e.g. "R") or sub-ranks (e.g. "R1").
  - look at specific aquaculture system / replicate / treatment.

### Delivery

In research everything should be reproducible, therefore, the biology lab wants to get your source code. However, the biology researchers are not experts in programming, so they expect you to share your results and explain your methodology without having to run your code. Thus, rather than a simple python file (`.py`) they would prefer a **jupyter notebook** (`.ipynb`) file (using python or R, and potentially using Google Colab) **with the output of each cell visible and comments on your results and methodology throughout the notebook**. If you can only provide a python/R script without using a notebook, then they would need a report (as a `.pdf` file) explaining your general approach, and with your results and comments on your results.

Once all the deliveries are received, the lab will give you feedback! There will be winners in each category. The category of a group is defined as the highest category among its members (e.g. a group of 2 bachelor and 1 master students is in the master category).

It is planned to **announce the winners of the competition on Friday, November 17, from 14:00 in the [Big Auditorium](https://link.mazemap.com/xG4u7s0W)**. We will keep you updated on that.

**What to submit**:

A `.zip` file containing:

- a jupyter notebook (`.ipynb`) (+ optionally a `.pdf`)
- **OR** a python/R script + pdf.

**Where to submit**: by email to `natacha.galmiche@uib.no`. In the filename of your submission, please include:

- Your category ("bachelor", "master" or "PhD")
- Your group number as specify in the `groups.xlsx` file (Ex: "Gr1") or your name (E.g. "Natacha_Galmiche")
- Example `PhD-Natacha_Galmiche.zip` or `Master-Gr12.zip`.

**Deadline**: Sunday, 5th November at 23:59.


In [1]:
import pandas as pd

## Read the files

In [2]:
df = pd.read_csv("./abundance_table.csv")
df

Unnamed: 0,Scientific Name,Taxonomic Rank,A0,A1_1A,A1_1B,A1_1C,A1_2A,A1_2B,A1_2C,A1_3A,...,B9_1A,B9_1B,B9_1C,B9_2A,B9_2B,B9_2C,B9_3A,B9_3B,B9_3BR,B9_3C
0,...,S,0.00,0.05,0.05,0.05,0.05,0.07,0.08,0.05,...,0.04,0.04,0.04,0.03,0.03,0.03,0.02,0.02,0.03,0.03
1,...,G,0.00,0.05,0.05,0.05,0.05,0.07,0.08,0.05,...,0.04,0.04,0.04,0.03,0.03,0.03,0.02,0.02,0.03,0.03
2,...,F1,0.00,0.05,0.05,0.05,0.05,0.07,0.08,0.05,...,0.04,0.04,0.04,0.03,0.03,0.03,0.02,0.02,0.03,0.03
3,...,F,0.00,0.05,0.05,0.05,0.05,0.07,0.08,0.05,...,0.04,0.04,0.04,0.03,0.03,0.03,0.02,0.02,0.03,0.03
4,...,O4,0.00,0.05,0.05,0.05,0.05,0.07,0.08,0.05,...,0.04,0.04,0.04,0.03,0.03,0.03,0.02,0.02,0.03,0.03
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23488,Viruses,D,0.00,0.01,0.01,0.01,0.01,0.01,0.01,0.01,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01
23489,cellular organisms,R1,99.32,33.81,29.53,31.56,27.60,26.84,26.21,16.60,...,20.08,20.36,19.89,18.19,17.87,18.28,17.07,17.70,17.47,17.28
23490,other entries,R1,0.00,0.00,0.01,0.00,0.00,0.00,0.00,0.01,...,0.01,0.01,0.00,0.00,0.00,0.00,0.01,0.01,0.00,0.00
23491,root,R,99.93,33.83,29.55,31.59,27.63,26.86,26.23,16.63,...,20.11,20.39,19.91,18.21,17.89,18.30,17.10,17.72,17.49,17.30


In [60]:
idxs = []
for i,j in enumerate(df["Taxonomic Rank"]):
    if "D" in j:
        idxs.append(i)

In [61]:
df.iloc[idxs]["A2_1C"].sum()

28.900000000000002

In [3]:
df_meta = pd.read_csv("./env_parameter_sample.csv")
df_meta

Unnamed: 0,time,TAN_DF,pH_DF,NO2_DF,Alkalinity_DF,TAN_DGS,pH_DGS,NO2_DGS,Alkalinity_DGS,Flow_rate,Treatment,Module,TAN_removal_biocarrier,co2_mgl,h2s_ugl,o2_mgl,o2_sat,salinity,temp,sample_name
0,29/08/2022 08:30,0.240,7.880,0.64,342.40,0.180,7.860,0.71,344.00,60.00,200,12,0.000034,7.03,0.16,7.79,83.78,14.88,15.1,B1_1
1,29/08/2022 08:31,0.280,8.070,0.64,433.40,0.220,8.020,0.55,433.90,60.00,200,1,0.000038,4.93,0.32,8.41,89.33,13.02,15.1,A1_1
2,02/09/2022 08:30,0.220,7.650,0.51,219.80,0.150,8.031,0.76,211.00,60.00,200,12,0.000040,7.99,0.17,7.82,85.07,16.38,15.1,No_sample
3,02/09/2022 08:31,0.200,7.800,0.51,236.00,0.070,8.030,0.53,236.30,60.00,200,1,0.000083,6.76,0.26,8.21,87.59,13.40,15.1,No_sample
4,05/09/2022 08:30,0.160,7.600,0.46,234.00,0.120,7.900,0.62,234.00,60.00,200,12,0.000023,8.76,0.13,7.57,81.98,16.36,15.3,B1_2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119,11/01/2023 08:31,0.255,7.128,0.06,72.29,0.135,7.411,0.05,73.07,54.55,70,1,0.000081,8.33,0.26,8.60,96.42,14.82,15.1,No_sample
120,12/01/2023 08:30,0.220,7.553,0.08,184.10,0.130,7.771,0.06,185.80,62.75,200,12,0.000070,7.66,0.13,8.28,94.29,17.28,15.3,No_sample
121,12/01/2023 08:31,0.225,7.226,0.08,74.13,0.125,7.298,0.05,72.83,57.25,70,1,0.000071,7.89,0.23,8.56,95.37,14.73,15.0,No_sample
122,16/01/2023 08:30,0.250,7.564,0.08,192.60,0.165,7.688,0.07,205.90,73.65,200,12,0.000078,7.49,0.18,8.24,93.87,17.21,15.2,B9_3


In [39]:
df_meta["sample_name"].value_counts()

No_sample    84
B1_1          1
B8_1          1
B6_1          1
A6_1          1
B6_2          1
A6_2          1
B7_1          1
A7_1          1
B7_2          1
A7_2          1
A8_1          1
B5_2          1
B8_2          1
A8_2          1
B9_1          1
A9_1          1
B9_2          1
A9_2          1
B9_3          1
A5_2          1
A5_1          1
A1_1          1
A2_2          1
B1_2          1
A1_2          1
B1_3          1
A1_3          1
B2_1          1
A2_1          1
B2_2          1
B3_1          1
B5_1          1
A3_1          1
B3_2          1
A3_2          1
B4_1          1
A4_1          1
B4_2          1
A4_2          1
A9_3          1
Name: sample_name, dtype: int64

In [8]:
df.columns[2:]

Index(['A0', 'A1_1A', 'A1_1B', 'A1_1C', 'A1_2A', 'A1_2B', 'A1_2C', 'A1_3A',
       'A1_3B', 'A1_3C',
       ...
       'B9_1A', 'B9_1B', 'B9_1C', 'B9_2A', 'B9_2B', 'B9_2C', 'B9_3A', 'B9_3B',
       'B9_3BR', 'B9_3C'],
      dtype='object', length=123)

In [14]:
set_meta = set(df_meta["sample_name"])
set_data = set(df.columns[2:])

In [18]:
for d in set_data:
    if d not in set_meta:
        print(d)

A6_1A
A4_2C
B7_1C
B9_1A
B8_2A
B8_1B
A3_1B
B1_1C
B8_1A
B7_2A
B8_1C
A7_2B
A3_1C
A1_1B
A1_2B
B8_2B
B5_1B
A8_1B
B9_1B
A8_2C
A1_3A
B1_1A
A7_1B
A7_2C
A6_2A
B6_1C
A9_3B
B7_2C
B4_1B
B3_1C
A9_1B
B9_3C
B5_2B
A8_1A
B6_2B
A9_2B
A5_2C
B9_3BR
B2_2C
B3_2C
A6_2B
B5_2A
A7_1A
A2_2C
B9_2C
B3_2B
B8_2C
B3_2A
A6_2C
B4_2B
B6_1A
A5_1B
B7_1B
A9_3A
A7_1C
A8_2B
B1_3B
A2_2B
A8_2A
B6_2A
A1_1A
B7_1A
A3_2B
B2_2A
B1_3C
B6_1B
A1_2A
A9_1A
A9_1C
A3_2C
B9_2B
B3_1A
A9_3AR
B1_2C
B5_2C
B3_1B
A7_2A
B9_3B
B5_1C
B2_2B
A2_1A
B2_1C
B4_1C
A1_2C
A8_1C
B1_3A
B9_3A
A6_1C
A4_1C
B2_1A
A5_1C
B7_2B
B5_1A
B4_2C
A9_2A
B1_2B
A3_2A
B2_1B
B9_2A
B4_2A
A5_1A
A3_1A
A0
B1_2A
B6_2C
A4_1A
A9_3C
A1_3C
B4_1A
B1_1B
A2_1C
B9_1C
A2_1B
A9_2C
A4_2B
A1_1C
A5_2B
A6_1B
A5_2A
A2_2A
A4_1B
A4_2A
A1_3B


In [4]:
# "df.set_index()" allows you to rename your rows (or give a name to your rows)
df_meta = df_meta.set_index("sample_name")
df_meta

Unnamed: 0_level_0,time,TAN_DF,pH_DF,NO2_DF,Alkalinity_DF,TAN_DGS,pH_DGS,NO2_DGS,Alkalinity_DGS,Flow_rate,Treatment,Module,TAN_removal_biocarrier,co2_mgl,h2s_ugl,o2_mgl,o2_sat,salinity,temp
sample_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
B1_1,29/08/2022 08:30,0.240,7.880,0.64,342.40,0.180,7.860,0.71,344.00,60.00,200,12,0.000034,7.03,0.16,7.79,83.78,14.88,15.1
A1_1,29/08/2022 08:31,0.280,8.070,0.64,433.40,0.220,8.020,0.55,433.90,60.00,200,1,0.000038,4.93,0.32,8.41,89.33,13.02,15.1
No_sample,02/09/2022 08:30,0.220,7.650,0.51,219.80,0.150,8.031,0.76,211.00,60.00,200,12,0.000040,7.99,0.17,7.82,85.07,16.38,15.1
No_sample,02/09/2022 08:31,0.200,7.800,0.51,236.00,0.070,8.030,0.53,236.30,60.00,200,1,0.000083,6.76,0.26,8.21,87.59,13.40,15.1
B1_2,05/09/2022 08:30,0.160,7.600,0.46,234.00,0.120,7.900,0.62,234.00,60.00,200,12,0.000023,8.76,0.13,7.57,81.98,16.36,15.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
No_sample,11/01/2023 08:31,0.255,7.128,0.06,72.29,0.135,7.411,0.05,73.07,54.55,70,1,0.000081,8.33,0.26,8.60,96.42,14.82,15.1
No_sample,12/01/2023 08:30,0.220,7.553,0.08,184.10,0.130,7.771,0.06,185.80,62.75,200,12,0.000070,7.66,0.13,8.28,94.29,17.28,15.3
No_sample,12/01/2023 08:31,0.225,7.226,0.08,74.13,0.125,7.298,0.05,72.83,57.25,70,1,0.000071,7.89,0.23,8.56,95.37,14.73,15.0
B9_3,16/01/2023 08:30,0.250,7.564,0.08,192.60,0.165,7.688,0.07,205.90,73.65,200,12,0.000078,7.49,0.18,8.24,93.87,17.21,15.2


## Looking at the data from different angles

### Specific taxonomic ranks

In [5]:
# Extract organisms that are of taxonomic rank F and/or subranks F1, F2, etc.
# This syntax extracts the sub-dataframe that satisfies a condition on a column
df_rankF_all = df[df["Taxonomic Rank"].str.contains("F")]
df_rankF_all

Unnamed: 0,Scientific Name,Taxonomic Rank,A0,A1_1A,A1_1B,A1_1C,A1_2A,A1_2B,A1_2C,A1_3A,...,B9_1A,B9_1B,B9_1C,B9_2A,B9_2B,B9_2C,B9_3A,B9_3B,B9_3BR,B9_3C
2,...,F1,0.0,0.05,0.05,0.05,0.05,0.07,0.08,0.05,...,0.04,0.04,0.04,0.03,0.03,0.03,0.02,0.02,0.03,0.03
3,...,F,0.0,0.05,0.05,0.05,0.05,0.07,0.08,0.05,...,0.04,0.04,0.04,0.03,0.03,0.03,0.02,0.02,0.03,0.03
37,...,F6,0.0,0.01,0.01,0.01,0.01,0.02,0.03,0.01,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01
43,...,F6,0.0,0.03,0.03,0.03,0.03,0.04,0.05,0.03,...,0.02,0.02,0.02,0.02,0.02,0.02,0.01,0.01,0.01,0.02
54,...,F6,0.0,0.02,0.02,0.02,0.02,0.03,0.04,0.02,...,0.01,0.01,0.02,0.01,0.01,0.01,0.01,0.01,0.01,0.01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23476,Globuloviridae,F,,,,,,,,,...,,,,0.00,,,,,,
23479,Polydnaviriformidae,F,,,0.00,,,0.00,0.00,0.00,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
23480,Portogloboviridae,F,,,,,,,,,...,,,,,,,,0.00,,
23483,Thaspiviridae,F,,0.00,0.00,0.00,0.00,0.00,0.00,0.00,...,,,,,,,,,,


In [6]:
# Extract organisms that are exactly of taxonomic rank F1.
# This syntax extracts the sub-dataframe that satisfies a condition on a column
df_rankF1 = df[df["Taxonomic Rank"] == "F1"]
df_rankF1

Unnamed: 0,Scientific Name,Taxonomic Rank,A0,A1_1A,A1_1B,A1_1C,A1_2A,A1_2B,A1_2C,A1_3A,...,B9_1A,B9_1B,B9_1C,B9_2A,B9_2B,B9_2C,B9_3A,B9_3B,B9_3BR,B9_3C
2,...,F1,0.0,0.05,0.05,0.05,0.05,0.07,0.08,0.05,...,0.04,0.04,0.04,0.03,0.03,0.03,0.02,0.02,0.03,0.03
194,Apioideae,F1,0.0,0.01,0.01,0.01,0.01,0.01,0.01,0.01,...,0.00,0.01,0.01,0.00,0.00,0.00,0.00,0.00,0.00,0.00
245,Acalypho...,F1,0.0,0.02,0.02,0.02,0.02,0.03,0.03,0.02,...,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01
246,Amygdalo...,F1,0.0,0.02,0.02,0.02,0.02,0.03,0.03,0.02,...,0.01,0.01,0.02,0.01,0.01,0.01,0.01,0.01,0.01,0.01
250,Asteroideae,F1,0.0,0.03,0.03,0.03,0.03,0.05,0.06,0.04,...,0.03,0.03,0.03,0.02,0.02,0.02,0.02,0.02,0.02,0.02
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21853,unclassified Straboviridae,F1,,,,,,,,,...,,,,,,,,,,
21862,unclassified Zierdtviridae,F1,,,,,,,,,...,,,,,,,,,,
23416,Geminialphasatellitinae,F1,,,,,,,,,...,,,,,,,,,,
23460,unclassified Anelloviridae,F1,,,,,,,,,...,,,,,,,,,,


### Transpose of the dataset

In [7]:
# Take the transpose of the dataframe (row becomes columns and vice versa)
# (if desired, it depends on your final objective)
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,23483,23484,23485,23486,23487,23488,23489,23490,23491,23492
Scientific Name,...,...,...,...,...,...,...,...,...,...,...,Thaspiviridae,Tolecusatellitidae,Varidnaviria,other sequences,unclassified viruses,Viruses,cellular organisms,other entries,root,unclassified
Taxonomic Rank,S,G,F1,F,O4,O3,S,S1,S,S,...,F,F,D1,R2,D1,D,R1,R1,R,U
A0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,...,,,,0.0,,0.0,99.32,0.0,99.93,0.07
A1_1A,0.05,0.05,0.05,0.05,0.05,0.05,0.01,0.01,0.0,0.0,...,0.0,,0.0,0.0,0.0,0.01,33.81,0.0,33.83,66.17
A1_1B,0.05,0.05,0.05,0.05,0.05,0.05,0.01,0.01,0.0,0.0,...,0.0,,0.0,0.01,0.0,0.01,29.53,0.01,29.55,70.45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
B9_2C,0.03,0.03,0.03,0.03,0.03,0.03,0.01,0.0,0.0,0.0,...,,,0.0,0.0,0.0,0.01,18.28,0.0,18.3,81.7
B9_3A,0.02,0.02,0.02,0.02,0.02,0.02,0.01,0.0,0.0,0.0,...,,,0.0,0.01,0.0,0.01,17.07,0.01,17.1,82.9
B9_3B,0.02,0.02,0.02,0.02,0.02,0.02,0.01,0.0,0.0,0.0,...,,0.0,0.0,0.01,0.0,0.01,17.7,0.01,17.72,82.28
B9_3BR,0.03,0.03,0.03,0.03,0.03,0.03,0.01,0.0,0.0,0.0,...,,,0.0,0.0,0.0,0.01,17.47,0.0,17.49,82.51


#### Treat NaN values as 0 

See `dataset_description.pdf`, where it is specified that NA values should actually be considered as 0. 

In [8]:
df.T.fillna(0.0)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,23483,23484,23485,23486,23487,23488,23489,23490,23491,23492
Scientific Name,...,...,...,...,...,...,...,...,...,...,...,Thaspiviridae,Tolecusatellitidae,Varidnaviria,other sequences,unclassified viruses,Viruses,cellular organisms,other entries,root,unclassified
Taxonomic Rank,S,G,F1,F,O4,O3,S,S1,S,S,...,F,F,D1,R2,D1,D,R1,R1,R,U
A0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,99.32,0.0,99.93,0.07
A1_1A,0.05,0.05,0.05,0.05,0.05,0.05,0.01,0.01,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,33.81,0.0,33.83,66.17
A1_1B,0.05,0.05,0.05,0.05,0.05,0.05,0.01,0.01,0.0,0.0,...,0.0,0.0,0.0,0.01,0.0,0.01,29.53,0.01,29.55,70.45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
B9_2C,0.03,0.03,0.03,0.03,0.03,0.03,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,18.28,0.0,18.3,81.7
B9_3A,0.02,0.02,0.02,0.02,0.02,0.02,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01,0.0,0.01,17.07,0.01,17.1,82.9
B9_3B,0.02,0.02,0.02,0.02,0.02,0.02,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01,0.0,0.01,17.7,0.01,17.72,82.28
B9_3BR,0.03,0.03,0.03,0.03,0.03,0.03,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,17.47,0.0,17.49,82.51


In [9]:
def process_transpose(df):
    """
    Transpose the dataframe, convert NaN to 0, re-index its columns
    and drop superfluous information
    """
    dfT = df.T.fillna(0.0)
    # Use all the scientific name as column index
    # "df.loc" refers to the row index (which could be string)
    # "dfT.columns = XXX" allows you to rename your columns
    dfT.columns = dfT.loc["Scientific Name"].to_list()
    # Drop the taxonomic rank and the scientific name rows
    # "df.iloc" refers to the number of the row (which are always integers)
    dfT = dfT.iloc[2:]
    return dfT

dfT_rankF1 = process_transpose(df_rankF1)
dfT_rankF1

Unnamed: 0,Homininae,Apioideae,Acalyphoideae,Amygdaloideae,Asteroideae,Aurantioideae,Benincaseae,Brassiceae,Byttnerioideae,Camelineae,...,unclassified Kyanoviridae,unclassified Mesyanzhinovviridae,unclassified Peduoviridae,unclassified Salasmaviridae,unclassified Schitoviridae,unclassified Straboviridae,unclassified Zierdtviridae,Geminialphasatellitinae,unclassified Anelloviridae,unclassified Bicaudaviridae
A0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1_1A,0.05,0.01,0.02,0.02,0.03,0.01,0.03,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1_1B,0.05,0.01,0.02,0.02,0.03,0.01,0.03,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1_1C,0.05,0.01,0.02,0.02,0.03,0.01,0.03,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
A1_2A,0.05,0.01,0.02,0.02,0.03,0.01,0.04,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
B9_2C,0.03,0.0,0.01,0.01,0.02,0.0,0.02,0.01,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B9_3A,0.02,0.0,0.01,0.01,0.02,0.0,0.01,0.01,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B9_3B,0.02,0.0,0.01,0.01,0.02,0.0,0.01,0.01,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B9_3BR,0.03,0.0,0.01,0.01,0.02,0.0,0.02,0.01,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Add sample name information as a column 

In [10]:
def process_sample_name(dfT):
    """
    Create 3 new categorical columns based on the information stored
    in the sample name

    Note that these operations are done on the transpose of the df
    """
    # Information on the aquaculture system (A or B) of the sample
    aquaculture_system = [n[0] for n in dfT.index]
    # Information on the treatment (1, 2, ....., 9)
    treatment = [int(n[1]) if len(n) > 1 else None for n in dfT.index]
    # Information on the time step (1, 2, or 3)
    time_step = [int(n[3]) if len(n) > 3 else None for n in dfT.index]
    # Information on the replicate (A, B or C) corresponding to the sample
    replicate = [n[4] if len(n) > 4 else None for n in dfT.index]

    # Add columns to the dataframe
    dfT["Aquaculture System"] = aquaculture_system
    dfT["Treatment"] = treatment
    dfT["Time Step"] = time_step
    dfT["Replicate"] = replicate

    return replicate

process_sample_name(dfT_rankF1)
dfT_rankF1

Unnamed: 0,Homininae,Apioideae,Acalyphoideae,Amygdaloideae,Asteroideae,Aurantioideae,Benincaseae,Brassiceae,Byttnerioideae,Camelineae,...,unclassified Schitoviridae,unclassified Straboviridae,unclassified Zierdtviridae,Geminialphasatellitinae,unclassified Anelloviridae,unclassified Bicaudaviridae,Aquaculture System,Treatment,Time Step,Replicate
A0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,A,0,,
A1_1A,0.05,0.01,0.02,0.02,0.03,0.01,0.03,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,A,1,1.0,A
A1_1B,0.05,0.01,0.02,0.02,0.03,0.01,0.03,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,A,1,1.0,B
A1_1C,0.05,0.01,0.02,0.02,0.03,0.01,0.03,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,A,1,1.0,C
A1_2A,0.05,0.01,0.02,0.02,0.03,0.01,0.04,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,A,1,2.0,A
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
B9_2C,0.03,0.0,0.01,0.01,0.02,0.0,0.02,0.01,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,B,9,2.0,C
B9_3A,0.02,0.0,0.01,0.01,0.02,0.0,0.01,0.01,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,B,9,3.0,A
B9_3B,0.02,0.0,0.01,0.01,0.02,0.0,0.01,0.01,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,B,9,3.0,B
B9_3BR,0.03,0.0,0.01,0.01,0.02,0.0,0.02,0.01,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,B,9,3.0,B


### Look at a specific aquaculture system and treatment

In [11]:
# Extracting sub-df based on multiple conditions
mask = (dfT_rankF1["Aquaculture System"] == "A") & (dfT_rankF1["Treatment"] == 1)
dfT_rankF1_A1 = dfT_rankF1[mask]
dfT_rankF1_A1

Unnamed: 0,Homininae,Apioideae,Acalyphoideae,Amygdaloideae,Asteroideae,Aurantioideae,Benincaseae,Brassiceae,Byttnerioideae,Camelineae,...,unclassified Schitoviridae,unclassified Straboviridae,unclassified Zierdtviridae,Geminialphasatellitinae,unclassified Anelloviridae,unclassified Bicaudaviridae,Aquaculture System,Treatment,Time Step,Replicate
A1_1A,0.05,0.01,0.02,0.02,0.03,0.01,0.03,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,A,1,1.0,A
A1_1B,0.05,0.01,0.02,0.02,0.03,0.01,0.03,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,A,1,1.0,B
A1_1C,0.05,0.01,0.02,0.02,0.03,0.01,0.03,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,A,1,1.0,C
A1_2A,0.05,0.01,0.02,0.02,0.03,0.01,0.04,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,A,1,2.0,A
A1_2B,0.07,0.01,0.03,0.03,0.05,0.01,0.05,0.04,0.01,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,A,1,2.0,B
A1_2C,0.08,0.01,0.03,0.03,0.06,0.01,0.06,0.05,0.01,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,A,1,2.0,C
A1_3A,0.05,0.01,0.02,0.02,0.04,0.01,0.04,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,A,1,3.0,A
A1_3B,0.06,0.01,0.02,0.03,0.05,0.01,0.05,0.03,0.01,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,A,1,3.0,B
A1_3C,0.05,0.01,0.02,0.02,0.04,0.01,0.03,0.03,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,A,1,3.0,C
