# Working with experimental conditions

#### Download install and import all necessary libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Download data file with transcripts

In [3]:
df = pd.read_csv('RSEM.gene.TMM.EXPR.matrix.csv', index_col=0)
df.head()

Unnamed: 0,rsem_outdir_NAi13t3n2_S1,rsem_outdir_NAi13t3n14_S2,rsem_outdir_NAi13t3n26_S3,rsem_outdir_NAi13t3n27_S4,rsem_outdir_NAi13t3n31_S5,rsem_outdir_NAi6t3n1_S6,rsem_outdir_NAi6t3n10_S7,rsem_outdir_NAi6t3n16_S8,rsem_outdir_NAi6t3n24_S9,rsem_outdir_NAi6t3n29_S10,rsem_outdir_NAi13t12n7_S11,rsem_outdir_NAi13t12n10_S12,rsem_outdir_NAi13t12n11_S13,rsem_outdir_NAi13t12n16_S14,rsem_outdir_NAi13t12n19_S15,rsem_outdir_NAi6t12n3_S16,rsem_outdir_NAi6t12n4_S17,rsem_outdir_NAi6t12n18_S18,rsem_outdir_NAi6t12n23_S19,rsem_outdir_NAi6t12n27_S20
TRINITY_DN0_c0_g1,756.624,906.464,307.957,739.13,206.39,154.113,220.125,554.147,317.122,183.428,125.687,403.675,1387.289,470.699,401.953,534.054,420.981,509.926,603.937,374.14
TRINITY_DN0_c13_g1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
TRINITY_DN0_c1_g1,357.557,352.441,211.422,268.329,205.709,247.21,346.917,252.998,325.901,216.055,157.471,224.673,429.794,145.129,164.364,240.231,204.186,263.276,249.055,293.112
TRINITY_DN0_c1_g2,1.1,1.022,1.04,1.243,1.052,1.458,0.378,0.503,1.32,1.47,0.41,1.034,1.657,0.949,1.143,0.814,0.866,0.716,1.038,1.163
TRINITY_DN0_c2_g1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Data organization

 - **columns**
     - Experimental subjects

 - **rows**
     - codes for transcripts

 - **numerical value**
     - expression level (count per milion base pairs)


## Experiment 1:extract names of experimental subjects

The coulumns of the data frame **df** contains names of experimenal samples (there is some code to include information about experimetnal conditions). In the cell below, write a short, easy to read Python script to extract names of samples in a list, array or a dataframe (whichever is more convenient for you). Use variable **subject_names** to keep your list/array/dataframe

In [4]:
#Declearing a variable (list) and appending columns from the DataFrame, df
subject_names = [item for item in df.columns]
print("The names of the experimental subjects are: " + str(subject_names))

The names of the experimental subjects are: ['rsem_outdir_NAi13t3n2_S1', 'rsem_outdir_NAi13t3n14_S2', 'rsem_outdir_NAi13t3n26_S3', 'rsem_outdir_NAi13t3n27_S4', 'rsem_outdir_NAi13t3n31_S5', 'rsem_outdir_NAi6t3n1_S6', 'rsem_outdir_NAi6t3n10_S7', 'rsem_outdir_NAi6t3n16_S8', 'rsem_outdir_NAi6t3n24_S9', 'rsem_outdir_NAi6t3n29_S10', 'rsem_outdir_NAi13t12n7_S11', 'rsem_outdir_NAi13t12n10_S12', 'rsem_outdir_NAi13t12n11_S13', 'rsem_outdir_NAi13t12n16_S14', 'rsem_outdir_NAi13t12n19_S15', 'rsem_outdir_NAi6t12n3_S16', 'rsem_outdir_NAi6t12n4_S17', 'rsem_outdir_NAi6t12n18_S18', 'rsem_outdir_NAi6t12n23_S19', 'rsem_outdir_NAi6t12n27_S20']


## Experiment 2: Separate experimenal subjects using light conditions

The names of experimenal samples contain a lot of information. In this and several following experiments we will try to extract such information from the name and use it to diviude samples into experimental groups.

Alll samples names start with the name of experiment (“rsem_outdir_NA”), followed by information on light conditions. There are only two light conditions in this experiment:

   - **i13** - experimenatal subject received a light PULSE

   - **i6** - experimental subject received NO light pulse

Use appropriate Python code to make two new lists/dataframes, pulse AND no_pulse.

   - The list/dataframe **pulse** should contain names of experimental subject which receive a light pulse (for example, *rsem_outdir_NAi13t12n10_S12*).

   - The list/dataframe **no_pulse** should contain names of experimental subjects who receive no light pulse(for example, *rsem_outdir_NAi6t3n16_S8*). Provide informative printout


In [5]:
#Declearing two lists to store data that have a pulse and no pulse, respectively
pulse = []
no_pulse = []
#Looping through the column names in the DataFrame, df
for col in df:
    #Conditional statement: If i13 is in the column name, add it to the "pulse" list
    if col.find("i13") > -1:
        pulse.append(col)
    #Second Conditional Statement: If i6 is in the column name, add it to the "no_pulse" list
    elif col.find("i6") > -1:
        no_pulse.append(col)
print("The experimental subjects that recieved a light pulse are: " + str([name for name in pulse]))
print("The experimental subjects that did not recieve a light pulse are: " + str([name for name in no_pulse]))

The experimental subjects that recieved a light pulse are: ['rsem_outdir_NAi13t3n2_S1', 'rsem_outdir_NAi13t3n14_S2', 'rsem_outdir_NAi13t3n26_S3', 'rsem_outdir_NAi13t3n27_S4', 'rsem_outdir_NAi13t3n31_S5', 'rsem_outdir_NAi13t12n7_S11', 'rsem_outdir_NAi13t12n10_S12', 'rsem_outdir_NAi13t12n11_S13', 'rsem_outdir_NAi13t12n16_S14', 'rsem_outdir_NAi13t12n19_S15']
The experimental subjects that did not recieve a light pulse are: ['rsem_outdir_NAi6t3n1_S6', 'rsem_outdir_NAi6t3n10_S7', 'rsem_outdir_NAi6t3n16_S8', 'rsem_outdir_NAi6t3n24_S9', 'rsem_outdir_NAi6t3n29_S10', 'rsem_outdir_NAi6t12n3_S16', 'rsem_outdir_NAi6t12n4_S17', 'rsem_outdir_NAi6t12n18_S18', 'rsem_outdir_NAi6t12n23_S19', 'rsem_outdir_NAi6t12n27_S20']


## Experiment 3: Separate experimenal subject using collection time

After the light information, next several characters in the name of experimenal subject code for time of sample collections. There were only two collection times in our experiment:

   - t3 - samples were collected at circadian time 3

   - t12 - samples were collected at circadian time 12

Use appropriate Ptyhon script to make two new lists/dataframes, **zt3** AND **zt12**.

   - The list/dataframe **zt3** should contain names of experimental subjects which were collected at time 3( for example, *rsem_outdir_NAi6t3n1_S6*).

   - The list/dataframe **zt12** should contain names of experimental subjects who was collected at circadian time 12 (for example, *rsem_outdir_NAi13t12n10_S12*). Make informative printout


In [6]:
#Declearing two lists to store data that have a zt of 3 and a zt of 12, respectively
zt3 = []
zt12 = []
#Looping through the column names in the DataFrame, df
for name in subject_names:
    #Conditional Statement: if t3 is in the column name, then add it into the zt3 list
    if name.find("t3") != -1:
        zt3.append(name)
    #Second Conditional Statement: if t12 is in the column name, then add it into the zt12 list
    elif name.find("t12") != -1:
        zt12.append(name)
print("The names of the samples that were collected at circadian time 3 are: " + str(zt3))
print("The names of the samples that were collected at circadian time 12 are: " + str(zt12))

The names of the samples that were collected at circadian time 3 are: ['rsem_outdir_NAi13t3n2_S1', 'rsem_outdir_NAi13t3n14_S2', 'rsem_outdir_NAi13t3n26_S3', 'rsem_outdir_NAi13t3n27_S4', 'rsem_outdir_NAi13t3n31_S5', 'rsem_outdir_NAi6t3n1_S6', 'rsem_outdir_NAi6t3n10_S7', 'rsem_outdir_NAi6t3n16_S8', 'rsem_outdir_NAi6t3n24_S9', 'rsem_outdir_NAi6t3n29_S10']
The names of the samples that were collected at circadian time 12 are: ['rsem_outdir_NAi13t12n7_S11', 'rsem_outdir_NAi13t12n10_S12', 'rsem_outdir_NAi13t12n11_S13', 'rsem_outdir_NAi13t12n16_S14', 'rsem_outdir_NAi13t12n19_S15', 'rsem_outdir_NAi6t12n3_S16', 'rsem_outdir_NAi6t12n4_S17', 'rsem_outdir_NAi6t12n18_S18', 'rsem_outdir_NAi6t12n23_S19', 'rsem_outdir_NAi6t12n27_S20']


## Experiment 4: Report subject IDs

Finally, the last few chcaracters in a sample name indicate a specie(S for spider) and experimental id (nuber of spider). For example, **S11** indicate spider id is 11.

Use appropriate Ptyhon code to make a new dataframes, **spiders**. It should have two columns:

  -  first column containing a full sample name (for example, *rsem_outdir_NAi6t3n1_S6*)

  -  the second should contain a subject number( for previous example, *6*)


In [7]:
#Declearing a variable (Dictionary) to store the data for sunject number with full sample name as a key
spider_dict = {}
# Looping through the column names
for name in subject_names:
    #If the spider has a subject name (as deoted by the S), add it into the dictionary
    if name.find("S") != -1:
        spider_dict[name] = name[(name.find("S")+1):]
#Converting dictionary into a DataFrame in Pandas and saving it in a variable called "spider"
spider = pd.DataFrame.from_dict(spider_dict, orient = "index")
display(spider)

Unnamed: 0,0
rsem_outdir_NAi13t3n2_S1,1
rsem_outdir_NAi13t3n14_S2,2
rsem_outdir_NAi13t3n26_S3,3
rsem_outdir_NAi13t3n27_S4,4
rsem_outdir_NAi13t3n31_S5,5
rsem_outdir_NAi6t3n1_S6,6
rsem_outdir_NAi6t3n10_S7,7
rsem_outdir_NAi6t3n16_S8,8
rsem_outdir_NAi6t3n24_S9,9
rsem_outdir_NAi6t3n29_S10,10


## Experiment 5: Summarizing experimental information in a dataframe

Finally, we will combine all experimenal conditions in a single dataframe. Make a new dataframe called **metadata**.

  -  The first colunm should contain all sample name (for example, *rsem_outdir_NAi13t3n2_S1*).

  -  The second column, named collection_times should contain experimenal time (**3** or **12**)

  -  The third column, named pulse, should contain light pulse indicator
      -  **yes** for pulse
      -  **no** for no pulse

Here is an example of the first three lines of such dataframe:

![image.png](attachment:image.png)

In [9]:
#Declearing 2 varables to store data based off of pulse or no pulse and collection times
pulse_list = []
collection_times_list = []

#Loop through the column names (name is a column name in the DataFrame, df)
for name in subject_names:
    #Conditional Statement: if column name is in the pulse list from Experiment 2, then add yes into the list for that name
    if name in pulse:
        pulse_list.append("yes")
     #Conditional Statement: if column name is in the no_pulse list from Experiment 2, then add no into the list for that name
    elif name in no_pulse:
        pulse_list.append("no")
    #Conditional Statement: if column name is in the zt3 list from Ex 3, then add a 3 in the list for a zt of 3 for that name.
    if name in zt3:
        collection_times_list.append("3")
    #Conditional Statement: if column name is in the zt12 list from Ex 3, then add a 12 in the list for a zt of 12 for that name.
    elif name in zt12:
        collection_times_list.append("12")
#Declearing a variable (dict) to store the column names, collection times, and pulse/no pulse.
combined_dict = {"": subject_names, "collection_times": collection_times_list, "pulse": pulse_list}
#Converting dictionary into a Pandas DataFrame and then saving it in a new variable called "metadata"
metadata = pd.DataFrame.from_dict(combined_dict)
#setting the first element of the index to a blank space to match the given format
metadata = metadata.set_index("")
print(metadata)

                            collection_times pulse
                                                  
rsem_outdir_NAi13t3n2_S1                   3   yes
rsem_outdir_NAi13t3n14_S2                  3   yes
rsem_outdir_NAi13t3n26_S3                  3   yes
rsem_outdir_NAi13t3n27_S4                  3   yes
rsem_outdir_NAi13t3n31_S5                  3   yes
rsem_outdir_NAi6t3n1_S6                    3    no
rsem_outdir_NAi6t3n10_S7                   3    no
rsem_outdir_NAi6t3n16_S8                   3    no
rsem_outdir_NAi6t3n24_S9                   3    no
rsem_outdir_NAi6t3n29_S10                  3    no
rsem_outdir_NAi13t12n7_S11                12   yes
rsem_outdir_NAi13t12n10_S12               12   yes
rsem_outdir_NAi13t12n11_S13               12   yes
rsem_outdir_NAi13t12n16_S14               12   yes
rsem_outdir_NAi13t12n19_S15               12   yes
rsem_outdir_NAi6t12n3_S16                 12    no
rsem_outdir_NAi6t12n4_S17                 12    no
rsem_outdir_NAi6t12n18_S18     