## **Notebook for the conversion of transcriptomic data from txt files to csvs in EDD format** 

By Garrett Roell and Christina Schenk

Tested on biodesign_3.7 kernel on jprime

This notebook first analyzes the proper EDD format and the format of the transcriptomics data that Winston sent. <br>
Then it goes through in high detail how to convert the CPM .txt file into a properly formatted EDD .csv file <br>
Last it runs the other three .txt files (FPKM, MR, and TMM) through the same pipeline

#### Imports

In [1]:
import pandas as pd

#### Define functions to help with conversion

In [18]:
def henson_row_to_line_name(row):
    line_name = row.Strain
    
    if row.Media == 'sodium benzoate':
        line_name += '-B'
    if row.Media == 'glucose':
        line_name += '-Glu'
    if row.Media == 'phenol':
        line_name += '-P'
    if row.Media == 'Mixture':
        line_name += '-M'
    if row.Media == 'guaiacol':
        line_name += '-Gua'
    if row.Media == 'vanillic acid':
        line_name += '-V'
    if row.Media == '4-hydroxybenzoic acid':
        line_name += '-H'
        
    print(row.Replicate)
    line_name += '-R' + str(row.Replicate)
        
    return line_name

In [4]:
# This mapping is based on the methods in the Henson paper
def henson_row_to_time_value(row):
    if row['Media'] == 'glucose' and row['Time.point'] == 't=1':
        return 10
    elif row['Media'] == 'glucose' and row['Time.point'] == 't=2':
        return 13
    elif row['Media'] == 'Mixture' and row['Time.point'] == 't=1':
        return 20
    elif row['Media'] == 'Mixture' and row['Time.point'] == 't=2':
        return 32
    elif row['Media'] == 'phenol' and row['Strain'] == 'WT':
        return 24
    elif row['Media'] == 'phenol' and row['Strain'] == 'PVHG':
        return 21
    elif row['Media'] == 'guaiacol':
        return 19
    elif row['Media'] == '4-hydroxybenzoic acid':
        return 11
    elif row['Media'] == 'sodium benzoate':
        return 12
    elif row['Media'] == 'vanillic acid':
        return 24
    else:
        print(f'No time data for {row.Strain} {row.Media} {row["Time.point"]}')

In [8]:
def henson_txt_file_to_EDD_csv(input_file_name, output_file_name):
    henson_data = pd.read_table(input_file_name, delim_whitespace=True)
    print(f'total length of dataframe is {len(henson_data)}')
    
    henson_data = henson_data[henson_data['product_accession'] != 'Test']
    
    henson_data = henson_data.reset_index(drop=True)
    print(henson_data.head())
    
    
    # Define columns needed for EDD
    henson_data['Line Name'] = [henson_row_to_line_name(row) for _, row in henson_data.iterrows()]
    henson_data['Measurement Type'] = [row.product_accession.replace('.', '_') for _, row in henson_data.iterrows()]
    henson_data['Time'] = [henson_row_to_time_value(row) for _, row in henson_data.iterrows()]
    henson_data['count'] = [row['Count'] for _, row in henson_data.iterrows()]
    
    # remove typo from data to make EDD compatible
    if 'FPKM' in input_file_name:
        henson_data['Units'] = ['FPKM'] * len(henson_data)
    
    henson_data.drop(['Strain', 'Media', 'Time.point', 'product_accession', 'Replicate', 'count'], axis=1, inplace=True)
    EDD_data = henson_data[['Line Name', 'Measurement Type', 'Time', 'Count', 'Units']]
    
    EDD_data.to_csv(output_file_name, index=False)


Test area

In [9]:
henson_data = pd.read_table('txt/henson_CPM_melted.txt', delim_whitespace=True)

henson_data = henson_data.reset_index(drop=True)
display(henson_data.head())

Unnamed: 0,locus_tag,product_accession,Replicate,Strain,Media,Units,Time.point,Test,Count
0,K2Z90_RS00005,WP_005263480.1,1,WT,Mixture,CPM,t=1,WT.Mixture.t=1,19.00342
1,K2Z90_RS00005,WP_005263480.1,2,WT,Mixture,CPM,t=1,WT.Mixture.t=1,16.800668
2,K2Z90_RS00005,WP_005263480.1,3,WT,Mixture,CPM,t=1,WT.Mixture.t=1,12.95522
3,K2Z90_RS00005,WP_005263480.1,1,WT,Mixture,CPM,t=2,WT.Mixture.t=2,10.984975
4,K2Z90_RS00005,WP_005263480.1,2,WT,Mixture,CPM,t=2,WT.Mixture.t=2,9.520107


#### Run the fuction for all normalization methods

In [10]:
henson_txt_file_to_EDD_csv(
    'txt/henson_CPM_melted.txt', 
    'csv/henson_CPM_melted.csv'
)

total length of dataframe is 434754
       locus_tag product_accession  Replicate Strain    Media Units  \
0  K2Z90_RS00005    WP_005263480.1          1     WT  Mixture   CPM   
1  K2Z90_RS00005    WP_005263480.1          2     WT  Mixture   CPM   
2  K2Z90_RS00005    WP_005263480.1          3     WT  Mixture   CPM   
3  K2Z90_RS00005    WP_005263480.1          1     WT  Mixture   CPM   
4  K2Z90_RS00005    WP_005263480.1          2     WT  Mixture   CPM   

  Time.point            Test      Count  
0        t=1  WT.Mixture.t=1  19.003420  
1        t=1  WT.Mixture.t=1  16.800668  
2        t=1  WT.Mixture.t=1  12.955220  
3        t=2  WT.Mixture.t=2  10.984975  
4        t=2  WT.Mixture.t=2   9.520107  


In [15]:
henson_txt_file_to_EDD_csv(
    'txt/henson_FKPM_melted.txt', 
    'csv/henson_FPKM_melted.csv'
)

total length of dataframe is 434754
       locus_tag product_accession  Replicate Strain    Media Units  \
0  K2Z90_RS00005    WP_005263480.1          1     WT  Mixture  FKPM   
1  K2Z90_RS00005    WP_005263480.1          2     WT  Mixture  FKPM   
2  K2Z90_RS00005    WP_005263480.1          3     WT  Mixture  FKPM   
3  K2Z90_RS00005    WP_005263480.1          1     WT  Mixture  FKPM   
4  K2Z90_RS00005    WP_005263480.1          2     WT  Mixture  FKPM   

  Time.point            Test      Count  
0        t=1  WT.Mixture.t=1  11.974430  
1        t=1  WT.Mixture.t=1  10.586432  
2        t=1  WT.Mixture.t=1   8.163339  
3        t=2  WT.Mixture.t=2   6.921849  
4        t=2  WT.Mixture.t=2   5.998807  


In [16]:
henson_txt_file_to_EDD_csv(
    'txt/henson_MR_melted.txt',
    'csv/henson_MR_melted.csv'
)

total length of dataframe is 434754
       locus_tag product_accession  Replicate Strain    Media Units  \
0  K2Z90_RS00005    WP_005263480.1          1     WT  Mixture    MR   
1  K2Z90_RS00005    WP_005263480.1          2     WT  Mixture    MR   
2  K2Z90_RS00005    WP_005263480.1          3     WT  Mixture    MR   
3  K2Z90_RS00005    WP_005263480.1          1     WT  Mixture    MR   
4  K2Z90_RS00005    WP_005263480.1          2     WT  Mixture    MR   

  Time.point            Test       Count  
0        t=1  WT.Mixture.t=1  223.983527  
1        t=1  WT.Mixture.t=1  198.020822  
2        t=1  WT.Mixture.t=1  152.696503  
3        t=2  WT.Mixture.t=2  129.474240  
4        t=2  WT.Mixture.t=2  112.208600  


In [19]:
henson_txt_file_to_EDD_csv(
    'txt/henson_TMM_melted.txt',
    'csv/henson_TMM_melted.csv'
)

total length of dataframe is 428553
       locus_tag product_accession  Replicate Strain    Media Units  \
0  K2Z90_RS00005    WP_005263480.1        1.0     WT  Mixture   TMM   
1  K2Z90_RS00005    WP_005263480.1        2.0     WT  Mixture   TMM   
2  K2Z90_RS00005    WP_005263480.1        3.0     WT  Mixture   TMM   
3  K2Z90_RS00005    WP_005263480.1        1.0     WT  Mixture   TMM   
4  K2Z90_RS00005    WP_005263480.1        2.0     WT  Mixture   TMM   

  Time.point            Test      Count  
0        t=1  WT.Mixture.t=1  18.441150  
1        t=1  WT.Mixture.t=1  16.208554  
2        t=1  WT.Mixture.t=1  12.751585  
3        t=2  WT.Mixture.t=2  10.500214  
4        t=2  WT.Mixture.t=2   9.147403  
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3.0
1.0
2.0
3

TypeError: unsupported operand type(s) for +=: 'float' and 'str'