# The conversion of files from Winston's format to EDD format
This notebook first analyzes the proper EDD format and the format of the transcriptomics data that Winston sent. <br>
Then it goes through in high detail how to convert the CPM .txt file into a properly formatted EDD .csv file <br>
Last it runs the other three .txt files (FPKM, MR, and TMM) through the same pipeline

In [1]:
import pandas as pd

### Proper EDD Format Example

In [2]:
old_edd_yoneda_data = pd.read_csv('../../EDD_Yoneda_data/Yoneda_set3_transcriptomics_data.csv')
print(f'The Yoneda data frame in EDD has {len(old_edd_yoneda_data)} rows')
old_edd_yoneda_data.head()

The Yoneda data frame in EDD has 71501 rows


Unnamed: 0,Line Name,Measurement Type,Time,Value,Units,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
0,WT-G-R1,PD630_LPD06575,14,708,counts,,,,
1,WT-G-R1,PD630_LPD06576,14,6513,counts,,,,
2,WT-G-R1,PD630_LPD00131,14,1015,counts,,,,
3,WT-G-R1,PD630_LPD06740,14,289,counts,,,,
4,WT-G-R1,PD630_LPD06741,14,1159,counts,,,,


In [3]:
print('The set of line names in Yoneda data already in EDD is:')
_ = [print(line_name) for line_name in set(old_edd_yoneda_data["Line Name"])]

The set of line names in Yoneda data already in EDD is:
EVOL33-G-R1
EVOL33-L-R1
EVOL40-G-R1
EVOL40-H-R1
WT-G-R1
EVOL40-L-R1
EVOL33-H-R1
WT-L-R1


Define a function to find the set of times used for each line name

In [4]:
def line_name_to_set_of_times(line_name):
    return set(old_edd_yoneda_data[old_edd_yoneda_data['Line Name'] == line_name].Time)

In [5]:
print('The set of times used for each line name in the Yoneda data already in EDD is:')
_ = [print(line_name, line_name_to_set_of_times(line_name)) for line_name in set(old_edd_yoneda_data["Line Name"])]

The set of times used for each line name in the Yoneda data already in EDD is:
EVOL33-G-R1 {14}
EVOL33-L-R1 {24}
EVOL40-G-R1 {14}
EVOL40-H-R1 {32}
WT-G-R1 {14}
EVOL40-L-R1 {24}
EVOL33-H-R1 {32}
WT-L-R1 {24}


There is a pattern where all 'G' lines have time 24, all 'L' lines have time 24 and all 'H' lines have time 32

### Characterization of Yoneda data from Winston

In [7]:
yoneda_data = pd.read_table('../winston_data/yoneda/yoneda_reprocess_FKPM_melted.txt', delim_whitespace=True)
print(f'There are {len(yoneda_data)} lines in data Winston recently sent')
yoneda_data.head()

There are 192168 lines in data Winston recently sent


Unnamed: 0,Strain,variable,Condition,Replicate,value,Units
1,3A,WP_000104864.1,1g/L_glucose,1,0.0,FKPM
2,3A,WP_000104864.1,1g/L_glucose,2,0.0,FKPM
3,3A,WP_000104864.1,1g/L_glucose,3,0.0,FKPM
4,3B,WP_000104864.1,0.75g/L_phenol,1,0.0,FKPM
5,3B,WP_000104864.1,0.75g/L_phenol,2,0.0,FKPM


There are roughly 3 times as many data points in the new data winston sent indicating the original version was missing replicates

In [8]:
print(f'The set of strains in Yoneda data Winston sent are {set(yoneda_data.Strain)}')
print(f'The set of conditions in Yoneda data Winston sent are {set(yoneda_data.Condition)}')
print(f'The set of replicates in Yoneda data Winston sent are {set(yoneda_data.Replicate)}')

The set of strains in Yoneda data Winston sent are {'3B', '4A', '4B', '4C', 'WA', 'WB', '3A', '3C'}
The set of conditions in Yoneda data Winston sent are {'0.75g/L_phenol', '1.5g/L_phenol', '1g/L_glucose'}
The set of replicates in Yoneda data Winston sent are {1, 2, 3}


Define a function to find the set of conditions used for each strain

In [9]:
def strain_to_condition_set(strain):
    return set(yoneda_data[yoneda_data['Strain'] == strain].Condition)

In [10]:
print('The set of conditions for each strain in the data winston sent is:')
_ = [print(strain, strain_to_condition_set(strain)) for strain in set(yoneda_data.Strain)]

The set of conditions for each strain in the data winston sent is:
3B {'0.75g/L_phenol'}
4A {'1g/L_glucose'}
4B {'0.75g/L_phenol'}
4C {'1.5g/L_phenol'}
WA {'1g/L_glucose'}
WB {'0.75g/L_phenol'}
3A {'1g/L_glucose'}
3C {'1.5g/L_phenol'}


There is a pattern that A = 1g/L_glucose, B = 0.75g/L_phenol, and C = 1.5g/L_phenol

W = wild type, 3 = evol33, and 4 = evol40

# Creation of EDD formatted dataframe

The column names of the data should be Line Name, Measurement Type, Time, Value, Units

| Column Name        | Information                                  | 
| -----------        | -----------                                  |
| Line name          | {Strain}-{carbon source}-R{replicate number} |
| Measurement Type   | gene annotation                              |
| Time               | measurement time in hours                    |
| Value              | measurement value                            |
| Units              | Unit name                                    |

Define a function to define the line name based on the information in a row of the data Winston sent

In [13]:
def row_to_line_name(row):
    if row.Strain.startswith('W'):
        line_name = 'WT-LN'
    if row.Strain.startswith('3'):
        line_name = 'EVOL33-LN'
    if row.Strain.startswith('4'):
        line_name = 'EVOL40-LN'
    
    if row.Condition == '1g/L_glucose':
        line_name += '-G-R'
    if row.Condition == '0.75g/L_phenol':
        line_name += '-LP-R'
    if row.Condition == '1.5g/L_phenol':
        line_name += '-HP-R'
        
    line_name += str(row.Replicate)
    
    return line_name

Add line names to data winston sent

In [14]:
yoneda_data['Line Name'] = [row_to_line_name(row) for _, row in yoneda_data.iterrows()]
yoneda_data.head()

Unnamed: 0,Strain,variable,Condition,Replicate,value,Units,Line Name
1,3A,WP_000104864.1,1g/L_glucose,1,0.0,FKPM,EVOL33-LN-G-R1
2,3A,WP_000104864.1,1g/L_glucose,2,0.0,FKPM,EVOL33-LN-G-R2
3,3A,WP_000104864.1,1g/L_glucose,3,0.0,FKPM,EVOL33-LN-G-R3
4,3B,WP_000104864.1,0.75g/L_phenol,1,0.0,FKPM,EVOL33-LN-LP-R1
5,3B,WP_000104864.1,0.75g/L_phenol,2,0.0,FKPM,EVOL33-LN-LP-R2


Add measurement types to the data Winston sent

In [15]:
yoneda_data['Measurement Type'] = [row.variable.replace('.', '_') for _, row in yoneda_data.iterrows()]
yoneda_data.head()

Unnamed: 0,Strain,variable,Condition,Replicate,value,Units,Line Name,Measurement Type
1,3A,WP_000104864.1,1g/L_glucose,1,0.0,FKPM,EVOL33-LN-G-R1,WP_000104864_1
2,3A,WP_000104864.1,1g/L_glucose,2,0.0,FKPM,EVOL33-LN-G-R2,WP_000104864_1
3,3A,WP_000104864.1,1g/L_glucose,3,0.0,FKPM,EVOL33-LN-G-R3,WP_000104864_1
4,3B,WP_000104864.1,0.75g/L_phenol,1,0.0,FKPM,EVOL33-LN-LP-R1,WP_000104864_1
5,3B,WP_000104864.1,0.75g/L_phenol,2,0.0,FKPM,EVOL33-LN-LP-R2,WP_000104864_1


Define a function to return the proper time based on the line name

In [16]:
def line_name_to_time(line_name):
    if '-G-' in line_name:
        return 14
    if '-LP-' in line_name:
        return 24
    if '-HP-' in line_name:
        return 32

Add Time to the dataframe

In [17]:
yoneda_data['Time'] = [line_name_to_time(row['Line Name']) for _, row in yoneda_data.iterrows()]
yoneda_data.head()

Unnamed: 0,Strain,variable,Condition,Replicate,value,Units,Line Name,Measurement Type,Time
1,3A,WP_000104864.1,1g/L_glucose,1,0.0,FKPM,EVOL33-LN-G-R1,WP_000104864_1,14
2,3A,WP_000104864.1,1g/L_glucose,2,0.0,FKPM,EVOL33-LN-G-R2,WP_000104864_1,14
3,3A,WP_000104864.1,1g/L_glucose,3,0.0,FKPM,EVOL33-LN-G-R3,WP_000104864_1,14
4,3B,WP_000104864.1,0.75g/L_phenol,1,0.0,FKPM,EVOL33-LN-LP-R1,WP_000104864_1,24
5,3B,WP_000104864.1,0.75g/L_phenol,2,0.0,FKPM,EVOL33-LN-LP-R2,WP_000104864_1,24


Add capitalized value to the dataframe

In [18]:
yoneda_data['Value'] = [row['value'] for _, row in yoneda_data.iterrows()]
yoneda_data.head()

Unnamed: 0,Strain,variable,Condition,Replicate,value,Units,Line Name,Measurement Type,Time,Value
1,3A,WP_000104864.1,1g/L_glucose,1,0.0,FKPM,EVOL33-LN-G-R1,WP_000104864_1,14,0.0
2,3A,WP_000104864.1,1g/L_glucose,2,0.0,FKPM,EVOL33-LN-G-R2,WP_000104864_1,14,0.0
3,3A,WP_000104864.1,1g/L_glucose,3,0.0,FKPM,EVOL33-LN-G-R3,WP_000104864_1,14,0.0
4,3B,WP_000104864.1,0.75g/L_phenol,1,0.0,FKPM,EVOL33-LN-LP-R1,WP_000104864_1,24,0.0
5,3B,WP_000104864.1,0.75g/L_phenol,2,0.0,FKPM,EVOL33-LN-LP-R2,WP_000104864_1,24,0.0


### Properly organize columns

In [19]:
yoneda_data.drop(['Strain','variable', 'Condition', 'Replicate'], axis=1, inplace=True)
yoneda_data = yoneda_data[['Line Name', 'Measurement Type', 'Time', 'Value', 'Units']]
yoneda_data.head()

Unnamed: 0,Line Name,Measurement Type,Time,Value,Units
1,EVOL33-LN-G-R1,WP_000104864_1,14,0.0,FKPM
2,EVOL33-LN-G-R2,WP_000104864_1,14,0.0,FKPM
3,EVOL33-LN-G-R3,WP_000104864_1,14,0.0,FKPM
4,EVOL33-LN-LP-R1,WP_000104864_1,24,0.0,FKPM
5,EVOL33-LN-LP-R2,WP_000104864_1,24,0.0,FKPM


In [22]:
yoneda_data.to_csv('../winston_data/yoneda/yoneda_trans_edd_formatted_FPKM.csv', index=False)

# Define function to convert all 4 versions of Yoneda data to EDD compatible csvs

In [17]:
def winston_yoneda_txt_file_to_EDD_csv(input_file_name, output_file_name):
    yoneda_data = pd.read_table(input_file_name, delim_whitespace=True)
    
    yoneda_data['Line Name'] = [row_to_line_name(row) for _, row in yoneda_data.iterrows()]
    yoneda_data['Measurement Type'] = [row.variable.replace('.', '_') for _, row in yoneda_data.iterrows()]
    yoneda_data['Time'] = [line_name_to_time(row['Line Name']) for _, row in yoneda_data.iterrows()]
    yoneda_data['Value'] = [row['value'] for _, row in yoneda_data.iterrows()]
    
    yoneda_data.drop(['Strain','variable', 'Condition', 'Replicate'], axis=1, inplace=True)
    EDD_data = yoneda_data[['Line Name', 'Measurement Type', 'Time', 'Value', 'Units']]
    
    EDD_data.to_csv(output_file_name, index=False)


### Run the function 4 times to convert the text files to EDD formatted csv files

In [18]:
winston_yoneda_txt_file_to_EDD_csv('../winston_data/yoneda/yoneda_reprocess_CPM_melted.txt', '../winston_data/yoneda/yoneda_reprocess_CPM_melted.csv')

In [19]:
winston_yoneda_txt_file_to_EDD_csv('../winston_data/yoneda/yoneda_reprocess_FKPM_melted.txt', '../winston_data/yoneda/yoneda_reprocess_FPKM_melted.csv')

In [20]:
winston_yoneda_txt_file_to_EDD_csv('../winston_data/yoneda/yoneda_reprocess_MR_melted.txt', '../winston_data/yoneda/yoneda_reprocess_MR_melted.csv')

In [21]:
winston_yoneda_txt_file_to_EDD_csv('../winston_data/yoneda/yoneda_reprocess_TMM_melted.txt', '../winston_data/yoneda/yoneda_reprocess_TMM_melted.csv')