# Description

Generates the training and testing CSVs used for CNN_6_0 and after. Details:
 - Uses 80% training 20% testing split
 - Noramlizes the expression from 0 (no expression) to 1 (highest expression). This is the inverse of 'Observed log(TX/Txref)'

It does not include augmented data, just takes the data from La Fleur's supplemental materials including:
 - La Fleur et al (and De Novo Designs)
 - Urtecho et al
 - Hossain et al
 - Yu et al
 - Lagator (36N, Pl, and Pr)
 - Anderson Series

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [2]:
# Load the data
file_path = '../Data/combined/LaFleur_supp.csv'
df = pd.read_csv(file_path)
df['Promoter Sequence'] = df['Promoter Sequence'].str.upper()
df['Normalized Expression'] = MinMaxScaler().fit_transform(df[['Observed log(TX/Txref)']].abs())

# Split the dataframe into 80% training and 20% testing
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

print(len(train_df), 'training samples')
print(len(test_df), 'testing samples')

39013 training samples
9754 testing samples


In [3]:
train_df.head()

Unnamed: 0,File Name,Upstream DNA,Promoter Sequence,Downstream DNA,Observed log(TX/Txref),Normalized Expression
3287,La Fleur et al (Fig 3a).csv,CTCGGTACCAAATTCCAGAA,TTTTCTATCTACGTACTCTTGGCTATTTCCTATTTCTCTTATAATT...,GAATTCGATCAAATTTCGAG,-2.529635,0.18577
8661,"Urtecho et al (Fig 3c, S7b).csv",,TTGCGGTTTTTTCGGTTCAATCACCGCCTGCTGACGAGCTGGGCGC...,,-1.505491,0.110559
15653,"Urtecho et al (Fig 3c, S7b).csv",,AGCCGCTTTTAGCGGACGACGTGAGTAAACAAAACCCAGACATCAT...,,-1.700999,0.124917
41540,Lagator Pr.csv,,GCGCCCGCTGATCCTCCTCGAGGATAAATATCTAATACCGTGCGTG...,,-5.049856,0.370848
36243,Lagator Pl.csv,,GCGCCCGCTGATCCTCCTCGAGGATAAATATTACACACAGGTGGTG...,,-1.473306,0.108196


In [4]:
# Save the training and testing data as CSV files
train_df.to_csv('../Data/Train Test/train_data.csv', index=False)
test_df.to_csv('../Data/Train Test/test_data.csv', index=False)