# L1000 to RNA-seq conversion pipeline

This is an L1000 to RNA-seq conversion pipeline. The pipeline takes 978-dimensional Level3 L1000 profiles as input and returns 25,312-dimensional RNA-seq like profiles. A cycleGAN model in step 1 converts gene expression values in L1000 to those in RNA-seq only for landmark genes. Then, step 2 takes the output profiles of step 1 and extrapolates the profiles to the 25,312 full genome profiles.



In [116]:
import os
import time

In [117]:
# saved model parameters
step1_exp_index = 17
step1_model_index = 99
step2_exp_index = 15



In [118]:
folder = "../data/LINCS_CFDE/L1000_GSE92742_landmark_only/"
preprocessing_folder = "../data/LINCS_CFDE/L1000_GSE92742_landmark_only_feather/"
output_folder_step1 = "../data/LINCS_CFDE/L1000_GSE92742_prediction_results_step1/"
output_folder_step2 = "../data/LINCS_CFDE/L1000_GSE92742_prediction_results_step2/"
filenames = os.listdir(folder)

In [120]:
for filename in filenames:
    strt_time = time.time()
    preprocessing_filename = filename.replace(".gctx", ".f")
    prediction_filename_step1 = filename.replace(".gctx", ".txt")
    prediction_filename_step2 = filename.replace(".gctx", ".f")
    
    # preprocessing
    !python functions/preprocessing_input_data.py --input_filename $folder$filename --output_filename $preprocessing_folder$preprocessing_filename

    # # step1
    !python functions/cyclegan_transcript.py --ispredicting --exp_index $step1_exp_index --load_model_index $step1_model_index --eval_dataset_nameA $preprocessing_folder$preprocessing_filename --y_pred_output_filename $prediction_filename_step1 --prediction_folder $output_folder_step1

    # # step2
    !python functions/extrapolation_transcript.py --ispredicting --exp_index $step2_exp_index --eval_input_dataset_name $output_folder_step1$prediction_filename_step1 --y_pred_output_filename $prediction_filename_step2 --prediction_folder $output_folder_step2

    print(time.time()-strt_time, "sec")
    # break

Loading L1000 data..... ../data/LINCS_CFDE/L1000_GSE92742_landmark_only/L1000_GSE92742_80.gctx
Saved! ../data/LINCS_CFDE/L1000_GSE92742_landmark_only_feather/L1000_GSE92742_80.f
{'epoch_resume': 0, 'n_epochs': 500, 'dataset_nameA': 'L1000', 'dataset_nameB': 'ARCHS4', 'batch_size': 100, 'lr': 0.0002, 'b1': 0.9, 'b2': 0.999, 'decay_epoch': 50, 'n_cpu': 8, 'input_dimA': 962, 'hidden_dimA': 512, 'output_dimA': 128, 'input_dimB': 962, 'hidden_dimB': 512, 'output_dimB': 128, 'num_samples': 50000, 'sample_interval': 100, 'checkpoint_interval': 10, 'n_residual_blocks': 1, 'lambda_cyc': 10.0, 'lambda_id': 5.0, 'load_model_index': 100, 'eval_dataset_nameA': 'GTEx', 'eval_dataset_nameB': 'GTEx', 'exp_index': 10, 'ispredicting': False, 'cell_line': None, 'gamma': 0.1, 'shuffle': False, 'evaluation': False, 'data_version': 'v2', 'y_true_output_filename': None, 'y_pred_output_filename': None}
Namespace(b1=0.9, b2=0.999, batch_size=1, cell_line=None, checkpoint_interval=10, dataset_nameA='L1000', dat