# L1000 to RNA-seq conversion pipeline - Training, Predicting & Evaluating

This is an L1000 to RNA-seq conversion pipeline. The pipeline takes 978-dimensional Level3 L1000 profiles as input and returns 25,312-dimensional RNA-seq like profiles. A cycleGAN model in step 1 converts gene expression values in L1000 to those in RNA-seq only for landmark genes. Then, step 2 takes the output profiles of step 1 and extrapolates the profiles to the 25,312 full genome profiles.



In [4]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

import os
import pandas as pd
import numpy as np
import random
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr
from numpy.random import seed
import umap
from sklearn.manifold import TSNE
import time

randomState = 123
seed(randomState)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Parameters

In [5]:
step1_exp_index = 30
step2_exp_index = 30
num_samples = 50000

step1_y_true_filename = "y_true_L1000_MCF7.txt"
step1_y_pred_filename = "y_pred_L1000_MCF7.txt"
step2_y_true_filename = "y_true_ARCHS4_MCF7.txt"
step2_y_pred_filename = "y_pred_ARCHS4_MCF7.txt"
eval_dataset_nameA = "L1000_MCF7"
eval_dataset_nameB = "ARCHS4_MCF7_landmark"

# step1_y_true_filename = "y_true_L1000_GTEx.txt"
# step1_y_pred_filename = "y_pred_L1000_GTEx.txt"
# step2_y_true_filename = "y_true_ARCHS4_GTEx.txt"
# step2_y_pred_filename = "y_pred_ARCHS4_GTEx.txt"
# eval_dataset_nameA = "GTEx_L1000"
# eval_dataset_nameB = "GTEx_RNAseq_landmark"
# eval_output_dataset_name = "GTEx_RNAseq"




## Training: Step 1

In [15]:
!python functions/delete.py --exp_index $step1_exp_index

In [16]:
!python functions/cyclegan_transcript.py --dataset_nameA "L1000" --dataset_nameB "ARCHS4" --n_epochs 100 --decay_epoch 50 --input_dimA 962 --hidden_dimA 512 --output_dimA 128 --input_dimB 962 --hidden_dimB 512 --output_dimB 128 --num_samples $num_samples --batch_size 100 --exp_index $step1_exp_index --prediction_folder "../output/"$step1_exp_index"/prediction/" --lambda_id 0.0 --benchmark_evaluation --eval_dataset_nameA $eval_dataset_nameA --eval_dataset_nameB $eval_dataset_nameB 

[34m[1mwandb[0m: Currently logged in as: [33mmjjeon[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.10.24 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.10.10
[34m[1mwandb[0m: Syncing run [33mastral-frog-45[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/mjjeon/L1000toRNAseq_step1[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/mjjeon/L1000toRNAseq_step1/runs/37idc7mr[0m
[34m[1mwandb[0m: Run data is saved locally in /home/maayanlab/Projects/minji/cycleGAN_gene_expression/scripts/wandb/run-20210331_174655-37idc7mr
[34m[1mwandb[0m: Run `wandb off` to turn off syncing.

Namespace(b1=0.9, b2=0.999, batch_size=100, benchmark_evaluation=True, cell_line=None, checkpoint_interval=10, dataset_nameA='L1000', dataset_nameB='ARCHS4', decay_epoch=50, epoch_resume=0, eval_dataset_nameA='L1000_MCF7', eval_dataset_na

## Training: Step 2

In [24]:
!python functions/delete.py --exp_index $step2_exp_index --step2

In [25]:
!python functions/extrapolation_transcript.py --input_dataset_name "ARCHS4_50000_input" --output_dataset_name "ARCHS4_50000_output" --n_epochs 100 --decay_epoch 10 --input_dim 962 --hidden_dim 2048 4096 8192 --output_dim 23614 --num_samples $num_samples --batch_size 100 --exp_index $step2_exp_index --valid_ratio 0.01 --test_ratio 0.01 --y_pred_output_filename "y_pred.txt" --y_true_output_filename "y_true.txt" --early_stopping --early_stopping_epoch 3 --early_stopping_tol 0.0001 --prediction_folder ../output_step2/$step2_exp_index/prediction
# 

[34m[1mwandb[0m: Currently logged in as: [33mmjjeon[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.10.24 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.10.10
[34m[1mwandb[0m: Syncing run [33mclassic-oath-84[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/mjjeon/L1000toRNAseq_step2[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/mjjeon/L1000toRNAseq_step2/runs/1kdfocn6[0m
[34m[1mwandb[0m: Run data is saved locally in /home/maayanlab/Projects/minji/cycleGAN_gene_expression/scripts/wandb/run-20210331_180215-1kdfocn6
[34m[1mwandb[0m: Run `wandb off` to turn off syncing.

Namespace(b1=0.9, b2=0.999, batch_size=100, cell_line=None, checkpoint_interval=10, decay_epoch=10, early_stopping=True, early_stopping_epoch=3, early_stopping_tol=0.0001, epoch_resume=0, eval_exp_index=8, eval_input_dataset_name='GTEx',

## Predicting: Step 0 Preprocessing input file (Optional)

978 landmark genes in GCTX -> 962 landmark genes in feather format)

In [10]:
!python functions/preprocessing_input_data.py --input_filename ../data/LINCS_CFDE/L1000_GSE92742_landmark_only/L1000_GSE92742_1.gctx --output_filename ../data/LINCS_CFDE/L1000_GSE92742_landmark_only_feather/L1000_GSE92742_1.gctx

Traceback (most recent call last):
  File "functions/preprocessing_input_data.py", line 40, in <module>
    main()
  File "functions/preprocessing_input_data.py", line 17, in main
    with open(opt.gene_names, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/overlap_landmark_file.txt'


## Predicting: Step 1 Running cycleGAN (L1000->RNA-seq)

Input file format: feather
Output file format: txt (tab-separated)

In [11]:
!python functions/cyclegan_transcript.py --ispredicting --exp_index $step1_exp_index --load_model_index $step1_model_index --eval_dataset_nameA ../data/Evaluation/GSE92742_Broad_LINCS_Level3_INF_mlr12k_n203x962_celllineMCF7.f --y_pred_output_filename step1.txt --prediction_folder "../prediction/" 

usage: cyclegan_transcript.py [-h] [--epoch_resume EPOCH_RESUME]
                              [--n_epochs N_EPOCHS]
                              [--dataset_nameA DATASET_NAMEA]
                              [--dataset_nameB DATASET_NAMEB]
                              [--batch_size BATCH_SIZE] [--lr LR] [--b1 B1]
                              [--b2 B2] [--weight_decay WEIGHT_DECAY]
                              [--decay_epoch DECAY_EPOCH] [--n_cpu N_CPU]
                              [--input_dimA INPUT_DIMA]
                              [--hidden_dimA HIDDEN_DIMA]
                              [--output_dimA OUTPUT_DIMA]
                              [--input_dimB INPUT_DIMB]
                              [--hidden_dimB HIDDEN_DIMB]
                              [--output_dimB OUTPUT_DIMB]
                              [--num_samples NUM_SAMPLES]
                              [--sample_interval SAMPLE_INTERVAL]
                              [--checkpoint_interval CHECKPOINT_INTERVA

## Predicting: Step 2 Extrapolating (962 dim RNA-seq -> 25,312 dim RNA-seq)

In [12]:
!python functions/extrapolation_transcript.py --ispredicting --exp_index $step2_exp_index --eval_input_dataset_name ../prediction/step1.txt --y_pred_output_filename step2.txt --prediction_folder "../prediction/"

Traceback (most recent call last):
  File "functions/extrapolation_transcript.py", line 151, in <module>
    with open(log_folder+"args.txt", "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '../output_step2/30/logs/args.txt'


## Evaluation

In [14]:
!python functions/evaluation.py --y_true ../data/Evaluation/ARCHS4_human_matrix_v9_n203x25312_celllineMCF7.f --y_pred ../prediction/step2.txt

Loading... ../data/Evaluation/ARCHS4_human_matrix_v9_n203x25312_celllineMCF7.f
Loading... ../prediction/step2.txt
                A1BG      A1CF       A2M  ...       ZYX     ZZEF1      ZZZ3
index                                     ...                              
GSM1244820  0.499148  0.077632  0.146216  ...  2.026383  1.929692  1.914497
GSM1069746  0.624828  0.158088  0.042253  ...  2.001709  2.007102  1.742260
GSM942209   0.740817  0.033552  0.184026  ...  1.668504  2.165832  1.819612
GSM1244818  0.484494  0.000000  0.043525  ...  2.023462  1.996412  1.838877
GSM1244822  0.450148  0.066832  0.130643  ...  2.004296  1.941199  1.883010
...              ...       ...       ...  ...       ...       ...       ...
GSM3538846  0.847936  0.088696  0.020267  ...  1.614222  2.083172  2.243796
GSM4081348  1.266858  0.000000  0.494607  ...  1.999887  1.888118  1.464791
GSM4081349  1.009861  0.000000  0.072061  ...  1.798051  1.837976  1.745063
GSM4081352  1.119146  0.160137  0.398369  ...  1.8