#  Predicting Drug Induced Gene Expression

### Project Introduction

This project tackles a core challenge in drug discovery: predicting how a drug will affect a cell's biology before it's ever tested in the lab. 
```
Gene Expression Change = f(Drug, Cell Type)
```
We use the LINCS L1000 (GSE92742) dataset, because it is the gold-standard, largest, and most comprehensive public resource designed for exactly this project's goal: predicting how a cell's gene expression changes in response to a drug. The dataset contains each variable in the equation above:

  * **Gene Expression Change**: It contains the gene expression changes (the Z-scores in the .gctx file).

  * **Drug**: It tells you which drug was used for each experiment (in the sig_info and pert_info files).

  * **Cell Type**: It tells you which cell type was used (also in the sig_info file)

We build a deep generative model to address this challenge by learning how different drugs change gene expression.  Mapping the post-treatment gene expression to a compact latent space using a Variational Autoencoder, the model can ifer a gene expression profile from a novel drug that it has never seen before.

The steps carried out in this notebook include:

  1. **Exploratory Data Analysis (EDA)**: ...
  
  

--- 
### References

**Data Sources and Platforms**

* [LINCS L1000 (GSE92742)](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92742): The data used here is the LINCS L1000 - the gold-standard, largest and most comprehensive public resource designed for predicting how a cell's gene expression changes in response to a drug.

**Libraries**
* [Pytorch Geometric](https://pytorch-geometric.readthedocs.io/en/latest/)

---
### Table of Contents

1. [Imports](#imports)
2. [Utility Functions](#utility-functions)
3. [Constants & Data Import](#constants-&-data-import)
4. [Data Exploration](#data-exploration)
5. [Data Pre-processing](#data-pre-processing)
6. [Build the Model](#build-the-model) 
7. [Clustering & Visualization](#clustering-&-visualization)
8. [Summary](#summary)

---
### 1. Imports <a class="anchor" id="imports"></a>

In [1]:
import os
from pathlib import Path
from typing import Tuple

import matplotlib.pyplot as plt
import seaborn as sns

import scanpy as sc
import squidpy as sq

import pandas as pd
import numpy as np

from sklearn.metrics import roc_auc_score

import torch
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv
from torch_geometric.utils import negative_sampling, train_test_split_edges
from tqdm import tqdm

from tqdm.autonotebook import tqdm
from sklearn.preprocessing import StandardScaler

  from tqdm.autonotebook import tqdm


---
### 2. Utility Functions <a class="anchor" id="utility"></a>

#### Model Architecture

Define the Variational Autoencoder

In [2]:
pass

#### Training Functions

In [3]:
def train(model, optimizer, data):
    """Train the model for one epoch.

    Returns:
        float: The reconstruction loss on the training edges.
    """
    model.train()
    optimizer.zero_grad()
    z = model.encode(data.x, data.train_pos_edge_index)
    # calculate the loss on the training edges
    loss, auc_score = model.recon_loss(z, data.train_pos_edge_index)

    loss.backward()
    optimizer.step()
    return float(loss), auc_score


@torch.no_grad()
def test(model, data):
    """Evaluate the model on the validation edges.

    Args:
        pos_edge_index (torch.Tensor): The positive edge indices for validation.
    Returns:
        float: The reconstruction loss on the validation edges.
    """
    model.eval()
    z = model.encode(data.x, data.train_pos_edge_index)

    loss, auc_score = model.recon_loss(z, data.val_pos_edge_index)

    return float(loss), auc_score

---
### 3. Constants & Data Import <a class="anchor" id="constants"></a>


The necessary data files can be accessed and downloaded via the [Gene Expression Omnibus Portal](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92742). The following specific datasets are required:

* **The Gene Expression Data**

    *GSE92742_Broad_LINCS_Level5_COMPZ.MODZ_n473647x12328.gctx.gz* 
    
    The main data matrix that contains the results of nearly half a million experiments (how much a gene's expression went up or down in a single experiment compared to a control)

* **The Signature Info File**

    *GSE92742_Broad_LINCS_sig_info.txt*

    Metadata file that explains what each column in the expression matrix means.

* **The Perturbation Info File**

    *GSE92742_Broad_LINCS_pert_info.txt*

    The drug dictionary that provides details about the perturbations used in the experiments.

* **The Gene Info File**

    *GSE92742_Broad_LINCS_gene_info.txt*

    The gene dictionary that provides details abou tthe genes measured.


The notebook expects both tsv files to be placed in the `data/LINCS_L1000` folder.

In [4]:
ROOT = Path(os.getcwd()).parents[0]

DATA_PATH = os.path.join(ROOT, "data", "LINCS_L1000")
GCTX_DATA_PATH = os.path.join(DATA_PATH, "GSE92742_Broad_LINCS_Level5_COMPZ.MODZ_n473647x12328.gctx")

GENE_INFO_PATH = os.path.join(DATA_PATH, "GSE92742_Broad_LINCS_gene_info.txt")
PERTURBATION_DATA_PATH = os.path.join(DATA_PATH, "GSE92742_Broad_LINCS_pert_info.txt")
SIGNATURE_DATA_PATH = os.path.join(DATA_PATH, "GSE92742_Broad_LINCS_sig_info.txt")

---
### 4. Data Exploration <a class="anchor" id="constants"></a>

Load the data

In [10]:
from cmapPy.pandasGEXpress.parse import parse as parse_gctx

# gctoo_object = parse_gctx(GCTX_DATA_PATH)
# expression_df = gctoo_object.data_df

# expression_df

col_metadata_df = parse_gctx(GCTX_DATA_PATH, col_meta_only=True)
all_sig_ids = col_metadata_df.index.tolist()
len(all_sig_ids)  # Total number of signatures

  meta_df = meta_df.apply(lambda x: pd.to_numeric(x, errors="ignore"))


473647

In [6]:
all_sig_ids[:10]  # Display the first 10 signature IDs

['CPC005_A375_6H:BRD-A85280935-003-01-7:10',
 'CPC005_A375_6H:BRD-A07824748-001-02-6:10',
 'CPC004_A375_6H:BRD-K20482099-001-01-1:10',
 'CPC005_A375_6H:BRD-K62929068-001-03-3:10',
 'CPC005_A375_6H:BRD-K43405658-001-01-8:10',
 'CPC004_A375_6H:BRD-K03670461-001-02-0:10',
 'CPC004_A375_6H:BRD-K36737713-001-01-6:10',
 'CPC005_A375_6H:BRD-K51223576-001-01-3:10',
 'CPC004_A375_6H:BRD-A14966924-001-03-0:10',
 'CPC004_A375_6H:BRD-K79131256-001-08-8:10']

In [None]:
batch_gctoo = parse_gctx(GCTX_DATA_PATH, cid=all_sig_ids[:10])
batch_gctoo.data_df

  meta_df = meta_df.apply(lambda x: pd.to_numeric(x, errors="ignore"))
  meta_df = meta_df.apply(lambda x: pd.to_numeric(x, errors="ignore"))


cid,CPC005_A375_6H:BRD-A85280935-003-01-7:10,CPC005_A375_6H:BRD-A07824748-001-02-6:10,CPC004_A375_6H:BRD-K20482099-001-01-1:10,CPC005_A375_6H:BRD-K62929068-001-03-3:10,CPC005_A375_6H:BRD-K43405658-001-01-8:10,CPC004_A375_6H:BRD-K03670461-001-02-0:10,CPC004_A375_6H:BRD-K36737713-001-01-6:10,CPC005_A375_6H:BRD-K51223576-001-01-3:10,CPC004_A375_6H:BRD-A14966924-001-03-0:10,CPC004_A375_6H:BRD-K79131256-001-08-8:10
rid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5720,0.773769,-0.645586,-5.449666,0.193408,1.006298,-5.388713,-1.000240,0.490110,0.063297,0.560929
466,-0.818468,-0.810749,2.393775,-0.582243,0.455536,1.867731,-1.106092,0.595174,-0.962553,-0.656688
6009,0.189572,0.459060,1.279790,-0.178977,0.631738,0.281383,-0.422545,-0.224163,0.521553,0.520286
2309,-0.146031,-0.224676,2.167868,-1.182025,-0.936414,1.378175,0.406279,-0.244783,0.182361,-0.315654
387,-0.654002,-0.335681,2.333199,-1.012651,-1.213203,1.290522,-0.218671,-0.124029,0.572183,-0.187850
...,...,...,...,...,...,...,...,...,...,...
25960,0.240643,-0.086766,3.620893,0.082145,0.508581,0.874223,-0.092895,-0.667292,0.093789,-0.469189
6376,0.941109,2.821144,-1.866171,0.781728,1.217399,-2.777992,0.924472,0.196522,-0.062907,-0.018892
11033,0.931256,0.413081,2.035219,0.367824,-0.496499,-0.140473,1.190756,-0.371865,-0.417451,0.226126
54869,0.635310,1.291829,2.424871,2.049168,-0.006612,1.533514,1.030570,-0.885176,0.401086,0.115130


**Observations**: 
* ...

---
### 6. Build the Model <a class="anchor" id="model"></a>

Define model parameters

**Observations**:
- ...

---
### 8. Summary

This project demonstrates ...


The primary findings include:

* ...

Next Steps:
* Investigate the effect of ...