# First Principles Approach To Solving My Graduation Project
---

## Current Suggested Approach

1. **What the model sees as input:**  
   The current design of the model takes two things:
   - A representation of how the cell’s gene expression changes when exposed to a particular drug. In other words, you give the model a snapshot of the cell’s transcriptional profile after it has already been treated.
   - The chemical descriptors (or structure) of the drug itself.

2. **What the model outputs:**  
   Given these two pieces of information—the observed biological response of the cell’s transcriptome to the drug and the drug’s own structural characteristics—the model predicts how viable the cell remains. Low viability means the drug killed the cancer cells effectively; high viability means it did not.

3. **What this means in practice:**  
   This setup effectively lets the model learn the following relationship:  

   *"Whenever an **arbitrary** cell line shows this specific pattern of gene expression change after being treated with a drug that has these particular molecular features, we often see a certain level of cell death or survival."*

   In other words, the model is learning a mapping:
   $$
   f(\text{Drug Structure}, \text{Post-Treatment Gene Expression Changes}) \rightarrow \text{Viability}.
   $$

   This is a correlation-based, supervised learning problem where the model leverages direct evidence of how the drug affects the cells at the molecular level (transcriptome) and combines it with the drug’s own intrinsic properties.

4. **From first principles, what knowledge does the model acquire?**  
   - The model learns which gene expression patterns (induced by a given drug) are associated with higher or lower cell survival. It’s essentially learning the "fingerprints" of effective vs. ineffective drug-induced perturbations on the cell’s biology.
   - It also learns how certain drug structures tend to produce particular transcriptional responses. Over many training examples, it can infer that drugs with certain chemical traits consistently lead to specific transcriptional signatures, which in turn correlate with certain viability outcomes.

   In essence, the model is encoding a conditional relationship: *If a drug looks like this (chemically) and produces these changes in gene expression, then the cell viability will be such and such.*  

5. **Why this might not achieve the ultimate goal yet:**  
   The main practical aim is to predict whether a new drug will kill tumor cells **without first testing it in the lab**. However, the current model needs to be given the gene expression changes caused by that drug on that cell line, which is something you only get after you've already exposed the cells to the drug. This is a catch-22: to predict viability, the model relies on the very data (post-treatment transcriptomics) that you wouldn't have before testing.

   So while the model “learns” a sophisticated correlation between (drug structure + observed cellular response) and viability, it doesn't learn to predict viability from only the drug and baseline cell information. It requires the post-perturbation expression data, effectively making it a "reactive" predictor rather than a "proactive" one. It’s like the model is very good at saying “Given I see these molecular changes after the drug has done its work, here’s the outcome,” rather than “Given this drug and this cell line, what would happen if I apply the drug?”

6. **In conclusion:**  
   The current model, as described, fundamentally learns a mapping from *observed drug-induced cellular changes* (transcriptomic perturbation) combined with the *drug’s intrinsic properties* to a final viability outcome. It does not inherently learn how to predict viability without the intermediate step of knowing the gene expression changes induced by the drug. In first-principles terms, it is a function that takes a known effect (the transcriptomic change) and a cause (the drug) and predicts the final outcome (viability), rather than predicting the outcome directly from just the cause and the baseline conditions alone.

This means the model is not yet solving the intended problem of predicting efficacy before any testing is done. It's learning a useful relationship, but one that still depends on having post-treatment data that you wouldn’t have in a real predictive scenario.

---

## Building A New Approach

1. **Thinking in Terms of Vector Spaces:**
   You start with a baseline gene expression vector that represents the cell line’s "state." A drug can be thought of as an operator or transformation that maps this baseline vector into a new vector representing the cell's perturbed state. If the drug is effective, it ideally pushes (or "rotates") this state vector into a region of the state space associated with low viability (i.e., the cell is likely to die).

   In this analogy:
   - **Baseline State Vector:** Encodes the cell line’s intrinsic biological configuration under normal conditions.
   - **Drug:** Acts like a transformation (linear or nonlinear) that takes the baseline vector to a new "perturbed" vector.
   - **Viability:** A function that assigns a "score" or "label" to each point in the gene expression space. Some regions correspond to healthy, viable cells; other regions correspond to stressed or dying cells.

![Vector Intuition](../assets/images/Intuition.png)

2. **Is the Transformation Universal?**
   One might imagine a perfect world where the same drug always moves the vector from a "viable region" to a "non-viable region." But in reality:
   - The effect of the drug (the "rotation") depends on both the drug’s properties and **the cell line’s baseline biology**.
   - Different cell lines, even starting from similar baseline states, may not respond identically to the same drug. The transformation induced by the drug can be cell-line specific because the drug interacts with the particular molecular networks, receptors, and pathways unique to that cell line.

   Thus, the "map" from baseline → perturbed state → viability is not one-size-fits-all. The drug’s effect is **context-dependent** and can vary significantly across different cell lines.

3. **Different Viability Reactions to the Same Final State:**
   Consider the question: If two different cell lines, through different means, arrive at the same final gene expression vector, do they have the same viability?

   In theory, if the exact final state vector is identical and fully captures all relevant biology, and the viability function is solely a function of that final state vector, then yes, they would have the same viability. But in practice:
   - The viability outcome might also depend on contextual, cell-line-specific factors not fully captured by the gene expression vector alone. For example, gene expression is not the only determinant of cell fate. Epigenetic context, protein modifications, metabolic states, and other hidden variables could differ between cell lines.
   - The model may implicitly encode cell-line identity in the baseline features. If baseline differences are part of the input, then the same "final state vector" might not actually be interpreted identically by the model for two different cell lines.

   Essentially, the final viability is not just a function of the final gene expression vector alone, but also of the cell line context represented in the baseline state. Two cell lines may map the same final gene expression pattern to different viability outcomes if they differ in some underlying, unmodeled dimensions or if the model includes baseline identifiers that change how it interprets that final state.

4. **Conclusion:**
   - Conceptually, you can think of the drug as something that tries to move the baseline gene expression vector into a "low-viability" region.
   - However, each cell line has its own "viability landscape." The same final gene expression pattern might not yield the same viability across different cell lines if other contextual factors differ.
   - In practice, the model (and biological reality) is more complicated than a simple uniform map of state vectors to viability. The cell line context, baseline conditions, and unobserved variables mean that different cell lines can indeed have different viability reactions to a given "state vector."

In [5]:
import pandas as pd

df = pd.read_csv("../data/raw/instinfo_beta.txt",sep="\t")

# Only select rows where the column qc_pass is 1.0 and where the pert_type is trt_cp
df_qc = df[(df["qc_pass"] == 1.0) & (df["pert_type"] == "trt_cp")]

# Print shapes to compare
print("Original shape:", df.shape)
print("Shape after filtering:", df_qc.shape)

  df = pd.read_csv("../data/raw/instinfo_beta.txt",sep="\t")


Original shape: (3026460, 30)
Shape after filtering: (1312170, 30)


In [4]:
df.head()

Unnamed: 0,bead_batch,nearest_dose,pert_dose,pert_dose_unit,pert_idose,pert_time,pert_itime,pert_time_unit,cell_mfc_name,pert_mfc_id,...,sample_id,pert_type,cell_iname,qc_pass,dyn_range,inv_level_10,build_name,failure_mode,project_code,cmap_name
0,b11,,20.0,uL,20 uL,72.0,72 h,h,VCAP,ERG_11,...,ERG013_VCAP_72H_X3_B11:O14,trt_sh,VCAP,0.0,4.20788,4220.5,,dyn_range,ERG,ERG
1,b10,,1.0,uL,1 uL,96.0,96 h,h,U2OS,TRCN0000072237,...,TAK004_U2OS_96H_X2_B10_DUO52HI53LO:D10,ctl_vector,U2OS,0.0,4.73906,1462.0,,inv_level_10,TAK,LACZ
2,b12,,0.1,ng/ml,0.1 ng/ml,2.0,2 h,h,HEPG2,SOD3,...,CYT001_HEPG2_2H_X2_B12:N12,trt_lig,HEPG2,1.0,6.79642,3038.0,,,CYT,SOD3
3,b12,,150.0,ng,150 ng,48.0,48 h,h,HEK293T,ENTRY00543,...,HSF038_HEK293T_48H_X2_B12:M01,trt_oe,HEK293T,0.0,23.7971,1642.0,,inv_level_10,HSF,PDGFRA
4,f3b5,6.66,5.33,uM,6.66 uM,24.0,24 h,h,A375,BRD-K79781870,...,DOS043_A375_24H_X1_F3B5_DUO52HI53LO:D17,trt_cp,A375,0.0,6.78867,1558.0,,"inv_level_10,qc_iqr",DOS,BRD-K79781870
