# Phase 1 - Establishing Baselines with Simple Models
---
In this first phase, we focus on leveraging **transcriptomics data** to predict tumor cell viability using traditional machine learning regression techniques. By exploring these simpler models, we aim to:

- **Benchmark Predictive Power**: Evaluate performance using metrics such as Mean Squared Error (MSE), R², and Pearson correlation.
- **Understand Feature Importance**: Analyze which genes and potentionally pathways are most predictive of viability through model coefficients.
- **Set a Performance Baseline**: Create a reference point to measure improvements in subsequent phases.

### Key Questions for Phase 1

1. How accurately can simple regression models predict tumor cell viability using transcriptomics data alone?
2. What are the limitations of these models in terms of predictive performance and feature interpretability?
3. Can this phase establish meaningful benchmarks for later comparisons with multimodal models?

By addressing these questions, Phase 1 will provide a critical foundation for the progression to more complex modeling approaches in subsequent phases.

## Import Required Libraries
Import the necessary libraries, including scikit-learn, pandas, and numpy etc.

In [1]:
## Import required libraries and modules
# Add src to path
import sys
import os

# Add the src directory to the Python path
sys.path.append(os.path.abspath(os.path.join("..", "src")))

# Standard library imports
import logging

# Third-party imports
import pandas as pd
import plotly.express as px
from scipy.stats import randint, uniform
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import make_scorer
from sklearn.linear_model import (
    LinearRegression,
    LassoCV,
    RidgeCV,
    ElasticNetCV,
    BayesianRidge,
    HuberRegressor,
)
from sklearn.ensemble import (
    RandomForestRegressor,
    GradientBoostingRegressor,
    AdaBoostRegressor,
)
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR

# Local application imports
from evaluation import weighted_score_func
from preprocess.preprocess import perform_pca, split_data
from pipelines import ModelPipeline, NonLinearModelPipeline
from utils import load_config
from visualizations import plot_pca_variance, create_pca_biplot, create_tsne_plot
from visualizations import (
    create_feature_set_distribution_visualization,
    create_feature_set_weighted_score_visualization,
    create_model_weighted_score_visualization,
)


# Load Config to ensure reproducibility and syncing with other scripts
config = load_config("../config.yaml")

# Set logging configurations
logging.basicConfig(
    level=logging.DEBUG,
    format="%(asctime)s - %(message)s",
)

## Load and Preprocess Dataset

The first step in our workflow involves preparing the transcriptomics and gene expression data for machine learning. This ensures the data is clean, standardized, and ready for model training. Below are the general steps we followed to preprocess the data:

### 1. **Loading the Dataset**
   - **Transcriptomics Data (TF Dataset):**  
     We load the transcriptomics features from a `.csv` file and remove unnecessary columns (e.g., IDs or metadata).  
     The target variable (tumor cell viability) is loaded as a separate column and later merged with the features.  
   - **Gene Expression Data (Gene Dataset):**  
     For gene expression, we load a binary file containing the expression values, which is reshaped into a matrix.  
     Gene information, such as gene symbols and feature space (landmark or full set), is extracted from an accompanying metadata file.

### 2. **Standardization**
   - To ensure that features are on the same scale, we standardize the values to have a mean of 0 and a standard deviation of 1 using `StandardScaler` from `sklearn`.
   - This step is critical for machine learning models like Ridge and Lasso regression that are sensitive to feature magnitudes.

### 3. **Feature Selection**
   - For the gene expression dataset, we provide the option to select **landmark genes** (a reduced set of informative genes) or the full set of genes.  
   - This enables flexibility for comparing performance based on different feature spaces.

### 4. **Handling Missing Values**
   - Any rows with missing values are removed to ensure data integrity during training and evaluation.

### 5. **Saving Preprocessed Data**
   - The cleaned and preprocessed datasets are saved as `.csv` files for use in subsequent steps of the project. TO allow easy loading to native Pandas Dataframe objects.

### 6. **Splitting the Dataset**
   - We allow splitting the preprocessed dataset into **training**, **validation**, and **test** sets based on configurable ratios. This allows to:
     - Train models on the training set.
     - Tune hyperparameters and avoid overfitting using the validation set.
     - Evaluate the final model performance on the test set.

For a more detailed overview of these preprocessing steps, please see the related code in the `src/preprocess` directory.

In [3]:
## Load, Split and Preprocess Dataset
# Load the dataset
tf_df = pd.read_csv(config["data_paths"]["preprocessed_tf_file"])
landmark_df = pd.read_csv(config["data_paths"]["preprocessed_landmark_file"])

# Define chunk size
chunk_size = 1000

# Initialize an empty list to store processed chunks
chunks = []

# Read the CSV file in chunks
for chunk in pd.read_csv(
    config["data_paths"]["preprocessed_gene_file"], chunksize=chunk_size, nrows=1000):
    # Optionally process the chunk (e.g., drop columns, filter rows)
    chunks.append(chunk)

# Combine chunks into a single DataFrame (if needed)
gene_df = pd.concat(chunks, axis=0)
# gene_df = pd.read_csv(config["data_paths"]["preprocessed_gene_file"])

# Only sample a subset of the data for faster training
tf_df = tf_df.sample(n=10000, random_state=42)
landmark_df = landmark_df.sample(n=10000, random_state=42)

logging.debug(
    f"TF data shape: {tf_df.shape}, Landmark data shape: {landmark_df.shape}, Gene data shape: {gene_df.shape}"
)

# Split the data into train, validation and test sets as well as features and target
X_tf_train, y_tf_train, X_tf_val, y_tf_val, X_tf_test, y_tf_test = split_data(
    tf_df, config, target_name="viability"
)

X_landmark_train, y_landmark_train, X_landmark_val, y_landmark_val, X_landmark_test, y_landmark_test = (split_data(landmark_df, config, target_name="viability"))

X_gene_train, y_gene_train, X_gene_val, y_gene_val, X_gene_test, y_gene_test = (
    split_data(gene_df, config, target_name="viability")
)

## Dimensionality Reduction and High-Dimensional Data Visualization

To enhance the interpretability of our high-dimensional datasets (Transcriptomics and Gene Expression data), we apply **dimensionality reduction** techniques. These steps are optional and can be toggled based on the project configuration. Below is an outline of the methodology:

---

### 1. **Variance Thresholding**
   - **Objective**: Reduce the number of features by removing those with low variance across samples, as they contribute little to predictive power.
   - **Process**:
     - Separate features and the target (viability).
     - Apply the `VarianceThreshold` technique using configurable thresholds specific to each dataset.
     - Retain only features with variance above the set threshold.
   - **Impact**:
     - Variance Thresholding reduces the dimensionality of the data while retaining the most informative features.
   - **Outcome**:
     - Both datasets (TF and Gene data) are reduced in feature dimensionality, and the shapes are logged for reference.

### 2. **Principal Component Analysis (PCA)**
   - **Objective**: Reduce the dataset's dimensionality while preserving most of the variance.
   - **Steps**:
     1. **Perform PCA**:
        - Extract the variance ratio and compute the principal components for both datasets.
        - Configure the amount of variance to retain (e.g., 95%).
     2. **Scree Plot**:
        - Visualize the explained variance ratio of each principal component to determine the retained dimensionality.
     3. **2D and 3D Biplots**:
        - Visualize the data in lower dimensions (2D and 3D) for interpretability and exploration of variance patterns.
        - Highlight the top contributing features (loadings) in the reduced space.
   - **Visual Outputs**:
     - **Scree Plots**: Show the cumulative explained variance across principal components.
     - **2D and 3D PCA Biplots**:
       - Represent the samples in the reduced dimensional space.
       - Include annotated loadings of the top contributing features for interpretability.

---

### 3. **t-SNE Visualization**
   - **Objective**: Visualize the complex patterns and relationships in the high-dimensional data by mapping it to 2D or 3D spaces.
   - **Steps**:
     - Perform t-SNE on the PCA-transformed data to further highlight separations or clusters within the data.
     - Visualize both datasets in 2D and 3D spaces.
   - **Visual Outputs**:
     - **2D t-SNE Plots**: Provide a planar visualization of clusters in the reduced feature space.
     - **3D t-SNE Plots**: Add depth to the visualization for better exploration of high-dimensional relationships.

---

### Summary of Configurable Steps
1. **Variance Thresholding**: Enabled/disabled via `config["preprocess"]["use_vt"]`.
2. **PCA**: Enabled/disabled via `config["preprocess"]["use_pca"]` with adjustable variance thresholds for TF and Gene datasets.
3. **t-SNE**: Automatically applied on PCA-transformed data for enhanced visualization.

These dimensionality reduction and visualization techniques help ensure that we retain meaningful information while exploring patterns in the data. They provide a strong foundation for understanding the datasets before applying predictive modeling.

In [4]:

## Perform Feature Selection and Dimensionality Reduction
if config["preprocess"]["use_vt"]:
    # Separate features and target
    tf_features = tf_df.drop(columns=["viability"])
    tf_target = tf_df["viability"]
    landmark_features = landmark_df.drop(columns=["viability"])
    landmark_target = landmark_df["viability"]
    gene_features = gene_df.drop(columns=["viability"])
    gene_target = gene_df["viability"]

    # Instantiate separate VarianceThreshold selectors for each dataset
    tf_selector = VarianceThreshold(threshold=config["preprocess"]["vt_threshold_tf"])
    landmark_selector = VarianceThreshold(
        threshold=config["preprocess"]["vt_threshold_landmark"]
    )
    gene_selector = VarianceThreshold(
        threshold=config["preprocess"]["vt_threshold_gene"]
    )

    # Apply VarianceThreshold to the features
    tf_features_selected = tf_selector.fit_transform(tf_features)
    landmark_features_selected = landmark_selector.fit_transform(landmark_features)
    gene_features_selected = gene_selector.fit_transform(gene_features)

    # Convert the results back to DataFrames
    tf_features_selected = pd.DataFrame(
        tf_features_selected, columns=tf_features.columns[tf_selector.get_support()]
    )
    landmark_features_selected = pd.DataFrame(
        landmark_features_selected,
        columns=landmark_features.columns[landmark_selector.get_support()],
    )
    gene_features_selected = pd.DataFrame(
        gene_features_selected,
        columns=gene_features.columns[gene_selector.get_support()],
    )

    # Concatenate the target column back to the selected features
    tf_df = pd.concat([tf_features_selected, tf_target.reset_index(drop=True)], axis=1)
    landmark_df = pd.concat(
        [landmark_features_selected, landmark_target.reset_index(drop=True)], axis=1
    )
    gene_df = pd.concat(
        [gene_features_selected, gene_target.reset_index(drop=True)], axis=1
    )

    # Log the shape of the datasets after applying VarianceThreshold
    logging.debug(
        f"TF data shape after VarianceThreshold: {tf_df.shape}, Landmark data shape after VarianceThreshold: {landmark_df.shape}, Gene data shape after VarianceThreshold: {gene_df.shape}")

if config["preprocess"]["use_pca"]:
    # Step 1: Extract the original feature names
    tf_feature_names = X_tf_train.columns.to_list()
    landmark_feature_names = X_landmark_train.columns.to_list() 
    gene_feature_names = X_gene_train.columns.to_list()

    # Step 2: Perform PCA
    X_tf_train_pca, tf_pca = perform_pca(
        X_tf_train, config["preprocess"]["pca_var_tf"]
    )
    X_landmark_train_pca, landmark_pca = perform_pca(
        X_landmark_train, config["preprocess"]["pca_var_landmark"]
    )
    X_gene_train_pca, gene_pca = perform_pca(
        X_gene_train, config["preprocess"]["pca_var_gene"]
    )

    logging.debug(
        f"TF data shape after PCA: {X_tf_train_pca.shape}, Landmark data shape after PCA: {X_landmark_train_pca.shape}, Gene data shape after PCA: {X_gene_train_pca.shape}"
    )
    # Step 3: Scree Plot
    plot_pca_variance(tf_pca, "TF Data", save_path="../assets/figures/Phase 1/TF_PCA_Scree_Plot.html")
    plot_pca_variance(landmark_pca, "Landmark Data", save_path="../assets/figures/Phase 1/Landmark_PCA_Scree_Plot.html")
    plot_pca_variance(gene_pca, "Gene Data", save_path="../assets/figures/Phase 1/Gene_PCA_Scree_Plot.html")

    # Step 4: 3D PCA Biplots
    create_pca_biplot(
        pca=tf_pca,
        X=X_tf_train_pca,
        Y=y_tf_train,
        features=tf_feature_names,
        dimension="2D",
        dataset_name="TF Dataset",
        top_n_loadings=10,
        sample_size=30000,
        loading_scale=10,
        save_path="../assets/figures/Phase 1/TF_PCA_Biplot.html",
    )
    
    create_pca_biplot(
        pca=landmark_pca,
        X=X_landmark_train_pca,
        Y=y_landmark_train,
        features=landmark_feature_names,
        dimension="2D",
        dataset_name="Landmark Dataset",
        top_n_loadings=10,
        sample_size=30000,
        loading_scale=10,
        save_path="../assets/figures/Phase 1/Landmark_PCA_Biplot.html",
    )

    # create_pca_biplot(
    #     pca=gene_pca,
    #     X=X_gene_train_pca,
    #     Y=y_gene_train,
    #     features=gene_feature_names,
    #     dimension="2D",
    #     dataset_name="Gene Dataset",
    #     top_n_loadings=10,
    #     sample_size=1000,
    #     loading_scale=10,
    #     save_path="../assets/figures/Phase 1/Gene_PCA_Biplot.html",
    # )

    # # Step 5: TSNE Plots
    # create_tsne_plot(
    #     X_tf_train_pca,
    #     y_tf_train,
    #     target_column="viability",
    #     sample_size=1000,
    #     dimension="2D",
    #     dataset_name="TF Data TSNE",
    #     save_path="../assets/figures/Phase 1/TF_TSNE_Plot.html",
    # )

    # create_tsne_plot(
    #     X_gene_train_pca,
    #     y_gene_train,
    #     target_column="viability",
    #     sample_size=1000,
    #     dimension="3D",
    #     dataset_name="Gene Data TSNE",
    #     save_path="../assets/figures/Phase 1/Gene_TSNE_Plot.html",
    # )

In [5]:
# if config["preprocess"]["use_pca"]:	
#     # Set the X data variable to the PCA transformed data
#     X_tf_train = X_tf_train_pca
#     X_landmark_train = X_landmark_train_pca
#     X_gene_train = X_gene_train_pca

#     # Also transform the validation and test data
#     X_tf_val = tf_pca.transform(X_tf_val)
#     X_tf_test = tf_pca.transform(X_tf_test)
    
#     X_landmark_val = landmark_pca.transform(X_landmark_val)
#     X_landmark_test = landmark_pca.transform(X_landmark_test)

#     X_gene_val = gene_pca.transform(X_gene_val)
#     X_gene_test = gene_pca.transform(X_gene_test)

## Baseline Model Training and Evaluation

In this step, we train a series of **linear regression models** on both the **TF Data** and **Gene Data** feature sets to establish baseline performance. This serves as a foundational benchmark for assessing the predictive capability of simpler models before progressing to more complex techniques in later phases.

### 1. **Regression Models**
We evaluate the following linear regression models, each offering unique characteristics to handle different data distributions and complexities:
- **Linear Regression**: A simple baseline for comparison.
- **Ridge Regression**: Adds L2 regularization to handle multicollinearity and prevent overfitting.
- **Lasso Regression**: Adds L1 regularization for feature selection and sparsity.
- **Elastic Net Regression**: Combines L1 and L2 regularization for a balanced approach.
- **Bayesian Ridge Regression**: Incorporates Bayesian inference to capture uncertainty in predictions.
- **Huber Regression**: Robust to outliers in the data.


### 2. **Feature Sets**
We use two datasets to train the models:
- **TF Data**: Features derived from transcriptomics (transcription factor) data.
- **Gene Data**: Features derived from the full gene expression dataset.

The models are trained on the respective training datasets, and their performance is evaluated on the test datasets.

### 3. **Evaluation Metrics**
To comprehensively assess model performance, we calculate the following metrics:
- **R² Score**: Measures the proportion of variance explained by the model.
- **Mean Squared Error (MSE)**: Quantifies the average squared difference between predicted and actual values.
- **Mean Absolute Error (MAE)**: Measures the average absolute difference between predicted and actual values.
- **Pearson Correlation Coefficient**: Assesses the linear relationship between predictions and true values.
- **Weighted Score**: Combines R² and Pearson correlation into a single, weighted metric for better interpretability.

### 4. **Visualization of Results**
We use interactive **Plotly** visualizations to interpret the results and compare model performance:

#### Bar Plots
- **Weighted Score by Model**: Compares the performance of different regression models across both feature sets.
- **Weighted Score by Feature Set**: Highlights how the TF Data and Gene Data compare in terms of predictive performance for each model.

#### Box Plot
- **Weighted Score Distribution**: Visualizes the spread of Weighted Scores for the TF Data and Gene Data feature sets, providing insights into variability and reliability.

### Code Highlights
1. **Training and Evaluation Pipeline**:
   - The `ModelPipeline` class modularizes model training, evaluation, and visualization for better code maintainability.
   - Each model is trained and evaluated with a cross-validation strategy using the configured number of folds (`cv`).

2. **Coefficient Analysis**:
   - We analyze the coefficients of trained models to understand feature importance and their contributions to predictions.

3. **Interactive Visualizations**:
   - Bar and box plots provide an intuitive comparison of model performance across feature sets and models.
   - The use of colors and interactivity enables deeper exploration of results.

---

### Insights from this Step
By comparing the performance of simple regression models, we can:
1. Identify the strengths and limitations of each feature set (TF vs. Gene).
2. Establish a clear baseline against which we can compare more complex models in later phases.
3. Gain an understanding of the most predictive features for tumor viability.

This step concludes the first phase of the project, providing a solid benchmark for future model development and experimentation.


In [5]:
## Train and Evaluate Linear Models
# Define models
models = {
    "Linear": (LinearRegression()),
    "Ridge": (RidgeCV(alphas=[0.01, 0.1, 1.0, 10.0], cv=5)),
    "Lasso": (LassoCV(alphas=[0.01, 0.1, 1.0, 10.0], cv=5)),
    "Elastic Net": (
        ElasticNetCV(l1_ratio=[0.2, 0.4, 0.6, 0.8], alphas=[0.01, 0.1, 1.0, 10.0], cv=5)
    ),
    "Bayesian Ridge": (BayesianRidge()),
    "Huber": (HuberRegressor()),
}

# Define feature sets
feature_sets = {
    "TF Data": (X_tf_train, y_tf_train, X_tf_test, y_tf_test),
    "Landmark Data": (X_landmark_train, y_landmark_train, X_landmark_test, y_landmark_test),
    # "Gene Data": (X_gene_train, y_gene_train, X_gene_test, y_gene_test),
}

weighted_scorer = make_scorer(weighted_score_func, greater_is_better=True)

pipeline = ModelPipeline(models, feature_sets, scoring=weighted_scorer, cv=config["training"]["cv_folds"])
pipeline.train_and_evaluate()
results_df = pipeline.get_results()
pipeline.visualize_coefficients(top_n=10, save_path="../assets/figures/Phase 1/Linear_Model_Coefficients.html")


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html


lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number 

In [6]:
styled_results = results_df.style.format(precision=3).set_caption(
    "Regression Model Evaluation Metrics"
).highlight_max(subset=["R²", "Pearson Correlation", "Weighted Score"], color="lightgreen").highlight_min(
    subset=["MAE", "MSE"], color="lightgreen"
)

styled_results.to_html("../assets/tables/Phase 1/Linear_Model_Evaluation_Metrics.html")
styled_results

Unnamed: 0_level_0,Unnamed: 1_level_0,MSE,MAE,R²,Pearson Correlation,Weighted Score,CV Mean Score,CV Std Score
Feature Set,Regression Model Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Landmark Data,Bayesian Ridge,0.039,0.132,0.385,0.624,0.504,0.475,0.013
Landmark Data,Elastic Net,0.04,0.133,0.367,0.614,0.49,0.471,0.016
Landmark Data,Huber,0.041,0.126,0.344,0.617,0.48,0.453,0.017
Landmark Data,Lasso,0.042,0.137,0.333,0.596,0.464,0.452,0.014
Landmark Data,Linear,0.041,0.139,0.352,0.597,0.474,0.43,0.015
Landmark Data,Ridge,0.041,0.139,0.352,0.597,0.475,0.431,0.016
TF Data,Bayesian Ridge,0.038,0.135,0.395,0.632,0.514,0.49,0.012
TF Data,Elastic Net,0.039,0.135,0.386,0.626,0.506,0.484,0.015
TF Data,Huber,0.04,0.128,0.369,0.633,0.501,0.477,0.02
TF Data,Lasso,0.043,0.14,0.311,0.577,0.444,0.429,0.017


In [7]:
create_model_weighted_score_visualization(
    results_df, save_path="../assets/figures/Phase 1/linear_model_weighted_score.html"
)
create_feature_set_weighted_score_visualization(
    results_df,
    save_path="../assets/figures/Phase 1/linear_feature_set_weighted_score.html",
)
create_feature_set_distribution_visualization(
    results_df,
    save_path="../assets/figures/Phase 1/linear_feature_set_distribution.html",
)

### Discussion: Linear Model Results and Biological Insights

The results and metrics obtained from training and evaluating linear regression models on the TF and Gene feature sets provide several important insights into the predictive capabilities and limitations of these models. Below is an interpretation of the findings:

#### **1. Overall Model Performance**
- **Key Metrics**:
  - Best-performing model:  
    - **R²**: ~0.427 (explaining ~42.7% of the variance in the viability data).  
    - **Pearson Correlation**: ~0.654 (moderately strong linear relationship between predicted and observed viability).  
- **Interpretation**:
  - While explaining ~42.7% of the variance is non-trivial, it leaves a significant portion (~57.3%) unexplained. This reflects the high-dimensional and complex nature of biological and genomic data.
  - The moderately strong Pearson correlation suggests a consistent linear trend, but certain samples likely experience substantial prediction errors, underscoring limitations in capturing the full complexity of cell viability.

#### **2. Comparison Among Models**
- **Observation**:
  - Similar performance across models like **Linear Regression**, **Ridge**, **Elastic Net**, **Bayesian Ridge**, and **Huber** suggests that their regularization strategies converge to similar solutions.
  - **Lasso Regression** underperforms, likely due to its aggressive feature selection (L1 regularization), indicating that the viability signal is distributed across many features rather than a sparse subset.
- **Conclusion**:
  - The lack of dramatic performance differences suggests that linear models, whether regularized or not, face inherent limitations in capturing the relationships in the data.

#### **3. Cross-Validation Stability**
- **Key Observations**:
  - **Mean and Standard Deviation**:  
    - Cross-validation (CV) results align well with test set scores, showing small standard deviations (e.g., ~0.015 to 0.02 for R²), indicating stable model performance across different training folds.
  - **Significance**:
    - The consistency between CV and test results confirms that the models generalize well to unseen data, without overfitting to specific training subsets.

#### **4. Biological Context and Interpretability**
- **Linear Models and Biology**:
  - Linear models assume a direct, proportional relationship between features (e.g., transcription factors or genes) and the target (cell viability). However, biological systems are highly **non-linear**, involving complex interactions, feedback loops, and pathway dependencies.
  - Despite these challenges, the models capture some biologically relevant signals. For example:
    - **FOXO3**, a key transcription factor, emerges as a feature aligned with viability patterns in PCA visualizations. This aligns with its known role in regulating cell survival and apoptosis.
  - The modest **R²** suggests missing features (e.g., epigenetic factors, environmental conditions) or the need for non-linear modeling approaches.

#### **5. Limitations and Challenges**
- **Unexplained Variance**:
  - Over half of the variance in viability (~57.3%) remains unexplained, likely due to:
    - Non-linear relationships.
    - Missing critical features or interactions.
    - Biological noise and variability.
- **Feature Interactions**:
  - Simple linear models cannot capture gene-gene interactions, pathway dependencies, or context-specific regulations, which are pivotal in cellular processes.

#### **In Summary**
Linear regression models provide a foundational benchmark for understanding the relationship between Gene/TF features and cell viability. The results reveal biologically relevant patterns (e.g., the role of FOXO3), but also highlight the limitations of simple linear relationships in capturing the complexity of cellular processes. These findings justify exploring more sophisticated models in subsequent phases of the project.

---

## Non-Linear Model Training and Evaluation

Building on the foundational benchmarks established in the first phase, we now evaluate **non-linear regression models** to capture more complex relationships within the TF and Gene datasets. Non-linear models are particularly suited to address the limitations of linear regression when dealing with high-dimensional, intricate biological data.

### **1. Non-Linear Regression Models**
We explore the following non-linear regression models, each with unique capabilities for capturing complex patterns:
- **Random Forest**: An ensemble method that uses decision trees to model interactions and non-linear relationships effectively.
- **Gradient Boosting**: Builds models iteratively, optimizing residual errors to improve performance.
- **AdaBoost**: An adaptive ensemble method that adjusts weights based on prediction errors.
- **K-Nearest Neighbors (KNN)**: Predicts values based on the closest neighbors, inherently modeling non-linearity.
- **Support Vector Machines (SVM)**: Uses kernels to map data to higher dimensions, capturing non-linear patterns.

These models are evaluated on both the **TF Data** and **Gene Data** feature sets.

### **2. Hyperparameter Tuning**
To maximize model performance, we employ **Randomized Search CV** for hyperparameter optimization. Key details:
- **Random Forest**: Tunes parameters like the number of estimators, tree depth, and sample splitting thresholds.
- **Gradient Boosting**: Optimizes learning rate, number of estimators, and tree depth.
- **AdaBoost**: Adjusts the learning rate and the number of estimators.
- **KNN**: Tunes the number of neighbors, distance metrics, and weighting schemes.
- **SVM**: Searches for optimal regularization (`C`), kernel type, and epsilon values for the regression loss function.

The search space is defined using distributions, ensuring efficient exploration within a constrained range of values.

### **3. Evaluation Metrics**
The same metrics used for linear models are applied here:
- **R² Score**
- **Mean Squared Error (MSE)**
- **Mean Absolute Error (MAE)**
- **Pearson Correlation Coefficient**
- **Weighted Score**

### **4. Visualization of Results**
#### Feature Importance Analysis
For models with interpretable feature importance (e.g., Random Forest, Gradient Boosting, AdaBoost), we create interactive visualizations:
- **Top N Features by Importance**: Bar plots highlighting the most predictive features in each model.
- **Positive vs. Negative Contributions**: Segregation of features contributing positively or negatively to predictions.

#### Comparison Across Models
We retain the visualizations introduced earlier for:
- **Model Comparison by Weighted Score**: Bar plots comparing model performance across feature sets.
- **Feature Set Comparison**: Box plots visualizing the spread of weighted scores for TF Data and Gene Data.

### **Insights from this Step**
- **Enhanced Performance**:
  Non-linear models generally outperform linear models in capturing complex relationships. Early results indicate:
  - **Gradient Boosting** and **Random Forest** excel due to their ability to handle non-linearity and interactions.
  - **KNN** and **SVM** show potential but may require further tuning for optimal performance.
- **Feature Importance**:
  - Ensemble methods (e.g., Random Forest, Gradient Boosting) provide insights into the relative importance of transcription factors and genes, offering interpretable results for biological contexts.
- **Biological Relevance**:
  - Features identified as important align with known biological pathways and regulators, reinforcing the credibility of the approach.

In [8]:
# Define ensemble models
models = {
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42),
    # "AdaBoost": AdaBoostRegressor(random_state=42),
    # "KNN": KNeighborsRegressor(),
    # "SVM": SVR(),
}

# Define hyperparameter distributions
param_distributions = {
    "Random Forest": {
        "n_estimators": randint(50, 200).rvs(size=10),  # Generate iterable integers
        "max_depth": randint(5, 20),
        "min_samples_split": randint(2, 10),
        "min_samples_leaf": randint(1, 10),
    },
    "Gradient Boosting": {
        "n_estimators": randint(50, 200).rvs(size=10),
        "learning_rate": uniform(0.01, 0.2),
        "max_depth": randint(3, 15),
        "min_samples_split": randint(2, 10),
        "min_samples_leaf": randint(1, 10),
    },
    "AdaBoost": {
        "n_estimators": randint(50, 200).rvs(size=10),
        "learning_rate": uniform(0.01, 1.0),
        },
    "KNN": {
        "n_neighbors": randint(3, 20).rvs(size=10),  # Generate iterable neighbors
        "weights": ["uniform", "distance"],
        "p": [1, 2],  # 1: Manhattan, 2: Euclidean
    },
    "SVM": {
        "C": uniform(0.1, 10.0).rvs(size=10),  # Generate random C values
        "epsilon": uniform(0.01, 1.0).rvs(size=10),  # Generate random epsilon values
        "kernel": ["linear", "rbf"],
        "gamma": ["scale", "auto"],
    },
}

pipeline = NonLinearModelPipeline(models, feature_sets, param_distributions, scoring=weighted_scorer, cv=2, n_iter=2)
pipeline.train_and_evaluate()
results_df = pipeline.get_results()
pipeline.visualize_feature_importances(top_n=15)

In [16]:
pipeline.visualize_feature_importances(top_n=15, save_path="../assets/figures/Phase 1/Nonlinear_Model_Feature_Importances.html")

In [14]:
styled_results = results_df.style.format(precision=3).set_caption(
    "Regression Model Evaluation Metrics"
).highlight_max(
    subset=["R²", "Pearson Correlation", "Weighted Score"], color="lightgreen"
).highlight_min(
    subset=["MAE", "MSE"], color="lightgreen"
)

styled_results.to_html("../assets/tables/Phase 1/Nonlinear_Model_Evaluation_Metrics.html")

In [15]:
styled_results

Unnamed: 0_level_0,Unnamed: 1_level_0,MSE,MAE,R²,Pearson Correlation,Weighted Score,CV Mean Score,CV Std Score
Feature Set,Regression Model Type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Landmark Data,Gradient Boosting,0.039,0.127,0.386,0.636,0.511,0.491,0.015
Landmark Data,Random Forest,0.038,0.124,0.392,0.636,0.514,0.506,0.013
TF Data,Gradient Boosting,0.041,0.132,0.346,0.601,0.473,0.464,0.011
TF Data,Random Forest,0.04,0.131,0.357,0.604,0.48,0.462,0.008


In [10]:
create_model_weighted_score_visualization(
    results_df, save_path="../assets/figures/Phase 1/nonlinear_model_weighted_score.html")
create_feature_set_weighted_score_visualization(
    results_df, save_path="../assets/figures/Phase 1/nonlinear_feature_set_weighted_score.html")
create_feature_set_distribution_visualization(
    results_df, save_path= "../assets/figures/Phase 1/nonlinear_feature_set_distribution.html")

## Evaluation of Results in Context of Szalai et al. (2019)

### **Key Findings of Szalai et al. (2019):**
1. **Predictive Power of Linear Models:**
   - Szalai et al. used linear regression models with L2 regularization to predict cell viability from perturbation transcriptomic data. They achieved Pearson correlation values of **0.59** for the CTRP-L1000 dataset and **0.49** for the Achilles-L1000 dataset in within-dataset predictions.
   - The best across-dataset performance achieved a Pearson correlation of **0.33** (CTRP to Achilles) and **0.19** (Achilles to CTRP), indicating limited generalizability between datasets.

2. **Factors Influencing Performance:**
   - The first principal component of gene expression (explaining 9% of variance) was found to be associated with cell viability, confirming the influence of viability-related transcriptional signatures on predictions.
   - Confounding effects of cell death or proliferation signatures were observed, potentially overshadowing the mechanistic interpretation of drug effects.

3. **Importance of Time-Specific Signatures:**
   - Models trained on CTRP-L1000-24h and Achilles-L1000-96h data showed superior performance in cross-dataset predictions, emphasizing the temporal aspect of perturbation measurements.

### **Comparison to Our Initial Results:**
1. **Linear Model Performance:**
   - Our best-performing linear models achieved **R² of 0.427** and **Pearson correlation of 0.654** on test sets, which align closely with Szalai et al.'s within-dataset results.
   - The slightly higher Pearson correlation in our case could be attributed to differences in preprocessing steps, feature selection, or dataset splits.

2. **Feature Selection and Regularization:**
   - Similar to Szalai et al.'s findings, Lasso regression performed worse in our analysis, suggesting the viability signal is distributed across multiple features rather than being concentrated in a sparse subset.
   - Bayesian Ridge and Elastic Net models in our experiments performed comparably to Ridge regression, consistent with their observations.

3. **Dataset-Specific Insights:**
   - The alignment of our test results with cross-validation metrics highlights the stability of the models and robustness in capturing viability-related patterns within datasets.
   - Our dataset may contain stronger linear signals or less noise compared to the large-scale, heterogeneous datasets used by Szalai et al.

4. **Biological Interpretation:**
   - Like Szalai et al., our findings emphasize the presence of biologically relevant signals (e.g., FOXO3 association) in the transcriptomic data. Both analyses show the limited capacity of linear models to fully capture the complexity of cell viability mechanisms.

### **Conclusions and Implications for Phase 2:**
1. **Consistency with Previous Research:**
   - Our results validate the findings of Szalai et al., confirming that linear models can capture significant, albeit incomplete, signals related to cell viability.
   - The observed R² and correlation values reinforce the importance of exploring non-linear relationships and higher-order interactions in subsequent phases.

2. **Building on Current Findings:**
   - Szalai et al. identified that non-linear models or more sophisticated feature representations could improve predictive power. This aligns with our Phase 2 objectives to explore non-linear deep models in the form of neural networks.
   - Incorporating time-specific signatures or stratifying datasets by perturbation conditions may further enhance model performance and generalizability.

3. **Addressing Limitations:**
   - Both our results and Szalai et al.'s study indicate that a significant portion of variance remains unexplained. Addressing this gap will require integrating additional data types (e.g., epigenetic, proteomic or chemical drug data) or leveraging domain knowledge for feature engineering.

4. **Strategic Focus for Phase 2:**
   - Based on these comparisons, Phase 2 should prioritize:
     - Incorporating non-linear methods to capture complex gene interactions.
     - Investigating ensemble approaches or feature transformation techniques like autoencoders.
     - Evaluating model generalizability across datasets, as cross-dataset performance remains a critical challenge.