#### QSAR Pipeline for Ionic-Liquid Cytotoxicity

This module implements a fully automated **Quantitative Structure–Activity Relationship (QSAR)** pipeline for predicting the cytotoxicity of ionic liquids using multiple curated CSV datasets. It supports both **regression (CC50 prediction)** and **binary toxicity classification**, and produces all required molecular fingerprints, physicochemical descriptors, metadata encodings, and trained machine-learning artifacts.

---

**1 Architectural Overview**

  **1.1 Data Aggregation**

* Automatically loads all cytotoxicity CSV files from
  `data/cytotoxicity_ionic_liquids/csv_datasets/`
* Adds `Family` labels based on the file names.
* Standardizes numeric formats such as:

  * `"1,5"` → `1.5`
  * `"1–2"` → mean of the range
  * strings containing units, parentheses, or trailing text.

**1.2 Molecular Representation**

The core structural representation is a **2048-bit Morgan Fingerprint (ECFP)** generated using RDKit:

* Radius = 2
* Vector length = 2048 bits (`fp_0` … `fp_2047`)
* Automatic salt/pair stripping and SMILES normalization
* Graceful zero-vector fallback if parsing fails

This compact high-dimensional bit vector captures substructural features essential for SAR modeling.

  **1.3 Physicochemical Descriptors**

Each molecule also receives three RDKit-calculated descriptors:

| Descriptor     | Meaning                                          |
| -------------- | ------------------------------------------------ |
| **MolWt_calc** | Molecular Weight (Da)                            |
| **LogP_calc**  | Logarithm of octanol/water partition coefficient |
| **TPSA_calc**  | Topological Polar Surface Area                   |

These descriptors improve generalization by incorporating global physicochemical properties.

  **1.4 Metadata Encoding**

Several experimental or biological categorical fields are factorized:

* `Family_enc`
* `Cell type_enc`
* `Organism_enc`
* `Full name of method_enc`

These encode differences in assay conditions, cell types, and data source variability, which improves predictive robustness.

  **1.5 Target Construction**

Two parallel machine learning targets are produced:

  **Regression target**

```text
log_CC50 = log10(CC50_mM)
```

Log-transforming CC50 stabilizes variance and better models multiplicative toxicity effects.

  **Classification target**

```text
toxic_label = 1 if CC50_mM < threshold_mM else 0
```

Default threshold: `1.0 mM` (configurable).
This supports clinical-style toxicity categorization.

---

**2. Cytotoxicity Calculation Logic**

  **2.1 Parsing Raw CC50/IC50/EC50 Data**

The pipeline standardizes values from the column
`CC50/IC50/EC50, mM` using `extract_float()`, which:

* supports commas, parentheses, units, and mixed text
* extracts numbers safely
* averages numeric ranges (e.g., `"1–3"` → `2.0`)
* converts invalid entries to `NaN`

Rows missing both SMILES or potency values are removed.

  **2.2 Log Transform**

Cytotoxicity spans orders of magnitude. To normalize the distribution:

```text
log_CC50 = log10(CC50_mM)
```

This provides a smoother regression space for biological potency.

  **2.3 Binary Toxicity Threshold**

Toxicity is defined by a configurable threshold:

```text
toxic_label = 1 if CC50_mM < 1.0 mM else 0
```

This mirrors conventional cytotoxicity screening rules.

---

**3. Model Architecture & Accuracy Rationale**

  **3.1 Learning Algorithms**

The pipeline trains:

* **RandomForestRegressor** for predicting log(CC50)
* **RandomForestClassifier** for toxicity classification

  **3.2 Why Random Forests?**

Random Forests are chosen due to:

* great performance on **high-dimensional binary fingerprints**
* ability to model **non-linear SAR** trends
* robustness to noisy biological data from prototyping datasets
* minimal hyperparameter sensitivity
* great built-in resilience to overfitting

  **3.3 Feature Channels**

The model integrates three categories of features:

1. **Structural fingerprints** (2048 bits)
2. **Physicochemical descriptors** (MolWt, LogP, TPSA)
3. **Encoded assay/cell-line metadata**

This mixed feature set greatly improves predictive accuracy, as it captures structural, physicochemical, and biological context simultaneously.

  **3.4 Reported Metrics**

During training, the pipeline computes:

* **Regression:** R², MAE (log scale)
* **Classification:** Accuracy, ROC-AUC

These metrics ensure the model is quantitatively validated on the held-out test set.

---

**4. Input Requirements**

  **4.1 Required Inputs for Dataset Building**

You must supply CSV files containing at minimum:

* `Canonical SMILES`
* `CC50/IC50/EC50, mM`

Optional additional columns (cell-line data, methods, incubation time, etc.) are automatically merged if available.

  **4.2 Input for Model Training**

Training functions consume the ML-ready dataset produced by:

```python
build_qsar_dataset()
```

which returns a `pandas.DataFrame`.

  **4.3 Input for Inference**

For prediction, the pipeline accepts **a single SMILES string**:

```python
predict_from_smiles("C[N+](C)(C)CCC[N+](C)(C)C")
```

The output includes:

```json
{
  "log_CC50_pred": ...,
  "CC50_mM_pred": ...,
  "prob_toxic": ...,
  "toxic_label_pred": 0 or 1
}
```

---

**5. Saved Artifacts**

Artifacts produced by training are saved under `models/`:

```
models/
 ├── qsar_regressor_rf.pkl
 ├── qsar_classifier_rf.pkl
 ├── qsar_scaler.pkl
 └── qsar_feature_list.json
```

The ML-ready dataset is exported to:

```
master_data/biological/biological_qsar_ml_ready.csv
```

This ensures full reproducibility and seamless integration into downstream web applications.

---