# Generate Synthetic Test Data

This notebook generates synthetic datasets mimicking the structure of real Tupro (single-cell RNA) and Fast Drug (drug response) data formats. These datasets are useful for testing, development, and demonstration purposes.

## Overview

The notebook creates three main output files:
1. **clone_infos.csv** - Clone metadata and cell type assignments
2. **sample2data/sample_*.csv** - Single-cell RNA expression data with clone assignments (one file per sample)
3. **FD_data.csv** - Drug response measurements (Fast Drug format)

## Data Generation Parameters

- **Samples (N)**: 50 patient samples
- **Clones (Kmax)**: 7 clones per sample (1 non-malignant + 6 tumor clones)
- **Drugs (D)**: 30 different drugs
- **Replicates (R)**: 20 drug response replicates
- **Features (L)**: 15 RNA expression dimensions
- **Negative Binomial (neg_bin)**: 100 (controls noise level)

## Output Structure

### clone_infos.csv
Contains clone metadata with columns:
- `cloneID`: Unique clone identifier (0 to Kmax-1)
- `clonelabel`: Malignancy status (non-malignant/tumor)
- `clonecategory`: Same as clonelabel
- `clonetype_sample_i`: Cell type for each clone in each sample (T cells, B cells, or Melanoma)

### sample2data/sample_*.csv
Single-cell data for each sample with columns:
- `cell_id`: Cell identifier
- `dim_1` to `dim_L`: RNA expression features
- `celltype`, `clonetype`: Cell type assignments
- `cloneID`, `clonelabel`, `clonecategory`: Clone annotations

Each sample contains 200-500 randomly distributed cells across clones.

### FD_data.csv
Drug response data with columns:
- `SampleID`: Patient sample identifier
- `Concentration`: Drug concentration (fixed at '5')
- `Drug`: Drug name (DMSO for control, Drug_0 to Drug_29)
- `Number_tumor_cells`: Change in tumor cell count
- `Number_all_cells`: Total cell count
- `Well_position_1`, `Well_position_2`: Plate well coordinates

In [6]:
import sys, os
sys.path.append('../')
import scClone2DR as sccdr
import matplotlib.pyplot as plt
import numpy as np
import torch
import pandas as pd
import random
np.float_ = np.float64


modelscClone2DR = sccdr.models.scClone2DR()
R = 20
neg_bin = 100
data_ref = modelscClone2DR.get_simulated_training_data({'C':24,'R':R,'N':50,'Kmax':7, 'D':30, 'theta_rna':15}, neg_bin_n=neg_bin, mode_nu="noise_correction", mode_theta="not shared decoupled")

In [7]:
from pathlib import Path
path2save_data = "./"

Path(os.path.join(path2save_data)).mkdir(parents=True, exist_ok=True)
Path(os.path.join(path2save_data, "sample2data")).mkdir(parents=True, exist_ok=True)

# Saving clone infos

In [8]:
Kmax = data_ref['Kmax']
N = data_ref['N']
dic = {"cloneID": np.arange(0,Kmax,1),
       "clonelabel": ["healthy"]+["tumor" for i in range(Kmax-1)],
       "clonecategory": ["healthy"]+["tumor" for i in range(Kmax-1)]             
      }
for i in range(N):
    dic['clonetype_{0}'.format(f'sample_{i}')] = [random.choice(['T cells', 'B cells'])] + ['Melanoma' for k in range(Kmax-1)]

clone_infos = pd.DataFrame(dic)
clone_infos.to_csv(os.path.join(path2save_data, 'clone_infos.csv'))

In [9]:
clone_infos = clone_infos.set_index("cloneID")

# Saving RNA data at the single cell level

In [10]:
L = data_ref['X'].shape[2]

for sample in range(N):

    columns = (
        ['cell_id']
        + [f'dim_{k+1}' for k in range(L)]
        + [
            'celltype',
            'cellcategory',
            'initial_cloneID',
            'clonetype',
            'clonelabel',
            'clonecategory',
            'cloneID'
        ]
    )

    total_cells = np.random.randint(200, 500)
    props = data_ref['proportions'][sample, :]
    cloneID2nb_cells = np.random.multinomial(total_cells, props)

    cells = []

    for cloneID in range(Kmax):

        clonecells = []

        clone_info = clone_infos.iloc[cloneID]
        clonetype_sample = clone_info[f'clonetype_sample_{sample}']
        
        for cell_id in range(cloneID2nb_cells[cloneID]):

            feature = (
                [f'cell_id_{cell_id}']
                + [el.item() for el in data_ref['X'][cloneID, sample, :]]
                + [
                    clonetype_sample,
                    clone_info['clonecategory'],
                    cloneID,
                    clonetype_sample,
                    clone_info['clonelabel'],
                    clone_info['clonecategory'],
                    cloneID
                ]
            )

            clonecells.append(feature)

        if clonecells:
            cells.append(pd.DataFrame(clonecells, columns=columns))

    df_sample = pd.concat(cells, ignore_index=True)

    os.makedirs(os.path.join(path2save_data, "sample2data"), exist_ok=True)

    df_sample.to_csv(
        os.path.join(path2save_data, "sample2data", f"sample_{sample}.csv"),
        index=False
    )

# Saving the Fast Drug data

In [12]:
Rt, D, N = data_ref['n0_r'].shape
Rc, N = data_ref['n0_c'].shape

columns = ['SampleID', 'Concentration', 'Drug', 'Number_tumor_cells', 'Number_all_cells', 'Well_position_1', 'Well_position_2']
data = []
for sample in range(N):
    for c in range(Rc):
        ntumor = data_ref['n_c'][c,sample].item()-data_ref['n0_c'][c,sample].item()
        well_x = np.random.randint(0,24)
        well_y = np.random.randint(0,7)
        data.append([f'sample_{sample}', '5', 'DMSO', ntumor, data_ref['n_c'][c,sample].item(), well_x, well_y])
    for r in range(Rt):
        for d in range(D):
            ntumor = data_ref['n_r'][r,d,sample].item()-data_ref['n0_r'][r,d,sample].item()
            well_x = np.random.randint(0,24)
            well_y = np.random.randint(0,7)
            data.append([f'sample_{sample}', '5', f'Drug_{d}', ntumor, data_ref['n_r'][r,d,sample].item(), well_x, well_y])
FD_data = pd.DataFrame(data, columns=columns)
FD_data.to_csv(os.path.join(path2save_data, 'FD_data.csv'), index=False)