# Clone Generation
In this script, we elaborated the detail process for getting clones from the data. As mentioned in the paper, our aim was to mitigate the influence of clones on integration accuracy. Therefore, we employed two different clustering methods, namely agglomerative clustering, and intNMF. For each clustering method, we used either untransformed or log-transformed data to obtain the clusters.

In [6]:
import pandas as pd
import utils
from scanpy.pp import log1p
import os
from sklearn.cluster import AgglomerativeClustering

## 1. Data preparation
Here, we use two strategies: clustering on the original CNA data and clustering on the log-transformed CNAs. The CNA data can have an extremely right-skewed distribution. Thus, $\log (x + 1) $ was applied, where $x$ is the original copy number.

In [12]:
# read data as dataframe
DataDir = "Data/"
dna = pd.read_csv(DataDir+"cnv_raw.csv",index_col=0)

ClusterDir = "Cluster/"
if not os.path.isdir(ClusterDir):
    os.mkdir(ClusterDir)

### 1.1 untransformed CNA data

In [13]:
ClusterDataDir = ClusterDir + "cluster_data/"
if not os.path.isdir(ClusterDataDir):
    os.mkdir(ClusterDataDir)

dna_untran = dna
dna_untran.to_csv(ClusterDataDir + "dna_untransformed.csv")

### 1.2 log-transformed CNA data

In [14]:
def log_transform(df):
    # scanpy.pp.log1p needs n_obs × n_vars
    return pd.DataFrame(log1p(df.to_numpy().T).T,
                        index=list(df.index),columns=list(df.columns))

dna_log = log_transform(dna)
dna_log.to_csv(ClusterDataDir + "dna_log.csv")

## 2. Clustering
We applied to methods for getting clusters, [agglomerative clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) and [intNMF](https://pubmed.ncbi.nlm.nih.gov/28459819/). For each method, we identified clusters from untransformed data and log-transformed data separately.

### 2.1 Agglomerative clustering

Given the difficulty of determining the optimal number of clusters using agglomerative clustering alone, we adopted a multi-resolution approach. By constructing clusters at different levels of resolution, we aimed to mitigate the influence of cluster numbers and obtain a more comprehensive understanding of the data.

Specifically, we initially obtained four clusters from the dataset. Then, we proceeded to iteratively merge similar clusters until we reached a point where only two clusters remained.

In [5]:
# specify the initial cluster number
cluster_num = 4

# agglomerative clusters using untransformed data
utils.cluster_and_merge(dna_untran, cluster_num, ClusterDir+"agg_cluster_untransformed/")
# agglomerative clusters using log-transformed data
utils.cluster_and_merge(dna_log, cluster_num, ClusterDir+"agg_cluster_log/")

### 2.2 intNMF

intNMF can find the optimal cluster numbers, so there is no need to do merging, the code is in run_IntNMF.R