Assuming we have single cell DNA methylationd data with thousands of cells. In each cell, we have one allc file:

```text
chr1    3001733 +       CTG     0       1       1
chr1    3001743 +       CAA     0       1       1
chr1    3001746 +       CCT     0       1       1
chr1    3001747 +       CTG     0       1       1
chr1    3001752 +       CTG     0       1       1
chr1    3001758 +       CTT     0       1       1
chr1    3001761 +       CTG     0       1       1
chr1    3001764 +       CTT     0       1       1
chr1    3001768 +       CTT     0       1       1
chr1    3001776 +       CTG     0       1       1
```

Columns are chrom, position, strand, context, mc (methylated count), cov (coverage) and fraction.
The first four columns are shared among all cells, in this case, we don't need to store this redundant information for each cell, we can separate the reference and cell data and only store the mc and cov for each cell.
However, we can still query and view allc with reference information.

## Generate allc coordinate reference .cz file

```shell
czip AllC -G ~/Ref/mm10/mm10_ucsc_with_chrL.fa -O mm10_with_chrL.allc.cz -n 20 run
# took 15 minutes using 20 cpus
```

```shell
# create subset index for all CG (including forward and reverse strand)
czip generate_ssi mm10_with_chrL.allc.cz -p CGN -o mm10_with_chrL.allc.cz.CGN.ssi
# took about 5 minutes with 1 core
```

```shell
# create subset index for all CG (forward strand only)
czip generate_ssi mm10_with_chrL.allc.cz -p +CGN -o mm10_with_chrL.allc.cz.CGN.forward.ssi
# about 5 minutes
```

```shell
# using forward CG subset index to extract forward strand CG coordinates from reference
czip extract -i mm10_with_chrL.allc.cz -s mm10_with_chrL.allc.cz.CGN.forward.ssi -o mm10_with_chrL.allCG.forward.cz
# about 1m23.855s
```

Actually, *.ssi is also a czip file, we can view .ssi using `czip Reader -I *.ssi view -s 0`

## View reference .cz file

### view header

In [2]:
czip Reader -I ~/Ref/mm10/annotations/mm10_with_chrL.allc.cz print_header

magic  :  b'BMZIP'
version  :  1.0
total_size  :  3074591556
message  :  /home/x-wding2/Ref/mm10/mm10_ucsc_with_chrL.fa
Formats  :  ['Q', 'c', '3s']
Columns  :  ['pos', 'strand', 'context']
Dimensions  :  ['chrom']
header_size  :  99


### summary chunks

In [3]:
czip Reader -I ~/Ref/mm10/annotations/mm10_with_chrL.allc.cz summary_chunks | head

chrom	chunk_start_offset	chunk_size	chunk_tail_offset	chunk_nblocks	chunk_nrows
chr1	99	219469648	219585440	14459	78962721
chr10	219585440	146325915	365988449	9634	52609184
chr11	365988449	144624498	510689185	9527	52027265
chr12	510689185	135669225	646429920	8936	48799752
chr13	646429920	135593184	782094542	8927	48750883
chr14	782094542	138973134	921140930	9154	49987736
chr15	921140930	117415601	1038618417	7733	42230765
chr16	1038618417	108168134	1146843557	7123	38899643
chr17	1146843557	108852498	1255753437	7170	39153472


Every chunk has a dimension (chrom, sample, cell types or the combination of those dimensions)

## Converting .allc.tsv.gz into .cz file

In [4]:
czip bed2cz --help

INFO: Showing help with the command 'czip bed2cz -- --help'.

[1mNAME[0m
    czip bed2cz - convert allc.tsv.gz to .cz file.

[1mSYNOPSIS[0m
    czip bed2cz [4mINPUT[0m [4mOUTFILE[0m <flags>

[1mDESCRIPTION[0m
    convert allc.tsv.gz to .cz file.

[1mPOSITIONAL ARGUMENTS[0m
    [1m[4mINPUT[0m[0m
        path to allc.tsv.gz, should has .tbi index.
    [1m[4mOUTFILE[0m[0m
        output .cz file

[1mFLAGS[0m
    -r, --reference=[4mREFERENCE[0m
        Type: Optional[]
        Default: None
        path to reference coordinates.
    -m, --missing_value=[4mMISSING_VALUE[0m
        Default: [0, 0]
    -F, --Formats=[4mFORMATS[0m
        Default: ['H', 'H']
        When reference is provided, we only need to pack mc and cov, ['H', 'H'] is suggested (H is unsigned short integer, only 2 bytes), if reference is not provided, we also need to pack position (Q is recommanded), in this case, Formats should be ['Q','H','H'].
    -C, --Columns=[4mCOLUMNS[0m
        Defaul

### Pack .allc.tsv.gz to .cz with coordinates

```shell
czip bed2cz FC_E17a_3C_1-1-I3-F13.allc.tsv.gz FC_E17a_3C_1-1-I3-F13.with_coordinate.cz -F Q,H,H -C pos,mc,cov -u 1,4,5
# took about 2m3.650s
```

In [5]:
czip Reader -I FC_E17a_3C_1-1-I3-F13.with_coordinate.cz view -s 0 |head

chrom	pos	mc	cov
chr1	3001733	0	1
chr1	3001743	0	1
chr1	3001746	0	1
chr1	3001747	0	1
chr1	3001752	0	1
chr1	3001758	0	1
chr1	3001761	0	1
chr1	3001764	0	1
chr1	3001768	0	1


In [6]:
# query
czip Reader -I FC_E17a_3C_1-1-I3-F13.with_coordinate.cz query -D chr9 -s 3000294 -e 3000472 | awk '$3>50'

chrom	pos	mc	cov
chr9	3000294	54	63
chr9	3000342	69	85
chr9	3000354	77	82
chr9	3000381	52	64
chr9	3000382	74	87
chr9	3000399	66	67
chr9	3000441	84	138
chr9	3000457	139	162
chr9	3000458	64	74
chr9	3000472	161	183


### Pack .allc.tsv.gz to .cz without coordinates (using reference)

```shell
czip bed2cz FC_E17a_3C_1-1-I3-F13.allc.tsv.gz FC_E17a_3C_1-1-I3-F13.cz -r ~/Ref/mm10/annotations/mm10_with_chrL.allc.cz
# took 8m40.090s, it will be faster after implementing C/C++ version
```

In [7]:
czip Reader -I FC_E17a_3C_1-1-I3-F13.cz view -s 0 |head

chrom	mc	cov
chr1	0	0
chr1	0	0
chr1	0	0
chr1	0	0
chr1	0	0
chr1	0	0
chr1	0	0
chr1	0	0
chr1	0	0


view FC_E17a_3C_1-1-I3-F13.cz together with reference

In [8]:
czip Reader -I FC_E17a_3C_1-1-I3-F13.cz view -s 0 -r ~/Ref/mm10/annotations/mm10_with_chrL.allc.cz |head

chrom	pos	strand	context	mc	cov
chr1	3000003	+	CTG	0	0
chr1	3000005	-	CAG	0	0
chr1	3000009	+	CTA	0	0
chr1	3000016	-	CAA	0	0
chr1	3000018	-	CAC	0	0
chr1	3000019	-	CCA	0	0
chr1	3000023	+	CTT	0	0
chr1	3000027	-	CAA	0	0
chr1	3000029	-	CTC	0	0


### Query

#### query allc.tsv.gz using tabix

In [10]:
tabix FC_E17a_3C_1-1-I3-F13.allc.tsv.gz chr9 | awk '$5 > 50' |head

chr9	3000294	-	CAT	54	63	1
chr9	3000342	-	CGA	69	85	1
chr9	3000354	-	CGT	77	82	1
chr9	3000381	+	CGT	52	64	1
chr9	3000382	-	CGG	74	87	1
chr9	3000399	+	CGA	66	67	1
chr9	3000441	+	CGT	84	138	1
chr9	3000457	+	CGT	139	162	1
chr9	3000458	-	CGA	64	74	1
chr9	3000472	+	CGT	161	183	1


#### query allc.cz using czip

In [11]:
czip Reader -I FC_E17a_3C_1-1-I3-F13.cz query -r ~/Ref/mm10/annotations/mm10_with_chrL.allc.cz -D chr9 -s 3000294 -e 3000472 |awk '$5>50'

chrom	pos	strand	context	mc	cov
chr9	3000294	-	CAT	54	63
chr9	3000342	-	CGA	69	85
chr9	3000354	-	CGT	77	82
chr9	3000381	+	CGT	52	64
chr9	3000382	-	CGG	74	87
chr9	3000399	+	CGA	66	67
chr9	3000441	+	CGT	84	138
chr9	3000457	+	CGT	139	162
chr9	3000458	-	CGA	64	74
chr9	3000472	+	CGT	161	183


## Cat multiple .cz files into one .cz file

In [12]:
czip Writer -O cat.cz -F Q,c,3s -C pos,strand,context -D chrom catcz --help

INFO: Showing help with the command 'czip Writer -O cat.cz -F Q,c,3s -C pos,strand,context -D chrom catcz -- --help'.

[1mNAME[0m
    czip Writer -O cat.cz -F Q,c,3s -C pos,strand,context -D chrom catcz - Cat multiple .cz files into one .cz file.

[1mSYNOPSIS[0m
    czip Writer -O cat.cz -F Q,c,3s -C pos,strand,context -D chrom catcz <flags>

[1mDESCRIPTION[0m
    Cat multiple .cz files into one .cz file.

[1mFLAGS[0m
    -I, --Input=[4mINPUT[0m
        Type: Optional[]
        Default: None
        Either a str (including *, as input for glob, should be inside the double quotation marks if using fire) or a list.
    -d, --dim_order=[4mDIM_ORDER[0m
        Type: Optional[]
        Default: None
        If dim_order=None, Input will be sorted using python sorted. If dim_order is a list, tuple or array of basename.rstrip(.cz), sorted as dim_order. If dim_order is a file path (for example, chrom size path to dim_order chroms or only use selected chroms) will be sorted as the 1

```shell
czip Writer -O mm10_with_chrL.allc.cz -F Q,c,3s -C pos,strand,context -D chrom catcz -I "cell_type/*.cz" \
            --dim_order ~/Ref/mm10/mm10_ucsc_with_chrL.chrom.sizes --add_dim True --title "cell_id"
```

In this example, we cat multiple .cz file into one .cz file and add another dimension to each chunk (cell_id)

## Extract CG from .cz and merge strand

In [13]:
czip extractCG --help

INFO: Showing help with the command 'czip extractCG -- --help'.

[1mNAME[0m
    czip extractCG

[1mSYNOPSIS[0m
    czip extractCG <flags>

[1mFLAGS[0m
    -i, --input=[4mINPUT[0m
        Type: Optional[]
        Default: None
    -o, --outfile=[4mOUTFILE[0m
        Type: Optional[]
        Default: None
    -s, --ssi=[4mSSI[0m
        Type: Optional[]
        Default: None
        ssi should be ssi to mm10_with_chrL.allc.cz.CGN.ssi, not forward strand ssi, but after merge (if merge_strand is True), forward ssi mm10_with_chrL.allc.cz.+CGN.ssi should be used to generate reference, one can
    -c, --chunksize=[4mCHUNKSIZE[0m
        Default: 5000
    -m, --merge_strand=[4mMERGE_STRAND[0m
        Default: True
        after merging, only forward strand would be kept, reverse strand values would be added to the corresponding forward strand.


```shell
czip extractCG -i cz/FC_P13a_3C_2-1-E5-D13.cz -o FC_P13a_3C_2-1-E5-D13.CGN.cz -s ~/Ref/mm10/annotations/mm10_with_chrL.allc.cz.CGN.ssi

# view CG .cz
czip Reader -I FC_P13a_3C_2-1-E5-D13.CGN.cz view -s 0 -r ~/Ref/mm10/annotations/mm10_with_chrL.allCG.forward.cz
```

## Merge multiple .cz files into one .cz file

In [14]:
czip merge_cz --help

INFO: Showing help with the command 'czip merge_cz -- --help'.

[1mNAME[0m
    czip merge_cz

[1mSYNOPSIS[0m
    czip merge_cz <flags>

[1mFLAGS[0m
    -i, --indir=[4mINDIR[0m
        Type: Optional[]
        Default: None
    -c, --cz_paths=[4mCZ_PATHS[0m
        Type: Optional[]
        Default: None
    -o, --outfile=[4mOUTFILE[0m
        Default: 'merged.cz'
    -n, --n_jobs=[4mN_JOBS[0m
        Default: 12
    -f, --formats=[4mFORMATS[0m
        Default: ['I', 'I']
    -P, --Path_to_chrom=[4mPATH_TO_CHROM[0m
        Default...
    -k, --keep_cat=[4mKEEP_CAT[0m
        Default: False
    -b, --batchsize=[4mBATCHSIZE[0m
        Default: 10


```shell
czip merge_mz -i cz-CGN/ -o merged.cz
```

## Merge .cz files belonging to the same cell type

In [15]:
czip merge_cell_type --help

INFO: Showing help with the command 'czip merge_cell_type -- --help'.

[1mNAME[0m
    czip merge_cell_type

[1mSYNOPSIS[0m
    czip merge_cell_type <flags>

[1mFLAGS[0m
    -i, --indir=[4mINDIR[0m
        Type: Optional[]
        Default: None
    -c, --cell_table=[4mCELL_TABLE[0m
        Type: Optional[]
        Default: None
    -o, --outdir=[4mOUTDIR[0m
        Type: Optional[]
        Default: None
    -n, --n_jobs=[4mN_JOBS[0m
        Default: 64
    -P, --Path_to_chrom=[4mPATH_TO_CHROM[0m
        Type: Optional[]
        Default: None
    -e, --ext=[4mEXT[0m
        Default: '.CGN.merged.cz'


## Run czip allc2cz on GCP

```shell
wget https://raw.githubusercontent.com/DingWB/czip/main/data/allc2mz.smk
```

```shell
snakemake --printshellcmds --immediate-submit --notemp -s allc2mz.smk --config indir="gs://mouse_pfc/test_allc" outdir="test_mz" \
            reference="mm10_with_chrL.allc.cz" ref_prefix="gs://wubin_ref/mm10/annotations" \
            chrom="mm10_ucsc_with_chrL.main.chrom.sizes.txt" chrom_prefix="gs://wubin_ref/mm10" \
            gcp=True --default-remote-prefix mouse_pfc --default-remote-provider GS \
            --google-lifesciences-region us-west1 --scheduler greedy -j 96 -np
```

## Run czip extractCG on GCP

```shell
wget https://raw.githubusercontent.com/DingWB/czip/main/data/extractCG.smk
```

```shell
snakemake --use-conda --printshellcmds -s extractCG.smk \
          --config algorithm="bmzip" indir=test_mz files_path=mz.path".0$SKYPILOT_NODE_RANK" \
          outdir=pfc_mz-CGN bmi=mm10_with_chrL.allc.mz.CGN.bmi bmi_prefix=gs://wubin_ref/mm10/annotations \
          gcp=True --default-remote-prefix mouse_pfc --default-remote-provider GS \
          --google-lifesciences-region us-west1 --scheduler greedy -j 96
```