## Download example dataset

```shell
pip install pyfigshare
pip uninstall -y ChunkZIP && pip install git+http://github.com/DingWB/czip
# or pip install ChunkZIP
figshare download 25374073 -o czip_example_data
```

## View reference .cz file

### view header

In [42]:
import os,sys
import czip
reader=czip.Reader("czip_example_data/FC_E17a_3C_1-1-I3-F13.cz")
reader.print_header()

### summary chunks & blocks

In [None]:
reader.summary_chunks(printout=False)
reader.summary_blocks(printout=False)

Every chunk has a dimension (chrom, sample, cell types or the combination of those dimensions)

### query

In [21]:
for record in reader.query(Dimension="chr9",start=3000294,end=3000472,reference="czip_example_data/mm10_with_chrL.allc.cz",printout=False):
    print(record)

chrom	pos	mc	cov
chr9	3000294	54	63
chr9	3000342	69	85
chr9	3000354	77	82
chr9	3000381	52	64
chr9	3000382	74	87
chr9	3000399	66	67
chr9	3000441	84	138
chr9	3000457	139	162
chr9	3000458	64	74
chr9	3000472	161	183


In [21]:
for record in reader.query(Dimension="chr9",start=3000294,end=3000472,printout=False):
    print(record)

chrom	pos	mc	cov
chr9	3000294	54	63
chr9	3000342	69	85
chr9	3000354	77	82
chr9	3000381	52	64
chr9	3000382	74	87
chr9	3000399	66	67
chr9	3000441	84	138
chr9	3000457	139	162
chr9	3000458	64	74
chr9	3000472	161	183


### Pack .allc.tsv.gz to .cz without coordinates (using reference)

```shell
czip bed2cz FC_E17a_3C_1-1-I3-F13.allc.tsv.gz FC_E17a_3C_1-1-I3-F13.cz -r ~/Ref/mm10/annotations/mm10_with_chrL.allc.cz
# took 8m40.090s, it will be faster after implementing C/C++ version
```

In [22]:
czip Reader -I FC_E17a_3C_1-1-I3-F13.cz view -s 0 |head

chrom	mc	cov
chr1	0	0
chr1	0	0
chr1	0	0
chr1	0	0
chr1	0	0
chr1	0	0
chr1	0	0
chr1	0	0
chr1	0	0


view FC_E17a_3C_1-1-I3-F13.cz together with reference

In [23]:
czip Reader -I FC_E17a_3C_1-1-I3-F13.cz view -s 0 -r ~/Ref/mm10/annotations/mm10_with_chrL.allc.cz |head

chrom	pos	strand	context	mc	cov
chr1	3000003	+	CTG	0	0
chr1	3000005	-	CAG	0	0
chr1	3000009	+	CTA	0	0
chr1	3000016	-	CAA	0	0
chr1	3000018	-	CAC	0	0
chr1	3000019	-	CCA	0	0
chr1	3000023	+	CTT	0	0
chr1	3000027	-	CAA	0	0
chr1	3000029	-	CTC	0	0


### Query

#### query allc.tsv.gz using tabix

In [28]:
tabix FC_E17a_3C_1-1-I3-F13.allc.tsv.gz chr9 | awk '$5 > 50' |head

chr9	3000294	-	CAT	54	63	1
chr9	3000342	-	CGA	69	85	1
chr9	3000354	-	CGT	77	82	1
chr9	3000381	+	CGT	52	64	1
chr9	3000382	-	CGG	74	87	1
chr9	3000399	+	CGA	66	67	1
chr9	3000441	+	CGT	84	138	1
chr9	3000457	+	CGT	139	162	1
chr9	3000458	-	CGA	64	74	1
chr9	3000472	+	CGT	161	183	1


#### query allc.cz using czip

In [29]:
czip Reader -I FC_E17a_3C_1-1-I3-F13.cz query -r ~/Ref/mm10/annotations/mm10_with_chrL.allc.cz -D chr9 -s 3000294 -e 3000472 |awk '$5>50'

chrom	pos	strand	context	mc	cov
chr9	3000294	-	CAT	54	63
chr9	3000342	-	CGA	69	85
chr9	3000354	-	CGT	77	82
chr9	3000381	+	CGT	52	64
chr9	3000382	-	CGG	74	87
chr9	3000399	+	CGA	66	67
chr9	3000441	+	CGT	84	138
chr9	3000457	+	CGT	139	162
chr9	3000458	-	CGA	64	74
chr9	3000472	+	CGT	161	183


## Cat multiple .cz files into one .cz file

In [30]:
czip Writer -O cat.cz -F Q,c,3s -C pos,strand,context -D chrom catcz --help

INFO: Showing help with the command 'czip Writer -O cat.cz -F Q,c,3s -C pos,strand,context -D chrom catcz -- --help'.

[1mNAME[0m
    czip Writer -O cat.cz -F Q,c,3s -C pos,strand,context -D chrom catcz - Cat multiple .cz files into one .cz file.

[1mSYNOPSIS[0m
    czip Writer -O cat.cz -F Q,c,3s -C pos,strand,context -D chrom catcz <flags>

[1mDESCRIPTION[0m
    Cat multiple .cz files into one .cz file.

[1mFLAGS[0m
    -I, --Input=[4mINPUT[0m
        Type: Optional[]
        Default: None
        Either a str (including *, as input for glob, should be inside the                      double quotation marks if using fire) or a list.
    -d, --dim_order=[4mDIM_ORDER[0m
        Type: Optional[]
        Default: None
        If dim_order=None, Input will be sorted using python sorted. If dim_order is a list, tuple or array of basename.rstrip(.cz), sorted as dim_order. If dim_order is a file path (for example, chrom size path to dim_order chroms or only use selected chroms) wi

```shell
czip Writer -O mm10_with_chrL.allc.cz -F Q,c,3s -C pos,strand,context -D chrom catcz -I "cell_type/*.cz" \
            --dim_order ~/Ref/mm10/mm10_ucsc_with_chrL.chrom.sizes --add_dim True --title "cell_id"
```

In this example, we cat multiple .cz file into one .cz file and add another dimension to each chunk (cell_id)

## Extract CG from .cz and merge strand

In [31]:
czip extractCG --help

INFO: Showing help with the command 'czip extractCG -- --help'.

[1mNAME[0m
    czip extractCG - Extract CG context from .cz file

[1mSYNOPSIS[0m
    czip extractCG <flags>

[1mDESCRIPTION[0m
    Extract CG context from .cz file

[1mFLAGS[0m
    -i, --input=[4mINPUT[0m
        Type: Optional[]
        Default: None
    -o, --outfile=[4mOUTFILE[0m
        Type: Optional[]
        Default: None
    -s, --ssi=[4mSSI[0m
        Type: Optional[]
        Default: None
        ssi should be ssi to mm10_with_chrL.allc.cz.CGN.ssi, not forward strand ssi, but after merge (if merge_cg is True), forward ssi mm10_with_chrL.allc.cz.+CGN.ssi should be used to generate reference, one can
    -c, --chunksize=[4mCHUNKSIZE[0m
        Default: 5000
    -m, --merge_cg=[4mMERGE_CG[0m
        Default: False
        after merging, only forward strand would be kept, reverse strand values would be added to the corresponding forward strand.


```shell
czip extractCG -i cz/FC_P13a_3C_2-1-E5-D13.cz -o FC_P13a_3C_2-1-E5-D13.CGN.cz -s ~/Ref/mm10/annotations/mm10_with_chrL.allc.cz.CGN.ssi

# view CG .cz
czip Reader -I FC_P13a_3C_2-1-E5-D13.CGN.cz view -s 0 -r ~/Ref/mm10/annotations/mm10_with_chrL.allCG.forward.cz
```

## Merge multiple .cz files into one .cz file
sum up the mc and cov

In [32]:
czip merge_cz --help

INFO: Showing help with the command 'czip merge_cz -- --help'.

[1mNAME[0m
    czip merge_cz - Merge multiple .cz files. For example: czip merge_cz -i ./ -o major_type.2D.txt -n 96 -f 2D                           -P ~/Ref/mm10/mm10_ucsc_with_chrL.main.chrom.sizes.txt                           -r ~/Ref/mm10/annotations/mm10_with_chrL.allCG.forward.cz

[1mSYNOPSIS[0m
    czip merge_cz <flags>

[1mDESCRIPTION[0m
    Merge multiple .cz files. For example: czip merge_cz -i ./ -o major_type.2D.txt -n 96 -f 2D                           -P ~/Ref/mm10/mm10_ucsc_with_chrL.main.chrom.sizes.txt                           -r ~/Ref/mm10/annotations/mm10_with_chrL.allCG.forward.cz

[1mFLAGS[0m
    -i, --indir=[4mINDIR[0m
        Type: Optional[]
        Default: None
        If cz_paths is not provided, indir will be used to get cz_paths.
    --cz_paths=[4mCZ_PATHS[0m
        Type: Optional[]
        Default: None
    --class_table=[4mCLASS_TABLE[0m
        Type: Optional[]
        Defau

```shell
czip merge_mz -i cz-CGN/ -o merged.cz
```

## Merge .cz files belonging to the same cell type
sum up the mc and cov

In [33]:
czip merge_cell_type --help

INFO: Showing help with the command 'czip merge_cell_type -- --help'.

[1mNAME[0m
    czip merge_cell_type

[1mSYNOPSIS[0m
    czip merge_cell_type <flags>

[1mFLAGS[0m
    -i, --indir=[4mINDIR[0m
        Type: Optional[]
        Default: None
    -c, --cell_table=[4mCELL_TABLE[0m
        Type: Optional[]
        Default: None
    -o, --outdir=[4mOUTDIR[0m
        Type: Optional[]
        Default: None
    -n, --n_jobs=[4mN_JOBS[0m
        Default: 64
    -P, --Path_to_chrom=[4mPATH_TO_CHROM[0m
        Type: Optional[]
        Default: None
    -e, --ext=[4mEXT[0m
        Default: '.CGN.merged.cz'


## Run czip allc2cz on GCP

```shell
wget https://raw.githubusercontent.com/DingWB/czip/main/data/allc2mz.smk
```

```shell
snakemake --printshellcmds --immediate-submit --notemp -s allc2mz.smk --config indir="gs://mouse_pfc/test_allc" outdir="test_mz" \
            reference="mm10_with_chrL.allc.cz" ref_prefix="gs://wubin_ref/mm10/annotations" \
            chrom="mm10_ucsc_with_chrL.main.chrom.sizes.txt" chrom_prefix="gs://wubin_ref/mm10" \
            gcp=True --default-remote-prefix mouse_pfc --default-remote-provider GS \
            --google-lifesciences-region us-west1 --scheduler greedy -j 96 -np
```

## Run czip extractCG on GCP

```shell
wget https://raw.githubusercontent.com/DingWB/czip/main/data/extractCG.smk
```

```shell
snakemake --use-conda --printshellcmds -s extractCG.smk \
          --config algorithm="bmzip" indir=test_mz files_path=mz.path".0$SKYPILOT_NODE_RANK" \
          outdir=pfc_mz-CGN bmi=mm10_with_chrL.allc.cz.CGN.bmi bmi_prefix=gs://wubin_ref/mm10/annotations \
          gcp=True --default-remote-prefix mouse_pfc --default-remote-provider GS \
          --google-lifesciences-region us-west1 --scheduler greedy -j 96
```

## Converting .cz file back to allc.tsv.gz

convert both CG and CH to allc

In [35]:
czip Reader -I FC_E17a_3C_1-1-I3-F13.cz view -s 0 -r ~/Ref/mm10/annotations/mm10_with_chrL.allc.cz -h False |head

chr1	3000003	+	CTG	0	0
chr1	3000005	-	CAG	0	0
chr1	3000009	+	CTA	0	0
chr1	3000016	-	CAA	0	0
chr1	3000018	-	CAC	0	0
chr1	3000019	-	CCA	0	0
chr1	3000023	+	CTT	0	0
chr1	3000027	-	CAA	0	0
chr1	3000029	-	CTC	0	0
chr1	3000030	-	CCT	0	0


```shell
czip Reader -I FC_E17a_3C_1-1-I3-F13.cz view -s 0 -r ~/Ref/mm10/annotations/mm10_with_chrL.allc.cz -h False | 
    awk 'BEGIN{FS=OFS="\t"}; {print $0,1}' | bgzip > test1.allc.tsv.gz && tabix -f -s 1 -b 2 -e 2 test1.allc.tsv.gz
```

convert only CG to allc

In [46]:
czip Reader -I FC_E17a_3C_1-1-I3-F13.cz view -s 0 -r ~/Ref/mm10/annotations/mm10_with_chrL.allCG.forward.cz -h False |head

chr1	3000827	+	CGT	0	0
chr1	3001007	+	CGG	0	0
chr1	3001018	+	CGT	0	0
chr1	3001277	+	CGA	0	0
chr1	3001629	+	CGT	0	0
chr1	3003226	+	CGG	0	0
chr1	3003339	+	CGC	0	0
chr1	3003379	+	CGT	0	0
chr1	3003582	+	CGC	0	0
chr1	3003640	+	CGG	0	0


```shell
czip Reader -I FC_E17a_3C_1-1-I3-F13.cz view -s 0 -r ~/Ref/mm10/annotations/mm10_with_chrL.allCG.forward.cz -h False | 
    awk 'BEGIN{FS=OFS=\"\t\"}; {print \$0,1}' | bgzip > test1.CG.allc.tsv.gz && tabix -f -s 1 -b 2 -e 2 test1.CG.allc.tsv.gz
```