<a href="https://colab.research.google.com/github/PopGenClustering/Clumppling/blob/master/online_notebook_for_clumppling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [*Clumppling*](https://github.com/PopGenClustering/Clumppling)

v 1.3 (Last update: Oct, 2025)

**How to run *Clumppling* online**
*   Make sure to disconnect and delete runtime before running this notebook (Runtime -> Disconnect and delete runtime).
*   Follow the instructions and click the run button (the little round button with the white triangle) on the left of each cell one by one.
*   You don’t need to view or edit the code, but you can show or hide code cells at any time by clicking "Show code" or by double-clicking a cell header.


In [None]:
#@title Installation
#@markdown **Note: You may need to restart runtime if prompted. Simply click the button "RESTART RUNTIME".**
!pip install clumppling

In [None]:
#@title Import Packages
import clumppling.__main__ as clumppling_main

from google.colab import files
import zipfile
import os
import shutil
import builtins
import argparse
import sys

import logging

logger = logging.getLogger()  # root logger
logger.setLevel(logging.INFO)  # or DEBUG

# Check if a StreamHandler is already attached
if not any(isinstance(h, logging.StreamHandler) for h in logger.handlers):
    handler = logging.StreamHandler(sys.stdout)  # send logs to notebook stdout
    formatter = logging.Formatter("%(asctime)s [%(levelname)s]  %(message)s")
    handler.setFormatter(formatter)
    logger.addHandler(handler)

logger.info("Logging is now visible in Colab!")

## Steps to Run Clumppling
---
**Upload clustering data (in a zip file)**. The zip folder can contain (and must directly contain, with no subfolder) clustering outputs of one of the following types:

1.   *STRUCTURE* output (i.e., the *_f* files)
2.   *ADMIXTURE* output (i.e., the *.Q* files, or the *.indivq* files)
3.   *fastSTRUCTURE* output (i.e., the *.meanQ* files)
4.   *General Q files*: files containing general membership output of any mixed-membership clustering method saved as spaced-delimited matrices in *.Q* files.

It is okay for the zip folder to contain extra files (e.g., *.P* files for *ADMIXTURE* output), as long as their file extension does not conflict with those specified.

**Specify the input data type** (from ``structure, fastStructure, admixture, generalQ``), **the input file extension** (e.g., ``.Q``; otherwise all files under the input directory will be loaded) **and other optional parameters** (if you would like to change those)

**Run the main program**. Progress and time used will be displayed.

**Download the results**.

**Visualize the results in the notebook**. This step is optional and just for the convenience of the user to see the results live.

In [None]:
#@title Upload files
#@markdown Click on the "Choose Files" and upload your file(s) all at once:
#@markdown \
#@markdown (Required) A zip file containing the input files from population structure analysis (e.g., [*capeverde_admixtureQ.zip*](https://github.com/PopGenClustering/Clumppling/blob/master/examples/capeverde_admixtureQ.zip) which you can download from the GitHub page)
#@markdown \
#@markdown (Optional) An individual label file (e.g., [*capeverde_ind_labels.txt*](https://github.com/PopGenClustering/Clumppling/blob/master/examples/capeverde_ind_labels.txt))
#@markdown \
#@markdown Both files should be selected and uploaded at the same time. You will see the uploading progress.

!rm -rf /content/* # remove previously uploaded files, if any

uploaded = files.upload()

filenames = list(uploaded.keys())
f_input = [f for f in filenames if f.endswith('.zip')]
assert len(f_input) == 1, "Please upload one (and only one) zip file."
f_input = f_input[0]
input = os.path.join("/content", f_input.split('.')[0])
with zipfile.ZipFile(f_input, 'r') as z:
  z.extractall(input)

print("Uploaded files (/content): {}".format(os.listdir("/content")))
print("Input path: {}".format(input))

In [None]:
#@title Set input parameters
import random
random.seed(42)

def str2bool(v):
    if isinstance(v, bool):
        return v
    v = v.lower()
    if v in ('yes', 'true', 't', 'y', '1'):
        return True
    elif v in ('no', 'false', 'f', 'n', '0'):
        return False
    else:
        raise ValueError("Boolean value expected.")

parser = argparse.ArgumentParser()

# required
parser.add_argument("-i", "--input", type=str, required=True, help="Input file path")
parser.add_argument("-o", "--output", type=str, required=True, help="Output file directory")
parser.add_argument("-f", "--format", type=str, required=True, choices=["generalQ", "admixture", "structure", "fastStructure"],
                    help="File format")
# optional
parser.add_argument('-v', '--vis', type=str2bool, default=True, required=False, help='Whether to generate figure(s): True (default)/False')
parser.add_argument('--custom_cmap', type=str, default='', required=False, help='A plain text file containing customized colors (one per line; in hex code): if empty (default), using the default colormap, otherwise use the user-specified colormap')
parser.add_argument("--plot_type", type=str, default="graph", required=False, choices=["graph", "list", "withinK", "major", "all"],
                      help="Type of plot to generate: 'graph' (default), 'list', 'withinK', 'major', 'all'")
parser.add_argument("--include_cost", type=str2bool, default=True, required=False, help="Whether to include cost values in the graph plot: True (default)/False")
parser.add_argument("--include_label", type=str2bool, default=True, required=False, help="Whether to include individual labels in the plot: True (default)/False")
parser.add_argument("--alt_color", type=str2bool, default=False, required=False, help="Whether to use alternative colors for connection lines: True (default)/False")
parser.add_argument("--ind_labels", type=str, default="", required=False,
                    help="A plain text file containing individual labels (one per line) (default: last column from labels in input file, which consists of columns [0, 1, 3] separated by delimiter)")
parser.add_argument("--regroup_ind", type=str2bool, default=True, required=False,
                    help="Whether to regroup individuals so that those with the same labels stay together (if labels are available): True (default)/False")
parser.add_argument("--reorder_within_group", type=str2bool, default=True, required=False,
                    help="Whether to reorder individuals within each label group in the plot (if labels are available): True (default)/False")
parser.add_argument("--reorder_by_max_k", type=str2bool, default=True, required=False,
                    help="Whether to reorder individuals based on the major mode with largest K: True (default)/False (based on the major mode with smallest K)")
parser.add_argument("--order_cls_by_label", type=str2bool, default=True, required=False,
                    help="Whether to reorder clusters based on total memberships within each label group in the plot: True (default)/False (by overall total memberships)")
parser.add_argument("--plot_unaligned", type=str2bool, default=False, required=False, help="Whether to plot unaligned modes (in a list): True/False (default)")
parser.add_argument("--fig_format", type=str, default="tiff", required=False, choices=["png", "jpg", "jpeg", "tif", "tiff", "svg", "pdf", "eps", "ps", "bmp", "gif"], help="Figure format for output files (default: tiff)")

parser.add_argument("--extension", type=str, default="", required=False, help="Extension of input files")
parser.add_argument("--skip_rows", type=int, default=0, required=False, help="Skip top rows in input files")
parser.add_argument("--remove_missing", type=str2bool, default=True, required=False, help="Remove individuals with missing data: True (default)/False")

parser.add_argument("--cd_method", type=str, default="louvain", required=False,
                      choices=["louvain", "leiden", "infomap", "markov_clustering", "label_propagation", "walktrap", "custom"],
                      help="Community detection method to use (default: louvain)")
parser.add_argument("--cd_res", type=float, default=1.0, required=False,
                      help="Resolution parameter for the default Louvain community detection (default: 1.0)")
parser.add_argument("--test_comm", type=str2bool, default=True, required=False,
                      help="Whether to test community structure (default: True)")
parser.add_argument("--comm_min", type=float, default=1e-6, required=False,
                      help="Minimum threshold for cost matrix (default: 1e-6)")
parser.add_argument("--comm_max", type=float, default=1e-2, required=False,
                      help="Maximum threshold for cost matrix (default: 1e-2)")
parser.add_argument("--merge", type=str2bool, default=True, required=False,
                      help="Whether to merge two clusters when aligning K+1 to K (default: True)")
parser.add_argument("--use_rep", type=str2bool, default=True, required=False, help="Use representative modes (alternative: average): True (default)/False")
parser.add_argument("--use_best_pair", type=str2bool, default=True, required=False, help="Use best pair as anchor for across-K alignment (alternative: major): True (default)/False")

#@markdown ### Enter the required argument:
#@markdown Choose the format of the input files:
format = "admixture" #@param ["generalQ", "admixture", "structure", "fastStructure"] {type:"string"}
#@markdown \
#@markdown ### Change optional arguments if needed:
#@markdown Leave emtpy if loading all input files, or provide a file extension like ``.Q``, ``_f``, etc.:
extension = "" #@param {type:"string"}
#@markdown Set the resolution parameter of the Louvain algorithm for community detection:
cd_res = 1.0 #@param {type:"number"}
#@markdown Choose whether to test for community structure before applying community detection (will force community detection if False):
test_comm = True #@param ["False", "True"] {type:"raw"}
#@markdown Choose whether to use a representative replicate to be mode consensus:
use_rep = False #@param ["False", "True"] {type:"raw"}
#@markdown Choose whether to merge all pairs of cluster when aligning K+1 and K (will be ignored if K differs by more than 1):
merge = True #@param ["False", "True"] {type:"raw"}
#@markdown Choose whether to generate corresponding visualizations:
vis = True #@param ["False", "True"] {type:"raw"}
plot_type = "graph" #@param ["graph", "list", "withinK", "major", "all"] {type:"string"}
fig_format = "png" #@param ["png", "jpg", "jpeg", "tif", "tiff", "svg", "pdf", "eps", "ps", "bmp", "gif"] {type:"string"}
#@markdown Leave emtpy if using the defulat colormap, or provide a custom colormap in the form of ``#D65859,#00AAC1,#01C0F6,#FDF0C4``:
custom_cmap = "" #@param {type:"string"}
#@markdown Leave emtpy if not using additional population labels, or provide a file name to individual labels (one label per line), e.g., ``capeverde_ind_labels.txt``:
ind_labels = "capeverde_ind_labels.txt" #@param {type:"string"}

if ind_labels != "":
  ind_labels = os.path.join("/content", ind_labels)
  assert os.path.exists(ind_labels), "The provided individual labels file does not exist."

# clean up cached output files
output = '/content/output'
if os.path.exists(output):
  shutil.rmtree(output)
if os.path.exists(output+".zip"):
  os.remove(output+".zip")

args_list = ['--input', input,
             '--output', output,
             '--format', format,
             '--extension',extension,
             '--test_comm', '1' if test_comm else '0',
             '--cd_res', str(cd_res),
             '--use_rep', '1' if use_rep else '0',
             '--merge', '1' if merge else '0',
             '--vis', '1' if vis else '0',
             '--plot_type', plot_type,
             '--fig_format', fig_format,
             '--custom_cmap',custom_cmap,
             '--ind_labels', ind_labels]

args = parser.parse_args(args_list)


In [None]:
#@title Run program
#@markdown You will see the progress output in the console.
clumppling_main.main(args)

In [None]:
#@title Download results
#@markdown A downloading window will pop up once you run this cell.
files.download('/content/output.zip')

In [None]:
#@title Display results
#@markdown You may see the alignment result figures here (if you choose to generate any visualization).
from IPython.display import display
from PIL import Image

image_folder = '/content/output/visualization'
for fn in os.listdir(image_folder):
  if fn.endswith(fig_format):
    img = Image.open(os.path.join(image_folder, fn))
    display(img)

In [None]:
#@title Show statistics
#@markdown You may see the within-K alignment performance here. More performances can be found in the output files you downloaded.
with open('/content/output/modes/mode_average_stats.txt') as f:
  for line in f:
    print(line.strip().replace(',','\t'))