# Get to Know a Dataset: Telomere-to-Telomere Korean Pangenome Project
This notebook serves as a guided tour of the 
[Telomere-to-Telomere Korean Pangenome Project](https://registry.opendata.aws/telomere-to-telomere-korean-pangenome-project) 
dataset. More usage examples, tutorials, and documentation for this dataset and others can be found at the 
[Registry of Open Data on AWS](https://registry.opendata.aws/).

### Q: How have you organized your dataset? Help us understand the key prefix structure of your S3 bucket.

The Telomere-to-Telomere Korean Pangenome Project dataset is organized into a small number of
top-level prefixes, each corresponding to a major data modality. This structure allows users to
navigate raw sequencing reads, assemblies, variants, and metadata without needing to download
the full dataset.

At the top level of the S3 bucket (final bucket name to be added after provisioning), users will find:

/reads/
/ont_ul/ # Oxford Nanopore ultra-long reads
/hifi/ # PacBio HiFi reads
/hic/ # Hi-C reads

/ubam/ # Unaligned BAM files

/assemblies/
/t2t_fasta/ # Telomere-to-telomere FASTA assemblies
/t2t_gfa/ # Graph Fragment Assembly (GFA) files
/diploid/ # Haplotype-resolved assemblies

/variants/
/vcf/ # SNP / INDEL / SV VCF files
/graph/ # Graph genome (VG format)

/metadata/
sample_metadata.csv
sequencing_summary.json

This organization is designed so that users can directly load only the data relevant to their workflow
(e.g., assemblies for benchmarking, VCFs for variant analysis, or raw reads for re-assembly).

Full descriptive documentation is available at:  
https://github.com/KoreanPangenome/KoreanPangenome

In [None]:
# TODO: After S3 provisioning, this code cell will demonstrate listing top-level prefixes.
# Example:
# import s3fs
# fs = s3fs.S3FileSystem(anon=True)
# fs.ls("s3://t2t-korean-pangenome/")

### Q: What data formats are present in your dataset? What kinds of data are stored using these formats? Can you give any advice for how you work with these data formats?

The Telomere-to-Telomere Korean Pangenome Project includes several file formats commonly used in
long-read sequencing, genome assembly, and pangenome construction. These formats each represent
different biological data types and are chosen based on stability, interoperability, and compatibility
with existing genomics tools.

**1. FASTQ (ONT Ultra-long, PacBio HiFi)**
- Stores raw sequencing reads with base qualities.
- Chosen because it is the standard for raw long-read data and is accepted by all major aligners.
- Recommended tools: `minimap2`, `seqkit`, `NanoPlot`, `pbccs`.
- AWS: Can be read directly from S3 using `s3fs` or streamed into alignment jobs on EC2.

**2. uBAM (Unaligned BAM)**
- Binary version of FASTQ with read-level metadata preserved.
- Useful for workflows requiring read grouping, UUID tracking, or metadata-aware QC.
- Tools: `samtools`, `picard`, `htslib`.

**3. FASTA (T2T and haplotype-resolved assemblies)**
- Represents assembled genome sequences.
- Used because it is the universal format for genome assemblies.
- Tools: `samtools faidx`, `seqkit`, `minimap2`.
- AWS: FASTA indexes can be stored alongside sequence files for fast region-based retrieval.

**4. GFA (Graph Fragment Assembly)**
- Describes assembly graphs for telomere-to-telomere sequences.
- Chosen because it captures repeat structure and alternative paths that FASTA cannot represent.
- Tools: `Bandage`, `GFAKluge`.

**5. VCF (SNPs, INDELs, Structural Variants)**
- Represents variant calls in a structured, indexed format.
- Each `.vcf.gz` is accompanied by a `.tbi` index for random access.
- Tools: `bcftools`, `cyvcf2`, `htslib`.
- AWS: Athena can query tabular derivatives (e.g., parquet conversion).

**6. VG (Graph-based pangenome format)**
- Encodes genome graphs, GBWT haplotypes, and indexing for graph-based alignment.
- Chosen because it supports pangenome-aware analysis and improves SV genotyping accuracy.
- Tools: `vg`, `giraffe`, `toil-vg`.
- AWS: Graph indexing workflows can be executed on EC2 or AWS Batch.

These formats were selected to ensure compatibility with existing human pangenome pipelines and to
support high-throughput analysis directly in AWS.


In [None]:
# TODO: Example of reading FASTQ from S3 will be added after bucket provisioning

### Q: Can you show us an example of downloading and loading data from your dataset?

Because the S3 bucket for the Telomere-to-Telomere Korean Pangenome Project has not yet been
provisioned, this section provides an outline of how users will load and inspect data once the
dataset is published.

Below is a draft example demonstrating the intended workflow for accessing raw sequencing data,
assemblies, and variant files directly from S3.


In [None]:
# TODO: Replace with the final S3 bucket URI once provisioning is complete.
BUCKET = "s3://t2t-korean-pangenome/"  

# Draft example for listing top-level prefixes
# import s3fs
# fs = s3fs.S3FileSystem(anon=True)
# fs.ls(BUCKET)

# Draft example for reading a FASTQ file from S3
# import gzip
# import s3fs
#
# fs = s3fs.S3FileSystem(anon=True)
# with fs.open(f"{BUCKET}/reads/hifi/KPP0001.hifi.fastq.gz", "rb") as f:
#     with gzip.open(f, "rt") as fastq:
#         for i in range(8):   # print first two reads
#             print(next(fastq).strip())

# Draft example for reading a VCF header
# import cyvcf2
#
# vcf_path = f"{BUCKET}/variants/vcf/KPP0001.phased.snps.vcf.gz"
# vcf = cyvcf2.VCF(vcf_path)
# print(vcf.raw_header)

In [None]:
Once the dataset is publicly available on AWS, these examples will be replaced with executable
code demonstrating:

- how to stream FASTQ / uBAM reads directly from S3 without download  
- how to load large FASTA assemblies efficiently  
- how to work with indexed VCF files for region-based variant queries  
- how to load pangenome graph files using the `vg` toolkit  

This section will be updated after the S3 bucket is provisioned.

### Q: A picture is worth a thousand words. Show us a visual (or several!) from your dataset that either illustrates something informative about your dataset, or that you think might excite someone to dig in further.

Because the dataset has not yet been published to S3, this section provides examples of the kinds of
visualizations we plan to include once the data becomes accessible.

The Telomere-to-Telomere Korean Pangenome Project dataset lends itself to several informative and
exciting visual summaries, including:

- Distribution of Oxford Nanopore ultra-long read lengths  
- Coverage depth across samples  
- Assembly contiguity metrics (e.g., N50, T2T completeness)  
- Structural variant size distribution  
- Graph-based pangenome node/edge counts  
- Comparison of haplotype divergence between individuals  

Below is draft code showing how these visualizations will be produced once the S3 bucket is available.


In [None]:
# TODO: Replace with actual S3 paths once the dataset is provisioned.
# Example: Read length distribution for ONT ultra-long reads.

# import matplotlib.pyplot as plt
# import numpy as np

# # Placeholder synthetic data for illustration
# placeholder_read_lengths = np.random.lognormal(mean=10, sigma=0.5, size=50000)

# plt.figure(figsize=(10, 5))
# plt.hist(placeholder_read_lengths, bins=100, color='#3A6EA5', alpha=0.8)
# plt.title("Draft Example: ONT Ultra-long Read Length Distribution")
# plt.xlabel("Read length (bp)")
# plt.ylabel("Count")
# plt.grid(alpha=0.3)
# plt.show()


### Q: What is one question that you have answered using these data? Can you show us how you came to that answer?

Because the dataset has not yet been deployed to S3, this section shows a toy example that illustrates
the kind of analysis users will be able to perform once the full dataset becomes available.

One representative question for the Telomere-to-Telomere Korean Pangenome Project is:

**"How do long-read sequencing technologies differ in read length distributions and what implications does this have for telomere-to-telomere assembly?"**

ONT ultra-long reads typically have extremely long read lengths—often exceeding 100 kb—while
PacBio HiFi reads have higher accuracy but shorter read lengths. Understanding these differences
helps users choose appropriate data types for assembly, variant calling, or graph construction.

Below is a toy example demonstrating how this comparison would be visualized.


In [None]:
# Toy example: comparing ONT UL vs PacBio HiFi read length distributions
# import numpy as np
# import matplotlib.pyplot as plt

# # Synthetic placeholder data
# ont_lengths = np.random.lognormal(mean=11.0, sigma=0.6, size=20000)   # ONT UL reads
# hifi_lengths = np.random.lognormal(mean=9.2, sigma=0.3, size=20000)   # PacBio HiFi reads

# plt.figure(figsize=(10,5))
# plt.hist(ont_lengths, bins=120, alpha=0.6, label="ONT ultra-long", color="#3A6EA5")
# plt.hist(hifi_lengths, bins=120, alpha=0.6, label="PacBio HiFi", color="#A53A3A")
# plt.title("Draft Example: Synthetic Read Length Distributions")
# plt.xlabel("Read length (bp)")
# plt.ylabel("Count")
# plt.legend()
# plt.grid(alpha=0.3)
# plt.show()


In [None]:
Once the dataset is publicly available, this notebook will reproduce the real distributions using
the raw FASTQ files directly from S3. These kinds of visualizations help highlight:

- why ONT UL reads are valuable for spanning repeats and assembling telomeres
- why PacBio HiFi reads are important for polishing assemblies and accurate variant calling
- how combining both platforms enables true telomere-to-telomere (T2T) assemblies

This “toy example” demonstrates the type of reasoning users can apply to the real dataset once
it becomes accessible.

### Q: What is one unanswered question that you think could be answered using these data? Do you have any recommendations or advice for someone wanting to answer this question?

One meaningful open question that can be explored using the Telomere-to-Telomere Korean Pangenome
Project dataset is:

**"Which structural variants unique to Korean and East Asian populations alter gene regulation or
disease-relevant pathways, and can pangenome-based methods improve their detection?"**

Traditional short-read methods struggle to resolve many variants in repetitive, medically relevant
regions. With ultra-long ONT reads, HiFi reads, and telomere-to-telomere assemblies, this dataset
makes it possible to systematically:

- detect previously unresolved structural variants  
- anchor them in T2T-resolved regions  
- compare haplotype-resolved breakpoints across individuals  
- evaluate potential functional impact (e.g., promoter disruptions, regulatory element shifts)

**Recommendations for someone tackling this question:**

1. Start with graph-based indexing (VG toolkit, GBWT) to ensure SVs are genotyped against a
   pangenome rather than a linear reference.
2. Use diploid assemblies to validate SV breakpoints at haplotype resolution.
3. Consider integrating publicly available transcriptomic or epigenomic datasets to assess likely
   regulatory impact.
4. For computational scale, use AWS Batch or EC2 spot instances, as genome-graph workflows can
   be resource-intensive.
5. Most importantly, explore complex loci—immune regions, segmental duplications, and
   pharmacogenomic hotspots—where population-specific SVs tend to hide.

This challenge invites the community to perform analyses that are only possible with T2T and
graph-based representations, and could lead to discoveries with direct relevance to precision
medicine in East Asian populations.
