# Xiong Lab Single-nuclei ATAC-seq Preprocessing Pipeline
---
## Overview

This pipeline preprocesses single-nucleus ATAC-seq (snATAC-seq) data from the Kellis lab (Xiong et al.) for downstream chromatin accessibility QTL (caQTL) analysis. It processes pseudobulk peak count data across six major brain cell types.

**Pipeline Purpose:**
- Transform raw pseudobulk peak counts into analysis-ready formats
- Remove technical confounders while preserving biological variation
- Generate QTL-ready phenotype files for genome-wide caQTL mapping

**Supported Cell Types:**
- **Mic** - Microglia
- **Astro** - Astrocytes
- **Oligo** - Oligodendrocytes
- **Ex** - Excitatory neurons
- **In** - Inhibitory neurons
- **OPC** - Oligodendrocyte precursor cells

---

## Workflow Structure

This pipeline consists of **three sequential steps**:

#### Step 0: Sample ID Mapping

**Input:**
- Sample mapping file: `rosmap_sample_mapping_data.csv`
- Original metadata files: `metadata_{celltype}.csv`
- Original count files: `pseudobulk_peaks_counts_{celltype}.csv.gz`

**Process:**
1. Loads sample ID mapping between individualID and sampleid
2. Processes metadata files:
   - Adds `sampleid` column after `individualID`
   - Maps individualID to sampleid where mapping exists
   - Keeps original individualID for unmapped samples
3. Processes count matrix files:
   - Renames column headers from individualID to sampleid
   - Maintains count data integrity

#### Step 1: Pseudobulk QC & Calculate Residuals with biological variation

**Input:**
- Mapped metadata: `metadata_{celltype}.csv` (from Step 0)
- Mapped peak counts: `pseudobulk_peaks_counts_{celltype}.csv.gz` (from Step 0)
- Sample covariates: `rosmap_cov.txt`
- hg38 blacklist: `hg38-blacklist.v2.bed.gz`

**Process:**
1. Loads pseudobulk peak count matrix and metadata
2. **Filters samples with n_nuclei > 20**
3. Calculates technical QC metrics per sample:
   - `log_n_nuclei`: Log-transformed number of nuclei
   - `med_nucleosome_signal`: Median nucleosome signal
   - `med_tss_enrich`: Median TSS enrichment score
   - `log_med_n_tot_fragment`: Log-transformed median total fragments (sequencing depth)
   - `log_total_unique_peaks`: Log-transformed count of unique peaks detected
4. Filters blacklisted genomic regions using `foverlaps()`
5. Merges with covariates (pmi, study) - **excludes msex and age_death**
6. Applies expression filtering with `filterByExpr()`:
   - `min.count = 5`: Minimum 5 reads in at least one sample
   - `min.total.count = 15`: Minimum 15 total reads across all samples
   - `min.prop = 0.1`: Peak must be expressed in ≥10% of samples
7. TMM normalization with `calcNormFactors()`
8. Saves **filtered raw counts** (used for region-specific analysis if needed)
9. Handles sequencingBatch and Library as covariates
10. Fits linear model using `voom()` and `lmFit()`:
    ```r
    model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + 
               log_med_n_tot_fragment + log_total_unique_peaks + 
               sequencingBatch_factor + Library_factor + pmi + study
    ```
11. Calculates residuals using `predictOffset()`: `offset + residuals`
    - **Preserves biological variation** (sex, age)
    - Removes technical variation and study effects

**Key Variables Regressed Out:**
- Technical: sequencing depth, nuclei count, nucleosome signal, TSS enrichment, batch, library
- Study effects: pmi, study cohort

**Key Variables Preserved:**
- Sex (msex)
- Age at death (age_death)


#### Step 2: Phenotype Reformatting

**Input:**
- `{celltype}_residuals.txt` from Step 1 (in `2_residuals/{celltype}/`)

**Process:**
1. Reads residuals file with proper handling of peak IDs and sample columns
2. Parses peak coordinates from peak IDs (format: `chr-start-end`)
3. Converts peaks to **midpoint coordinates**:

Use for:
Genome-wide caQTL mapping with FastQTL, TensorQTL, or MatrixEQTL
Analysis that accounts for or investigates sex/age effects

---

### Pipeline Outputs

**From Step 0:**
`metadata_{celltype}.csv`: Metadata with mapped sampleid
`pseudobulk_peaks_counts_{celltype}.csv.gz`: Counts with mapped sampleid headers

**From Step 1:**
`{celltype}_residuals.txt`: Covariate-adjusted residuals (log2-CPM scale)
`{celltype}_filtered_raw_counts.txt`: TMM-normalized counts
`{celltype}_results.rds`: Complete analysis results
`{celltype}_summary.txt`: QC and filtering statistics
`{celltype}_variable_explanation.txt`: Variable documentation

**From Step 2:**
`{celltype}_kellis_xiong_snatac_phenotype.bed.gz`: Genome-wide QTL-ready BED file

---

**Input files** needed to run this pipeline can be downloaded [here](https://drive.google.com/drive/folders/1l1RJx5toqg_WOlWW3gy-ynkrodi8oqXv?usp=drive_link).

#### Before you start, let's set your working paths.

In [29]:
input_dir <- " " # insert your input dir
output_dir <- " " #insert your output dir

## Step 0: Check sample ID

**Purpose:** Maps original sample identifiers (individualID) to standardized sample IDs (sampleid) across metadata and count matrix files.

---

#### Input:

**Sample Mapping Reference:**
- `rosmap_sample_mapping_data.csv`: Contains mapping between individualID and sampleid

**Metadata Files (per cell type):**
- `metadata_Ast.csv`
- `metadata_Ex.csv`
- `metadata_In.csv`
- `metadata_Microglia.csv`
- `metadata_Oligo.csv`
- `metadata_OPC.csv`

**Count Matrix Files (per cell type):**
- `pseudobulk_peaks_counts_Ast.csv.gz`
- `pseudobulk_peaks_counts_Ex.csv.gz`
- `pseudobulk_peaks_counts_In.csv.gz`
- `pseudobulk_peaks_counts_Microglia.csv.gz`
- `pseudobulk_peaks_counts_Oligo.csv.gz`
- `pseudobulk_peaks_counts_OPC.csv.gz`


#### Process:

**Part 1: Process Metadata Files**

1. Loads sample mapping dictionary from `rosmap_sample_mapping_data.csv`
2. Creates a keyed data.table for fast lookups: `individualID → sampleid`
3. For each metadata file:
   - Reads the CSV file
   - Finds the position of the `individualID` column
   - Creates a new `sampleid` column
   - For each sample:
     - If mapping exists: uses the mapped sampleid
     - If no mapping: uses the original individualID (preserves unmapped samples)
   - Inserts `sampleid` column immediately after `individualID` column
   - Saves updated metadata file

**Part 2: Process Count Matrix Files**

1. For each count matrix file (gzipped):
   - Extracts header line (first row with column names)
   - First column is `peak_id` (kept as-is)
   - Remaining columns are sample IDs (individualID format)
   - Maps sample IDs to sampleid where mapping exists
   - Creates new header with mapped IDs
   - Replaces original header with new header
   - Recompresses with gzip

#### Output:
Output Directory: `output/1_files_with_sampleid/`

Metadata Files (with sampleid):
- `metadata_Ast.csv`
- `metadata_Ex.csv`
- `metadata_In.csv`
- `metadata_Microglia.csv`
- `metadata_Oligo.csv`
- `metadata_OPC.csv`

Count Matrix Files (with sampleid headers):
- `pseudobulk_peaks_counts_Ast.csv.gz`
- `pseudobulk_peaks_counts_Ex.csv.gz`
- `pseudobulk_peaks_counts_In.csv.gz`
- `pseudobulk_peaks_counts_Microglia.csv.gz`
- `pseudobulk_peaks_counts_Oligo.csv.gz`
- `pseudobulk_peaks_counts_OPC.csv.gz`


#### Load libraries

In [2]:
library(data.table)

#### Load input

In [3]:
# 3. Read mapping data
map_file <- file.path(input_dir, "data/rosmap_sample_mapping_data.csv")
map <- fread(map_file)
cat("Read mapping file, rows:", nrow(map), "\n")

# 4. Create mapping dictionary
id_map <- map[, .(individualID, sampleid)]
setkey(id_map, individualID)

# Define cell types and paths
celltype <- c("Ast", "Ex", "In", "Microglia", "Oligo", "OPC")

# Your specific metadata file paths
metadata_files <- file.path(input_dir, paste0("1_files_with_sampleid/metadata_", celltype, ".csv"))


for (ct in celltype) {
  specific_dir <- file.path(output_dir, "1_files_with_sampleid")
  if (!dir.exists(specific_dir)) {
    dir.create(specific_dir, recursive = TRUE)
    cat("Created directory:", specific_dir, "\n")
  }
}

Read mapping file, rows: 1200 


### Process metadata files

In [4]:
# Function to process metadata files - adds sampleid and uses individualID for unmapped cases
process_metadata <- function(file_path, celltype_name) {
  cat("\nProcessing metadata file:", basename(file_path), "\n")
  
  # Read data
  meta <- fread(file_path)
  cat("Original rows:", nrow(meta), "columns:", ncol(meta), "\n")
  
  # Find the position of individualID column
  id_col_index <- which(colnames(meta) == "individualID")
  if (length(id_col_index) == 0) {
    cat("Warning: individualID column not found\n")
    return(NULL)
  }
  
  # Find the mapped sampleids for each individualID
  meta$sampleid <- character(nrow(meta))  # Initialize with empty strings
  
  for (i in 1:nrow(meta)) {
    ind_id <- meta$individualID[i]
    mapped_id <- id_map[ind_id, sampleid]
    
    # If mapping found, use it; otherwise use the original individualID
    if (length(mapped_id) > 0 && !is.na(mapped_id)) {
      meta$sampleid[i] <- mapped_id
    } else {
      # Use the original individualID instead of NA
      meta$sampleid[i] <- ind_id
    }
  }
  
  # Move sampleid column to the front
  setcolorder(meta, c("sampleid", setdiff(names(meta), "sampleid")))
  
  # Save results
  output_file <- file.path(output_dir, "1_files_with_sampleid",basename(file_path))
  cat("Output file will be saved to:", output_file, "\n")
  fwrite(meta, output_file)
  
  # Count mapped and unmapped IDs
  mapped_count <- sum(meta$sampleid != meta$individualID)
  unmapped_count <- sum(meta$sampleid == meta$individualID)
  
  cat("Saved to:", output_file, "\n")
  cat("Converted rows:", nrow(meta), "columns:", ncol(meta), "\n")
  cat("Mapped IDs:", mapped_count, "Unmapped IDs:", unmapped_count, "\n")
  
  # Return processing summary
  list(
    file = basename(file_path),
    mapped_ids = mapped_count,
    unmapped_ids = unmapped_count,
    total_ids = nrow(meta)
  )
}

# Process all metadata files
meta_results <- mapply(process_metadata, metadata_files, celltype, SIMPLIFY = FALSE)
meta_summary <- do.call(rbind, lapply(meta_results, as.data.table))

cat("\nMetadata file processing summary:\n")
print(meta_summary)


Processing metadata file: metadata_Ast.csv 
Original rows: 93 columns: 10 
Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Ast.csv 
Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Ast.csv 
Converted rows: 93 columns: 10 
Mapped IDs: 84 Unmapped IDs: 9 

Processing metadata file: metadata_Ex.csv 
Original rows: 92 columns: 10 
Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Ex.csv 
Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Ex.csv 
Converted rows: 92 columns: 10 
Mapped IDs: 83 Unmapped IDs: 9 

Processing metadata file: metadata_In.csv 
Original rows: 93 columns: 10 
Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_In.csv 
Saved to: /restricted/projectnb/xqtl/jaempawi/ata

### Process count matrix files

In [5]:
# Your specific metadata file paths
count_files <- file.path(input_dir, paste0("1_files_with_sampleid/pseudobulk_peaks_counts_", celltype, ".csv.gz"))


# Direct column renaming for count matrix files
process_counts_simple <- function(file_path) {
  cat("\nProcessing count matrix file:", basename(file_path), "\n")
  
  # Get header line only
  header_command <- paste0("zcat ", file_path, " | head -n 1")
  header_line <- system(header_command, intern = TRUE)
  
  # Parse column names
  col_names <- unlist(strsplit(header_line, ","))
  cat("Original columns:", length(col_names), "\n")
  
  # First column is peak_id, remaining columns are sample IDs
  peak_id_col <- col_names[1]
  sample_cols <- col_names[-1]
  
  # Map sample IDs
  new_sample_cols <- character(length(sample_cols))
  mapped_count <- 0
  
  for (i in seq_along(sample_cols)) {
    ind_id <- sample_cols[i]
    mapped_id <- id_map[ind_id, sampleid]
    
    if (length(mapped_id) > 0 && !is.na(mapped_id)) {
      new_sample_cols[i] <- mapped_id
      mapped_count <- mapped_count + 1
    } else {
      # Keep original individualID if no mapping found
      new_sample_cols[i] <- ind_id
    }
  }
  
  # Create new header
  new_col_names <- c(peak_id_col, new_sample_cols)
  
  # Create temporary header file
  temp_header <- tempfile()
  writeLines(paste(new_col_names, collapse = ","), temp_header)
  
  # Output file path
  output_file <- file.path(output_dir, "1_files_with_sampleid", basename(file_path))
  
  # Use system command to process the file without chunking
  # This extracts the data (excluding header), prepends new header, and compresses
  cmd <- paste0(
    "zcat ", file_path, " | tail -n +2 | cat ", temp_header, " - | gzip > ", output_file
  )
  
  cat("Executing command:", cmd, "\n")
  system_result <- system(cmd)
  
  # Check if command succeeded
  if (system_result != 0) {
    cat("ERROR: Command failed with exit code", system_result, "\n")
    cat("Attempting backup method...\n")
    
    # Backup method using R's built-in file handling
    tryCatch({
      # Create a named vector for mapping
      id_mapping <- setNames(new_sample_cols, sample_cols)
      
      # Open connections
      in_conn <- gzfile(file_path, "r")
      out_conn <- gzfile(output_file, "w")
      
      # Read and discard the header line
      readLines(in_conn, n = 1)
      
      # Write the new header
      writeLines(paste(new_col_names, collapse = ","), out_conn)
      
      # Copy the rest of the file line by line
      while (length(line <- readLines(in_conn, n = 1)) > 0) {
        writeLines(line, out_conn)
      }
      
      # Close connections
      close(in_conn)
      close(out_conn)
      
      cat("Backup method successful\n")
    }, error = function(e) {
      cat("Backup method also failed:", e$message, "\n")
    })
  } else {
    # Check file sizes to verify completion
    input_size <- system(paste("ls -lh", file_path), intern = TRUE)
    output_size <- system(paste("ls -lh", output_file), intern = TRUE)
    cat("Input file size: ", input_size, "\n")
    cat("Output file size:", output_size, "\n")
  }
  
  # Delete temporary file
  file.remove(temp_header)
  
  cat("File processing completed and saved to:", output_file, "\n")
  
  # Return processing summary
  list(
    file = basename(file_path),
    total_columns = length(col_names),
    mapped_columns = mapped_count,
    unmapped_columns = length(sample_cols) - mapped_count
  )
}

# Process all count files
count_results <- lapply(count_files, process_counts_simple)

# Summarize results
count_summary <- do.call(rbind, lapply(count_results, as.data.table))
cat("\nCount matrix file processing summary:\n")
print(count_summary)

cat("\nAll files processed!\n")


Processing count matrix file: pseudobulk_peaks_counts_Ast.csv.gz 
Original columns: 93 
Executing command: zcat /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz | tail -n +2 | cat /scratch/3114076.1.casaq/RtmpQcG6rV/file1bc75a5b135eff - | gzip > /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz 
Input file size:  -rw-r--r-- 1 jaempawi xqtl 22M Jan 29 12:08 /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz 
Output file size: -rw-r--r-- 1 jaempawi xqtl 22M Feb 12 15:32 /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz 
File processing completed and saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz 

Processing count matrix file: pseudobulk_pea

## Step 1: Pseudobulk QC noBIOvar
**Purpose:** Performs quality control on pseudobulk ATAC-seq data, filters low-quality samples and peaks, normalizes data, and calculates covariate-adjusted residuals while preserving biological variation (sex, age).

---

#### Input:

**From Step 0 (required):**
- `metadata_{celltype}.csv` (in `output/1_files_with_sampleid/`)
- `pseudobulk_peaks_counts_{celltype}.csv.gz` (in `output/1_files_with_sampleid/`)

**Reference Files:**
- `rosmap_cov.txt`: Sample covariates (pmi, study)
- `hg38-blacklist.v2.bed.gz`: ENCODE blacklist regions

**Cell Types:**
- `Mic` (Microglia)
- `Astro` (Astrocytes)
- `Oligo` (Oligodendrocytes)
- `Ex` (Excitatory neurons)
- `In` (Inhibitory neurons)
- `OPC` (Oligodendrocyte precursor cells)

#### Process:

1. Load Data
2. Sample Quality Filtering
3. Calculate Technical QC Metrics
4. Process Peak Coordinates
5. Filter Blacklisted Regions
6. Merge Covariates
7. Create DGE Object
8. Expression Filtering
9. Save Filtered Raw Counts
10. TMM Normalization
11. Handle Batch and Library Variables
12. Build Linear Model
13. Voom Transformation & Model Fitting
14. Calculate Offsets and Residuals

#### Output:
Output Directory: `output/2_residuals/{celltype}/`

1. Residuals File: `{celltype}_residuals.txt`
2. Results Object: `{celltype}_results.rds`
3. Summary Report: `{celltype}_summary.txt`
4. Variable Explanation: `{celltype}_variable_explanation.txt`
5. Filtered Raw Counts: `{celltype}_filtered_raw_counts.txt`

#### Load libaries

In [6]:
library(data.table)
library(stringr)
library(dplyr)
library(edgeR)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:data.table’:

    between, first, last


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: limma



In [8]:
# Set cell type and create output directory
#args <- commandArgs(trailingOnly = TRUE)

celltype <- "Oligo"
cat("Processing celltype:", celltype, "\n")

# Create individual directories for each cell type
for (ct in celltype) {
  specific_dir <- file.path(output_dir, "2_residuals",celltype)
  if (!dir.exists(specific_dir)) {
    dir.create(specific_dir, recursive = TRUE)
    cat("Created directory:", specific_dir, "\n")
  }
}

Processing celltype: Oligo 


#### Create predictOffset funciton

In [9]:
predictOffset <- function(fit) {
  # Define which variables are factors and which are continuous
  usedFactors <- c("sequencingBatch", "Library", "study") 
  usedContinuous <- c("log_n_nuclei", "med_nucleosome_signal", "med_tss_enrich", "log_med_n_tot_fragment",
                      "log_total_unique_peaks", "pmi")
  
  # Filter to only use variables actually in the design matrix
  usedFactors <- usedFactors[sapply(usedFactors, function(f) any(grepl(paste0("^", f), colnames(fit$design))))]
  usedContinuous <- usedContinuous[sapply(usedContinuous, function(f) any(grepl(paste0("^", f), colnames(fit$design))))]
  
  # Get indices for factor and continuous variables
  facInd <- unlist(lapply(as.list(usedFactors), 
                         function(f) {return(grep(paste0("^", f), 
                                                colnames(fit$design)))}))
  contInd <- unlist(lapply(as.list(usedContinuous), 
                          function(f) {return(grep(paste0("^", f), 
                                                 colnames(fit$design)))}))
  
  # Add the intercept
  all_indices <- c(1, facInd, contInd)
  
  # Verify design matrix structure (using sorted indices to avoid duplication warning)
  all_indices_sorted <- sort(unique(all_indices))
  stopifnot(all(all_indices_sorted %in% 1:ncol(fit$design)))
  
  # Create new design matrix with median values
  D <- fit$design
  D[, facInd] <- 0  # Set all factor levels to reference level
  
  # For continuous variables, set to median value
  if (length(contInd) > 0) {
    medContVals <- apply(D[, contInd, drop=FALSE], 2, median)
    for (i in 1:length(medContVals)) {
      D[, names(medContVals)[i]] <- medContVals[i]
    }
  }
  
  # Calculate offsets
  stopifnot(all(colnames(coefficients(fit)) == colnames(D)))
  offsets <- apply(coefficients(fit), 1, function(c) {
    return(D %*% c)
  })
  offsets <- t(offsets)
  colnames(offsets) <- rownames(fit$design)
  
  return(offsets)
}



#### Load data

In [11]:
celltype <- "Oligo"
meta_path <- paste0(output_dir, "/1_files_with_sampleid/metadata_", celltype, ".csv")
peak_path <- paste0(output_dir, "/1_files_with_sampleid/pseudobulk_peaks_counts_", celltype, ".csv.gz")

# Blacklist and Covariates are in the source 'data_dir'
blacklist_file <- file.path(input_dir, "data/hg38-blacklist.v2.bed.gz")
covariates_file <- file.path(input_dir, "data/rosmap_cov.txt")

# Load metadata
meta <- fread(meta_path)
cat("Loaded metadata with", nrow(meta), "samples\n")

# Filter samples with n_nuclei > 20
meta_filtered <- meta[n.nuclei > 20]
cat("Filtered to", nrow(meta_filtered), "samples with > 20 nuclei\n")

# Load peak data
peak_data <- fread(peak_path)
cat("Loaded peak data with", nrow(peak_data), "peaks\n")

# Extract peak_id and set as rownames
peak_id <- peak_data$peak_id
peak_data <- peak_data[, -1, with = FALSE]  # Remove peak_id column

# Filter peak data to keep only samples with >20 nuclei
valid_samples <- meta_filtered$sampleid
cat("Valid samples after nuclei filtering:", length(valid_samples), "\n")

# Find which valid samples actually exist in the peak data
available_samples <- intersect(valid_samples, colnames(peak_data))
cat("Valid samples present in peak data:", length(available_samples), "\n")

# Create filtered peak matrix
peak_data_filtered <- peak_data[, ..available_samples, with=FALSE]
cat("Original peak data dimensions:", nrow(peak_data), "×", ncol(peak_data), "\n")
cat("Filtered peak data dimensions:", nrow(peak_data_filtered), "×", ncol(peak_data_filtered), "\n")

# Convert to matrix for downstream analysis
peak_matrix <- as.matrix(peak_data_filtered)
rownames(peak_matrix) <- peak_id

# Update metadata to match filtered samples
meta_filtered <- meta_filtered[sampleid %in% available_samples]
cat("Final metadata samples after filtering:", nrow(meta_filtered), "\n")

Loaded metadata with 93 samples
Filtered to 92 samples with > 20 nuclei
Loaded peak data with 363775 peaks
Valid samples after nuclei filtering: 92 
Valid samples present in peak data: 90 
Original peak data dimensions: 363775 × 92 
Filtered peak data dimensions: 363775 × 90 
Final metadata samples after filtering: 90 


#### Process technical variables from meta data

In [12]:
# Column name normalization (for easier handling)
meta_clean <- meta_filtered %>%
  rename(
    med_nucleosome_signal = med.nucleosome_signal.ct,
    med_tss_enrich = med.tss.enrich.ct,
    med_n_tot_fragment = med.n_tot_fragment.ct,
    n_nuclei = n.nuclei
  )

# Calculate peak metrics - total unique peaks per sample and median peak width
peak_metrics <- data.frame(
  sampleid = colnames(peak_matrix),
  total_unique_peaks = colSums(peak_matrix > 0)
) %>%
  mutate(log_total_unique_peaks = log(total_unique_peaks + 1))

# Calculate median peak width for each sample using count as weight
calculate_median_peakwidth <- function(peak_matrix, peak_info) {
  # Create a data frame with peak widths
  peak_widths <- peak_info$end - peak_info$start
  
  # Initialize a vector to store median peak widths
  median_peak_widths <- numeric(ncol(peak_matrix))
  names(median_peak_widths) <- colnames(peak_matrix)
  
  # For each sample, calculate the weighted median peak width
  for (i in 1:ncol(peak_matrix)) {
    sample_counts <- peak_matrix[, i]
    # Only consider peaks with counts > 0
    idx <- which(sample_counts > 0)
    
    if (length(idx) > 0) {
      # Method 1: Use counts as weights
      weights <- sample_counts[idx]
      # Repeat each peak width by its count for weighted calculation
      all_widths <- rep(peak_widths[idx], times=weights)
      median_peak_widths[i] <- median(all_widths)
    } else {
      median_peak_widths[i] <- NA
    }
  }
  
  return(median_peak_widths)
}

#### Process peaks

In [13]:
# Process peak coordinates
peak_df <- data.table(
  peak_name = peak_id,
  chr = sapply(strsplit(peak_id, "-"), `[`, 1),
  start = as.integer(sapply(strsplit(peak_id, "-"), `[`, 2)),
  end = as.integer(sapply(strsplit(peak_id, "-"), `[`, 3)),
  stringsAsFactors = FALSE
)

# Verify peak coordinates were extracted correctly
cat("Sample of peak coordinates:\n")
print(head(peak_df))

if (file.exists(blacklist_file)) {
  blacklist_df <- fread(blacklist_file)
  if (ncol(blacklist_df) >= 4) {
    colnames(blacklist_df)[1:4] <- c("chr", "start", "end", "label")
  } else {
    colnames(blacklist_df)[1:3] <- c("chr", "start", "end")
  }
  
  # Filter blacklisted peaks
  setkey(blacklist_df, chr, start, end)
  setkey(peak_df, chr, start, end)
  overlapping_peaks <- foverlaps(peak_df, blacklist_df, nomatch=0)
  blacklisted_peaks <- unique(overlapping_peaks$peak_name)
  cat("Number of blacklisted peaks:", length(blacklisted_peaks), "\n")
  
  filtered_peak_idx <- !(peak_id %in% blacklisted_peaks)
  filtered_peak <- peak_matrix[filtered_peak_idx, ]
  cat("Number of peaks after blacklist filtering:", nrow(filtered_peak), "\n")
} else {
  cat("Warning: Blacklist file not found at", blacklist_file, "\n")
  cat("Proceeding without blacklist filtering\n")
  filtered_peak <- peak_matrix
}

Sample of peak coordinates:
            peak_name    chr  start    end
               <char> <char>  <int>  <int>
1: chr1-817077-817577   chr1 817077 817577
2: chr1-827285-827785   chr1 827285 827785
3: chr1-850237-850737   chr1 850237 850737
4: chr1-869660-870160   chr1 869660 870160
5: chr1-903662-904162   chr1 903662 904162
6: chr1-904504-905004   chr1 904504 905004
Number of blacklisted peaks: 29 
Number of peaks after blacklist filtering: 363746 


#### Load covariates

In [14]:
covariates_file <- file.path(input_dir,'data/rosmap_cov.txt')

if (file.exists(covariates_file)) {
  covariates <- fread(covariates_file)
  # Check column names and adjust if needed
  if ('#id' %in% colnames(covariates)) {
    id_col <- '#id'
  } else if ('individualID' %in% colnames(covariates)) {
    id_col <- 'individualID'
  } else {
    cat("Warning: Could not identify ID column in covariates file. Available columns:", 
        paste(colnames(covariates), collapse=", "), "\n")
    id_col <- colnames(covariates)[1]
    cat("Using", id_col, "as ID column\n")
  }
  
  # Select relevant columns - excluding msex and age_death
  cov_cols <- intersect(c(id_col, 'pmi', 'study'), colnames(covariates))
  covariates <- covariates[, ..cov_cols]
  
  # Merge with metadata
  meta_with_ind <- meta_clean %>%
    select(sampleid, everything())
  
  all_covs <- meta_with_ind %>%
    inner_join(peak_metrics, by = "sampleid") %>%
    inner_join(covariates, by = setNames(id_col, "sampleid"))
  
  # Impute missing values
  for (col in c("pmi")) {
    if (col %in% colnames(all_covs) && any(is.na(all_covs[[col]]))) {
      cat("Imputing missing values for", col, "\n")
      all_covs[[col]][is.na(all_covs[[col]])] <- median(all_covs[[col]], na.rm=TRUE)
    }
  }
} else {
  cat("Warning: Covariates file", covariates_file, "not found.\n")
  cat("Proceeding with only technical variables.\n")
  all_covs <- meta_clean %>%
    inner_join(peak_metrics, by = "sampleid")
}


# Perform log transformations on necessary variables
# Add a small constant to avoid log(0)
epsilon <- 1e-6

all_covs$log_n_nuclei <- log(all_covs$n_nuclei + epsilon)
all_covs$log_med_n_tot_fragment <- log(all_covs$med_n_tot_fragment + epsilon)

# Show distribution of original and log-transformed variables
cat("\nVariable statistics before and after log transformation:\n")
for (var in c("n_nuclei", "med_n_tot_fragment")) {
  orig_var <- all_covs[[var]]
  log_var <- all_covs[[paste0("log_", var)]]
  
  cat(sprintf("%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\n", 
              var, min(orig_var), median(orig_var), max(orig_var), sd(orig_var)))
  cat(sprintf("log_%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\n", 
              var, min(log_var), median(log_var), max(log_var), sd(log_var)))
}

cat("Number of samples after joining:", nrow(all_covs), "\n")
cat("Sample IDs:", paste(head(all_covs$sampleid), collapse=", "), "...\n")
cat("Available covariates:", paste(colnames(all_covs), collapse=", "), "\n")


Variable statistics before and after log transformation:
n_nuclei: min=39.00, median=849.00, max=4394.00, SD=1080.03
log_n_nuclei: min=3.66, median=6.74, max=8.39, SD=1.05
med_n_tot_fragment: min=1308.50, median=7521.00, max=30629.00, SD=5373.50
log_med_n_tot_fragment: min=7.18, median=8.93, max=10.33, SD=0.69
Number of samples after joining: 83 
Sample IDs: SM-CTECR, SM-CJK5G, SM-CJEKQ, SM-CJGGY, SM-CJK3S, SM-CTEGU ...
Available covariates: sampleid, individualID, sequencingBatch, Library, Celltype4, n_nuclei, avg.pct.read.in.peak.ct, med_nucleosome_signal, med_n_tot_fragment, med_tss_enrich, total_unique_peaks, log_total_unique_peaks, pmi, study, log_n_nuclei, log_med_n_tot_fragment 


#### Create DGE object

In [15]:
valid_samples <- intersect(colnames(filtered_peak), all_covs$sampleid)
cat("Number of valid samples:", length(valid_samples), "\n")

all_covs_filtered <- all_covs[all_covs$sampleid %in% valid_samples, ]
filtered_peak_filtered <- filtered_peak[, valid_samples]

dge <- DGEList(
  counts = filtered_peak_filtered,
  samples = all_covs_filtered
)
rownames(dge$samples) <- dge$samples$sampleid

Number of valid samples: 83 


####  Filter low counts and normalize

In [18]:
cat("Number of peaks before filtering:", nrow(dge), "\n")
keep <- filterByExpr(dge, 
                   min.count = 5,     # for one sample, min reads 
                   min.total.count = 15, # min reads overall
                   min.prop = 0.1) 

dge <- dge[keep, , keep.lib.sizes=FALSE]
cat("Number of peaks after filtering:", nrow(dge), "\n") #66154 in OPC

# Save filtered raw count data
filtered_raw_counts <- dge$counts
write.table(filtered_raw_counts,
            file = paste0(output_dir, "/2_residuals/", celltype, "/", celltype, "_filtered_raw_counts.txt"), 
            quote=FALSE, sep="\t", row.names=TRUE, col.names=TRUE)
cat("Saved filtered raw counts to", paste0(output_dir, "/2_residuals/", celltype, "/", celltype, "_filtered_raw_counts.txt"), "\n")

dge <- calcNormFactors(dge, method="TMM")


Number of peaks before filtering: 176039 


“All samples appear to belong to the same group.”


Number of peaks after filtering: 176039 
Saved filtered raw counts to /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/2_residuals/Oligo/Oligo_filtered_raw_counts.txt 


#### Handle batch and library as technical variables

In [19]:
# We'll handle batch and library as technical variables rather than doing batch adjustment
cat("Handling sequencingBatch and Library as technical variables\n")

# Check batch information
batches <- dge$samples$sequencingBatch
cat("Found", length(unique(batches)), "unique sequencing batches\n")

# Check batch size
batch_counts <- table(batches)
cat("Batch sizes:\n")
print(batch_counts)

# Convert sequencingBatch to factor with at least 2 levels
if (length(unique(batches)) < 2) {
  cat("Only one sequencing batch found. Adding dummy batch for model compatibility.\n")
  # Create a dummy batch factor to avoid model errors
  dge$samples$sequencingBatch_factor <- factor(rep("batch1", ncol(dge)))
} else {
  # Use the existing batch information
  dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)
}

# Check library information
libraries <- dge$samples$Library
cat("Found", length(unique(libraries)), "unique libraries\n")

# Check library size
library_counts <- table(libraries)
cat("Library sizes:\n")
print(library_counts)

# Convert Library to factor with at least 2 levels
if (length(unique(libraries)) < 2) {
  cat("Only one library found. Adding dummy library for model compatibility.\n")
  # Create a dummy library factor to avoid model errors
  dge$samples$Library_factor <- factor(rep("lib1", ncol(dge)))
} else {
  # Use the existing library information
  dge$samples$Library_factor <- factor(dge$samples$Library)
}

Handling sequencingBatch and Library as technical variables
Found 2 unique sequencing batches
Batch sizes:
batches
190820Kel 191203Kel 
        7        76 
Found 7 unique libraries
Library sizes:
libraries
Library10 Library11  Library2  Library4  Library5  Library7  Library9 
       26         6         7         6         7        23         8 


####  Create model and run voom

In [20]:
# Define the model based on available covariates - using log-transformed variables
# Removed msex and age_death from the model
if ("study" %in% colnames(dge$samples) && "pmi" %in% colnames(dge$samples)) {
  # Technical model with pmi and study
  cat("Using model with technical covariates plus pmi and study\n")
  model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment  +
    log_total_unique_peaks  + sequencingBatch_factor + Library_factor + pmi + study
} else if ("pmi" %in% colnames(dge$samples)) {
  # Technical model with pmi only
  cat("Using model with technical covariates and pmi\n")
  model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment  +
    log_total_unique_peaks  + sequencingBatch_factor + Library_factor + pmi
} else {
  # Technical variables only model
  cat("Using model with technical covariates only\n")
  model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment  +
    log_total_unique_peaks  + sequencingBatch_factor + Library_factor
}

# Print the model formula
cat("Model formula:", deparse(model), "\n")

# Check for factor variables with only one level
for (col in colnames(dge$samples)) {
  if (is.factor(dge$samples[[col]]) && nlevels(dge$samples[[col]]) < 2) {
    cat("Warning: Factor variable", col, "has only one level. Converting to character.\n")
    dge$samples[[col]] <- as.character(dge$samples[[col]])
  }
}

# Create design matrix with error checking
tryCatch({
  design <- model.matrix(model, data=dge$samples)
  cat("Successfully created design matrix with", ncol(design), "columns\n")
}, error = function(e) {
  cat("Error in creating design matrix:", e$message, "\n")
  cat("Attempting to fix model formula...\n")
  
  # Check each term in the model
  all_terms <- all.vars(model)
  valid_terms <- character(0)
  
  for (term in all_terms) {
    if (term %in% colnames(dge$samples)) {
      # Check if it's a factor with at least 2 levels
      if (is.factor(dge$samples[[term]])) {
        if (nlevels(dge$samples[[term]]) >= 2) {
          valid_terms <- c(valid_terms, term)
        } else {
          cat("Skipping factor", term, "with only", nlevels(dge$samples[[term]]), "level\n")
        }
      } else {
        # Non-factor variables are fine
        valid_terms <- c(valid_terms, term)
      }
    } else {
      cat("Variable", term, "not found in sample data\n")
    }
  }
  
  # Create a simplified model with valid terms
  if (length(valid_terms) > 0) {
    model_str <- paste("~", paste(valid_terms, collapse = " + "))
    model <- as.formula(model_str)
    cat("New model formula:", model_str, "\n")
    design <- model.matrix(model, data=dge$samples)
    cat("Successfully created design matrix with", ncol(design), "columns\n")
  } else {
    stop("Could not create a valid model with the available variables")
  }
})

# Check if the design matrix is full rank
if (!is.fullrank(design)) {
  cat("Design matrix is not full rank. Adjusting...\n")
  # Find and remove the problematic columns
  qr_res <- qr(design)
  design <- design[, qr_res$pivot[1:qr_res$rank]]
  cat("Adjusted design matrix columns:", ncol(design), "\n")
}

# Run voom and fit model
v <- voom(dge, design, plot=FALSE) #logCPM
fit <- lmFit(v, design)
fit <- eBayes(fit)

# Calculate offset and residuals
cat("Calculating offsets and residuals...\n")
offset <- predictOffset(fit)
resids <- residuals(fit, y=v)

# Verify dimensions
stopifnot(all(rownames(offset) == rownames(resids)) &
          all(colnames(offset) == colnames(resids)))

# Final adjusted data
stopifnot(all(dim(offset) == dim(resids)))
stopifnot(all(colnames(offset) == colnames(resids)))

final_data <- offset + resids

Using model with technical covariates plus pmi and study
Model formula: ~log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +      log_total_unique_peaks + sequencingBatch_factor + Library_factor +      pmi + study 
Successfully created design matrix with 15 columns
Design matrix is not full rank. Adjusting...
Adjusted design matrix columns: 14 
Calculating offsets and residuals...


#### Save results

In [22]:
# Save results
saveRDS(list(
  dge = dge,
  offset = offset,
  residuals = resids,
  final_data = final_data,
  valid_samples = colnames(dge),
  design = design,
  fit = fit,
  model = model
), file = paste0(output_dir, "/2_residuals/", celltype, "/", celltype, "_results.rds"))

# Write final residual data to file
write.table(final_data,
            file = paste0(output_dir, "/2_residuals/", celltype,  "/", celltype, "_residuals.txt"), 
            quote=FALSE, sep="\t", row.names=TRUE, col.names=TRUE)

# Write summary statistics
sink(file = paste0(output_dir, "/2_residuals/", celltype,  "/", celltype, "_summary.txt"))
cat("*** Processing Summary for", celltype, "***\n\n")
cat("Original peak count:", length(peak_id), "\n")
cat("Peaks after blacklist filtering:", nrow(filtered_peak), "\n")
cat("Peaks after expression filtering:", nrow(dge), "\n\n")
cat("Number of samples:", ncol(dge), "\n")
cat("Number of samples after nuclei (>20) filtering:", ncol(peak_matrix), "\n")
cat("\nTechnical Variables Used:\n")
cat("- log_n_nuclei: Log-transformed number of nuclei per sample\n")
cat("- med_nucleosome_signal: Median nucleosome signal\n")
cat("- med_tss_enrich: Median TSS enrichment\n")
cat("- log_med_n_tot_fragment: Log-transformed median number of total fragments\n")
cat("- log_total_unique_peaks: Log-transformed count of unique peaks per sample\n")
cat("- sequencingBatch_factor: Sequencing batch ID\n")
cat("- Library_factor: Library ID\n")
cat("\nOther Variables Used:\n")
cat("- pmi: Post-mortem interval\n")
cat("- study: Study cohort\n")
sink()

# Write an additional explanation file about the variables and log transformation
sink(file = paste0(output_dir, "/2_residuals/", celltype,  "/", celltype, "_variable_explanation.txt"))
cat("# ATAC-seq Technical Variables Explanation\n\n")


cat("## Why Log Transformation?\n")
cat("Log transformation is applied to certain variables for several reasons:\n")
cat("1. To make the distribution more symmetric and closer to normal\n")
cat("2. To stabilize variance across the range of values\n")
cat("3. To match the scale of voom-transformed peak counts, which are on log2-CPM scale\n")
cat("4. To be consistent with the approach used in related studies like haQTL\n\n")

cat("## Variables and Their Meanings\n\n")

cat("### Technical Variables\n")
cat("- n_nuclei: Number of nuclei that contributed to this pseudobulk sample\n")
cat("  * Filtered to include only samples with >20 nuclei\n")
cat("  * Log-transformed because count data typically has a right-skewed distribution\n\n")

cat("- med_n_tot_fragment: Median number of total fragments per cell\n")
cat("  * Represents sequencing depth\n")
cat("  * Log-transformed because sequencing depth typically has exponential effects\n\n")

cat("- total_unique_peaks: Number of unique peaks detected in each sample\n")
cat("  * Log-transformed similar to 'TotalNumPeaks' in haQTL pipeline\n\n")

cat("- med_nucleosome_signal: Median nucleosome signal\n")
cat("  * Measures the degree of nucleosome positioning\n")
cat("  * Not log-transformed as it's already a ratio/normalized metric\n\n")

cat("- med_tss_enrich: Median transcription start site enrichment score\n")
cat("  * Indicates the quality of the ATAC-seq data\n")
cat("  * Not log-transformed as it's already a ratio/normalized metric\n\n")


cat("- sequencingBatch: Batch ID for the sequencing run\n")
cat("  * Treated as a factor to account for batch effects\n\n")

cat("- Library: Library preparation batch ID\n")
cat("  * Treated as a factor to account for library preparation effects\n\n")

cat("### Other Variables\n")
cat("- pmi: Post-mortem interval (time between death and tissue collection)\n")
cat("- study: Study cohort (ROSMAP, MAP, ROS)\n\n")

cat("## Relationship to voom Transformation\n")
cat("The voom transformation converts count data to log2-CPM (counts per million) values ")
cat("and estimates the mean-variance relationship. By log-transforming certain technical ")
cat("covariates, we ensure they're on a similar scale to the transformed expression data, ")
cat("which can improve the fit of the linear model used for removing unwanted variation.\n")
sink()

cat("Processing completed. Results and documentation saved to:", paste0(output_dir, "/2_residuals/", celltype,  "/"), "\n")

Processing completed. Results and documentation saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/2_residuals/Oligo/ 


## Step 2: Phenotype Reformat
**Purpose:** Converts covariate-adjusted residuals from Step 1 into genome-wide BED format suitable for QTL mapping tools (FastQTL, TensorQTL, MatrixEQTL).

---

#### Input:

**From Step 1 (required):**
- `{celltype}_residuals.txt` (in `output/2_residuals/{celltype}/`)

**Cell Types:**
- `Mic` (Microglia)
- `Astro` (Astrocytes)
- `Oligo` (Oligodendrocytes)
- `Ex` (Excitatory neurons)
- `In` (Inhibitory neurons)
- `OPC` (Oligodendrocyte precursor cells)


#### Process:

1. Set Cell Type and Paths
2. Load residuals file
3. Extract and parse peak IDs
4. Convert to Midpoint Coordinates
5. Create BED format
6. Sort by genomic position
7. Write BED file
8. Compress with bgzip

#### Output:
Output Directory: `output/3_phenotype_reformatting/{celltype}/`

Output File: `{celltype}_kellis_xiong_snatac_phenotype.bed.gz`

#### Load libraries

In [23]:
library(data.table)
library(stringr)

In [25]:
#!/usr/bin/env Rscript

# Script to reformat ATAC-seq residuals into BED format and compress with bgzip
# Usage: Rscript reformat_residuals.R [celltype]

# Get command line arguments
#args <- commandArgs(trailingOnly = TRUE)
#if (length(args) < 1) {
#  celltype <- "Ex"  # Default cell type
#  cat("No cell type specified, using default:", celltype, "\n")
#} else {
#  celltype <- args[1]
#  cat("Processing cell type:", celltype, "\n")
#}

# Define input and output paths
#input_dir <- "/home/al4225/project/kellis_snatac/output/xiong/2_residuals"
#output_dir <- "/home/al4225/project/kellis_snatac/output/3_phenotype_processing"
pheno_reformat_output_dir <- paste0(output_dir, "/3_phenotype_reformatting/", celltype)

# Create output directory if it doesn't exist
dir.create(pheno_reformat_output_dir, recursive = TRUE, showWarnings = FALSE)

# Check if input directory exists
celltype_dir <- paste0(output_dir,"/2_residuals/", celltype)
if (!dir.exists(celltype_dir)) {
  cat("Cell type directory not found:", celltype_dir, "\n")
  cat("Using backup directory...\n")
  celltype_dir <- file.path(output_dir,paste0("2_residuals/backup/", celltype))
  if (!dir.exists(celltype_dir)) {
    dir.create(celltype_dir, recursive = TRUE)
    stop("Backup directory not found either: ", celltype_dir)
  }
}

input_file <- file.path(celltype_dir, paste0(celltype, "_residuals.txt"))
output_bed <- file.path(output_dir, paste0("3_phenotype_reformatting/",celltype ,"/", celltype,"_kellis_xiong_snatac_phenotype.bed"))

# Check if input file exists
if (!file.exists(input_file)) {
  stop("Input file not found: ", input_file)
}

# Read the first line manually to get the column names
first_line <- readLines(input_file, n = 1)
col_names <- unlist(strsplit(first_line, split = "\t"))
cat("Column names from first line:", paste(head(col_names), collapse = ", "), "...\n")

Column names from first line: SM-CTECR, SM-CJK5G, SM-CJEKQ, SM-CJGGY, SM-CJK3S, SM-CTEGU ...


#### Load input

In [26]:
cat("Reading residuals file:", input_file, "\n")
first_line <- readLines(input_file, n = 1)
col_names  <- unlist(strsplit(first_line, split = "\t"))

residuals <- fread(input_file, header = FALSE, skip = 1)

# Logic to handle row names/peak IDs
if (ncol(residuals) > length(col_names)) {
  peak_ids  <- residuals[[1]]
  residuals <- residuals[, -1, with = FALSE]
  setnames(residuals, col_names)
} else {
  peak_ids  <- residuals[[1]]
  residuals <- residuals[, -1, with = FALSE]
  setnames(residuals, col_names[-1]) # Adjusting for leading empty/ID column
}

Reading residuals file: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/2_residuals/Oligo/Oligo_residuals.txt 


#### Coordinate Parsing (BED format)

In [27]:
cat("Parsing peak IDs into BED format with midpoint coordinates\n")
parts <- strsplit(peak_ids, "-")
chrs  <- sapply(parts, `[`, 1)
starts_raw <- as.numeric(sapply(parts, `[`, 2))
ends_raw   <- as.numeric(sapply(parts, `[`, 3))

# Calculate midpoints for a 1bp window (Standard for QTLtools)
# This centers the peak signal on a single genomic coordinate
mids <- as.integer((starts_raw + ends_raw) / 2)

parsed_peaks <- data.table(
  '#chr' = chrs,
  start = mids,
  end   = mids + 1,
  ID    = peak_ids
)

# Combine and Sort
bed_data <- cbind(parsed_peaks, residuals)
setorder(bed_data, '#chr', start)


Parsing peak IDs into BED format with midpoint coordinates


#### Save and compress 

In [28]:
cat("Writing BED file to:", output_bed, "\n")
fwrite(bed_data, output_bed, sep = "\t", col.names = TRUE, quote = FALSE)

cat("Compressing with bgzip...\n")
system(paste("bgzip -f", output_bed))

# Highly recommended: Index for tabix
system(paste("tabix -p bed", paste0(output_bed, ".gz")))

cat("Process completed for", celltype, "\n")

Writing BED file to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/3_phenotype_reformatting/Oligo/Oligo_kellis_xiong_snatac_phenotype.bed 
Compressing with bgzip...
Process completed for Oligo 
