# 🎉 Welcome to Building Your Reference Panel! 🎉

---

###  **Let’s get started!** Building a reference panel is an exciting and crucial step in your analysis. Here’s a roadmap to guide you through the process:

---

## **Locate Your Open Access Data**  
The first step is to identify where your open-access data currently resides.  
- **What does this mean?**  
  - You’ll need to find the links or sources for the datasets you plan to use.  


**Pro tip:**  
- Keep track of these links in a well-organized document or spreadsheet. You'll thank youreself later! 

---

### 🎯 **You’re off to a great start!**  
Finding and organizing your data sources is the foundation of building a robust reference panel. Let’s dive in!!  

In [None]:
#Download your data and take note where it lives!
wget -nc LINK TO YOUR DATA
wget -nc ....

# We want to have all our data in the same file format, for the proccesses we are going to do together PLINK's BED format is going to be the best for our purposes. 

# Understanding PLINK Binary Files (BED/BIM/FAM)

You are going to see three files with the same prefix and they are all important in their own way! :

## 1. BED (Binary PED file)
- A compact--**binary** format for storing genotype data (you will not be able to open this one).
- Each SNP is encoded using **two bits per individual** (homozygous major, heterozygous, homozygous minor, or missing).

## 2. BIM (Binary Marker Information file)
- A **map file** describing **SNPs** in your dataset.
- Contains six columns:
  1. **Chromosome**
  2. **SNP ID**
  3. **Genetic distance** (often set to 0)
  4. **Physical position** (The location on the genome)
  5. **Allele 1** (reference allele)
  6. **Allele 2** (alternative allele)

## 3. FAM (Family Information file)
- A **sample metadata file** describing individuals.
- Contains six columns:
  1. **Family ID**
  2. **Individual ID**
  3. **Paternal ID** (0 if unknown)
  4. **Maternal ID** (0 if unknown)
  5. **Sex** (1 = male, 2 = female, 0 = unknown)
  6. **Phenotype** (1 = unaffected, 2 = affected, -9 = missing)

## What These Files Tell You
- The **BED file** tells you the **genotypes** of individuals.
- The **BIM file** tells you **which SNPs** those genotypes correspond to.
- The **FAM file** tells you **who the individuals are** and their metadata.

# If you do not have PLINK preinstalled be sure to download it! https://www.cog-genomics.org/plink/



In [4]:
from collections import defaultdict
import pandas as pd
from IPython.display import display
import os
import glob
import sys
import pysam
import subprocess
from tabulate import tabulate

#If you do not have a package here you can download via the command 'pip install ...' via your command line!

In [5]:
#Set up plink to work in jupyter notebook (Compute Canada)
!module load StdEnv/2020 && module load plink/1.9b_6.21-x86_64 && which plink

In [6]:
#Copy the output from above into this next command -- or just the absolute path to your downloaded plink
plink_path = 'path/to/plink/command'

Only run the next cell if there is inconsistency in your file formats. 

In [None]:

#VCF to Plink Bed

subprocess.run([plink_path, "--vcf", "YOUR_FILE", "--make-bed", "--real-ref-alleles", "--out", "NAME_YOU_WANT"], check=True)
#Eigensoft file to plink_BED 
!git clone https://github.com/roberta-davidson/ADMIXTURE-smartPCA-PLINK-and-EIGENSOFT.git
!ADMIXTURE-smartPCA-PLINK-and-EIGENSOFT/CONVERTF_EIG_to_PLINK.sh YOUR FILE PREFIX
!rm -r ADMIXTURE-smartPCA-PLINK-and-EIGENSOFT

#If data is provided as a matrix text file (vintage!) Please refer to matrix_to_vcf.py and matrix_to_vcf.sh

# 🚀 LIFTOVER!  

## What is a Genome Build?  
A **genome build** is a **reference assembly** of a species' genome, providing a **standardized coordinate system** for genetic variants. Each build is an improved version of previous assemblies, correcting errors, adding missing sequences, and improving accuracy.  

### Human Genome Builds:  
- **hg18 (NCBI36)** – Released in **2006**  
- **hg19 (GRCh37)** – Released in **2009**  
- **hg38 (GRCh38)** – Released in **2013** (**Current Build**)  

---

## Why LiftOver?  
You wouldn’t use an outdated **operating system** on your computer, right?  
Similarly, we want to **update** our genomic data to the latest **genome build** to ensure accuracy.  

This process is called **"Lifting Over"** because we **Liftover** our old data to the newest build.  

### Failing to lift over can result in:  
❌ **Mismatched variant positions**  
❌ **Incorrect gene annotations**  
❌ **Data incompatibility** with current research tools  

---

## Let's Gather Our Tools!  

 

In [None]:

wget -nc https://hgdownload.soe.ucsc.edu/gbdb/hg19/liftOver/hg19ToHg38.over.chain.gz
wget -nc https://hgdownload.soe.ucsc.edu/gbdb/hg18/liftOver/hg18ToHg38.over.chain.gz
wget -nc https://hgdownload.soe.ucsc.edu/goldenPath/currentGenomes/Homo_sapiens/bigZips/hg38.fa.gz


# There are a lot of different tools to perform the liftover function (so many choices!). This tutorial is going to utilize CrossMap (you can find the documentation here: https://crossmap.readthedocs.io/en/latest/). But go ahead and find a software and switch the cod up if you would like; this is your journey! 

In [None]:
%pip install git+https://github.com/liguowang/CrossMap.git
#Copy the path where this was downloaded below


In [2]:
CrossMap_path = '/path/to/CrossMap.py'

Now lets set up a class in python to peform our initial liftover and then we will reconvene! 

In [3]:

# Shared QC table to track all studies
shared_qc_table = []

def count_variants(study_name, bim_file, fam_file, step_name):
    """
    Args:
        study_name : Name of the study (e.g., "Study1").
        bim_file : Path to the BIM file.
        fam_file : Path to the FAM file.
        step_name : Name of the step (e.g., "Start", "After Class1", "After Class2").
    
    Returns:
        dict: A dictionary containing counts for autosomal, X, Y, MT variants,
              total individuals, males, females, and ambiguous individuals.
    """
    autosomal = 0
    x_chr = 0
    y_chr = 0
    mt_chr = 0

    
    with open(bim_file, 'r') as f:
        for line in f:
            parts = line.strip().split()
            chrom = parts[0]
            if chrom.startswith("chr"):
                chrom_clean = chrom.replace("chr", "")
            else:
                chrom_clean = chrom
            if chrom_clean in ['X', '23', '25']:
                x_chr += 1
            elif chrom_clean in ['Y', '24']:
                y_chr += 1
            elif chrom_clean in ['MT', 'M', '26']:
                mt_chr += 1
            elif chrom_clean.isdigit():
                if 1 <= int(chrom_clean) <= 22:
                    autosomal += 1

    
    individuals = 0
    males = 0
    females = 0
    ambiguous = 0
    
    with open(fam_file, 'r') as f:
        for line in f:
            parts = line.strip().split()
            sex_code = int(parts[4])
            if sex_code == 1:
                males += 1
            elif sex_code == 2:
                females += 1
            elif sex_code == 0:
                ambiguous += 1
            individuals += 1
    
    shared_qc_table.append([
        study_name,
        step_name,
        autosomal,
        x_chr,
        y_chr,
        mt_chr,
        individuals,
        males,
        females,
        ambiguous,
    ])
    
    return {
        "autosomal": autosomal,
        "x_chr": x_chr,
        "y_chr": y_chr,
        "mt_chr": mt_chr,
        "individuals": individuals,
        "males": males,
        "females": females,
        "ambiguous": ambiguous,
    }
headers = [
    "Study Name", "Step Name", "Autosomal", "X Chr", "Y Chr", "MT Chr",
    "Individuals", "Males", "Females", "Ambiguous"
]

def save_qc_table(filename="QC_results.txt"):
    """
    Saves the shared QC table to a text file.

    Args:
        filename : Name of the output file.
    """
    with open(filename, "w") as f:
        # Write the header
        f.write("Study\tStep\tAutosomal\tX_Chr\tY_Chr\tMT_Chr\tIndividuals\tMales\tFemales\tAmbiguous\n")
        
        # Write each row of data
        for row in shared_qc_table:
            f.write("\t".join(map(str, row)) + "\n")

In [4]:


class LiftoverProcessor:
    def __init__(self, study_name, base_name, chain_file, out_dir):
        """
        Initialize the LiftoverProcessor class.

        Args:
            study_name : Name of the study.
            base_name : Base name of the input files (without extensions).
            chain_file : Path to the chain file for liftover.
            out_dir : Directory where all output files will be saved.
        """
        self.study_name = study_name
        self.base_name = base_name
        self.chain_file = chain_file
        self.out_dir = out_dir
        # Create the output directory if it doesn't exist
        os.makedirs(self.out_dir, exist_ok=True)
        # Initialize current BIM and FAM files
        self.current_base = base_name  # Use the full path provided in base_name
        self.current_bim = f"{self.current_base}.bim"
        self.current_fam = f"{self.current_base}.fam"
        self.intermediate_files = []
        # Initialize QC table as a list of rows
        self.qc_table = [["Step", "Autosomal", "X Chr", "Y Chr", "MT Chr", "Individuals", "Males", "Females", "Ambiguous"]]
        count_variants(self.study_name, self.current_bim, self.current_fam, "Start")

    def check_sex(self, ycount_threshold=(0.3, 0.7)):
        """Check for sex mismatches and remove problematic samples."""
        sexcheck_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_sexcheck")
        remove_list = []

        # Check if there are sex chromosomes (X or Y) in the dataset
        has_sex_chromosomes = False
        has_polymorphic_x = False

        # Read the BIM file to check for sex chromosomes and polymorphic X loci
        with open(f"{self.current_base}.bim", 'r') as bim_file:
            for line in bim_file:
                parts = line.strip().split()
                chrom = parts[0]
                if chrom in ['23', 'X', '25', 'Y']:  # Check for X or Y chromosomes
                    has_sex_chromosomes = True
                if chrom in ['23', 'X'] and parts[4] != '0':  # Check for polymorphic X loci
                    has_polymorphic_x = True

        # If no sex chromosomes or no polymorphic X loci, skip sex check
        if not has_sex_chromosomes or not has_polymorphic_x:
            print("No sex chromosomes or no polymorphic X loci detected. Skipping sex check.")
            return

        # Run PLINK sex check
        subprocess.run([plink_path, '--bfile', self.current_base, '--check-sex', str(ycount_threshold[0]), str(ycount_threshold[1]), '--out', sexcheck_base], check=True)

        # Identify problematic samples
        with open(f"{sexcheck_base}.sexcheck", 'r') as f:
            next(f)  # Skip header
            for line in f:
                if 'PROBLEM' in line:
                    parts = line.strip().split()
                    remove_list.append(f"{parts[0]}\t{parts[1]}")

        # Print problematic samples
        print("Samples with sex discrepancies:")
        for sample in remove_list:
            print(sample)

        # Remove problematic samples using PLINK
        new_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_good_sex")
        remove_file = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_remove_badsex.txt")
        with open(remove_file, 'w') as f:
            f.write("\n".join(remove_list))

        subprocess.run([plink_path, '--bfile', self.current_base, '--remove', remove_file, '--merge-x', 'no-fail', '--make-bed', '--output-chr', 'MT', '--out', new_base], check=True)

        # Clean up intermediate files
        self.intermediate_files.extend([remove_file])


        # Update current BIM/FAM and count
        self.current_base = new_base
        self.current_bim = f"{new_base}.bim"
        self.current_fam = f"{new_base}.fam"
        count_variants(self.study_name, self.current_bim, self.current_fam, "Sex Mismatch")

    def create_bed_file(self):
        """Create a BED file for liftover."""
        bed_file = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_bedfile")
        with open(self.current_bim, 'r') as f_in, open(bed_file, 'w') as f_out:
            for line in f_in:
                parts = line.strip().split()
                chrom = parts[0]
                pos = int(parts[3])
                var_id = parts[1]
                chrom_clean = f"chr{chrom}" if not chrom.startswith('chr') else chrom
                f_out.write(f"{chrom_clean}\t{pos-1}\t{pos}\t{var_id}\n")
 
        self.bedfile_path = bed_file
    def run_liftover(self):
        # Define output files
        map_file = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_mapfile")
        unmapped_file = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_unmapped")
        exclude_file = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_excludefile")

        # Perform the liftover using CrossMap
        subprocess.run([CrossMap_path, "bed", self.chain_file, self.bedfile_path, map_file, "--unmap-file", unmapped_file], check=True)

        def extract_variants_to_exclude(input_file, exclude_f):
            if os.path.exists(input_file):
                with open(input_file, 'r') as f:
                    for line in f:
                        if not line.startswith('#'):  # Skip header lines
                            parts = line.strip().split()
                            if len(parts) >= 4:  # Ensure the line has enough columns
                                chrom = parts[0]
                                var_id = parts[3]
                                # Exclude non-standard chromosomes or unmapped variants
                                if input_file == unmapped_file or '_' in chrom or 'chrUn' in chrom or 'chr_alt' in chrom:
                                    exclude_f.write(f"{var_id}\n")

        # Process the output files to create an exclude list
        with open(exclude_file, 'w') as exclude_f:
            # Process unmapped_file and map_file
            extract_variants_to_exclude(unmapped_file, exclude_f)
            extract_variants_to_exclude(map_file, exclude_f)

        # Save the final mapped file (only standard chromosomes)
        map_file_final = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_mapfile_final")
        with open(map_file_final, 'w') as final_f:
            if os.path.exists(map_file):
                with open(map_file, 'r') as mapped_f:
                    for line in mapped_f:
                        parts = line.strip().split()
                        if len(parts) >= 4:  # Ensure the line has enough columns
                            chrom = parts[0]
                            # Include only standard chromosomes
                            if '_' not in chrom and 'chrUn' not in chrom and 'chr_alt' not in chrom:
                                final_f.write(line)

        # Update class attributes
        self.mapfile_path = map_file_final
        self.exclude_file = exclude_file

        # Clean up intermediate files
        self.intermediate_files.extend([map_file, unmapped_file])


    def update_plink_files(self):
        """Update PLINK files after liftover."""
        temp_base = os.path.join(self.out_dir, f"temp_{os.path.basename(self.base_name)}")
        new_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_38_merged")

        # Exclude problematic variants
        subprocess.run([plink_path, '--bfile', self.current_base, '--exclude', self.exclude_file, '--make-bed', '--out', temp_base], check=True)

        # Prepare chromosome and position updates
        chr_update_file = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_chr_update.txt")
        pos_update_file = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_pos_update.txt")

        with open(chr_update_file, 'w') as f_chr, open(pos_update_file, 'w') as f_pos:
            for line in open(self.mapfile_path, 'r'):
                parts = line.strip().split()
                var_id = parts[3]
                chrom = parts[0].replace('chr', '')
                pos = parts[2]

                # Convert chromosome names to PLINK format
                if chrom == 'X':
                    plink_chrom = '23'
                elif chrom == 'Y':
                    plink_chrom = '24'
                elif chrom in ['MT', 'M']:
                    plink_chrom = '26'
                else:
                    plink_chrom = chrom

                f_chr.write(f"{var_id}\t{plink_chrom}\n")
                f_pos.write(f"{var_id}\t{pos}\n")

        # Apply updates
        subprocess.run([plink_path, '--bfile', temp_base, '--update-chr', chr_update_file, '--update-map', pos_update_file, '--make-bed', '--output-chr', 'chrMT', '--out', new_base], check=True)

        # Clean up intermediate files
        self.intermediate_files.extend([
        chr_update_file,
        pos_update_file,
        (f"{temp_base}.bed"),
        (f"{temp_base}.bim"),
        (f"{temp_base}.fam"),
        self.mapfile_path])

        # Update current files and count
        self.current_base = new_base
        self.current_bim = f"{new_base}.bim"
        self.current_fam = f"{new_base}.fam"
        count_variants(self.study_name, self.current_bim, self.current_fam, "After liftover")
    
        for file in self.intermediate_files:
            if os.path.exists(file):
                    os.remove(file)

    def run_pipeline(self):
        # This will run the full liftover operation.
        self.check_sex()
        self.create_bed_file()
        self.run_liftover()
        self.update_plink_files()
 

    def get_output_files(self):
        """Return the final BIM and FAM files after liftover."""
        return self.current_base

## Now we can do the initial liftover of our data! Exciting stuff!

In [None]:
#Add as many studies as needed! Just be sure to add them downstream!

study1= LiftoverProcessor('Study1','path/data_base_name', 'hg19ToHg38.over.chain.gz', 'output_directory')
study2= LiftoverProcessor('Study2','path/data_base_name', 'hg18ToHg38.over.chain.gz', 'output_directory')

In [None]:
study1.check_sex()
study2.check_sex()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

### Do we see any noteable reduction in sample size? If so make sure there is not a more sytematic problem with your data that may need some manual fixing!

In [None]:
study1.create_bed_file()
study2.create_bed_file()
study1.run_liftover()
study2.run_liftover()

### We have lift off! Now lets really get this car into gear and update our plink files to reflect our new mapped variants!

In [None]:
study1.update_plink_files()
study2.update_plink_files()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

# Ok! We have successfully performed a liftover! How do you feel? While the literal process of lifting over the coordinates has been completed, we are not quite done:
    
There’s still some housekeeping to take care of—after all, we didn’t generate this data ourselves, so we need to make sure everything’s in order. Here’s the game plan: 

---
- **Map to Reference FASTA**: 
    - Let’s make sure those lifted-over variants actually match the sequence of the new genome build.
- **Strand Flips**: 
    - Keep an eye out for major/minor (reference/alternate) allele mix-ups. Sometimes the reference strand orientation flips between builds, and we need to straighten that out.
- **Palindromic SNPs**: 
    - These variants are indistinguishable on forward and reverse strands, making them a headache for allele assignment. Since they’re too ambiguous to resolve confidently, we’ll filter them out entirely
- **Invalid SNPs**: 
    - Last but not least, filter out any variants where the reference allele doesn’t match the new build. Only the real deal makes the cut!
    

In [4]:

class ReferenceAligner:
    def __init__(self, study_name, base_name, genome_fasta, output_directory):
        """
        Initialize the ReferenceAligner class.

        Args:
            study_name : Name of the study.
            base_name : Base name of the input files (without extensions).
            genome_fasta : Path to the reference genome FASTA file.
            output_directory : Directory where temporary and output files will be saved.
        """
        self.study_name = study_name
        self.base_name = base_name
        self.genome_fasta = genome_fasta
        self.output_directory = output_directory
        self.intermediate_files = []
        # Create the output directory if it doesn't exist
        os.makedirs(self.output_directory, exist_ok=True)

        # Initialize current BIM and FAM files
        self.current_bim = f"{self.base_name}.bim"
        self.current_fam = f"{self.base_name}.fam"

    def strand_flip(self, a):
        """Helper function to flip alleles."""
        return {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}[a]

    def generate_alignment_files(self):
      
        #Generate the necessary files for alignment (remove.txt, strand_flip.txt, force_a1.txt).
        
        n_total_variants = 0
        n_non_snps = 0
        n_palindromic = 0
        n_flip_strand = 0
        n_force_ref_allele = 0
        n_no_ref_match = 0

        # Define paths for output files 
        self.remove_file = os.path.join(self.output_directory, f"{os.path.basename(self.base_name)}.remove.txt")
        self.flip_file = os.path.join(self.output_directory, f"{os.path.basename(self.base_name)}.strand_flip.txt")
        self.force_file = os.path.join(self.output_directory, f"{os.path.basename(self.base_name)}.force_a1.txt")

        with open(self.current_bim, 'rt') as ibim, pysam.FastaFile(self.genome_fasta) as ifasta, \
            open(self.remove_file, 'wt') as oremove, \
            open(self.flip_file, 'wt') as oflip, \
            open(self.force_file, 'wt') as oforce:

            fasta_chroms = set(list(ifasta.references))
            for line in ibim:
                fields = line.rstrip().split()
                chrom, varid, pos, a1, a2 = fields[0], fields[1], int(fields[3]), fields[4], fields[5]
            
                n_total_variants += 1

                # Handle chromosome naming quirks
                if chrom not in fasta_chroms:
                    chrom = chrom[3:] if chrom.startswith('chr') else f'chr{chrom}'
                    if chrom not in fasta_chroms:
                        print(f'Warning: skipping chromosome {fields[0]} because it is not in FASTA file.')
                        continue

                # Skip non-SNPs
                if a1 not in {'A', 'C', 'G', 'T'} or a2 not in {'A', 'C', 'G', 'T'}:
                    oremove.write(f'{varid}\n')
                    n_non_snps += 1
                    continue

                # Skip palindromic SNPs
                if (a1 in {'A', 'T'} and a2 in {'A', 'T'}) or (a1 in {'C', 'G'} and a2 in {'C', 'G'}):
                    oremove.write(f'{varid}\n')
                    n_palindromic += 1
                    continue

                # Retreive the reference allele from the hg38 FASTA
                ref_base = None
                for base in ifasta.fetch(chrom, pos - 1, pos):
                    ref_base = base

                if ref_base == a2:
                    n_force_ref_allele += 1
                    oforce.write(f'{varid}\t{ref_base}\n')
                elif ref_base != a1:
                    flipped_a1 = self.strand_flip(a1)
                    flipped_a2 = self.strand_flip(a2)
                    if ref_base == flipped_a2:
                        n_force_ref_allele += 1
                        oforce.write(f"{varid}\t{ref_base}\n")
                        n_flip_strand += 1
                        oflip.write(f"{varid}\n")
                    elif ref_base == flipped_a1:
                        n_flip_strand += 1
                        oflip.write(f"{varid}\n")
                    else:
                        n_no_ref_match += 1
                        oremove.write(f"{varid}\n")

        print(f"Total variants: {n_total_variants:,}")
        print(f"Not valid SNPs: {n_non_snps:,}")
        print(f"Palindromic SNPs: {n_palindromic:,}")
        print(f"Strand flips: {n_flip_strand:,}")
        print(f"Force reference allele: {n_force_ref_allele:,}")
        print(f"A1/A2 didn't match reference allele: {n_no_ref_match:,}")

    def align_to_reference(self):
        """
        Aligns the data to the reference genome and sets HH to missing.
        """
        self.remove_file = os.path.join(self.output_directory, f"{os.path.basename(self.base_name)}.remove.txt")
        self.flip_file = os.path.join(self.output_directory, f"{os.path.basename(self.base_name)}.strand_flip.txt")
        self.force_file = os.path.join(self.output_directory, f"{os.path.basename(self.base_name)}.force_a1.txt")
        # Define paths for intermediate files in the output directory
        temp_base = os.path.join(self.output_directory, "temp")
        temp2_base = os.path.join(self.output_directory, "temp2")
        aligned_base = os.path.join(self.output_directory, f"{os.path.basename(self.base_name)}_aligned")
    
        # Remove impossible to adjust SNPs
        subprocess.run([plink_path, '--bfile', self.base_name, '--exclude', self.remove_file, '--make-bed', '--out', temp_base], check=True)
        count_variants(self.study_name, f"{temp_base}.bim", self.current_fam, "After Removal of Impossible SNPs")

        # Strand flip
        subprocess.run([plink_path, '--bfile', temp_base, '--flip', self.flip_file, '--make-bed', '--out', temp2_base], check=True)
        count_variants(self.study_name, f"{temp2_base}.bim", self.current_fam, "After Strand Flip")

        # Force the REF/ALT designation
        subprocess.run([plink_path, '--bfile', temp2_base, '--a1-allele', self.force_file, '--make-bed', '--out', aligned_base], check=True)
        count_variants(self.study_name, f"{aligned_base}.bim", self.current_fam, "After Alignment")
    
        # Clean up intermediate files
        for temp_file in [temp_base, temp2_base]:
            for ext in [".bed", ".bim", ".fam", ".log", ".nosex"]:
                file_path = f"{temp_file}{ext}"
                if os.path.exists(file_path):
                    self.intermediate_files.append(file_path)

        # Add alignment files to intermediate_files list
        for file in [self.remove_file, self.flip_file, self.force_file]:
            if os.path.exists(file):
                self.intermediate_files.append(file)

        # Clean up all intermediate files
        for file in self.intermediate_files:
            if os.path.exists(file):
                os.remove(file)
        
        self.current_base = aligned_base
        self.current_bim = f"{aligned_base}.bim"
        self.current_fam = f"{aligned_base}.fam"
        print(f"Alignment completed for {self.study_name}.")
        
    def get_output_files(self):
        """Return the final BIM and FAM files after liftover."""
        return self.current_base
    
        

# Now lets get to aligning! Be sure to remember where you sent the output of that lift over!

In [None]:
# Take note of your last output from the last section!
final_base = study1.get_output_files()
print(f"Final base file: {final_base}")


In [None]:
# Take note of your last output from the last section!
final_base = study2.get_output_files()
print(f"Final base file: {final_base}")


# Now we can feed that to our alignment!

---

Be sure that your studies are being sent to **different directories** to ensure there is not issues with the alignment process!

In [None]:
study1_A = ReferenceAligner('Study1', 'path/to/final_bim_base_name', 'path/to/hg38.fa.gz', 'output_directory')
study2_A = ReferenceAligner('Study2', 'path/to/final_bim_base_name', 'path/to/hg38.fa.gz', 'output_directory')

In [None]:
study1_A.generate_alignment_files()
study2_A.generate_alignment_files()

As you can see there was definetely some tidying to do, no matter how clean the data you inherited was!

In [None]:
study1_A.align_to_reference()
study2_A.align_to_reference()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

#  We're Almost There! 

---

###  **Great progress so far!** We've tackled the bulk of the challenging work, and now we’re down to the final steps to ensure our data is polished and ready for standard quality control. Let’s break it down:

---

##  **Handling Duplicate Variants**  
Occasionally, multiple variants can appear at the same genomic position, which can complicate downstream analysis.  
**Our goal:**  
- Identify these duplicates and retain only the variant that is most prevalent in our dataset.   

---

##  **Standardizing Variant IDs**  
Different studies often use different genotyping chips, each with its own naming conventions for variant IDs. This inconsistency can create headaches when integrating multiple datasets.  

**Our solution:**  
- Rename variant IDs using a universal format: **`chr:pos:ALT:REF`**.  
- This ensures consistency across studies and makes our data more interoperable. Think of it as giving everyone the same map to follow! 🗺️  

---

##  **Addressing Heterozygous Haploid Variants**  
- **What’s the issue?**  
  - PLINK sometimes misinterprets haploid chromosomes (e.g., the male X chromosome) as diploid.  
- **Why does this matter?**  
  - This can skew analysis or indicate potential data quality issues.  

**Our approach:**  
- Set these heterozygous haploid variants to **missing** for now.  
- We’ll revisit these later to ensure accuracy and robustness in our analysis. Safety first! 🛡️  

---

###  **You’re doing amazing!**  
- These final steps will ensure our data is clean, consistent, and ready for the next phase of analysis. 
- Let’s power through and get this done—you’ve got this! 💪  

In [7]:


class Dup_Renamer:
    def __init__(self, study_name, base_name, out_dir):

        """
        Initialize the DeduplicationProcessor class.

        Args:
            study_name : Name of the study.
            base_name : Base name of the input files (including full path, e.g., "path/to/data_base_name").
            out_dir : Directory where all output files will be saved.
        """
        self.study_name = study_name
        self.base_name = base_name  # Full path to the base name (e.g., "path/to/data_base_name")
        self.out_dir = out_dir

        # Create the output directory if it doesn't exist
        os.makedirs(self.out_dir, exist_ok=True)

        # Initialize current BIM and FAM files
        self.current_base = os.path.join(self.out_dir, os.path.basename(base_name))
        self.current_bim = f"{self.current_base}.bim"
        self.current_fam = f"{self.current_base}.fam"

        # Track intermediate files for cleanup
        self.intermediate_files = []

    def list_duplicate_vars(self):
        """Run PLINK to list duplicate variants."""
        subprocess.run([plink_path, '--bfile', self.current_base, '--list-duplicate-vars', '--out', self.current_base], check=True)
        subprocess.run([plink_path, '--bfile', self.current_base, '--freq', 'counts', '--out', self.current_base], check=True)

        # Track intermediate files
        self.intermediate_files.extend([
            f"{self.current_base}.dupvar",
            f"{self.current_base}.frq.counts",
            f"{self.current_base}.log"
        ])

    def prioritize_duplicates(self):
        """Prioritize duplicates based on missingness and generate an exclude list."""
        dupvar_file = f"{self.current_base}.dupvar"
        counts_file = f"{self.current_base}.frq.counts"
        exclude_file = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_exclude_dup.txt")

        missingness = dict()

        # Read missingness data from .frq.counts file
        with open(counts_file, 'rt') as icounts:
            header = icounts.readline().split()
            for line in icounts:
                fields = dict(zip(header, line.split()))
                missingness[fields['SNP']] = int(fields['G0'])

        # Prioritize duplicates and write to exclude file
        with open(dupvar_file, 'rt') as idupvar, open(exclude_file, 'wt') as ofile:
            header = idupvar.readline().split()
            for line in idupvar:
                var_ids = line.strip().split('\t')[-1].split()
                var_ids_sorted = sorted([(var_id, missingness[var_id]) for var_id in var_ids], key=lambda x: x[1], reverse=True)
                for var_id in var_ids_sorted[:-1]:
                    ofile.write(var_id[0] + '\n')

        self.exclude_file = exclude_file
        self.intermediate_files.append(exclude_file)

    def remove_duplicates(self):
        """Remove duplicates using the exclude list."""
        dedup_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_dedup")
        subprocess.run([plink_path, '--bfile', self.current_base, '--exclude', self.exclude_file, '--output-chr', 'chrMT', '--keep-allele-order', '--make-bed', '--out', dedup_base], check=True)
        self.current_base = dedup_base
        self.current_bim = f"{dedup_base}.bim"
        self.current_fam = f"{dedup_base}.fam"
        count_variants(self.study_name, self.current_bim, self.current_fam, "After removal of duplicates")

        # Track intermediate files
        self.intermediate_files.extend([
            f"{dedup_base}.bed",
            f"{dedup_base}.bim",
            f"{dedup_base}.fam",
            f"{dedup_base}.log"
        ])

    def rename_variants(self):
        """Rename variants using chr:pos:ALT:REF format."""
        rename_file = os.path.join(self.out_dir, "rename.txt")
        with open(os.devnull, 'w') as devnull:
            old_stdout = sys.stdout
            sys.stdout = devnull
            try:
                with open(self.current_bim, 'r') as f_in, open(rename_file, 'w') as f_out:
                    for line in f_in:
                        parts = line.strip().split()
                        chrom, pos, alt, ref = parts[0], parts[3], parts[5], parts[4]
                        new_id = f"chr{chrom}:{pos}:{alt}:{ref}"
                        f_out.write(f"{new_id} {parts[1]}\n")  
            finally:
                sys.stdout = old_stdout
        renamed_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_renamed")
        subprocess.run([plink_path, '--bfile', self.current_base, '--update-name', rename_file, '1', '2', '--make-bed', '--keep-allele-order', '--output-chr', 'chrMT', '--out', renamed_base], check=True)
        self.current_base = renamed_base
        self.current_bim = f"{renamed_base}.bim"
        self.current_fam = f"{renamed_base}.fam"
        count_variants(self.study_name, self.current_bim, self.current_fam, "After renaming")

        # Track intermediate files
        self.intermediate_files.extend([
            rename_file,
            f"{renamed_base}.bed",
            f"{renamed_base}.bim",
            f"{renamed_base}.fam",
            f"{renamed_base}.log"
        ])
    def split_and_merge_x(self):
        """Split and merge X chromosomes."""
        xsplit_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_Xsplit")
        subprocess.run([plink_path, '--bfile', self.current_base, '--split-x', 'no-fail', 'b38', '--keep-allele-order', '--make-bed', '--output-chr', 'chrMT', '--out', xsplit_base], check=True)
        self.current_base = xsplit_base
        self.current_bim = f"{xsplit_base}.bim"
        self.current_fam = f"{xsplit_base}.fam"

        xsplit_temp_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_Xsplit_temp")
        subprocess.run([plink_path, '--bfile', self.current_base, '--keep-allele-order', '--set-hh-missing', '--make-bed', '--output-chr', 'chrMT', '--out', xsplit_temp_base], check=True)
        self.current_base = xsplit_temp_base
        self.current_bim = f"{xsplit_temp_base}.bim"
        self.current_fam = f"{xsplit_temp_base}.fam"

        final_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_hg38")
        subprocess.run([plink_path, '--bfile', self.current_base, '--merge-x', 'no-fail', '--keep-allele-order', '--make-bed', '--output-chr', 'chrMT', '--out', final_base], check=True)
        self.current_base = final_base
        self.current_bim = f"{final_base}.bim"
        self.current_fam = f"{final_base}.fam"
        count_variants(self.study_name, self.current_bim, self.current_fam, "Final counts Lift over!")

        # Track intermediate files
        self.intermediate_files.extend([
            f"{xsplit_base}.bed",
            f"{xsplit_base}.bim",
            f"{xsplit_base}.fam",
            f"{xsplit_base}.log",
            f"{xsplit_temp_base}.bed",
            f"{xsplit_temp_base}.bim",
            f"{xsplit_temp_base}.fam",
            f"{xsplit_temp_base}.log"
            
        ])

        for file in self.intermediate_files:
            if os.path.exists(file):
                    os.remove(file)
        print("Cleaned up intermediate files.")
        print(f"LIiftover is all done for {self.study_name}, happy hunting!!")
        


    def get_output_files(self):
        """Return the final BIM and FAM files after deduplication and renaming."""
        return self.current_base
        


In [None]:
# Take note of your last output from the last section!
final_base = study2_A.get_output_files()
print(f"Final base file: {final_base}")

In [None]:
# Take note of your last output from the last section!
final_base = study1_A.get_output_files()
print(f"Final base file: {final_base}")

In [None]:
Study1_D = Dup_Renamer('Study1', '/home/belleza/scratch/1KGHGDP/test_aligner/HGDP1KG_OPT_aligned', '/home/belleza/scratch/1KGHGDP/test_aligner/')
Study1_D.list_duplicate_vars()
Study1_D.prioritize_duplicates()



### Make sure the output directories are different!

In [None]:
Study1_D.rename_variants()
Study1_D.split_and_merge_x()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

In [None]:

Study1_D = Dup_Renamer('Study1', 'path/to/final_bim_base_name', 'output_directory')
Study2_D = Dup_Renamer('Study1', 'path/to/final_bim_base_name', 'output_directory')

In [None]:
Study1_D.list_duplicate_vars()
Study1_D.prioritize_duplicates()
Study2_D.list_duplicate_vars()
Study2_D.prioritize_duplicates()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

### This ***should*** be the last step with any loss of variants, keep an eye out for this!

In [None]:
Study1_D.rename_variants()
Study2_D.rename_variants()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

In [None]:
Study1_D.split_and_merge_x()
Study2_D.split_and_merge_x()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

In [None]:
#Last step
save_qc_table()

# 🎉 **You Did It! Look at You Go, Hot Shot!** 🎉

---

### **Great job!** Our reference data now has that fresh, polished **"new genomic build smell"**, and it wasn’t even that hard, was it? 😎  

---

## **What’s Next?**  
Now that our data is looking sharp, it’s time to ensure it’s in tip-top shape before we bring everything together.  

**Here’s the plan:**  
- Perform **standard quality control (QC) measures** on each dataset independently.  
- This step is crucial to catch any potential issues early and ensure our data is reliable and ready for integration.  

---

### **You’re crushing it!**  
Head over to the next section to dive into the QC steps. See you there—keep up the fantastic work! 💪  