# Welcome back! 

## Your lifted over data treaing you well? 

## While we may have **updated** data, we do not neccesarily have **clean** data
- When conducting our future analyses we want the cleanest data possible -- you deserve only the best!
## We are going to go on a **Q**uality **C**ontrol (QC) adventure together 
- Going through some standard and neccesary steps to make sure our data is in fighting shape! 

## **Here’s What We’ll Do:**  

### **Screen for Missingness**  
- We’ll check for missing data at both the **variant level** (e.g., specific SNPs with too much missing data) and the **individual level** (e.g., samples with too much missing data).  
- Missing data can skew results and reduce the power of your analysis. Let’s nip this in the bud! 

---

### **Check Hardy-Weinberg Equilibrium (HWE)**  
- We’ll test if the variants in our dataset follow Hardy-Weinberg expectations.  
- Deviations from HWE can indicate genotyping errors, population stratification, or other issues. It’s like a litmus test for data quality!  

---

### **Remove Variant Duplicates**  

- Sometimes, the same variant appears multiple times in the dataset (thanks, genotyping chips!).  
- We’ll identify and remove these duplicates to avoid redundancy and confusion. Out with the extras!  

---
### **Remove Duplicate Samples**  
- Occasionally, the same individual might appear more than once in the dataset (oops!).  
- We’ll detect and remove these duplicates to ensure each individual is represented only once. No clones allowed!  

---

### **You’ve Got This!**  
These QC steps are like the foundation of a house—they ensure everything built on top is solid and reliable. Let’s get to it and make your data shine!

- First lets gather our tools and get coded up!

In [None]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import glob
from collections import defaultdict
import pandas as pd
import subprocess
from IPython.display import display
import shutil
from tabulate import tabulate
import logging

#Set up plink to work in jupyter notebook (Compute Canada)
!module load StdEnv/2020 && module load plink/1.9b_6.21-x86_64 && which plink
!module load StdEnv/2020 && module load plink/2.00a3.6 && which plink2


In [26]:
#Copy the output from above into this next command -- or just the absolute path to your downloaded plink
plink_path = 'path/to/plink command'
plink2_path = 'path/to/plink/command'

In [28]:

# Shared QC table to track all studies
shared_qc_table = []

def count_variants(study_name, bim_file, fam_file, step_name):
    """
    Args:
        study_name : Name of the study (e.g., "Study1").
        bim_file : Path to the BIM file.
        fam_file : Path to the FAM file.
        step_name : Name of the step (e.g., "Start", "After Class1", "After Class2").
    
    Returns:
        dict: A dictionary containing counts for autosomal, X, Y, MT variants,
              total individuals, males, females, and ambiguous individuals.
    """
    autosomal = 0
    x_chr = 0
    y_chr = 0
    mt_chr = 0

    
    with open(bim_file, 'r') as f:
        for line in f:
            parts = line.strip().split()
            chrom = parts[0]
            if chrom.startswith("chr"):
                chrom_clean = chrom.replace("chr", "")
            else:
                chrom_clean = chrom
            if chrom_clean in ['X', '23', '25']:
                x_chr += 1
            elif chrom_clean in ['Y', '24']:
                y_chr += 1
            elif chrom_clean in ['MT', 'M', '26']:
                mt_chr += 1
            elif chrom_clean.isdigit():
                if 1 <= int(chrom_clean) <= 22:
                    autosomal += 1

    
    individuals = 0
    males = 0
    females = 0
    ambiguous = 0
    
    with open(fam_file, 'r') as f:
        for line in f:
            parts = line.strip().split()
            sex_code = int(parts[4])
            if sex_code == 1:
                males += 1
            elif sex_code == 2:
                females += 1
            elif sex_code == 0:
                ambiguous += 1
            individuals += 1
    
    shared_qc_table.append([
        study_name,
        step_name,
        autosomal,
        x_chr,
        y_chr,
        mt_chr,
        individuals,
        males,
        females,
        ambiguous,
    ])
    
    return {
        "autosomal": autosomal,
        "x_chr": x_chr,
        "y_chr": y_chr,
        "mt_chr": mt_chr,
        "individuals": individuals,
        "males": males,
        "females": females,
        "ambiguous": ambiguous,
    }
headers = [
    "Study Name", "Step Name", "Autosomal", "X Chr", "Y Chr", "MT Chr",
    "Individuals", "Males", "Females", "Ambiguous"
]

def save_qc_table(filename="QC_results.txt"):
    """
    Saves the shared QC table to a text file.

    Args:
        filename : Name of the output file.
    """
    with open(filename, "w") as f:
        # Write the header
        f.write("Study\tStep\tAutosomal\tX_Chr\tY_Chr\tMT_Chr\tIndividuals\tMales\tFemales\tAmbiguous\n")
        
        # Write each row of data
        for row in shared_qc_table:
            f.write("\t".join(map(str, row)) + "\n")

In [10]:

class QCPlots:
    def __init__(self, prefix_path, chromosomes=""):
        self.prefix_path = prefix_path
        self.chromosomes = chromosomes
        self.working_directory = os.path.dirname(self.prefix_path)
    def check_sex_chromosomes(self):
       #Check if the dataset contains sex chromosomes (X and Y) and print a message.
        
        
        bim_file = f"{self.prefix_path}.bim"
        if not os.path.exists(bim_file):
            raise FileNotFoundError(f"BIM file not found: {bim_file}")

        
        has_x = False
        has_y = False

        # Check for X and Y chromosomes
        with open(bim_file, 'r') as f:
            for line in f:
                # Split the line into columns
                parts = line.strip().split()
                if len(parts) < 1:
                    continue  # Skip empty lines

                # Extract the chromosome column
                chrom = parts[0]

                # Check for X chromosome representations
                if chrom in ['chrX', 'X', '23']:
                    has_x = True

                # Check for Y chromosome representations
                if chrom in ['chrY', 'Y', '24']:
                    has_y = True

        # Print results
        if has_x:
            print("We've got X chromosomes!")
        else:
            print("No X chromosomes :(.")

        if has_y:
            print("We've got Y chromosomes!")
        else:
            print("No Y chromosomes :(.")
    def read_files(self, pattern):
        """Read files matching a pattern and concatenate them into a DataFrame."""
        files = [f for f in os.listdir(self.working_directory) if f.endswith(pattern)]
        if not files:
            raise FileNotFoundError(f"No files found with pattern {pattern}")
        df_list = [pd.read_csv(os.path.join(self.working_directory, f), sep='\s+') for f in files]
        return pd.concat(df_list, ignore_index=True)

    def plot_sample_missingness(self, df, chromosome):
        """Plot sample missingness."""
        plt.figure(figsize=(10, 10))
        sns.scatterplot(x=1 - df['F_MISS'], y=df['IID'], color='dodgerblue')
        plt.xlabel('Call Rate')
        plt.yticks([])
        plt.title(f'Sample ({chromosome}) Missingness')
        plt.show()

    def plot_variant_missingness(self, df, chromosome):
        """Plot variant missingness."""
        plt.figure(figsize=(10, 10))
        sns.histplot(1 - df['F_MISS'], bins=60, color='dodgerblue', kde=False)
        plt.xlabel('Call Rate')
        plt.title(f'Variant ({chromosome}) Missingness')
        plt.show()

    def plot_heterozygosity(self, het_df):
        """Plot heterozygosity rate."""
        init_mean_het = het_df['F'].mean()
        init_sd_het = het_df['F'].std()
        plt.figure(figsize=(10, 10))
        sns.scatterplot(x=het_df['F'], y=het_df['IID'], color='dodgerblue')
        plt.axvline(init_mean_het, color='black')
        plt.axvline(init_mean_het + 3 * init_sd_het, color='firebrick', linestyle='--')
        plt.axvline(init_mean_het - 3 * init_sd_het, color='firebrick', linestyle='--')
        plt.xlabel('F')
        plt.yticks([])
        plt.title('Sample Heterozygosity')
        plt.show()

    def plot_maf(self, maf_df):
        """Plot minor allele frequency (MAF)."""
        plt.figure(figsize=(10, 10))
        sns.histplot(maf_df['MAF'], bins=60, color='dodgerblue', kde=False)
        plt.xlabel('MAF')
        plt.title('Minor Allele Frequency')
        plt.show()

    def plot_afs(self, freqs_df):
        """Plot allele frequency spectrum (AFS)."""
        freqs_df['C1'] = pd.to_numeric(freqs_df['C1'])
        freqs_table = freqs_df['C1'].value_counts().sort_index().head(41)
        plt.figure(figsize=(20, 10))
        sns.barplot(x=freqs_table.index, y=freqs_table.values, color='dodgerblue')
        plt.xlabel('Alternate Allele Count')
        plt.ylabel('Frequency')
        plt.title('Allele Frequency Spectrum')
        plt.show()

    def plot_hwe(self, hwe_df):
        """Plot Hardy-Weinberg Equilibrium (HWE)."""
        hwe_df = hwe_df[hwe_df['TEST'].str.contains('ALL', na=False)]
        plt.figure(figsize=(10, 10))
        sns.histplot(hwe_df['P'], bins=300, color='dodgerblue', kde=False)
        plt.xlabel('HWE p-value')
        plt.title('Hardy-Weinberg Equilibrium')
        plt.show()

    def make_plot(self):
        #Generate plots based on the chromosome type (all or sex chromosomes).
        if self.chromosomes == "sex":
            # Read and plot X chromosome data
            X_ind_miss = self.read_files('X.imiss')
            X_var_miss = self.read_files('X.lmiss')
            X_maf = self.read_files('X.frq')
            X_freqs = self.read_files('X.frq.counts')
            X_hwe = self.read_files('X.hwe')
            Y_ind_miss = self.read_files('Y.imiss')
            Y_var_miss = self.read_files('Y.lmiss')

            Y_maf = self.read_files('Y.frq')
            Y_freqs = self.read_files('Y.frq.counts')
            Y_hwe = self.read_files('Y.hwe')            

            self.plot_sample_missingness(X_ind_miss, 'X')
            self.plot_variant_missingness(X_var_miss, 'X')

            self.plot_maf(X_maf)
            self.plot_afs(X_freqs)
            self.plot_hwe(X_hwe)
            self.plot_sample_missingness(Y_ind_miss, 'Y')
            self.plot_variant_missingness(Y_var_miss, 'Y')
            self.plot_maf(Y_maf)
            self.plot_afs(Y_freqs)
            self.plot_hwe(Y_hwe)

        elif self.chromosomes == "all":
            # Read and plot other QC metrics for all
            ind_miss = self.read_files('.imiss')
            var_miss = self.read_files('.lmiss')
            het = self.read_files('.het')
            maf = self.read_files('.frq')
            freqs = self.read_files('.frq.counts')
            hwe = self.read_files('.hwe')

            self.plot_sample_missingness(ind_miss, 'All')
            self.plot_variant_missingness(var_miss, 'All')
            self.plot_heterozygosity(het)
            self.plot_maf(maf)
            self.plot_afs(freqs)
            self.plot_hwe(hwe)

        else:
            raise ValueError("Invalid chromosome type. Use 'all' or 'sex'.")
    def run_plink(self):
        """
        What you need:
        - self.prefix_path: Absolute path to the PLINK file prefix (e.g., '/path/to/plink/files/data').
        - chromosomes: Type of chromosomes to plot ("all" or "sex"). Default is "all".
        """
        # Extract the directory and prefix from the absolute path

        # Create a directory for pre-QC plots
    
    
        # Run PLINK commands describing certain key quality control areas
        print("Running key quality control evaluations...")
        if self.chromosomes == "sex":
            has_x = os.path.exists(f"{self.prefix_path}.bim") and any("X" in line.split()[0] for line in open(f"{self.prefix_path}.bim"))
            has_y = os.path.exists(f"{self.prefix_path}.bim") and any("Y" in line.split()[0] for line in open(f"{self.prefix_path}.bim"))

            if has_x:
                subprocess.run([plink_path, '--bfile', self.prefix_path, '--chr', 'X', '--missing', '--out', f"{self.prefix_path}_95_preQCX"], check=True)
                subprocess.run([plink_path, '--bfile', self.prefix_path, '--chr', 'X', '--freq', '--out', f"{self.prefix_path}_95_preQCX"], check=True)
                subprocess.run([plink_path, '--bfile', self.prefix_path, '--chr', 'X', '--freq', 'counts', '--out', f"{self.prefix_path}_95_preQCX"], check=True)
                subprocess.run([plink_path, '--bfile', self.prefix_path, '--chr', 'X', '--filter-females', '--hardy', '--out', f"{self.prefix_path}_95_preQCX"], check=True)
            else:
                print("No X chromosome found in the dataset. Skipping X chromosome processing.")

            if has_y:
                subprocess.run([plink_path, '--bfile', self.prefix_path, '--chr', 'Y', '--filter-males' , '--missing', '--out', f"{self.prefix_path}_95_preQCY"], check=True)
                subprocess.run([plink_path, '--bfile', self.prefix_path, '--chr', 'Y', '--freq', '--out', f"{self.prefix_path}_95_preQCY"], check=True)
                subprocess.run([plink_path, '--bfile', self.prefix_path, '--chr', 'Y', '--freq', 'counts', '--out', f"{self.prefix_path}_95_preQCY"], check=True)
                subprocess.run([plink_path, '--bfile', self.prefix_path, '--chr', 'Y', '--hardy', '--out', f"{self.prefix_path}_95_preQCY"], check=True)
            else:
                print("No Y chromosome found in the dataset. Skipping Y chromosome processing.")

        else:
            subprocess.run([plink_path, '--bfile', self.prefix_path, '--missing', '--out', f"{self.prefix_path}_95_preQC"], check=True)
            subprocess.run([plink_path, '--bfile', self.prefix_path, '--het', '--out', f"{self.prefix_path}_95_preQC"], check=True)
            subprocess.run([plink_path, '--bfile', self.prefix_path, '--freq', '--out', f"{self.prefix_path}_95_preQC"], check=True)
            subprocess.run([plink_path, '--bfile', self.prefix_path, '--freq', 'counts', '--out', f"{self.prefix_path}_95_preQC"], check=True)
            subprocess.run([plink_path, '--bfile', self.prefix_path, '--hardy', '--out', f"{self.prefix_path}_95_preQC"], check=True)
        
    def generate_plots(self):
        self.run_plink()
        print("Plotting results using QCPlots...")
        self.make_plot()

        # Clean up intermediate files
        print("Cleaning up intermediate files...")
        intermediate_files = glob.glob(f"{self.prefix_path}_95_preQC*")
        for file in intermediate_files:
            if os.path.exists(file):
                print(f"Removing: {file}")
                os.remove(file)    
    
        for file in intermediate_files:
            if os.path.exists(file):
                print(f"Removing: {file}")
                os.remove(file)


# Now we need our files! So that we can get started! Be sure to go back to our last tutorial together and see where you placed that! 
---
# First thing we are going to want to do is visualize what we are dealing with quality wise! First lets start by looking at the entirety of the data.

In [None]:
#Add as many studies as needed! Just be sure to add them downstream!
study1= QCPlots('prefix_path_study_1', chromosomes='all or sex')
study2= QCPlots('prefix_path_study_2', chromosomes='all or sex')

In [None]:
study1.generate_plots()
study2.generate_plots()

# So what are we observing? Do we see any noticeable trends in your data? Keep these findings in mind as we continue along!
- Quickly, let us check if there are sex chromosomes in your dataset! 

In [None]:
study1.check_sex_chromosomes()
study2.check_sex_chromosomes()

# If there were sex chromsomes in your dataset run this next cell and if so lets see how they stand quality wise!

In [None]:
study1= QCPlots('prefix_path_study_1', chromosomes='all or sex')
study2= QCPlots('prefix_path_study_2', chromosomes='all or sex')

In [None]:
study1.generate_plots()
study2.generate_plots()

# Now that we have an understanding of how we're starting off, we can really dive in and begin to process our data!
 - Lets get it coded up!
---

- When we are picking out thresholds here, we are screening out variants or individuals above a given threshold (e.g., If your Mind threshold is 0.05 it will screen out an individual with a 7% overall missingness)

In [29]:
class Missingness:
    def __init__(self, study_name, base_name, out_dir):
        """
        Args:
            study_name : Name of the study.
            base_name : Base name of the input files (including full path, e.g., "path/to/data_base_name").
            out_dir : Directory where all output files will be saved.
        """
        self.study_name = study_name
        self.base_name = base_name
        self.out_dir = out_dir

        # Create the output directory if it doesn't exist
        os.makedirs(self.out_dir, exist_ok=True)

        # Initialize current BIM and FAM files
        self.current_base = base_name
        self.current_bim = f"{self.current_base}.bim"
        self.current_fam = f"{self.current_base}.fam"
        self.intermediate_files = []
        self.original_bim = f"{self.current_base}.bim"
        self.original_base = base_name
        self.original_fam = f"{self.current_base}.fam"
        count_variants(self.study_name, self.current_bim, self.current_fam, "Start")
        

    def filter_all(self, geno_threshold=0.05, mind_threshold=0.05):
        """Filter all for missingness."""
        # Filter for sample missingness (all)
        
        all_sample_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_all_sample")
        subprocess.run([plink_path, '--bfile', self.current_base, '--chr', '1-22', '--geno', str(geno_threshold), '--keep-allele-order', '--make-bed', '--out', all_sample_base], check=True)
        count_variants(self.study_name, f"{all_sample_base}.bim", f"{all_sample_base}.fam", "Autosomal Filter missing per individual")

        # Filter for variant missingness (all)
        all_variant_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_all_variant")
        subprocess.run([plink_path, '--bfile', all_sample_base, '--mind', str(mind_threshold), '--keep-allele-order', '--make-bed', '--out', all_variant_base], check=True)
        count_variants(self.study_name, f"{all_variant_base}.bim", f"{all_variant_base}.fam", "Autosomal Filter missing per variant")

        self.current_base = all_variant_base
        self.current_bim = f"{all_variant_base}.bim"
        self.current_fam = f"{all_variant_base}.fam"

        self.intermediate_files.extend([
            f"{all_sample_base}.bed",
            f"{all_sample_base}.bim",
            f"{all_sample_base}.fam",
            f"{all_sample_base}.log",
            f"{all_variant_base}.bed",
            f"{all_variant_base}.bim",
            f"{all_variant_base}.fam",
            f"{all_variant_base}.log"
        ])
        """Sex chromosomes (especially for older data) can be rather problematic and may need to be thrown out based on missingness (lots of gaps!) 
        so this next function will look a bit crazier to ensure we don't crash!"""
    def filter_sex_chromosomes(self, x_mind_threshold=0.1, x_geno_threshold=0.05, y_mind_threshold=0.85, y_geno_threshold=0.25):
       #Filter sex chromosomes (X and Y) for missingness.
        # Check if sex chromosomes are present
        with open(self.original_bim, 'r') as f:
            has_x = any(line.startswith('X') or line.startswith('chrX') or line.startswith('23') or line.startswith('25') for line in f)
            has_y = any(line.startswith('Y') or line.startswith('chrY') or line.startswith('24') for line in f)
        
        if has_x:
            try:
                # Split X chromosome into PAR and non-PAR regions
                split_x_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_splitX")
                subprocess.run([plink_path, '--bfile', self.original_base, '--chr', 'X', '--split-x', 'hg38', 'no-fail', '--keep-allele-order', '--make-bed', '--out', split_x_base], check=True, timeout=600)
                
                count_variants(self.study_name, f"{split_x_base}.bim", f"{split_x_base}.fam", "Split X Chromosome into PAR and non-PAR regions")

                # Filter for sample missingness (X)
                split_x_sample_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_splitX_sample")
                subprocess.run([plink_path, '--bfile', split_x_base, '--mind', str(x_mind_threshold), '--keep-allele-order', '--make-bed', '--out', split_x_sample_base], check=True, timeout=600)
                count_variants(self.study_name, f"{split_x_sample_base}.bim", f"{split_x_sample_base}.fam", "X Chromosome Filter missing per individual")

                # Filter for variant missingness (X)
                split_x_variant_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_splitX_variant")
                subprocess.run([plink_path, '--bfile', split_x_sample_base, '--geno', str(x_geno_threshold), '--keep-allele-order', '--make-bed', '--out', split_x_variant_base], check=True, timeout=600)
                count_variants(self.study_name, f"{split_x_variant_base}.bim", f"{split_x_variant_base}.fam", "X Chromosome Filter missing per variant")
                self.x_base = split_x_variant_base
            except subprocess.CalledProcessError as e:
                logging.error(f"Error processing X chromosome: {e}")
                self.x_base = None
                shared_qc_table.append([
                    self.study_name,
                    "X Chromosome QC--failed filtering",
                    0, 0, 0, 0, 0, 0, 0, 0
                    
                ])   
        if has_y:
            try:
                # Filter for sample missingness (Y)
                y_sample_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_Y_sample")
                subprocess.run([plink_path, '--bfile', self.original_base, '--chr', 'Y', '--filter-males', '--mind', str(y_mind_threshold), '--keep-allele-order', '--make-bed', '--out', y_sample_base], check=True, timeout=600)
                count_variants(self.study_name, f"{y_sample_base}.bim", f"{y_sample_base}.fam", "Y Chromosome Filter missing per individual")

                # Filter for variant missingness (Y)
                y_variant_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_Y_variant")
                subprocess.run([plink_path, '--bfile', y_sample_base, '--geno', str(y_geno_threshold), '--keep-allele-order', '--make-bed', '--out', y_variant_base], check=True, timeout=600)
                count_variants(self.study_name, f"{y_variant_base}.bim", f"{y_variant_base}.fam", "Y Chromosome Filter missing per variant")
                self.y_base = y_variant_base 
            except subprocess.CalledProcessError as e:
                logging.error(f"Error processing Y chromosome: {e}")
                self.y_base = None
                shared_qc_table.append([
                    self.study_name,
                    "Y Chromosome QC--failed filtering",
                    0, 0, 0, 0, 0, 0, 0, 0
                    
                ])   
  
                       
    def get_output_files(self):
        return (
            self.current_base,  # Always return self.current_base
            getattr(self, 'y_base', None),  # Return self.y_base if it exists, otherwise None
            getattr(self, 'x_base', None)   # Return self.x_base if it exists, otherwise None
        )

In [None]:
study1_M = Missingness('Study1', 'prefix_path_study_1', 'path/to/ouput/directory')
study2_M = Missingness('Study2', 'prefix_path_study_2', 'path/to/ouput/directory')

In [None]:
study1_M.filter_all()
study2_M.filter_all()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

In [None]:
study1_M.filter_sex_chromosomes()
study2_M.filter_sex_chromosomes()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

# How are we looking? 
    - Did you have a higher missingness than what you would expect? 
        - Losing variants is never fun but in our next sections take a look at the genotyping rate in the plink output, it should be a lot higher!

# Let us get the Hardy-Weinberg screenings up! 
- When we are handling the X chromosomes, we are only going to be screening the X chromsomes of biological females. 
    - Why? Because biological males only have one X chromosomes (making them hemizygous) the standard test for Hardy-Weinberg Equillibrium is thrown off.

---

- Now lets get this coded up, when running HWE screens we are screening for variants that are more significant than the given threshold

In [31]:
class HWEProcessor:
    def __init__(self, study_name, base_name, out_dir, x_base="", y_base=""):
        """
        Args:
            study_name : Name of the study.
            base_name : Base name of the input files (including full path, e.g., "path/to/data_base_name").
            out_dir : Directory where all output files will be saved.
            x_base : Path to the X chromosome files (if available).
            y_base : Path to the Y chromosome files (if available).
        """
        self.study_name = study_name
        self.base_name = base_name
        self.out_dir = out_dir

        # Create the output directory if it doesn't exist
        os.makedirs(self.out_dir, exist_ok=True)

        # Initialize current BIM and FAM files
        self.current_base = os.path.join(self.out_dir, os.path.basename(base_name))
        self.current_bim = f"{self.current_base}.bim"
        self.current_fam = f"{self.current_base}.fam"

        # Track sex chromosome files
        self.x_base = x_base
        self.y_base = y_base

    def filter_all(self, hwe_threshold=1e-25):
        """Filter all for Hardy-Weinberg Equilibrium."""
        all_hwe_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_all_hwe")
        subprocess.run([plink_path, '--bfile', self.current_base, '--hwe', str(hwe_threshold), '--keep-allele-order', '--make-bed', '--out', all_hwe_base], check=True)
        count_variants(self.study_name, f"{all_hwe_base}.bim", f"{all_hwe_base}.fam", "Autosomal Hardy-Weinberg Filtering (1e-25)")

        self.current_base = all_hwe_base
        self.current_bim = f"{all_hwe_base}.bim"
        self.current_fam = f"{all_hwe_base}.fam"

    def filter_sex_chromosomes(self, hwe_threshold=1e-25):
        """Filter sex chromosomes (X and Y) for Hardy-Weinberg Equilibrium."""
        if self.x_base:
            # Filter X chromosome for HWE
            # HWE can only be done on biologically female X chromosomes since biological males are naturally heterozygous!
            x_hwe_males_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_splitX_hwe_males")
            subprocess.run([plink_path, '--bfile', self.x_base, '--filter-males', '--keep-allele-order', '--make-bed', '--out', x_hwe_males_base], check=True)

            x_hwe_females_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_splitX_hwe_females")
            subprocess.run([plink_path, '--bfile', self.x_base, '--filter-females', '--hwe', str(hwe_threshold), '--keep-allele-order', '--make-bed', '--out', x_hwe_females_base], check=True)

            x_hwe_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_splitX_hwe")
            subprocess.run([plink_path, '--bfile', x_hwe_males_base, '--bmerge', x_hwe_females_base, '--keep-allele-order', '--make-bed', '--out', x_hwe_base], check=True)
            count_variants(self.study_name, f"{x_hwe_base}.bim", f"{x_hwe_base}.fam", "X Chromosome Hardy-Weinberg Filtering (1e-25)")

            # Update X chromosome file
            self.x_base = x_hwe_base

        if self.y_base:
            # Filter Y chromosome for HWE
            y_hwe_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_Y_hwe")
            subprocess.run([plink_path, '--bfile', self.y_base, '--hwe', str(hwe_threshold), '--filter-males', '--keep-allele-order', '--make-bed', '--out', y_hwe_base], check=True)
            count_variants(self.study_name, f"{y_hwe_base}.bim", f"{y_hwe_base}.fam", "Y Chromosome Hardy-Weinberg Filtering (1e-25)")

            # Update Y chromosome file
            self.y_base = y_hwe_base

    def get_output_files(self):
        return (
            self.current_base,  # Always return self.current_base
            getattr(self, 'y_base', None),  # Return self.y_base if it exists, otherwise None
            getattr(self, 'x_base', None)   # Return self.x_base if it exists, otherwise None
        )

In [None]:
# Take note of your last output from the last section!
final_base = study1_M.get_output_files()
print(f"Final base file: {final_base}")

In [None]:
# Take note of your last output from the last section!
final_base = study2_M.get_output_files()
print(f"Final base file: {final_base}")

In [None]:
study1_H = HWEProcessor('Study1', 'prefix_path_study_1', 'path/to/ouput/directory', 'path/to/x_base_if_applicable','path/to/y_base_if_applicable - if none then delete')
study2_H = HWEProcessor('Study2', 'prefix_path_study_2', 'path/to/ouput/directory''path/to/x_base_if_applicable','path/to/y_base_if_applicable - if none then delete')

In [None]:
study1_H.filter_all()
study2_H.filter_all()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

In [None]:
study1_H.filter_sex_chromosomes()
study2_H.filter_sex_chromosomes()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

# Alright! We are making great progress eh? (This tutorial was written in Canada)
- The majority of the human genome is in Hardy-Weinberg Equillibrium so we shoudn't encounter ***too*** many variants going the way of the Dodo
    
    - Now let's snuff out those duplicates!

In [33]:
class DuplicateProcessor:
    def __init__(self, study_name, base_name, out_dir, x_base=None, y_base=None):
        """
        Args:
            study_name : Name of the study.
            base_name : Base name of the input files (including full path).
            out_dir : Directory where all output files will be saved.
            x_base : Path to the X chromosome files (if available).
            y_base : Path to the Y chromosome files (if available).
        """
        self.study_name = study_name
        self.base_name = base_name
        self.out_dir = out_dir

        # Create the output directory if it doesn't exist
        os.makedirs(self.out_dir, exist_ok=True)

        # Initialize current BIM and FAM files
        self.current_base = os.path.join(os.path.basename(base_name))
        self.current_bim = f"{self.current_base}.bim"
        self.current_fam = f"{self.current_base}.fam"

        # Track sex chromosome files
        self.x_base = x_base
        self.y_base = y_base

    def remove_duplicates(self):
        """Remove duplicates in all and sex chromosomes."""
        all_dup_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_dup")
        subprocess.run([plink_path, '--bfile', self.base_name, '--list-duplicate-vars', 'ids-only', 'suppress-first', '--out', 'temp'], check=True)
        subprocess.run([plink_path, '--bfile', self.base_name, '--exclude', 'temp.dupvar', '--keep-allele-order', '--make-bed', '--out', all_dup_base], check=True)
        count_variants(self.study_name, f"{all_dup_base}.bim", f"{all_dup_base}.fam", "Duplicate SNP Removal")

        self.current_base = all_dup_base
        self.current_bim = f"{all_dup_base}.bim"
        self.current_fam = f"{all_dup_base}.fam"

        # Remove duplicates in sex chromosomes (if present)
        if self.x_base:
            x_dup_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_X_dup")
            subprocess.run([plink_path, '--bfile', self.x_base, '--list-duplicate-vars', 'ids-only', 'suppress-first', '--out', 'temp'], check=True)
            subprocess.run([plink_path, '--bfile', self.x_base, '--exclude', 'temp.dupvar', '--keep-allele-order', '--make-bed', '--out', x_dup_base], check=True)
            count_variants(self.study_name, f"{x_dup_base}.bim", f"{x_dup_base}.fam", "Duplicate SNP (X) Removal")

            self.x_base = x_dup_base

        if self.y_base:
            y_dup_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_Y_dup")
            subprocess.run([plink_path, '--bfile', self.y_base, '--list-duplicate-vars', 'ids-only', 'suppress-first', '--out', 'temp'], check=True)
            subprocess.run([plink_path, '--bfile', self.y_base, '--exclude', 'temp.dupvar', '--keep-allele-order', '--make-bed', '--out', y_dup_base], check=True)
            count_variants(self.study_name, f"{y_dup_base}.bim", f"{y_dup_base}.fam", "Duplicate SNP (Y) Removal")

            self.y_base = y_dup_base

    def merge_sex_chromosomes(self):
        """Merge sex chromosomes back with all."""
        if self.x_base and self.y_base:
            # Merge X and Y with all
            intermediate_base = os.path.join(self.out_dir, "intermediate")
            subprocess.run([plink_path, '--bfile', self.current_base, '--bmerge', self.x_base, '--keep-allele-order', '--make-bed', '--out', intermediate_base], check=True)
            subprocess.run([plink_path, '--bfile', intermediate_base, '--bmerge', self.y_base, '--keep-allele-order', '--make-bed', '--out', f"{self.base_name}_merge"], check=True)
            count_variants(self.study_name, f"{self.base_name}_merge.bim", f"{self.base_name}_merge.fam", "Sex Chromosome merger")
        elif self.x_base:
            # Merge X with all
            subprocess.run([plink_path, '--bfile', self.current_base, '--bmerge', self.x_base, '--keep-allele-order', '--make-bed', '--out', f"{self.base_name}_merge"], check=True)
            count_variants(self.study_name, f"{self.base_name}_merge.bim", f"{self.base_name}_merge.fam", "X but no Y Merger")
        elif self.y_base:
            # Merge Y with all
            subprocess.run([plink_path, '--bfile', self.current_base, '--bmerge', self.y_base, '--keep-allele-order', '--make-bed', '--out', f"{self.base_name}_merge"], check=True)
            count_variants(self.study_name, f"{self.base_name}_merge.bim", f"{self.base_name}_merge.fam", "No X but Y Merger")
        else:
            # No sex chromosomes survived QC
            count_variants(self.study_name, f"{self.current_base}.bim", f"{self.current_base}.fam", "No sex chromosome survived QC")
    
    def get_output_files(self):
        return self.current_base

In [None]:
# Take note of your last output from the last section!
final_base = study1_H.get_output_files()
print(f"Final base file: {final_base}")

In [None]:
# Take note of your last output from the last section!
final_base = study2_H.get_output_files()
print(f"Final base file: {final_base}")

In [None]:
study1_D = DuplicateProcessor('Study1', 'prefix_path_study_1', 'path/to/ouput/directory', 'path/to/x_base_if_applicable','path/to/y_base_if_applicable')
study2_D = DuplicateProcessor('Study2', 'prefix_path_study_2', 'path/to/ouput/directory''path/to/x_base_if_applicable','path/to/y_base_if_applicable')

# Lets start with just removing duplicate variants!

In [None]:
study1_D.remove_duplicates()
study2_D.remove_duplicates()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

# We have no need to keep our sex chromosomes seperate anymore, so let us bring those back into the fold before our final step!

In [None]:
study1_D.merge_sex_chromosomes()
study2_D.merge_sex_chromosomes()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

# Last step!
- Duplicate samples is not entirely common but it is always worth checking! 

---

- Here we are going to screen for "monozygotic twins" (i.e., genetically the exact same individual) and we'll only keep the individual with more present variants
  - If they are the exact same we will have to manually curate. 

In [35]:

class TwinProcessor:
    def __init__(self, study_name, base_name, out_dir):
        """
        Args:
            study_name : Name of the study.
            base_name : Base name of the input files (including full path, e.g., "path/to/data_base_name").
            out_dir : Directory where all output files will be saved.
        """
        self.study_name = study_name
        self.base_name = base_name
        self.out_dir = out_dir

        # Create the output directory if it doesn't exist
        os.makedirs(self.out_dir, exist_ok=True)
        self.intermediate_files = []
        # Initialize current BIM and FAM files
        self.current_base = os.path.basename(base_name)
        self.current_bim = f"{self.current_base}.bim"
        self.current_fam = f"{self.current_base}.fam"

    def identify_twins(self):
        # Create a temporary file for processing
        temp_base = os.path.join(self.out_dir, "temp")
        subprocess.run([plink_path, "--bfile", self.base_name, "--keep-allele-order", "--make-bed", "--out", temp_base], check=True)

        # Calculate relatedness using PLINK2
        subprocess.run([plink2_path, "--bfile", temp_base, "--make-king-table", "--king-table-filter", "0.25", "--out", "relatedness"], check=True)

        # Extract twin pairs with KING coefficient > 0.354
        with open("relatedness.kin0", "r") as kin_file, open("twins.txt", "w") as twins_file:
            next(kin_file)  # Skip the header line
            for line in kin_file:
                fields = line.strip().split()
                if float(fields[7]) > 0.354:  # KING coefficient in column 8
                    twins_file.write(f"{fields[0]} {fields[1]} {fields[2]} {fields[3]}\n")
        self.intermediate_files.append(f"{temp_base}.*")
        self.intermediate_files.append("relatedness.kin0")
        self.intermediate_files.append("twins.txt")
    def identify_twins(self):
        # Create a temporary file for processing
        temp_base = os.path.join(self.out_dir, "temp")
        subprocess.run([plink_path, "--bfile", self.base_name, "--keep-allele-order", "--make-bed", "--out", temp_base], check=True)

        # Calculate relatedness using PLINK2
        subprocess.run([plink2_path, "--bfile", temp_base, "--make-king-table", "--king-table-filter", "0.25", "--out", "relatedness"], check=True)

        # Create twins.txt file with a header
        twins_file_path = os.path.join(self.out_dir, "twins.txt")
        with open(twins_file_path, "w") as twins_file:
            twins_file.write("FID1\tIID1\tFID2\tIID2\n")  # Write header

            # Extract twin pairs with KING coefficient > 0.354
            if os.path.exists("relatedness.kin0"):
                with open("relatedness.kin0", "r") as kin_file:
                    next(kin_file)  # Skip the header line
                    for line in kin_file:
                        fields = line.strip().split()
                        if float(fields[7]) > 0.354:  # KING coefficient in column 8
                            twins_file.write(f"{fields[0]}\t{fields[1]}\t{fields[2]}\t{fields[3]}\n")

        self.intermediate_files.append(f"{temp_base}.*")
        self.intermediate_files.append("relatedness.kin0")
        self.intermediate_files.append(twins_file_path)
        self.twins_file = twins_file_path

    def screen_twins(self):
        # Ensure twins.txt exists
        if not os.path.exists(self.twins_file):
            print("Twins file does not exist. Creating an empty file.")
            with open(self.twins_file, "w") as twins_file:
                twins_file.write("FID1\tIID1\tFID2\tIID2\n")  # Write header

        # Run PLINK to calculate missingness
        subprocess.run([plink_path, "--bfile", self.base_name, "--missing", "--out", "missingness"], check=True)

        # Read missingness data
        individual_variants = {}
        if os.path.exists("missingness.imiss"):
            with open("missingness.imiss", "r") as imiss_file:
                next(imiss_file)  # Skip the header line
                for line in imiss_file:
                    fields = line.strip().split()
                    fid, iid = fields[0], fields[1]
                    non_missing_count = int(fields[4])
                    individual_variants[(fid, iid)] = non_missing_count
            self.intermediate_files.append("missingness.imiss")

        # Initialize lists for twins to remove and tied twins
        twins_to_remove = []
        twins_tied = []

        # Process twins file
        with open(self.twins_file, "r") as twins_file:
            header = next(twins_file)  # Read header
            for line in twins_file:
                fields = line.strip().split()
                twin1 = (fields[0], fields[1])
                twin2 = (fields[2], fields[3])
                if twin1 in individual_variants and twin2 in individual_variants:
                    if individual_variants[twin1] < individual_variants[twin2]:
                        twins_to_remove.append(twin2)
                    elif individual_variants[twin1] > individual_variants[twin2]:
                        twins_to_remove.append(twin1)
                    else:
                        twins_tied.append((twin1, twin2))

        # Create twins_to_remove.txt with header
        remove_file_path = os.path.join(self.out_dir, "twins_to_remove.txt")
        with open(remove_file_path, "w") as remove_file:
            remove_file.write("FID\tIID\n")  # Write header
            for fid, iid in twins_to_remove:
                remove_file.write(f"{fid}\t{iid}\n")

        # Create twins_tied.txt with header
        tied_file_path = os.path.join(self.out_dir, "twins_tied.txt")
        with open(tied_file_path, "w") as tied_file:
            tied_file.write(header)  # Write header from twins.txt
            for twin1, twin2 in twins_tied:
                tied_file.write(f"{twin1[0]}\t{twin1[1]}\t{twin2[0]}\t{twin2[1]}\n")

        self.tied = tied_file_path
        self.remove = remove_file_path


    def remove_twins(self):
        """Remove flagged twins from the dataset."""
        qc_base = os.path.join(self.out_dir, f"{os.path.basename(self.base_name)}_95_QC")
        subprocess.run([plink_path, "--bfile", self.base_name, "--remove", self.remove, "--keep-allele-order", "--make-bed", "--out", qc_base], check=True)

        # Update current BIM and FAM files
        self.current_base = qc_base
        self.current_bim = f"{qc_base}.bim"
        self.current_fam = f"{qc_base}.fam"
        count_variants(self.study_name, self.current_bim, self.current_fam, "Removal of Homozygotic Twins")
        total_manual = subprocess.run(["wc", "-l", self.tied], capture_output=True, text=True).stdout.strip()
        print(f"Total number of twins needing manual curation: {total_manual}")

        # Print total number of variants
        total_variants = subprocess.run(["wc", "-l", f"{self.current_base}.bim"], capture_output=True, text=True).stdout.strip()
        print(f"Total number of variants: {total_variants}")

        print("Removing intermediate files...")
        for pattern in self.intermediate_files:
            for file_path in glob.glob(pattern):
                print(f"Removing {file_path}")
                try:
                    os.remove(file_path)
                except FileNotFoundError:
                    print(f"File {file_path} not found. Skipping...")
                except Exception as e:
                    print(f"Error removing {file_path}: {e}")
        for file_path in glob.glob(os.path.join(self.out_dir, "*")):
            if "_95_QC" not in os.path.basename(file_path):  # Check if the file does NOT contain "_95_QC"
                print(f"Removing {file_path}")
                try:
                    os.remove(file_path)
                except FileNotFoundError:
                    print(f"File {file_path} not found. Skipping...")
                except Exception as e:
                    print(f"Error removing {file_path}: {e}")
        print(f"You have successfully screened the data for {self.study_name}, the data is squeaky clean! Happy Hunting!")


In [None]:
final_base = study1_D.get_output_files()
print(f"Final base file: {final_base}")

In [None]:
final_base = study2_D.get_output_files()
print(f"Final base file: {final_base}")

In [17]:
study1_T = TwinProcessor('Study1', 'prefix_path_study_1', 'path/to/ouput/directory')
study2_T = TwinProcessor('Study2', 'prefix_path_study_2', 'path/to/ouput/directory')

In [None]:
study1_T.remove_twins()

- Reminder! Be sure to send each study to a different output directory to avoid some issues!

In [None]:
study1_T.identify_twins()
study2_T.identify_twins()

# Do we see any twins? If not that is ok! Still run the last cell, it'll help clean up those pesky intermediate files!

In [None]:
study1_T.screen_twins()
study1_T.remove_twins()
study2_T.screen_twins()
study2_T.remove_twins()
print(tabulate(shared_qc_table, headers=headers, tablefmt="pretty"))

# 🎉 **You're on fire!** 🎉

---

### **Great job!** We have now ensured that each study in our reference panel (or each array within a single study) is at the utmost quality! 

---

## **What’s Next?**  
Now that we are confident that our data is as clean as possible, we can finally bring them together to form one comprehensive reference panel!

**Here’s the plan:**  
- We are going to affix suffixes to the ends of our samples for ease of backtracking
- We are going to merge the data together solely at the intersection of variants
- Run a manhattan plot for evidence of a batch effect between studies. 

---

### **You’re crushing it!**  
Head over to the next section get this data finelly merged together. See you there—keep up the fantastic work! 💪  