#### Detection of Non-B DNA forming motifs in the peri/centromeric region 

Make sure you have the following ready to run this notebook - 
1. A .bed file containing the genomic coordinates of the regions of interest
2. The .fasta file of the genome of interest containing the DNA sequences of the region of interest
3. non-B DNA Motif Search Tool (nBMST) - https://github.com/abcsFrederick/non-B_gfa

In [5]:
#importing necessary libraries
import pandas as pd
from Bio import SeqIO

In [2]:
# Reading the .bed file with the genomic annotation information of the region of interest
bed = pd.read_csv('CHM13_centric_transition.bed', sep = '\t') 
# This is an example. Change the path according to your system.

In [6]:
# Function to read the FASTA file
def read_fasta_file(fasta_path):
    return {record.id: str(record.seq) for record in SeqIO.parse(fasta_path, "fasta")}

fasta_path = "./ALLCHROMOSOMES/CHM13.fasta"
genome_sequences = read_fasta_file(fasta_path)

In [None]:
# Function to extract and save sequences
def extract_and_save_sequences(df, genome_sequences, output_path):
    with open(output_path, "w") as output_file:
        for index, row in df.iterrows():
            chrom_id = row["Chromosome"]
            start_pos = row["Start"]
            end_pos = row["End"]
            region = row["Region"]
            
            if chrom_id not in genome_sequences:
                print(f"Chromosome {chrom_id} not found in the FASTA file.")
                continue
            
            seq = genome_sequences[chrom_id][start_pos:end_pos]            
            output_file.write(f">{chrom_id}_{start_pos}_{end_pos}_{region}\n")
            output_file.write(str(seq) + "\n\n")

output_path = "extracted_sequences.fasta"
extract_and_save_sequences(df=bed, genome_sequences=genome_sequences, output_path=output_path)

Utilize the following command on the terminal to run the nBDST v2.0 tool

In [None]:
#./gfa -seq <input_fasta_filename> -out <output_file_prefix> [optional_switches]

In [None]:
mkdir output_folder
./gfa -seq extracted_sequences.fasta -out ./output_folder/output

The output will contain a .tsv file for each type of non-B DNA motif selected.