# Vista Enhancer
    
    The VISTA Enhancer Browser is a central resource for experimentally validated human and mouse noncoding fragments with gene enhancer activity as assessed in transgenic mice. Most of these noncoding elements were selected for testing based on their extreme conservation in other vertebrates or epigenomic evidence (ChIP-Seq) of putative enhancer marks. The results of this in vivo enhancer screen are provided through this publicly available website.

## URL : https://enhancer.lbl.gov/ 

### Data download - 06/12/2023

#### Query selection

Browse the database in the 'Advanced search' https://enhancer.lbl.gov/cgi-bin/imagedb3.pl?form=ext_search&show=1

1) Expression pattern: All
2) Only 'Positive' enhancers
3) Only 'Humans'

Query results in 1002 elements with a data download link

https://enhancer.lbl.gov/cgi-bin/imagedb3.pl?action=search;page=1;search.org=Human;search.status=Positives;search.gene=;form=ext_search;search.result=yes;page_size=100;show=1;search.sequence=1

Output is a preformatted text provided in HTML as a 'pre' element



## Citing the Enhancer Browser

The following publication should be referenced for any analysis in which data from the VISTA Enhancer Browser was used:

Visel A, Minovitsky S, Dubchak I, Pennacchio LA (2007). VISTA Enhancer Browser-a database of tissue-specific human enhancers. Nucleic Acids Res 35:D88-92

When referring to specific datasets within the Enhancer Browser, please report the respective dataset ID, e.g. hs112 or mm23.

## Data Extraction and Processing

In [19]:
#!/usr/bin/env python

import requests
import re
import subprocess
import os
import glob

In [20]:
# Function to get the strand information using Samtools faidx
def get_strand(sequence):
    fa = "hg19.fa"
    output = subprocess.check_output(["samtools", "faidx", fa, sequence, "--mark-strand", "sign"]).decode("utf-8")
    strand = output.split('\n')[0].replace('>', '')
    # Extract the value inside the parentheses - chr16:78510608-78511944(+)
    value = re.search(r'\((.*?)\)', strand).group(1)
    return value

In [21]:
# Send a GET request to the webpage
url = 'https://enhancer.lbl.gov/cgi-bin/imagedb3.pl?page=1;show=1;form=ext_search;order=;search.gene=;search.result=yes;search.status=Positives;action=search;search.org=Human;page_size=100;search.sequence=1'
response = requests.get(url)
text_content = response.text

In [22]:
# Find all entries starting with ">Human"
entries = re.findall(r'>Human.*?(?=^>Human|\Z)', text_content, re.MULTILINE | re.DOTALL)

In [23]:
# Remove '</pre' from the last entry
last_entry = entries[-1]
entries[-1] = re.sub('</pre', '', last_entry)

In [24]:
# Define the output file name
output_file = 'VISTA_Human_enhancers_sequences.bed'

In [25]:
# Create the output file and write the header
with open(output_file, 'w') as file:
    # Write the header
    headers = ['#chrom', 'chromStart', 'chromEnd', 'name', 'strand', 'annotation', 'sequence', 'expressionPattern']
    file.write('\t'.join(headers) + '\n')

    # Create a set to store the unique expression patterns
    tissue_categories = set()

    # Process and write each entry to the output file
    for entry in entries:
        # Remove '>' character and newlines, replace '|' with tab '\t' (removing surrounding spaces)
        entry = re.sub(r'[>\n]|(\s*\|\s*)', lambda m: '\t' if m.group(1) else '', entry)
        # Split the columns
        columns = entry.split('\t')
        # Rearrange the last column (ear[6/6]CTCCCCTgg)
        last_column_parts = re.split(r'(\])', columns[-1])
        # Combine the last_column_parts with the rest of the columns
        combined_columns = columns[:-1] + [last_column_parts[0] + last_column_parts[1], last_column_parts[2]]
        # Join the combined columns with '\t' separator
        combined_entry = '\t'.join(combined_columns)
        # Extract entries between 4th and last column
        column_4_to_last = combined_entry.split('\t')[4:-1]
        # Combine entries between 4th and last column
        joined_column_4_to_last = ';'.join(column_4_to_last)
        # Split the joined_column_4_to_last by ';'
        split_last_column = joined_column_4_to_last.split(';')
        # Get strand
        strand = get_strand(combined_entry.split('\t')[1])
        # Split sequence identifier into chromosome, start and end
        chromosome, position = combined_entry.split('\t')[1].split(':')
        start_position, end_position = position.split('-')

        # Iterate over expression patterns
        for item in split_last_column:
            # Parsing the expression patters ('trigeminal V (ganglion, cranial)[3/5]')
            expression_pattern = item.strip().split('[')[0].strip().replace(' ', '_').replace(',','').capitalize().replace('(','').replace(')','')
            rearranged_columns = [chromosome, start_position, end_position, combined_entry.split('\t')[2], strand, combined_entry.split('\t')[3], combined_entry.split('\t')[-1], item]
            # Write the entry with rearranged columns
            file.write('\t'.join(rearranged_columns) + '\n')
            # Add the expression pattern to the set
            tissue_categories.add(expression_pattern)

In [26]:
# Create the directory to store the expression pattern files
expression_pattern_files_directory = 'Tissue_specific_files'
os.makedirs(expression_pattern_files_directory, exist_ok=True)

In [27]:
# Split the entries based on tissue category
for tissue in tissue_categories:
    tissue_filename = f'{expression_pattern_files_directory}/{tissue}.{output_file}'
    with open(output_file, 'r') as input_file, open(tissue_filename, 'w') as tissue_file:
        # Write the header to the individual tissue files
        tissue_file.write('\t'.join(headers) + '\n')
        for line in input_file:
            # Compare the pattern to the expressionPattern column
            columns = line.split('\t')[-1].strip().split('[')[0].strip().replace(' ', '_').replace(',','').capitalize().replace('(','').replace(')','')
            # pattern_match = columns.strip().split('[')[0].strip().replace(' ', '_')
            if columns == tissue:
                tissue_file.write(line)

In [28]:
# Sort each tissue specific files individually
file_pattern = f"{expression_pattern_files_directory}/*.bed"
file_list = glob.glob(file_pattern)

In [29]:
for file_path in file_list:
    #print (file_path)
    sort_command = f"LC_ALL=C sort -k1,1 -k2,2n -k3,3n {file_path} -o {file_path}"
    subprocess.run(sort_command, shell=True)

In [30]:
# Compress and index the sorted files using bgzip and tabix
for file_path in file_list:
    bgzip_command = f"bgzip -f {file_path}"
    subprocess.run(bgzip_command, shell=True)

    tabix_command = f"tabix -f -p bed {file_path}.gz"
    subprocess.run(tabix_command, shell=True)

In [31]:
print("Output has been written to 'VISTA_Human_enhancers_sequences_06142023.bed' file.")
print(f"Tissue specific files have been created in the '{expression_pattern_files_directory}' directory.")

Output has been written to 'VISTA_Human_enhancers_sequences_06142023.bed' file.
Tissue specific files have been created in the 'Tissue_specific_files' directory.
