# Regulatory Element Identification in Genomic Sequences

This notebook provides a comprehensive analysis for identifying regulatory elements in DNA sequences.

## Regulatory Elements

Regulatory elements are DNA sequences that control gene expression:

1. **Promoters**: Sequences that initiate transcription, typically located upstream of genes
   - Contain motifs like TATA box (TATAAA), CAAT box, GC box
   - Usually 100-1000 bp upstream of transcription start site

2. **Enhancers**: Sequences that increase transcription rates
   - Can be located far from the gene (up to 1 Mb away)
   - Work in either orientation
   - Contain binding sites for transcription factors

3. **Silencers**: Sequences that repress transcription
   - Similar to enhancers but have repressive function
   - Bind repressor proteins

4. **Insulators**: Sequences that block enhancer-promoter interactions
   - Create boundaries between regulatory domains
   - Prevent inappropriate gene activation

## Approach

We use machine learning to identify regulatory elements based on:
- K-mer frequencies (sequence patterns)
- Nucleotide composition (GC content, dinucleotides)
- Known regulatory motifs
- Sequence complexity measures


In [None]:
# Import necessary libraries
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from regulatory_element_identification import RegulatoryElementIdentifier

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)
%matplotlib inline


## 1. Load and Explore Data


In [None]:
# Initialize the identifier
identifier = RegulatoryElementIdentifier('../data/genomics_data.csv')

# Load data
df = identifier.load_data()
df.head()


In [None]:
# Prepare features and split data
X, y = identifier.prepare_features(k=3)
X_train, X_test, y_train, y_test = identifier.split_data(test_size=0.2, random_state=42)

# Train models
models = identifier.train_models()

# Evaluate models
results = identifier.evaluate_models()

# Generate visualizations
identifier.plot_results(results, save_path='../results')
