ori_analyzer

ori_analyzer is a Python package designed for analyzing single DNA sequences and identifying potential origin of replication (ori) regions with their respective DnaA boxes. This package offers a range of functionalities to facilitate genomic analysis, including calculating GC skew, generating k-mers, determining pattern frequencies, and finding the most frequent patterns.

Summary of Features:

Reading DNA Sequences: The package provides functionality to read DNA sequences from input files.
Calculating GC Skew: Users can calculate GC skew scores for each position in the sequence, which can be indicative of ori regions.
Plotting GC Skew: Visualize GC skew scores as a function of positions in the genome.
Finding Min-Max Skew Positions: Identify positions where the GC skew is minimum and maximum.
Generating k-mers: Generate unique k-mers from specified ranges of genomic DNA sequences.
Generating Neighbourhood Dictionary: Create a dictionary of k-mers and their corresponding d-neighbourhoods, useful for motif analysis.
Determining Pattern Frequency: Determine the frequency of patterns in a DNA sequence based on their presence in a neighborhood dictionary.
Finding Most Frequent Patterns: Identify the most frequent patterns in a dictionary of pattern frequencies.

Example analysis

The Jupyter notebook contains an example analysis of the E. coli genome. The GC plot (Fig. 1) exhibits two distinct turning points, aiding in the identification of the OriC region. The determination of 9-mers frequency, along with their 1-d neighborhood, from the minimum point to 1000 base pairs downstream, resulted in a list where the DnaA box sequence in E. coli, TTATCCACA, along with its reverse complement, TGTGGATAA, is prominently featured.

Figure 1: The GC skew is plotted across the relative genomic position of the E. coli genome.

ori_pipeline

The ori_pipeline script provides a simple default pipeline for analyzing genome sequences to identify putative DnaA boxes. It assumes the following prerequisites:

The genome sequence is stored as a plain text file named "genome.txt".
The analyzed k-mer length is set to 9, which is the default for E. coli.
The region to analyze extends from the minimum skew position to 500 base pairs downstream.

Usage and output for E. coli genome:

>>> python3 ori_pipeline.py

Positions with minimum skew values: [3923620, 3923621, 3923622, 3923623]
Positions with maximum skew values: [1550413]
The following 9-mers have been found 4 times in the given sequence range
AGCTGGGAT
GATCCCAGC
GCTGGGATC
CTGGGATCA
GGATCCTGG
TGTGGATAA
TTATCCACA
GTGGATAAC
GTTATCCAC

Reference

This project draws inspiration from the book "Bioinformatics Algorithms" by Phillip Compeau & Pavel Pevzner, as well as the associated ROSALIND challenges.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
__pycache__		__pycache__
functions		functions
tests		tests
README.md		README.md
example_analysis.ipynb		example_analysis.ipynb
example_gc_plot.png		example_gc_plot.png
genome.txt		genome.txt
ori_analyzer.py		ori_analyzer.py
ori_pipeline.py		ori_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ori_analyzer

Summary of Features:

Example analysis

ori_pipeline

Reference

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ori_analyzer

Summary of Features:

Example analysis

ori_pipeline

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages