Skip to content

Commit

Permalink
Merge pull request #6 from AlfonsoJan/dev
Browse files Browse the repository at this point in the history
Version 1.0.1
  • Loading branch information
AlfonsoJan committed Feb 10, 2024
2 parents c3b4e22 + 35dbb5f commit ff6e4d0
Show file tree
Hide file tree
Showing 6 changed files with 124 additions and 23 deletions.
2 changes: 1 addition & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "sbsgenerator"
version = "1.0.0"
version = "1.0.1"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
Expand Down
22 changes: 11 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

# SBSGenerator

SBSGenerator is a comprehensive Python package designed for bioinformaticians and researchers working in the field of genomics. This package offers a robust set of tools for generating, analyzing, and interpreting single base substitutions (SBS) mutations from Variant Call Format (VCF) files. With a focus on ease of use, efficiency, and scalability, SBSGenerator facilitates the detailed study of genomic mutations, aiding in the understanding of their roles in various biological processes and diseases. Uniquely developed using a hybrid of Python and Rust, SBSGenerator leverages the PyO3 library for seamless integration between Python's flexible programming capabilities and Rust's unparalleled performance. This innovative approach ensures that SBSGen is not only user-friendly but also incredibly efficient and capable of handling large-scale genomic data with ease.
SBSGenerator is a comprehensive Python package designed for bioinformaticians and researchers working in the field of genomics. This package offers a robust set of tools for generating, analyzing, and interpreting single base substitutions (SBS) mutations from Variant Call Format (VCF) files. With a focus on ease of use, efficiency, and scalability, SBSGenerator facilitates the detailed study of genomic mutations, aiding in the understanding of their roles in various biological processes and diseases. Uniquely developed using a hybrid of Python and Rust, SBSGenerator leverages the PyO3 library for seamless integration between Python's flexible programming capabilities and Rust's unparalleled performance. This innovative approach ensures that SBSGenerator is not only user-friendly but also incredibly efficient and capable of handling large-scale genomic data with ease.

- [Installation](#installation)
- [Usage](#usage)
Expand All @@ -18,37 +18,37 @@ $ pip install sbsgenerator

## Usage

Create mutliple SBS files. With increasing context.
The `SBSGenerator` package is designed to facilitate the generation and analysis of SBS mutation data from VCF files across different genomic contexts. Depending on the specified context size, it can create comprehensive dataframes listing all possible SBS mutations, ranging from simple 3-nucleotide contexts to more complex 7-nucleotide contexts, with the potential number of mutation combinations exponentially increasing with context size.

The sbs.96.txt file contains all of the following the pyrimidine single nucleotide variants, N[{C > A, G, or T} or {T > A, G, or C}]N.
*4 possible starting nucleotides x 6 pyrimidine variants x 4 ending nucleotides = 96 total combinations.*
- Context 3: The dataframe contains all of the following the pyrimidine single nucleotide variants, N[{C > A, G, or T} or {T > A, G, or C}]N. *4 possible starting nucleotides x 6 pyrimidine variants x 4 ending nucleotides = 96 total combinations.*

The sbs.1536.txt file contains all of the following the pyrimidine single nucleotide variants, NN[{C > A, G, or T} or {T > A, G, or C}]NN.
- Context 5: The dataframe contains all of the following the pyrimidine single nucleotide variants, NN[{C > A, G, or T} or {T > A, G, or C}]NN.
*16 (4x4) possible starting nucleotides x 6 pyrimidine variants x 16 (4x4) possible ending nucleotides = 1536 total combinations.*

The sbs.24576.txt file contains all of the following the pyrimidine single nucleotide variants, NNN[{C > A, G, or T} or {T > A, G, or C}]NNN.
- Context 7: The dataframe contains all of the following the pyrimidine single nucleotide variants, NNN[{C > A, G, or T} or {T > A, G, or C}]NNN.
*64 (4x4x4) nucleotides x 6 pyrimidine variants x 64 (4x4x4) possible ending dinucleotides = 24576 total combinations.*


```python
from sbsgenerator import generator
# Context number (must be larger than 3 and uneven)
context_size = 7
# List with all the vcf files
vcf_files = [str(Path(__file__).parent / "files" / "test.vcf")]
vcf_files = ["data/test.vcf"]
# Where the ref genomes will be downloaded to
ref_genome = Path(__file__).parent / "files"
ref_genome = "temp/ref_genomes"
sbsgen = generator.SBSGenerator(
context=context_size,
vcf_files=vcf_files,
ref_genome=Path(__file__).parent / "files"
ref_genome=ref_genome
)
sbsgen.count_mutations()
```

## Contributing

I welcome contributions to SBSGen! If you have suggestions for improvements or bug fixes, please open an issue or submit a pull request.
I welcome contributions to SBSGenerator! If you have suggestions for improvements or bug fixes, please open an issue or submit a pull request.

## License

SBSGen is released under the MIT License. See the LICENSE file for more details.
SBSGenerator is released under the MIT License. See the LICENSE file for more details.
25 changes: 23 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,32 @@ build-backend = "maturin"

[project]
name = "sbsgenerator"
description = "Blazingly fast SBS matrix generator library"
readme = "README.md"
authors = [
{ name = "Jan Alfonso Busker", email = "alfonsobusker@gmail.com" },
]
license = { file = "LICENSE" }
repository = "https://github.com/AlfonsoJan/sbsgenerator"
requires-python = ">=3.8"
keywords = ["sbs", "matrix", "sbsgenerator", "Single Base Substitution", "bioinformatics", "genomics", "genetics", "biology", "sequence", "sequence analysis"]
classifiers = [
"Programming Language :: Rust",
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: MIT License",
"Operating System :: MacOS",
"Operating System :: Microsoft :: Windows",
"Operating System :: POSIX :: Linux",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy",
"Programming Language :: Python",
"Programming Language :: Rust",
]
dependencies = [
"numpy == 1.23.1",
Expand Down
41 changes: 36 additions & 5 deletions python/sbsgenerator/generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
import itertools
from pathlib import Path
import math
from functools import wraps

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -82,26 +83,56 @@ def increase_mutations(context: int) -> list[str]:
return new_mutations


class NoCorrectSBSMutationsFound(Exception):
class NotCorrectSBSMutationsFound(Exception):
"""Exception raised when no correct SBS mutations are found."""

pass


class NotADirectoryError(Exception):
"""Exception raised when the argument is not a folder."""

pass


def validate_input(func):
@wraps(func)
def wrapper(context, vcf_files, ref_genome, **kwargs):
# Check if context is an odd number greater than 1 and an integer
if not isinstance(context, int) or context < 2 or context % 2 == 0:
raise ValueError("Context must be an odd number greater than 1.")
# Ensure vcf_files is a list or tuple
if not isinstance(vcf_files, (list, tuple)):
raise TypeError("Input 'vcf_files' must be a list or tuple.")
# Verify that vcf_files contain existing file paths
exist_vcf_files = [str(vcf_file) for vcf_file in vcf_files if Path(vcf_file).exists()]
if len(exist_vcf_files) < len(vcf_files):
missing_files = set(vcf_files) - set(exist_vcf_files)
raise FileNotFoundError(f"The following files do not exist: {', '.join(missing_files)}")
# Normalize ref_genome to a Path object, ensuring it exists
ref_genome_path = Path(ref_genome)
if not ref_genome_path.is_dir():
raise NotADirectoryError("This argument must be a folder.")
return func(context, vcf_files, ref_genome_path, **kwargs)

return wrapper


@validate_input
class SBSGenerator:
def __init__(self, context: int, vcf_files: list[str], ref_genome: str) -> None:
def __init__(self, context: int, vcf_files: list[str], ref_genome: Path) -> None:
"""
Initialize the Generator object.
Args:
context (int): The context value.
vcf_files (list[str]): List of VCF file paths.
ref_genome (str): Path to the reference genome.
ref_genome (Path): Path to the reference genome.
Returns:
None
"""
download.download_ref_genomes(Path(ref_genome))
download.download_ref_genomes(ref_genome)
self._logger = logging.SingletonLogger()
self.context = context
self.vcf_files = vcf_files
Expand Down Expand Up @@ -133,7 +164,7 @@ def parse_vcf_files(self, vcf_files) -> tuple[np.ndarray, np.ndarray]:
self._logger.log_info("Parsing VCF files")
filtered_vcf = parse_vcf_files(vcf_files, str(self.ref_genome), self.context)
if filtered_vcf.shape[0] == 0:
raise NoCorrectSBSMutationsFound("No correct SBS mutations found in VCF files")
raise NotCorrectSBSMutationsFound("No correct SBS mutations found in VCF files")
samples = np.unique(filtered_vcf[:, 0])
self._logger.log_info("Done parsing VCF files")
return filtered_vcf, samples
Expand Down
55 changes: 52 additions & 3 deletions python/tests/test_sbsgenerator.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,21 +49,21 @@ def test_sbsgenerator_with_context(context_size: int, expected_shape: tuple[int,
3, # For context size 3
],
)
def test_bad_sbsgenerator(context_size: int) -> None:
def test_bad_vcf_sbsgenerator(context_size: int) -> None:
"""
Test case for the SBSGenerator class when no correct SBS mutations are found in VCF files.
Args:
context_size (int): The size of the context to consider for SBS mutations.
Raises:
generator.NoCorrectSBSMutationsFound: If no correct SBS mutations are found in the VCF files.
generator.NotCorrectSBSMutationsFound: If no correct SBS mutations are found in the VCF files.
Returns:
None
"""
with pytest.raises(
generator.NoCorrectSBSMutationsFound,
generator.NotCorrectSBSMutationsFound,
match="No correct SBS mutations found in VCF files",
) as _:
sbsgen = generator.SBSGenerator(
Expand All @@ -72,3 +72,52 @@ def test_bad_sbsgenerator(context_size: int) -> None:
ref_genome=Path(__file__).parent / "files",
)
sbsgen.count_mutations()


def test_bad_context() -> None:
"""
Test case for when the context is invalid.
This test ensures that a ValueError is raised when the context is not an odd number greater than 1.
"""
with pytest.raises(
ValueError,
match="Context must be an odd number greater than 1.",
) as _:
generator.SBSGenerator(
context=1,
vcf_files=[str(Path(__file__).parent / "files" / "bad_test.vcf")],
ref_genome=Path(__file__).parent / "files",
)


def test_file_does_not_exist() -> None:
"""
Test case to verify that the SBSGenerator raises a FileNotFoundError
when the specified VCF file does not exist.
"""
vcf_file = str(Path(__file__).parent / "files" / "does_not_exist.vcf")
with pytest.raises(
FileNotFoundError,
match=f"The following files do not exist: {vcf_file}",
) as _:
generator.SBSGenerator(
context=3,
vcf_files=[vcf_file],
ref_genome=Path(__file__).parent / "files",
)


def test_folder_is_a_file() -> None:
"""
Test case to verify that an exception is raised when the provided folder is actually a file.
"""
with pytest.raises(
generator.NotADirectoryError,
match=f"This argument must be a folder.",
) as _:
generator.SBSGenerator(
context=3,
vcf_files=[str(Path(__file__).parent / "files" / "bad_test.vcf")],
ref_genome=Path(__file__).parent / "files.txt",
)

0 comments on commit ff6e4d0

Please sign in to comment.