Skip to content

Commit

Permalink
Merge pull request #4 from AlfonsoJan/dev
Browse files Browse the repository at this point in the history
Version 1.0.0
  • Loading branch information
AlfonsoJan authored Feb 8, 2024
2 parents c4be4c3 + c55d88a commit c3b4e22
Show file tree
Hide file tree
Showing 18 changed files with 7,287 additions and 25 deletions.
5 changes: 0 additions & 5 deletions .github/workflows/deploy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,6 @@ on:
push:
branches:
- main
- master
tags:
- '*'
pull_request:
workflow_dispatch:

permissions:
contents: read
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/integrate.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ jobs:
ARCHITECTURE: ${{ matrix.target }}
- name: Build and Test
run: |
python -m pip install .[test] && pytest
python -m pip install .
macos:
runs-on: macos-latest
Expand All @@ -50,7 +50,7 @@ jobs:
ARCHITECTURE: ${{ matrix.target }}
- name: Build and Test
run: |
python -m pip install .[test] && pytest
python -m pip install .
windows:
runs-on: windows-latest
Expand All @@ -71,4 +71,4 @@ jobs:
pip install setuptools wheel setuptools-rust
- name: Build and Test
run: |
python -m pip install .[test] && pytest
python -m pip install .
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
/target

# Test Folder

# Byte-compiled / optimized / DLL files
__pycache__/
.pytest_cache/
Expand Down
80 changes: 79 additions & 1 deletion Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "sbsgenerator"
version = "0.1.0"
version = "1.0.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
Expand All @@ -10,3 +10,4 @@ crate-type = ["cdylib"]

[dependencies]
pyo3 = "0.20.2"
numpy = "0.20.0"
53 changes: 52 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,54 @@
[![Docs](https://img.shields.io/badge/docs-latest-blue.svg)](https://osf.io/t6j7u/wiki/home/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

# SBSGenerator

SBSGenerator
SBSGenerator is a comprehensive Python package designed for bioinformaticians and researchers working in the field of genomics. This package offers a robust set of tools for generating, analyzing, and interpreting single base substitutions (SBS) mutations from Variant Call Format (VCF) files. With a focus on ease of use, efficiency, and scalability, SBSGenerator facilitates the detailed study of genomic mutations, aiding in the understanding of their roles in various biological processes and diseases. Uniquely developed using a hybrid of Python and Rust, SBSGenerator leverages the PyO3 library for seamless integration between Python's flexible programming capabilities and Rust's unparalleled performance. This innovative approach ensures that SBSGen is not only user-friendly but also incredibly efficient and capable of handling large-scale genomic data with ease.

- [Installation](#installation)
- [Usage](#usage)
- [Contributing](#contributing)
- [License](#license)

## Installation

```bash
$ pip install sbsgenerator
```

## Usage

Create mutliple SBS files. With increasing context.

The sbs.96.txt file contains all of the following the pyrimidine single nucleotide variants, N[{C > A, G, or T} or {T > A, G, or C}]N.
*4 possible starting nucleotides x 6 pyrimidine variants x 4 ending nucleotides = 96 total combinations.*

The sbs.1536.txt file contains all of the following the pyrimidine single nucleotide variants, NN[{C > A, G, or T} or {T > A, G, or C}]NN.
*16 (4x4) possible starting nucleotides x 6 pyrimidine variants x 16 (4x4) possible ending nucleotides = 1536 total combinations.*

The sbs.24576.txt file contains all of the following the pyrimidine single nucleotide variants, NNN[{C > A, G, or T} or {T > A, G, or C}]NNN.
*64 (4x4x4) nucleotides x 6 pyrimidine variants x 64 (4x4x4) possible ending dinucleotides = 24576 total combinations.*

```python
from sbsgenerator import generator
# Context number (must be larger than 3 and uneven)
context_size = 7
# List with all the vcf files
vcf_files = [str(Path(__file__).parent / "files" / "test.vcf")]
# Where the ref genomes will be downloaded to
ref_genome = Path(__file__).parent / "files"
sbsgen = generator.SBSGenerator(
context=context_size,
vcf_files=vcf_files,
ref_genome=Path(__file__).parent / "files"
)
sbsgen.count_mutations()
```

## Contributing

I welcome contributions to SBSGen! If you have suggestions for improvements or bug fixes, please open an issue or submit a pull request.

## License

SBSGen is released under the MIT License. See the LICENSE file for more details.
12 changes: 9 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,16 @@ classifiers = [
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy",
]
[project.optional-dependencies]
test = [
"pytest",
dependencies = [
"numpy == 1.23.1",
"requests == 2.31.0",
"dask == 2024.1.1",
"pandas == 1.5.0",
]

[project.optional-dependencies]
test = ["pytest"]
dev = ["ruff"]
dynamic = ["version"]

[tool.maturin]
Expand Down
2 changes: 1 addition & 1 deletion python/sbsgenerator/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@

__doc__ = sbsgenerator.__doc__
if hasattr(sbsgenerator, "__all__"):
__all__ = sbsgenerator.__all__
__all__ = sbsgenerator.__all__
134 changes: 134 additions & 0 deletions python/sbsgenerator/download.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
#!/usr/bin/env python3
"""
This module provides functions to download reference genomes.
And skip the download if the reference genome already exists.
If some of the files are missing, it will delete the folder and download the reference genome again.
* download_ref_genomes(folder: Path) -> callable: Download reference genomes (GRCh37, GRCh38).
"""
import os
import requests
import tarfile
from pathlib import Path

from . import logging

# Dictionary of reference genomes and their chromosomes *.txt files
CHECK_REF_GENOMES = {
"GRCh37": list(range(1, 23)) + ["Y", "X", "MT"],
"GRCh38": list(range(1, 23)) + ["Y", "X", "MT"],
}


def download_ref_genomes(folder: Path) -> None:
"""
Download reference genomes.
Args:
folder (Path): The folder where the reference genomes will be downloaded.
Returns:
None
"""
ref_genomes = ["GRCh37", "GRCh38"]
for _ref_genome in ref_genomes:
download_ref_genome(folder, _ref_genome)


def download_ref_genome(folder: Path, ref_genome: str) -> None:
"""
Downloads a reference genome if it doesn't already exist in the specified folder.
Args:
folder (Path): The folder where the reference genome will be downloaded.
ref_genome (str): The name of the reference genome.
Returns:
None
"""
# Log to the console
logger: logging.SingletonLogger = logging.SingletonLogger()
if check_ref_genome(folder, ref_genome):
logger.log_info(f"{ref_genome} already exists in '{folder / ref_genome}'")
return
# Delete children if the folder exists but not complete
if (folder / ref_genome).exists():
delete_children(folder / ref_genome)
download_path = folder / f"{ref_genome}.tar.gz"
logger.log_info(
f"Beginning downloading of reference {ref_genome}. "
"This may take up to 40 minutes to complete."
)
# Download the tar.gz file
url = f"https://ngs.sanger.ac.uk/scratch/project/mutographs/SigProf/{ref_genome}.tar.gz"
download_tar_url(url, download_path, folder, ref_genome)


def delete_children(folder: Path) -> None:
"""
Deletes all the files in the given folder.
Args:
folder (Path): The folder to delete the files from.
Returns:
None
"""
for child in folder.iterdir():
if child.is_file():
child.unlink()


def check_ref_genome(folder: Path, ref_genome: str) -> bool:
"""
Check if all the required reference genome files exist in the specified folder.
Args:
folder (Path): The folder where the reference genome files are located.
ref_genome (str): The name of the reference genome.
Returns:
bool: True if all the required files exist, False otherwise.
"""
t_path = folder / ref_genome
for chr in CHECK_REF_GENOMES[ref_genome]:
if not (t_path / f"{chr}.txt").exists():
return False
return True


def download_tar_url(url: str, download_path: Path, extracted_path: Path, genome: str) -> None:
"""
Download a tar.gz file from the provided URL, extract its contents, and clean up.
Args:
url (str): URL of the tar.gz file.
download_path (Path): Path to save the downloaded tar.gz file.
extracted_path (Path): Path to extract the contents of the tar.gz file.
genome (str): Name of the reference genome.
Returns:
None
"""
logger: logging.SingletonLogger = logging.SingletonLogger()
# Download the tar.gz file
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
logger.log_info("Finished downloading the file")
# Save the downloaded tar.gz file
with open(download_path, "wb") as file:
file.write(response.content)
# Extract the contents of the tar.gz file
with tarfile.open(download_path, "r:gz") as tar:
tar.extractall(extracted_path)
logger.log_info(f"Finished extracting {genome} to '{extracted_path}'!")
# Clean up by removing the downloaded tar.gz file
os.remove(download_path)
else:
logger.log_warning(
(
"The Sanger ftp site is not responding. "
"Please check your internet connection/try again later."
)
)
Loading

0 comments on commit c3b4e22

Please sign in to comment.