-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #4 from AlfonsoJan/dev
Version 1.0.0
- Loading branch information
Showing
18 changed files
with
7,287 additions
and
25 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,7 @@ | ||
/target | ||
|
||
# Test Folder | ||
|
||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
.pytest_cache/ | ||
|
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,54 @@ | ||
[![Docs](https://img.shields.io/badge/docs-latest-blue.svg)](https://osf.io/t6j7u/wiki/home/) | ||
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) | ||
|
||
# SBSGenerator | ||
|
||
SBSGenerator | ||
SBSGenerator is a comprehensive Python package designed for bioinformaticians and researchers working in the field of genomics. This package offers a robust set of tools for generating, analyzing, and interpreting single base substitutions (SBS) mutations from Variant Call Format (VCF) files. With a focus on ease of use, efficiency, and scalability, SBSGenerator facilitates the detailed study of genomic mutations, aiding in the understanding of their roles in various biological processes and diseases. Uniquely developed using a hybrid of Python and Rust, SBSGenerator leverages the PyO3 library for seamless integration between Python's flexible programming capabilities and Rust's unparalleled performance. This innovative approach ensures that SBSGen is not only user-friendly but also incredibly efficient and capable of handling large-scale genomic data with ease. | ||
|
||
- [Installation](#installation) | ||
- [Usage](#usage) | ||
- [Contributing](#contributing) | ||
- [License](#license) | ||
|
||
## Installation | ||
|
||
```bash | ||
$ pip install sbsgenerator | ||
``` | ||
|
||
## Usage | ||
|
||
Create mutliple SBS files. With increasing context. | ||
|
||
The sbs.96.txt file contains all of the following the pyrimidine single nucleotide variants, N[{C > A, G, or T} or {T > A, G, or C}]N. | ||
*4 possible starting nucleotides x 6 pyrimidine variants x 4 ending nucleotides = 96 total combinations.* | ||
|
||
The sbs.1536.txt file contains all of the following the pyrimidine single nucleotide variants, NN[{C > A, G, or T} or {T > A, G, or C}]NN. | ||
*16 (4x4) possible starting nucleotides x 6 pyrimidine variants x 16 (4x4) possible ending nucleotides = 1536 total combinations.* | ||
|
||
The sbs.24576.txt file contains all of the following the pyrimidine single nucleotide variants, NNN[{C > A, G, or T} or {T > A, G, or C}]NNN. | ||
*64 (4x4x4) nucleotides x 6 pyrimidine variants x 64 (4x4x4) possible ending dinucleotides = 24576 total combinations.* | ||
|
||
```python | ||
from sbsgenerator import generator | ||
# Context number (must be larger than 3 and uneven) | ||
context_size = 7 | ||
# List with all the vcf files | ||
vcf_files = [str(Path(__file__).parent / "files" / "test.vcf")] | ||
# Where the ref genomes will be downloaded to | ||
ref_genome = Path(__file__).parent / "files" | ||
sbsgen = generator.SBSGenerator( | ||
context=context_size, | ||
vcf_files=vcf_files, | ||
ref_genome=Path(__file__).parent / "files" | ||
) | ||
sbsgen.count_mutations() | ||
``` | ||
|
||
## Contributing | ||
|
||
I welcome contributions to SBSGen! If you have suggestions for improvements or bug fixes, please open an issue or submit a pull request. | ||
|
||
## License | ||
|
||
SBSGen is released under the MIT License. See the LICENSE file for more details. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
#!/usr/bin/env python3 | ||
""" | ||
This module provides functions to download reference genomes. | ||
And skip the download if the reference genome already exists. | ||
If some of the files are missing, it will delete the folder and download the reference genome again. | ||
* download_ref_genomes(folder: Path) -> callable: Download reference genomes (GRCh37, GRCh38). | ||
""" | ||
import os | ||
import requests | ||
import tarfile | ||
from pathlib import Path | ||
|
||
from . import logging | ||
|
||
# Dictionary of reference genomes and their chromosomes *.txt files | ||
CHECK_REF_GENOMES = { | ||
"GRCh37": list(range(1, 23)) + ["Y", "X", "MT"], | ||
"GRCh38": list(range(1, 23)) + ["Y", "X", "MT"], | ||
} | ||
|
||
|
||
def download_ref_genomes(folder: Path) -> None: | ||
""" | ||
Download reference genomes. | ||
Args: | ||
folder (Path): The folder where the reference genomes will be downloaded. | ||
Returns: | ||
None | ||
""" | ||
ref_genomes = ["GRCh37", "GRCh38"] | ||
for _ref_genome in ref_genomes: | ||
download_ref_genome(folder, _ref_genome) | ||
|
||
|
||
def download_ref_genome(folder: Path, ref_genome: str) -> None: | ||
""" | ||
Downloads a reference genome if it doesn't already exist in the specified folder. | ||
Args: | ||
folder (Path): The folder where the reference genome will be downloaded. | ||
ref_genome (str): The name of the reference genome. | ||
Returns: | ||
None | ||
""" | ||
# Log to the console | ||
logger: logging.SingletonLogger = logging.SingletonLogger() | ||
if check_ref_genome(folder, ref_genome): | ||
logger.log_info(f"{ref_genome} already exists in '{folder / ref_genome}'") | ||
return | ||
# Delete children if the folder exists but not complete | ||
if (folder / ref_genome).exists(): | ||
delete_children(folder / ref_genome) | ||
download_path = folder / f"{ref_genome}.tar.gz" | ||
logger.log_info( | ||
f"Beginning downloading of reference {ref_genome}. " | ||
"This may take up to 40 minutes to complete." | ||
) | ||
# Download the tar.gz file | ||
url = f"https://ngs.sanger.ac.uk/scratch/project/mutographs/SigProf/{ref_genome}.tar.gz" | ||
download_tar_url(url, download_path, folder, ref_genome) | ||
|
||
|
||
def delete_children(folder: Path) -> None: | ||
""" | ||
Deletes all the files in the given folder. | ||
Args: | ||
folder (Path): The folder to delete the files from. | ||
Returns: | ||
None | ||
""" | ||
for child in folder.iterdir(): | ||
if child.is_file(): | ||
child.unlink() | ||
|
||
|
||
def check_ref_genome(folder: Path, ref_genome: str) -> bool: | ||
""" | ||
Check if all the required reference genome files exist in the specified folder. | ||
Args: | ||
folder (Path): The folder where the reference genome files are located. | ||
ref_genome (str): The name of the reference genome. | ||
Returns: | ||
bool: True if all the required files exist, False otherwise. | ||
""" | ||
t_path = folder / ref_genome | ||
for chr in CHECK_REF_GENOMES[ref_genome]: | ||
if not (t_path / f"{chr}.txt").exists(): | ||
return False | ||
return True | ||
|
||
|
||
def download_tar_url(url: str, download_path: Path, extracted_path: Path, genome: str) -> None: | ||
""" | ||
Download a tar.gz file from the provided URL, extract its contents, and clean up. | ||
Args: | ||
url (str): URL of the tar.gz file. | ||
download_path (Path): Path to save the downloaded tar.gz file. | ||
extracted_path (Path): Path to extract the contents of the tar.gz file. | ||
genome (str): Name of the reference genome. | ||
Returns: | ||
None | ||
""" | ||
logger: logging.SingletonLogger = logging.SingletonLogger() | ||
# Download the tar.gz file | ||
response = requests.get(url) | ||
# Check if the request was successful | ||
if response.status_code == 200: | ||
logger.log_info("Finished downloading the file") | ||
# Save the downloaded tar.gz file | ||
with open(download_path, "wb") as file: | ||
file.write(response.content) | ||
# Extract the contents of the tar.gz file | ||
with tarfile.open(download_path, "r:gz") as tar: | ||
tar.extractall(extracted_path) | ||
logger.log_info(f"Finished extracting {genome} to '{extracted_path}'!") | ||
# Clean up by removing the downloaded tar.gz file | ||
os.remove(download_path) | ||
else: | ||
logger.log_warning( | ||
( | ||
"The Sanger ftp site is not responding. " | ||
"Please check your internet connection/try again later." | ||
) | ||
) |
Oops, something went wrong.