Modernize bioinfo scripts with argparse and add comprehensive test suite by Copilot · Pull Request #1 · Mxrcon/Bioinfo-python-scripts

Copilot · 2025-10-09T13:31:19Z

Overview

This PR modernizes three bioinformatics Python scripts to follow professional best practices, making them more maintainable, user-friendly, and robust. Additionally, it includes a comprehensive test suite to ensure all scripts work correctly.

Changes Made

1. Modernized cdsselector.py

Replaced positional sys.argv arguments with argparse for a professional CLI experience:

Before:

python cdsselector.py $GBK_FOLDER $OUTDIR $LIST_FILE

After:

python cdsselector.py --input-folder gbk_files --genes-list genes.txt --output-folder filtered_gbk
# Short flags also available:
python cdsselector.py -i gbk_files -g genes.txt -o filtered_gbk

2. Modernized "take genes into aminoacid.py"

Updated the amino acid extraction script with the same professional improvements:

Before:

python "take genes into aminoacid.py" $INPUT_FOLDER $OUTPUT_FOLDER
# Gene list was hardcoded in the script

After:

python "take genes into aminoacid.py" --input-folder gbk_files --genes-list genes.txt --output-folder amino_acids
# Short flags: -i, -g, -o

Key improvements:

Gene list now loaded from file parameter instead of being hardcoded
Progress tracking: "Processing file 1/5: genome1.gbk"
Professional logging with timestamps
Comprehensive error handling

3. Modernized "take genes into nucleotides.py"

Updated the nucleotide extraction script with matching improvements:

Before:

python "take genes into nucleotides.py" $INPUT_FOLDER $OUTPUT_FOLDER
# Gene list was hardcoded in the script

After:

python "take genes into nucleotides.py" --input-folder gbk_files --genes-list genes.txt --output-folder nucleotides
# Short flags: -i, -g, -o

Key improvements:

Gene list now loaded from file parameter
Progress tracking showing "Processing file X/Y"
Professional error handling and logging

4. Comprehensive Test Suite

Created a complete testing infrastructure:

Structure:

tests/
├── test_data/
│   ├── input/              # Sample GenBank files
│   │   ├── test_genome1.gbk
│   │   └── test_genome2.gbk
│   └── genes.txt           # Test gene list
├── test_scripts.py         # Automated test runner
└── README.md              # Testing documentation

Test Coverage:

✅ 11 automated tests covering all three scripts
✅ Tests for help commands, input validation, and output verification
✅ Colored terminal output for easy identification of pass/fail
✅ 100% test success rate

Run tests with: python tests/test_scripts.py

Common Improvements Across All Scripts

Command-Line Interface:

--help / -h flag with comprehensive usage documentation
Clear parameter names that describe their purpose
Built-in validation and error messages

Progress Tracking:
All scripts now show processing progress:

Processing file 1/2: test_genome1.gbk
Processing file 2/2: test_genome2.gbk
Processing complete: 2 files succeeded, 0 files failed

Enhanced Error Handling:

Validates that input folder exists and contains GenBank files
Checks gene list file existence before processing
Provides informative error messages instead of cryptic stack traces
Gracefully handles individual file processing errors without stopping the entire batch

Professional Code Structure:

Added shebang line (#!/usr/bin/env python) for Unix compatibility
Implemented main() function with if __name__ == "__main__" guard
Organized code into well-documented, single-purpose functions
Added comprehensive docstrings following Python conventions

Improved Logging:
Replaced basic print() statements with structured logging:

2025-10-09 13:35:31 - INFO - Gene to Nucleotide Extractor started
2025-10-09 13:35:31 - INFO - Loaded 4 genes from genes.txt
2025-10-09 13:35:31 - INFO - Found 2 GenBank file(s) in input_folder
2025-10-09 13:35:31 - INFO - Processing file 1/2: genome1.gbk
2025-10-09 13:35:31 - INFO - Processing complete: 2 files succeeded, 0 files failed
2025-10-09 13:35:31 - INFO - Extracted 4 nucleotide sequences

Code Quality Improvements:

Fixed logic bug in cdsselector.py: Original code incorrectly reset the FEATURES list
Follows PEP 8 style guidelines
Uses os.path.join() for proper cross-platform path operations
Fixed mixed tabs/spaces indentation
Added .gitignore to exclude Python cache files and test outputs

Better Documentation:

Updated module docstrings with clear usage examples
Added detailed function docstrings with parameter descriptions
Included author information and email contact

Testing

Automated Tests:

✅ 11/11 tests passing (100%)
✅ Tests for all three scripts (cdsselector, aminoacid, nucleotides)
✅ Help command functionality verified
✅ Input validation tested
✅ Output file generation verified

Manual Testing:

✅ Tested with sample GenBank files to verify correct filtering behavior
✅ Verified error handling for missing files and directories
✅ Confirmed both long and short flag options work correctly
✅ Validated PEP 8 compliance
✅ Passed automated code review

Migration Notes

For cdsselector.py:

# Old: python cdsselector.py $GBK_FOLDER $OUTDIR $LIST_FILE
# New: python cdsselector.py --input-folder $GBK_FOLDER --genes-list $LIST_FILE --output-folder $OUTDIR

For gene extraction scripts:

# Old: python "take genes into aminoacid.py" $INPUT_FOLDER $OUTPUT_FOLDER
# New: python "take genes into aminoacid.py" --input-folder $INPUT_FOLDER --genes-list $GENES_FILE --output-folder $OUTPUT_FOLDER

Important: Gene lists are now loaded from files instead of being hardcoded in the scripts. This makes the scripts more flexible and reusable.

- Replace sys.argv with argparse for proper argument parsing - Add --input-folder, --genes-list, and --output-folder parameters (with short flags) - Implement proper error handling and validation - Add logging instead of simple print statements - Follow PEP 8 style guidelines - Add main() function with if __name__ == "__main__" guard - Fix logic bug where FEATURES list was reset incorrectly - Use os.path.join() for path operations - Add comprehensive docstrings for all functions - Add shebang line for better Unix compatibility Co-authored-by: Mxrcon <48180517+Mxrcon@users.noreply.github.com>

Co-authored-by: Mxrcon <48180517+Mxrcon@users.noreply.github.com>

Mxrcon · 2025-10-09T13:43:31Z

@copilot continue this work updating the scripts "take genes into aminoacid.py" and "take genes into nucleotides.py" please add a progress output to tack how many gbks have been processed, also create a testing folder to make sure that all python scripts are tested and working.

- Update "take genes into aminoacid.py" with argparse and progress tracking - Update "take genes into nucleotides.py" with argparse and progress tracking - Add professional structure matching cdsselector.py - Create comprehensive test suite in tests/ folder - Add test data with sample GenBank files - Add progress output showing "Processing file X/Y" - All 11 tests passing (100%) Co-authored-by: Mxrcon <48180517+Mxrcon@users.noreply.github.com>

Copilot · 2025-10-09T13:51:12Z

@copilot continue this work updating the scripts "take genes into aminoacid.py" and "take genes into nucleotides.py" please add a progress output to tack how many gbks have been processed, also creat...

Done! I've updated both scripts and created a comprehensive test suite. All changes committed in bf49a01.

Updates made:

"take genes into aminoacid.py": Modernized with argparse (--input-folder, --genes-list, --output-folder), progress tracking showing "Processing file X/Y", and professional error handling.
"take genes into nucleotides.py": Same improvements - argparse parameters, progress output, and robust error handling.
Test suite created:
- tests/test_data/ with sample GenBank files
- tests/test_scripts.py - automated test runner
- tests/README.md - testing documentation
- All 11 tests passing (100%)

Run tests with: python tests/test_scripts.py

Both scripts now show progress like:

Processing file 1/2: test_genome1.gbk
Processing file 2/2: test_genome2.gbk
Processing complete: 2 files succeeded, 0 files failed
Extracted 4 sequences

Initial plan

3c5d400

Copilot AI assigned Copilot and Mxrcon Oct 9, 2025

Copilot started work on behalf of Mxrcon October 9, 2025 13:32 View session

Copilot AI and others added 2 commits October 9, 2025 13:39

Add .gitignore to exclude Python cache files

e630cb5

Co-authored-by: Mxrcon <48180517+Mxrcon@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Update cdsselector.py to use parameters for input, genes, and output~~ Modernize cdsselector.py with argparse and professional code structure Oct 9, 2025

Copilot finished work on behalf of Mxrcon October 9, 2025 13:43

Copilot started work on behalf of Mxrcon October 9, 2025 13:44 View session

Copilot AI changed the title ~~Modernize cdsselector.py with argparse and professional code structure~~ Modernize bioinfo scripts with argparse and add comprehensive test suite Oct 9, 2025

Copilot AI requested a review from Mxrcon October 9, 2025 13:53

Copilot finished work on behalf of Mxrcon October 9, 2025 13:53

Mxrcon marked this pull request as ready for review October 14, 2025 11:27

Mxrcon merged commit b4ac1d7 into master Oct 14, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernize bioinfo scripts with argparse and add comprehensive test suite#1

Modernize bioinfo scripts with argparse and add comprehensive test suite#1
Mxrcon merged 4 commits intomasterfrom
copilot/update-cdsselector-script

Copilot AI commented Oct 9, 2025 •

edited

Loading

Uh oh!

Mxrcon commented Oct 9, 2025

Uh oh!

Copilot AI commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes Made

1. Modernized cdsselector.py

2. Modernized "take genes into aminoacid.py"

3. Modernized "take genes into nucleotides.py"

4. Comprehensive Test Suite

Common Improvements Across All Scripts

Testing

Migration Notes

Related

Uh oh!

Mxrcon commented Oct 9, 2025

Uh oh!

Copilot AI commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 9, 2025 •

edited

Loading