Skip to content

Modernize bioinfo scripts with argparse and add comprehensive test suite#1

Merged
Mxrcon merged 4 commits intomasterfrom
copilot/update-cdsselector-script
Oct 14, 2025
Merged

Modernize bioinfo scripts with argparse and add comprehensive test suite#1
Mxrcon merged 4 commits intomasterfrom
copilot/update-cdsselector-script

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Oct 9, 2025

Overview

This PR modernizes three bioinformatics Python scripts to follow professional best practices, making them more maintainable, user-friendly, and robust. Additionally, it includes a comprehensive test suite to ensure all scripts work correctly.

Changes Made

1. Modernized cdsselector.py

Replaced positional sys.argv arguments with argparse for a professional CLI experience:

Before:

python cdsselector.py $GBK_FOLDER $OUTDIR $LIST_FILE

After:

python cdsselector.py --input-folder gbk_files --genes-list genes.txt --output-folder filtered_gbk
# Short flags also available:
python cdsselector.py -i gbk_files -g genes.txt -o filtered_gbk

2. Modernized "take genes into aminoacid.py"

Updated the amino acid extraction script with the same professional improvements:

Before:

python "take genes into aminoacid.py" $INPUT_FOLDER $OUTPUT_FOLDER
# Gene list was hardcoded in the script

After:

python "take genes into aminoacid.py" --input-folder gbk_files --genes-list genes.txt --output-folder amino_acids
# Short flags: -i, -g, -o

Key improvements:

  • Gene list now loaded from file parameter instead of being hardcoded
  • Progress tracking: "Processing file 1/5: genome1.gbk"
  • Professional logging with timestamps
  • Comprehensive error handling

3. Modernized "take genes into nucleotides.py"

Updated the nucleotide extraction script with matching improvements:

Before:

python "take genes into nucleotides.py" $INPUT_FOLDER $OUTPUT_FOLDER
# Gene list was hardcoded in the script

After:

python "take genes into nucleotides.py" --input-folder gbk_files --genes-list genes.txt --output-folder nucleotides
# Short flags: -i, -g, -o

Key improvements:

  • Gene list now loaded from file parameter
  • Progress tracking showing "Processing file X/Y"
  • Professional error handling and logging

4. Comprehensive Test Suite

Created a complete testing infrastructure:

Structure:

tests/
├── test_data/
│   ├── input/              # Sample GenBank files
│   │   ├── test_genome1.gbk
│   │   └── test_genome2.gbk
│   └── genes.txt           # Test gene list
├── test_scripts.py         # Automated test runner
└── README.md              # Testing documentation

Test Coverage:

  • ✅ 11 automated tests covering all three scripts
  • ✅ Tests for help commands, input validation, and output verification
  • ✅ Colored terminal output for easy identification of pass/fail
  • ✅ 100% test success rate

Run tests with: python tests/test_scripts.py

Common Improvements Across All Scripts

Command-Line Interface:

  • --help / -h flag with comprehensive usage documentation
  • Clear parameter names that describe their purpose
  • Built-in validation and error messages

Progress Tracking:
All scripts now show processing progress:

Processing file 1/2: test_genome1.gbk
Processing file 2/2: test_genome2.gbk
Processing complete: 2 files succeeded, 0 files failed

Enhanced Error Handling:

  • Validates that input folder exists and contains GenBank files
  • Checks gene list file existence before processing
  • Provides informative error messages instead of cryptic stack traces
  • Gracefully handles individual file processing errors without stopping the entire batch

Professional Code Structure:

  • Added shebang line (#!/usr/bin/env python) for Unix compatibility
  • Implemented main() function with if __name__ == "__main__" guard
  • Organized code into well-documented, single-purpose functions
  • Added comprehensive docstrings following Python conventions

Improved Logging:
Replaced basic print() statements with structured logging:

2025-10-09 13:35:31 - INFO - Gene to Nucleotide Extractor started
2025-10-09 13:35:31 - INFO - Loaded 4 genes from genes.txt
2025-10-09 13:35:31 - INFO - Found 2 GenBank file(s) in input_folder
2025-10-09 13:35:31 - INFO - Processing file 1/2: genome1.gbk
2025-10-09 13:35:31 - INFO - Processing complete: 2 files succeeded, 0 files failed
2025-10-09 13:35:31 - INFO - Extracted 4 nucleotide sequences

Code Quality Improvements:

  • Fixed logic bug in cdsselector.py: Original code incorrectly reset the FEATURES list
  • Follows PEP 8 style guidelines
  • Uses os.path.join() for proper cross-platform path operations
  • Fixed mixed tabs/spaces indentation
  • Added .gitignore to exclude Python cache files and test outputs

Better Documentation:

  • Updated module docstrings with clear usage examples
  • Added detailed function docstrings with parameter descriptions
  • Included author information and email contact

Testing

Automated Tests:

  • ✅ 11/11 tests passing (100%)
  • ✅ Tests for all three scripts (cdsselector, aminoacid, nucleotides)
  • ✅ Help command functionality verified
  • ✅ Input validation tested
  • ✅ Output file generation verified

Manual Testing:

  • ✅ Tested with sample GenBank files to verify correct filtering behavior
  • ✅ Verified error handling for missing files and directories
  • ✅ Confirmed both long and short flag options work correctly
  • ✅ Validated PEP 8 compliance
  • ✅ Passed automated code review

Migration Notes

For cdsselector.py:

# Old: python cdsselector.py $GBK_FOLDER $OUTDIR $LIST_FILE
# New: python cdsselector.py --input-folder $GBK_FOLDER --genes-list $LIST_FILE --output-folder $OUTDIR

For gene extraction scripts:

# Old: python "take genes into aminoacid.py" $INPUT_FOLDER $OUTPUT_FOLDER
# New: python "take genes into aminoacid.py" --input-folder $INPUT_FOLDER --genes-list $GENES_FILE --output-folder $OUTPUT_FOLDER

Important: Gene lists are now loaded from files instead of being hardcoded in the scripts. This makes the scripts more flexible and reusable.

Related

Closes the issue requesting professional updates to cdsselector.py and gene extraction scripts with proper parameter handling, progress tracking, and testing infrastructure.

Original prompt

On cdsselector.py, update this script to be more professional, also use --input_folder, --genes-list and --output_folder as parameters


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI and others added 2 commits October 9, 2025 13:39
- Replace sys.argv with argparse for proper argument parsing
- Add --input-folder, --genes-list, and --output-folder parameters (with short flags)
- Implement proper error handling and validation
- Add logging instead of simple print statements
- Follow PEP 8 style guidelines
- Add main() function with if __name__ == "__main__" guard
- Fix logic bug where FEATURES list was reset incorrectly
- Use os.path.join() for path operations
- Add comprehensive docstrings for all functions
- Add shebang line for better Unix compatibility

Co-authored-by: Mxrcon <48180517+Mxrcon@users.noreply.github.com>
Co-authored-by: Mxrcon <48180517+Mxrcon@users.noreply.github.com>
Copilot AI changed the title [WIP] Update cdsselector.py to use parameters for input, genes, and output Modernize cdsselector.py with argparse and professional code structure Oct 9, 2025
@Mxrcon
Copy link
Copy Markdown
Owner

Mxrcon commented Oct 9, 2025

@copilot continue this work updating the scripts "take genes into aminoacid.py" and "take genes into nucleotides.py" please add a progress output to tack how many gbks have been processed, also create a testing folder to make sure that all python scripts are tested and working.

- Update "take genes into aminoacid.py" with argparse and progress tracking
- Update "take genes into nucleotides.py" with argparse and progress tracking
- Add professional structure matching cdsselector.py
- Create comprehensive test suite in tests/ folder
- Add test data with sample GenBank files
- Add progress output showing "Processing file X/Y"
- All 11 tests passing (100%)

Co-authored-by: Mxrcon <48180517+Mxrcon@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Oct 9, 2025

@copilot continue this work updating the scripts "take genes into aminoacid.py" and "take genes into nucleotides.py" please add a progress output to tack how many gbks have been processed, also creat...

Done! I've updated both scripts and created a comprehensive test suite. All changes committed in bf49a01.

Updates made:

  1. "take genes into aminoacid.py": Modernized with argparse (--input-folder, --genes-list, --output-folder), progress tracking showing "Processing file X/Y", and professional error handling.

  2. "take genes into nucleotides.py": Same improvements - argparse parameters, progress output, and robust error handling.

  3. Test suite created:

    • tests/test_data/ with sample GenBank files
    • tests/test_scripts.py - automated test runner
    • tests/README.md - testing documentation
    • All 11 tests passing (100%)

Run tests with: python tests/test_scripts.py

Both scripts now show progress like:

Processing file 1/2: test_genome1.gbk
Processing file 2/2: test_genome2.gbk
Processing complete: 2 files succeeded, 0 files failed
Extracted 4 sequences

Copilot AI changed the title Modernize cdsselector.py with argparse and professional code structure Modernize bioinfo scripts with argparse and add comprehensive test suite Oct 9, 2025
Copilot AI requested a review from Mxrcon October 9, 2025 13:53
@Mxrcon Mxrcon marked this pull request as ready for review October 14, 2025 11:27
@Mxrcon Mxrcon merged commit b4ac1d7 into master Oct 14, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants