Welcome to this repository containing a practical textbook on integrating Large Language Models (LLMs) into coding and research workflows. This guide is geared toward biostatisticians, bioinformaticians, and data scientists, providing hands-on steps and examples to enhance productivity and maintain rigorous standards when employing GenAI tools.
-
Chapters
- Chapter 1: Laying the Foundation — Plan & Context
- Chapter 2: Knowledge is Power — Do Your Research
- Chapter 3: The AI Assistant — Prompt LLM to Generate Code
- Chapter 4: Critical Eye — Review & Understand the Code
- Introduces Git-based version control practices for managing LLM-generated code iterations.
- Chapter 5: Making it Better — Refine Code & Add Features
- Chapter 6: The Iterative Journey — Iterate Until Satisfied
- Chapter 7: Standardization — Refactor to Lab Template
- Explains using the R Code Refactoring Prompt and an R Code Template to enforce lab-wide coding standards.
- Chapter 8: Sharing Your Work — Generate Documentation
- Demonstrates how to leverage the GitHub Repository Documentation Guidelines for creating robust project docs.
- Conclusions
- A concise, optimized conclusion summarizing the eight-step process and emphasizing the importance of version control, collaboration, and reproducibility.
-
Docs
- Appendix A: Recommended LLM Tools and Platforms
- Appendix B: Prompt Library for Biostatistics, Bioinformatics, and Data Science
- Appendix C: Further Reading and Resources
- Appendix D: Project Plans and Code Templates
- Directs you to the individual plan and template files under
docs/templates.
- Directs you to the individual plan and template files under
- References
- Hyperlinked citations used throughout the textbook.
-
Templates (in
docs/templates/)- R Code Templates & Prompts
R_CodeTemplate.R: A production-quality script template featuring metadata, library management, CLI parsing, and more.R_CodeRefactoringPromptExample.md: Guidance for prompting an LLM to refactor your existing code according to lab standards.
- GitHub Documentation Guide
GitHubRepoDocumentationGuidelines.md: A prompt and template for creating comprehensiveREADME.mdfiles, environment setup instructions, and usage docs.
- Project Plans
- Coding Prompt
CodingPromptForRScript.md: An example prompt for generating production-ready R scripts withdata.tableandoptparse.
- R Code Templates & Prompts
-
Follow the 8-Step Process
Read the chapters in order to learn how to plan, prompt, refine, and document your AI-assisted code effectively. -
Leverage the Appendices
- Check out the recommended LLM tools, curated prompts, further reading, and code templates to accelerate your workflow.
-
Explore the Templates
- Use the R templates, refactoring prompts, and documentation guidelines in
docs/templates/to standardize your code and streamline collaboration.
- Use the R templates, refactoring prompts, and documentation guidelines in
-
Apply Version Control
- As described in Chapter 4, store each LLM-generated or refined code iteration in Git (e.g., GitHub) to maintain a clear history, review diffs, and revert if needed.
-
Stay Tuned for References
- All references are hyperlinked and listed in docs/References.md.
If you'd like to test your workflows or the 8-step process on example data, we provide a script that simulates EHR and demographic records with realistic data quality issues:
- Script:
scripts/simulate_ehr_data.R - Purpose: Generates BMI, height, weight, and demographic data for a specified number of individuals, optionally introducing missing or implausible values.
Rscript scripts/simulate_ehr_data.R \
--output_ehr "./data/raw/ehr_bmi_simulated_data.tsv" \
--output_ehr_dict "./data/raw/data_dictionary.txt" \
--output_demo "./data/raw/demographics_simulated_data.tsv" \
--output_demo_dict "./data/raw/demographics_data_dictionary.txt" \
--seed 123 \
--n_individuals 1000Explanation of Arguments:
--output_ehr: Where to save the simulated EHR (BMI) data (TSV).--output_ehr_dict: Where to write the EHR data dictionary (TXT).--output_demo: Where to save the demographics data (TSV).--output_demo_dict: Where to write the demographics data dictionary (TXT).--seed: Random seed for reproducibility.--n_individuals: Number of unique individuals to simulate (default=1000).
By running this script, you can quickly produce synthetic data for testing code-refinement prompts, checking your data-cleaning pipelines, or practicing the entire 8-step process from ingestion to documentation.
- We welcome pull requests for refinements, corrections, or extensions.
- Please open issues for any questions, or join discussions to keep this textbook accurate and helpful.
This repository is licensed under the GNU General Public License v3.0.
Happy learning and coding with GenAI tools!