Skip to content

FritscheLab/practical-genai-coding-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GenAI Tools for Coding & Research Workflows — A Practical 8-Step Process

Welcome to this repository containing a practical textbook on integrating Large Language Models (LLMs) into coding and research workflows. This guide is geared toward biostatisticians, bioinformaticians, and data scientists, providing hands-on steps and examples to enhance productivity and maintain rigorous standards when employing GenAI tools.

Repository Overview

How to Use This Repository

  1. Follow the 8-Step Process
    Read the chapters in order to learn how to plan, prompt, refine, and document your AI-assisted code effectively.

  2. Leverage the Appendices

    • Check out the recommended LLM tools, curated prompts, further reading, and code templates to accelerate your workflow.
  3. Explore the Templates

    • Use the R templates, refactoring prompts, and documentation guidelines in docs/templates/ to standardize your code and streamline collaboration.
  4. Apply Version Control

    • As described in Chapter 4, store each LLM-generated or refined code iteration in Git (e.g., GitHub) to maintain a clear history, review diffs, and revert if needed.
  5. Stay Tuned for References

Data Simulation Script

If you'd like to test your workflows or the 8-step process on example data, we provide a script that simulates EHR and demographic records with realistic data quality issues:

  • Script: scripts/simulate_ehr_data.R
  • Purpose: Generates BMI, height, weight, and demographic data for a specified number of individuals, optionally introducing missing or implausible values.

Example Usage

Rscript scripts/simulate_ehr_data.R \
  --output_ehr "./data/raw/ehr_bmi_simulated_data.tsv" \
  --output_ehr_dict "./data/raw/data_dictionary.txt" \
  --output_demo "./data/raw/demographics_simulated_data.tsv" \
  --output_demo_dict "./data/raw/demographics_data_dictionary.txt" \
  --seed 123 \
  --n_individuals 1000

Explanation of Arguments:

  • --output_ehr: Where to save the simulated EHR (BMI) data (TSV).
  • --output_ehr_dict: Where to write the EHR data dictionary (TXT).
  • --output_demo: Where to save the demographics data (TSV).
  • --output_demo_dict: Where to write the demographics data dictionary (TXT).
  • --seed: Random seed for reproducibility.
  • --n_individuals: Number of unique individuals to simulate (default=1000).

By running this script, you can quickly produce synthetic data for testing code-refinement prompts, checking your data-cleaning pipelines, or practicing the entire 8-step process from ingestion to documentation.

Contributing

  • We welcome pull requests for refinements, corrections, or extensions.
  • Please open issues for any questions, or join discussions to keep this textbook accurate and helpful.

License

This repository is licensed under the GNU General Public License v3.0.


Happy learning and coding with GenAI tools!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages