Skip to content

Shriyabh11/DocForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DocForge

AI-Powered Automatic Code Documentation Generation IEEE Envision Project 2026


Team

Mentors: Shriya Bharadwaj, Priyadharshni S

Mentees: Dhruv Bhavesh Chokshi, Harsh Raj, Aadit Munje, Shreevarna S Rao, Dharsini Nakulan


Overview

High-quality documentation is the backbone of maintainable software, yet it remains one of the most neglected aspects of development. DocForge bridges the gap between code and comprehension by automatically generating clear, complete documentation for functions using fine-tuned transformer models.

Given a function, DocForge generates:

  • A natural language description of what the function does
  • Parameter explanations (@param)
  • Return value descriptions (@return)

Results

Model BLEU ROUGE-L
Paper: Zero-shot Llama 3.1 8B 0.0302 0.0786
Paper: Fine-tuned Llama 3.1 8B 0.0391 0.0975
Ours: CodeT5-base (Run 1) 0.2691 0.4621
Ours: CodeT5-base (Run 2) 0.2866 0.4686

Our fine-tuned CodeT5-base outperforms the paper's fine-tuned Llama 3.1 8B by 7.3x on BLEU, despite being 36x smaller.


Dataset

We use the Code2Doc dataset (arXiv:2512.18748) — a curated benchmark of 13,358 high-quality function-documentation pairs across Python, Java, TypeScript, JavaScript, and C++.

Each sample contains:

  • codet5_input — prompt in the format Summarize {language}: {code}
  • codet5_target — the target docstring

Model

We fine-tune Salesforce/codet5-base — an encoder-decoder transformer pre-trained on code.

Why CodeT5 over Llama?

  • CodeT5 is an encoder-decoder — encoder reads code deeply, decoder generates docs
  • Pre-trained specifically on code understanding and generation tasks
  • 222M parameters vs 8B — 36x smaller, 7.3x better results
  • Domain-specific pretraining beats raw model scale

Interactive Dashboard

alt text alt text alt text


Training Configuration

Setting Run 1 Run 2
Epochs 3 5
Learning Rate 5e-5 3e-5
Warmup Steps 200 300
Effective Batch Size 16 16
Hardware NVIDIA T4 NVIDIA T4
BLEU 0.2691 0.2866
ROUGE-L 0.4621 0.4686

Pretrained Model on Hugging Face

The fine-tuned DocForge model is hosted on Hugging Face Hub for easy access and deployment:

Model: imshriya/docforge-codet5-base-v1

Loading the Model

The model is automatically downloaded and cached when you run the dashboard:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

MODEL_NAME = "imshriya/docforge-codet5-base-v1"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)

The model is cached locally, so subsequent runs load instantly.


Repository Structure

DocForge/
├── dashboard.py                    # Streamlit web interface for documentation generation
├── requirements.txt                # Python package dependencies
├── README.md                       # Project documentation
├── app/                            # Application modules
├── notebooks/                      # Jupyter notebooks
│   ├── docforge-codet5-model.ipynb # Model training & evaluation
│   ├── eda-docforge.ipynb          # Dataset exploration & analysis
│   └── preprocessing-docforge.ipynb # Data preprocessing pipeline
└── .gitignore

Getting Started

Prerequisites

  • Python 3.8+
  • pip or conda

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd DocForge
  2. Install dependencies:

    pip install -r requirements.txt

Running the Dashboard

Launch the interactive Streamlit dashboard to generate documentation for your code:

streamlit run dashboard.py

The dashboard will open in your browser at http://localhost:8501

Features:

  • Paste or type code snippets
  • Get instant AI-generated documentation
  • View parameter and return value descriptions

Using Jupyter Notebooks

Explore the project step-by-step:

jupyter notebook notebooks/
  • eda-docforge.ipynb - Dataset exploration and statistics
  • preprocessing-docforge.ipynb - Data cleaning and preprocessing
  • docforge-codet5-model.ipynb - Model training and evaluation

Technologies

  • Python
  • PyTorch
  • Hugging Face Transformers
  • Datasets
  • Evaluate (BLEU, ROUGE)
  • Streamlit

References

About

AI-powered code documentation generator built as part of IEEE Envision 2026 that fine-tunes a large language model (CodeT5) to automatically generate clear, complete docstrings.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors