AI-Powered Automatic Code Documentation Generation IEEE Envision Project 2026
Mentors: Shriya Bharadwaj, Priyadharshni S
Mentees: Dhruv Bhavesh Chokshi, Harsh Raj, Aadit Munje, Shreevarna S Rao, Dharsini Nakulan
High-quality documentation is the backbone of maintainable software, yet it remains one of the most neglected aspects of development. DocForge bridges the gap between code and comprehension by automatically generating clear, complete documentation for functions using fine-tuned transformer models.
Given a function, DocForge generates:
- A natural language description of what the function does
- Parameter explanations (
@param) - Return value descriptions (
@return)
| Model | BLEU | ROUGE-L |
|---|---|---|
| Paper: Zero-shot Llama 3.1 8B | 0.0302 | 0.0786 |
| Paper: Fine-tuned Llama 3.1 8B | 0.0391 | 0.0975 |
| Ours: CodeT5-base (Run 1) | 0.2691 | 0.4621 |
| Ours: CodeT5-base (Run 2) | 0.2866 | 0.4686 |
Our fine-tuned CodeT5-base outperforms the paper's fine-tuned Llama 3.1 8B by 7.3x on BLEU, despite being 36x smaller.
We use the Code2Doc dataset (arXiv:2512.18748) — a curated benchmark of 13,358 high-quality function-documentation pairs across Python, Java, TypeScript, JavaScript, and C++.
Each sample contains:
codet5_input— prompt in the formatSummarize {language}: {code}codet5_target— the target docstring
We fine-tune Salesforce/codet5-base — an encoder-decoder transformer pre-trained on code.
Why CodeT5 over Llama?
- CodeT5 is an encoder-decoder — encoder reads code deeply, decoder generates docs
- Pre-trained specifically on code understanding and generation tasks
- 222M parameters vs 8B — 36x smaller, 7.3x better results
- Domain-specific pretraining beats raw model scale
| Setting | Run 1 | Run 2 |
|---|---|---|
| Epochs | 3 | 5 |
| Learning Rate | 5e-5 | 3e-5 |
| Warmup Steps | 200 | 300 |
| Effective Batch Size | 16 | 16 |
| Hardware | NVIDIA T4 | NVIDIA T4 |
| BLEU | 0.2691 | 0.2866 |
| ROUGE-L | 0.4621 | 0.4686 |
The fine-tuned DocForge model is hosted on Hugging Face Hub for easy access and deployment:
Model: imshriya/docforge-codet5-base-v1
The model is automatically downloaded and cached when you run the dashboard:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
MODEL_NAME = "imshriya/docforge-codet5-base-v1"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)The model is cached locally, so subsequent runs load instantly.
DocForge/
├── dashboard.py # Streamlit web interface for documentation generation
├── requirements.txt # Python package dependencies
├── README.md # Project documentation
├── app/ # Application modules
├── notebooks/ # Jupyter notebooks
│ ├── docforge-codet5-model.ipynb # Model training & evaluation
│ ├── eda-docforge.ipynb # Dataset exploration & analysis
│ └── preprocessing-docforge.ipynb # Data preprocessing pipeline
└── .gitignore
- Python 3.8+
- pip or conda
-
Clone the repository:
git clone <repository-url> cd DocForge
-
Install dependencies:
pip install -r requirements.txt
Launch the interactive Streamlit dashboard to generate documentation for your code:
streamlit run dashboard.pyThe dashboard will open in your browser at http://localhost:8501
Features:
- Paste or type code snippets
- Get instant AI-generated documentation
- View parameter and return value descriptions
Explore the project step-by-step:
jupyter notebook notebooks/eda-docforge.ipynb- Dataset exploration and statisticspreprocessing-docforge.ipynb- Data cleaning and preprocessingdocforge-codet5-model.ipynb- Model training and evaluation
- Python
- PyTorch
- Hugging Face Transformers
- Datasets
- Evaluate (BLEU, ROUGE)
- Streamlit
- Karaman, R.K. & Akarsu, M. (2025). Code2Doc: A Quality-First Curated Dataset for Code Documentation. arXiv:2512.18748
- Wang, Y. et al. (2021). CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models. EMNLP 2021
- Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 2017


