Replication package for: LLMs for Generation of Architectural Components: An Exploratory Empirical Study in the Serverless World
This project investigates the capabilities of Large Language Models (LLMs) to generate architectural components, specifically focusing on Functions as a Service (FaaS), commonly referred to as serverless functions. By extending the scope of LLM-generated code from snippets to complete architectural components, this work introduces the potential to bridge design decisions directly to deployment, streamlining the software development process. This replication package is applying for the Research Object Reviewed (ROR) ROR-Functional, ROR Reusable and Open Research Object (ORO) badges. We are not applying for Results Validated badges (Results Reproduced (ROR-R) and Results Replicated (RER)) since we do not have independent researchers for verification nor is the reproduction of all results low effort. It involves significant investment in terms of time (execution and person hours) and non-zero cost to generate the serverless functions using the LLMs.
This study aims to evaluate the degree to which LLMs are able to generate software architecture components. The degree here refers to both the functional correctness and quality of code. we formalize our goal to:
Analyze the effectiveness of LLMs
For the purpose of generating software architecture compo- nents
With respect to automatic software architectural component generation
From the viewpoint of software architects and developers
In the context of the Function-as-a-Service (FaaS) architec- tural style
The experiment involves generating functions and calculating their metrics across multiple repositories, prompt types, and models. Here's the process:
- Repositories and Functions:
- A total of 4 repositories are considered.
- From three repository, 3 functions were selected and one of them 1 function was selected, resulting in 10 functions in total.
Given below is the information about the selected repositories:
| Repository Name | Language | Stars | Forks | No. of Functions | Link to Repository |
|---|---|---|---|---|---|
| codebox-npm | Javascript | 352 | 27 | 10 | Link |
| laconia | Javascript | 326 | 30 | 15 | Link |
| TagBot | Python | 91 | 18 | 2 | Link |
| StackJanitor | Typescript | 37 | 2 | 5 | Link |
- Models and Prompt Types:
- 5 different models are used to generate code for each selected function.
Given below is the information about the selected models:
| Model Name | Number of Parameters | Context Window Size (in tokens) | Availability | License Type |
|---|---|---|---|---|
| Artigenz-Coder-DS-6.7B | 6.7B | 16,384 | Local/API | Open |
| CodeQwen1.5-7B-Chat | 7B | 64K | Local/API | Open |
| DeepSeek-V2.5 | 236B | 128K | Local/API | Open |
| GPT-3.5-Turbo | Unknown | 4,096 | API | Proprietary |
| GPT-4 | Unknown | 8,192 | API | Proprietary |
- Each model generates code using 3 different prompt types:
- Zero Shot with README (Type 1 Prompt)
- Zero Shot with Codebase Summarization (Type 2 Prompt)
- Few Shot with Codebase Summarization (Type 3 Prompt)
- Function Generation:
- For each function, code is generated by every model using all 3 prompt types.
- This results in 145 generated functions:
- Evaluation:
- We perform three kinds of evaluations on the LLM generated serverless functions:
- Functional Correctness Through Testing: We evaluated both the original and generated code using the existing tests in each repository. The evaluation was conducted without and with minimal human intervention.
- Code Quality through Code Metrics: We quantify code quality using code level metrics—Lines of Code (LoC), Cyclomatic Complexity, Cognitive Complexity, Halstead Metrics.
- Code Similarity using CodeBLEU: We measure how syntactically similar LLM generated serverless functions are to human written ones through the CodeBLEU metric.
Link to the Artifact:
Steps to Reproduce:
See the INSTALL.md file
|_experiments
|_Repo1
|_function1
|_codebleu-results
|_type1
|_GPT-3_5-Turbo.txt
|_GPT-4.txt
|_DeepSeek-Coder-V2.txt
|_CodeQwen1_5-7B-Chat.txt
|_Artigenz-Coder-DS-6_7B.txt
|_type2
|_type3
|_GENERATED
|_type1
|_Artigenz-Coder-DS-6_7B
|_GENERATED-function1_1.js
|_CodeQwen1_5-7B-Chat
|_GENERATED-function1_1.js
|_DeepSeek-Coder-V2
|_GENERATED-function1_1.js
|_GPT-3_5-Turbo
|_GENERATED-function1_1.js
|_GPT-4
|_GENERATED-function1_1.js
|_type2
|_type3
|_prompts
|_function-generation-prompt
|_type1.txt
|_type2.txt
|_type3.txt
|_codebase-summarization-prompt.txt
|_function-description-prompt.txt
|_codebase-summary.txt
|_config.json
|_context-files-paths.txt
|_function-description.txt
|_ORIGINAL-function1.js
|_function2
|_function3
|_README.md
|_Repo2
|_Repo3
|_Repo4
|_prompt-templates
|_function-generation-prompt-template
|_type1.txt
|_type2.txt
|_type3.txt
|_codebase-summarization-prompt-template
|_function-description-prompt-template.txt
|_csvs
|_code quality metrics
|_consistency
|_plots
|_test-results
|_runner.ipynb
|_code_metrics.ipynb
|_codebleu_scores.ipynb
|_consistency_check.ipynb
|_visulaization.ipynb
|_CodebleuCalculator.py
|_CodeMetricCalculator.py
|_HelperFunction.py
|_CreatePrompt.py
|_LLMInterface.py
|_ArtigenzCoder.py
|_Gemini.py
|_CodeQwen.py
|_DeepSeek-Coder-V2.py
|_OpenAIModel.py
|_LoacalLLM.py
|_config_files.txt
|_config_template.json
|_package-lock.json
|_package.json
|_repository-selection
|_filter_dataset.ipynb
Notebooks:
-
filter_dataset.ipynb: Filters repositories:- Checks for the presence of tests in the repository using the keyword
testand filters out repositories without tests. - The filtered repositories are then sorted based on the number of stars and forks.
- Checks for the presence of tests in the repository using the keyword
-
runner.ipynb: Orchestrates the experiment workflow:- Loads and validates the configuration file.
- Creates codebase and function description prompts.
- Uses Gemini to create summarizations and descriptions.
- Creates function generation prompts (Type 1, 2, and 3).
- Generates function code using 5 models across 3 prompt types.
-
code_metrics.ipynb: Calculates code metrics—Lines of Code (LoC), Cyclomatic Complexity, Cognitive Complexity, and Halstead Metrics— -
codebleu_scores.ipynb: Computes and saves CodeBLEU scores for functions generated by 5 models using 3 prompt types, comparing each with its original counterpart. -
consistency_check.ipynb: Evaluates consistency by comparing multiple generated functions for the same context using CodeBLEU. Includes a plot of Average Pairwise CodeBLEU Scores per Model. -
visualization.ipynb: Generates visualizations for metrics and CodeBLEU scores, including:- Code Quality Metrics for Original and Generated Functions.
- Average CodeBLEU Scores per Model and Prompt Type.
Models
LLMInterface.py: Defines a common interface for loading models and generating responses.ArtigenzCoder.py: Implements LLMInterface for the Artigenz-Coder-DS-6.7B model via the Gradio API on Hugging Face Spaces.CodeQwen.py: Implements LLMInterface for the CodeQwen1.5-7B-Chat model via the Gradio API on Hugging Face Spaces.DeepSeek.py: Implements LLMInterface for the DeepSeek-V2.5 model using its OpenAI-compatible API. OpenAIModel.py: Implements LLMInterface for OpenAI's GPT-3.5-Turbo and GPT-4 models.Gemini.py: Implements LLMInterface for Google's Gemini-1.5-Pro model.LocalLLM.py: Implements LLMInterface designed to interact with the Artigenz-Coder-DS-6.7B and CodeQwen1.5-7B-Chat models hosted locally.
Helper Files
CodebleuCalculator.py: Contains methods to compute CodeBLEU scores.CodeMetricCalculator.py: Computes code metrics—LoC, Cyclomatic Complexity, Cognitive Complexity, and Halstead Metrics—for Python and JavaScript code.CreatePrompt.py: Methods to create Type 1, 2, and 3 prompts for function generation.HelperFunctions.py: Utility functions for configuration validation, file handling, and prompt management in function generation experiments.
For detailed implementation documentation, see comments in specific files.
Configuration Files
- There is a configuration file for each function in the repository. The configuration file contains the fields as mentioned in the
config_template.jsonfile. - This file is used to generate functions for the corresponding function in the repository using the various models and prompt types.