Replication package for: LLMs for Generation of Architectural Components: An Exploratory Empirical Study in the Serverless World

Authors: Shrikara Arun, Meghana Tedla, Karthik Vaidhyanathan (* indicates equal contribution)

Overview

This project investigates the capabilities of Large Language Models (LLMs) to generate architectural components, specifically focusing on Functions as a Service (FaaS), commonly referred to as serverless functions. By extending the scope of LLM-generated code from snippets to complete architectural components, this work introduces the potential to bridge design decisions directly to deployment, streamlining the software development process. This replication package is applying for the Research Object Reviewed (ROR) ROR-Functional, ROR Reusable and Open Research Object (ORO) badges. We are not applying for Results Validated badges (Results Reproduced (ROR-R) and Results Replicated (RER)) since we do not have independent researchers for verification nor is the reproduction of all results low effort. It involves significant investment in terms of time (execution and person hours) and non-zero cost to generate the serverless functions using the LLMs.

Key Objectives

This study aims to evaluate the degree to which LLMs are able to generate software architecture components. The degree here refers to both the functional correctness and quality of code. we formalize our goal to:

Analyze the effectiveness of LLMs

For the purpose of generating software architecture compo- nents

With respect to automatic software architectural component generation

From the viewpoint of software architects and developers

In the context of the Function-as-a-Service (FaaS) architec- tural style

Description

The experiment involves generating functions and calculating their metrics across multiple repositories, prompt types, and models. Here's the process:

Repositories and Functions:

A total of 4 repositories are considered.
From three repository, 3 functions were selected and one of them 1 function was selected, resulting in 10 functions in total.

Given below is the information about the selected repositories:

Repository Name	Language	Stars	Forks	No. of Functions	Link to Repository
codebox-npm	Javascript	352	27	10	Link
laconia	Javascript	326	30	15	Link
TagBot	Python	91	18	2	Link
StackJanitor	Typescript	37	2	5	Link

Models and Prompt Types:

5 different models are used to generate code for each selected function.

Given below is the information about the selected models:

Model Name	Number of Parameters	Context Window Size (in tokens)	Availability	License Type
Artigenz-Coder-DS-6.7B	6.7B	16,384	Local/API	Open
CodeQwen1.5-7B-Chat	7B	64K	Local/API	Open
DeepSeek-V2.5	236B	128K	Local/API	Open
GPT-3.5-Turbo	Unknown	4,096	API	Proprietary
GPT-4	Unknown	8,192	API	Proprietary

Each model generates code using 3 different prompt types:
- Zero Shot with README (Type 1 Prompt)
- Zero Shot with Codebase Summarization (Type 2 Prompt)
- Few Shot with Codebase Summarization (Type 3 Prompt)

Function Generation:

For each function, code is generated by every model using all 3 prompt types.
This results in 145 generated functions:

Evaluation:

We perform three kinds of evaluations on the LLM generated serverless functions:
- Functional Correctness Through Testing: We evaluated both the original and generated code using the existing tests in each repository. The evaluation was conducted without and with minimal human intervention.
- Code Quality through Code Metrics: We quantify code quality using code level metrics—Lines of Code (LoC), Cyclomatic Complexity, Cognitive Complexity, Halstead Metrics.
- Code Similarity using CodeBLEU: We measure how syntactically similar LLM generated serverless functions are to human written ones through the CodeBLEU metric.

Reproducing Results

Link to the Artifact:

Steps to Reproduce: See the INSTALL.md file

Project Structure

|_experiments
    |_Repo1
        |_function1
            |_codebleu-results
                |_type1
                    |_GPT-3_5-Turbo.txt
                    |_GPT-4.txt
                    |_DeepSeek-Coder-V2.txt
                    |_CodeQwen1_5-7B-Chat.txt
                    |_Artigenz-Coder-DS-6_7B.txt
                |_type2
                
                |_type3
                    
            |_GENERATED
                |_type1
                    |_Artigenz-Coder-DS-6_7B
                        |_GENERATED-function1_1.js
                    |_CodeQwen1_5-7B-Chat
                        |_GENERATED-function1_1.js
                    |_DeepSeek-Coder-V2
                        |_GENERATED-function1_1.js
                    |_GPT-3_5-Turbo
                        |_GENERATED-function1_1.js
                    |_GPT-4
                        |_GENERATED-function1_1.js

                |_type2

                |_type3
                    
            |_prompts
                |_function-generation-prompt
                    |_type1.txt
                    |_type2.txt
                    |_type3.txt
                |_codebase-summarization-prompt.txt
                |_function-description-prompt.txt

            |_codebase-summary.txt
            |_config.json
            |_context-files-paths.txt
            |_function-description.txt
            |_ORIGINAL-function1.js

        |_function2
        
        |_function3

        |_README.md

    |_Repo2
    |_Repo3
    |_Repo4  

    |_prompt-templates
        |_function-generation-prompt-template
            |_type1.txt
            |_type2.txt
            |_type3.txt
        |_codebase-summarization-prompt-template
        |_function-description-prompt-template.txt

    |_csvs
        |_code quality metrics
        |_consistency
    |_plots
    |_test-results

    |_runner.ipynb
    |_code_metrics.ipynb
    |_codebleu_scores.ipynb
    |_consistency_check.ipynb
    |_visulaization.ipynb
    |_CodebleuCalculator.py
    |_CodeMetricCalculator.py
    |_HelperFunction.py
    |_CreatePrompt.py
    |_LLMInterface.py
    |_ArtigenzCoder.py
    |_Gemini.py
    |_CodeQwen.py
    |_DeepSeek-Coder-V2.py
    |_OpenAIModel.py
    |_LoacalLLM.py
    |_config_files.txt
    |_config_template.json
    |_package-lock.json
    |_package.json

|_repository-selection
    |_filter_dataset.ipynb

File Descriptions

Notebooks:

filter_dataset.ipynb: Filters repositories:
1. Checks for the presence of tests in the repository using the keyword test and filters out repositories without tests.
2. The filtered repositories are then sorted based on the number of stars and forks.
runner.ipynb: Orchestrates the experiment workflow:
1. Loads and validates the configuration file.
2. Creates codebase and function description prompts.
3. Uses Gemini to create summarizations and descriptions.
4. Creates function generation prompts (Type 1, 2, and 3).
5. Generates function code using 5 models across 3 prompt types.
code_metrics.ipynb: Calculates code metrics—Lines of Code (LoC), Cyclomatic Complexity, Cognitive Complexity, and Halstead Metrics—
codebleu_scores.ipynb: Computes and saves CodeBLEU scores for functions generated by 5 models using 3 prompt types, comparing each with its original counterpart.
consistency_check.ipynb: Evaluates consistency by comparing multiple generated functions for the same context using CodeBLEU. Includes a plot of Average Pairwise CodeBLEU Scores per Model.
visualization.ipynb: Generates visualizations for metrics and CodeBLEU scores, including:
1. Code Quality Metrics for Original and Generated Functions.
2. Average CodeBLEU Scores per Model and Prompt Type.

Models

LLMInterface.py: Defines a common interface for loading models and generating responses.
ArtigenzCoder.py: Implements LLMInterface for the Artigenz-Coder-DS-6.7B model via the Gradio API on Hugging Face Spaces.
CodeQwen.py: Implements LLMInterface for the CodeQwen1.5-7B-Chat model via the Gradio API on Hugging Face Spaces.
DeepSeek.py: Implements LLMInterface for the DeepSeek-V2.5 model using its OpenAI-compatible API. OpenAIModel.py: Implements LLMInterface for OpenAI's GPT-3.5-Turbo and GPT-4 models.
Gemini.py: Implements LLMInterface for Google's Gemini-1.5-Pro model.
LocalLLM.py: Implements LLMInterface designed to interact with the Artigenz-Coder-DS-6.7B and CodeQwen1.5-7B-Chat models hosted locally.

Helper Files

CodebleuCalculator.py: Contains methods to compute CodeBLEU scores.
CodeMetricCalculator.py: Computes code metrics—LoC, Cyclomatic Complexity, Cognitive Complexity, and Halstead Metrics—for Python and JavaScript code.
CreatePrompt.py: Methods to create Type 1, 2, and 3 prompts for function generation.
HelperFunctions.py: Utility functions for configuration validation, file handling, and prompt management in function generation experiments.

For detailed implementation documentation, see comments in specific files.

Configuration Files

There is a configuration file for each function in the repository. The configuration file contains the fields as mentioned in the config_template.json file.
This file is used to generate functions for the corresponding function in the repository using the various models and prompt types.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
experiments		experiments
images		images
repository-selection		repository-selection
.gitignore		.gitignore
ICSA_2025_Paper_124.pdf		ICSA_2025_Paper_124.pdf
INSTALL.md		INSTALL.md
LICENSE.md		LICENSE.md
Original README.md		Original README.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Replication package for: LLMs for Generation of Architectural Components: An Exploratory Empirical Study in the Serverless World

Authors: Shrikara Arun, Meghana Tedla, Karthik Vaidhyanathan (* indicates equal contribution)

Overview

Key Objectives

Description

Reproducing Results

Project Structure

File Descriptions

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Replication package for: LLMs for Generation of Architectural Components: An Exploratory Empirical Study in the Serverless World

Authors: Shrikara Arun*, Meghana Tedla*, Karthik Vaidhyanathan (* indicates equal contribution)

Overview

Key Objectives

Description

Reproducing Results

Project Structure

File Descriptions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Authors: Shrikara Arun, Meghana Tedla, Karthik Vaidhyanathan (* indicates equal contribution)

Packages