Skip to content

Meghanatedla/LLM-ComponentGen

Repository files navigation

Replication package for: LLMs for Generation of Architectural Components: An Exploratory Empirical Study in the Serverless World

Authors: Shrikara Arun*, Meghana Tedla*, Karthik Vaidhyanathan (* indicates equal contribution)

Overview

This project investigates the capabilities of Large Language Models (LLMs) to generate architectural components, specifically focusing on Functions as a Service (FaaS), commonly referred to as serverless functions. By extending the scope of LLM-generated code from snippets to complete architectural components, this work introduces the potential to bridge design decisions directly to deployment, streamlining the software development process. This replication package is applying for the Research Object Reviewed (ROR) ROR-Functional, ROR Reusable and Open Research Object (ORO) badges. We are not applying for Results Validated badges (Results Reproduced (ROR-R) and Results Replicated (RER)) since we do not have independent researchers for verification nor is the reproduction of all results low effort. It involves significant investment in terms of time (execution and person hours) and non-zero cost to generate the serverless functions using the LLMs.

Key Objectives

This study aims to evaluate the degree to which LLMs are able to generate software architecture components. The degree here refers to both the functional correctness and quality of code. we formalize our goal to:

Analyze the effectiveness of LLMs

For the purpose of generating software architecture compo- nents

With respect to automatic software architectural component generation

From the viewpoint of software architects and developers

In the context of the Function-as-a-Service (FaaS) architec- tural style

Description

The experiment involves generating functions and calculating their metrics across multiple repositories, prompt types, and models. Here's the process:

  1. Repositories and Functions:
  • A total of 4 repositories are considered.
  • From three repository, 3 functions were selected and one of them 1 function was selected, resulting in 10 functions in total.

Given below is the information about the selected repositories:

Repository Name Language Stars Forks No. of Functions Link to Repository
codebox-npm Javascript 352 27 10 Link
laconia Javascript 326 30 15 Link
TagBot Python 91 18 2 Link
StackJanitor Typescript 37 2 5 Link
  1. Models and Prompt Types:
  • 5 different models are used to generate code for each selected function.

Given below is the information about the selected models:

Model Name Number of Parameters Context Window Size (in tokens) Availability License Type
Artigenz-Coder-DS-6.7B 6.7B 16,384 Local/API Open
CodeQwen1.5-7B-Chat 7B 64K Local/API Open
DeepSeek-V2.5 236B 128K Local/API Open
GPT-3.5-Turbo Unknown 4,096 API Proprietary
GPT-4 Unknown 8,192 API Proprietary
  • Each model generates code using 3 different prompt types:
    • Zero Shot with README (Type 1 Prompt)
    • Zero Shot with Codebase Summarization (Type 2 Prompt)
    • Few Shot with Codebase Summarization (Type 3 Prompt)
  1. Function Generation:
  • For each function, code is generated by every model using all 3 prompt types.
  • This results in 145 generated functions:
  1. Evaluation:
  • We perform three kinds of evaluations on the LLM generated serverless functions:
    • Functional Correctness Through Testing: We evaluated both the original and generated code using the existing tests in each repository. The evaluation was conducted without and with minimal human intervention.
    • Code Quality through Code Metrics: We quantify code quality using code level metrics—Lines of Code (LoC), Cyclomatic Complexity, Cognitive Complexity, Halstead Metrics.
    • Code Similarity using CodeBLEU: We measure how syntactically similar LLM generated serverless functions are to human written ones through the CodeBLEU metric.

Reproducing Results

Link to the Artifact:

  1. GitHub Link to the artifact
  2. Zenodo Link to the artifact

Steps to Reproduce: See the INSTALL.md file

Project Structure

|_experiments
    |_Repo1
        |_function1
            |_codebleu-results
                |_type1
                    |_GPT-3_5-Turbo.txt
                    |_GPT-4.txt
                    |_DeepSeek-Coder-V2.txt
                    |_CodeQwen1_5-7B-Chat.txt
                    |_Artigenz-Coder-DS-6_7B.txt
                |_type2
                
                |_type3
                    
            |_GENERATED
                |_type1
                    |_Artigenz-Coder-DS-6_7B
                        |_GENERATED-function1_1.js
                    |_CodeQwen1_5-7B-Chat
                        |_GENERATED-function1_1.js
                    |_DeepSeek-Coder-V2
                        |_GENERATED-function1_1.js
                    |_GPT-3_5-Turbo
                        |_GENERATED-function1_1.js
                    |_GPT-4
                        |_GENERATED-function1_1.js

                |_type2

                |_type3
                    
            |_prompts
                |_function-generation-prompt
                    |_type1.txt
                    |_type2.txt
                    |_type3.txt
                |_codebase-summarization-prompt.txt
                |_function-description-prompt.txt

            |_codebase-summary.txt
            |_config.json
            |_context-files-paths.txt
            |_function-description.txt
            |_ORIGINAL-function1.js

        |_function2
        
        |_function3

        |_README.md

    |_Repo2
    |_Repo3
    |_Repo4  

    |_prompt-templates
        |_function-generation-prompt-template
            |_type1.txt
            |_type2.txt
            |_type3.txt
        |_codebase-summarization-prompt-template
        |_function-description-prompt-template.txt

    |_csvs
        |_code quality metrics
        |_consistency
    |_plots
    |_test-results

    |_runner.ipynb
    |_code_metrics.ipynb
    |_codebleu_scores.ipynb
    |_consistency_check.ipynb
    |_visulaization.ipynb
    |_CodebleuCalculator.py
    |_CodeMetricCalculator.py
    |_HelperFunction.py
    |_CreatePrompt.py
    |_LLMInterface.py
    |_ArtigenzCoder.py
    |_Gemini.py
    |_CodeQwen.py
    |_DeepSeek-Coder-V2.py
    |_OpenAIModel.py
    |_LoacalLLM.py
    |_config_files.txt
    |_config_template.json
    |_package-lock.json
    |_package.json

|_repository-selection
    |_filter_dataset.ipynb

File Descriptions

Notebooks:

  • filter_dataset.ipynb: Filters repositories:

    1. Checks for the presence of tests in the repository using the keyword test and filters out repositories without tests.
    2. The filtered repositories are then sorted based on the number of stars and forks.
  • runner.ipynb: Orchestrates the experiment workflow:

    1. Loads and validates the configuration file.
    2. Creates codebase and function description prompts.
    3. Uses Gemini to create summarizations and descriptions.
    4. Creates function generation prompts (Type 1, 2, and 3).
    5. Generates function code using 5 models across 3 prompt types.
  • code_metrics.ipynb: Calculates code metrics—Lines of Code (LoC), Cyclomatic Complexity, Cognitive Complexity, and Halstead Metrics—

  • codebleu_scores.ipynb: Computes and saves CodeBLEU scores for functions generated by 5 models using 3 prompt types, comparing each with its original counterpart.

  • consistency_check.ipynb: Evaluates consistency by comparing multiple generated functions for the same context using CodeBLEU. Includes a plot of Average Pairwise CodeBLEU Scores per Model.

  • visualization.ipynb: Generates visualizations for metrics and CodeBLEU scores, including:

    1. Code Quality Metrics for Original and Generated Functions.
    2. Average CodeBLEU Scores per Model and Prompt Type.

Models

  1. LLMInterface.py: Defines a common interface for loading models and generating responses.
  2. ArtigenzCoder.py: Implements LLMInterface for the Artigenz-Coder-DS-6.7B model via the Gradio API on Hugging Face Spaces.
  3. CodeQwen.py: Implements LLMInterface for the CodeQwen1.5-7B-Chat model via the Gradio API on Hugging Face Spaces.
  4. DeepSeek.py: Implements LLMInterface for the DeepSeek-V2.5 model using its OpenAI-compatible API. OpenAIModel.py: Implements LLMInterface for OpenAI's GPT-3.5-Turbo and GPT-4 models.
  5. Gemini.py: Implements LLMInterface for Google's Gemini-1.5-Pro model.
  6. LocalLLM.py: Implements LLMInterface designed to interact with the Artigenz-Coder-DS-6.7B and CodeQwen1.5-7B-Chat models hosted locally.

Helper Files

  • CodebleuCalculator.py: Contains methods to compute CodeBLEU scores.
  • CodeMetricCalculator.py: Computes code metrics—LoC, Cyclomatic Complexity, Cognitive Complexity, and Halstead Metrics—for Python and JavaScript code.
  • CreatePrompt.py: Methods to create Type 1, 2, and 3 prompts for function generation.
  • HelperFunctions.py: Utility functions for configuration validation, file handling, and prompt management in function generation experiments.

For detailed implementation documentation, see comments in specific files.

Configuration Files

  • There is a configuration file for each function in the repository. The configuration file contains the fields as mentioned in the config_template.json file.
  • This file is used to generate functions for the corresponding function in the repository using the various models and prompt types.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors