# Using CLDK to validate code translation

In this tutorial, we will use CLDK to translate code and check properties of the translated code. You'll explore some of the benefits of using CLDK to perform quick and easy program analysis for this task. By the end of this tutorial, you will have implemented a simple Java-to-Python code translator that also performs light-weight property checking on the translated code.

Specifically, you will learn how to perform the following tasks on a Java application to create LLM prompts for code translation and checking the translated code:

1. Create a new instance of the CLDK class.
2. Create an analysis object for the target Java application.
3. Iterate over all files in the application.
4. Iterate over all classes in a file.
5. Sanitize the class for prompting the LLM.
6. Create treesitter-based Java and Python analysis objects

We will write a couple of helper methods to (1) format the LLM instruction for translating a Java class to Python and (2) prompt the LLM via Ollama. We will then use CLDK to analyze code and get context information for translating code and also checking properties of the translated code.

## Prequisites

Before we get started, let's make sure you have the following installed:

1. Python 3.11 or later (you can use [pyenv](https://github.com/pyenv/pyenv) to install Python)
2. Java 11 or later (you can use [SDKMAN!](https://sdkman.io) to instal Java)
3. Ollama 0.3.4 or later (you can get Ollama here: [Ollama download](https://ollama.com/download))

We will use Ollama to spin up a local [Granite code model](https://ollama.com/library/granite-code), which will serve as our LLM for this turorial.

### Download Granite code model

After starting the Ollama server, please download the latest version of the Granite code 8b-instruct model by running the following command. There are other Granite code models available, but for this tutorial, we will use Granite code 8b-instruct. If you prefer to use a different Granite code model, you can replace `8b-instruct` with the tag of another version (see [Granite code tags](https://ollama.com/library/granite-code/tags)).

In [None]:
%%bash
ollama pull granite-code:8b-instruct

 Let's make sure the model is downloaded by running the following command:

In [None]:
%%bash
ollama run granite-code:8b-instruct \"Write a python function to print 'Hello, World!'

### Install Ollama Python SDK

In [None]:
pip install ollama

### Install CLDK
CLDK is avaliable at https://github.com/IBM/codellm-devkit. You can install it by running the following command:

In [None]:
pip install git+https://github.com/IBM/codellm-devkit.git

### Get the sample Java application
For this tutorial, we will use [Apache Commons CLI](https://github.com/apache/commons-cli) as the Java application under test. You can download the source code to a temporary directory by running the following command:

In [None]:
%%bash
wget https://github.com/apache/commons-cli/archive/refs/tags/rel/commons-cli-1.7.0.zip -O /tmp/commons-cli-1.7.0.zip && unzip -o /tmp/commons-cli-1.7.0.zip -d /tmp

The project will now be extracted to `/tmp/commons-cli-rel-commons-cli-1.7.0`.
<!-- We'll remove these files later, so don't worry about the location. -->

## Translate Jave code to Python and build a light-weight property checker (for translation validation)
Code translation aims to convert source code from one programming language to another. Given the promising abilities of large language models (LLMs) in code synthesis, researchers are exploring their potential to automate code translation. In our recent work, [presented at ICSE'24](https://dl.acm.org/doi/10.1145/3597503.3639226), we found that LLM-based code translation is very promising. In this example, we will walk through the steps of translating a Java class to Python and checking various properties of translated code (e.g., number of methods, number of fields, formal arguments, etc.) as a simple form of translation validation.

Step 1: Import the required modules

In [None]:
from cldk.analysis.python.treesitter import PythonSitter
from cldk.analysis.java.treesitter import JavaSitter
import ollama
from cldk import CLDK
from cldk.analysis import AnalysisLevel

Step 2: Define a function for creating the LLM prompt, which instructs the LLM to translate a Java class to Python and includes the body of the Java class after removing all the comments and import statements.

In [None]:
def format_inst(code, focal_class, language):
    """
    Format the instruction for the given focal method and class.
    """
    inst = f"Question: Can you translate the Java class `{focal_class}` below to Python and generate under code block (```)?\n"

    inst += "\n"
    inst += f"```{language}\n"
    inst += code
    inst += "```" if code.endswith("\n") else "\n```"
    inst += "\n"
    return inst

Step 3: Define a function to call the LLM (in this case, Granite code 8b-instruct) using Ollama.

In [None]:
def prompt_ollama(message: str, model_id: str = "granite-code:8b-instruct") -> str:
    """Prompt local model on Ollama"""
    response_object = ollama.generate(model=model_id, prompt=message)
    return response_object["response"]

Step 4: Translate a class of the Java application to Python and check for two properties of the translated code: number of translated method and number of translated fields. 

In [None]:
# Create an instance of CLDK for Java analysis
cldk = CLDK(language="java")

# Create an analysis object for the Java application, providing the application path
analysis = cldk.analysis(project_path="/tmp/commons-cli-rel-commons-cli-1.7.0", analysis_level=AnalysisLevel.symbol_table)

# For simplicity, we run the code translation on a single class(this filter can be removed to run this code over the entire application)
target_class = "org.apache.commons.cli.GnuParser"

# Go through all the classes in the application
for class_name in analysis.get_classes():
    
    if class_name == target_class:
        # Get the location of the Java class
        class_path = analysis.get_java_file(qualified_class_name=class_name)
        
        # Read the file content
        if not class_path:
            class_body = ""
        with open(class_path, "r", encoding="utf-8", errors="ignore") as f:
            class_body = f.read()
        
        # Sanitize the file content by removing comments
        sanitized_class =  JavaSitter().remove_all_comments(source_code=class_body)

        # Create prompt for translating sanitized Java class to Python
        inst = format_inst(code=sanitized_class, language="java", focal_class=class_name.split(".")[-1])

        print(f"Instruction:\n{inst}\n")
        print(f"Translating Java code to Python . . .\n")

        # Prompt the local model on Ollama
        translated_code = prompt_ollama(message=inst)
        
        # Print translated code
        print(f"Translated Python code: {translated_code}\n")

        # Create python sitter instance for analyzing translated Python code
        py_cldk = PythonSitter()

        # Compute methods, function, and field counts for translated code
        all_methods = py_cldk.get_all_methods(module=translated_code)
        all_functions = py_cldk.get_all_functions(module=translated_code)
        all_fields = py_cldk.get_all_fields(module=translated_code)
        
        # Check counts against method and field counts for Java code
        if len(all_methods) + len(all_functions) != len(analysis.get_methods_in_class(qualified_class_name=class_name)):
            print(f'Number of translated method not matching in class {class_name}')
        else:
            print(f'Number of translated method in class {class_name} is {len(all_methods)}')
        if all_fields is not None:
            if len(all_fields) != len(analysis.get_class(qualified_class_name=class_name).field_declarations):
                print(f'Number of translated field not matching in class {class_name}')
            else:
                print(f'Number of translated fields in class {class_name} is {len(all_fields)}')
