In [1]:
path_to_repo = "./sample_java_repo/csc-java-course-spring-2023-key-value-store-EgorShibaev/"
method1_name = "org.csc.java.spring2023.KeyValueStoreImplementation.openValueStream"
method2_name = "org.csc.java.spring2023.KeyValueStoreImplementation.getIndexManager"

This notebook supports only Java repository. 

For parsing Java code library `javalang` is used.

In [2]:
import javalang
import os

def extract_code_block(code: str, start_position: int) -> str:
    """
    Extracts the code block starting at the given position according to the opening and closing braces
    :param code: Java or Kotlin code
    :param start_position: position of the opening brace of the code block
    :return: code block and end position of the code block
    """
    open_braces = 0
    code_block_start = start_position
    code_block_end = -1

    for i in range(start_position, len(code)):
        if code[i] == "{":
            open_braces += 1
        elif code[i] == "}":
            open_braces -= 1
            if open_braces == 0:
                code_block_end = i + 1
                break

    if code_block_end == -1:
        raise ValueError(f"Code block extraction failed")

    while code[code_block_start] == "\n":
        code_block_start += 1

    return code[code_block_start:code_block_end], code_block_end



def extract_method_code(java_source: str, class_name: str, method_name: str) -> str | None:
    """
    This function extracts the code of the method with the
    given name and class from the Java source code
    :param java_source: Java source code
    :param class_name: name of the class
    :param method_name: name of the method
    :return: code of the method
    """
    tree = javalang.parse.parse(java_source)

    for path, node in tree.filter(javalang.tree.MethodDeclaration):
        if class_name == path[1][0].name and method_name == node.name:
            start_line = node.position.line - 1
            start_position = sum(len(line) + 1 for line in java_source.splitlines()[:start_line]) \
                             + node.position.column - 1
            return extract_code_block(java_source, start_position)[0]

    return None

def extract_method_code_by_repo_path_and_fqname(path_to_repository: str, fully_qualified_name: str) -> str | None:
    """
    This function extracts the code of the method with the
    given fully qualified name from the repository
    :param path_to_repository: path to the repository
    :param fully_qualified_name: fully qualified name of the method
    :return: code of the method
    """

    if fully_qualified_name.count(".") >= 2:
        package, class_name, method_name = fully_qualified_name.rsplit(".", 2)
    else:
        class_name, method_name = fully_qualified_name.rsplit(".", 1)
        package = ""

    package_path = package.replace(".", os.path.sep)
    file_path = os.path.join(path_to_repository, "src", "main", "java", package_path, f"{class_name}.java")

    with open(file_path, "r") as java_file:
        java_source = java_file.read()

    # Extract the method code
    return extract_method_code(java_source, class_name, method_name)

In [3]:
method1_code = extract_method_code_by_repo_path_and_fqname(path_to_repo, method1_name)
method2_code = extract_method_code_by_repo_path_and_fqname(path_to_repo, method2_name)

In [4]:
method1_code

'InputStream openValueStream(byte[] key) throws IOException {\n    Objects.requireNonNull(key);\n    if (!contains(key)) {\n      throw new IOException("No such key in store");\n    }\n    check_closed();\n\n\n    var blocks = indexManager.getFileBlocksLocations(key);\n    var stream = InputStream.nullInputStream();\n\n    for (var block : blocks) {\n      stream = new SequenceInputStream(stream, valueStoreManager.openBlockStream(block));\n    }\n    return stream;\n  }'

In [5]:
method2_code

'IndexManager getIndexManager() {\n    check_closed();\n\n    return indexManager;\n  }'

I will try 3 different models for this task: two models which are trained on the Java code and one model which is trained on the text datasets. The first two models are `microsoft/CodeGPT-small-java-adaptedGPT2` and `neulab/codebert-java` from Huggingface and the third model is `all-MiniLM-L6-v2` from Sentence Transformer library.

For each model, I find embeddings for each method in pairs and consider the cosine distance between embeddings as a similarity measure.

I expect that models trained on Java code will extract meaningful features from the code and methods with similar semantics will be closer to each other.

The model from Sentence Transformer is trained for the task of semantic textual similarity, so, it is probable that it will be suitable for code similarity tasks.

In [6]:
from transformers import AutoTokenizer, AutoModel
from sentence_transformers import SentenceTransformer

# classes-wrappers for models from different sources
class SentenceTransformerModel:

    def __init__(self, model_name):
        self.model = SentenceTransformer(model_name)
    
    def encode(self, method_code):
        return self.model.encode(method_code, convert_to_tensor=True).view(1, -1)

class HuggingfaceModel:

    def __init__(self, model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name)
    
    def encode(self, method_code):
        inputs = self.tokenizer(method_code, return_tensors="pt")
        outputs = self.model(**inputs)
        return outputs.last_hidden_state.mean(dim=1)

In [7]:
models = {
    "CodeGPT": HuggingfaceModel("microsoft/CodeGPT-small-java-adaptedGPT2"),
    "CodeBERT": HuggingfaceModel("neulab/codebert-java"),
    "MiniLM": SentenceTransformerModel("all-MiniLM-L6-v2"),
}

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of RobertaModel were not initialized from the model checkpoint at neulab/codebert-java and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [8]:
import torch

def get_similarity(model_name, method1_code, method2_code):

    with torch.no_grad():
        embedding_1 = models[model_name].encode(method1_code)
        embedding_2 = models[model_name].encode(method2_code)

    similarity = torch.nn.functional.cosine_similarity(embedding_1, embedding_2)

    return similarity.item()

In [9]:
for model_name in models:
    similarity = get_similarity(model_name, method1_code, method2_code)
    print(f"Similarity by {model_name}: {similarity:.3f}")

Similarity by CodeGPT: 0.535
Similarity by CodeBERT: 0.911
Similarity by MiniLM: 0.368


## Some comparison of models

In [10]:
def compare_models(code1, code2):
    for model_name in models:
        similarity = get_similarity(model_name, code1, code2)
        print(f"Similarity by {model_name}: {similarity:.3f}")

In [11]:
compare_models("""
public int factorial(int n) {
    if (n == 0) {
        return 1;
    } else {
        return n * factorial(n - 1);
    }
}
""","""
public String[] parseCSVLine(String csvLine) {
    return csvLine.split(",");
}
""")

Similarity by CodeGPT: 0.441
Similarity by CodeBERT: 0.906
Similarity by MiniLM: 0.014


In [12]:
compare_models("""
public int calculateSquare(int num) {
    return num * num;
}
""","""
public boolean isNumeric(String str) {
    try {
        Double.parseDouble(str);
        return true;
    } catch (NumberFormatException e) {
        return false;
    }
}
""")

Similarity by CodeGPT: 0.672
Similarity by CodeBERT: 0.902
Similarity by MiniLM: 0.385


In [13]:
compare_models("""
public int findMax(int[] arr) {
    int max = arr[0];
    for (int i = 1; i < arr.length; i++) {
        if (arr[i] > max) {
            max = arr[i];
        }
    }
    return max;
}
""","""
public String toUpperCase(String input) {
    return input.toUpperCase();
}
""")

Similarity by CodeGPT: 0.453
Similarity by CodeBERT: 0.867
Similarity by MiniLM: 0.036


I consider methods below similar, so, I expect that the similarity score for them will be high.

 CodeGPT and CodeBERT output high score, but Sentence Transformer model output much lower score, so, it is not able to see similarity between these methods.

In [14]:
compare_models("""
public static int sum(List<Integer> numbers) {
        int sum = 0;
        for (int num : numbers) {
            sum += num;
        }
        return sum;
}
""","""
public static int product(List<Integer> numbers) {
        int product = 1;
        for (int num : numbers) {
            product *= num;
        }
        return product;
}
""")

Similarity by CodeGPT: 0.968
Similarity by CodeBERT: 0.998
Similarity by MiniLM: 0.601


In [15]:
compare_models("""
public static int sqrt(int a) {
    return a * a;
}
""","""
public static long sqrt(int a) {
    return Math.pow(a, 2);
}
""")

Similarity by CodeGPT: 0.840
Similarity by CodeBERT: 0.973
Similarity by MiniLM: 0.768


### Resume

I compared 3 models on pairs of java methods. 
- all-MiniLM-L6-v2 from Sentence Transformers library is not the best choice for this task because it was trained on large amounts of text data to generate sentence embeddings that capture the semantic meaning of the __text__, so, it is not suitable for code similarity task.
- CodeBERT always output high score, so, it should be taken into account.
- CodeGPT seems to perform better than CodeBERT for this task. So, I will use CodeGPT for the second task.

In [16]:
!jupyter nbconvert --to html task1.ipynb

[NbConvertApp] Converting notebook task1.ipynb to html
[NbConvertApp] Writing 646514 bytes to task1.html
