# Fine-tuning Dataset Exploration

### This notebook explores the AtCoder dataset and the Google CodeJam dataset released by https://github.com/Chenning-Tao/C4.

### AtCoder dataset and Google CodeJam dataset:

According to the definition, cross-language code clone means two or more code blocks developed in different programming languages that are functional similar to each other.

AtCoder is a programming competition website in Japan and Google CodeJam is a Google's programming competition. The open source contest sites accept solutions for a posted problem without any language restriction. At the same time, all the submitted code blocks are tested with the same input and expected output to validate them. So, we can say that accepted solutions for any programming language for a single problem are functional clones of each other, which we can claim as the best validated cross-language clone dataset in every aspect.

In this dataset, accepted solutions for a single problem statement are validated cross-language clones of each other, whereas accepted solutions of two different problem statements are identified and validated as a cross-language non-clone pair.

### Notes:
C4 Dataset: pair data at the program level has only original code string, it has to be tokenizes to get a list of code tokens befory applying to the model. In addition, non-clone pairs should also be constructed.   
PS. many original code strings have a surprising length (normally > 512 code tokens) which makes it illegal to be the input of the BERT-based model.

In [1]:
import os
import json
import jsonlines

import numpy as np
import pandas as pd
pd.set_option('max_colwidth', 300)
from pprint import pprint

In [2]:
# define dataset root_dir

c4_root = '/Users/rongdang/Desktop/semantic-code-clone/dataset/cross-language/C4'


In [3]:
# calculate distinct programming languages

set_name = 'pair_train.jsonl'
set_program_dir = os.path.join(c4_root, set_name)

language_list = list()
with open(set_program_dir, 'r') as f:
    sample_file = f.readlines()
    for sample_line in sample_file:
        sample_data = json.loads(sample_line)
        category1 = sample_data['Category1']
        category2 = sample_data['Category2']
        language_list.append(category1)
        language_list.append(category2)
    language_list = list(set(language_list))

print('total number of programming language: ' + str(len(language_list)))
print('individual programming languages: ')
print(language_list)


total number of programming language: 5
individual programming languages: 
['cs', 'cpp', 'py', 'java', 'c']


In this dataset, solutions are developed in 5 programming languages, namely: C#(cs), Java(java), C(c), Python(py) and C++(cpp). We are interested in Java-Python and Python-Java language pairs.

### Part 1: Preview the Dataset
We display one java-py data point and one py-java data point in the training set.

In [4]:
# display one java-py program pair

with open(set_program_dir, 'r') as f:
    sample_file = f.readlines()
    for sample_line in sample_file:
        sample_data = json.loads(sample_line)
        category1 = sample_data['Category1']
        category2 = sample_data['Category2']
        if category1 == 'java' and category2 == 'py':
            pprint(sample_data)
            break


{'Category1': 'java',
 'Category2': 'py',
 'Code1': 'import java.io.*; \n'
          'import java.math.*; \n'
          'import java.util.*; \n'
          ' \n'
          'public class ReorderingTrainCars \n'
          '{ \n'
          ' static Scanner sc = new Scanner(System.in); \n'
          ' static PrintWriter out = new PrintWriter(System.out); \n'
          ' public static void main(String[] args) \n'
          ' { \n'
          '  for(int '
          'caseId=1,totalCases=sc.nextInt();caseId<=totalCases;caseId++) \n'
          '  { \n'
          '   out.println("Case #"+caseId+": "+solve()); \n'
          '   //Add logic here \n'
          '   out.flush(); \n'
          '  } \n'
          ' } \n'
          '  \n'
          ' static int solve() { \n'
          '  int[] prefix = new int[26], suffix = new int[26]; \n'
          '  Arrays.fill(prefix, -1); Arrays.fill(suffix, -1); \n'
          '  int[] enclosed = new int[26], filled = new int[26]; \n'
          '  Arrays.fill(enclos

In [5]:
# display one py-java program pair

with open(set_program_dir, 'r') as f:
    sample_file = f.readlines()
    for sample_line in sample_file:
        sample_data = json.loads(sample_line)
        category1 = sample_data['Category1']
        category2 = sample_data['Category2']
        if category1 == 'py' and category2 == 'java':
            pprint(sample_data)
            break


{'Category1': 'py',
 'Category2': 'java',
 'Code1': '#!/usr/bin/env python\n'
          'from __future__ import unicode_literals\n'
          'import codecs\n'
          'import collections\n'
          'import itertools\n'
          'import sys\n'
          '\n'
          '\n'
          'MODULO = 1000000007\n'
          'FACT_ARRAY = [1, 1]\n'
          'def fact(n):\n'
          '    if n >= len(FACT_ARRAY):\n'
          '        curr_n = len(FACT_ARRAY)\n'
          '        for c in xrange(curr_n, n+1):\n'
          '            FACT_ARRAY.append((FACT_ARRAY[-1]*c) % MODULO)\n'
          '    return FACT_ARRAY[n]\n'
          '\n'
          'def get_answer_for_combs(combs):\n'
          '    result = 1\n'
          '    for num in combs:\n'
          '        result = (result * fact(num)) % MODULO\n'
          '    return result\n'
          '\n'
          'def get_result(strs):\n'
          '    # print\n'
          '    # print("Starting new case: {}".format(strs))\n'
          '

In [6]:
# display attributes of data point

with open(set_program_dir, 'r') as f:
    sample_file = f.readlines()
sample_data = json.loads(sample_file[0])

display_data = dict()
field_name = list()
field_type = list()
for key, value in sample_data.items():
    field_name.append(key)
    field_type.append(type(value))
display_data['attr_name'] = field_name
display_data['attr_type'] = field_type

df = pd.DataFrame(display_data, index=['attr_1','attr_2','attr_3','attr_4','attr_5','attr_6','attr_7'])
display(df)


Unnamed: 0,attr_name,attr_type
attr_1,Task,<class 'int'>
attr_2,ID1,<class 'int'>
attr_3,Category1,<class 'str'>
attr_4,Code1,<class 'str'>
attr_5,ID2,<class 'int'>
attr_6,Category2,<class 'str'>
attr_7,Code2,<class 'str'>


As shown above, each data point has 7 attributes. The 'Task' attribute represents the task id of a problem. The 'ID1' and 'ID2' attributes correspond to the solution id of a program. The 'Code1' and 'Code2' attributes are solution code in source language and that in target language with the string format. The 'Category1' and 'Category2' attributes are languages of source code and target code respectively.

### Part 2: Analyze the Dataset
We intend to explore Java-Python and Python-Java program pairs. We will investigate dataset statistics in terms of: 1) the number of problems (tasks); 2) the total number of language pairs; 3) the average number of code lines in each language.

In [7]:
def count_code_lines(code_str):
    line_num = 0
    code_lines = code_str.split('\n')
    for line in code_lines:
        line = line.strip('\n').strip('\t').strip()
        if line != '\n' and line != '\t' and line != '':
            line_num += 1
    return line_num

In [8]:
# analyze the Java-Python program pair across train, valid, test datasets

src_language = 'java'
tgt_language = 'py'
sets = ['pair_train.jsonl', 'pair_valid.jsonl', 'pair_test.jsonl']

task_list = list()
pair_nums = 0
java_line_nums = list()
python_line_nums = list()

for set_name in sets:
    set_program_dir = os.path.join(c4_root, set_name)
    with open(set_program_dir, 'r') as f:
        sample_file = f.readlines()
    for sample_line in sample_file:
        sample_data = json.loads(sample_line)
        category1 = sample_data['Category1']
        category2 = sample_data['Category2']
        if category1 == src_language and category2 == tgt_language:
            task_list.append(sample_data['Task'])
            pair_nums += 1
            java_line_nums.append(count_code_lines(sample_data['Code1']))
            python_line_nums.append(count_code_lines(sample_data['Code2']))
    f.close()

print('-----Java-Python-----program-level statistics-----')
print('total tasks: ' + str(len(list(set(task_list)))))
print('total pairs: ' + str(pair_nums))
print('avg number of lines in Java program: ' + str(round(sum(java_line_nums)/len(java_line_nums), 2)))
print('avg number of lines in Python program: ' + str(round(sum(python_line_nums)/len(python_line_nums), 2)))


-----Java-Python-----program-level statistics-----
total tasks: 1018
total pairs: 4057
avg number of lines in Java program: 93.02
avg number of lines in Python program: 24.04


In [9]:
# analyze the Python-Java program pair across train, valid, test datasets

src_language = 'py'
tgt_language = 'java'
sets = ['pair_train.jsonl', 'pair_valid.jsonl', 'pair_test.jsonl']

task_list = list()
pair_nums = 0
python_line_nums = list()
java_line_nums = list()

for set_name in sets:
    set_program_dir = os.path.join(c4_root, set_name)
    with open(set_program_dir, 'r') as f:
        sample_file = f.readlines()
    for sample_line in sample_file:
        sample_data = json.loads(sample_line)
        category1 = sample_data['Category1']
        category2 = sample_data['Category2']
        if category1 == src_language and category2 == tgt_language:
            task_list.append(sample_data['Task'])
            pair_nums += 1
            python_line_nums.append(count_code_lines(sample_data['Code1']))
            java_line_nums.append(count_code_lines(sample_data['Code2']))
    f.close()

print('-----Python-Java-----program-level statistics-----')
print('total tasks: ' + str(len(list(set(task_list)))))
print('total pairs: ' + str(pair_nums))
print('avg number of lines in Python program: ' + str(round(sum(python_line_nums)/len(python_line_nums), 2)))
print('avg number of lines in Java program: ' + str(round(sum(java_line_nums)/len(java_line_nums), 2)))


-----Python-Java-----program-level statistics-----
total tasks: 1014
total pairs: 4142
avg number of lines in Python program: 23.99
avg number of lines in Java program: 92.52


### Part 3: Construct a Fine-tuning Dataset based on the C4 Dataset
We intend to construct a fine-tuning dataset based on the C4 dataset. Specifically, 1) we will collect all Java-Python and Python-Java program pairs and find the intersection problem set of Java and Python solutions; 2) we will collect Java and Python solutions of the intersection problems and build Java-Python program clone pairs; 3) we will build Java-Python program non-clone pairs; 4) we will pre-process and split the whole dataset into 3 separate sets: the training set, the validation set and the test set. The ratio of #num of train: #num of valid: #num of test is approximately 8:1:1.

In [3]:
# # define original and pre-processed dataset root_dir

original_root = os.path.join(c4_root, 'dataset')
preprocessed_root = os.path.join(c4_root, 'preprocessed_dataset')


In [32]:
# collect all Java-Python and Python-Java program pairs from 'pair_traion.jsonl', 'pair_valid.jsonl' and 'pairt_test.jsonl'

src_language = 'java'
tgt_language = 'py'
sets = ['pair_train.jsonl', 'pair_valid.jsonl', 'pair_test.jsonl']

original_data = list()
pair_clones = os.path.join(original_root, 'pair_clones.jsonl')

for set_name in sets:
    set_program_dir = os.path.join(c4_root, set_name)
    with open(set_program_dir, 'r') as f:
        lines = f.readlines()
        for line in lines:
            sample_data = json.loads(line)
            # add java-py pairs
            if sample_data['Category1'] == src_language and sample_data['Category2'] == tgt_language:
                item = dict()
                item['task_id'] = sample_data['Task']
                item['src_id'] = sample_data['ID1']
                item['src_code'] = sample_data['Code1']
                item['src_tokens'] = list()
                item['tgt_id'] = sample_data['ID2']
                item['tgt_code'] = sample_data['Code2']
                item['tgt_tokens'] = list()
                item['category'] = 'java-python'
                if item not in original_data and item['src_code'] != '\'' and item['tgt_code'] != '\'':
                    original_data.append(item)
            # add py-java pairs
            elif sample_data['Category1'] == tgt_language and sample_data['Category2'] == src_language:
                item = dict()
                item['task_id'] = sample_data['Task']
                item['src_id'] = sample_data['ID2']
                item['src_code'] = sample_data['Code2']
                item['src_tokens'] = list()
                item['tgt_id'] = sample_data['ID1']
                item['tgt_code'] = sample_data['Code1']
                item['tgt_tokens'] = list()
                item['category'] = 'java-python'
                if item not in original_data and item['src_code'] != '\'' and item['tgt_code'] != '\'':
                    original_data.append(item)
    f.close()

# sort the program pairs by task_id (int)
sorted_data = sorted(original_data, key=lambda k: k['task_id'])
print('Start writing Java-Python pair_clones.jsonl...')
print('Total numbers of training pairs: ' + str(len(sorted_data)))
with jsonlines.open(pair_clones, 'w') as writer:
    writer.write_all(sorted_data)
writer.close()


Start writing Java-Python pair_clones.jsonl...
Total numbers of training pairs: 7868


In [11]:
# display one pre-processed Java-Python clone program pair (pre-process procedure implemented in PyCharm)

preprocessed_pair_clones = os.path.join(preprocessed_root, 'pair_clones.jsonl')
with open(preprocessed_pair_clones, 'r') as f:
    lines = f.readlines()
    sample_data = json.loads(lines[7851])
    pprint(sample_data)
#     print(sample_data['src_code'])
#     print('----------------------')
#     print(sample_data['tgt_code'])
f.close()


{'category': 'java-python',
 'src_code': 'import java.util.Scanner; \n'
             'class Main{ \n'
             ' static int G; \n'
             ' static int D; \n'
             ' static int[] p; \n'
             ' static int[] c; \n'
             ' static boolean[] used; \n'
             ' static long ans = 1000000000; \n'
             ' public static void main(String[] args) { \n'
             '  Scanner sc = new Scanner(System.in);   \n'
             '  String ss = sc.next(); \n'
             '  long[][] dp = new long[ss.length()+1][4]; \n'
             '  dp[0][0] = 1; \n'
             '  long mod = 1000000007; \n'
             '  for(int i = 0;i < ss.length();i++){ \n'
             "   if(ss.charAt(i) == 'A' ){ \n"
             '    dp[i+1][0] += dp[i][0] %mod;   \n'
             '    dp[i+1][1] += dp[i][1] % mod;   \n'
             '    dp[i+1][2] += dp[i][2] % mod;   \n'
             '    dp[i+1][3] += dp[i][3] % mod;   \n'
             '    dp[i+1][1] += dp[i][0] % mod; \n'


In [5]:
# collect Java and Python solutions of the intersection problems
# build an all_programs.json file

solutions = {
    'java': dict(),
    'python': dict()
}

preprocessed_pair_clones = os.path.join(preprocessed_root, 'pair_clones.jsonl')
all_programs = os.path.join(preprocessed_root, 'all_programs.json')

with open(preprocessed_pair_clones, 'r') as f:
    lines = f.readlines()
    for line in lines:
        sample_data = json.loads(line)
        task_id = sample_data['task_id']
        src_id = sample_data['src_id']
        tgt_id = sample_data['tgt_id']
        if task_id not in solutions['java'].keys():
            solutions['java'][task_id] = dict()
            solutions['java'][task_id][src_id] = sample_data['src_code']
        else:
            solutions['java'][task_id][src_id] = sample_data['src_code']
        
        if task_id not in solutions['python'].keys():
            solutions['python'][task_id] = dict()
            solutions['python'][task_id][tgt_id] = sample_data['tgt_code']
        else:
            solutions['python'][task_id][tgt_id] = sample_data['tgt_code']
f.close()

with open(all_programs, mode='w', encoding='utf-8') as json_file_to_write:
    json_file_to_write.write(json.dumps(solutions, indent=4))
json_file_to_write.close()

In [4]:
# build a correct_programs.json file based on the all_programs.json file (implemented in PyCharm)
# calculate the total number of Java and Python solutions among the intersection problems in the correct_programs.json

correct_programs = os.path.join(preprocessed_root, 'correct_programs.json')
java_functions_num = 0
python_functions_num = 0

with open(correct_programs, mode='r', encoding='utf-8') as json_file_to_read:
    json_data = json.load(json_file_to_read)
    java_pool = json_data['java']
    for task, task_set in java_pool.items():
        java_functions_num += len(task_set.keys())
    python_pool = json_data['python']
    for task, task_set in python_pool.items():
        python_functions_num += len(task_set.keys())
json_file_to_read.close()

print('Total number of Java tasks: ' + str(len(java_pool.keys())))
print('Total number of Java programs: ' + str(java_functions_num))
print('Total number of Python tasks: ' + str(len(python_pool.keys())))
print('Total number of Python programs: ' + str(python_functions_num))

Total number of Java tasks: 993
Total number of Java programs: 5660
Total number of Python tasks: 1007
Total number of Python programs: 6239


In [None]:
# build a one-sentence-per-line raw corpus (txt file) for sentencepiece model training based on the correct_programs.json

correct_programs = os.path.join(preprocessed_root, 'correct_programs.json')
txt_file = os.path.join(preprocessed_root, 'programs.txt')
result = list()

with open(correct_programs, mode='r', encoding='utf-8') as json_file_to_read:
    json_data = json.load(json_file_to_read)
    java_pool = json_data['java']
    python_pool = json_data['python']
    
    for _, solutions in java_pool.items():
        for _, solution in solutions.items():
            function = solution.replace('\n', ' ').strip() + '\n'
            result.append(function)
    
    
    for _, solutions in python_pool.items():
        for _, solution in solutions.items():
            function = solution.replace('\n', ' ').strip() + '\n'
            result.append(function)
json_file_to_read.close()


with open(txt_file, mode='w') as output_file:
    output_file.writelines(result)
output_file.close()

In [13]:
# build the Java-Python program clone pairs based on the correct_programs.json file
# display the total number of program clone pairs

clone_pair_programs = os.path.join(preprocessed_root, 'pair_clone_programs.jsonl')
with open(clone_pair_programs, 'r') as f:
    lines = f.readlines()
    print('Total number of pre-processed program clone pairs: ' + str(len(lines)))
#     sample_data_1 = json.loads(lines[32656])
#     print(sample_data_1['task_id'])
#     sample_data_2 = json.loads(lines[32657])
#     print(sample_data_2['task_id'])
f.close()


Total number of pre-processed program clone pairs: 40863


In [31]:
# build non-clone program pairs: 1) build a train_programs.json file, 
#                                   construct non-clone program pairs for the training set based on this json file

solutions = {
    'java': dict(),
    'python': dict()
}

preprocessed_pair_clones = os.path.join(preprocessed_root, 'pair_clone_programs.jsonl')
train_programs = os.path.join(preprocessed_root, 'train_programs.json')

with open(preprocessed_pair_clones, 'r') as input_file:
    lines = input_file.readlines()
    for line in lines[0:32657]:
        sample_data = json.loads(line)
        task_id = sample_data['task_id']
        src_id = sample_data['src_id']
        tgt_id = sample_data['tgt_id']
        if task_id not in solutions['java'].keys():
            solutions['java'][task_id] = dict()
            solutions['java'][task_id][src_id] = sample_data['src_code']
        else:
            solutions['java'][task_id][src_id] = sample_data['src_code']
        
        if task_id not in solutions['python'].keys():
            solutions['python'][task_id] = dict()
            solutions['python'][task_id][tgt_id] = sample_data['tgt_code']
        else:
            solutions['python'][task_id][tgt_id] = sample_data['tgt_code']
input_file.close()

with open(train_programs, mode='w', encoding='utf-8') as json_file_to_write:
    json_file_to_write.write(json.dumps(solutions, indent=4))
json_file_to_write.close()

In [32]:
with open(train_programs, mode='r', encoding='utf-8') as json_file_to_read:
    json_data = json.load(json_file_to_read)
    # assert json_data['java'].keys() == json_data['python'].keys()
    print('Total number of Java tasks: ' + str(len(json_data['java'].keys())))
    print('Total number of Python tasks: ' + str(len(json_data['python'].keys())))
json_file_to_read.close()

Total number of Java tasks: 804
Total number of Python tasks: 804


In [33]:
java_solutions = list()
python_solutions = list()

with open(train_programs, mode='r', encoding='utf-8') as json_file_to_read:
    json_data = json.load(json_file_to_read)
    for task_id, solutions in json_data['java'].items():
        java_solutions.extend(solutions.keys())
    for task_id, solutions in json_data['python'].items():
        python_solutions.extend(solutions.keys())
    print('Total number of Java solutions: ' + str(len(java_solutions)))
    print('Total number of Python solutions: ' + str(len(python_solutions)))
json_file_to_read.close()

Total number of Java solutions: 4572
Total number of Python solutions: 4994


In [34]:
# build non-clone program pairs: 2) build a valid_programs.json file, 
#                                   construct non-clone program pairs for the validation set based on this json file

solutions = {
    'java': dict(),
    'python': dict()
}

preprocessed_pair_clones = os.path.join(preprocessed_root, 'pair_clone_programs.jsonl')
valid_programs = os.path.join(preprocessed_root, 'valid_programs.json')

with open(preprocessed_pair_clones, 'r') as input_file:
    lines = input_file.readlines()
    for line in lines[32657:36763]:
        sample_data = json.loads(line)
        task_id = sample_data['task_id']
        src_id = sample_data['src_id']
        tgt_id = sample_data['tgt_id']
        if task_id not in solutions['java'].keys():
            solutions['java'][task_id] = dict()
            solutions['java'][task_id][src_id] = sample_data['src_code']
        else:
            solutions['java'][task_id][src_id] = sample_data['src_code']
        
        if task_id not in solutions['python'].keys():
            solutions['python'][task_id] = dict()
            solutions['python'][task_id][tgt_id] = sample_data['tgt_code']
        else:
            solutions['python'][task_id][tgt_id] = sample_data['tgt_code']
input_file.close()

with open(valid_programs, mode='w', encoding='utf-8') as json_file_to_write:
    json_file_to_write.write(json.dumps(solutions, indent=4))
json_file_to_write.close()

In [36]:
with open(valid_programs, mode='r', encoding='utf-8') as json_file_to_read:
    json_data = json.load(json_file_to_read)
    # assert json_data['java'].keys() == json_data['python'].keys()
    print('Total number of Java tasks: ' + str(len(json_data['java'].keys())))
    print('Total number of Python tasks: ' + str(len(json_data['python'].keys())))
json_file_to_read.close()

Total number of Java tasks: 97
Total number of Python tasks: 97


In [37]:
java_solutions = list()
python_solutions = list()

with open(valid_programs, mode='r', encoding='utf-8') as json_file_to_read:
    json_data = json.load(json_file_to_read)
    for task_id, solutions in json_data['java'].items():
        java_solutions.extend(solutions.keys())
    for task_id, solutions in json_data['python'].items():
        python_solutions.extend(solutions.keys())
    print('Total number of Java solutions: ' + str(len(java_solutions)))
    print('Total number of Python solutions: ' + str(len(python_solutions)))
json_file_to_read.close()

Total number of Java solutions: 560
Total number of Python solutions: 640


In [38]:
# build non-clone program pairs: 3) build a test_programs.json file, 
#                                   construct non-clone program pairs for the test set based on this json file

solutions = {
    'java': dict(),
    'python': dict()
}

preprocessed_pair_clones = os.path.join(preprocessed_root, 'pair_clone_programs.jsonl')
test_programs = os.path.join(preprocessed_root, 'test_programs.json')

with open(preprocessed_pair_clones, 'r') as input_file:
    lines = input_file.readlines()
    for line in lines[36763:]:
        sample_data = json.loads(line)
        task_id = sample_data['task_id']
        src_id = sample_data['src_id']
        tgt_id = sample_data['tgt_id']
        if task_id not in solutions['java'].keys():
            solutions['java'][task_id] = dict()
            solutions['java'][task_id][src_id] = sample_data['src_code']
        else:
            solutions['java'][task_id][src_id] = sample_data['src_code']
        
        if task_id not in solutions['python'].keys():
            solutions['python'][task_id] = dict()
            solutions['python'][task_id][tgt_id] = sample_data['tgt_code']
        else:
            solutions['python'][task_id][tgt_id] = sample_data['tgt_code']
input_file.close()

with open(test_programs, mode='w', encoding='utf-8') as json_file_to_write:
    json_file_to_write.write(json.dumps(solutions, indent=4))
json_file_to_write.close()

In [39]:
with open(test_programs, mode='r', encoding='utf-8') as json_file_to_read:
    json_data = json.load(json_file_to_read)
    # assert json_data['java'].keys() == json_data['python'].keys()
    print('Total number of Java tasks: ' + str(len(json_data['java'].keys())))
    print('Total number of Python tasks: ' + str(len(json_data['python'].keys())))
json_file_to_read.close()

Total number of Java tasks: 88
Total number of Python tasks: 88


In [40]:
java_solutions = list()
python_solutions = list()

with open(test_programs, mode='r', encoding='utf-8') as json_file_to_read:
    json_data = json.load(json_file_to_read)
    for task_id, solutions in json_data['java'].items():
        java_solutions.extend(solutions.keys())
    for task_id, solutions in json_data['python'].items():
        python_solutions.extend(solutions.keys())
    print('Total number of Java solutions: ' + str(len(java_solutions)))
    print('Total number of Python solutions: ' + str(len(python_solutions)))
json_file_to_read.close()

Total number of Java solutions: 522
Total number of Python solutions: 577


In [14]:
# display one pre-processed Java-Python non-clone program pair

non_clone_pair_programs = os.path.join(preprocessed_root, 'pair_non_clone_programs_test.jsonl')
with open(non_clone_pair_programs, 'r') as f:
    lines = f.readlines()
    sample_data = json.loads(lines[0])
    pprint(sample_data)
f.close()


{'category': 'java-python',
 'src_code': 'import java.util.*; \n'
             'public class Main { \n'
             ' public static void main (String[] args) { \n'
             '  Scanner sc = new Scanner(System.in); \n'
             '  int n = sc.nextInt(); \n'
             '  if (isPrime(n)) { \n'
             '   System.out.println("YES"); \n'
             '  } else { \n'
             '   System.out.println("NO"); \n'
             '  } \n'
             ' } \n'
             ' static boolean isPrime(int x) { \n'
             '  for (int i = 2; i <= 1000 && i < x; i++) { \n'
             '   if (x % i == 0) { \n'
             '    return false; \n'
             '   } \n'
             '  } \n'
             '  return true; \n'
             ' } \n'
             '}',
 'src_id': '79154',
 'task_id': ['1245', '1367'],
 'tgt_code': 'N,K,Q = map(int,input().split()) \n'
             'A = list(map(int,input().split())) \n'
             'R = sorted(set(A)) \n'
             'ans = 10**9 \n'
    

### Finally, there are 81726 clone and non-clone Java-Python program pairs (40863 each) in the fine-tuning dataset. Specifically, 65314 (32657 each) program pairs are in the training set, 8212 (4106 each) program pairs are in the validation set and 8200 (4100 each) program pairs are in the test set. We split the clone program pairs by task ids.