# Data Augmentation

### CodeSearchNet dataset:
CodeSearchNet, a large corpus of methods extracted from popular GitHub repositories. The primary dataset consists of 2 million (comment, code) pairs from open-source libraries. Concretely, a 'comment' is a top-level function or method, and 'code' is an entire function or method. Currently, the dataset contains Python, Javascript, Ruby, Go, Java and PHP functions. The dataset is partitioned into train, validation and test sets such that code from the same repository can only exist in one partition.

For the CodeSearchNet dataset cleaned and released by Microsoft, we intend to take a look at the data at the function level in Java and Python.  
Different from the original CodeSearchNet, the answer of each query is retrieved from the whole development and testing code corpus instead of 1,000 candidate codes. Besides, some queries contain content unrelated to the code, such as a link that refers to external resources. Therefore, the dataset is filtered to improve the quality. 

Statistics about the cleaned CodeSearchNet dataset released by Microsoft are shown below.  
   | Programming Language      | Train     | Valid      | Test     | Candidate codes    |   
   | :---------- | :---------- | :---------- | :---------- | :----------  |     
   | Java  |164,923      | 5,183         | 10,955         | 13,981        |  
   | Python  |251,820    | 13,914        | 14,918         | 42,827        |    

For the CodeSearchNet dataset, we intend to explore Java and Python functions. First, we will apply the TransCoder model (https://github.com/facebookresearch/CodeGen/blob/main/docs/transcoder.md#pre-trained-models) to translate Java functions into their Python counterparts. Then, we will filter out unparsable transfered Python functions. Finally, we will construct Java-Python functions pairs and investigate the generated dataset statistics in terms of: 1) the average number of code lines and 2) the average number of code tokens in each language.

In [1]:
import os
import sys
import json
import jsonlines

import numpy as np
import pandas as pd
pd.set_option('max_colwidth', 300)
from pprint import pprint

In [2]:
# define csn_generated dir

csn_generated = '/Users/rongdang/Desktop/semantic-code-clone/dataset/pre-training/CSN-Generated'

### Part 1: Preview the generated dataset
1)We applied the TransCoder_DOBF model to translate Java functions from the CSN dataset into corresponding Python functions. 2)We filtered out unparsable transfered Python functions. Finally, we generated 178087 Java-Python function pairs. We constructed the generated dataset and splitted it into 2 separate sets: the training set (function pairs from the original training set) and the validation set (function pairs from the original valid and test set). The ratio of #num of train: #num of valid in the generated dataset is approximately 9:1 (162174: 15913).  
The augmented pre-training dataset contains the original pre-training dataset and the generated dataset.

In [3]:
# display one Java-Python pair in the generated dataset

sets = ['train', 'valid']

file_name = 'pair_' + sets[0] + '.jsonl'
file_path = os.path.join(csn_generated, file_name)

with open(file_path, 'r') as f:
    sample_file = f.readlines()
sample_data = json.loads(sample_file[0])
pprint(sample_data)
f.close()

{'category': 'java-python',
 'src_code': 'public Set<BsonValue> getPausedDocumentIds(final MongoNamespace '
             'namespace) {\n'
             '    this.waitUntilInitialized();\n'
             '\n'
             '    try {\n'
             '      ongoingOperationsGroup.enter();\n'
             '      final Set<BsonValue> pausedDocumentIds = new HashSet<>();\n'
             '\n'
             '      for (final CoreDocumentSynchronizationConfig config :\n'
             '          this.syncConfig.getSynchronizedDocuments(namespace)) '
             '{\n'
             '        if (config.isPaused()) {\n'
             '          pausedDocumentIds.add(config.getDocumentId());\n'
             '        }\n'
             '      }\n'
             '\n'
             '      return pausedDocumentIds;\n'
             '    } finally {\n'
             '      ongoingOperationsGroup.exit();\n'
             '    }\n'
             '  }',
 'task_id': '154879',
 'tgt_code': 'def get_paused_document_ids (

### Part 2: Analyze the generated dataset

### analyze the dataset statistics of the CSN-Generated dataset using the pre-trained CodeBERT tokenizer

-----CSN-Generated dataset statistics-----  
avg number of lines in Java function: 14.26  
avg number of code tokens in Java function: 137.03  
avg number of lines in Python function: 9.1  
avg number of code tokens in Python function: 111.31  