<a href="https://colab.research.google.com/github/Dipeshpal/CodeSearch/blob/main/code_search_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# https://github.com/github/CodeSearchNet

## Data

The primary dataset consists of 2 million (comment, code) pairs from open source libraries. Concretely, a comment is a top-level function or method comment (e.g. docstrings in Python), and code is an entire function or method. Currently, the dataset contains Python, Javascript, Ruby, Go, Java, and PHP code. Throughout this repo, we refer to the terms docstring and query interchangeably. We partition the data into train, validation, and test splits such that code from the same repository can only exist in one partition. Currently this is the only dataset on which we train our model. Summary statistics about this dataset can be found in this notebook

For more information about how to obtain the data, see this section.

In [1]:
import json

import pandas as pd
from pathlib import Path
pd.set_option('max_colwidth',300)
from pprint import pprint

## Data Exploration
This notebook explores the pre-processed data, and shows some basic statistics that may be useful.


## Preview The Dataset
Before downloading the entire dataset, it may be useful to explore a small sample in order to understand the format and structure of the data. While the full dataset can be automatically downloaded with the /script/setup script located in this repo, we can alternatively download a subset of the data from S3.

### The s3 links follow this pattern:

https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{python,java,go,php,ruby,javascript}.zip

* For example, the link for the python is:

https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip

* First we download and decompress this dataset:

In [2]:
!wget https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip
!unzip python.zip

--2021-06-25 08:15:18--  https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/python.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.136.200
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.136.200|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 940909997 (897M) [application/zip]
Saving to: ‘python.zip’


2021-06-25 08:15:38 (45.6 MB/s) - ‘python.zip’ saved [940909997/940909997]

Archive:  python.zip
   creating: python/
   creating: python/final/
   creating: python/final/jsonl/
   creating: python/final/jsonl/train/
  inflating: python/final/jsonl/train/python_train_9.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_12.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_10.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_0.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_6.jsonl.gz  
  inflating: python/final/jsonl/train/python_train_2.jsonl.gz  
  inflating: python/final/jsonl/train/p

In [3]:
!gzip -d python/final/jsonl/test/python_test_0.jsonl.gz

In [5]:
with open('python/final/jsonl/test/python_test_0.jsonl', 'r') as f:
    sample_file = f.readlines()
pprint(json.loads(sample_file[0]))

{'code': 'def get_vid_from_url(url):\n'
         '        """Extracts video ID from URL.\n'
         '        """\n'
         "        return match1(url, r'youtu\\.be/([^?/]+)') or \\\n"
         "          match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\n"
         "          match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n"
         "          match1(url, r'youtube\\.com/watch/([^/?]+)') or \\\n"
         "          parse_query_param(url, 'v') or \\\n"
         "          parse_query_param(parse_query_param(url, 'u'), 'v')",
 'code_tokens': ['def',
                 'get_vid_from_url',
                 '(',
                 'url',
                 ')',
                 ':',
                 'return',
                 'match1',
                 '(',
                 'url',
                 ',',
                 "r'youtu\\.be/([^?/]+)'",
                 ')',
                 'or',
                 'match1',
                 '(',
                 'url',
                 ',',
        

In [7]:
python_files = sorted(Path('./python/').glob('**/*.gz'))
print(f'Total number of files: {len(python_files):,}')

Total number of files: 15


In [8]:
columns_long_list = ['repo', 'path', 'url', 'code', 
                     'code_tokens', 'docstring', 'docstring_tokens', 
                     'language', 'partition']

columns_short_list = ['code_tokens', 'docstring_tokens', 
                      'language', 'partition']

def jsonl_list_to_dataframe(file_list, columns=columns_long_list):
    """Load a list of jsonl.gz files into a pandas DataFrame."""
    return pd.concat([pd.read_json(f, 
                                   orient='records', 
                                   compression='gzip',
                                   lines=True)[columns] 
                      for f in file_list], sort=False)

In [9]:
pydf = jsonl_list_to_dataframe(python_files)

In [14]:
pydf.shape

(435285, 9)

In [10]:
pydf.head(3)

Unnamed: 0,repo,path,url,code,code_tokens,docstring,docstring_tokens,language,partition
0,ageitgey/face_recognition,examples/face_recognition_knn.py,https://github.com/ageitgey/face_recognition/blob/c96b010c02f15e8eeb0f71308c641179ac1f19bb/examples/face_recognition_knn.py#L46-L108,"def train(train_dir, model_save_path=None, n_neighbors=None, knn_algo='ball_tree', verbose=False):\n """"""\n Trains a k-nearest neighbors classifier for face recognition.\n\n :param train_dir: directory that contains a sub-directory for each known person, with its name.\n\n (View in s...","[def, train, (, train_dir, ,, model_save_path, =, None, ,, n_neighbors, =, None, ,, knn_algo, =, 'ball_tree', ,, verbose, =, False, ), :, X, =, [, ], y, =, [, ], # Loop through each person in the training set, for, class_dir, in, os, ., listdir, (, train_dir, ), :, if, not, os, ., path, ., isdir...","Trains a k-nearest neighbors classifier for face recognition.\n\n :param train_dir: directory that contains a sub-directory for each known person, with its name.\n\n (View in source code to see train_dir example tree structure)\n\n Structure:\n <train_dir>/\n ├── <person...","[Trains, a, k, -, nearest, neighbors, classifier, for, face, recognition, .]",python,train
1,ageitgey/face_recognition,examples/face_recognition_knn.py,https://github.com/ageitgey/face_recognition/blob/c96b010c02f15e8eeb0f71308c641179ac1f19bb/examples/face_recognition_knn.py#L111-L150,"def predict(X_img_path, knn_clf=None, model_path=None, distance_threshold=0.6):\n """"""\n Recognizes faces in given image using a trained KNN classifier\n\n :param X_img_path: path to image to be recognized\n :param knn_clf: (optional) a knn classifier object. if not specified, model_s...","[def, predict, (, X_img_path, ,, knn_clf, =, None, ,, model_path, =, None, ,, distance_threshold, =, 0.6, ), :, if, not, os, ., path, ., isfile, (, X_img_path, ), or, os, ., path, ., splitext, (, X_img_path, ), [, 1, ], [, 1, :, ], not, in, ALLOWED_EXTENSIONS, :, raise, Exception, (, ""Invalid im...","Recognizes faces in given image using a trained KNN classifier\n\n :param X_img_path: path to image to be recognized\n :param knn_clf: (optional) a knn classifier object. if not specified, model_save_path must be specified.\n :param model_path: (optional) path to a pickled knn classifie...","[Recognizes, faces, in, given, image, using, a, trained, KNN, classifier]",python,train
2,ageitgey/face_recognition,examples/face_recognition_knn.py,https://github.com/ageitgey/face_recognition/blob/c96b010c02f15e8eeb0f71308c641179ac1f19bb/examples/face_recognition_knn.py#L153-L181,"def show_prediction_labels_on_image(img_path, predictions):\n """"""\n Shows the face recognition results visually.\n\n :param img_path: path to image to be recognized\n :param predictions: results of the predict function\n :return:\n """"""\n pil_image = Image.open(img_path).conv...","[def, show_prediction_labels_on_image, (, img_path, ,, predictions, ), :, pil_image, =, Image, ., open, (, img_path, ), ., convert, (, ""RGB"", ), draw, =, ImageDraw, ., Draw, (, pil_image, ), for, name, ,, (, top, ,, right, ,, bottom, ,, left, ), in, predictions, :, # Draw a box around the face u...",Shows the face recognition results visually.\n\n :param img_path: path to image to be recognized\n :param predictions: results of the predict function\n :return:,"[Shows, the, face, recognition, results, visually, .]",python,train
