# Preprocess Data


In [1]:
!pip install wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9674 sha256=e38c95dfeac79c81f63134ab980e9c33ec2939e0dc9a94e416376312e92c8f1f
  Stored in directory: /root/.cache/pip/wheels/bd/a8/c3/3cf2c14a1837a4e04bd98631724e81f33f462d86a1d895fae0
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [2]:
!pip install nmslib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nmslib
  Downloading nmslib-2.1.1-cp38-cp38-manylinux2010_x86_64.whl (13.4 MB)
[K     |████████████████████████████████| 13.4 MB 5.1 MB/s 
[?25hCollecting pybind11<2.6.2
  Downloading pybind11-2.6.1-py2.py3-none-any.whl (188 kB)
[K     |████████████████████████████████| 188 kB 67.2 MB/s 
Installing collected packages: pybind11, nmslib
Successfully installed nmslib-2.1.1 pybind11-2.6.1


In [3]:
!pip install pathos

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pathos
  Downloading pathos-0.3.0-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 3.5 MB/s 
[?25hCollecting ppft>=1.7.6.6
  Downloading ppft-1.7.6.6-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.2 MB/s 
[?25hCollecting pox>=0.3.2
  Downloading pox-0.3.2-py3-none-any.whl (29 kB)
Collecting multiprocess>=0.70.14
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 34.4 MB/s 
Installing collected packages: ppft, pox, multiprocess, pathos
Successfully installed multiprocess-0.70.14 pathos-0.3.0 pox-0.3.2 ppft-1.7.6.6


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
%cd /content/drive/MyDrive/Automate

/content/drive/MyDrive/Automate


In [6]:
%load_ext autoreload
%autoreload 2

import ast
import glob
import re
from pathlib import Path

import astor
import pandas as pd
import spacy
from tqdm import tqdm
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split

from general_utils import apply_parallel, flattenlist

EN = spacy.load("en_core_web_sm")



In [7]:
%%time
# Read the data into a pandas dataframe, and parse out some meta-data

df =pd.concat([pd.read_csv(f'https://storage.googleapis.com/kubeflow-examples/code_search/raw_data/00000000000{i}.csv') \
                for i in range(1,3)])
print(len(df))

df['nwo'] = df['repo_path'].apply(lambda r: r.split()[0])
df['path'] = df['repo_path'].apply(lambda r: r.split()[1])
df.drop(columns=['repo_path'], inplace=True)
df = df[['nwo', 'path', 'content']]
df.head()

248090
CPU times: user 13.7 s, sys: 4.96 s, total: 18.6 s
Wall time: 21 s


Unnamed: 0,nwo,path,content
0,bitsanity/rateboard,krakenticker.py,#!/usr/bin/python\n# -*- coding: utf-8 -*-\n\n...
1,rusty1s/embedded_gcnn,lib/tf/convert.py,import numpy as np\nimport tensorflow as tf\n\...
2,mackorone/mms,util/ttf2png.py,import os\nimport sys\nimport string\n\ndef en...
3,nicksergeant/snipt,accounts/models.py,from annoying.functions import get_object_or_N...
4,huaxz1986/git_book,chapters/Model_Selection/validation_curve.py,"# -*- coding: utf-8 -*-\n""""""\n 模型选择\n ~~..."


In [8]:
# Inspect shape of the raw data
df.shape

(248090, 3)

## Functions to parse data and tokenize

Our goal is to parse the python files into (code, docstring) pairs.  Fortunately, the standard library in python comes with the wonderful [ast](https://docs.python.org/3.6/library/ast.html) module which helps us extract code from files as well as extract docstrings.  

We also use the [astor](http://astor.readthedocs.io/en/latest/) library to strip the code of comments by doing a round trip of converting the code to an [AST](https://en.wikipedia.org/wiki/Abstract_syntax_tree) and then from AST back to code. 

In [9]:
def tokenize_docstring(text):
    "Apply tokenization using spacy to docstrings."
    tokens = EN.tokenizer(text)
    return [token.text.lower() for token in tokens if not token.is_space]


def tokenize_code(text):
    "A very basic procedure for tokenizing code strings."
    return RegexpTokenizer(r'\w+').tokenize(text)


def get_function_docstring_pairs(blob):
    "Extract (function/method, docstring) pairs from a given code blob."
    pairs = []
    try:
        module = ast.parse(blob)
        classes = [node for node in module.body if isinstance(node, ast.ClassDef)]
        functions = [node for node in module.body if isinstance(node, ast.FunctionDef)]
        for _class in classes:
            functions.extend([node for node in _class.body if isinstance(node, ast.FunctionDef)])

        for f in functions:
            source = astor.to_source(f)
            docstring = ast.get_docstring(f) if ast.get_docstring(f) else ''
            function = source.replace(ast.get_docstring(f, clean=False), '') if docstring else source

            pairs.append((f.name,
                          f.lineno,
                          source,
                          ' '.join(tokenize_code(function)),
                          ' '.join(tokenize_docstring(docstring.split('\n\n')[0]))
                         ))
    except (AssertionError, MemoryError, SyntaxError, UnicodeEncodeError):
        pass
    return pairs


def get_function_docstring_pairs_list(blob_list):
    """apply the function `get_function_docstring_pairs` on a list of blobs"""
    return [get_function_docstring_pairs(b) for b in blob_list]

The below convience function `apply_parallel` parses the code in parallel using process based threading. 

In [10]:
pairs = flattenlist(apply_parallel(get_function_docstring_pairs_list, df.content.tolist(), cpu_cores=32))

In [11]:
assert len(pairs) == df.shape[0], f'Row count mismatch. `df` has {df.shape[0]:,} rows; `pairs` has {len(pairs):,} rows.'
df['pairs'] = pairs
df.head()

Unnamed: 0,nwo,path,content,pairs
0,bitsanity/rateboard,krakenticker.py,#!/usr/bin/python\n# -*- coding: utf-8 -*-\n\n...,[]
1,rusty1s/embedded_gcnn,lib/tf/convert.py,import numpy as np\nimport tensorflow as tf\n\...,"[(sparse_to_tensor, 5, def sparse_to_tensor(va..."
2,mackorone/mms,util/ttf2png.py,import os\nimport sys\nimport string\n\ndef en...,"[(end_of_path_index, 5, def end_of_path_index(..."
3,nicksergeant/snipt,accounts/models.py,from annoying.functions import get_object_or_N...,"[(get_blog_posts, 84, def get_blog_posts(self)..."
4,huaxz1986/git_book,chapters/Model_Selection/validation_curve.py,"# -*- coding: utf-8 -*-\n""""""\n 模型选择\n ~~...","[(test_validation_curve, 17, def test_validati..."


## Flatten code, docstring pairs and extract meta-data

Flatten (code, docstring) pairs

In [12]:
%%time
# flatten pairs
df = df.set_index(['nwo', 'path'])['pairs'].apply(pd.Series).stack()
df = df.reset_index()
df.columns = ['nwo', 'path', '_', 'pair']

CPU times: user 1min 24s, sys: 8.89 s, total: 1min 32s
Wall time: 1min 35s


Extract meta-data and format dataframe.  

We have not optimized this code.  Pull requests are welcome!

In [13]:
%%time
df['function_name'] = df['pair'].apply(lambda p: p[0])
df['lineno'] = df['pair'].apply(lambda p: p[1])
df['original_function'] = df['pair'].apply(lambda p: p[2])
df['function_tokens'] = df['pair'].apply(lambda p: p[3])
df['docstring_tokens'] = df['pair'].apply(lambda p: p[4])
df = df[['nwo', 'path', 'function_name', 'lineno', 'original_function', 'function_tokens', 'docstring_tokens']]
df['url'] = df[['nwo', 'path', 'lineno']].apply(lambda x: 'https://github.com/{}/blob/master/{}#L{}'.format(x[0], x[1], x[2]), axis=1)
df.head()

CPU times: user 16.7 s, sys: 158 ms, total: 16.8 s
Wall time: 16.9 s


Unnamed: 0,nwo,path,function_name,lineno,original_function,function_tokens,docstring_tokens,url
0,rusty1s/embedded_gcnn,lib/tf/convert.py,sparse_to_tensor,5,"def sparse_to_tensor(value):\n """"""Convert a...",def sparse_to_tensor value row np reshape valu...,convert a scipy sparse matrix to a tensorflow ...,https://github.com/rusty1s/embedded_gcnn/blob/...
1,mackorone/mms,util/ttf2png.py,end_of_path_index,5,def end_of_path_index(full_path):\n return ...,def end_of_path_index full_path return full_pa...,,https://github.com/mackorone/mms/blob/master/u...
2,mackorone/mms,util/ttf2png.py,get_path,8,def get_path(full_path):\n return full_path...,def get_path full_path return full_path end_of...,,https://github.com/mackorone/mms/blob/master/u...
3,mackorone/mms,util/ttf2png.py,get_name,11,def get_name(full_path):\n return full_path...,def get_name full_path return full_path end_of...,,https://github.com/mackorone/mms/blob/master/u...
4,mackorone/mms,util/ttf2png.py,ttf2png,14,"def ttf2png(chars_path, font_path, dest_path):...",def ttf2png chars_path font_path dest_path opt...,,https://github.com/mackorone/mms/blob/master/u...


## Remove Duplicates

In [14]:
%%time
# remove observations where the same function appears more than once
before_dedup = len(df)
df = df.drop_duplicates(['original_function', 'function_tokens'])
after_dedup = len(df)
df=df.dropna(axis=0)
after_dropna=len(df)

print(f'Removed {before_dedup - after_dedup:,} duplicate rows')
print(f'Removed { after_dedup-after_dropna} null rows')

Removed 108,604 duplicate rows
Removed 0 null rows
CPU times: user 5.49 s, sys: 31.7 ms, total: 5.52 s
Wall time: 5.51 s


In [15]:
df.shape

(1210414, 8)

## Separate function w/o docstrings

In [16]:
def listlen(x):
    if not isinstance(x, list):
        return 0
    return len(x)

# separate functions w/o docstrings
# docstrings should be at least 3 words in the docstring to be considered a valid docstring

with_docstrings = df[df.docstring_tokens.str.split().apply(listlen) >= 3]
without_docstrings = df[df.docstring_tokens.str.split().apply(listlen) < 3]

## Partition code by repository to minimize leakage between train, valid & test sets. 
Rough assumption that each repository has its own style.  We want to avoid having code from the same repository in the training set as well as the validation or holdout set.

In [17]:
grouped = with_docstrings.groupby('nwo')

In [18]:
# train, valid, test splits
train, test = train_test_split(list(grouped), train_size=0.87, shuffle=True, random_state=8081)
train, valid = train_test_split(train, train_size=0.82, random_state=8081)

In [19]:
train = pd.concat([d for _, d in train]).reset_index(drop=True)
valid = pd.concat([d for _, d in valid]).reset_index(drop=True)
test = pd.concat([d for _, d in test]).reset_index(drop=True)

In [20]:
print(f'train set num rows {train.shape[0]:,}')
print(f'valid set num rows {valid.shape[0]:,}')
print(f'test set num rows {test.shape[0]:,}')
print(f'without docstring rows {without_docstrings.shape[0]:,}')

train set num rows 217,207
valid set num rows 50,421
test set num rows 40,793
without docstring rows 901,993


Preview what the training set looks like.  You can start to see how the data looks, the function tokens and docstring tokens are what will be fed downstream into the models.  The other information is important for diagnostics and bookeeping.

In [21]:
train.head()

Unnamed: 0,nwo,path,function_name,lineno,original_function,function_tokens,docstring_tokens,url
0,o19s/lazy-semantic-indexing,search_index.py,docIds,1,"def docIds(es, index='stackexchange', doc_type...",def docIds es index stackexchange doc_type pos...,"fetch the i d of all docs of type "" doc_type ""...",https://github.com/o19s/lazy-semantic-indexing...
1,o19s/lazy-semantic-indexing,search_index.py,justTfandDf,20,"def justTfandDf(terms):\n """""" Format the st...",def justTfandDf terms tfAndDf for term value i...,format the stats for each term into a compact ...,https://github.com/o19s/lazy-semantic-indexing...
2,o19s/lazy-semantic-indexing,search_index.py,_termVectorBatch,28,"def _termVectorBatch(es, docIds, index='stacke...",def _termVectorBatch es docIds index stackexch...,returns term vectors for specified batch of do...,https://github.com/o19s/lazy-semantic-indexing...
3,o19s/lazy-semantic-indexing,search_index.py,termVectors,45,"def termVectors(es, docIds, field='Body.bigram...",def termVectors es docIds field Body bigramed ...,returns term vectors for corpus one doc at a t...,https://github.com/o19s/lazy-semantic-indexing...
4,RPi-Distro/python-sense-emu,sense_emu/terminal.py,handle,229,"def handle(self, exc_type, exc_value, exc_trac...",def handle self exc_type exc_value exc_trace i...,global application exception handler,https://github.com/RPi-Distro/python-sense-emu...


## Output each set to train/valid/test.function/docstrings/lineage files


In [22]:
def write_to(df, filename, path='./data/processed_data/'):
    "Helper function to write processed files to disk."
    out = Path(path)
    out.mkdir(exist_ok=True)
    df.function_tokens.to_csv(out/'{}.function'.format(filename), index=False)
    df.original_function.to_json(out/'{}_original_function.json.gz'.format(filename), orient='values', compression='gzip')
    if filename != 'without_docstrings':
        df.docstring_tokens.to_csv(out/'{}.docstring'.format(filename), index=False)
    df.url.to_csv(out/'{}.lineage'.format(filename), index=False)

In [23]:
import os
if not os.path.exists('data/'):
    os.makedirs('data/')
# write to output files
write_to(train, 'train')
write_to(valid, 'valid')
write_to(test, 'test')
write_to(without_docstrings, 'without_docstrings')