# Chapter 3 - Github Embeddings

In this notebook we're going to go beyond using pre-trained embeddings and models we download from the internet and start to create our own secondary models that can improve the primary model through transfer learning. We're going to train text and code embeddings based on Github's [CodeSearchNet](https://github.com/rjurney/CodeSearchNet) datasets. They include both doc strings and code for 2 million posts and while they use the data to map from text search queries to code, we'll be using it to create separate BERT embeddings to drive our Stack Overflow tagger.

The paper for CodeSearchNet is on arXiv at [CodeSearchNet Challenge: Evaluating the State of Semantic Code Search](https://arxiv.org/abs/1909.09436).

In [54]:
import csv
import gc
from pathlib import Path
import os
import sys
import warnings

from bs4 import BeautifulSoup
from nltk.tokenize.punkt import PunktSentenceTokenizer
import pandas as pd

# Add parent directory to path
parent_dir = os.path.dirname(os.getcwd())
sys.path.append(parent_dir)

from lib.utils import extract_text_plain

# Disable all warnings
warnings.filterwarnings("ignore")

## Load CodeSearchNet Data

We load the entire CodeSearchNet dataset for Go, Java, PHP, Python and Ruby. While the code doesn't cover all languages I'm hoping they are diverse enough to handle other languages and so will still help performance.

In [2]:
df = pd.DataFrame()

# Load all Gzipped JSON Lines files in the data directory
for filename in Path('../data/CodeSearchNet').glob('**/*.jsonl.gz'):
    new_df = pd.read_json(filename, lines=True)
    df = pd.concat([df, new_df])
    
    # Carefully manage memory
    del new_df
    gc.collect()

df.head()

Unnamed: 0,code,code_tokens,docstring,docstring_tokens,func_name,language,original_string,partition,path,repo,sha,url
0,protected final void fastPathOrderedEmit(U val...,"[protected, final, void, fastPathOrderedEmit, ...",Makes sure the fast-path emits in order.\n@par...,"[Makes, sure, the, fast, -, path, emits, in, o...",QueueDrainObserver.fastPathOrderedEmit,java,protected final void fastPathOrderedEmit(U val...,test,src/main/java/io/reactivex/internal/observers/...,ReactiveX/RxJava,ac84182aa2bd866b53e01c8e3fe99683b882c60e,https://github.com/ReactiveX/RxJava/blob/ac841...
1,@CheckReturnValue\n @NonNull\n @Schedule...,"[@, CheckReturnValue, @, NonNull, @, Scheduler...",Mirrors the one ObservableSource in an Iterabl...,"[Mirrors, the, one, ObservableSource, in, an, ...",Observable.amb,java,@CheckReturnValue\n @NonNull\n @Schedule...,test,src/main/java/io/reactivex/Observable.java,ReactiveX/RxJava,ac84182aa2bd866b53e01c8e3fe99683b882c60e,https://github.com/ReactiveX/RxJava/blob/ac841...
2,"@SuppressWarnings(""unchecked"")\n @CheckRetu...","[@, SuppressWarnings, (, ""unchecked"", ), @, Ch...",Mirrors the one ObservableSource in an array o...,"[Mirrors, the, one, ObservableSource, in, an, ...",Observable.ambArray,java,"@SuppressWarnings(""unchecked"")\n @CheckRetu...",test,src/main/java/io/reactivex/Observable.java,ReactiveX/RxJava,ac84182aa2bd866b53e01c8e3fe99683b882c60e,https://github.com/ReactiveX/RxJava/blob/ac841...
3,"@SuppressWarnings({ ""unchecked"", ""rawtypes"" })...","[@, SuppressWarnings, (, {, ""unchecked"", ,, ""r...",Concatenates elements of each ObservableSource...,"[Concatenates, elements, of, each, ObservableS...",Observable.concat,java,"@SuppressWarnings({ ""unchecked"", ""rawtypes"" })...",test,src/main/java/io/reactivex/Observable.java,ReactiveX/RxJava,ac84182aa2bd866b53e01c8e3fe99683b882c60e,https://github.com/ReactiveX/RxJava/blob/ac841...
4,"@SuppressWarnings({ ""unchecked"", ""rawtypes"" })...","[@, SuppressWarnings, (, {, ""unchecked"", ,, ""r...",Returns an Observable that emits the items emi...,"[Returns, an, Observable, that, emits, the, it...",Observable.concat,java,"@SuppressWarnings({ ""unchecked"", ""rawtypes"" })...",test,src/main/java/io/reactivex/Observable.java,ReactiveX/RxJava,ac84182aa2bd866b53e01c8e3fe99683b882c60e,https://github.com/ReactiveX/RxJava/blob/ac841...


In [3]:
print(
    f'There are {len(df["docstring"].index):,} functions'
)

There are 2,070,536 functions


## Extract Sentences from Docstrings

Docstings can contain HTML, so we parse them and extract text using `BeautifulSoup`.

In [6]:
code = df['code']
docs = df.docstring.apply(lambda x: extract_text_plain(x))

## Inspect the result of the code removal

In [12]:
pd.set_option('max_colwidth', 500)
doc_df = pd.DataFrame({'docs': docs, 'docstring': df['docstring']})

doc_df.head()

Unnamed: 0,docs,docstring
0,"Makes sure the fast-path emits in order.\n@param value the value to emit or queue up\n@param delayError if true, errors are delayed until the source has terminated\n@param disposable the resource to dispose if the drain terminates","Makes sure the fast-path emits in order.\n@param value the value to emit or queue up\n@param delayError if true, errors are delayed until the source has terminated\n@param disposable the resource to dispose if the drain terminates"
1,Mirrors the one ObservableSource in an Iterable of several ObservableSources that first either emits an item or sends\na termination notification.\n\n\n\nScheduler:\n{@code amb} does not operate by default on a particular {@link Scheduler}.\n\n\n@param the common element type\n@param sources\nan Iterable of ObservableSource sources competing to react first. A subscription to each source will\noccur in the same order as in the Iterable.\n@return an Observable that emits the same sequence as ...,"Mirrors the one ObservableSource in an Iterable of several ObservableSources that first either emits an item or sends\na termination notification.\n<p>\n<img width=""640"" height=""385"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/amb.png"" alt="""">\n<dl>\n<dt><b>Scheduler:</b></dt>\n<dd>{@code amb} does not operate by default on a particular {@link Scheduler}.</dd>\n</dl>\n\n@param <T> the common element type\n@param sources\nan Iterable of ObservableSource sources compe..."
2,Mirrors the one ObservableSource in an array of several ObservableSources that first either emits an item or sends\na termination notification.\n\n\n\nScheduler:\n{@code ambArray} does not operate by default on a particular {@link Scheduler}.\n\n\n@param the common element type\n@param sources\nan array of ObservableSource sources competing to react first. A subscription to each source will\noccur in the same order as in the array.\n@return an Observable that emits the same sequence as whic...,"Mirrors the one ObservableSource in an array of several ObservableSources that first either emits an item or sends\na termination notification.\n<p>\n<img width=""640"" height=""385"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/amb.png"" alt="""">\n<dl>\n<dt><b>Scheduler:</b></dt>\n<dd>{@code ambArray} does not operate by default on a particular {@link Scheduler}.</dd>\n</dl>\n\n@param <T> the common element type\n@param sources\nan array of ObservableSource sources compet..."
3,Concatenates elements of each ObservableSource provided via an Iterable sequence into a single sequence\nof elements without interleaving them.\n\n\n\nScheduler:\n{@code concat} does not operate by default on a particular {@link Scheduler}.\n\n@param the common value type of the sources\n@param sources the Iterable sequence of ObservableSources\n@return the new Observable instance,"Concatenates elements of each ObservableSource provided via an Iterable sequence into a single sequence\nof elements without interleaving them.\n<p>\n<img width=""640"" height=""380"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/concat.png"" alt="""">\n<dl>\n<dt><b>Scheduler:</b></dt>\n<dd>{@code concat} does not operate by default on a particular {@link Scheduler}.</dd>\n</dl>\n@param <T> the common value type of the sources\n@param sources the Iterable sequence of Observa..."
4,"Returns an Observable that emits the items emitted by each of the ObservableSources emitted by the source\nObservableSource, one after the other, without interleaving them.\n\n\n\nScheduler:\n{@code concat} does not operate by default on a particular {@link Scheduler}.\n\n\n@param the common element base type\n@param sources\nan ObservableSource that emits ObservableSources\n@param prefetch\nthe number of ObservableSources to prefetch from the sources sequence.\n@return an Observable that e...","Returns an Observable that emits the items emitted by each of the ObservableSources emitted by the source\nObservableSource, one after the other, without interleaving them.\n<p>\n<img width=""640"" height=""380"" src=""https://raw.github.com/wiki/ReactiveX/RxJava/images/rx-operators/concat.png"" alt="""">\n<dl>\n<dt><b>Scheduler:</b></dt>\n<dd>{@code concat} does not operate by default on a particular {@link Scheduler}.</dd>\n</dl>\n\n@param <T> the common element base type\n@param sources\nan Obser..."


## Generate CSV for BERT

We need to generate CSV in the format that BERT expects, which is:

> Here's how to run the data generation. The input is a plain text file, with one sentence per line. (It is important that these be actual sentences for the "next sentence prediction" task). Documents are delimited by empty lines. The output is a set of tf.train.Examples serialized into TFRecord file format.

In [55]:
sentence_tokenizer = PunktSentenceTokenizer()
sentences = docs.apply(sentence_tokenizer.tokenize)

sentences.head(2)

0                                                                                                                                                                                                                                                                               [Makes sure the fast-path emits in order., @param value the value to emit or queue up\n@param delayError if true, errors are delayed until the source has terminated\n@param disposable the resource to dispose if the drain terminates]
1    [Mirrors the one ObservableSource in an Iterable of several ObservableSources that first either emits an item or sends\na termination notification., Scheduler:\n{@code amb} does not operate by default on a particular {@link Scheduler}., @param  the common element type\n@param sources\nan Iterable of ObservableSource sources competing to react first., A subscription to each source will\noccur in the same order as in the Iterable., @return an Observable that emits the same sequence as wh

In [78]:
with open('../data/sentences.csv', 'w') as f:
    
    current_idx = 0
    for idx, doc in sentences.items():
        # Insert a newline to separate documents
        if idx != current_idx:
            f.write('\n')
        # Write each sentence exactly as it appared to one line each
        for sentence in doc:
            f.write(sentence.encode('unicode-escape').decode().replace('\\\\', '\\') + '\n')

## Using `sentencepiece` to Extract a WordPiece Vocabulary

BERT needs a WordPiece vocabulary file to run, so we need to decide on a number of tokens and then run `sentencepiece` to extract a list of valid tokens.

In [99]:
%%bash

cd ../models
spm_train --input="../data/sentences.csv" --model_prefix=wsl --vocab_size=20000

sentencepiece_trainer.cc(49) LOG(INFO) Starts training with : 
TrainerSpec {
  input: ../data/sentences.csv
  input_format: 
  model_prefix: wsl
  model_type: UNIGRAM
  vocab_size: 20000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  treat_whitespace_as_suffix: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
}
NormalizerSpec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}

trainer_interface.cc(267) LOG(INFO) Loading corpus: ../data/sentences.csv
trainer_interface.cc(139) LOG(INFO) Lo

In [100]:
# I am here to breakup the previous and next cell :)

## Using BERT to Pretrain a Language Model

Next we use the WordPiece vocabulary to pre-train a BERT model that we will then use, as a tranfer learning strategy, to encode the text of Stack Overflow questions.

### Creating a BERT conda environment

It is not possible to create a new conda environment from which to install `tensorflow==1.14.0`, which BERT needs, so you will need to run this code outside of this notebook, from the root directory of this project.


```bash
conda create -y -n bert python=3.7.4
conda init bash
```

Now in a new shell, change directory to the root of project:

```bash
cd /path/to/weakly_supervised_learning_code
```

Now run:

```bash
conda activate bert
pip install tensorflow-gpu==1.14.0
```

### Running BERT Pre-Training

We need to configure BERT to use our vocabulary size, so we create a `bert_config.json` file in the `bert/` directory. Then we execute the `create_pretraining_data.py` command to pre-train the network.

```bash
# Tell BERT how many tokens to use
echo '{ "vocab_size": 20000 }' > bert/bert_config.json 

python bert/create_pretraining_data.py \
   --input_file=data/sentences.csv \
   --output_file=data/tf_examples.tfrecord \
   --vocab_file=models/wsl.vocab \
   --bert_config_file=./bert/bert_config.json \
   --do_lower_case=False \
   --max_seq_length=128 \
   --max_predictions_per_seq=20 \
   --num_train_steps=20 \
   --num_warmup_steps=10 \
   --random_seed=1337 \
   --learning_rate=2e-5

conda deactivate
```