# Chapter 3 - Github Embeddings

In this notebook we're going to go beyond using pre-trained embeddings and models we download from the internet and start to create our own secondary models that can improve the primary model through transfer learning. We're going to train text and code embeddings based on Github's [CodeSearchNet](https://github.com/rjurney/CodeSearchNet) datasets. They include both doc strings and code for 2 million posts and while they use the data to map from text search queries to code, we'll be using it to create separate [BERT](https://arxiv.org/abs/1810.04805) embeddings to drive our Stack Overflow tagger.

The paper for CodeSearchNet is on arXiv at [CodeSearchNet Challenge: Evaluating the State of Semantic Code Search](https://arxiv.org/abs/1909.09436).

In [3]:
import csv
import gc
from pathlib import Path
import os
import random
import sys
import warnings

from bs4 import BeautifulSoup
from nltk.tokenize.punkt import PunktSentenceTokenizer
import pandas as pd

random.seed(1337)

# Add parent directory to path
parent_dir = os.path.dirname(os.getcwd())
sys.path.append(parent_dir)

from lib.utils import extract_text_plain

# Disable all warnings
warnings.filterwarnings("ignore")

## Load CodeSearchNet Data

We load the entire CodeSearchNet dataset for Go, Java, PHP, Python and Ruby. While the code doesn't cover all languages I'm hoping they are diverse enough to handle other languages and so will still help performance.

In [4]:
df = pd.DataFrame()

# Load all Gzipped JSON Lines files in the data directory
for filename in Path('../data/CodeSearchNet').glob('**/*.jsonl.gz'):
    new_df = pd.read_json(filename, lines=True)
    df = pd.concat([df, new_df])
    
    # Carefully manage memory
    del new_df
    gc.collect()

df.head()

OptionError: "No such keys(s): 'display.html.border'"

                                                code  \
0  protected final void fastPathOrderedEmit(U val...   
1  @CheckReturnValue\n    @NonNull\n    @Schedule...   

                                         code_tokens  \
0  [protected, final, void, fastPathOrderedEmit, ...   
1  [@, CheckReturnValue, @, NonNull, @, Scheduler...   

                                           docstring  \
0  Makes sure the fast-path emits in order.\n@par...   
1  Mirrors the one ObservableSource in an Iterabl...   
2  Mirrors the one ObservableSource in an array o...   
3  Concatenates elements of each ObservableSource...   
4  Returns an Observable that emits the items emi...   

                                    docstring_tokens  \
0  [Makes, sure, the, fast, -, path, emits, in, o...   
1  [Mirrors, the, one, ObservableSource, in, an, ...   
2  [Mirrors, the, one, ObservableSource, in, an, ...   
3  [Concatenates, elements, of, each, ObservableS...   
4  [Returns, an, Observable, that, emits, the

In [5]:
print(
    f'There are {len(df["docstring"].index):,} functions'
)

There are 2,070,536 functions


## Extract Text from Docstrings

Docstings can contain HTML, so we parse them and extract text using `BeautifulSoup`.

In [None]:
code = df['code']
docs = df.docstring.apply(lambda x: extract_text_plain(x))

## Inspect the result of the code removal

In [None]:
pd.set_option('max_colwidth', 500)
doc_df = pd.DataFrame({'docs': docs, 'docstring': df['docstring']})

doc_df.head(3)

## Generate CSV for BERT

The [Google BERT Github project](https://github.com/google-research/bert) is a submodule to this project, which you can checkout from within this [cloned project](https://github.com/rjurney/weakly_supervised_learning_code) with:

```bash
git submodule init
git submodule update
```

We need to generate CSV in the format that BERT expects, which is:

> Here's how to run the data generation. The input is a plain text file, with one sentence per line. (It is important that these be actual sentences for the "next sentence prediction" task). Documents are delimited by empty lines. The output is a set of tf.train.Examples serialized into TFRecord file format.

In [None]:
sentence_tokenizer = PunktSentenceTokenizer()
sentences = docs.apply(sentence_tokenizer.tokenize)

sentences.head(2)

In [None]:
with open('../data/sentences.csv', 'w') as f:
    
    current_idx = 0
    for idx, doc in sentences.items():
        # Insert a newline to separate documents
        if idx != current_idx:
            f.write('\n')
        # Write each sentence exactly as it appared to one line each
        for sentence in doc:
            f.write(sentence.encode('unicode-escape').decode().replace('\\\\', '\\') + '\n')

## Using `sentencepiece` to Extract a WordPiece Vocabulary

BERT needs a WordPiece vocabulary file to run, so we need to decide on a number of tokens and then run `sentencepiece` to extract a list of valid tokens.

The `sentencepiece` Pypi library isn't sufficient for our needs, we need to clone the Github repo, build and install the software to create our vocabulary.

Make sure you're in the root directory of this project and run:

```bash
git clone https://github.com/google/sentencepiece
cd sentencepiece

mkdir build
cd build
cmake ..
make -j $(nproc)
sudo make install
sudo ldconfig -v
```

Now we can use `sp_train` to create a vocabulary of our 4.7 million sentences.

In [1]:
%%bash

cd ../models
spm_train --input="../data/sentences.csv" --model_prefix=wsl --vocab_size=20000

# Add the [CLS], [SEP], [UNK] and [MASK] tags, or pre-training will error out
echo -e "[CLS]\t0\n[SEP]\t0\n[UNK]\t0\n[MASK]\t0\n$(cat wsl.vocab)" > wsl.vocab

# Remove the numbers, just retain the tag vocabulary
cat wsl.vocab | cut -d$'\t' -f1 > wsl.stripped.vocab

sentencepiece_trainer.cc(49) LOG(INFO) Starts training with : 
TrainerSpec {
  input: ../data/sentences.csv
  input_format: 
  model_prefix: wsl
  model_type: UNIGRAM
  vocab_size: 20000
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  treat_whitespace_as_suffix: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
}
NormalizerSpec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv: 
}

trainer_interface.cc(267) LOG(INFO) Loading corpus: ../data/sentences.csv
trainer_interface.cc(139) LOG(INFO) Lo

## Using BERT to Pretrain a Language Model

Next we use the WordPiece vocabulary to pre-train a BERT model that we will then use, as a tranfer learning strategy, to encode the text of Stack Overflow questions.

### Creating a BERT conda environment

It is not possible to create a new conda environment from which to install `tensorflow==1.14.0`, which BERT needs, so you will need to run this code outside of this notebook, from the root directory of this project.


```bash
conda create -y -n bert python=3.7.4
conda init bash
```

Now in a new shell, change directory to the root of project:

```bash
cd /path/to/weakly_supervised_learning_code
```

Now run:

```bash
conda activate bert
pip install tensorflow-gpu==1.14.0
```

### Creating BERT Pre-Training Data

Before we can train a BERT model or extract static embedding values we need to create the pre-training data the model uses to train. The output file will be 20GB, so make sure you have the space available!

From the [BERT README](https://github.com/google-research/bert/blob/master/README.md):

> Here's how to run the data generation. The input is a plain text file, with one sentence per line. (It is important that these be actual sentences for the "next sentence prediction" task). Documents are delimited by empty lines. The output is a set of tf.train.Examples serialized into TFRecord file format.

We need to configure BERT to use our vocabulary size, so we create a `bert_config.json` file in the `bert/` directory.

```bash
# Tell BERT how many tokens to use
echo '{ "vocab_size": 20004 }' > bert/bert_config.json 
```

Then we execute the `create_pretraining_data.py` command to pre-train the network.

```bash
python bert/create_pretraining_data.py \
   --input_file=data/sentences.csv \
   --output_file=data/tf_examples.tfrecord \
   --vocab_file=models/wsl.stripped.vocab \
   --bert_config_file=bert/bert_config.json \
   --do_lower_case=False \
   --max_seq_length=128 \
   --max_predictions_per_seq=20 \
   --num_train_steps=20 \
   --num_warmup_steps=10 \
   --random_seed=1337 \
   --learning_rate=2e-5
```

Now we can run pretraining. If your GPU is only 8GB of RAM, reduce the training batch size to 16 or 24.

```bash
python bert/run_pretraining.py \
  --input_file=data/tf_examples.tfrecord \
  --output_dir=models/pretraining_output \
  --do_train=True \
  --do_eval=True \
  --bert_config_file=bert/bert_config.json \
  --train_batch_size=32 \
  --max_seq_length=128 \
  --max_predictions_per_seq=20 \
  --num_train_steps=10000 \
  --num_warmup_steps=10 \
  --learning_rate=2e-5
```

Finally, deactivate the conda environment:

```bash
conda deactivate
```