# Readme Embedding using BERT

In this code, the readme of the storage is embedded as a vector using **BERT**.

There are two main reasons for using BERT.
1. The readme does not preserve the order information.
2. In general, BERT produces better sentence representation than doc2vec.

In the readme used in this study, the order information of the text is not preserved due to the removal of the code, emoji and  the mixing of the title and the body.   

However, doc2vec is a sequential language model that learns words based on their order. Therefore, the doc2vec model is not suitable for readme data.

In [1]:
import pandas as pd
from transformers import BertTokenizer, BertModel
from tqdm import tqdm 
import warnings 

warnings.filterwarnings(action='ignore')

In [2]:
# Load data 
data = pd.read_csv('data/data/filtered_data.csv')

In [3]:
# The input data of BERT needs to be added with a special token.
# Therefore, we add tokens before and after each sentence.
corpus = list(data.readme)

# text tokenize using BERT Tokenizer
# split sentence into token 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_corpus = []
for sent in tqdm(corpus) :
    tokenized_corpus.append(tokenizer.tokenize(sent))

# Map token to index 
token_index = [tokenizer.convert_tokens_to_ids(v) for v in tokenized_corpus]

100%|██████████| 3367/3367 [00:54<00:00, 61.24it/s]


In [4]:
# BERT trains by solving the next sentence prediction task along with MLM.
# The same applies to prediction, so a sentence ID must be assigned to each sentence.
# Since it is applied to a single sentence here, the ID of all sentences is unified as 1.
segment_ids = []
for token_corpus in token_index : 
    segment_ids.append([1] * len(token_corpus))

---

# BERT practice 

I am new to BERT in this project. 

The code below is a summary of what I learned while implementing it, so you don't need to check below codes.

In [6]:
# transform data to batch form 
batch_size = 16
batch_len = int(len(corpus)/batch_size) + 1

idx_list = []
for i in range(10000) : 
    batch_final_idx = batch_size * i 
    idx_list.append(batch_final_idx)
    if batch_final_idx >= len(corpus) :
        break

idx_list[-1] = -1

print('number of batches : {}'.format(batch_len))

number of batches : 211


total 27 batch are created.   

Each batch has 128 corpus 

In [9]:
# The input data of BERT needs to be added with a special token.
# Therefore, we add tokens before and after each sentence.
corpus = list(data.readme)

'''
for i, sent in enumerate(corpus) : 
    corpus[i] = "[CLS]" + sent.lower() + "[SEP]"
'''
    
# tokenize using 'batch_encode_plus' method 
# This method automates text embedding and sentence token assignment, which were manually implemented in the existing code implementation.
bert_result = []
bert_model = BertModel.from_pretrained('bert-base-uncased')

for i, idx in enumerate(idx_list) : 
    if i == batch_len - 1 :
            break
    
    # visualize training process 
    print('{}/{}'.format(i, batch_len))

    # tokenize corpus 
    tokenizer_test = BertTokenizer.from_pretrained('bert-base-uncased')
    token_output = tokenizer_test.batch_encode_plus(corpus[idx : idx_list[i+1]], padding=True, truncation=True, max_length=512, return_tensors='pt')

    # bert embedding using bert base model 
    result = bert_model(**token_output)
    bert_result.append(result)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


0/211
1/211
2/211
3/211
4/211
5/211


101 is [CLS] token. 

Do not add '[CLS]' manually when using batch_encode_plus method. 

It automatically add special tokens.

In [None]:
# Embedding using bert model 
# I don't know why add ** in parameter 
bert_model = BertModel.from_pretrained('bert-base-uncased')
result = bert_model(**token_output)

### Questions
1. Why add ** in front of input sentence parameter in bert model instance?    
      
2. What is 'pooler-output' in bert model instance? and are there different output?

In [None]:
result.last_hidden_state.size()

In [None]:
result.last_hidden_state = result.last_hidden_state.resize_(4, 393216)

In [None]:
result.last_hidden_state