# ML-CodeParrot Pretraining Process
- BigQuery에서 제공하는 Github Dataset 중 ML Library(torch, transformers, ..., sklearn)를 사용하는 코드만 추출하여 Pretraining <br>
- Pretrained Tokenizer, Model은 모두 Huggingface Hub에서 [load](https://huggingface.co/rockmiin)하여 사용 가능
- [기존 CodeParrot모델](transformersbook/codeparrot-small)과 결과를 비교하며 진행할 예정
## 전체 프로세스
- ### Extract ML-Github Dataset
- ### Tokenizer
- ### Model
- ### Experiment
- ### Conclusion

## Extract ML-Github Dataset
- BigQuery에서 하위 SQL문을 이용하여 `torch`, `sklearn`, `huggingface library(transformers, datasets, tokenizers)`를 사용하는 py파일만 추출 (2분 내로 처리)<br>
- 총 2.7TB에서 5.61GB(446,595 samples)를 추출하여 사용 ([CodeParrot](transformersbook/codeparrot-small) 모델에 비해 3%에 해당하는 데이터셋 사용)
- 추출된 데이터를 9:1 비율로 [Train](https://huggingface.co/datasets/rockmiin/ml-codeparrot-train), [Valid](https://huggingface.co/datasets/rockmiin/ml-codeparrot-valid) Dataset 분리


In [None]:
SQL = """SELECT
  f.repo_name, f.path, c.copies, c.size, c.content, l.license
FROM
  `bigquery-public-data.github_repos.files` AS f
JOIN
  `bigquery-public-data.github_repos.contents` AS c
ON
  f.id = c.id
JOIN
  `bigquery-public-data.github_repos.licenses` AS l
ON
  f.repo_name = l.repo_name
WHERE
  NOT c.binary
  AND ((f.path LIKE '%.py')
    AND (c.size BETWEEN 1024 AND 1048575))
  AND REGEXP_CONTAINS(c.content, r'torch|sklearn|transformers|datasets|tokenizers')
"""

## Tokenizer
- 사전학습된 codeparrot과 ml-codeparrot tokenizer의 vocab token list 확인
- 한 쪽 tokenizer vocab에만 포함되어 있는 token 확인 

In [2]:
from transformers import AutoTokenizer

ml_tokenizer= AutoTokenizer.from_pretrained('rockmiin/ml-codeparrot')
org_tokenizer= AutoTokenizer.from_pretrained('transformersbook/codeparrot-small')

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
topk= 1000
top_ml_vocab= [tok for tok, idx in sorted(ml_tokenizer.vocab.items(), key= lambda x: x[1])[257:257+topk]]
top_org_vocab= [tok for tok, idx in sorted(org_tokenizer.vocab.items(), key= lambda x: x[1])[257:257+topk]]

In [21]:
def get_top_tokens(tokenizer, n):
    # 딕셔너리의 값(빈도수)를 기준으로 내림차순으로 정렬한 후, 상위 n개의 항목만 추출
    sorted_tokens = sorted(tokenizer.vocab.items(), key=lambda x: x[1])[257:257+n]
    # 토큰만 추출하여 리스트에 저장
    top_tokens = [token for token, count in sorted_tokens]
    return top_tokens

# 예시: 상위 100개의 토큰을 추출하여 출력
top_tokens = get_top_tokens(ml_tokenizer, 100)
top_tokens

['ĠĠ',
 'ĠĠĠĠ',
 'ĠĠĠ',
 'ĠĠĠĠĠĠĠĠ',
 'in',
 'se',
 'at',
 're',
 'ĠĠĠĠĠĠĠ',
 'or',
 'er',
 'on',
 'Ġt',
 'st',
 'ĊĠĠĠ',
 'ĊĠĠĠĠĠĠĠ',
 'Ġ=',
 'al',
 'ar',
 'ĊĠĠĠĠĠĠĠĠ',
 'le',
 'an',
 'de',
 'he',
 'me',
 'it',
 '--',
 'Ġc',
 'Ġn',
 'Ġi',
 'as',
 'Ġf',
 'en',
 'ion',
 'Ġs',
 'mp',
 'lf',
 '##',
 'ra',
 'Ġp',
 'ro',
 'ct',
 'self',
 'ut',
 'Ġthe',
 'Ġin',
 'ĊĠĠĠĠĠĠĠĠĠĠĠ',
 'Ġo',
 'es',
 'ing',
 'Ġd',
 'lo',
 '==',
 "Ġ'",
 'Ġ"',
 'Ġa',
 'ed',
 'co',
 'ata',
 'el',
 'Ġm',
 'ic',
 'Ġre',
 'est',
 'Ġ#',
 'Ġb',
 'pe',
 'ge',
 'ĊĊĠĠĠ',
 'and',
 'Ġw',
 'Ġself',
 '----',
 '):',
 'ur',
 'ĊĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠ',
 'is',
 'un',
 'ig',
 'ame',
 'Ġ(',
 'ce',
 '####',
 'ue',
 "',",
 'ul',
 'ab',
 'res',
 'Ġde',
 'ts',
 'Ġ1',
 'ate',
 'id',
 'Ġof',
 'ser',
 'Ġto',
 'ch',
 'ort',
 'ex',
 'di']

In [4]:
top_ml_vocab[:100]

['ĠĠ',
 'ĠĠĠĠ',
 'ĠĠĠ',
 'ĠĠĠĠĠĠĠĠ',
 'in',
 'se',
 'at',
 're',
 'ĠĠĠĠĠĠĠ',
 'or',
 'er',
 'on',
 'Ġt',
 'st',
 'ĊĠĠĠ',
 'ĊĠĠĠĠĠĠĠ',
 'Ġ=',
 'al',
 'ar',
 'ĊĠĠĠĠĠĠĠĠ',
 'le',
 'an',
 'de',
 'he',
 'me',
 'it',
 '--',
 'Ġc',
 'Ġn',
 'Ġi',
 'as',
 'Ġf',
 'en',
 'ion',
 'Ġs',
 'mp',
 'lf',
 '##',
 'ra',
 'Ġp',
 'ro',
 'ct',
 'self',
 'ut',
 'Ġthe',
 'Ġin',
 'ĊĠĠĠĠĠĠĠĠĠĠĠ',
 'Ġo',
 'es',
 'ing',
 'Ġd',
 'lo',
 '==',
 "Ġ'",
 'Ġ"',
 'Ġa',
 'ed',
 'co',
 'ata',
 'el',
 'Ġm',
 'ic',
 'Ġre',
 'est',
 'Ġ#',
 'Ġb',
 'pe',
 'ge',
 'ĊĊĠĠĠ',
 'and',
 'Ġw',
 'Ġself',
 '----',
 '):',
 'ur',
 'ĊĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠ',
 'is',
 'un',
 'ig',
 'ame',
 'Ġ(',
 'ce',
 '####',
 'ue',
 "',",
 'ul',
 'ab',
 'res',
 'Ġde',
 'ts',
 'Ġ1',
 'ate',
 'id',
 'Ġof',
 'ser',
 'Ġto',
 'ch',
 'ort',
 'ex',
 'di']

In [5]:
top_org_vocab[:100]

['ĠĠ',
 'ĠĠĠĠ',
 'ĠĠĠ',
 'ĠĠĠĠĠĠĠĠ',
 'se',
 'in',
 'ĠĠĠĠĠĠĠ',
 're',
 'on',
 'te',
 'ĊĠĠĠĠĠĠĠ',
 'ĊĠĠĠĠĠĠĠĠ',
 'or',
 'st',
 'de',
 'ĊĠĠĠ',
 'th',
 'le',
 'Ġ=',
 'lf',
 'self',
 'me',
 'al',
 'ti',
 'er',
 'Ġa',
 "Ġ'",
 'Ġi',
 'ar',
 'Ġc',
 'en',
 'ĊĠĠĠĠĠĠĠĠĠĠĠ',
 'Ġf',
 'an',
 'Ġself',
 'at',
 'ro',
 'Ġth',
 'Ġre',
 'tion',
 "',",
 'Ġ"',
 'Ġp',
 'ur',
 'ce',
 'Ġn',
 'ge',
 '):',
 'as',
 '--',
 'Ġt',
 'Ġs',
 '##',
 'ue',
 'mp',
 'Ġo',
 'ame',
 'Ġthe',
 'Ġin',
 'ing',
 'li',
 'def',
 'ct',
 'lo',
 'pe',
 'ri',
 'ate',
 'un',
 'Ġe',
 'ĊĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠ',
 'Ġ#',
 'di',
 'fi',
 'Ġb',
 'co',
 'ser',
 'Ġm',
 'Ġ(',
 'ch',
 'Ġw',
 'ut',
 'si',
 'ĊĊĠĠĠ',
 'Ġif',
 '""',
 '()',
 'nt',
 'id',
 'ra',
 'ck',
 'Ġdef',
 'ul',
 'urn',
 'ad',
 'ter',
 'el',
 'turn',
 'name',
 'ĊĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠ',
 "':"]

In [6]:
# ml-vocab topk에는 있지만 org-vocab에 존재하지 않는 token
print([tok for tok in top_ml_vocab if tok not in top_org_vocab])
# [train, iter, shape, features]



In [7]:
# ml-vocab topk에는 있지만 org-vocab에 존재하지 않는 token
print([tok for tok in top_org_vocab if tok not in top_ml_vocab])
# [url, db, jango, assertEqual]

['ti', 'tion', 'fi', 'si', 'xt', 'alue', 'la', '::', 'eld', 'gs', 'bu', 'bj', 'lin', 'ls', 'ht', 'bject', 'ci', 'tem', 'our', 'app', 'module', 'wor', 'mm', 'tri', 'Ġar', 'lic', 'Ġma', 'ource', 'assertEqual', 'mode', 'url', 'ry', "'),", 'Ġra', 'fa', 'ader', 'ĊĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠ', 'sp', 'tp', 'Ġpa', 'ca', 'db', 'field', 'Ġns', 'ssage', 'ĠĊĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠĠ', 'vice', 'ten', 'scri', 'ponse', "'):", 'EN', 'jo', 'tes', 'AL', 'late', 'user', 'ress', 'wa', 'ari', 'tic', 'method', 'ock', 'tent', 'ET', 'vent', 'gn', 'object', 'tribu', "Ġ['", 'Ġ}', 'Field', 'work', 'pp', 'Type', 'ze', 'temp', 'ception', 'stri', 'lay', 'Ġmode', 'ec', 'kw', 'ble', 'son', 'ST', 'Ġcls', '11', 'net', 'jang', 'tions', 'lif', '.__', 'tive', 'jango', 'Ġmodule', 'andle', 'mmand', 'Ġtry', 'lient', '////', 'Ġexcept', 'De', 'update', 'string', 'ters', 'AN', 'Ġro', 'umber', "''", 'atus', 'SE', 'Ġfield', 'gin', 'LE', 'RE', 'peci', 'ght', 'Pro', 'Ex', 'Ġuser', 'group', 'ape', 'Ġ##', 'opy', 'Ġpath', 'node', 'UT', 'uti', "'}", '

- 학습에 사용된 데이터에 따라 vocab 구성이 꽤 많이 바뀌는 것을 확인 

## Model
- [huggingface codeparrot repository](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot) 코드 사용
- [ml-codeparrot-train dataset](https://huggingface.co/datasets/rockmiin/ml-codeparrot-train)을 **2epoch**(200,000step) 학습
- baseline model로 [gpt2](https://huggingface.co/gpt2)를 사용
- A5000 24G 1GPU를 사용(약 28시간 소요)

###Dataset
The entire data was divided into 9:1 and divided into train and valid dataset.
| Dataset                | Raw size |
|----------------------|----------------|
| ml-codeparrot-train            | 5.05GB            |
| ml-codeparrot-valid            | 0.56GB          |

### Baseline Models
Pretraining was performed using the gpt2 
| Model                | Model size | Vocab size |
|----------------------|----------------|-------------|
| gpt2            | 117M            | 32768         |

### Monitoring
**Train Loss**<br>
<center><img src="./images/train_loss.png" width="900" height="300"></center>

**Eval Loss**<br>
<center><img src="./images/eval_loss.png" width="900" height="300"></center>

**Eval Perplexity**<br>
<center><img src="./images/eval_perplexity.png" width="900" height="300"></center>

## Experiment
- ml-codeparrot과 codeparrot의 generation 결과 비교

In [8]:
from transformers import pipeline, set_seed

model_ckpt= 'rockmiin/ml-codeparrot'
generation = pipeline('text-generation', model=model_ckpt)

In [9]:
org_model_ckpt= 'transformersbook/codeparrot-small'
org_generation = pipeline('text-generation', model=org_model_ckpt)

In [10]:
import re

def first_block(string):
    return re.split('\nclass|\ndef|\n#|\n@|\nprint|\nif', string)[0].rstrip()

def complete_code(pipe, prompt, max_length=64, num_completions=4, seed=42):
    set_seed(seed)
    gen_kwargs = {"temperature":0.6, "top_p":0.90, "top_k":0, "num_beams":1,
                  "do_sample":True,}
    code_gens = pipe(prompt, num_return_sequences=num_completions, 
                            max_length=max_length, **gen_kwargs)
    code_strings = []
    for code_gen in code_gens:
        generated_code = first_block(code_gen['generated_text'][len(prompt):])
        code_strings.append(generated_code)
    print(('\n'+'='*80 + '\n').join(code_strings))

In [11]:
prompt = '''
import torch
def concat_tensor(a, b):
    """
    Return concatenated tensor of two input tensors.
    Assume the sizes of two tensors are equal.
    """'''

complete_code(org_generation, prompt, max_length=64)
# LABEL : return torch.concat([a, b])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



    return a.concat(b)

    return a * b

    # Compute the concatenated tensors
    if isinstance(a, tensor.Tensor):
        a = a.value()
    # Add the concatenated

    return torch.cat(a, b)


In [12]:
prompt = '''
import torch
def concat_tensor(a, b):
    """
    Return concatenated tensor of two input tensors.
    Assume the sizes of two tensors are equal.
    """'''

complete_code(generation, prompt, max_length=64)
# LABEL : return torch.concat([a, b])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



    return [
        torch.zeros(1, len(a), dtype=torch.float),
        torch.zeros(1, len

    return a.to(b)

    return torch.cat(a, dim=-1)

    return torch.stack([a, b], axis=-1)


In [13]:
prompt = '''
def encode(sentence, tokenizer):
    """
    Return tokenized list of input sentences.
    
    Example:
    sentence: ["Hi", "how are you"] -> output: [[1], [34, 5656, 32]]
    """'''

complete_code(org_generation, prompt, max_length=64)
# LABEL : return [tokenizer(s)['input_ids'] for s in sentence]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



    # TODO: use tokenize.tokenize to handle sentence
    sentence = tokenizer

    return ''.join(sentence.split())

    # TODO: Use tokenizer to decode sentence
    sentence = tokenizer.tokenize

    return [encode_sentence(sentence, tokenizer) for sentence in sentence


In [14]:
prompt = '''
def encode(sentence, tokenizer):
    """
    Return tokenized list of input sentences.
    
    Example:
    sentence: ["Hi", "how are you"] -> output: [[1], [34, 5656, 32]]
    """'''

complete_code(generation, prompt, max_length=64)
# LABEL : return [tokenizer(s)['input_ids'] for s in sentence]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



    return [
        tokenizer(sentence) for sentence in sentence]

    return [x.encode(tokenizer.tokenize(x)) for x in

    tokenized = tokenizer.encode(sentence)
    tokenized = tokenized.encode(

    return [
        [tokenizer(sentence, tokenizer, max_tokens=max


In [15]:
# from chatGPT
import torch

def concat_tensor(a, b):
    """
    Return concatenated tensor of two input tensors.
    Assume the sizes of two tensors are equal.
    """
    return torch.cat((a, b), dim=0)
# LABEL : return torch.concat([a, b])

In [16]:
# from chatGPT
def encode(sentence, tokenizer): 
    """
    Return tokenized list of input sentences.
    
    Example:
    sentence: ["Hi", "how are you"] -> output: [[1], [34, 5656, 32]]
    """
    return [tokenizer.encode(s) for s in sentence]
# LABEL : return [tokenizer(s)['input_ids'] for s in sentence]

### Conclusion

특정 Task(ML) Code Dataset만 활용하여 Pretraining
- 빅쿼리를 이용하면 정말 빠르게 데이터 추출이 가능!
- 적은 데이터(3%) 활용에도 불구하고 특정 task만 푸는 것에 대한 가능성을 보여줌
- 코테나 수학 문제는 유닛 테스트를 통해 성능 평가가 가능하지만 ML problem은 어려운 것 같다. 평가 어떻게?
- 이렇게 했음에도 불구하고.. LLM의 성능이 더 좋긴 하더라..


[Repository](https://github.com/RockMiin/ML-CodeParrot)