<a href="https://colab.research.google.com/github/AdrianKazi/AWS-DeepRacer-PPO-Agent/blob/main/Slingerr_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# (Prod) Data

Imports data from provided links and store in data/corpus.txt

In [None]:
import requests
from pathlib import Path

def fetch_corpus(output_path="data/corpus.txt", timeout=30):
    Path(output_path).parent.mkdir(parents=True, exist_ok=True)

    DATA_SOURCES = {
        # TECH / ENGINEERING LANGUAGE
        "github_opensource_guide": "https://raw.githubusercontent.com/github/opensource.guide/main/README.md",
        "kubernetes_intro": "https://raw.githubusercontent.com/kubernetes/website/main/content/en/docs/concepts/overview/what-is-kubernetes.md",
        "docker_docs": "https://raw.githubusercontent.com/docker/docs/main/README.md",

        # PROGRAMMING / Q&A STYLE
        "reddit_programming": "https://raw.githubusercontent.com/taivop/joke-dataset/master/reddit-programming.txt",

        # JOB / ROLE / CORP LANGUAGE (synthetic but legit)
        "job_postings": "https://raw.githubusercontent.com/IBM/dataset-job-postings/master/data/job_postings.txt",
    }

    with open(output_path, "w", encoding="utf-8") as f:
        for name, url in DATA_SOURCES.items():
            print(f"Fetching {name}")
            try:
                r = requests.get(url, timeout=timeout)
                r.raise_for_status()
                f.write(r.text)
                f.write("\n")
            except Exception as e:
                print(f"SKIPPED {name}: {e}")

    print("DONE. Corpus at:", output_path)

# usage:
data = fetch_corpus()
data

Fetching github_opensource_guide
Fetching kubernetes_intro
SKIPPED kubernetes_intro: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/kubernetes/website/main/content/en/docs/concepts/overview/what-is-kubernetes.md
Fetching docker_docs
Fetching reddit_programming
SKIPPED reddit_programming: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/taivop/joke-dataset/master/reddit-programming.txt
Fetching job_postings
SKIPPED job_postings: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/IBM/dataset-job-postings/master/data/job_postings.txt
DONE. Corpus at: data/corpus.txt


In [None]:
with open("data/corpus.txt", "r", encoding="utf-8") as f:
    print(f.read(2000))


# Open Source Guides
[![Build Status](https://github.com/github/opensource.guide/workflows/GitHub%20Actions%20CI/badge.svg)](https://github.com/github/opensource.guide/actions)

Open Source Guides (https://opensource.guide/) are a collection of resources for individuals, communities, and companies who want to learn how to run and contribute to an open-source project.

## Background
Open Source Guides were created and are curated by GitHub, along with input from outside community reviewers, but they are not exclusive to GitHub products. One reason we started this project is that we felt that there weren't enough resources for people creating open-source projects.

Our goal was to aggregate community best practices, *not* what GitHub (or any other individual or entity) thinks is best. Therefore, we used examples and quotations from others to illustrate our points.

## Contributing

This site is powered by [Jekyll](https://jekyllrb.com/). Check out our [contributing guidelines](/CONTRIBUT

# Tokenizer

## Own Tokenizer

Byte Pair Encoding algorithm on corpus data to build tokenizer

In [None]:
import sentencepiece as spm
from pathlib import Path

def train_tokenizer(
    corpus_path="data/corpus.txt",
    model_prefix="slingerra_tok",
    vocab_size=2000,
):
    corpus_path = Path(corpus_path)
    assert corpus_path.exists() and corpus_path.stat().st_size > 0, "Corpus missing or empty"

    spm.SentencePieceTrainer.train(
        input=str(corpus_path),
        model_prefix=model_prefix,
        vocab_size=vocab_size,
        model_type="bpe",
        byte_fallback=True,
        character_coverage=1.0,
        normalization_rule_name="identity"
    )

    print(f"DONE → {model_prefix}.model / {model_prefix}.vocab")

# usage:
train_tokenizer()

DONE → slingerra_tok.model / slingerra_tok.vocab


### Import Tokenizer

In [None]:
import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.load("slingerra_tok.model")

True

### Test Tokenizer

In [None]:
text = "Senior Python engineer with AWS and Docker experience"

ids = sp.encode(text, out_type=int)
tokens = sp.encode(text, out_type=str)

print(ids)
print(tokens)

decoded = sp.decode(ids)
print(decoded)


[343, 281, 1649, 414, 1949, 259, 284, 885, 1941, 604, 262, 1044, 809, 1980, 1967, 301, 351, 491, 1937, 1611, 281, 1352]
['▁S', 'en', 'ior', '▁P', 'y', 'th', 'on', '▁en', 'g', 'ine', 'er', '▁with', '▁A', 'W', 'S', '▁and', '▁Docker', '▁ex', 'p', 'eri', 'en', 'ce']
Senior Python engineer with AWS and Docker experience


## (Prod) GPT2

In [None]:
from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

text = "Senior Python engineer with AWS and Docker experience welding machine and aerospace engieenering based in Ukraine AWS Lambda"
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

print(ids)
print(tokenizer.tokenize(text))
print(decoded)

[31224, 11361, 11949, 351, 30865, 290, 25716, 1998, 47973, 4572, 290, 40439, 1786, 494, 877, 278, 1912, 287, 7049, 30865, 21114, 6814]
['Senior', 'ĠPython', 'Ġengineer', 'Ġwith', 'ĠAWS', 'Ġand', 'ĠDocker', 'Ġexperience', 'Ġwelding', 'Ġmachine', 'Ġand', 'Ġaerospace', 'Ġeng', 'ie', 'ener', 'ing', 'Ġbased', 'Ġin', 'ĠUkraine', 'ĠAWS', 'ĠLamb', 'da']
Senior Python engineer with AWS and Docker experience welding machine and aerospace engieenering based in Ukraine AWS Lambda
