# This is the beginning of Cloaky-LM

## Setting up the libraries

1. torch: This is PyTorch, the fundamental deep-learning framework we will use. Think of it as the engine and raw materials (like steel and circuits) for our model.

2. transformers: From Hugging Face, this is the most important library for our project. It provides pre-built architectures (like the Transformer) and high-level tools, including a Trainer class that will manage our training loop for us. It's our master blueprint and toolbox.

3. datasets: Also from Hugging Face, this library makes it incredibly simple to download, load, and process the vast amounts of text data our model needs to learn from.

4. tokenizers: An efficient library for the crucial step of converting our text into numbers that the model can understand.

5. accelerate: A helper library that works with transformers to automatically optimize our training code to run efficiently on whatever hardware we have (like the T4 GPU in Colab).


In [None]:
!pip install transformers datasets tokenizers torch accelerate
!pip install --upgrade datasets

In [None]:


from datasets import load_dataset

# Download and load the wikitext-2-raw-v1 configuration of the WikiText dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
print(dataset)


In [None]:
# trying some examples of the training data
print(dataset["train"][9]['text'])

## Tokenising the dataset with eos

In [None]:
from transformers import AutoTokenizer

# Load the tokenizer of the 'gpt2' model.
tokenizer = AutoTokenizer.from_pretrained("gpt2")

"""
The GPT-2 model was trained without a padding token.
Generally a End of sentence token is used as a padding Token...

I am going to use the eos token as the padding token
"""
tokenizer.pad_token=tokenizer.eos_token


def tokenize_function(examples):
  """it takes the text and converts it to 'input_ids'."""
  return tokenizer(examples["text"], truncation=True, max_length=128)

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])





*   We loaded a tokenizer that already knows a vocabulary of about 50,000 English tokens.
*   used .map() to apply this function across all splits (train, validation, test) of our dataset

*   remove_columns=["text"] because once we have the input_ids, we no longer need the original raw text


In [None]:
# Print the first 20 token IDs for the 10th example.
print(tokenized_datasets["train"][9]['input_ids'][:20])
