<a href="https://colab.research.google.com/github/Eya-Laouini/LLMs-Text-Generation-with-GPT-2/blob/main/LLMs%5BText_Generation_with_GPT2%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Large Language Models - Text Generation with GPT2**

The command '!pip install transformers' is used to install the transformers library in Python, typically in a programming environment like Jupyter Notebook or a Python script.


transformers: This is the name of the package you want to install. The transformers library, created by Hugging Face, provides a collection of state-of-the-art machine learning models specifically designed for natural language processing (NLP) tasks. It includes pre-trained models like BERT, GPT-2, T5, and others, which can be used for a variety of NLP tasks like text classification, translation, summarization, and more.

In [None]:
!pip install transformers



Import specific libraries and modules in a Python script, particularly for working with machine learning models.

**import tensorflow: This imports the TensorFlow library into your Python environment. TensorFlow is a popular open-source library developed by Google for numerical computation and machine learning. It's widely used for creating deep learning models, especially neural networks.


**from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers: This specifies that you are importing from the transformers library. As mentioned earlier, transformers is a library that provides many state-of-the-art machine learning models, particularly for natural language processing tasks.

**import GPT2LMHeadModel, GPT2Tokenizer: This imports two specific components from the transformers library:

GPT2LMHeadModel: This is the model class for GPT-2 (Generative Pretrained Transformer 2), a powerful language model that can generate human-like text. The 'LMHead' part indicates that this model variant includes a language modeling head on top of the base GPT-2 model, which makes it suitable for tasks like text generation.
GPT2Tokenizer: This is a tokenizer specifically designed for the GPT-2 model. A tokenizer is used to convert text into a format that can be understood and processed by the model. For GPT-2, this involves converting text into tokens (like words or subwords) and encoding these tokens as numerical values.

By importing these modules, the script is now equipped to utilize TensorFlow for numerical and machine learning operations, and specifically to leverage the GPT-2 model and its associated tokenizer for natural language processing tasks. This setup is commonly used in projects that involve generating or processing human language text using deep learning techniques.

In [None]:
import tensorflow as tf
from transformers import GPT2LMHeadModel, GPT2Tokenizer

The line tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large") in Python is related to the setup of a tokenizer for a specific machine learning model.

GPT2Tokenizer: This refers to the class GPT2Tokenizer from the transformers library. A tokenizer is used in natural language processing (NLP) to convert text into a format that a machine learning model can understand. Specifically, it breaks down the text into smaller units called tokens (like words or subwords) and then converts these tokens into numbers that the model can process.

from_pretrained: This is a method of the GPT2Tokenizer class. The from_pretrained method is used to load a tokenizer that has been already trained on a specific dataset. In this case, it's trained to work with the GPT-2 model.

"gpt2-large": This is a string argument to the from_pretrained method. It specifies which variant of the GPT-2 model's tokenizer you want to use. The gpt2-large model is a larger version of the GPT-2 model, which means it has more parameters and can potentially generate more coherent and contextually relevant text compared to smaller versions. The tokenizer for gpt2-large is specifically trained to work well with this model variant.

tokenizer: This is the variable name assigned to the initialized tokenizer. Once the tokenizer is loaded and assigned to this variable, you can use tokenizer in your code to tokenize text in a way that's compatible with the gpt2-large model.

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large")

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

This line of code is for loading a pre-trained model from the transformers library, specifically a large variant of the GPT-2 model with a specified padding token.

GPT2LMHeadModel: This refers to the class GPT2LMHeadModel in the transformers library. The GPT2LMHeadModel is a variant of the GPT-2 model that includes a language modeling head on top of the base GPT-2 architecture. This head allows the model to perform language modeling tasks, such as text generation.

from_pretrained("gpt2-large"): This method is used to load a pre-trained version of the GPT-2 model.

"gpt2-large": This string specifies the size of the GPT-2 model you want to load. gpt2-large is a larger variant of GPT-2, meaning it has more parameters than the base model (gpt2) and is capable of more complex language understanding and generation, but also requires more computational resources.
pad_token_id=tokenizer.eos_token_id: This is an additional argument passed to the from_pretrained method.

pad_token_id: This parameter sets the token that the model uses for padding. Padding is used to fill in shorter sequences to match the length of the longest sequence when processing a batch of text data.
tokenizer.eos_token_id: The eos_token_id is the 'end of string' token ID from the tokenizer. This line sets the padding token ID to be the same as the end-of-string token ID. It tells the model that padding should be treated similarly to the end-of-string token, which is important for maintaining consistency in how the model processes text sequences.
model: This is the variable to which the loaded model is assigned. After this line of code, the model variable refers to the GPT-2 large model loaded with the specified tokenizer and padding token settings.


In [None]:
model = GPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id)

model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
tokenizer

GPT2Tokenizer(name_or_path='gpt2-large', vocab_size=50257, model_max_length=1024, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}

Add the prompt

In [None]:
sentence = 'what is LLMS'

This line of code are used to convert a text sentence into a format that can be processed by a machine learning model, specifically a GPT-2 model.

tokenizer.encode(sentence, return_tensors='pt'):

tokenizer: This is the GPT-2 tokenizer you initialized earlier. It's used to convert text into a numerical format that the GPT-2 model can understand.
encode: This is a method of the tokenizer. It takes a string of text (in this case, the variable sentence) and converts it into a list of numerical tokens. Each token corresponds to a word or a part of a word.
sentence: This is the text that you want to encode. It should be a string variable containing the sentence or text input you wish to process.
return_tensors='pt': This argument tells the tokenizer to return the tokens in a format suitable for PyTorch (indicated by 'pt'). PyTorch is a popular deep learning framework, and this format is a tensor, which is essentially a multi-dimensional array used in PyTorch for model input.
input_ids = ...:

This part of the code is assigning the output of the tokenizer.encode method to a variable named input_ids.
input_ids now contains the encoded version of your input sentence, formatted as a tensor suitable for input into a PyTorch model.

In [None]:
input_ids = tokenizer.encode(sentence, return_tensors='pt')
input_ids

tensor([[10919,   318, 27140,  5653]])

The output tensor([[10919, 318, 27140, 5653]]) represents a PyTorch tensor containing a sequence of numerical tokens. These tokens are the result of processing a text sentence through a tokenizer, specifically the GPT-2 tokenizer in your case. Here's what each part of this output means:

tensor: This indicates that the data structure is a PyTorch tensor. Tensors are multi-dimensional arrays used in PyTorch, a popular library for machine learning and deep learning.

[[10919, 318, 27140, 5653]]: Inside the tensor are the numerical tokens. Each number here is a token ID, representing a specific word or subword as understood by the GPT-2 model.

The tokenizer has a vocabulary of tokens (words, parts of words, symbols, etc.), and each unique piece of text is assigned a specific number. For example, the word "hello" might be tokenized into the number 1256, "world" might become 5678, and so on.
The sequence 10919, 318, 27140, 5653 corresponds to the tokenized form of your input sentence. Each number maps to a word or part of a word in that sentence.
The double brackets [[...]] indicate that this is a 2-dimensional tensor. In this context, it's likely a single sequence (one sentence) being represented, hence only one row in this 2D tensor.
Contextual Meaning: The numerical values themselves (e.g., 10919, 318) don't have inherent meaning without the context of the tokenizer's vocabulary. To understand what text each number represents, you would need to look up these IDs in the tokenizer's vocabulary.

Usage: This tensor format is what you would typically feed into a machine learning model like GPT-2 for it to perform tasks like text generation, classification, or answering questions. The model reads these numbers and uses its trained neural network to interpret them and generate appropriate outputs based on its training.

In summary, tensor([[10919, 318, 27140, 5653]]) is a PyTorch tensor containing a tokenized version of a text sentence, ready to be input into a machine learning model for further processing.

The code is for generating text using a pre-trained GPT-2 model in Python, specifically using the PyTorch library.

model.generate(input_ids, ...):

model: This is your pre-trained GPT-2 model, loaded earlier in your code.
generate: This is a method of the GPT-2 model that's used for text generation.
input_ids: These are the tokenized input text you've prepared. This input acts as a prompt or context for the generated text.
Parameters of generate Method:

max_length=50: This parameter sets the maximum length of the sequence to be generated. The value 50 includes both the length of your input text (the context) and the new text that the model will generate. The model stops generating additional tokens once this length is reached.
num_beams=5: This enables beam search with 5 beams. Beam search is a technique used in natural language processing for generating text where the model considers multiple possible next words at each step and keeps the most promising sequences (or "beams") of tokens at each step. Using 5 beams means the model keeps track of 5 potential sequences at each step, which can lead to higher-quality output but requires more computational resources.
no_repeat_ngram_size=2: This setting prevents the model from repeating the same n-grams (in this case, sequences of 2 tokens) in the output text. It helps in reducing repetitiveness in the generated text.
early_stopping=True: This parameter tells the model to stop generating text once all beam candidates have reached the end of sentence token. It can make the generation process more efficient by stopping the search as soon as a satisfactory output is found.


When you execute output, it displays the generated token IDs. These are the numerical representations of the generated text. To convert these back into human-readable text, you would use the tokenizer's decode method (e.g., tokenizer.decode(output[0])).


In [None]:
#generate text until the output length (which includes the context length) reaches 50
output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, early_stopping=True)
output

tensor([[10919,   318, 27140,  5653,    30,   198,   198,  3069,  5653,   318,
           281,  1280,    12, 10459,    11,  3272,    12, 24254,  3788,  2478,
          2858,   357, 10305,    42,     8,   329,  2615,    11,  4856,    11,
           290, 29682,  3788,    13, 27140, 15996,   318,   257, 17050,   290,
          2792,   263,   326,   318,   973,   284,  1382,   290,  1057, 27140]])

The line of code print(tokenizer.decode(output[0], skip_special_tokens=True)) is for converting the generated output from the GPT-2 model into human-readable text.

tokenizer.decode(...):

tokenizer: This refers to the GPT-2 tokenizer you initialized earlier. The tokenizer not only converts text to tokens (numerical representations) but can also do the reverse – converting tokens back to text.
decode: This is a method of the tokenizer that translates the tokenized output back into a string of text.
output[0]: The output is the tensor containing the generated token IDs from the GPT-2 model. Since output can be a multi-dimensional tensor (especially if you generate multiple sequences), output[0] accesses the first sequence in this tensor. This is the tokenized text that you want to decode.
skip_special_tokens=True: This argument tells the decoder to ignore special tokens like padding tokens or end-of-sequence tokens that are used for model processing but are not meaningful in the final generated text.


In [None]:
print(tokenizer.decode(output[0], skip_special_tokens=True))

what is LLMS?

LLMS is an open-source, cross-platform software development environment (SDK) for building, testing, and deploying software. LLVM is a compiler and linker that is used to build and run LL
