<a href="https://colab.research.google.com/github/Mahemaran/Colab-notebooks/blob/main/Text_generation_by_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the GPT-2 tokenizer and model
model_name = "gpt2"  # You can use other models like "EleutherAI/gpt-neo-125M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set the padding token to the EOS token (for compatibility)
tokenizer.pad_token = tokenizer.eos_token

# Open and read the content of the text file
with open("/content/Maran.txt", "r") as book:
    content = book.read()

# Print length of the content
print("Length of the content:", len(content))

# Split the content into sentences (based on full stops)
text_split = content.split(".")
print(f"Text split into {len(text_split)} sentences")

# Clean up the text by stripping extra spaces from each sentence
text_split = [line.strip() for line in text_split if line.strip()]
print(text_split[:5])  # Print the first 5 cleaned sentences to verify

# Tokenize the text and pad dynamically
inputs = tokenizer(text_split, return_tensors="pt", padding=True, truncation=True)

# Create the attention mask (automatically created by tokenizer)
attention_mask = inputs['attention_mask']

print(f"Input shape: {inputs['input_ids'].shape}")
print(f"Attention mask shape: {attention_mask.shape}")

Length of the content: 2505
Text split into 17 sentences
['Maran’s Vision for the Future:\nProfessional Ambition\nWith 3 years of experience in HCL Technologies, Maran is on a mission to accelerate his career in AI', 'He envisions running a variety of cutting-edge automation and AI projects, driving innovation that pushes the boundaries of what’s possible', 'His passion for AI spans from machine learning to natural language processing (NLP), and he is constantly seeking opportunities to build intelligent systems that will revolutionize industries', 'Maran isn’t just looking for a job—he wants to make an impact', 'Whether it’s designing smart automation systems, or pioneering AI-driven solutions, he dreams of working on high-impact projects that not only challenge him but also allow him to shape the future of technology']
Input shape: torch.Size([17, 47])
Attention mask shape: torch.Size([17, 47])


In [None]:
def generate_text_transformer(seed_text, num_words_to_generate, model, tokenizer):
    # Tokenize the input seed text and convert it to tensor
    input_ids = tokenizer.encode(seed_text, return_tensors="pt")

    # # Create attention mask (all tokens are attended to)
    attention_mask = torch.ones_like(input_ids)

    # Generate text using the model
    output = model.generate(
        input_ids=input_ids,  # Input seed tokens
        max_length=len(input_ids[0]) + num_words_to_generate,  # Control text length
        num_return_sequences=1,  # Generate one sequence
        temperature=0.7,  # Controls randomness
        top_p=0.9,  # Nucleus sampling (focus on top probabilities)
        do_sample=True,  # Sampling mode (not greedy decoding)
        pad_token_id=tokenizer.eos_token_id,  # Ensure stopping at EOS token
        attention_mask=attention_mask  # Use attention mask for the seed text
    )

    # Decode the generated token IDs back to text
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

# Input seed text for generation
seed_text = "Maran’s Professional Ambition"
# seed_text = content[:1000]
num_words_to_generate = 50
# num_words_to_generate = 25  # Number of words to generate

# Generate text based on the seed text
generated_text = generate_text_transformer(seed_text, num_words_to_generate, model, tokenizer)

# Display generated text
print("\nGenerated Text:")
print(generated_text)


Generated Text:
torch.Size([17, 47])


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load the GPT-2 model and tokenizer
model_name = "gpt2"  # You can use other models like "EleutherAI/gpt-neo-125M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Set the padding token to the EOS token (for compatibility)
tokenizer.pad_token = tokenizer.eos_token

# Open the content from the .txt file (your text data)
with open("/content/Maran.txt", "r") as book:
    content = book.read()

# Function to generate text based on question and content
def generate_answer_from_content(question, content, model, tokenizer):
    # Combine the question with the content for context
    prompt = f"Question: {question}\nContent: {content}\nAnswer:"

    # Tokenize the prompt and content
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # Generate an answer based on the question and content
    output = model.generate(
        input_ids=input_ids,            # Input seed tokens
        max_length=len(input_ids[0]) + 100,  # Control text length
        num_return_sequences=1,         # Generate one sequence
        temperature=0.7,                # Controls randomness
        top_p=0.9,                      # Nucleus sampling
        do_sample=True,                 # Sampling mode (not greedy decoding)
        pad_token_id=tokenizer.eos_token_id  # Ensure stopping at EOS token
    )

    # Decode the generated tokens back to text
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

    # Extract only the generated answer (remove the question and content parts)
    answer = generated_text.split("Answer:")[1].strip()
    return answer
    # print(input_ids.shape)

# Example: Asking a question about Maran's profession
question = "What is Maran's profession?"

# Generate an answer based on the question and content
generated_answer = generate_answer_from_content(question, content, model, tokenizer) # Get input_ids

# Display the result
print("\nGenerated Answer:")
print(generated_answer)


Generated Answer:
I want to share my dream home with you!

I am currently working on a project with a group of young professionals. They are currently looking to expand their workforce by creating a new team that will be able to provide a more sustainable workplace for their young clients. I hope to work with them to develop a new approach to building a sustainable business.

I am looking for people to join my team.

Please share this dream with your friends and family.

Please join me


In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Input prompt (seed text)
seed_text = "The future of artificial intelligence is"

# Tokenize the input
input_ids = tokenizer.encode(seed_text, return_tensors="pt")

# Generate text
output = model.generate(
    input_ids=input_ids,  # Input text
    max_length=50,        # Maximum number of tokens
    temperature=0.7,      # Control randomness (lower = more focused)
    top_p=0.9,            # Nucleus sampling (focus on top tokens)
    do_sample=True,       # Enable sampling
    pad_token_id=tokenizer.eos_token_id
)

# Decode and print the output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("Generated Text:")
print(generated_text)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated Text:
The future of artificial intelligence is likely to be a lot more complex than that of the human mind.

This is because we are in a fundamentally different era. We have a new kind of intelligence, and we are going to have a new kind


In [2]:
from transformers import pipeline

# Load NER pipeline
ner_pipeline = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

# Input text
text = "Elon Musk is the CEO of SpaceX, which is headquartered in California."

# Perform NER
entities = ner_pipeline(text)
for entity in entities:
    print(f"{entity['word']} → {entity['entity']} ({entity['score']:.2f})")

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

El → I-PER (1.00)
##on → I-PER (1.00)
Mu → I-PER (1.00)
##sk → I-PER (1.00)
Space → I-ORG (1.00)
##X → I-ORG (1.00)
California → I-LOC (1.00)
