“Building A Story Generator Through Text Generation
Models”


Data Loading:
We have utilized pandas to load and manipulate a dataset related to book descriptions.

Data Preprocessing Cleaning:
Removed HTML tags using BeautifulSoup.
Lowercased text, removed special characters, and normalized spaces.
(BeautifulSoup is used to pull specific information (like text, links, or images) from web pages. It's helpful when you want to extract data from websites and the data is inside HTML)

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Text preprocessing
import re # regular expressions, which allow you to search, match, and manipulate strings in flexible and complex ways
from bs4 import BeautifulSoup  # For removing HTML tags
import nltk  # For tokenization and other NLP tasks
from nltk.tokenize import word_tokenize

# Download NLTK data (Natural Language Toolkit, rovides easy access to these resources, which are essential for many NLP tasks like tokenization, part-of-speech tagging)
nltk.download('punkt')  # Tokenizer data

# Machine Learning/Deep Learning
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# For saving/loading models and visualizing training
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping

# Optional: Hugging Face Transformers for GPT-like models
#library is a powerful tool for working with transformer-based models, GPT-like models.These models are widely used in various(NLP) tasks,such as text generation
!pip install transformers
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Now access your file
file_path = '/content/drive/MyDrive/Time traveler dataset mini pro.csv'

# You can use pandas or other libraries to read the file
import pandas as pd
data = pd.read_csv(file_path)

# Display the first few rows of the data
data.head()


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,Book_Title,Original_Book_Title,Author_Name,Edition_Language,Rating_score,Rating_votes,Review_number,Book_Description,Year_published,Genres,url
0,Outlander,Outlander,Diana Gabaldon,English,4.23,852563,46268,"The year is 1945. Claire Randall, a former com...",1991,"{'Historical (Historical Fiction) ': 11192, 'R...",https://www.goodreads.com/book/show/10964.Outl...
1,The Time Traveler's Wife,The Time Traveler's Wife,Audrey Niffenegger,English,3.98,1604511,47705,"A funny, often poignant tale of boy meets girl...",2003,"{'Fiction': 10255, 'Romance': 5903, 'Fantasy':...",https://www.goodreads.com/book/show/18619684-t...
2,11/22/63,Eleven twenty-two sixty-three,Stephen King,English,4.31,434183,38801,Jake Epping is a thirty-five-year-old high sch...,2011,"{'Fiction': 5134, 'Historical (Historical Fict...",https://www.goodreads.com/book/show/10644930-1...
3,Dragonfly in Amber,Dragonfly in Amber,Diana Gabaldon,English,4.32,294887,15200,From the author of Outlander... a magnificent ...,1992,"{'Historical (Historical Fiction) ': 6068, 'Ro...",https://www.goodreads.com/book/show/5364.Drago...
4,Rubinrot,Rubinrot,Kerstin Gier,German,4.09,118940,10146,"Manchmal ist es ein echtes Kreuz, in einer Fam...",2009,"{'Fantasy': 3411, 'Young Adult': 2650, 'Scienc...",https://www.goodreads.com/book/show/6325285-ru...


Preprocess the Data:
Cleaning: Remove special characters, normalize text, and clean it using functions like the ones I provided earlier.

In [None]:
# Import required libraries
from bs4 import BeautifulSoup
import re
import pandas as pd

# Check for the presence of the column and handle NaN values
if 'Book_Description' in data.columns:
    data['Book_Description'] = data['Book_Description'].fillna("")  # Replace NaN with an empty string

    # Define the cleaning function
    def clean_text(text):
        # Check if the text is a string
        if isinstance(text, str):
            text = text.lower()  # Convert text to lowercase
            text = BeautifulSoup(text, "html.parser").get_text()  # Remove HTML tags
            text = re.sub(r'[^a-z\s]', '', text)  # Remove non-alphabetic characters
            return ' '.join(text.split())  # Remove extra spaces
        else:
            return ""  # Return an empty string if the input is not a string

    # Apply the cleaning function to the 'Book_Description' column
    data['cleaned_description'] = data['Book_Description'].apply(clean_text)

    # Display the original and cleaned columns
    print(data[['Book_Description', 'cleaned_description']].head())
else:
    print("The 'Book_Description' column is not found in the dataset.")


                                    Book_Description  \
0  The year is 1945. Claire Randall, a former com...   
1  A funny, often poignant tale of boy meets girl...   
2  Jake Epping is a thirty-five-year-old high sch...   
3  From the author of Outlander... a magnificent ...   
4  Manchmal ist es ein echtes Kreuz, in einer Fam...   

                                 cleaned_description  
0  the year is claire randall a former combat nur...  
1  a funny often poignant tale of boy meets girl ...  
2  jake epping is a thirtyfiveyearold high school...  
3  from the author of outlander a magnificent epi...  
4  manchmal ist es ein echtes kreuz in einer fami...  


Tokenization & Embeddings:
You can tokenize the text using libraries like nltk or spaCy.
python

Tokenization:
Used nltk for splitting text into tokens.
(NLTK (Natural Language Toolkit) is a Python library used for working with human language data. When you split text into tokens, it means breaking down a piece of text into smaller parts like words or phrases.)

In [None]:
import nltk
nltk.download('punkt')

# Tokenize the cleaned descriptions
df['tokenized_descriptions'] = df['cleaned_description'].apply(nltk.word_tokenize)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Used embedding spaCy for converting words into vectors.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

# Convert a single description to embeddings
df['spacy_vectors'] = df['cleaned_description'].apply(lambda x: nlp(x).vector)


Train the Text Generation Model
Model Choice:
You can use RNN (LSTM/GRU) models, Transformers, or GPT-like models for text generation.
Using LSTM/GRU for Text Generation:
Keras is a great library for building RNN-based models. Here’s a simple LSTM model:

In [None]:
import pandas as pd

# Load the dataset into the 'df' DataFrame
df = pd.read_csv('/content/Time traveler dataset mini pro.csv')

# Sample 10% of the dataset
df_sampled = df.sample(frac=0.1)

# Continue with tokenization and model training using df_sampled


TensorFlow is an open-source machine learning library for building and deploying machine learning models

In [None]:
import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))


Num GPUs Available:  0


In [None]:
import pandas as pd
import re
from bs4 import BeautifulSoup  # Make sure to import BeautifulSoup

# Load dataset
df = pd.read_csv('/content/Time traveler dataset mini pro.csv')

# Define cleaning function
def clean_text(text):
    text = str(text).lower()  # Convert to string to avoid issues with NaN values
    text = BeautifulSoup(text, "html.parser").get_text()  # Remove HTML tags
    text = re.sub(r'[^a-z\s]', '', text)  # Remove non-alphabetical characters
    return ' '.join(text.split())

# Apply cleaning function to 'Book_Description' (or the relevant column)
df['cleaned_description'] = df['Book_Description'].apply(clean_text)

# Verify that the 'cleaned_description' column is created
print(df.head())


                 Book_Title            Original_Book_Title  \
0                 Outlander                      Outlander   
1  The Time Traveler's Wife       The Time Traveler's Wife   
2                  11/22/63  Eleven twenty-two sixty-three   
3        Dragonfly in Amber             Dragonfly in Amber   
4                  Rubinrot                       Rubinrot   

          Author_Name Edition_Language  Rating_score  Rating_votes  \
0      Diana Gabaldon          English          4.23        852563   
1  Audrey Niffenegger          English          3.98       1604511   
2        Stephen King          English          4.31        434183   
3      Diana Gabaldon          English          4.32        294887   
4        Kerstin Gier           German          4.09        118940   

   Review_number                                   Book_Description  \
0          46268  The year is 1945. Claire Randall, a former com...   
1          47705  A funny, often poignant tale of boy meets girl

This code is used to convert text data into a format suitable for training an LSTM model, where the text is transformed into integer sequences, padded to a consistent length, and ready for model input.

In [None]:
# Tokenize and prepare sequences for LSTM
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['cleaned_description'])
sequences = tokenizer.texts_to_sequences(df['cleaned_description'])
max_sequence_len = max([len(x) for x in sequences])
input_sequences = pad_sequences(sequences, maxlen=max_sequence_len, padding='pre')

# Build and train the model (as described before)


In [None]:
print(any(seq is None for seq in input_sequences))  # Should return False


False


In [None]:
print(input_sequences.shape)  # Should print a 2D shape, e.g., (num_samples, max_sequence_len)


(1248, 1507)


This code tokenizes text, converts it into sequences of integers, and pads them to ensure all sequences are of the same length, ready for use in a model (such as an LSTM-Long Short-Term Memory).

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Assuming df['cleaned_description'] is cleaned text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['cleaned_description'])

# Convert text to sequences
sequences = tokenizer.texts_to_sequences(df['cleaned_description'])

# Find the maximum sequence length
max_sequence_len = max([len(x) for x in sequences])

# Pad sequences to the same length
input_sequences = pad_sequences(sequences, maxlen=max_sequence_len, padding='pre')

print(input_sequences.shape)  # Check the shape again


(1248, 1507)


Builds an LSTM model with an embedding layer, LSTM layer, dropout for regularization, and a softmax output layer for classification.
Prepares text data by tokenizing it and padding the sequences to ensure uniform input length.
Compiles the padded sequences into a format suitable for input to the LSTM model.

In [None]:
model = Sequential()
model.add(Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=128))  # Removed input_length
model.add(LSTM(150))
model.add(Dropout(0.2))
model.add(Dense(len(tokenizer.word_index)+1, activation='softmax'))


This code generates new text using the GPT-2 model based on an initial input text, and then prints the continuation of that text.

Input: "Once upon a time in a distant future"
Output: The model will generate a story or continuation based on this input, producing a sequence of words up to a maximum of **200 tokens**.
GPT-2 is a large language model capable of generating coherent and contextually relevant text, making it useful for applications like creative writing, content generation, and more.

In [None]:
!pip install transformers
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Example: Tokenizing input text and generating text
input_text = "Once upon a time in a distant future"
input_ids = tokenizer.encode(input_text, return_tensors='pt')

# Generate text
output = model.generate(input_ids, max_length=200, num_return_sequences=1)
print(tokenizer.decode(output[0], skip_special_tokens=True))




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Once upon a time in a distant future, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger


 used to decode the generated tokens into human-readable text.

In [None]:
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)


Once upon a time in a distant future, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger


generates multiple sequences of text using the GPT-2 model with specific configurations for controlling randomness and diversity in the generated text.
generates three distinct text sequences using the GPT-2 model, based on a given input prompt. It does so with specific parameters to control how random or creative the output is. These parameters are designed to make the output diverse and varied, while still maintaining coherence. The result is 3 different continuations of the input text, printed one by one.

In [None]:
output = model.generate(
    input_ids,
    max_length=200,
    num_return_sequences=3,  # Generate multiple sequences
    temperature=0.7,  # Controls the randomness
    top_k=50,  # Limits to the top 50 tokens at each step
    top_p=0.9,  # Nucleus sampling to choose tokens within the top 90% probability mass
    do_sample=True  # Enables sampling, allowing num_return_sequences > 1
)

for i, sequence in enumerate(output):
    print(f"Generated Text {i+1}:")
    print(tokenizer.decode(sequence, skip_special_tokens=True))
    print("\n")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text 1:
Once upon a time in a distant future, the Earth and the Moon are united in the cause of Earth's liberation. The Earth is not a separate planet from the Moon. The Earth and the Moon are the same.

The Earth and the Moon are united in the cause of Earth's liberation. The Earth and the Moon are the same. The Earth is a part of the Sun. The Earth and the Moon are the same.

The Earth and the Moon are the same. The Earth and the Moon are united in the cause of Earth's liberation. The Earth and the Moon are the same.

The Earth and the Moon are united in the cause of Earth's liberation. The Earth and the Moon are the same. The Earth and the Moon are united in the cause of Earth's liberation. The Earth and the Moon are the same.

The Earth and the Moon are united in the cause of Earth's liberation. The Earth and the Moon are the same. The


Generated Text 2:
Once upon a time in a distant future, we had a child. The child was a child of the Lord, who loved our children.

I am

Fine-Tuning GPT-2 on Your Dataset
Fine-tune GPT-2 on your custom story dataset if you'd like the model to generate text more specific to your task (e.g., science fiction stories, fantasy).
 This process involves retraining the GPT-2 model on your dataset to better capture the style, structure, and themes of your story data.
Hugging Face makes it easy to fine-tune GPT-2. You can find tutorials and examples on how to fine-tune models using the Trainer API.

In [None]:
!pip install datasets


Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [None]:
# Add a padding token to the tokenizer
tokenizer.pad_token = tokenizer.eos_token  # Use the EOS token as the padding token


In [None]:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})


1

In [None]:
# Import required libraries
from bs4 import BeautifulSoup
import re
import pandas as pd

# Check for the presence of the column and handle NaN values
if 'Book_Description' in data.columns:
    data['Book_Description'] = data['Book_Description'].fillna("")  # Replace NaN with an empty string

    # Define the cleaning function
    def clean_text(text):
        # Check if the text is a string
        if isinstance(text, str):
            text = text.lower()  # Convert text to lowercase
            text = BeautifulSoup(text, "html.parser").get_text()  # Remove HTML tags
            text = re.sub(r'[^a-z\s]', '', text)  # Remove non-alphabetic characters
            return ' '.join(text.split())  # Remove extra spaces
        else:
            return ""  # Return an empty string if the input is not a string

    # Apply the cleaning function to the 'Book_Description' column
    data['cleaned_description'] = data['Book_Description'].apply(clean_text)

    # Display the original and cleaned columns
    print(data[['Book_Description', 'cleaned_description']].head())
else:
    print("The 'Book_Description' column is not found in the dataset.")


                                    Book_Description  \
0  The year is 1945. Claire Randall, a former com...   
1  A funny, often poignant tale of boy meets girl...   
2  Jake Epping is a thirty-five-year-old high sch...   
3  From the author of Outlander... a magnificent ...   
4  Manchmal ist es ein echtes Kreuz, in einer Fam...   

                                 cleaned_description  
0  the year is claire randall a former combat nur...  
1  a funny often poignant tale of boy meets girl ...  
2  jake epping is a thirtyfiveyearold high school...  
3  from the author of outlander a magnificent epi...  
4  manchmal ist es ein echtes kreuz in einer fami...  


Saving and Exporting the Model
After generating text or fine-tuning the model, you can save the fine-tuned model to your local machine or Google Drive for future use.

In [None]:
# Save the model
model.save_pretrained("/content/drive/My Drive/fine_tuned_gpt2_model")
tokenizer.save_pretrained("/content/drive/My Drive/fine_tuned_gpt2_tokenizer")


used for creating interactive web applications, especially for data science and machine learning projects. It allows you to quickly build and deploy data-driven web apps with minimal coding

In [None]:
!pip install streamlit


Collecting streamlit
  Downloading streamlit-1.39.0-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Collecting watchdog<6,>=2.1.5 (from streamlit)
  Downloading watchdog-5.0.3-py3-none-manylinux2014_x86_64.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.9/41.9 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Downloading streamlit-1.39.0-py2.py3-none-any.whl (8.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.7/8.7 MB[0m [31m68.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydeck-0.9.1-py2.py3-none-any.whl (6.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m87.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading watchdog-5.0.3-py3-none-manylinux2014_x86_64.whl (79 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.3/79.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[

simple Streamlit app that allows users to interact with a text generation model (like GPT-2) to generate a story based on a given prompt.

In [None]:
import streamlit as st

# Simple Streamlit app to interact with the model
st.title("Story Generator")

input_text = st.text_input("Enter a prompt:", "Once upon a time")

if st.button("Generate Story"):
    input_ids = tokenizer.encode(input_text, return_tensors='pt')
    output = model.generate(input_ids, max_length=200, num_return_sequences=1)
    story = tokenizer.decode(output[0], skip_special_tokens=True)
    st.write(story)


2024-10-24 03:02:57.977 
  command:

    streamlit run /usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py [ARGUMENTS]
2024-10-24 03:02:57.990 Session state does not function when running a script without `streamlit run`


In [None]:
!streamlit run your_app.py


Usage: streamlit run [OPTIONS] TARGET [ARGS]...
Try 'streamlit run --help' for help.

Error: Invalid value: File does not exist: your_app.py


Flask app allows you to interact with the GPT-2 model by sending a text prompt via a POST request.
Flask is used to handle web requests and serve the model on a local server, while Transformers and PyTorch are used to interact with the pre-trained GPT-2 model.

In [None]:
pip install flask transformers torch




 used to save a pre-trained GPT-2 model and its tokenizer to a specific directory, making it easier to reuse the model and tokenizer later without needing to download them again. This is helpful for fine-tuning, deployment, and sharing models in a production environment.

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Save the model and tokenizer
model.save_pretrained('/content/gpt2_model')
tokenizer.save_pretrained('/content/gpt2_model')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



('/content/gpt2_model/tokenizer_config.json',
 '/content/gpt2_model/special_tokens_map.json',
 '/content/gpt2_model/vocab.json',
 '/content/gpt2_model/merges.txt',
 '/content/gpt2_model/added_tokens.json')

In [None]:
from google.colab import files
!zip -r gpt2_model.zip /content/gpt2_model
files.download('gpt2_model.zip')


  adding: content/gpt2_model/ (stored 0%)
  adding: content/gpt2_model/merges.txt (deflated 53%)
  adding: content/gpt2_model/tokenizer_config.json (deflated 54%)
  adding: content/gpt2_model/generation_config.json (deflated 24%)
  adding: content/gpt2_model/special_tokens_map.json (deflated 74%)
  adding: content/gpt2_model/model.safetensors (deflated 7%)
  adding: content/gpt2_model/vocab.json (deflated 68%)
  adding: content/gpt2_model/config.json (deflated 52%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
def generate_text(prompt):
    # Ensure this line actually generates text from your model
    generated_text = model.generate(prompt)  # Adjust this line based on your model's function
    return generated_text


Gradio is a Python library used to quickly create and share user-friendly web interfaces for machine learning models. It allows developers to build interactive demos with minimal code, making it easy to test, deploy, and share models with non-technical users

In [None]:
!pip install gradio




function should encode the input text into tokens, pass those tokens to the model to generate output, and then decode the result back into readable text. This will enable the text generation to work correctly.

In [None]:
def generate_text(prompt):
    # Ensure this line actually generates text from your model
    generated_text = model.generate(prompt)  # Adjust this line based on your model's function
    return generated_text


Load Pre-trained Model: The pipeline automatically downloads the GPT-2 model and its associated tokenizer if they are not already downloaded.
Prepare for Text Generation: The pipeline prepares the GPT-2 model for text generation. It handles tokenization, model inference, and decoding for you, so you don’t need to manually process the input and output.

In [None]:
from transformers import pipeline
model = pipeline("text-generation", model="gpt2")  # or any other model


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
def generate_text(prompt):
    # Call the pipeline object directly with the prompt
    generated_text = model(prompt)  # This will return the generated text
    return generated_text[0]['generated_text']  # Adjust to get the actual text output if needed


When the user enters a prompt in the input field (such as "Once upon a time"), the generate_text function is called.
The function generates a continuation of the input text using the GPT-2 model and returns it.
The generated text is displayed in the output field on the web interface

Examples for running
Dragonfly in Amber
Time Out of Joint
Goddess of the Sea
Hour of the Olympics

In [None]:
import gradio as gr
from transformers import pipeline

# Load the text-generation model
model = pipeline("text-generation", model="gpt2")

# Define the text generation function
def generate_text(prompt):
    return model(prompt, max_length=50)[0]["generated_text"]

# Set up the Gradio interface
iface = gr.Interface(
    fn=generate_text,
    inputs="text",
    outputs="text",
    title="AI Text Generator"
)

# Launch the Gradio app
iface.launch()


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://f2fe833fb38cf35230.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


