## Downloading Dataset

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("akmittal/quotes-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/akmittal/quotes-dataset?dataset_version_number=1...


100%|██████████| 3.88M/3.88M [00:00<00:00, 39.5MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/akmittal/quotes-dataset/versions/1


In [None]:
import os

os.listdir(path)

['quotes.json']

In [None]:
import json
import pandas as pd

df_json = os.path.join(path, "quotes.json")

# Loading JSON file
with open(df_json, 'r') as file:
    data = json.load(file)

# Converting to DataFrame
df = pd.DataFrame(data)
print(df.head())

                                               Quote          Author  \
0  Don't cry because it's over, smile because it ...       Dr. Seuss   
1  Don't cry because it's over, smile because it ...       Dr. Seuss   
2  I'm selfish, impatient and a little insecure. ...  Marilyn Monroe   
3  I'm selfish, impatient and a little insecure. ...  Marilyn Monroe   
4  I'm selfish, impatient and a little insecure. ...  Marilyn Monroe   

                                                Tags  Popularity   Category  
0  [attributed-no-source, cry, crying, experience...    0.155666       life  
1  [attributed-no-source, cry, crying, experience...    0.155666  happiness  
2  [attributed-no-source, best, life, love, mista...    0.129122       love  
3  [attributed-no-source, best, life, love, mista...    0.129122       life  
4  [attributed-no-source, best, life, love, mista...    0.129122      truth  


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48391 entries, 0 to 48390
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Quote       48391 non-null  object 
 1   Author      48391 non-null  object 
 2   Tags        48391 non-null  object 
 3   Popularity  48391 non-null  float64
 4   Category    48391 non-null  object 
dtypes: float64(1), object(4)
memory usage: 1.8+ MB


In [None]:
# Dropping nulls and keeping only the 'Quote' column
df = df[['Quote']].dropna()

## Tokenization

I used a tokenizer designed for GPT-2 to minimize memory usage

In [None]:
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# Convert quotes to tokens
inputs = tokenizer(
    df['Quote'].tolist(),
    return_tensors="pt",
    max_length=64,
    padding="max_length",
    truncation=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

## Loading a Small GPT2 Model




In [None]:
from transformers import GPT2LMHeadModel

model = GPT2LMHeadModel.from_pretrained("distilgpt2")
model.resize_token_embeddings(len(tokenizer))

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Embedding(50257, 768)

## Preparing DataLoader (Low Memory)

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# Dataset
dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'])

# DataLoader (small batch size to save RAM)
dataloader = DataLoader(dataset, batch_size=1, shuffle=True)

## Training Loop (Basic, RAM-efficient)

In [None]:
from torch.optim import AdamW

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.train()

optimizer = AdamW(model.parameters(), lr=5e-5)

for epoch in range(1):  # Keep epochs low due to RAM limits
    for batch in dataloader:
        input_ids, attention_mask = [x.to(device) for x in batch]

        outputs = model(input_ids, attention_mask=attention_mask, labels=input_ids)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()

    print(f"Epoch complete with loss: {loss.item():.4f}")

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch complete with loss: 0.9839


## Generating Quotes

In [None]:
model.eval()
prompt = "wisdom begins"

input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

output = model.generate(
    input_ids,
    max_length=50,
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    do_sample=True,
    temperature=0.8,
    top_k=50,
    top_p=0.95
)

print("Generated quote:\n", tokenizer.decode(output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Generated quote:
 wisdom begins to make us want to believe in someone, so in our experience we are able to see the beauty of that person.
