# **RAG-Driven Poetry Generation: Combining Shakespeare and Semantic Search**

In [1]:
!pip install chromadb
!pip install sentence-transformers


Collecting chromadb
  Downloading chromadb-0.6.2-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.7.5-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.29.0-py3-

Initialize the LLM for text generation

In [2]:
import chromadb
from sentence_transformers import SentenceTransformer
import requests
import pandas as pd
from transformers import GPT2LMHeadModel, GPT2Tokenizer

generator_model = GPT2LMHeadModel.from_pretrained("gpt2")
generator_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
generator_tokenizer.pad_token = generator_tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Loading shakespeark text and 100 poems kaggle dataset

In [4]:
# Load the Shakespeare text
shakespeare_url = 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt'
shakespeare_text = requests.get(shakespeare_url).text

# #
# def load_poems(poem_file="poems-100.csv"):
#     poems_df = pd.read_csv(poem_file)
#     return poems_df['text'].tolist()  # Return the list of poems from the text column

# poems = load_poems("poems-100.csv")  # Load the poems dataset

poems = pd.read_csv("poems-100.csv")['text'].tolist()


Building and Populating a Semantic Search Database with Shakespeare and Poetry

In [5]:
# Combine Shakespeare text and poems
chunks = shakespeare_text.splitlines() + poems

model = SentenceTransformer('all-MiniLM-L6-v2')
client = chromadb.Client()

# Create collection
collection = client.create_collection("shakespeare_poetry3")

embeddings = model.encode(chunks)

collection.add(
    documents=chunks,
    embeddings=embeddings,
    metadatas=[{"source": f"Text_{i}"} for i in range(len(chunks))],
    ids=[f"id_{i}" for i in range(len(chunks))]
)

print("Shakespeare text and poems added to Chroma DB successfully.")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Shakespeare text and poems added to Chroma DB successfully.


Retrieve the most relevant context from the combined Shakespeare and poems dataset in Chroma DB.

In [7]:
def retrieve_relevant_context(query, top_k=3):
    query_embedding = model.encode([query])[0]
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )

    # Extract relevant text - flatten the documents
    relevant_texts = []
    for doc in results['documents']:
        relevant_texts.extend(doc)  # Flattening the list of sentences or phrases

    return ' '.join(relevant_texts[:5])


In [25]:
def generate_poem(query, length=100, temperature=0.9, num_lines=2, top_k=3):
    context = retrieve_relevant_context(query, top_k=top_k)
    context_str = context + " " + query
    inputs = generator_tokenizer.encode(context_str, return_tensors="pt", padding=True, truncation=True)

    if inputs.ndimension() == 1:
        inputs = inputs.unsqueeze(0)

    max_input_length = inputs.shape[-1]
    max_total_length = 1024
    max_new_tokens = max_total_length - max_input_length

    outputs = generator_model.generate(
        inputs,
        max_new_tokens=min(length, max_new_tokens),
        num_return_sequences=1,
        no_repeat_ngram_size=2,
        temperature=temperature,
        top_k=top_k,
        top_p=0.95,
        do_sample=True,
        pad_token_id=generator_tokenizer.eos_token_id
    )

    poem = generator_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return extract_lines(poem, num_lines)


In [9]:
# Helper function to extract lines
def extract_lines(poem, num_lines):

    lines = poem.split('\n')
    extracted_poem = '\n'.join(lines[:num_lines])

    if not extracted_poem.endswith('.'):
        extracted_poem += '.'

    return extracted_poem

Illustrative Example of Poem Generation

In [11]:
#1
query = input("Enter query Here: ")
poem = generate_poem(query, length=100, temperature=0.8, num_lines=2, top_k=3)
print("Generated Poem:\n", poem)


Enter query Here: War and peace cannot go hand to hand
Generated Poem:
 Let me have war, say I; it exceeds peace as far as I will not peace. What is the matter, Aumerle. Could not have made this peace. War and peace cannot go hand to hand. Peace is a matter of words. It is not a question of the future. There is no future, no present, only the past. The future will become something like the present. I do not mean to say that one should have peace, but that that is what I mean.


In [13]:
#2
query = "Love is a gamble"
poem = generate_poem(query, length=100, temperature=0.8, num_lines=2, top_k=3)
print("Generated Poem:\n", poem)


Generated Poem:
 Only make trial what your love can do With such a kind of love as might become In love? Love is a gamble, a gift, and the only way to get to it is by love! The only one who can save our souls!
.


In [24]:
#3
query = "What is the meaning of lost love?"
poem = generate_poem(query, length=100, temperature=0.8, num_lines=2, top_k=3)
print("Generated Poem:\n", poem)


Generated Poem:
 I do give lost; for I do feel it gone, It were lost sorrow to wail one that's lost. I am not yours, not lost in you,
Not lost, although I long to be.


In [15]:
query = "How does the night sky make you feel?"
poem = generate_poem(query, length=100, temperature=0.8, num_lines=2, top_k=3)
print("Generated Poem:\n", poem)


Generated Poem:
 Upon the heavy middle of the night When the sun sets, who doth not look for night? That all the world will be in love with night How does the night sky make you feel?
.
