<a href="https://colab.research.google.com/github/CNielsen94/Random_data_repo/blob/main/notebooks/RAG_based_StoryTelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction
Not long after I started on DBS, I had the idea to create an app that would generate and read custom-tailored stories, and now I get to show you guys a cheap and semi-hacky way to use Retrieval-Augmented Generation (RAG) for this specific purpose. God I love being a nerd.

![](https://raw.githubusercontent.com/CNielsen94/Random_data_repo/main/media/Nerd_is_a_compliment.png)

Let's get started!

#Pip installs

In [1]:
!pip install transformers datasets faiss-gpu
!pip install sentence-transformers
!pip install openai



#Database setup + fillin' it with good ol' data

First, let's set up a SQLite database to store the text data. Here’s how you can create a database and a table for storing texts:

This code segment is used to interact with a SQLite database in Python. It starts by importing the sqlite3 module, which allows for communication with SQLite databases. Then, it creates (or opens, if it already exists) a database file named gutenberg_texts.db.

A cursor object c is created from the connection object conn. This cursor is used to execute SQL commands. Here, it's used to create a new table named texts in the database, with columns id, title, and content, where id is an integer that automatically increments (PRIMARY KEY), and both title and content are text fields.

conn.commit() is called to save the changes made to the database.

In [2]:
import sqlite3

# Create a new SQLite database
conn = sqlite3.connect('gutenberg_texts.db')
c = conn.cursor()

# Create a new table to store texts
c.execute('''CREATE TABLE IF NOT EXISTS texts
             (id INTEGER PRIMARY KEY, title TEXT, content TEXT)''')
conn.commit()

Now we need to some text to populate the database with. In this case I went through the semi-easy route of just using NLTKs 'gutenberg' corpus. <br> For the sake of context, Gutenberg is an online database of sorts that contain all publicly available written stories, meaning we aren't breaking any of those pesky IP protection rules <br>
As I'm sure you're aware by now, the set up steps I am following ***may*** significantly differ from yours depending on what data you want to utilize.

In [3]:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')


# Get a list of file identifiers for the texts in the Gutenberg corpus
file_ids = nltk.corpus.gutenberg.fileids()

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
for file_id in file_ids:
    # Fetch the text content and title
    content = nltk.corpus.gutenberg.raw(file_id)
    title = file_id.replace('.txt', '').replace('.zip', '')  # Simple cleanup for title

    # Insert the title and content into the database
    c.execute("INSERT INTO texts (title, content) VALUES (?, ?)", (title, content))

# Commit changes and close the connection
conn.commit()
conn.close()

This function, fetch_all_texts_from_db, is designed to interact with a SQLite database to retrieve stored texts, aligning with the Retrieval part of RAG (Retrieval-Augmented Generation). <br>
It opens a connection to the gutenberg_texts.db database, selects the content of all entries in the texts table, and fetches these entries. <br>
The results, initially in a list of tuples (because that's how database queries return data), are transformed into a list of strings, where each string is the content of a text from the database. <br>
This list of texts can then be used as the corpus for retrieval-based tasks in a RAG setup, providing context or source material for GPT (or another generative model) to generate new content based on the retrieved information. <br>
Finally, the database connection is closed to ensure resource efficiency.

In [5]:
def fetch_all_texts_from_db():
    # Connect to the SQLite database
    conn = sqlite3.connect('gutenberg_texts.db')
    cursor = conn.cursor()

    # Select all content from the texts table
    cursor.execute("SELECT content FROM texts")

    # Fetch all results
    results = cursor.fetchall()  # List of tuples

    # Convert list of tuples to list of strings
    texts = [text[0] for text in results]

    # Close the database connection
    conn.close()

    return texts

#Setting up RAG index for custom story telling:

This code segment integrates the retrieval part of RAG and prepares data for GPT. It uses the sentence_transformers library to load a pre-trained model (all-MiniLM-L6-v2) that converts texts into embeddings, numerical vectors representing textual information. <br><br>
These embeddings, derived from the database texts, allow for semantic similarity comparison. The FAISS library is used to create an efficient index for these embeddings, enabling quick retrieval of the most relevant texts based on vector similarity. <br><br>
This retrieval process complements GPT's generation by providing relevant context or source material.

In [6]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Assuming you've populated the database with text data, fetch it, vectorize, and index
texts = fetch_all_texts_from_db()  # This should come from your database
embeddings = model.encode(texts)

# Create a FAISS index
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(np.array(embeddings))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Since I got some spare credits lying around on OpenAI, I decided to use their API as the backend. The benefit of this, is that I don't have to run the model locally/in Colab, which allows a bit more freedom/creativity in terms of model implementation, as the GPU resources are less scarce.<br>
(And it's dirt cheap)

In [7]:
import openai
import os

In [8]:
with open("openai_key.txt", "r") as file:
    openai.api_key = file.read().strip()

#Functionality setup
I'll set up some extra functions to tie this application together.


First we want to retrieve all the texts in the FAISS index.
This function, **retrieve_texts()**, embodies the retrieval aspect of RAG. <br><br>
It uses embeddings to find texts most similar to a user's query within a corpus. <br>It encodes the query into a vector, searches the pre-indexed embeddings for the closest matches, and retrieves the corresponding texts based on their index positions. <br>
This selection serves as context or inspiration for subsequent text generation, enhancing GPT's outputs with relevant, query-specific information.

In [9]:
def retrieve_texts(query, model, index, texts, top_k=5):
    # Encode the query using the same model
    query_vector = model.encode([query])

    # Perform the search
    D, I = index.search(np.array(query_vector), top_k)

    # Retrieve the texts corresponding to the indexed IDs
    retrieved_texts = [texts[i] for i in I[0]]

    return retrieved_texts

This function, **generate_story_based_on_texts()**, integrates the Generation part of RAG. <br>
It takes user-defined elements (like setting, character) and retrieved texts to craft a detailed prompt, then uses GPT-3 to generate a narrative. <br>
This approach combines structured storytelling with dynamic, AI-driven content creation through prompt engineering

In [10]:
def generate_story_based_on_texts(retrieved_texts,
                                  setting="a mysterious forest",
                                  character="a brave young explorer",
                                  objective="find the lost city of gold",
                                  tone="adventurous",
                                  style="light-hearted",
                                  model="gpt-3.5-turbo-instruct",
                                  max_length=1000):




    # Construct the initial prompt using the user's input
    initial_prompt = f"""
    The setting of the story should be {setting}. 
    The main character is {character}.
    The main characters objective should be {objective}. 
    The tone of the story should be {tone} and the style of the story should be {style}.
    """

    # Combine the retrieved texts with the initial prompt
    combined_text = f"""
    You will be provided with the task to generate a new, original story. 
    The story should contain the following elements:
    {initial_prompt}.

    You should use the following as inspiration, without directly mentioning any of it.  
    {' '.join(retrieved_texts)}
    """

    # Call GPT-3 to continue this story
    response = openai.completions.create(
        model=model,
        prompt=combined_text,
        max_tokens=max_length,
        temperature=0.7,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0
    )

    # Extracting the generated text according to the new response structure
    story = response.choices[0].text.strip()
    return story

The **refine_texts()** function processes retrieved texts to extract and concatenate a specified number of leading sentences from each, enhancing clarity and relevance before they're used for story generation. <br><br>
This acts as a form of summarization, focusing on the most significant parts of each text to provide GPT-3 with a distilled context, hopefully resulting in more coherent and contextually relevant narrative outputs.

In [11]:
def refine_texts(retrieved_texts, max_sentences=3):
    # This function could be enhanced to include summarization
    refined_texts = []
    for text in retrieved_texts:
        sentences = nltk.sent_tokenize(text)
        refined_texts.append(' '.join(sentences[:max_sentences]))  # Take only the first few sentences
    return ' '.join(refined_texts)

Finally we combine all the functionalities into one: <br>

The **create_story_from_query()** function combines retrieval, refinement, and generation steps in RAG: it fetches texts related to a query, condenses them to essential content, merges this with a narrative starter, and employs GPT-3 to unfold a story, leveraging structured inputs for creative output.

In [12]:
def create_story_from_query(query, initial_prompt="Once upon a time"):
    retrieved_texts = retrieve_texts(query, model, index, texts, top_k=5)
    refined_text = refine_texts(retrieved_texts)
    combined_text = f"{initial_prompt} {refined_text}"
    story = generate_story_based_on_texts([combined_text])  # Adjusted to accept combined text directly
    return story

#Let's try it out!

In [13]:
# Generate the story
query = "adventure in the mountains"
story = create_story_from_query(query)
print(story)

After
that he yawned until it seemed as if his jaws would crack, and started
off towards the Laughing Brook.

As he drew near, the sound of the Laughing Brook grew louder and louder.
It was the same merry, bubbling laugh that had greeted Buster Bear every
morning just as long as he could remember. It was coming from the
smooth, black pool in which the Laughing Brook ended its long journey
down the side of the Green Mountains. Buster loved that pool. He loved
to watch the speckled trout darting to and fro in its clear depths. He
loved to watch the flies dancing above it, and the little birds
splashing in it. But most of all, he loved the taste of the trout who
were foolish enough to let him catch them.

With a loud splash, Buster plunged into the pool and began to swim
around in search of breakfast. Suddenly he stopped and turned his head,
for he


#Can we make this even more nerdy and fun? (ALWAYS!)
Let's add some custom Text-To-Speech to wrap this notebook up. There are lots of models that allow you to mimic a voice with very few (or even just a single) examples. 
**(This part of the notebook isn't fully implemented yet. I need to make sure the data loading etc for the voice clip functions correctly)**

In [None]:
!git clone https://github.com/coqui-ai/TTS.git
%cd TTS
!pip install -r /content/TTS/requirements.txt

!pip install coqpit

In [None]:
!wget -O voice.wav 'https://github.com/CNielsen94/Random_data_repo/raw/main/media/380339__scottemoil__male-deep-voice-lines-of-dialogue.wav'

In [None]:
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

In [None]:
# generate speech by cloning a voice using default settings
tts.tts_to_file(text=story,
                file_path="output.wav",
                speaker_wav=["/content/voice.wav"],
                language="en",
                split_sentences=True
                )