# RAG and HYDE Algorithm based Question - Answering System on the Lex Fridman Podcast Dataset


### Loading the Data

In [1]:
import pandas as pd
df = pd.read_csv('podcastdata_dataset.csv')

## Analysis of the DATA

In [2]:
df.head()

Unnamed: 0,id,guest,title,text
0,1,Max Tegmark,Life 3.0,"As part of MIT course 6S099, Artificial Genera..."
1,2,Christof Koch,Consciousness,As part of MIT course 6S099 on artificial gene...
2,3,Steven Pinker,AI in the Age of Reason,"You've studied the human mind, cognition, lang..."
3,4,Yoshua Bengio,Deep Learning,What difference between biological neural netw...
4,5,Vladimir Vapnik,Statistical Learning,The following is a conversation with Vladimir ...


In [3]:
print("There are",df['guest'].nunique(), "number of guests and", df['title'].nunique() , "number of podcasts in the dataset")


There are 281 number of guests and 317 number of podcasts in the dataset


In [4]:
average_length_of_a_podcast = 0
temp = 0
for i in df["text"]:
  temp = temp + len(i)
average_length_of_a_podcast = temp/len(df["text"])
print("The average number of characters in a transcript of a podcast is", int(2*(average_length_of_a_podcast//2)))

The average number of characters in a transcript of a podcast is 118604


## Methodology


### Converting each podcast transcript to tokens


In [5]:
%pip install transformers

Note: you may need to restart the kernel to use updated packages.


#### Word Tokenizer

In [6]:
new_column = []
df = df.head(2)
for i in df["text"]:
  tokens = i.split(' ')
  print(tokens)
  new_column.append(tokens)
df['tokens'] = new_column

['As', 'part', 'of', 'MIT', 'course', '6S099,', 'Artificial', 'General', 'Intelligence,', "I've", 'gotten', 'the', 'chance', 'to', 'sit', 'down', 'with', 'Max', 'Tegmark.', 'He', 'is', 'a', 'professor', 'here', 'at', 'MIT.', "He's", 'a', 'physicist,', 'spent', 'a', 'large', 'part', 'of', 'his', 'career', 'studying', 'the', 'mysteries', 'of', 'our', 'cosmological', 'universe.', 'But', "he's", 'also', 'studied', 'and', 'delved', 'into', 'the', 'beneficial', 'possibilities', 'and', 'the', 'existential', 'risks', 'of', 'artificial', 'intelligence.', 'Amongst', 'many', 'other', 'things,', 'he', 'is', 'the', 'cofounder', 'of', 'the', 'Future', 'of', 'Life', 'Institute,', 'author', 'of', 'two', 'books,', 'both', 'of', 'which', 'I', 'highly', 'recommend.', 'First,', 'Our', 'Mathematical', 'Universe.', 'Second', 'is', 'Life', '3.0.', "He's", 'truly', 'an', 'out', 'of', 'the', 'box', 'thinker', 'and', 'a', 'fun', 'personality,', 'so', 'I', 'really', 'enjoy', 'talking', 'to', 'him.', 'If', "you'd

### Converting each row of the transcript to chunks of tokens with some overlap

In [7]:
def chunk_text(tokens, chunk_size, overlap_size, padding=True, padding_type='zero'):
    chunks = []
    start_idx = 0
    end_idx = chunk_size
    while start_idx < len(tokens):
        # Extract the current chunk
        chunk = " ".join(tokens[start_idx:end_idx])

        # Check for padding

        # Append the chunk to the list
        chunks.append(chunk)

        # Move the start and end indices for the next chunk
        start_idx += (chunk_size - overlap_size)
        end_idx = min(start_idx + chunk_size, len(tokens))

    # Padding
    if padding:
        if padding_type == 'zero':
            while len(chunks[-1]) < chunk_size:
                chunks[-1] += '0'  # Add zero-padding
        elif padding_type == 'duplicate':
            while len(chunks[-1]) < chunk_size:
                chunks[-1] += chunks[-2][-1]  # Duplicate the last token of the previous chunk
        elif padding_type == 'boundary':
            while len(chunks[-1]) < chunk_size:
                chunks[-1] += tokens[end_idx]  # Append content from the next sentence/document

    return chunks

# Example usage:
chunk_size = 1000
overlap_size = 30
new_column = []
for i in df["tokens"]:
  tokens = i
  chunks = chunk_text(tokens, chunk_size, overlap_size)
  new_column.append(chunks)
df['chunks'] = new_column


### Converting each of the chunk into a vector using an embedding models

In [8]:
%pip install sentence_transformers

Note: you may need to restart the kernel to use updated packages.


In [9]:
from sentence_transformers import SentenceTransformer

def chunk_to_vector(chunk, model_name_or_path):
    # Load the Sentence Transformers model
    model = SentenceTransformer(model_name_or_path)

    # Convert each chunk to vectors
    chunk_vector = []
    chunk_vector = model.encode(chunk)
    chunk_vector.append(chunk_vector)

    return chunk_vector


### Creating a vector space and storing it

In [10]:
import pandas as pd
from sentence_transformers import SentenceTransformer

# Initialize Sentence Transformers model
model_name_or_path = "bert-base-nli-mean-tokens"
model = SentenceTransformer(model_name_or_path)

# Initialize an empty list to store vector embeddings
vector_embeddings = [["guest", "title", "chunk", "chunk_vector"]]

# Assuming df is your dataframe containing guest, title, and chunks
for index, row in df.iterrows():
    guest = row["guest"]
    title = row["title"]
    chunks = row["chunks"]

    for chunk in chunks:
        # Convert each chunk to a vector using Sentence Transformers model
        chunk_vector = model.encode(chunk)

        # Append guest, title, chunk text, and corresponding vector to vector_embeddings list
        vector_embeddings.append([guest, title, chunk, chunk_vector])

# Convert the list of lists to a pandas DataFrame
df_vector_embeddings = pd.DataFrame(vector_embeddings[1:], columns=vector_embeddings[0])

# Save the DataFrame as a CSV file
df_vector_embeddings.to_csv("vector_embeddings.csv", index=False)


  attn_output = torch.nn.functional.scaled_dot_product_attention(


### Creating a function which will look for the top-k documents which best suit the input prompt

In [11]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def retrieve_top_k_documents(prompt_vector, vector_database, k=5):
    # Extract chunk vectors from the vector database
    chunk_vectors = [item[3] for item in vector_database]

    # Calculate cosine similarity between the prompt vector and all chunk vectors
    similarities = cosine_similarity([prompt_vector], chunk_vectors)[0]

    # Get indices of top-k documents based on cosine similarity
    top_k_indices = np.argsort(similarities)[::-1][:k]

    return top_k_indices


# input_prompt = "Computer Vision"
# prompt_vector = model.encode(input_prompt)
# top_k_indices = retrieve_top_k_documents(prompt_vector, vector_embeddings)
# print("Top-k document indices:", top_k_indices)
# for i in top_k_indices:
#   print(vector_embeddings[i][2])
#   print(vector_embeddings[i][1])
#   print( "\n\n")



### Implementing the HyDE Algorithm

In [12]:
%pip install openai==0.28




In [13]:
import openai

# Configuration for OpenAI API
openai.api_base = "http://localhost:1234/v1"
openai.api_key = "lm-studio"

# Function to create a chat completion with a dynamic user prompt
def create_chat_completion(history):
    return openai.ChatCompletion.create(
        model="TheBloke/Llama-2-7B-Chat-GGUF",
        messages=history,
        temperature=0.7,
        stream=True,
    )

def better_prompt_generation(user_prompt):
    model_input = "Original prompt: " + user_prompt + "\n\n" + "How can we enrich this question to provide more context and depth?\n\n" + "Just give me your version of the prompt and nothing else!"
    
    # Predefined system message
    system_message = (
        "You are an expert writer. Your task is to enhance the given prompt by adding more context and depth, without altering the original tone."
    )

    history = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": model_input},
    ]

    completion = create_chat_completion(history)
    new_message = {"role": "assistant", "content": ""}

    for chunk in completion:
        if 'content' in chunk.choices[0].delta:
            new_message["content"] += chunk.choices[0].delta.content

    if new_message["content"]:
        history.append(new_message)

    return new_message["content"]



#### Creating a pseudo document for the input prompt to give it a context

In [14]:
user_prompt = input("Enter your prompt here: ")
output_prompt = better_prompt_generation(user_prompt)
print(output_prompt)

Here is an enriched version of the original prompt:

"Can you tell me more about the history of podcasts and specifically, what was the first podcast ever produced, who created it, and how did it shape the medium as we know it today?"


##### Converting the output_prompt to vector

In [15]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
import pandas as pd

model_name_or_path = "bert-base-nli-mean-tokens"

model = SentenceTransformer(model_name_or_path)

output_prompt_vector = model.encode(output_prompt)

ve = pd.read_csv('vector_embeddings.csv')

chunks_with_similarity_indices = []

for index, row in ve.iterrows():
    guest = row['guest']
    title = row['title']
    chunk = row['chunk']
    # Convert the string representation of the chunk vector to a numpy array
    vector = np.fromstring(row['chunk_vector'][1:-1], sep=' ')
    # Calculate cosine similarity between the chunk vector and the output prompt vector
    similarity = cosine_similarity(vector.reshape(1, -1), output_prompt_vector.reshape(1, -1))
    # Append similarity score, chunk text, guest, and title to the result list
    chunks_with_similarity_indices.append([similarity[0][0], chunk, guest, title])

chunks_with_similarity_indices.sort(reverse=True)

# print(len(ve['chunk']))

# print(len(chunks_with_similarity_indices))

# print(chunks_with_similarity_indices)

chunks_with_similarity_indices = chunks_with_similarity_indices[:6]




### Fetching the top K documents with the best similarity with the vector of the pseudo document using the created function

In [16]:
chunks_with_similarity_indices = chunks_with_similarity_indices[:6]

### Generating a final prompt using prompt structures, input prompt and the top k documents fetched

In [17]:
final_prompt = user_prompt + "\nAbove is the user prompt\n\n" + output_prompt+ "\n Above is a better version of the prompt\n\n" "Below are the related chunks of data : \n\n"
for i in chunks_with_similarity_indices:
  final_prompt +=  i[1] + "\n\n"
  
final_prompt += "\n\n" + "Based on the input prompt, output prompt and the relevant chunks of data, answer the input prompt"


In [18]:
print(final_prompt)

Give me some information about the first podcast
Above is the user prompt

Here is an enriched version of the original prompt:

"Can you tell me more about the history of podcasts and specifically, what was the first podcast ever produced, who created it, and how did it shape the medium as we know it today?"
 Above is a better version of the prompt

Below are the related chunks of data : 

little part. And amazingly, they're basically the same part. Yeah, it's almost like our world was created for, I mean, they kind of come together. Yeah, well, you could say maybe where the world was created for us, but I have a more modest interpretation, which is that the world was created for us, but I have a more modest interpretation, which is that instead evolution endowed us with neural networks precisely for that reason. Because this particular architecture, as opposed to the one in your laptop, is very, very well adapted to solving the kind of problems that nature kept presenting our ancestor

### Generating the final response to the final prompt generated using Language Models like

In [19]:
def Final_Answer_generator(history, final_prompt):
    # Predefined system message
    system_message = (
        "Consider the prompts and data provided as a comprehensive resource to craft an accurate response. Feel free to extract relevant details or modify the prompts for clarity." 
    )

    history.append({"role": "system", "content": system_message})
    history.append({"role": "user", "content": final_prompt})

    completion = create_chat_completion(history)
    new_message = {"role": "assistant", "content": ""}

    for chunk in completion:
        if 'content' in chunk.choices[0].delta:
            new_message["content"] += chunk.choices[0].delta.content

    if new_message["content"]:
        history.append(new_message)

    return new_message["content"]

# Example usage:
history = []  # Initialize history as an empty list
Final_Output = Final_Answer_generator(history, final_prompt)
print(Final_Output)

KeyboardInterrupt: 

In [20]:
%pip install python-dotenv

Note: you may need to restart the kernel to use updated packages.


In [21]:
import os
from dotenv import load_dotenv
from langchain.llms import OpenAI

# Load environment variables from .env file
load_dotenv()

# Get the OpenAI API key from the environment variable
api_key = os.getenv("OPEN_API_KEY")
llm=OpenAI(openai_api_key=api_key,temperature=0)
text= final_prompt
print(llm.predict(text))



  warn_deprecated(
  warn_deprecated(


 with a summary of the conversation.

The input prompt was: "What is your take on the Fermi Paradox? Do you think intelligent life exists elsewhere in the universe?"

Summary:

Dr. Wolfram discussed the Fermi Paradox, stating that it's possible that we are alone in the universe and that there might not be any other intelligent life forms out there. He mentioned that if we assume a uniform distribution of intelligent life across the universe, then the probability of finding another civilization within 10^16 meters is extremely low. However, he also emphasized the importance of being responsible with our existence as the only known intelligent life form in the universe.

Regarding the possibility of AGI (Artificial General Intelligence), Dr. Wolfram expressed skepticism about the idea that there might be AGI already existing in cellular automata or other systems without us noticing it. He believes that if AGI were to exist, we would soon notice it due to its immense capabilities.

The co

In [22]:
dummy_prompt = user_prompt + "\nAbove is the user prompt\n\n"+  "Below are the related chunks of data : \n\n"
for i in chunks_with_similarity_indices:
  dummy_prompt +=  i[1] + "\n\n"
  
dummy_prompt += "\n\n" + "Based on the input prompt, output prompt and the relevant chunks of data, answer the input prompt"

print(dummy_prompt)

Give me some information about the first podcast
Above is the user prompt

Below are the related chunks of data : 

little part. And amazingly, they're basically the same part. Yeah, it's almost like our world was created for, I mean, they kind of come together. Yeah, well, you could say maybe where the world was created for us, but I have a more modest interpretation, which is that the world was created for us, but I have a more modest interpretation, which is that instead evolution endowed us with neural networks precisely for that reason. Because this particular architecture, as opposed to the one in your laptop, is very, very well adapted to solving the kind of problems that nature kept presenting our ancestors with. So it makes sense that why do we have a brain in the first place? It's to be able to make predictions about the future and so on. So if we had a sucky system, which could never solve it, we wouldn't have a world. So this is, I think, a very beautiful fact. Yeah. We als

In [23]:
text= dummy_prompt
print(llm.predict(text))

 with a summary of the conversation.

The input prompt was: "What is your take on the Fermi Paradox? Do you think intelligent life exists elsewhere in the universe?"

Summary:

Dr. Wolfram discussed the Fermi Paradox, stating that it's possible that advanced civilizations may have already gone extinct or are too far away to detect. He also mentioned that there could be a false sense of security due to the possibility of other life forms coming to rescue us if we make mistakes.

Regarding intelligent life existing elsewhere in the universe, Dr. Wolfram believes that it's unlikely that we're alone, given the vast number of Earth-like planets and the probability of life arising on at least one of them. However, he thinks that advanced civilizations may be extremely rare due to the difficulty of achieving a high level of intelligence.

Dr. Wolfram also touched upon the idea that there might already be intelligent systems in existence, but we just don't know how to communicate with them or 

In [28]:
print(text)




In [27]:
list = text.split("\n")
list = list[1:]
text = ''
for i in list:
    text += i + "\n"
print(text)




In [29]:
import streamlit as st
import subprocess

# Function to run Vector_Space_Creation.py
def create_vector_space():
    st.write("Creating vector space...")
    subprocess.run(["python", "Vector_Space_Creation.py"])
    st.write("Vector space creation completed.")

# Streamlit interface
st.title("Vector Space Creation")

# Button to trigger vector space creation
if st.button("Create Vector Space"):
    create_vector_space()

2024-06-11 18:44:53.606 
  command:

    streamlit run C:\Users\SURYA\AppData\Roaming\Python\Python312\site-packages\ipykernel_launcher.py [ARGUMENTS]
