# Simple Demo for KI Seminar

## Content to create

- Podcast script
- Presentation
- Code Demo?

## Sequence - Podcast Script

- Ask for the topic
- Figure out the requirements
  - How long does it need to be (time wise)
    - (google wpm talking speed)
    - calculate the words to time with wpm talking speed
  - Topics (How do they work?, ChatGPT vs open models, ...)
- Gather information on the topic and requirements (google search or wikipedia)
- output information
  - maybe to file even?


## Interesting ideas / modifications

- could generate an interesting topic inside the domain of large language models


In [None]:
import os
from dotenv import load_dotenv

from langchain.llms import OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import RetrievalQA, LLMChain
from langchain.document_loaders import TextLoader
from langchain.prompts import PromptTemplate
from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.agents import AgentType
import wikipedia
import pickle
import re
import time

load_dotenv()


In [None]:
# setup chatgpt

openai_api_key = os.environ.get('OPENAI_API_KEY')
llm = OpenAI(openai_api_key=openai_api_key, temperature=0.9)  # type: ignore


In [None]:
# tell the topic for which we want to create the podcast for
T1 = 'Large Language Models'
T2 = 'Time series analysis'
T3 = 'Face aging'
T4 = 'Colorize'
T5 = 'Recommendation Systems'
T6 = 'Bayesian modelling'
T7 = 'Process Mining'
T8 = 'Voice Recognition'
T9 = 'Dialect in speech recognition'
T10 = 'Transfer Learning in Speech Recognition'
T11 = 'Auto Deep Learning'
T12 = 'Automatic feature extraction'

topic = T1


## Step 1: Gathering information from the lecture notes

We can use ChatGPT to interpret the lecture slides which lists out the requirements for the podcast.
By loading this document we can figure out:

- the duration which the podcast should have (20 min)
- How many words the use in the script for the podcast
- Which topics should be covered


In [None]:
# document loaders
loader = TextLoader(file_path='../documents/VL1-1-processed-eng.txt', encoding='utf-8')
document_content = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=600, chunk_overlap=0)
split_content = text_splitter.split_documents(document_content)


In [None]:
# create embeddings
if os.path.isfile('../objects/embeddings.pkl'):
    with open('../objects/embeddings.pkl', 'rb') as embeddings_file:
        embeddings = pickle.load(embeddings_file)
else:
    embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)  # type: ignore
embeddings_search = Chroma.from_documents(split_content, embeddings)
embeddings_search


In [None]:
# create prompt template to get usable results
prompt_template_text_document = """
Instruction:
- Use the following pieces of context to answer the question at the end.
- If you don't know the answer output: NULL
- Just answer the question without providing any additional information

Context:
    {context}

Question:
    {question}

Answer:
"""

prompt_template_documents = PromptTemplate(template=prompt_template_text_document, input_variables=['context', 'question'])
chain_type_kwargs = {'prompt': prompt_template_documents}


In [None]:
# create retriever
# usage with prompt templates see: https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html
# to see the source documents set: return_source_documents=True
qa = RetrievalQA.from_chain_type(llm=llm, chain_type='stuff', retriever=embeddings_search.as_retriever(), chain_type_kwargs=chain_type_kwargs)


In [None]:
# get the duration of the podcast in minutes
query_duration = 'Which time duration should the podcast have?'
res_duration = qa.run(query_duration)
res_duration


In [None]:
# This isn't used later but still an interesting application to use
prompt_template_text_num_words = """
Instructions:
  - if you are not sure make an estimate
  - output just a number
  - don't use any text in the answer like for example "aproximately" or "words"

Context:
  {context}

Question:
  How many words should I use in my podcast script to be able to talk the entire duration if a human speaks at 150 words per minute?

Answer:
"""
prompt_template_num_words = PromptTemplate(template=prompt_template_text_num_words, input_variables=['context'])
num_words_chain = LLMChain(llm=llm, prompt=prompt_template_num_words)


In [None]:

# ask chatgpt how many words to use for the podcast
res_words = num_words_chain.run(context=res_duration)
res_words


In [None]:
def get_num_words(text: str) -> int:
    """
    Use the output prompt to match the number of words and convert it to a int value

    :param text: the prompt the tells us the number of words to use
    :returns: the number specified in the text or 1000 if no number could be found
    """
    max_tokens = 2048
    regex = r"\b(\d+[,.]\d+|\d+)\b"
    results: list[str] = re.findall(regex, text, re.MULTILINE)
    if not results:
        return max_tokens

    no_comma = re.sub(r'[,]', '', results[-1])
    float_result = float(no_comma)
    suggestion = round(float_result)
    return max_tokens if suggestion > max_tokens else suggestion


In [None]:
# convert the output from chatgpt to an integer
num_words_to_use = get_num_words(res_words)
num_words_to_use


In [None]:
# query for the topics which should be used in the podcast
query_topics = f'Which topics should be covered in the podcast about {topic}?'
res_topics = qa.run(query_topics)
res_topics


In [None]:
# format the topics in a way to be interpretable
formatted_topics = re.split(r'[,|\n|?]', res_topics)
formatted_topics = [topic.strip() for topic in formatted_topics if not re.search(r'^$', topic)]
formatted_topics


## Step 2: Doing research about the topics

Now that we found out which requirements and topics we should use we can now do research regarding these.
For this we can use a variety of tools (also langchain integrations), but here we chose to use the wikipedia api.


In [None]:
def ask_wikipedia(topic: str):
    """
    Ask wikipedia for the summary of the topic.
    Note: results my be inaccurate!

    :param topic: the topic to ask wikipedia for
    :returns: the summary of the topic or an empty string if nothing was found
    """
    time.sleep(0.3)
    search = wikipedia.search(topic)
    if not search:
        return ''

    try:
        return wikipedia.summary(search[0], sentences=5)
    except wikipedia.PageError:
        return ''


In [None]:
wiki_research = '\n'.join([ask_wikipedia(topic) for topic in formatted_topics])
wiki_research


## Generate podcast script

Now that we have done our research about the topic we can generate our podcast script.


In [None]:
prompt_template_text_script = """
Sub Topics:
  {sub_topics}

Context:
  \"{context}\"

Podcast Participants:
  - Host
  - Expert

Previous Section of the Podcast Script:
  \"{previous_section}\"

Task:
  - Your task is to write a podcast script about \"{topic}\".
  - The Sub Topics refine the main topic and need to be addressed!
  - Use your own knowledge and the one provided in Context if you think it fit the topic.
  - Continue from the previous section and output the new content.
  - If you think you are done output [END]

"""
prompt_template_podcast = PromptTemplate(template=prompt_template_text_script, input_variables=['sub_topics', 'context', 'topic', 'previous_section'])


In [None]:
def run_repeated_chain(chain: LLMChain, max_iterations: int = 8, stop_word: str = '[END]', **chain_kwargs) -> str:
    """
    run the podcast chain until the podcast is the stop word is found or the max_iterations is reached

    :param chain: the chain to run
    :param max_iterations: the maximum number of iterations to run the chain
    :param stop_word: the word to stop the chain
    :param chain_kwargs: the kwargs to pass to the chain
    """
    i = 0
    found_end = False
    script = ''
    while not found_end:
        if i >= 6:
            print(f'[WARNING] could not finish task in {max_iterations} iterations')
            break
        output = chain.run(**chain_kwargs)
        script += output
        if 'previous_section' in chain_kwargs:
            chain_kwargs['previous_section'] = output
        found_end = stop_word in output
        i += 1
        print(f'iteration {i} of {max_iterations} finished')

    return script


In [None]:
sub_topics = '\n'.join(f'  - {sub_topic}' for sub_topic in formatted_topics)
sub_topics


In [None]:
podcast_chain = LLMChain(llm=llm, prompt=prompt_template_podcast)

podcast_script = run_repeated_chain(podcast_chain, sub_topics=sub_topics, context=wiki_research, topic=topic, previous_section='')


In [None]:
# write it to a file for later use
with open('../out/podcast_script_wiki.txt', 'a') as script_file:
    script_file.write(podcast_script)


## Using tools and agents

What we could do using other frameworks and manual methods like before (wikipedia), we can now use integrated tools in combination with agents to automate our process further.


In [None]:
tools = load_tools(['serpapi'], llm=llm)
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)


In [None]:
all_topics = [topic, *formatted_topics]
web_research = []

for topic in all_topics:
    agent_res = agent.run(f"Task: Do a thorough web research about {topic}. Provide at least 3 sentences of information.")
    web_research.append(agent_res)

web_research


In [None]:
formatted_web_research = '\n\n'.join(web_research)
formatted_web_research


In [None]:
podcast_chain = LLMChain(llm=llm, prompt=prompt_template_podcast)

podcast_script_web = run_repeated_chain(podcast_chain, sub_topics=sub_topics, context=formatted_web_research, topic=topic, previous_section='')
podcast_script_web


In [None]:
with open('../out/podcast_script_web.txt', 'w') as script_file:
    script_file.write(podcast_script_web)
