# Demo _Seminar Aktuelle Themen der KI_

This is a demo for the course _Seminar Aktuelle Themen der KI_, KI-B-6, TH Deggendorf for the topic "T1: Large Language Models".
We are going to leverage ChatGPT and the tool suite of [LangChain](https://python.langchain.com/en/latest/index.html) in order to build an application which would be either impossible or just with a tedious amount of work to achieve otherwise.

For this demo we are going to build a system which can do one of the subtasks for this course, creating a podcast script, fully automatically.
We will use our lecture slides to extract information about the requirements and the topic, search wikipedia and the web for current information and use this to create our podcast script.

This is done to show 3 things:

- How powerful the tool sets have become in order to automate a wide array of problems through to be impossible to solve
- The problem the educational sector faces with examination of the students
- The advantage people who have access to such tools (GPT4, ChatGPT Plugins) will have over those who don't

The second and third point are especially problematic as OpenAI is working on a [Plugins System](https://openai.com/blog/chatgpt-plugins) which will be available to the average user and will be able to more or less do the same thing shown here.

> All personal inforation in the lecture script was removed for privacy reasons.

> This demo is inspired by the work done on [Auto GPT](https://github.com/Significant-Gravitas/Auto-GPT)


In [None]:
import os
from dotenv import load_dotenv

from langchain.llms import OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import RetrievalQA, LLMChain
from langchain.document_loaders import TextLoader
from langchain.prompts import PromptTemplate
from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.agents import AgentType
import wikipedia
import pickle
import re
import time

load_dotenv()


In [None]:
# setup chatgpt

openai_api_key = os.environ.get('OPENAI_API_KEY')
llm = OpenAI(openai_api_key=openai_api_key, temperature=0.9)  # type: ignore


In [None]:
# choose the topic for which we want to create the podcast for
T1 = 'Large Language Models'
T2 = 'Time series analysis'
T3 = 'Face aging'
T4 = 'Colorize'
T5 = 'Recommendation Systems'
T6 = 'Bayesian modelling'
T7 = 'Process Mining'
T8 = 'Voice Recognition'
T9 = 'Dialect in speech recognition'
T10 = 'Transfer Learning in Speech Recognition'
T11 = 'Auto Deep Learning'
T12 = 'Automatic feature extraction'

topic = T1


## Step 1: Gathering information from the lecture notes

We can use ChatGPT to interpret the lecture slides which lists out the requirements for the podcast.
By loading this document we can extract the information contained in there. Like:

- the duration which the podcast should have (20 min)
- How many words the use in the script for the podcast
- Which topics should be covered

> In the end we are not using the number of words calculated because it worsens the quality of the output, but there might be a good way to incorporate this by experimenting with other prompts


In [None]:
# document loaders
loader = TextLoader(file_path='../documents/VL1-1-processed-eng.txt', encoding='utf-8')
document_content = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=600, chunk_overlap=0)
split_content = text_splitter.split_documents(document_content)


In [None]:
# create embeddings
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)  # type: ignore
embeddings_search = Chroma.from_documents(split_content, embeddings)
embeddings_search


In [None]:
# create prompt template to get usable results
prompt_template_text_document = """
Instruction:
- Use the following pieces of context to answer the question at the end.
- If you don't know the answer output: NULL
- Just answer the question without providing any additional information

Context:
    {context}

Question:
    {question}

Answer:
"""

prompt_template_documents = PromptTemplate(template=prompt_template_text_document, input_variables=['context', 'question'])
chain_type_kwargs = {'prompt': prompt_template_documents}


In [None]:
# create retriever
# usage with prompt templates see: https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa.html
# to see the source documents set: return_source_documents=True
qa = RetrievalQA.from_chain_type(llm=llm, chain_type='stuff', retriever=embeddings_search.as_retriever(), chain_type_kwargs=chain_type_kwargs)


In [None]:
# get the duration of the podcast in minutes
query_duration = 'Which time duration should the podcast have?'
res_duration = qa.run(query_duration)
res_duration


In [None]:
# This isn't used later but still an interesting application to use
prompt_template_text_num_words = """
Instructions:
  - if you are not sure make an estimate
  - output just a number
  - don't use any text in the answer like for example "aproximately" or "words"

Context:
  {context}

Question:
  How many words should I use in my podcast script to be able to talk the entire duration if a human speaks at 150 words per minute?

Answer:
"""
prompt_template_num_words = PromptTemplate(template=prompt_template_text_num_words, input_variables=['context'])
num_words_chain = LLMChain(llm=llm, prompt=prompt_template_num_words)


In [None]:

# ask chatgpt how many words to use for the podcast
res_words = num_words_chain.run(context=res_duration)
res_words


In [None]:
def get_num_words(text: str) -> int:
    """
    Use the output prompt to match the number of words and convert it to a int value

    :param text: the prompt the tells us the number of words to use
    :returns: the number specified in the text or 1000 if no number could be found
    """
    max_tokens = 2048
    regex = r"\b(\d+[,.]\d+|\d+)\b"
    results: list[str] = re.findall(regex, text, re.MULTILINE)
    if not results:
        return max_tokens

    no_comma = re.sub(r'[,]', '', results[-1])
    float_result = float(no_comma)
    suggestion = round(float_result)
    return max_tokens if suggestion > max_tokens else suggestion


In [None]:
# convert the output from chatgpt to an integer
num_words_to_use = get_num_words(res_words)
num_words_to_use


In [None]:
# query for the topics which should be used in the podcast
query_topics = f'Which topics should be covered in the podcast about {topic}?'
res_topics = qa.run(query_topics)
res_topics


In [None]:
# format the topics in a way to be interpretable
formatted_topics = re.split(r'[,|\n|?]', res_topics)
formatted_topics = [topic.strip() for topic in formatted_topics if not re.search(r'^$', topic)]
formatted_topics


## Step 2: Doing research about the topics

As ChatGPT has a training cutoff in September 2021 there might be more current information important to the topic out there.
Of course we want to include this to achieve the best possible result and up to date responses.

Now that we found out which requirements and topics we should use we can now do research regarding these.
For this we can use a variety of tools (also langchain integrations). First we are using Wikipedia.


In [None]:
def ask_wikipedia(topic: str):
    """
    Ask wikipedia for the summary of the topic.
    Note: results my be inaccurate!

    :param topic: the topic to ask wikipedia for
    :returns: the summary of the topic or an empty string if nothing was found
    """
    time.sleep(0.3)
    search = wikipedia.search(topic)
    if not search:
        return ''

    try:
        return wikipedia.summary(search[0], sentences=5)
    except wikipedia.PageError:
        return ''


In [None]:
# get all wiki research for the topics and all subtopics
wiki_research = '\n'.join([ask_wikipedia(topic) for topic in formatted_topics])
wiki_research


## Generate podcast script

Now that we have done our research about the topic we can generate our podcast script. As the output length of the model varies and the number of output tokens is set to a default max of 2046 and an absolute max of 4092, we possibly need to do multiple runs to get a full script.


In [None]:
prompt_template_text_script = """
Sub Topics:
  {sub_topics}

Context:
  \"{context}\"

Podcast Participants:
  - Host
  - Expert

Previous Section of the Podcast Script:
  \"{previous_section}\"

Task:
  - Your task is to write a podcast script about \"{topic}\".
  - The Sub Topics refine the main topic and need to be addressed!
  - Use your own knowledge and the one provided in Context if you think it fit the topic.
  - Continue from the previous section and output the new content.
  - If you think you are done output [END]

"""
prompt_template_podcast = PromptTemplate(template=prompt_template_text_script, input_variables=['sub_topics', 'context', 'topic', 'previous_section'])


In [None]:
def run_repeated_chain(chain: LLMChain, max_iterations: int = 8, stop_word: str = '[END]', **chain_kwargs) -> str:
    """
    run the podcast chain until the podcast is the stop word is found or the max_iterations is reached

    :param chain: the chain to run
    :param max_iterations: the maximum number of iterations to run the chain
    :param stop_word: the word to stop the chain
    :param chain_kwargs: the kwargs to pass to the chain
    """
    i = 0
    found_end = False
    script = ''
    while not found_end:
        if i >= max_iterations:
            print(f'[WARNING] could not finish task in {max_iterations} iterations')
            break
        output = chain.run(**chain_kwargs)
        script += output
        if 'previous_section' in chain_kwargs:
            chain_kwargs['previous_section'] = output
        found_end = stop_word in output
        i += 1
        print(f'iteration {i} of {max_iterations} finished')

    return script


In [None]:
sub_topics = '\n'.join(f'  - {sub_topic}' for sub_topic in formatted_topics)
sub_topics


In [None]:
podcast_chain = LLMChain(llm=llm, prompt=prompt_template_podcast)

podcast_script = run_repeated_chain(podcast_chain, sub_topics=sub_topics, context=wiki_research, topic=topic, previous_section='')


In [None]:
# write it to a file for later use
with open('../out/podcast_script_wiki.txt', 'a') as script_file:
    script_file.write(podcast_script)


## Using tools and agents

What we could do using other frameworks and manual methods like before (wikipedia), we can now use integrated tools in combination with agents to automate our process further.

So here we will do an example how to leverage current web search results in order to get the most up to date information we need.


In [None]:
tools = load_tools(['serpapi'], llm=llm)
agent = initialize_agent(tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True)


In [None]:
# do a web research for all topics
all_topics = [topic, *formatted_topics]
web_research = []

for topic in all_topics:
    agent_res = agent.run(f"Task: Do a thorough web research about {topic}. Provide at least 3 sentences of information.")
    web_research.append(agent_res)

web_research


In [None]:
formatted_web_research = '\n\n'.join(web_research)
formatted_web_research


In [None]:
# generate the podcast script like before just now with the serpapi agent web research
podcast_chain = LLMChain(llm=llm, prompt=prompt_template_podcast)

podcast_script_web = run_repeated_chain(podcast_chain, sub_topics=sub_topics, context=formatted_web_research, topic=topic, previous_section='')
podcast_script_web


In [None]:
with open('../out/podcast_script_web.txt', 'w') as script_file:
    script_file.write(podcast_script_web)


## Ending Thoughts

### Prompt Engineering

As I have done my research, tested out various models - closed and open source - and also programmed the demo I have noticed the importance of choosing approximately the right prompt.
Especially when I tried out the models from [GPT4All](https://github.com/nomic-ai/gpt4all) and used them without a prompt template one thing became very clear: The model is not able to do anything without a prompt.
Well it outputs text at least, but no which would make sense. To solve such problems we need use trial and error, as well as look for patterns in the dataset.
The model lives and falls with the right prompts and prompt templates. The field of [Prompt Engineering](https://en.wikipedia.org/wiki/Prompt_engineering) is not longer a laughable matter but reality.

### API Costs

As we stand now most of the usable LLMs are hidden behind a paywall. This would be bearable if it was only one service we would have to pay for, but as your requirements increase it would not be surprising to be subscribed to a number of APIs. Especially running applications in an "AutoGPT manner" is very quickly going to drain your wallet. Also make sure you have a credit card, otherwise you won't be able to subscribe to almost any API.

### Privacy

There are serious privacy concerns when using ChatGPT to work with your private data.
Not only can you be sure that they will know your most private information (which is sadly reality nowadays). Especially with those LLMs we are now more able than ever you analyze a large amount of unstructured data which prevents you to stay hidden in the masses. Another concern is that this data will probably end up in the training data and god help you if another user gets output from your private data.

### Open source models

[ChatBot Arena](https://chat.lmsys.org/?arena)

As _GPT3.5-turbo_ and _GPT4_ dominate the current market of large language models the open source variants start rising more and more to the occasion.
![ChatGPT, Bard, Vicuna13B performance](https://lmsys.org/images/blog/vicuna/chart.svg)
We can see that especially the _Vicuna-13B_ models reaches and astounding [performance of 92%](https://lmsys.org/blog/2023-03-30-vicuna/) compared to the _GPT3.5-turbo_ model used for.
This is especially astounding is we look at the number of parameters for each model:

- _GPT3.5-turbo_: 175B
- _Vicuna-13B_: 13B

As close as this seems there are several problems the open source models nowadays face:

1. Training resources
   The first one is very obvious and that's the training cost involved. As an average user it is nearly impossible to train even a 13B parameter model. Maybe you are able to do some fine tuning but that's all at best. So the community either relies on a couple individuals or companies who don't have as much training as the monetized competitors.
   Addressing these problems there is a very interesting open source project out there which tries it's hand on decentralized training named [petals](https://github.com/bigscience-workshop/petals). But it is still young and we are yet to see what to come.

2. Training datasets
   This is another very simple one. Even though there is open source datasets for training large language models they don't even come close to the ones kept at companies like Google, Microsoft, Facebook and OpenAI.
   To solve this problem the open source community found a very easy, but interesting approach. The use ChatGPT output to train their models. On the other hand this means of course that the models exclusively trained with that method cannot become better than it trainer.

3. Imitating output vs understanding output
   The big problem noticed with open source models is that there seems to be a divergence between imitating vs understanding the output. As we established before a lot of open source models are trained using ChatGPT inputs and outputs.
   Leaving the other problems aside they just seemed to imitate ChatGPT which is a big problem for transfer, zero or one shot learning.
   To address this other approaches were chosen where both _GPT3.5-turbo_ and _GPT4_ were used and instructed to also explain it's reasoning process which dramatically improved performance.

But there is very good news for us if we can trust the contents of a [leaked memo by google](https://www.semianalysis.com/p/google-we-have-no-moat-and-neither). This memo states there is a high probability that the open source models will close the gap in the near future - to themselves and even to OpenAI.

### Interesting Tools

- [AutoGPT](https://github.com/Significant-Gravitas/Auto-GPT)
  - AI agent doing everything possible to achieve whichever goal you give it
- [perplexity.ai](https://www.perplexity.ai/)
  - free AI search engine which uses and shows it's sources
- [privateGPT](https://github.com/imartinez/privateGPT)
  - Document search engine which uses just private and open source models and technology
- [PentestGPT](https://github.com/GreyDGL/PentestGPT)
  - Automatic penetration testing / hacking tool. Similar to AutoGPT
- [MiniGPT](https://github.com/Vision-CAIR/MiniGPT-4)
  - open source chatbot with vision / image understanding
- [petals](https://github.com/bigscience-workshop/petals)
  - open source distributed training and usage of a BLOOM 176B parameter model
- [GPT4All](https://github.com/nomic-ai/gpt4all)
  - collection of open source LLMs

### Implications for the average user

As we have seen with this example and also even more so when looking at the [interesting tools](#interesting-tools), applications leveraging the power of LLMs are rapidly getting more powerful and sophisticated.
Once public exposure, cost, variety and usability of LLMs increase we will see a new era of applications which will change our lives forever.
We should start to review the systems which are in place and adapt them.
For example taking the educational sector. We are still using the same methods of learning which were used 100 years ago. Even though we have the technology to make learning more interactive and fun we still rely on the same old methods.
At the same time there is also a shadow side to this. We problems with privacy, people fully relying on ChatGPT and not thinking anymore for themselves and many more.

Rather than complain we should do following things:

- Talk about the capabilities of such tools
- Talk about the risks and disadvantages
- Talk about how to incorporate such tools in our lives
- Talk about boundaries and limits

Before this is not done we cannot create a promising strategy of how to solve the problems we are facing.
