# 1.1 Building AI powered Tools: Langchain & OpenAI

In this series I will be exploring how we can build powerful AI application simply & quickly using Langchain and OpenAI.

Every use case detailed is designed to make your life easier, find answers quicker, & ultimately give you more headpspace to be more creative in your day to day.

These workbooks are designed to help you understand how you build your own AI tools. I will endevour to explain how things work along the way so that you can apply them in different ways, rather than just copy and pasting the codebase.

Join on my journey to the 4 day work week.

Note, I have built this using Jupyter Notebooks. If you have a Mac, you can access Jupyter Notebooks by downloading Anaconda. Learn how to do this here: https://www.youtube.com/watch?v=PM60D-Pg890&ab_channel=ProgrammingKnowledge

Or, you can use Google Collab. https://colab.research.google.com/


# Use case 1: Using Langchain & OpenAI to QA & Summarise YouTube videos


Today, we will be using Langchain to enable us to run OpenAI queries and custom prompts on any YouTube video.

To kickoff, what is langchain??


## Langchain is your Swiss army knife for creating LLM applications



LangChain is a framework that helps create applications that use language models. 

It has two important qualities:

1. Data-aware: It allows the language model to connect with other sources of information or data. This means the model can access and use data from different places to make better decisions or provide more accurate answers.

3. Agentic: It enables the language model to interact with its environment. This means the model can actively engage with its surroundings, understand instructions, and perform tasks based on that understanding.

The main benefits of LangChain are:

Components: LangChain provides tools that make it easy to work with language models. These tools are like building blocks that can be used together or separately. They are designed to be flexible and user-friendly, whether you're using the entire LangChain framework or just specific components.

Off-the-shelf chains: LangChain offers pre-designed combinations of components that are ready to use for specific tasks. These pre-designed chains make it simple to get started quickly. If you have more complex requirements or specific needs, you can easily customize existing chains or build your own using the available components.

In simpler terms, LangChain is a toolbox that helps create applications using language models. It allows the model to use data from different sources and interact with its surroundings. With LangChain, you can use pre-designed tools or create custom solutions to accomplish specific tasks.

For this, you will need your OpenAI api key.You can create one here: https://platform.openai.com/account/api-keys

Ensure you protect yourself by not sharing this API key and setting rate limits in case it is stolen. You can do that here: https://platform.openai.com/account/rate-limits

N.B. Throughout this walkthrough I will be using OpenAI's gpt-3.5-turbo model. There are others available, however, I reccomend you use this one. Check out all the models are here: https://platform.openai.com/account/rate-limits

# LET'S BUILD!

## Load in the Libraries & Download the YouTube Transcript

A Python library is like a collection of pre-built tools or resources that can help make programming in Python easier and more efficient. It's similar to a toolbox filled with different tools that you can use to perform specific tasks.

In [1]:
#import libraries
from langchain.document_loaders import YoutubeLoader #youtube loader
from langchain.llms import OpenAI #openAI access
import os #use this to load your openAI API into functions

In [1]:
#set your openAI key to as an environment variable
%env OPENAI_API_KEY=YOUR_API_KEY #do not use string value just =sk-...


env: OPENAI_API_KEY=YOUR_API_KEY #do not use string value just =sk-...


In [3]:
#Add the YouTube URL you want to scrape the transcipt
YouTube_URL = "https://www.youtube.com/watch?v=uIYujpFmvo8&ab_channel=YCombinator"

In [4]:
#load in the data from YouTube
yt_loader = YoutubeLoader.from_youtube_url(
    YouTube_URL, add_video_info=False
)


In [5]:
#load the Youtube transcipt in
YouTube_Transcript = yt_loader.load()

In [6]:
#the docs have loaded in
YouTube_Transcript

[Document(page_content="over the past few batches we've seen a pretty big uptick in the number of AI startups that are applying to YC and building AI focused companies so today we thought we would do an episode Focus entirely on AI startup websites welcome to another episode of design review [Music] today we are very lucky to be joined by YC's president Gary tan thanks for having me so Gary why did you want to focus on AI websites for this episode the number one question that is sort of the most important for a lot of startups is actually why now all of the activity around large language models whether using open AI staff or anthropic or open source it's just the most exciting why now that's happening because we just didn't have this level of capability so when you have this crazy new why now often you're actually literally trying to figure out how do you describe a new category that didn't exist before so that's some of the stuff I think we're going to dive into today awesome so we're

In [7]:
#convert the list to a string
transcript_string = str(YouTube_Transcript)

# Use NLP (Natural Language Processing to create semantic chunks)

NLP is a method of using neural networks to pre-process your data. We will be using the SpaCy library here to makese sense of the unstructured transcript above.

There is an alternative called NLTK. However, SpaCy is more powerful and fast NLP. 

Please note that the longer the text, the longer the model will take to complete the task.

In this example, we use spaCy to split the input text into sentences. Then, we join the sentences back together into a single string. Finally, we use textwrap.wrap() to format the text into paragraphs with a specified width.

In [10]:
import spacy #import the NLP library
import textwrap #join sentences back together

In [14]:
#install the NLP model required from Github
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz


SyntaxError: invalid syntax. Perhaps you forgot a comma? (3995155508.py, line 2)

In [16]:
nlp = spacy.load("en_core_web_sm") #load the model, this is the small (sm) model. Good for speed where accuracy isn't key


In [18]:
# Process the text with spaCy
doc = nlp(text)

With the library and model loaded, we can now add formatting to the string block (e.g. grammar and paragraphs).

In addition, we need to split the text using /n as this is what our 'Text splitter' will use to split the massive chuncks of text into smaller ones to feed it to openAI.

In [19]:
# Your input text
text = transcript_string

# Split the text into sentences
sentences = [sent.text for sent in doc.sents]

# Add full stops to sentences
sentences_with_full_stops = [sentence + "." if not sentence.endswith((".", "!", "?")) else sentence for sentence in sentences]

# Set the desired width for each line
line_width = 80

# Create formatted paragraphs with consistent line breaks and separators
formatted_text = []
paragraph = []
for sentence in sentences_with_full_stops:
    if len(" ".join(paragraph + [sentence])) > line_width:
        formatted_text.append(" ".join(paragraph))
        paragraph = [sentence]
    else:
        paragraph.append(sentence)
formatted_text.append(" ".join(paragraph))

# Add separators between paragraphs
formatted_text_with_separators = "\n\n".join(formatted_text)

# Print the formatted text with separators and full stops
print(formatted_text_with_separators)



[Document(page_content="over the past few batches we've seen a pretty big uptick in the number of AI startups that are applying to YC and building AI focused companies so today we thought we would do an episode Focus entirely on AI startup websites welcome to another episode of design review [Music] today we are very lucky to be joined by YC's president.

Gary tan thanks for having me.

so Gary why did you want to focus on AI websites for this episode the number one question that is sort of the most important for a lot of startups is actually why now all of the activity around large language models whether using open AI staff or anthropic or open source it's just the most exciting why now that's happening because we just didn't have this level of capability so when you have this crazy new why now often you're actually literally trying to figure out how do you describe a new category that didn't exist before so that's some of the stuff I think we're going to dive into today.

awesome 

# Using OpenAI for Q&A's & Custom prompts 🤖

So, I am going to give you two method to interact with the text chunk we are about to feed to openAI.

## 1st Q&A's

We will create a simple q&a function where we can ask specific questions to the text. The outputs are very direct and to the point. This is different to ChatGPT in that it will not work to fill in the blanks or make up information/use sources outside of the text you have fed it.

I think this is a great usecase if you don't have time to watch an entire video and have specifc things you want to know whether they are mentioned.

## 2nd Custom Prompts

From here, we will look to add custom prompts which direct OpenAI to take specific actions. This is very powerful as you can essentially give OpenAI a goal when looking the text & get it to give you specific answers based on the context you give it. 

The power here is that you can use custom prompts to create your 'so what'. 

For example, a general summary is useful. But it doesn't guide you on how this information might affect your circumstances or goals. In the example below, I am going to feed it the transcript, which is a video on how AI start ups should design their sites. So, I am going to tell it I am a start up founder and need tips for my site setup.

Let's see what it gives me.



### 1. Creating a Q&A function

In [20]:
#import the library his will split documents recursively by different characters - starting with "\n\n", then "\n", then " ". This is nice because it will try to keep all the semantically relevant content in the same place for as long as possible.
from langchain.text_splitter import RecursiveCharacterTextSplitter

#OpenAI’s text embeddings measure the relatedness of text strings.
from langchain.embeddings.openai import OpenAIEmbeddings

#Faiss is a library for efficient similarity search and clustering of dense vectors.
from langchain.vectorstores.faiss import FAISS

#Retreival QA is the library you use to select how you want Langchain to feed summaries to openAI
from langchain.chains import RetrievalQA

#import the summarise chain library
from langchain.chains.summarize import load_summarize_chain

In [21]:
#call OpenAI and feed it the OpenAI key. Temperature can be between 0 and 1. 0 means that everytime you run the same query you will get the same outptut. As you increase this, OpenAI will get more creative.
llm = OpenAI(temperature=0, openai_api_key=os.environ['OPENAI_API_KEY'])

In [22]:
#split your text into chunks and tell it how much you want the chunks to overlap. The below is standard, however, you may need to change these if you find that you are going over the OpenAI limits
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

#Using the above to split our formatted text
texts = text_splitter.split_text(formatted_text_with_separators)

#Embedding our splits into a vector store. A vector store is like a giant library that organizes information in a way that makes it easier for computers to understand and find related pieces of information quickly.
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(texts, embeddings)

In [23]:
#Setting up a system that can answer questions by retrieving relevant information from a vector store
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever())

In [24]:
#Enter your query
query = "How should I design my AI site?"

#run the query with the QA system above
qa.run(query)

' You should design your AI site by making sure to include a clear call to action and contrast to help draw attention to your main KPI and outcome. Ask your existing users to describe your product and use that phrase at the top of your site. Make sure to prioritize the main goal for your website visitors.'

### A note on the different core chains for working with Documents

There are 4 different chain types within langchain.They are useful for summarizing documents, answering questions over documents, extracting information from documents, and more.

You'll see in the cell above & code below ('QA'), that we used stuff. But what happens when we use the others?

`qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever())`


Let's first look at each chain type. Stuff and map_reduce are the most commonly used. So, we'll start with those.


### 1. Stuff

- The "stuff" documents chain is a simple way of using a language model to process a list of documents.
- It takes the documents you provide, puts them together as a single input, and then asks the language model to work on that combined information.
- This approach works best when you have small documents and only need to process a few of them at a time.

![Stuff](https://python.langchain.com/assets/images/stuff-818da4c66ee17911bc8861c089316579.jpg)


### 2. Map_Reduce

- The map reduce documents chain is like a two-step process. First, it takes each document separately and applies a specific set of actions to it based on your prompt. Then, it combines all the processed documents together to get a final result.
- You can think of it like this. Imagine you have a pile of papers about European tourism that you need to fit into an envelope to send to a client. The envelope is limited in size & they only care about information on Paris.
- In the first step (Map), you take each paper and perform some actions on it, like highlighting important parts about Paris.
- Once you've gone through all the papers, you gather them together again. In the second step (Reduce), you combine all the modified papers into one final document; cutting out anything that isn't highlighted. The information on Paris is fed back into the LLM.
- To make sure the papers fit nicely in the second step, you might need to compress or condense them, making them smaller if necessary (chaging the chunk & overlap size). If the compressed papers are still too big, you repeat the compression process until they fit.

![Map_reduce](https://python.langchain.com/assets/images/map_reduce-c65525a871b62f5cacef431625c4d133.jpg)

### 3. Refine

- The refine documents chain works by going through the input documents and gradually refining its answer. It does this by repeatedly updating its response based on each document it examines.
- Think of it like solving a puzzle or completing a task where you have several pieces of information. The refine chain takes these pieces one by one and incorporates them into its answer. For each document, it considers the information it already has, any additional inputs that are not documents, the current document being examined, and the most recent answer it has generated. Then it uses all of this combined information as input to a language model, which provides a new answer based on the updated context.
- The refine chain is useful when you have many documents to analyze, and it's not possible to consider them all together due to the limitations of the model's context. By analyzing the documents one at a time, it can work with a larger set of documents without running into context limitations. However, this approach also means that the refine chain needs to make more calls to the language model, which can be computationally intensive.
- There are some cases where the refine chain may not perform optimally. For example, if the documents frequently refer to each other or if the task requires detailed information from multiple documents, the iterative nature of the refine chain might face difficulties in producing accurate results.
- In summary, the refine documents chain gradually refines its answer by analyzing each document separately, using the information from previous steps and the latest answer. It is suitable for tasks involving a large number of documents, but it may have limitations in certain scenarios where document cross-references or detailed information from multiple documents are crucial.

![Refine](https://python.langchain.com/assets/images/refine-a70f30dd7ada6fe5e3fcc40dd70de037.jpg)


### 4. Map re-rank

- The map re-rank documents chain works by running an initial prompt on each document. This prompt does two things: it tries to find an answer to a specific task, and it also assigns a score to indicate how confident it is in that answer. The chain then compares all the answers and selects the one with the highest score as the final response.
- Imagine you have a list of questions and a set of documents. The chain goes through each document one by one and asks a question about that document. It not only tries to find an answer but also gives that answer a score to indicate how sure it is about the correctness of the answer.
- For example, let's say you have a document about cats. The chain would run a prompt on that document, asking a question like "What is the average lifespan of a cat?" The chain then provides an answer, say "The average lifespan of a cat is around 15 years," and assigns a score to this answer, say 90 out of 100, indicating high confidence in the accuracy of the answer.
- This process is repeated for each document, with different prompts and corresponding answers and scores. Once all the documents have been processed, the chain compares the scores assigned to each answer. It selects the answer with the highest score, which represents the most confident and reliable response.
- In simpler terms, the map re-rank documents chain reads through each document, asks a specific question about it, gives an answer with a confidence score, and then chooses the answer with the highest score as the final response.

![Map Re-rank](https://python.langchain.com/assets/images/map_rerank-0302b59b690c680ad6099b7bfe6d9fe5.jpg)


Let's see the outputs we get when feeding the same question and documents but changing the chain type

#### 1. Stuff

In [25]:
#Setting up a system that can answer questions by retrieving relevant information from a vector store
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever())

#Enter your query
query = "How should I design my AI site?"

#run the query with the QA system above
qa.run(query)

' Make sure to have a clear call to action and contrast in order to draw attention to the element you want users to click on. Choose the main KPI you want users to achieve and make that the focus of the website.'

#### 2. Map_Reduce

In [26]:
#Setting up a system that can answer questions by retrieving relevant information from a vector store
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="map_reduce", retriever=vectorstore.as_retriever())

#Enter your query
query = "How should I design my AI site?"

#run the query with the QA system above
qa.run(query)

' Design your AI site by asking existing users how they would describe your product and use the phrase they say at the top of your site. Make the call to action to try the beta the main focus and use contrast to be opinionated to the user. Figure out the main KPI and outcome you want from people that visit the website.'

#### 3. Refine

In [27]:
#Setting up a system that can answer questions by retrieving relevant information from a vector store
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="refine", retriever=vectorstore.as_retriever())

#Enter your query
query = "How should I design my AI site?"

#run the query with the QA system above
qa.run(query)

"\n\nWhen designing your AI site, focus on clearly communicating why now is the time to use your product and what it does. Draw on the words and phrases used by your existing users to describe your product and place them at the top of the page. Additionally, be sure to highlight any impressive features or capabilities your product has. Since AI is a new and exciting technology, make sure to emphasize the unique advantages it can bring to your users. Make sure to use contrast to make your call-to-action stand out - if you want people to try the beta, make that the main call-to-action and make the 'Learn More' button a link instead of a button. Additionally, provide a clear and concise breakdown of what your product does and the value it provides to your users. Figure out what your main KPI is and what outcome you want from people who visit your site. If your product is focused on game assets, make sure that is the main focus and prioritize it accordingly. Finally, make sure to provide c

#### 4. Map Re-Rank

In [28]:
#Setting up a system that can answer questions by retrieving relevant information from a vector store
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="map_rerank", retriever=vectorstore.as_retriever())

#Enter your query
query = "How should I design my AI site?"

#run the query with the QA system above
qa.run(query)

' You should focus on having clear call-to-actions, a good design hierarchy, and a logical flow.'

### Analaysing their outputs

#### Stuff Chain:

- Good: The output suggests designing an AI site with a clear call to action, good hierarchy and flow, and contrast to draw the user's attention. It emphasizes focusing on the main key performance indicator (KPI) and desired outcomes from visitors.
- Bad: The output is relatively short and lacks some details. It does not provide specific guidance on user engagement or product communication.
- Reason: The Stuff chain is designed to process small documents, so it may prioritize brevity and simplicity. It focuses on providing a quick overview rather than in-depth guidance.
- Approach: In the future, we may want to consider using the Stuff chain for tasks that require concise information or when we need a brief summary. However, if we require more detailed recommendations, we should explore other chains.


#### Map Reduce Chain:

- Good: The output advises designing the AI site with clear call-to-action, good hierarchy and flow, and contrast. It suggests being opinionated towards the user and making the main KPI and desired outcome the focus.
- Bad: The output is relatively short and does not provide much specific guidance on user engagement or product communication.
- Reason: The Map Reduce chain combines individual document processing with a reduction step, aiming to provide an overall response. However, it may still prioritize brevity and simplicity in the output.
- Approach: To get more detailed and comprehensive recommendations, we may need to explore other chains or consider additional processing steps.

#### Refine Chain:

- Good: The output offers detailed guidance on designing the AI site, including user-focused considerations, highlighting product strengths, and emphasizing the call to action. It covers various aspects such as product description, technology utilization, and visual appeal.
- Bad: The output is quite lengthy, and there is a risk of being too verbose or overwhelming for the user. This chain takes the longest also, so you must consider this if you have a massive piece of text and want speedy responses.
- Reason: The Refine chain iteratively refines the answer by considering each document separately, leading to a more detailed and comprehensive response.
- Approach: When detailed guidance is required, the Refine chain can be valuable. However, it's important to ensure that the output remains concise and user-friendly. Careful editing and prioritization of information can help strike the right balance.

#### Map Re-rank Chain:

- Good: The output suggests designing the AI site by focusing on the main KPI and desired outcomes from visitors. It advises using contrast to make the call to action stand out and being opinionated about user interaction.
- Bad: The output is relatively short and lacks some specific details on product communication or user engagement strategies.
- Reason: The Map Re-rank chain aims to re-rank the answers based on scores assigned to each document. While it may provide a higher-scoring response, it may not always capture all the nuances or specific details.
- Approach: If we need more comprehensive recommendations, we should consider using other chains or combining multiple approaches to achieve better results.


After analyzing these outputs, I have determined that combining the Map Re-Rank and Refine chains would provide the most effective methodology. The Refine chain offers comprehensive recommendations, giving you a fuller picture of the actions to consider. On the other hand, the Map Re-Rank chain prioritizes the most important information, helping you focus on the actions that require immediate attention. By leveraging the Refine chain for a macro view and then employing Map Re-Rank to zero in on prioritizing those actions, you can achieve a well-balanced approach. This combination ensures that you have a holistic understanding of the recommendations while being able to identify and address the critical areas effectively.


### Now that we understand how QA'ing & the different map chain types. Let's jump into creating custom prompts

# 2. Custom Prompts

So, I will start from the beginning again here so you can choose whether you want to do QA or custom prompts without having to think about library dependcies from project to project.

In [35]:
#remember the text that we processed with the Spacy NLP, I am going to shorten the name down cause I'm lazy and don't want to keep typing that

text= formatted_text_with_separators

In [33]:
#import the libraries

# Loaders
from langchain.schema import Document

# Splitters
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Model
from langchain.chat_models import ChatOpenAI

# Embedding Support
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings

# Summarizer we'll use for Map Reduce
from langchain.chains.summarize import load_summarize_chain

# Data Science
import numpy as np
from sklearn.cluster import KMeans

In [34]:
#I need to combine the docs and replace any tabs with spaces.

#create an empty string to load in our formatted data
text=''

# Combine docs and replace tabs with spaces
for doc in docs:
    text += doc.page_content.replace('\t', ' ')

# Print or use the formatted text as needed
print(formatted_text_with_separators)

AttributeError: 'str' object has no attribute 'page_content'

In [36]:
#split the text into chunks

text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", "\t"], chunk_size=1000, chunk_overlap=300)

docs = text_splitter.create_documents([text])

In [40]:
#let's see how our the text splitter has split our document
num_documents = len(docs)

print (f"Now our transcript is split up into {num_documents} documents")

Now our transcript is split up into 38 documents


In [42]:
#embed the data in a vectorstore
embeddings = OpenAIEmbeddings(openai_api_key=os.environ['OPENAI_API_KEY'])
vectors = embeddings.embed_documents([x.page_content for x in docs])

Now let's cluster our embeddings.

We cluster embeddings to group similar data points together based on their underlying patterns or similarities. By clustering embeddings, we can organize and understand large amounts of data more effectively. This process allows us to discover relationships, patterns, or categories within the data that might not be immediately apparent.

So, in this case, we want to cluster together similar contexts which will help us to feed all the relevant information related to our custom prompt into openAI

In [47]:
# Choose the number of clusters, this can be adjusted based on the book's content.
# An online resource found that around ~10 was the best as usually if you have 10 passages from a book/pdf/transcript you can tell what it's about
num_clusters = 10

We have a massive piece of text that is embedded. In the code below we will find the closest embeddings of text to find the most similar or closest parts of the text to a set of predefined categories or topics. It allows us to group similar parts of the text together based on their embeddings.

Imagine you have a very long text that talks about different topics, such as sports, cooking, and technology. By using embeddings, we can represent each part of the text as a numerical vector. Then, we can calculate the distances between these vectors and the cluster centers, which are representative vectors of each category. By finding the closest embeddings to each cluster center, we can determine which parts of the text are most similar to each category.

For example, if we apply this process to a news article, it can help us identify which paragraphs or sentences are talking about sports, cooking, or technology. This can be useful for organizing and analyzing large amounts of text, extracting relevant information, or categorizing different topics within the text.

In [48]:
# Find the closest embeddings to the centroid cluster

# Create an empty list that will hold your closest points
closest_indices = []

# Loop through the number of clusters you have
for i in range(num_clusters):
    
    # Get the list of distances from that particular cluster center
    distances = np.linalg.norm(vectors - kmeans.cluster_centers_[i], axis=1)
    
    # Find the list position of the closest one (using argmin to find the smallest distance)
    closest_index = np.argmin(distances)
    
    # Append that position to your closest indices list
    closest_indices.append(closest_index)

In [50]:
#so, these are our clusters
selected_indices = sorted(closest_indices)
selected_indices

[0, 1, 4, 11, 15, 20, 23, 26, 29, 34]

In [57]:
#create a LLM variable to create a summary of each the clusters
llm = ChatOpenAI(temperature=0,
                 openai_api_key=os.environ['OPENAI_API_KEY'],
                 max_tokens=150,
                 model='gpt-3.5-turbo'
                )

In [56]:
from langchain import PromptTemplate #import the prompt template library to create our prompt

#creating a prompt for OpenAI, these are the actions they will take on each cluster
map_prompt = """
You will be given a single passage of a YouTube Transcript. This section will be enclosed in triple backticks (```)
Your goal is to give a summary of this section so that a reader will have a full understanding of what happened.
Your response should be at least three paragraphs and fully encompass what was said in the passage.

```{text}```
FULL SUMMARY:
"""
map_prompt_template = PromptTemplate(template=map_prompt, input_variables=["text"])

In [58]:
#combining the LLM variable with the prompt template. Here we will use the chain 'stuff' for speed.
map_chain = load_summarize_chain(llm=llm3,
                             chain_type="stuff",
                             prompt=map_prompt_template)

In [60]:
#The line of code creates a new list called selected_docs by selecting specific elements from the docs list based on the indices specified in the selected_indices list.
selected_docs = [docs[doc] for doc in selected_indices]

Now we will iterate over a set of selected documents, generating summaries for each document using a specified mapping chain, and store the summaries in a list.

In [62]:
# Make an empty list to hold your summaries
summary_list = []

# Loop through a range of the lenght of your selected docs
for i, doc in enumerate(selected_docs):
    
    # Go get a summary of the chunk
    chunk_summary = map_chain.run([doc])
    
    # Append that summary to your list
    summary_list.append(chunk_summary)
    
    print (f"Summary #{i} (chunk #{selected_indices[i]}) - Preview: {chunk_summary[:250]} \n")

Summary #0 (chunk #0) - Preview: In this passage, the speaker discusses the increasing number of AI startups that are applying to YC and building AI-focused companies. As a result, the episode of "Design Review" will focus entirely on AI startup websites. The speaker, Gary Tan, who  

Summary #1 (chunk #1) - Preview: In this section of the YouTube transcript, the speaker and Gary are discussing a product called Rosebud. The speaker mentions that Rosebud is related to AI-generated game assets and game development. They note that there is a distracting element on t 

Summary #2 (chunk #4) - Preview: In this passage, the speaker is discussing the possibility of there being multiple products. They mention that Sprites makes sense as one of these products. They also note that there are multiple calls to action, which they find great. They mention s 

Summary #3 (chunk #11) - Preview: In this passage, the speaker expresses their appreciation for the "how it works" segment of a YouTube video,

In [64]:
#join the summaries together
summaries = "\n".join(summary_list)

# Convert it back to a document
summaries = Document(page_content=summaries)

#ensure it is not over the token limit for GPT 3.5 turbo (our selected model)
print (f"Your total summary has {llm.get_num_tokens(summaries.page_content)} tokens")

Your total summary has 1426 tokens


#### We're going to convert this summary to a JSON file.

Quick explanation on what a JSON file is 

A JSON file is like a special type of file that stores information in a structured and organized way. It stands for "JavaScript Object Notation," but don't worry too much about the technical term. Think of it as a way to store data in a format that computers can easily understand.

A JSON file looks a bit like a list of instructions or a table. It consists of pairs of information called "key-value" pairs. The key is like a label that describes what the value represents. For example, a key might be "name," and the corresponding value would be the actual name. You can have multiple key-value pairs in a JSON file, allowing you to store and organize different types of information together.

JSON files are commonly used to store data in web applications, websites, and even in software programs. They are easy to read and write for both humans and computers. Many different programming languages, including Python, can work with JSON files, making it a popular choice for exchanging data between different systems.

In simpler terms, a JSON file is a way to store information on a computer in an organized and structured format. It's like a digital recipe book where data is stored as key-value pairs. It helps computers understand and work with the information, and it's widely used in web applications and software programs.

In [72]:
import json #to convert to JSON

In [71]:
document_dict = summaries.__dict__.copy()
del document_dict['metadata']
json_string = json.dumps(document_dict)
print(json_string)

{"page_content": "In this passage, the speaker discusses the increasing number of AI startups that are applying to YC and building AI-focused companies. As a result, the episode of \"Design Review\" will focus entirely on AI startup websites. The speaker, Gary Tan, who is the president of YC, explains that the most important question for startups is why now, referring to the recent surge in activity around large language models. He mentions the use of open AI staff, anthropic, and open source as contributing factors to this excitement. The speaker also highlights the challenge of describing a new category that didn't exist before, which will be explored further in the episode.\nIn this section of the YouTube transcript, the speaker and Gary are discussing a product called Rosebud. The speaker mentions that Rosebud is related to AI-generated game assets and game development. They note that there is a distracting element on the left side of the screen, but they focus on the three concept

In [80]:
# Creating our custom prompt for the LLM. Here will be a AI startup founder
prompt_AI = '''
As an AI startup founder, you're looking to build a cutting-edge website powered by artificial intelligence. You'll be provided with a document that contains valuable information on building AI websites. Use this prompt as a guide to maximize your learning and extract the best practices from the documents.

---

1. Begin by familiarizing yourself with the document content. Pay attention to the different aspects covered, such as design, user experience, and technology implementation.

2. Take notes on key recommendations, innovative ideas, and success stories mentioned in the documents. These insights will help you understand the best practices in building an AI website.

3. Identify any common patterns or trends among the documents. Look for recurring themes or strategies that successful AI websites employ.

4. Evaluate the importance of different design elements, user interactions, and AI integrations in creating a compelling user experience. Consider how these elements can enhance the overall functionality and value of your website.

5. Assess the scalability and maintainability of the suggested AI technologies and frameworks. Look for guidance on how to effectively leverage AI capabilities while ensuring long-term sustainability.

6. Pay attention to security considerations and ethical implications discussed in the documents. Ensure that your AI website maintains privacy standards, data protection, and responsible AI usage.

7. As you navigate through the documents, stay open to new ideas and innovative approaches. Consider how you can adapt and tailor these best practices to align with your unique business goals and target audience.

Remember, building an AI website requires a blend of creativity, technical expertise, and strategic thinking. Use the information in these documents to inspire your vision and empower you to create an exceptional AI-driven online presence for your startup.

Do not add any information that is not within the document I will give you below
---
'''


I used chatgpt to create that prompt. You can do the same by using my prompt below as a framework for your own goals.

"Create a chatgpt prompt for an AI startup founder who is about to be fed documents relating to building an AI website, use prompting best practices"

In [78]:
import openai #import the openAI library

#Finally, combine everything (LLM, the prompt, & the summary (json_string) to get your output
response = openai.ChatCompletion.create(
    temperature = 1.0,
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": prompt_AI},
        {"role": "user", "content": json_string}
    ]
)

# & your final output is 🥁....

In [83]:
print(response['choices'][0]['message']['content'])


Key Recommendations and Innovative Ideas:
- Focus on the "why now" factor for AI startups and highlight the recent surge in activity around large language models.
- Leverage open AI staff, anthropic, and open source platforms to contribute to the excitement around AI.
- Use clear and concise taglines to avoid confusion about the product's purpose.
- Incorporate multiple calls to action, but minimize the need for extra clicks to avoid drop-off in user engagement.
- Provide a step-by-step guide or debug tool to enhance user experience.
- Implement accessible onboarding processes that prompt users to create an account only when necessary.
- Utilize facial detection and authentication features to prevent the spread of fabricated content and enhance fraud prevention.
- Demonstrate impressive feats or strengths of the company to impress potential users.
- Highlight social proof from reputable organizations to increase credibility.
- Show upfront cost savings for products targeting Cloud infr

And there you have it. A complete compresensive summary of a 25 minute YouTube video. You can see how powerful it is at creating both summaries of what was said and actionable advice for you to take.