# LangChain Pipeline: YouTube Transcript + Parallel Chains

This notebook demonstrates how to:
- Fetch a transcript from a YouTube video using `youtube-transcript-api`
- Split it into chunks
- Create embeddings with OpenAI
- Store vectors in FAISS
- Use LangChain's `RunnableParallel` to run retrieval and summarization in parallel


### Setup and Imports

This section imports all necessary libraries and modules required for fetching transcripts, processing text, embedding, vector storage, prompt creation, and running the language model chains.

In [29]:
import os 
from dotenv import load_dotenv
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

### Load Environment Variables

Load the OpenAI API key from the `.env` file to authenticate API requests securely.

In [30]:
load_dotenv("env")
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

### Fetch YouTube Video Transcript

Function to fetch the transcript of a YouTube video by its ID. Handles the case where transcripts might be disabled gracefully.
python
Copy
Edit


In [31]:
def getVideoTranscript(video_id): 
    try:
        api = YouTubeTranscriptApi()
        transcript_list = api.fetch(video_id=video_id)
        transcript = " ".join(chunk.text for chunk in transcript_list.snippets)
        return transcript
    except TranscriptsDisabled:
        return None

transcript = getVideoTranscript(video_id = "Gfr50f6ZBvo")

### Split Transcript into Chunks

Split the entire transcript text into smaller chunks to make embedding and retrieval more efficient. We use an overlap to maintain context between chunks.

In [32]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.create_documents([transcript])

### Create Embeddings and Vector Store

Generate vector embeddings from the chunks using OpenAI embeddings, then store them in a FAISS vector store for similarity search.
python
Copy
Edit


In [5]:
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = FAISS.from_documents(chunks, embeddings)

### Define Prompt Template

Set up a prompt template that guides the language model to answer questions using only the retrieved transcript context. If there is insufficient context, the model responds accordingly.
python
Copy
Edit


In [33]:
prompt = PromptTemplate(
    template="""
      You are a helpful assistant.
      Answer ONLY from the provided transcript context.
      If the context is insufficient, just say you don't know.

      {context}
      Question: {question}
    """,
    input_variables = ['context', 'question']
)


### Setup Retriever and Formatter

Configure a retriever to find the top 4 most similar transcript chunks for a given question and a formatter to concatenate retrieved document texts.

In [34]:

retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 4})

def formatter(retrieved_docs) : 
    return  "\n\n".join(doc.page_content for doc in retrieved_docs)

### Build Parallel Chain

Create a parallel runnable that runs retrieval and formatting of context alongside passing the question forward in the pipeline.

In [None]:
parallel_chain = RunnableParallel({
    'context': retriever | RunnableLambda(formatter),
    'question': RunnablePassthrough()
})


### Initialize Language Model and Parser

Set up the ChatOpenAI language model (GPT-4o-mini) with a low temperature for deterministic responses, and an output parser to extract the final string response.
python
Copy
Edit


In [36]:
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)

parser = StrOutputParser()

### Combine Chain Components

Chain together retrieval, prompt formatting, language model inference, and parsing into one end-to-end processing pipeline.
python
Copy
Edit


In [None]:
chain = parallel_chain | prompt | llm | parser

### Ask a Question

Invoke the chain with a sample question regarding the video transcript and print the answer.

In [37]:
question  = "is the topic of aliens discussed in this video? if yes then what was discussed"
answer = chain.invoke(question)
answer

'Yes, the topic of aliens is discussed in this video. The speaker expresses their personal opinion that they believe we are likely alone in the universe, citing a lack of evidence for the existence of alien civilizations. They mention that despite the advancements in technology and searches for extraterrestrial life, we have not heard any signs of alien civilizations. The speaker also discusses the possibility of different types of alien civilizations, their potential behaviors, and the idea of a "great filter" that might prevent civilizations from advancing to a multi-planetary stage. They reflect on the implications of these questions for humanity, particularly regarding self-destruction.'

### Visualize Chain Graph (Optional)

Print an ASCII diagram showing the flow and components of the chain pipeline for debugging or learning purposes.
python
Copy
Edit


In [26]:
chain.get_graph().print_ascii()

            +---------------------------------+         
            | Parallel<context,question>Input |         
            +---------------------------------+         
                    **               ***                
                 ***                    **              
               **                         ***           
+----------------------+                     **         
| VectorStoreRetriever |                      *         
+----------------------+                      *         
            *                                 *         
            *                                 *         
            *                                 *         
      +-----------+                    +-------------+  
      | formatter |                    | Passthrough |  
      +-----------+*                   +-------------+  
                    **               ***                
                      ***         ***                   
                         **    