# Task: 

### Implementation of IDEA using yt transcrpits

1. choosing best LLM for our usecase: where usecase="summarization of the context"
    - **choosen model** :Mistral
2. Dataset creation



In [81]:
token ="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"
yt_link = "https://youtu.be/CDZ9REOh2xA?si=HaSORmywGwQA97_T"

In [82]:
# ! pip install -q langchain langchain_chroma langchain_community langchain_core langchain_huggingface transformers
# ! pip install -qU langchain_huggingface
# ! pip install pytube
# ! pip install --upgrade --quiet  youtube-transcript-api

In [83]:
import os
import time
from dotenv import load_dotenv
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_chroma import Chroma
from langchain_community.llms import HuggingFaceHub
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_community.document_loaders import UnstructuredCSVLoader
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface.embeddings import HuggingFaceEndpointEmbeddings
from langchain_huggingface import HuggingFaceEndpoint
from langchain.chains import HypotheticalDocumentEmbedder

In [84]:
from langchain_community.document_loaders import YoutubeLoader

## Laoding text from video

In [85]:
import re


In [86]:
loader = YoutubeLoader.from_youtube_url(yt_link)
docs = loader.load()
print(f"number of words: {len(docs[0].page_content.split())}")

number of words: 1422


In [87]:
# Cleaning the text using regular expressions


def clean_text(text):
  """Cleans the text by removing unwanted characters and symbols.

  Args:
    text: The input text to be cleaned.

  Returns:
    The cleaned text.
  """

  # Remove punctuation and special characters except for a few symbols
  allowed_symbols = r"[^a-zA-Z0-9\s\.\,\?\!\-\(\)]"
  text = re.sub(allowed_symbols, '', text)

  # Remove extra spaces
  text = re.sub(r'\s+', ' ', text)

  return text

content = clean_text((docs[0].page_content))

In [88]:
# saving the loaded docs to the txt files

with open("texts\lex-elon.txt", "w") as obj:
    obj.write(content)

## for Chunking and converting into embbedings

In [89]:
EmbedModel = HuggingFaceEndpointEmbeddings(model="sentence-transformers/sentence-t5-xxl", huggingfacehub_api_token=token)

In [90]:
text_splitter = RecursiveCharacterTextSplitter()
chunck_list = text_splitter.split_documents(docs)

## Loading chunked docs to vector DB

In [91]:
len(chunck_list)

2

In [92]:
vector_store = Chroma.from_documents(chunck_list, EmbedModel, persist_directory="./chroma_db")
# vector_store = Chroma(persist_directory="./chroma_db", embedding_function=EmbedModel)

## Load the LLM

In [93]:
llm = HuggingFaceEndpoint(
    repo_id="mistralai/Mistral-7B-Instruct-v0.3",
    task="text-generation",
    max_new_tokens=3000,
    do_sample=False,
    huggingfacehub_api_token=token
    
)


The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [94]:
# Define the contextualization prompt for reformulating questions based on chat history
contextualize_q_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", """Given a chat history and the latest user question 
which might reference context in the chat history, formulate a standalone question 
which can be understood without the chat history. Do NOT answer the question, 
just reformulate it if needed and otherwise return it as is."""),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

In [95]:
# Statefully manage chat history
chat_history_store = {}

def get_chat_session_history(session_id: str) -> BaseChatMessageHistory:
    """Fetches the chat history for the given session."""
    if session_id not in chat_history_store:
        chat_history_store[session_id] = ChatMessageHistory()
    return chat_history_store[session_id]

In [96]:
# Define the chat prompt template for QA
qa_prompt_template = ChatPromptTemplate.from_template("""
Answer the following question based only on the provided context. 
Think step by step before providing a detailed answer. 
I will tip you $1000 if the user finds the answer helpful.
<context>
{context}
</context>
Question: {input}
""")

In [97]:
# Create the question answer chain
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt_template)

# Create the history-aware retriever
history_aware_retriever = create_history_aware_retriever(
    llm, vector_store.as_retriever(), contextualize_q_prompt
)

# Create the retrieval chain
retrieval_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

# Create the conversational RAG chain with chat history management
conversational_rag_chain = RunnableWithMessageHistory(
    retrieval_chain,
    get_chat_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)

In [98]:
user_question = "summarize the whole podcast"
session_id = "u1"

In [102]:
user_questio = input("Enter: ")
response = conversational_rag_chain.invoke(
    {"input": user_question},
    config={"configurable": {"session_id": session_id}},
)
print(response['answer'])

Enter:  convert to 2 persons like person a and b



The podcast discusses the characteristics of a great engineering team, as observed in the Memphis supercomputer cluster project, and the first principles algorithm used to achieve simplicity, efficiency, and automation in engineering. The algorithm consists of five steps: question the requirements, try to delete the process steps, optimize or simplify, speed up only after deletion and optimization, and automate. The speaker emphasizes the importance of being willing to delete and redo work, as well as the need to overcome the human tendency to overcomplicate things. The speaker also mentions the Memphis project's challenges with power fluctuation issues and the need to optimize the electrical system for the extreme power demands of the supercomputer cluster. The speaker also discusses the need for a powerful training compute for AI and the importance of human talent, unique access to data, and real-world data collection for the success of AI systems. The speaker also mentions the pote