In [1]:
MODEL = 'llama3'

## Load model

In [2]:
from langchain_community.llms import Ollama

model = Ollama(model = MODEL, top_k=128)
print(model.invoke("Tell me a joke about scientists"))

Why did the scientist take out his doorbell?

Because he wanted to see some "experimental" results!


## Parser

In [3]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

chain = model | parser
chain.invoke("What MLB team won the World Series during the COVID-19 pandemic?")

'The Los Angeles Dodgers won the 2020 World Series, which was played during the COVID-19 pandemic. The series was played from October 5 to October 27, 2020, and the Dodgers defeated the Tampa Bay Rays in six games (4-2). This was their first World Series title since 1988.'

## Prompt templete

In [4]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below no need to explain the results. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
print(prompt.format(context="Mary's sister is Susana", question="Who is Mary's sister?"))

Human: 
Answer the question based on the context below no need to explain the results. If you can't 
answer the question, reply "I don't know".

Context: Mary's sister is Susana

Question: Who is Mary's sister?



In [5]:
chain = prompt | model | parser
chain.invoke({
    "context": "Mary's sister is Susana",
    "question": "Who is Mary's sister?"
})

'Susana'

## Combination chains

In [6]:
translation_prompt = ChatPromptTemplate.from_template(
    "Translate {answer} to {language}"
)

In [7]:
from operator import itemgetter

translation_chain = (
    {"answer": chain, "language": itemgetter("language")} | translation_prompt | model | parser
)

print(translation_chain.invoke(
    {
        "context": "Mary's sister is Susana. She doesn't have any more siblings.",
        "question": "How many sisters does Mary have and what is her name?",
        "language": "Spanish",
    }
))

The translation of "Sister" in this context is "Hermana".

So, the correct translation would be:

1 hermana: Susana -> Una hermana llamada Susana (One sister named Susana)


## Transcribing youtube video

In [8]:
import os
import tempfile
import whisper
from pytube import YouTube

# This is the YouTube video we're going to use.
YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=cdiD-9MMpb0"

# Let's do this only if we haven't created the transcription file yet.
if not os.path.exists("transcription.txt"):
    youtube = YouTube(YOUTUBE_VIDEO)
    audio = youtube.streams.filter(only_audio=True).first()

    # Let's load the base model. This is not the most accurate
    # model but it's fast.
    whisper_model = whisper.load_model("base")

    with tempfile.TemporaryDirectory() as tmpdir:
        file = audio.download(output_path=tmpdir)
        transcription = whisper_model.transcribe(file, fp16=False)["text"].strip()

        with open("transcription.txt", "w") as file:
            file.write(transcription)

In [9]:
with open("transcription.txt") as file:
    transcription = file.read()

transcription[:100]

"I think it's possible that physics has exploits and we should be trying to find them. arranging some"

## Entire transcription as context

In [12]:
try:
    print(chain.invoke({
        "context": transcription,
        "question": "Is reading papers a good idea?"
    }))
except Exception as e:
    print(e)

According to Andre Karpathy, reading papers can be an interesting experience when AI-generated content becomes more prevalent. He mentions that if the cost of content creation falls to zero, Hollywood might start using AI to generate scenes, which could open up new possibilities for creating movies like Avatar for under a million dollars or even just by talking to your phone. He also thinks that generating synthetic content could be humbling, as humans tend to treat themselves as special for being able to create art and ideas.


## Text Splitter

In [14]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader('transcription.txt')
text_documents = loader.load()

In [15]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 100, chunk_overlap=20)
text_splitter.split_documents(text_documents)

[Document(page_content="I think it's possible that physics has exploits and we should be trying to find them. arranging some", metadata={'source': 'transcription.txt'}),
 Document(page_content='arranging some kind of a crazy quantum mechanical system that somehow gives you buffer overflow,', metadata={'source': 'transcription.txt'}),
 Document(page_content='buffer overflow, somehow gives you a rounding error in the floating point. Synthetic intelligences', metadata={'source': 'transcription.txt'}),
 Document(page_content="intelligences are kind of like the next stage of development. And I don't know where it leads to.", metadata={'source': 'transcription.txt'}),
 Document(page_content='where it leads to. Like at some point, I suspect the universe is some kind of a puzzle. These', metadata={'source': 'transcription.txt'}),
 Document(page_content='of a puzzle. These synthetic AIs will uncover that puzzle and solve it. The following is a', metadata={'source': 'transcription.txt'}),
 Docum

## Similarity Search

In [20]:
from langchain_community.embeddings.ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings()
embedded_query = embeddings.aembed_query("Who is Mary's sister?")


# print(f"Embedding length: {len(embedded_query)}")
# print(embedded_query[:10])

  embedded_query = embeddings.aembed_query("Who is Mary's sister?")


AttributeError: 'coroutine' object has no attribute '__len__'

In [22]:
str (46546)

'46546'