# Camel -  Loading Youtube video using LLama Index and Llama collectors

In this example, rather than training a model from scratch, we can leverage the benefits of using a pre-trained model and enhance its knowledge by extracting information from a YouTube video using Llama collectors and the Llama index. This approach, known as context learning, allows us to reduce resource consumption while creating a customized model specifically designed for geoscience.

Here we will use the LLM from open ai, therefore we need to have an open AI key. 

In [1]:
import os
os.environ['OPENAI_API_KEY'] = "insert your openai key "
os.environ["CUDA_VISIBLE_DEVICES"] = "0" 


from llama_index import GPTListIndex, SimpleDirectoryReader
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index import LangchainEmbedding, ServiceContext
from llama_index.llm_predictor import HuggingFaceLLMPredictor
from llama_index import GPTVectorStoreIndex, GPTEmptyIndex

from pathlib import Path #needed for the pdf connector
from llama_index import download_loader

# setup prompts - specific to StableLM
from llama_index.prompts.prompts import SimpleInputPrompt
from llama_index import download_loader

import torch


## From Context


In [2]:
# This will wrap the default prompts that are internal to llama-index
# taken from https://huggingface.co/Writer/camel-5b-hf
query_wrapper_prompt = SimpleInputPrompt(
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{query_str}\n\n### Response:"
)

In [3]:
hf_predictor = HuggingFaceLLMPredictor(
    max_input_size=2048, 
    max_new_tokens=256,
    # temperature=0.25,
    # do_sample=False,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="Writer/camel-5b-hf",
    model_name="Writer/camel-5b-hf",
    device_map="auto",
    tokenizer_kwargs={"max_length": 2048},
    # model_kwargs={"torch_dtype": torch.bfloat16}
    model_kwargs={"torch_dtype": torch.float16},
)

embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))

service_context = ServiceContext.from_defaults(chunk_size_limit=512, llm_predictor=hf_predictor, embed_model=embed_model)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [7]:
YoutubeTranscriptReader = download_loader("YoutubeTranscriptReader")

loader = YoutubeTranscriptReader()
documents = loader.load_data(ytlinks=['https://www.youtube.com/watch?v=-irZQY9H6mg'])

# new_index = GPTVectorStoreIndex.from_documents(documents)
new_index = GPTListIndex.from_documents(documents, service_context=service_context)

In [8]:
# query with embed_model specified
query_engine = new_index.as_query_engine(response_mode='tree_summarize',
    verbose=False
)
response = query_engine.query("What deep learning methods are used to interpolate seismic data")
print(response)

Token indices sequence length is longer than the specified maximum sequence length for this model (858 > 512). Running this sequence through the model will result in indexing errors
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Deep learning methods used to interpolate seismic data include auto encoders (e.g., unit and resnet), discriminator-based methods (e.g., MDA gun), and hybrid methods (e.g., fusion).


In [9]:
# query with embed_model specified
query_engine = new_index.as_query_engine(response_mode='tree_summarize'
)
response_2 = query_engine.query("What is the video about")
print(response_2)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The video is about the application of artificial intelligence in the oil and gas industry, specifically in the field of seismic data interpretation.
