<a href="https://colab.research.google.com/github/Troyanovsky/Building-with-GenAI/blob/main/tutorial_generative_ai_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build with GenAI: Generative AI Search with Local LLM

- Local LLM (not using OpenAI's API). You can run the code on your own computer and keep everything private. Or you can use Google Colab's free T4 GPU (just hit Runtime - Change runtime type - T4 GPU; then you can run all cells.)
- You can adapt the code easily to perform other tasks like searching on Arxiv, searching & summarizing local documents, etc.

This Colab notebook is the accompanying code for my article at: https://medium.com/design-bootcamp/build-with-genai-generative-search-with-local-llm-342eb5a5037a


This is part of the "Build with GenAI" series. Other tutorial projects can be found at: https://github.com/Troyanovsky/Building-with-GenAI/tree/main

In [None]:
# Install llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir

%cd /content
!apt-get update -qq && apt-get install -y -qq aria2

# Download a local large language model, I'm using OpernHermes-2.5-Mistral-7B-16K-GGUF which has a longer context size and has pretty good quality at its size
# If you want to use other local models that can easily run on consumer hardware, check out this repo: https://github.com/Troyanovsky/Local-LLM-Comparison-Colab-UI/
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-16k-GGUF/resolve/main/openhermes-2.5-mistral-7b-16k.Q4_K_M.gguf?download=true -d /content/model/ -o openhermes-2.5-mistral-7b-16k.Q4_K_M.gguf

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.63.tar.gz (37.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m37.5/37.5 MB[0m [31m79.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m195.6 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.63-cp310-cp310-linux_x86_64.whl size=39168030 sha256=1b3a605d471975601119bf1ce3a515a808104258e5601568cdbcf4b804f56c5f
  Stored in dire

In [None]:
import requests
import subprocess
import json
import time

# Function for calling search API
def get_search_results(search_term, max_retries=2, retry_delay=2):
    url = "https://google.serper.dev/search"
    payload = json.dumps({"q": search_term})
    headers = {
        'X-API-KEY': '<your_api_key>', # Replace with your own API Key
        'Content-Type': 'application/json'
    }

    retries = 0
    while retries < max_retries:
        try:
            response = requests.request("POST", url, headers=headers, data=payload)
            response.raise_for_status()  # Raise an exception for non-2xx status codes
            data = response.json()
            organic_results = data.get("organic", [])

            search_results = []
            search_results_str = ""
            index = 0
            for result in organic_results:
                title = result.get("title", "")
                link = result.get("link", "")
                snippet = result.get("snippet", "")
                search_results.append({"title": title, "link": link, "snippet": snippet})
                formatted_result = f"index: {index}\ntitle: {title}\nlink: {link}\nsnippet: {snippet}\n\n"
                search_results_str += formatted_result
                index += 1
            return search_results, search_results_str
        except requests.exceptions.RequestException as e:
            retries += 1
            print(f"Error: {e}. Retrying in {retry_delay} seconds... (Attempt {retries}/{max_retries})")
            time.sleep(retry_delay)

    raise Exception("Maximum retries exceeded. Failed to retrieve search results.")


def fetch_url_content(url):
    # Prepend "https://r.jina.ai/" to the input URL
    # This converts the URL into LLM-friendly format. Check out their GitHub: https://github.com/jina-ai/reader
    prefixed_url = f"https://r.jina.ai/{url}"


    try:
        curl_cmd = [
            "curl",
            "-H",
            "Accept: text/event-stream",
            prefixed_url,
        ]
        curl_process = subprocess.Popen(curl_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        stdout, stderr = curl_process.communicate()

        if curl_process.returncode == 0:
            content = stdout.decode("utf-8")

            content_lines = [line for line in content.split("\n") if line.startswith("data: ")]
            if content_lines:
                content_data = "\n".join(line[6:] for line in content_lines)
                try:
                    content_value = json.loads(content_data)["content"]
                    return content_value
                except (ValueError, KeyError):
                    pass

            return ""
        else:
            error_message = stderr.decode("utf-8")
            raise Exception(f"cURL request failed: {error_message}")

    except Exception as e:
        raise Exception(f"An error occurred: {e}")

In [None]:
# Setting up a local LLM for summarization or chat
from llama_cpp import Llama

def load_llama():
    llm = Llama(
            model_path="/content/model/openhermes-2.5-mistral-7b-16k.Q4_K_M.gguf", # If you're using another model, change the name
            chat_format="chatml", # Use the chat_format that matches the model
            n_gpu_layers=-1, # Use -1 for all layers on GPU
            n_ctx=12288 # Set context size
    )
    return llm

def call_llama(input: str, llm) -> str:
    llm = llm
    output = llm.create_chat_completion(
        messages=[
            {
                "role": "system",
                "content": "You're a helpful assistant.",
            }, # Feel free to modify the prompt to suit your own formatting needs
            {"role": "user", "content": input},
        ],
        temperature=0.7,
    )
    output_text = output['choices'][0]['message']['content']
    return output_text

In [None]:
def pick_url(query, search_results_str, search_results, llm):
    llm = llm
    prompt = f"Given the following question, which of the following URLs is most likely to contain the answer for it? Reply ONLY the index number. Question: ```{query}``` List: ```{search_results_str}```"
    index = call_llama(prompt, llm)

    max_retries = 2
    retries = 0
    while retries < max_retries:
        try:
            index = int(index.strip())
            break
        except ValueError:
            retries += 1
            index = call_llama(prompt, llm)

    if retries == max_retries:
        raise Exception("Failed to convert index to a valid integer after multiple retries.")

    try:
        return index
    except IndexError:
        raise Exception(f"Invalid index {index} for the search results list.")

In [None]:
def search_with_ai(user_input):
    llm = None

    llm = load_llama()

    search_term_prompt = f"Based on the following question, plesae come up with a search term to use in the search engine. Reply the search term only. Quesiton: ```{user_input}```"
    search_term = call_llama(search_term_prompt, llm)
    print(f"Searching: {search_term}")

    # Seach with search API
    search_results, search_results_str = get_search_results(search_term)

    # Pick the most relevant URL
    try:
        top_url_index = pick_url(user_input, search_results_str, search_results, llm)
    except Exception as e:
        print(f"Error picking URL: {e}")
        return

    # Fetch the content from the top URL
    try:
        top_url = search_results[top_url_index]["link"]
        top_snippet = search_results[top_url_index]["snippet"]
        print(f"Crawling: {top_url}")
        content = fetch_url_content(top_url)
    except Exception as e:
        print(f"Error fetching URL content: {e}")
        del llm
        return

    # Truncate the content if it's longer than 36864 characters. I'm using a very lazy estimate here. You can count actual tokens instead.
    if len(content) > 36864:
        content = content[:36864]

    # Call LLM with the content and get the answer
    answer_prompt = f"Answer the question from the given content. Question: ```{user_input}```\n\nContent:```From URL: {top_url} Snippet: {top_snippet}\n{content}```"
    try:
        answer = call_llama(answer_prompt, llm)
        return answer
    except Exception as e:
        print(f"Error calling LLM: {e}")
        return

In [None]:
question = input("What is your question? \n")
answer = search_with_ai(question)
print(answer)

What is your question? What is Llama-3? When is it released?


llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from /content/model/openhermes-2.5-mistral-7b-16k.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = nurtureai_openhermes-2.5-mistral-7b-16k
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - 

"Llama-3 release date"


Llama.generate: prefix-match hit

llama_print_timings:        load time =     245.22 ms
llama_print_timings:      sample time =       1.30 ms /     2 runs   (    0.65 ms per token,  1539.65 tokens per second)
llama_print_timings: prompt eval time =    1299.16 ms /  1079 tokens (    1.20 ms per token,   830.54 tokens per second)
llama_print_timings:        eval time =      26.44 ms /     1 runs   (   26.44 ms per token,    37.82 tokens per second)
llama_print_timings:       total time =    1349.01 ms /  1080 tokens


https://www.geeksforgeeks.org/llama-3-metas-new-ai-model/


Llama.generate: prefix-match hit

llama_print_timings:        load time =     245.22 ms
llama_print_timings:      sample time =     102.07 ms /   157 runs   (    0.65 ms per token,  1538.13 tokens per second)
llama_print_timings: prompt eval time =    6396.92 ms /  4822 tokens (    1.33 ms per token,   753.80 tokens per second)
llama_print_timings:        eval time =    5685.59 ms /   156 runs   (   36.45 ms per token,    27.44 tokens per second)
llama_print_timings:       total time =   12852.58 ms /  4978 tokens


Llama-3 is Meta's latest and most powerful large language model (LLM). It was released on April 18, 2024. It uses a powerful tokenizer with a vocabulary of 128,000 tokens and is trained on 15 trillion tokens, making it 7 times larger than its predecessor Llama-2. Llama-3 excels at understanding language, enhancing the performance of Meta's platforms like Facebook, Instagram, WhatsApp, and Messenger. It offers features such as improved creativity, increased productivity, accessibility improvements, and search integration within the apps. The full open-source model is expected to be released in July 2024.
