#🚀Run the new Meta Llama 3.1-8B-Instruct for free with Ollama

Meta Llama 3.1 is the latest open source LLM released by meta. It supports 8 Languages `English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai`. Here you can run the model for free in Quantized form provided by `Ollama` on the `T4 Gpu` of google colab.

The **Quantized**🎯 form is basically smaller in size then the original model, which saves your disk space, internet and also has faster inferences.

Follow the cells from top to bottom, that is from Install Ollama to Run the model.

NOTE: This colab notebook performs better than my previous notebook that is the Llama 3.1-8B_Colab. In terms of inference speed and model size this current notebook is better. And you are free from that hugging face access and token hassle.







| |Google Colab|
|:--|:-:|
| ⭐ **Llama 3.1-8B_Colab** | [![Open in Colab](https://raw.githubusercontent.com/hollowstrawberry/kohya-colab/main/assets/colab-badge.svg)](https://colab.research.google.com/drive/10c_GQ8wqVXuX5JciX0gHVstO0WHaUbqD?usp=sharing )
| 🌟 **Llama 3.1-8B_QuantisedxOllama** |  [![Open in Colab](https://raw.githubusercontent.com/hollowstrawberry/kohya-colab/main/assets/colab-badge.svg)](https://colab.research.google.com/drive/1S9q6cvH8y2WMml7pczg0Bl-VS6Le-jzZ?usp=sharing)

❗Quantisation can reduce the performance of the model in some use case for example it can make mistakes or create halluciantion, so always check for important info. So if you have the model access on hugging face and also if you have the pro version of google colab then you can also use that.

In [None]:
#@title ##Install Ollama🦙
!sudo apt-get install -y pciutils
!curl -fsSL https://ollama.com/install.sh | sh # download ollama api
from IPython.display import clear_output
clear_output()

In [None]:
#@title ##Start Ollama API Server🌐
import os
import threading
import subprocess
import requests
import json

def ollama():
    os.environ['OLLAMA_HOST'] = '0.0.0.0:11434'
    os.environ['OLLAMA_ORIGINS'] = '*'
    subprocess.Popen(["ollama", "serve"])

ollama_thread = threading.Thread(target=ollama)
ollama_thread.start()

In [None]:
#@title ##Download the model - Meta-Llama-3.1-8B-Instruct⬇️

from IPython.display import clear_output
!ollama pull llama3.1:8b
!pip install -U lightrag[ollama]
!pip install gradio
clear_output()

In [None]:
#@title ##Run the model🚀
#@markdown ❗Run this cell before running the `Gradio UI` cell.

from lightrag.core.generator import Generator
from lightrag.core.component import Component
from lightrag.core.model_client import ModelClient
from lightrag.components.model_client import OllamaClient
from IPython.display import Markdown, display

import time


qa_template = r"""<SYS>
You are a helpful assistant.
</SYS>
User: {{input_str}}
You:"""

class SimpleQA(Component):
    def __init__(self, model_client: ModelClient, model_kwargs: dict):
        super().__init__()
        self.generator = Generator(
            model_client=model_client,
            model_kwargs=model_kwargs,
            template=qa_template,
        )

    def call(self, input: dict) -> str:
        return self.generator.call({"input_str": str(input)})

    async def acall(self, input: dict) -> str:
        return await self.generator.acall({"input_str": str(input)})

model = {
    "model_client": OllamaClient(),
    "model_kwargs": {"model": "llama3.1:8b"}
}
qa = SimpleQA(**model)
users_prompt = "Hey! What's up?" #@param {type:"string"}
output=qa(users_prompt)
display(Markdown(f"**Answer:** {output.data}"))

In [None]:
#@title ##Gradio UI🖼️
import gradio as gr

def get_answer(users_prompt):
    output = qa.call(users_prompt)
    return output.data

# Gradio interface
gradioUI = gr.Interface(
    fn=get_answer,
    inputs=gr.Textbox(lines=2, placeholder="Chat with Llama here..."),
    outputs="text",
    title="Run Llama 3.1-8B-Instruct <br> Notebook By <a href='https://github.com/73LIX' target='_blank'>GouravYdv</a>",
    description="Llama 3.1 can make mistakes. Check for important info."
)

# Launch
gradioUI.launch()