# Ollama

**Running LLM's locally - Master Applied AI - Michiel Bontenbal - 12 december 2024**

Ollama is a tool that allows users to run open-source large language models (LLMs) locally on your laptop. Ollama supports a variety of models, including Llama2, Mistral, CodeLlama and many others. 

You'll need to download ollama first. Download it from www.ollama.com.

Courtesy of some code examples to ollama.com / Jeffrey Morgan.
License: MIT License

### Contents
0. Install and settings
1. First script
2. Streaming the response
3. Create a gradio front end

### Sources
- https://github.com/ollama/ollama-python
- https://github.com/ollama/ollama/blob/main/docs/api.md#api
- https://pypi.org/project/ollama/


## 0. Install and settings

*Before running this code, make sure you've installed ollama on your laptop!*

In [2]:
# Check your version of python. To run ollama with python you will need Python 3.8 or higher.
from platform import python_version
print(python_version())

3.12.7


In [3]:
#before downloading the model check available disk space. You will need at least 20 Gb!
import shutil
usage = shutil.disk_usage("/")
free_space_bytes = usage.free
free_space_gb = free_space_bytes / (1024 * 1024 * 1024)  # Convert to GB
print(f'free disk space = {round(free_space_gb,1)} Gb')

free disk space = 14.8 Gb


In [4]:
#Check processor and RAM
import psutil
import platform
print("Processor:", platform.processor())
memory = psutil.virtual_memory()
print(f'Total RAM: "{memory.total/1000000000} Gb')
print(f"Available RAM: {memory.available/1000000000} Gb")
print(f"RAM Usage: {memory.percent}%")

Processor: i386
Total RAM: "8.589934592 Gb
Available RAM: 1.791295488 Gb
RAM Usage: 79.1%


In [5]:
%pip install --upgrade ollama

Note: you may need to restart the kernel to use updated packages.


In [6]:
# Make sure you run from harddisk. Running this from OneDrive or cloud makes it much slower.
import os
print(f"Current working directory: {os.getcwd()}")

Current working directory: /Users/michielbontenbal/Library/CloudStorage/OneDrive-HvA/GitHub/ollama_master


In [7]:
#download a model from the ollama server. May take a minute... Uncomment if necessary
import ollama
ollama.pull('llama3.2:1b')

ProgressResponse(status='success', completed=None, total=None, digest=None)

In [8]:
#get all the models on your device
ollama.list()

ListResponse(models=[Model(model='llama3.2:1b', modified_at=datetime.datetime(2024, 12, 12, 12, 50, 46, 322160, tzinfo=TzInfo(+01:00)), digest='baf6a787fdffd633537aa2eb51cfd54cb93ff08e28040095462bb63daf552878', size=1321098329, details=ModelDetails(parent_model='', format='gguf', family='llama', families=['llama'], parameter_size='1.2B', quantization_level='Q8_0')), Model(model='hf.co/BramVanroy/GEITje-7B-ultra-GGUF:Q3_K_M', modified_at=datetime.datetime(2024, 12, 12, 10, 57, 43, 528854, tzinfo=TzInfo(+01:00)), digest='7595df917f18a22cc1ee275332b7ebb8b23e8976a542e5dd6c74c1c8ac3d6304', size=3518986848, details=ModelDetails(parent_model='', format='gguf', family='llama', families=['llama'], parameter_size='7.24B', quantization_level='unknown')), Model(model='hf.co/BramVanroy/fietje-2-chat-gguf:Q3_K_M', modified_at=datetime.datetime(2024, 12, 10, 13, 37, 12, 900267, tzinfo=TzInfo(+01:00)), digest='29b0a169fcaa64dca25ab1b5325a26f8dad0e42217460ca1694fac629c902035', size=1423223271, details=

In [9]:
#Let's unpack it a bit (ollama changed it's API this week...) so 
models = ollama.list()
print(models)
modellen = models.models
for i in range (len(modellen)):
    print(models.models[i].model)

models=[Model(model='llama3.2:1b', modified_at=datetime.datetime(2024, 12, 12, 12, 50, 46, 322160, tzinfo=TzInfo(+01:00)), digest='baf6a787fdffd633537aa2eb51cfd54cb93ff08e28040095462bb63daf552878', size=1321098329, details=ModelDetails(parent_model='', format='gguf', family='llama', families=['llama'], parameter_size='1.2B', quantization_level='Q8_0')), Model(model='hf.co/BramVanroy/GEITje-7B-ultra-GGUF:Q3_K_M', modified_at=datetime.datetime(2024, 12, 12, 10, 57, 43, 528854, tzinfo=TzInfo(+01:00)), digest='7595df917f18a22cc1ee275332b7ebb8b23e8976a542e5dd6c74c1c8ac3d6304', size=3518986848, details=ModelDetails(parent_model='', format='gguf', family='llama', families=['llama'], parameter_size='7.24B', quantization_level='unknown')), Model(model='hf.co/BramVanroy/fietje-2-chat-gguf:Q3_K_M', modified_at=datetime.datetime(2024, 12, 10, 13, 37, 12, 900267, tzinfo=TzInfo(+01:00)), digest='29b0a169fcaa64dca25ab1b5325a26f8dad0e42217460ca1694fac629c902035', size=1423223271, details=ModelDetails(

In [10]:
#printing the details of a model
ollama.show('llama3.2:1b')

ShowResponse(modified_at=datetime.datetime(2024, 12, 12, 12, 50, 46, 322160, tzinfo=TzInfo(+01:00)), template='<|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\n\n{{ if .System }}{{ .System }}\n{{- end }}\n{{- if .Tools }}When you receive a tool call response, use the output to format an answer to the orginal user question.\n\nYou are a helpful assistant with tool calling capabilities.\n{{- end }}<|eot_id|>\n{{- range $i, $_ := .Messages }}\n{{- $last := eq (len (slice $.Messages $i)) 1 }}\n{{- if eq .Role "user" }}<|start_header_id|>user<|end_header_id|>\n{{- if and $.Tools $last }}\n\nGiven the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.\n\nRespond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.\n\n{{ range $.Tools }}\n{{- . }}\n{{ end }}\n{{ .Content }}<|eot_id|>\n{{- else }}\n\n{{ .Content }}<|

In [11]:
#show all functions
print(dir(ollama))

['AsyncClient', 'ChatResponse', 'Client', 'EmbedResponse', 'EmbeddingsResponse', 'GenerateResponse', 'ListResponse', 'Message', 'Options', 'ProcessResponse', 'ProgressResponse', 'RequestError', 'ResponseError', 'ShowResponse', 'StatusResponse', 'Tool', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_client', '_types', '_utils', 'chat', 'copy', 'create', 'delete', 'embed', 'embeddings', 'generate', 'list', 'ps', 'pull', 'push', 'show']


In [12]:
#Delete a model. 
#ollama.delete(<your model>) #replace <your model>

## 1. Run first script

In [13]:
#first set the model
model = 'llama3.2:1b'

In [14]:
#first script from ollama website (https://github.com/ollama/ollama-python)
import ollama
response = ollama.chat(model=model, messages=[
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
  },
])
print(response['message']['content'])

The sky appears blue to us because of a phenomenon called Rayleigh scattering, named after the British physicist Lord Rayleigh, who first described it in the late 19th century. Here's what happens:

When sunlight enters Earth's atmosphere, it encounters tiny molecules of gases such as nitrogen (N2) and oxygen (O2). These molecules are much smaller than the wavelength of light, so they scatter the light in all directions.

The shorter wavelengths of light, like blue and violet, are scattered more than the longer wavelengths, like red and orange. This is because the smaller molecules have a greater tendency to absorb and scatter the shorter wavelengths.

As a result, the blue and violet light that reaches our eyes from the sky has been scattered in all directions, giving the sky its blue appearance. The other colors of the visible spectrum, on the other hand, are reflected back to us from the surface of the Earth, which is why we see more red, orange, and yellow hues in the sky during su

In [23]:
#Create the ollama function
import ollama

def ask_ollama(question):
    """
    
    Sends a question to the Ollama API and returns the response.
    """
    response = ollama.chat(
        model=model,
        messages=[
            {'role': 'user', 'content': question},
        ],
    )

    return response['message']['content']

# Example usage
response_content = ask_ollama("Tell me a joke?")
print(response_content)

A man walked into a library and asked the librarian, "Do you have any books on Pavlov's dogs and Schrödinger's cat?" The librarian replied, "It rings a bell, but I'm not sure if it's here or not."


## 2. Streaming the response

With streaming the response will be printed on the screen while the LLM is still busy generating the answer. This is a faster solution. Try it out yourself!

In [16]:
question = input('Your question:')

In [17]:
#same but now as a function (to use with gradio) 
import ollama

def ollama_chat_stream(question):
    """
    Streams the chat response from Ollama using the 'tinyllama' model.
    """
    # Initialize the chat with Ollama
    stream = ollama.chat(
        model=model,
        messages=[{'role': 'user', 'content': question}],
        stream=True,
    )

    # Stream and print the responses
    for chunk in stream:
        print(chunk['message']['content'], end='', flush=True)
        #print(chunk['message']['content'], end='', flush=True)

# Example usage
ollama_chat_stream(question)


A man walked into a library and asked the librarian, "Do you have any books on Pavlov's dogs and Schrödinger's cat?" The librarian replied, "It rings a bell, but I'm not sure if it's here or not."

## 3. Creating a gradio front end

Gradio is a very high level Python library that let's you create a front-end very quickly. It is used to demo your model. Gradio starts a server for you (like Flask or NodeJS).

In [18]:
#uncomment if necessary
!pip install gradio --upgrade



In [19]:
import gradio

In [25]:
#a Gradio frontend make sure you have run previous cells
import gradio as gr

iface = gr.Interface(
    fn=ask_ollama,  #use the function we defined under 1
    inputs="text", 
    outputs= "text"
)

iface.launch()

* Running on local URL:  http://127.0.0.1:7863

To create a public link, set `share=True` in `launch()`.


