### How to run TintLlama ?

Define environnement variables

In [1]:
!CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"

Install the llama cpp-python package

In [2]:
!pip install llama-cpp-python



In [3]:
!pip install huggingface-hub



Download the right Llama from the Huggingface client

In [4]:
!huggingface-cli download TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --local-dir .

Downloading 'tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf' to '.cache/huggingface/download/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf.9fecc3b3cd76bba89d504f29b616eedf7da85b96540e490ca5824d3f7d2776a0.incomplete'
tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf: 100%|██| 669M/669M [00:31<00:00, 21.4MB/s]
Download complete. Moving file to tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf


Import the libraray

In [5]:
from llama_cpp import Llama

Model parameters, explanation below

In [10]:
# For Mac OS M3, use
llm = Llama(model_path="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
            n_ctx=2048,
            n_threads=8,
            n_gpu_layers=0)

# For Intel system, with no GPU, use:
#llm = Llama(
#    model_path="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",  # Path to your TinyLlama model
#    n_ctx=2048,  # Context window size
#    n_threads=4,  # Number of CPU threads to use
#    n_batch=32,    # Batch size for processing tokens (lower if memory is limited)
#    n_gpu_layers=0  # If you do not have GPU
#)

llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = tinyllama_tinyllama-1.1b-chat-v1.0
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 ll

## Key Parameters Explained
### model_path:

Path to the quantized TinyLlama model file. Ensure this file is accessible on your Ubuntu machine.

### n_ctx (Context Window):
Specifies the maximum number of tokens that the model can handle at once. Set to 2048 for optimal balance between memory usage and performance.

#### n_threads (CPU Threads):
Set this to the number of physical cores or threads available on your Intel CPU.
Example:
If your Intel CPU has 4 cores/8 threads, use n_threads=4 (leave some headroom for other processes).
For a high-end CPU with 8 cores/16 threads, you can increase n_threads to 8 or 12.

### n_batch (Batch Size):

Controls how many tokens are processed at a time. Lower values reduce memory usage but may slow down processing.
Recommended: Start with n_batch=32. If you run into memory issues, reduce it to 16 or 8.

### n_gpu_layers:

If you have a GPU:

Use this parameter to speed up processing by leveraging GPU capabilities.
Balance the number of layers offloaded with the available GPU memory.
For TinyLlama, 35 is a good value

If you don’t have a GPU:

Set n_gpu_layers=0 or omit it entirely (it’s irrelevant since there’s no GPU to use).

Full context given by ChatGPT

In [11]:
system_role= """You are a highly knowledgeable and empathetic lecturer specializing in mental diseases and mental health. Your role is to provide detailed, accurate, and evidence-based explanations about mental diseases based on the raw data you receive.
The raw data will typically include abstracts of research papers, clinical studies, or other medical documents. Analyze the data provided to you and synthesize the information into a clear and concise explanation. Your responses must:
1. Summarize the main points of the abstract, highlighting key findings or relevant data.
2. Explain concepts in a way suitable for a professional audience, such as medical students, researchers, or mental health practitioners, while remaining approachable for non-experts if necessary.
3. Use medical terminology appropriately, but ensure definitions or explanations are provided for complex terms.
4. Organize responses into structured formats when appropriate (e.g., bullet points, numbered lists, or sections like "Background," "Findings," "Implications").
5. Provide references to the data where relevant (e.g., 'According to the abstract, the study found...').
6. Adopt a professional, empathetic tone while avoiding judgmental language or bias.
If the input data is unclear or insufficient, request clarification or more context to ensure an accurate response."""

Optional one

In [12]:
#system_role= "You are a highly knowledgeable and empathetic lecturer specializing in mental diseases"

If you wish an input

In [13]:
#user_content=input("What do you want ?\n")

In [14]:
user_content="What do you know about Alzeihmer ?"

In [15]:
response=llm.create_chat_completion(
      messages = [
        {
          "role": "system",
          "content": system_role

        },
        {
          "role": "user",
          "content": user_content
        }
      ]
)

llama_perf_context_print:        load time =    4829.93 ms
llama_perf_context_print: prompt eval time =       0.00 ms /   339 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   258 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   22710.12 ms /   597 tokens


In [16]:
content = response['choices'][0]['message']['content']

print("Generated Response:\n")
print(content)

Generated Response:

I do not have the latest information on alzeihmer. However, I can provide you with some general information about alzeihmer, a type of dementia that is characterized by memory loss, impaired judgment, and behavioral changes. Alzeihmer is a subtype of dementia that is often associated with other forms of dementia, such as vascular dementia, frontotemporal dementia, and frontotemporal lobar degeneration. It is also known as the "white matter disease" because the brain tissue in the white matter regions of the brain, which are involved in memory and cognitive function, is affected. Alzeihmer is considered a progressive disease, meaning it worsens over time. It is often diagnosed in older adults, but it can affect younger individuals as well. Alzeihmer is often associated with other forms of dementia, such as vascular dementia, frontotemporal dementia, and frontotemporal lobar degeneration. It is also associated with other conditions, such as Parkinson's disease, Alzhe