# IPEX_LLM using Llamacpp on Intel GPUs

## Introduction

This notebook demonstrates how to install IPEX-LLM on Windows with Intel GPUs. It applies to Intel Core Ultra and Core 11 - 14 gen integrated GPUs (iGPUs), as well as Intel Arc Series GPU.

## What is an AIPC

What is an AI PC you ask?

Here is an [explanation](https://www.intel.com/content/www/us/en/newsroom/news/what-is-an-ai-pc.htm#gs.a55so1) from Intel:

”An AI PC has a CPU, a GPU and an NPU, each with specific AI acceleration capabilities. An NPU, or neural processing unit, is a specialized accelerator that handles artificial intelligence (AI) and machine learning (ML) tasks right on your PC instead of sending data to be processed in the cloud. The GPU and CPU can also process these workloads, but the NPU is especially good at low-power AI calculations. The AI PC represents a fundamental shift in how our computers operate. It is not a solution for a problem that didn’t exist before. Instead, it promises to be a huge improvement for everyday PC usages.”

## Install Prerequisites

### Step 1: System Preparation

To set up your AIPC for running with Intel iGPUs, follow these essential steps:

1. Update Intel GPU Drivers: Ensure your system has the latest Intel GPU drivers, which are crucial for optimal performance and compatibility. You can download these directly from Intel's [official website](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html) . Once you have installed the official drivers, you could also install Intel ARC Control to monitor the gpu:

   <img src="Assets/gpu_arc_control.png">


2. Install Visual Studio 2022 Community edition with C++: Visual Studio 2022, along with the “Desktop Development with C++” workload, is required. This prepares your environment for C++ based extensions used by the intel SYCL backend that powers accelerated Ollama. You can download VS 2022 Community edition from the official site, [here](https://visualstudio.microsoft.com/downloads/).

3. Install conda-forge: conda-forge will manage your Python environments and dependencies efficiently, providing a clean, minimal base for your Python setup. Visit conda-forge's [installation site](https://conda-forge.org/download/) to install for windows.

   

## Step 2: Install IPEX-LLM

### After installation of conda-forge, open the Miniforge Prompt, and create a new python environment:
  ```
  conda create -n llm-cpp python=3.11

  ```

### Activate the new environment
```
conda activate llm-cpp

```

<img src="Assets/llm4.png">

### With the llm-cpp environment active, use pip to install ipex-llm for GPU. 

```
pip install --pre --upgrade ipex-llm[cpp]

```

<img src="Assets/llm5.png">

### Create llama-cpp directory

```
mkdir llama-cpp
cd llama-cpp

```

<img src="Assets/llm6.png">

### Please run the following command with administrator privilege in Miniforge Prompt. We should see many soft links of llama.cpp’s executable files in current directory.
```
init-llama-cpp.bat

```

<img src="Assets/llm7.png">

### Set the following environment variables according to your device to use GPU acceleration
For Intel iGPU:
```
set SYCL_CACHE_PERSISTENT=1

```
### Below shows a simple example to show how to run a community GGUF model with IPEX-LLM
* Download and run the model for example as below 

```
main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "What is AI" -t 8 -e -ngl 33 --color
```

<img src="Assets/llm8.png">

### Below is an example output

<img src="Assets/llm9.png">


<img src="Assets/llm10.png">




In [None]:
! C:\workshop\llama-cpp\main.exe -m ../models/llama-2-7b-chat.Q5_K_M.gguf -n 100 --prompt "What is AI" -t 16 -ngl 999 --color -e 

## Complete code snippet

In [None]:
%%writefile src/st_ipexllm_native.py
import streamlit as st
import subprocess
import os
import threading
import time

st.title("Chat with me!")

# Get the inputs from the text fields with required logs
exe_path = st.text_input("Enter the path to the main.exe binary generated by the steps outlined:",value="..\llama-cpp\main.exe", key="exe_path")
print(f"{exe_path}\n")
if exe_path:
    if os.path.exists(exe_path):
        if os.path.isfile(exe_path):
            print(f"valid file path: {exe_path}")
        else:
            st.error(f"The path {exe_path} is not a file")
    else:
        st.error(f"The path {exe_path} does not exist")
else:
    print("Please enter the file path")

model_path = st.text_input("Enter model file path:", value="..\models\llama-2-7b-chat.Q5_K_M.gguf", key="model_name")
print(f"{model_path}\n")
if model_path:
    if os.path.exists(model_path):
        if os.path.isfile(model_path):
            print(f"valid file path: {model_path}")
        else:
            st.error(f"The path {model_path} is not a file")
    else:
        st.error(f"The path {model_path} does not exist")
else:
    print("Please enter the file path")


num_words = st.text_input("Enter the number of words you'd expect to see in your answer:", value="100", key="num_words")
print(f"{num_words}\n")

question = st.text_input("Enter your question", value="What is AI", key="question")
question = f'"{question}"'
print(f"{question}\n")
num_cores = st.text_input("Enter the number of cores", value="16", key="num_cores")
print(f"{num_cores}\n")
 
gpu_layers = st.text_input("Enter number of GPU layers:", value="999", key="gpu_layers")
print(f"{gpu_layers}\n")

def stdout_typewriter_effect(stdout_container, current_stdout):
    current_char = ""
    for char in current_stdout:
        current_char+=char
        stdout_container.markdown(current_char)
        time.sleep(0.01)

def launch_exe():
    stdout_chunks = []
    stderr_llama_time = []
    
    def append_stdout(pipe, stdout_lines):
        for line in iter(pipe.readline, ''):
            if line:
                print(line.strip())
                stdout_lines.append(line.strip())
        pipe.close()

    def append_stderr(pipe, stderr_lines):
        for line in iter(pipe.readline, ''):
            if line.startswith("llama_print_timings"):
                print(line.strip())
                stderr_lines.append(line.strip())
        pipe.close()

    filter_command = '| findstr "^"'
    # command to run    
    commandparams = exe_path + " " + "-m" + " " + model_path + " " + "-n " + " " + num_words + " " + "--prompt " + " " + question + " " +  "-t " + " " + num_cores + " " + "-e -ngl" + " " + gpu_layers + " " + filter_command
    # logging command for easy debugging
    print(f"{commandparams}")
    try:
        # Use subprocess.Popen() to execute the EXE file with command-line parameters and capture the output in real-time
        result = subprocess.Popen(commandparams, shell=True, stdout=subprocess.PIPE, stderr = subprocess.PIPE, text=True)

        stdout_thread = threading.Thread(target=append_stdout, args=(result.stdout, stdout_chunks))
        stderr_thread = threading.Thread(target=append_stderr, args=(result.stderr, stderr_llama_time))
        stdout_thread.start()
        stderr_thread.start()
        stdout_container = st.empty()
        stderr_container = st.empty()

        # result.poll() returns None only if the subprocess is still running otherwise it returns the return code of subprocess
        # this method is not waiting for subprocess to complete as it only checks for the current status   
        while result.poll() is None and stdout_thread.is_alive or stderr_thread.is_alive():
            # stdout_container.markdown('\n'.join(stdout_lines))
            stdout_typewriter_effect(stdout_container, '\n'.join(stdout_chunks))
            stderr_container.text('\n'.join(stderr_llama_time))
            stdout_thread.join(timeout=0.1)
            stderr_thread.join(timeout=0.1)
            
        stdout_thread.join()
        stderr_thread.join()

    except FileNotFoundError:
        st.error("The specified EXE file does not exist.")
    
if st.button("Generate"):
    with st.spinner("Running....Please wait..🐎"): 
        launch_exe()

In [None]:
! streamlit run src/st_ipexllm_native.py

### Streamlit sample output

Below is the output of a sample run from the streamlit application and offloaded to iGPU

<img src="Assets/llm11.png"> <img src="Assets/output2.png">




* Reference: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html