### Overview

Large language models (LLMs) have revolutionized natural language processing by enabling advanced applications such as text generation, sentiment analysis, and conversational agents. Running these models on a local machine allows developers and enthusiasts to leverage their power without relying on cloud services, promoting greater accessibility and experimentation. However, successfully implementing LLMs locally necessitates a thorough understanding of the associated hardware and software requirements, as well as optimal setup procedures. To effectively utilize LLMs on a personal computer, users must meet specific hardware specifications, including system compatibility, adequate memory, and powerful GPUs capable of handling extensive computational tasks. For instance, models with billions of parameters demand substantial RAM and GPU resources, making it essential for users to evaluate their system capabilities before installation. Additionally, software frameworks like TensorFlow and PyTorch are integral to managing LLMs, necessitating a well-configured environment to ensure seamless operation. Setting up a local environment involves several critical steps, including creating virtual environments, installing necessary libraries, and managing dependencies. Users must also navigate potential challenges such as installation errors and performance optimization, particularly regarding memory management and GPU acceleration. Addressing these concerns is crucial for maximizing the performance and responsiveness of LLMs on personal machines, allowing users to tailor models to specific tasks effectively. While local implementation of LLMs is increasingly accessible, it is not without its complexities. Notable challenges include hardware limitations, compatibility issues, and the need for efficient troubleshooting strategies. Understanding these intricacies is vital for users aiming to harness the full potential of LLMs, ensuring they can execute advanced AI applications locally while mitigating common pitfalls associated with setup and performance optimization.

### Requirements
To effectively run large language models (LLMs) on a local machine, it is crucial to understand and meet specific hardware and software requirements. This ensures optimal performance and a smooth user experience while leveraging the capabilities of LLMs.

#### Hardware Requirements
#### System Compatibility

Before diving into the installation of an LLM, you need to ensure that your operating system is compatible. For MacOS users, versions 11 Big Sur or later are recommended, while Linux users should have Ubuntu 18.04 or later. Windows users can utilize Windows Subsystem for Linux 2 (WSL2) to run LLMs effectively on their machines[[1]](#1).


#### Memory Specifications
Memory plays a vital role in running LLMs. For smaller models, such as those with around 3 billion parameters, a minimum of 8GB of RAM is necessary. As the model size increases, so do the memory requirements; 7B models require 16GB of RAM, while 13B models benefit from at least 32GB[[1]](#1)[[2]](#2). Ensuring adequate memory is essential for smooth operation and responsiveness of the models.

**GPU Requirements**

The GPU is arguably the most critical component when it comes to running LLMs. They are responsible for handling the complex matrix multiplications and parallel processing tasks required for both training and inference. Recommended GPUs for LLM tasks include the NVIDIA A100 Tensor Core GPU, which has 40GB or more of VRAM, and consumer-grade options like the NVIDIA RTX 4090/3090, which feature 24GB of VRAM [[2]](#2)[[3]](#3). Additionally, for larger models, having a GPU with at least 10GB of VRAM is advisable to ensure efficient processing[[4]](#4).

#### Disk Space and Storage
Adequate disk space is essential for installing the necessary software and storing model files. While the specific requirements can vary based on the models being used, ensuring you have sufficient space to accommodate multiple models and datasets is vital.

#### Software Requirements
Operating System and Software Support
It is important to have a system that supports the necessary software frameworks and libraries for LLMs, such as TensorFlow, PyTorch, Hugging Face Transformers, and DeepSpeed. Most AI developers prefer Linux-based systems due to better support for these tools[[2]](#2)[[3]](#3).

#### Networking and Connectivity
For larger deployments or setups involving multiple machines, a robust networking setup is essential. A 10 Gigabit Ethernet connection is recommended for transferring large model checkpoints or datasets efficiently. If relying on wireless connectivity, WiFi 6 is advisable for improved throughput and reduced latency[[2]](#2)[[3]](#3).

### Setting Up the Environment
To successfully utilize large language models (LLMs) on a local machine, it's crucial to set up your development environment correctly. This process involves creating virtual environments, installing necessary dependencies, and ensuring that your system meets specific requirements.

#### Downloading Large Language Models
Downloading large language models (LLMs) for local use is a crucial step for individuals and developers aiming to leverage these powerful tools for various applications. This section outlines the key considerations and steps involved in obtaining LLMs.

#### Accessing Model Repositories
The primary source for downloading LLMs is online repositories, with Hugging Face being one of the most popular platforms. Users can search for models based on their specific needs and select from a wide range of options, including models tuned for conversational tasks, such as instruct-tuned variants designed to handle interactive dialogues effectively[[5]](#5)[[6]](#6).

#### Choosing the Right Model
When selecting a model, it is essential to consider its architecture and intended use case. Many models come pre-trained and can be fine-tuned for specific tasks or applications. For example, models like GPT-3 and BERT are widely recognized for their versatility in natural language processing tasks, ranging from text generation to sentiment analysis[[7]](#7). Additionally, it's important to pay attention to model specifications, such as the number of parameters and quantization options, which can affect performance and resource requirements[[8]](#8)[[5]](#5).


#### Downloading and Managing Models
Once a suitable model is identified, downloading it is typically straightforward. Many LLM frameworks provide built-in mechanisms to handle model downloads seamlessly, often defaulting to versions optimized for local usage, such as 4-bit quantized models[[6]](#6). For a manual approach, users can directly visit the Hugging Face Model Hub, where they can select the desired model repository and download the model files. The process generally involves copying the model name and specific file needed for implementation[[5]](#5).

#### Setting Up the Environment
After downloading, users will need to set up their local environment to run the model. This often includes installing required libraries and dependencies specific to the framework being used, such as TensorFlow or PyTorch. Users should familiarize themselves with the framework's documentation to ensure proper configuration and optimization for their hardware setup[[7]](#7)[[9]](#9).

### Running Large Language Models Locally
Running large language models (LLMs) locally has become increasingly accessible, allowing users to leverage powerful AI tools directly on their personal computers. With advancements in model efficiency and the availability of various frameworks, users can now experiment with models like GPT-3, LLaMA, and others without relying solely on cloud-based solutions.

#### Getting Started with Local LLMs
Before diving into running LLMs locally, it's essential to understand the necessary prerequisites. Users should familiarize themselves with concepts such as transformer architecture, pre-training, and fine-tuning, which are critical for effectively utilizing these models[[7]](#7). For initial experimentation, it is advisable to start with smaller models like GPT-2, which are easier to work with and require less computational power[[7]](#7).


#### Installation and Setup

To run an LLM locally, users need to install the relevant libraries and dependencies. Popular libraries for this purpose include Hugging Face's Transformers and Llama CPP. After ensuring that the necessary software is installed, users can download a model using the model browser provided by these frameworks. This tool integrates with platforms like Hugging Face to facilitate file management[[9]](#9)[[5]](#5).

#### Model Loading and Configuration

Once a model is downloaded, users must configure it according to their system capabilities. This involves specifying parameters such as context length and the number of CPU threads to utilize. For instance, one might set the context length to 4096 tokens and allocate four CPU threads for processing[[5]](#5). After configuring these settings, the model can be instantiated and tested for basic functionality.

#### Fine-Tuning Models

Fine-tuning is a crucial step in adapting pre-trained models to specific tasks or datasets. If a user possesses a labeled dataset tailored for a particular application, such as sentiment analysis or question answering, they can fine-tune the model to enhance its performance on that task. This process involves using training scripts and defining appropriate training parameters within the chosen framework[[10]](#10). Detailed documentation is typically available to guide users through this process effectively.

#### Challenges and Considerations
While running LLMs locally is more feasible than ever, users should be aware of potential challenges. The computational demands of larger models can strain consumer-grade hardware, making it essential to choose the right model based on system specifications[[11]](#11). Additionally, experimenting with local models may require troubleshooting and adjustments to ensure compatibility and performance stability.

### Performance Optimization
Optimizing the performance of large language models (LLMs) when running locally is crucial for achieving efficient and effective results. Several strategies can be employed to maximize the capabilities of LLMs like Llama 2 and Llama 3.1, ensuring that they run smoothly while utilizing system resources effectively.

#### GPU Acceleration
One of the most significant enhancements in performance can be achieved through the use of GPU acceleration. Modern GPUs, especially high-end models such as the NVIDIA GeForce RTX series, are designed to handle computationally intensive tasks with greater efficiency compared to CPUs. To leverage this, users should ensure that their systems are configured to utilize GPU capabilities by installing the necessary drivers and libraries, such as CUDA for NVIDIA GPUs or ROCm for AMD GPUs[[12]](#12). Utilizing GPUs can drastically reduce processing time, particularly when managing large inputs or executing multiple tasks concurrently[[12]](#12)[[13]](#13).

#### Batching Techniques
Batching is another effective method for optimizing throughput and latency in model inference. By processing multiple input sequences simultaneously, batching maximizes the efficiency of matrix operations, enabling the model to handle more tokens per unit time. However, selecting the appropriate batch size involves trade-offs; smaller batch sizes can reduce latency but may lower overall throughput, while larger batch sizes can enhance throughput at the expense of increased latency[[14]](#14). It is essential to tailor the batch size to the specific use case requirements, balancing the need for quick responses against the desire for maximum processing efficiency[[14]](#14).


#### Memory Management
Efficient memory usage plays a crucial role in optimizing performance. Implementing batch processing not only reduces redundant operations but also optimizes resource utilization. Careful alignment of batch sizes with system memory capacity can help prevent crashes and enhance performance. Additionally, monitoring system resource usage with tools like NVIDIA’s nvidia-smi or Linux’s htop can help identify bottlenecks in memory, GPU, or CPU usage, allowing users to make necessary adjustments[[12]](#12)[[15]](#15).


#### Regular Updates and Dependency Management
Keeping software dependencies up to date is vital for maintaining optimal performance. Regularly updating Python libraries, CUDA drivers, and the Llama repository can lead to significant performance enhancements, as updates often include bug fixes and new features that improve functionality[[12]](#12).

#### Fine-Tuning Model Parameters
Fine-tuning model parameters can also yield considerable improvements in performance. Parameter-efficient tuning methods, such as Low-Rank Adaptation (LoRA) and Prefix Tuning, allow for fine-tuning only a subset of model parameters, which reduces computational and memory overhead[[16]](#16). These techniques have been shown to achieve performance levels comparable to full fine-tuning while being significantly less resource-intensive.

#### Structural Optimization
Structural optimizations, such as the implementation of FlashAttention and PagedAttention, can enhance computational speed by minimizing memory access operations during forward propagation. These optimizations utilize a chunked computation approach, which significantly boosts performance by reducing the number of accesses to high bandwidth memory (HBM) and speeding up the overall inference process[[17]](#17)[[16]](#16). By employing these strategies, users can significantly enhance the performance of large language models on local machines, making them faster, more responsive, and better suited for a variety of applications.

### Troubleshooting Common Issues
When working with large language models (LLMs) on a local machine, users may encounter various issues that can hinder their experience. Below are some common problems and recommended solutions to address them effectively.

#### Node-Level Replacement Issues
One common challenge arises when node-level replacement is needed during model execution. If a node fails, collectives may hang instead of throwing an error. To mitigate this, it is essential to set appropriate timeouts on collectives to ensure they throw an error when necessary. Additionally, implementing a monitoring client that tracks CloudWatch logs and metrics can help identify abnormal patterns, such as halted logging or zero GPU usage, which indicate job hangs or convergence issues. This setup allows for prompt remediation by enabling automatic job stop/retry actions[[18]](#18).

#### Context-Memory Conflicts
Context-memory conflicts can occur when external contexts contradict the internal knowledge of the LLM. To address these conflicts, fine-tuning the model on counterfactual contexts can help prioritize external information. Using specialized prompts reinforces adherence to context, while decoding techniques can amplify context probabilities. Additionally, pre-training on diverse contexts across documents aids in reducing the incidence of such conflicts[[19]](#19).

#### Installation Issues
Users may experience difficulties when installing packages like PyTorch on their machines. It is advisable to follow official installation instructions to avoid errors. For instance, for a standard installation on Windows using Anaconda, one might use commands like conda install pytorch torchvision cpuonly -c pytorch for CPU versions, or include the CUDA toolkit for GPU support[[20]](#20)[[21]](#21). If issues persist, verifying the installation with a simple PyTorch code snippet can help confirm that everything is set up correctly[[21]](#21).

#### Model Downloading Errors
When downloading models, particularly for machines with varying GPU capabilities, users may encounter CudaOOM errors if the target machine has insufficient resources. It is crucial to ensure compatibility between the downloading and executing machines. For better management of LLMs, tools like hugging face-cli or hf_hub_download can provide a more reliable method to fetch models[[22]](#22)[[23]](#23).

#### Post-Installation Setup
After installing an LLM application like Jan.AI or Ollama, users should ensure their system's environment is correctly set up. For instance, ensuring that the video card (preferably NVIDIA) is recognized is critical for optimal performance. If a suitable GPU is unavailable, the model may resort to CPU usage, significantly slowing down processing times[[24]](#24). By addressing these common issues with proactive measures, users can enhance their experience with large language models on local machines, ensuring smoother operations and better outcomes.






<a id="References "></a>
### References



<a id="1">[1]</a> 
https://aitoolmall.com/news/running-local-llms-made-easy-with-ollama-ai/

<a id="2">[2]</a> 
https://www.linkedin.com/pulse/interacting-metas-latest-llama-32-llms-locally-using-ollama-stafford-fhduc/

<a id="3">[3]</a> 
https://www.pcmag.com/how-to/how-to-run-your-own-chatgpt-like-llm-for-free-and-in-private

<a id="4">[4]</a> 
https://www.hardware-corner.net/llm-database/LLaMA/

<a id="5">[5]</a> 
https://www.restack.io/p/transformers-answer-hugging-face-install-cat-ai

<a id="6">[6]</a> 
https://scoris.medium.com/a-beginners-journey-into-ai-guide-to-running-large-language-models-locally-4a9c44fa8fef

<a id="7">[7]</a> 
https://medium.com/@dinakarchennupati777/the-recommended-way-to-setup-pytorch-environment-on-your-local-machine-b90ec1eef5dd

<a id="8">[8]</a> 
https://www.tomshardware.com/news/running-your-own-chatbot-on-a-single-gpu

<a id="9">[9]</a> 
https://www.pcguide.com/apps/can-chatgpt-run-locally/


<a id="10">[10]</a> 
https://www.almabetter.com/bytes/articles/install-pytorch


<a id="11">[11]</a> 
https://blog.ahmadwkhan.com/running-open-source-llm-llama3-locally-using-ollama-and-langchain

<a id="12">[12]</a> 
https://medium.com/predict/a-simple-comprehensive-guide-to-running-large-language-models-locally-on-cpu-and-or-gpu-using-c0c2a8483eee


<a id="13">[13]</a> 
https://www.edtech247.com/blog/local-llm/

<a id="14">[14]</a> 
https://medium.com/@siamsoftlab/getting-started-with-large-language-models-945b6d943f01

<a id="15">[15]</a> 
https://raptorhacker.medium.com/20-minutes-for-beginners-to-deploy-large-language-model-locally-3ab66d71ef1c

<a id="16">[16]</a> 
**seven ways of running llm locally**
https://kleiber.me/blog/2024/01/07/seven-ways-running-llm-locally/


<a id="17">[17]</a> 
https://mljourney.com/how-to-use-hugging-face-step-by-step-guide/


<a id="18">[18]</a> 
https://www.toolify.ai/ai-news/running-large-language-models-on-your-computer-a-guide-to-koboldcpp-and-autogptq-1217326



<a id="19">[19]</a> 
https://mljourney.com/how-to-run-llama-2-locally-a-step-by-step-guide/


<a id="20">[20]</a> 
https://www.hardware-corner.net/guides/hardware-for-120b-llm/



<a id="21">[21]</a> 
https://medium.com/@yvan.fafchamps/how-to-benchmark-and-optimize-llm-inference-performance-for-data-scientists-1dbacdc7412a


<a id="22">[22]</a> 
https://medium.com/@mne/how-to-choose-the-ideal-large-language-model-for-local-inference-c2b25931205


<a id="23">[23]</a> 
https://arxiv.org/html/2401.02038v2


<a id="24">[24]</a> 
https://developers.googleblog.com/en/large-language-models-on-device-with-mediapipe-and-tensorflow-lite/





#### Create a Virtual Environment

You can install it with pip:

- `conda create --name llm_local`

list the available virtual environments:

- `conda env list`

Activate the virtual environment:

`conda activate llm_local`

### Ollama

Ollama is a self-hosted language model developed by OpenAI. It is designed to be easy to use, scalable, and secure. Ollama provides a simple REST API that allows users to interact with the model.You can use Ollama by  from the terminal by first installing an executable file available for download from this [link](https://ollama.com/download). You can also use the python library by `pip install ollama` to install.

For example to run and chat with llama 3.3 run `ollama run llama3.3` in the terminal. This will download the model and afterwards you can chat with it. Different models are available from [here](https://ollama.com/search). To use a model , first pull it with for the first time, the command `ollama pull llama3.2:1b`  will pull the Llama 3.2 3B model. This command will install a 4-bit quantized version of the 3B model, which requires 2.0 GB of disk space and has an identical hash to the 3b-instruct-q4_K_M model. Afterwards you can download the model as  `ollama run llama3.2:1b` to start chatting with this model in the terminal.

In [10]:
#!ollama pull llama3.2:1b 
!ollama pull gemma2:2b
#!ollama pull llama3.2:latest
#!pip install streamlit ollama
#!pip install langchain langchain-community langchain-core langchain-ollama streamlit watchdog 

[?25lpulling manifest ⠋ [?25h[?25l[2K[1Gpulling manifest ⠙ [?25h[?25l[2K[1Gpulling manifest 
pulling 7462734796d6... 100% ▕████████████████▏ 1.6 GB                         
pulling e0a42594d802... 100% ▕████████████████▏  358 B                         
pulling 097a36493f71... 100% ▕████████████████▏ 8.4 KB                         
pulling 2490e7468436... 100% ▕████████████████▏   65 B                         
pulling e18ad7af7efb... 100% ▕████████████████▏  487 B                         
verifying sha256 digest 
writing manifest 
success [?25h


The ollama python library can also be used to interact with  llms as below:

In [19]:
%%time
#pip install ollama
from ollama import chat
from ollama import ChatResponse

def generate_response(prompt,model):
    try:
        response: ChatResponse = chat(model, messages=[
        {
         'role': 'user',
          'content': prompt},
          ])
    
    except Exception as e:
        print(f"An error occurred during response generation: {e}")
    
    return response['message']['content']

generate_response(prompt='what is perplexity in llm?',model='gemma2:2b')

CPU times: total: 0 ns
Wall time: 45.5 s


'Let\'s break down perplexity and how it relates to large language models (LLMs):\n\n**What is Perplexity?**\n\nIn essence, perplexity is a measure of how well an LLM predicts the next word in a sequence.  Imagine you have a sentence: "The quick brown fox jumps over the lazy dog." \n\n* **High Perplexity:** If your LLM has trouble predicting what the next word should be (like "over" or "the"), it indicates high perplexity. This suggests that the model doesn\'t understand the context well and struggles to predict the most likely follow-up words.\n* **Low Perplexity:**  If the model predicts the next word with high accuracy, it will have low perplexity. The model can grasp the relationship between words and make accurate predictions about what should come next.\n\n**Perplexity Explained**\n\nPerplexity is calculated based on the probabilities of all possible next words within a given context:\n\n1. **Training Data:** An LLM learns from massive amounts of text data (think books, articles,

Equivalently or access fields directly from the response object

### Ollama with Streamlit and LangChain

We will use Streamlit and LangChain to interact with the Llama 3.2 1B and 3B models using a chat application. The code below can be saved as `ollama_streamlit.py` then run `python -m streamlit run ollama_streamlit.py` in the terminal.

In [None]:
import streamlit as st
from ollama import chat
from ollama import ChatResponse






def generate_response(prompt,model):
    try:
        response: ChatResponse = chat(model, messages=[
        {
         'role': 'user',
          'content': prompt},
          ])
    
    except Exception as e:
        print(f"An error occurred during response generation: {e}")
    
    return response['message']['content']


model_selected = st.selectbox('Select model', ['llama3.2:1b', 'gemma2:2b'])

#generate_response(prompt='why is the sky blue ?')

def main():
    st.title("Ollama LLM App")

    # User input field
    user_input = st.text_input("Enter your prompt:")

    # Button to trigger generation
    if st.button("Generate"):
        if user_input:
            #response = generate_response(user_input)
            response = generate_response(prompt=user_input,model=model_selected)
            st.write("Model Response:")
            st.write(response)
        else:
            st.warning("Please enter a prompt.")

if __name__ == "__main__":
    main()

Ollama can also be used via the api. You can first if the API is running via `http://localhost:11434/`. In windows powershell the commad to access the api for example is `Invoke-WebRequest -Uri http://localhost:11434/api/generate -Method Post -Body '{"model": "llama3.2:1b","prompt": "what is the biggest country in the world?"}' -ContentType "application/json"`.The curl command in our linux/unix terminal to send a request to the API. curl http://localhost:11434/api/generate -d '{ "model": "llama3.2:1b", "prompt": "What is the happiest place on earth?" }'
Another way to access the ollama api is through python. For example the python code below can be saved as a python script and run in the terminal.


In [None]:
import requests
import json


url = "http://localhost:11434/api/generate"

headers = {"Content-Type": "application/json"}

data = {
    "model": "llama3.2:1b",
    "prompt": "what is the biggest country in the world",
    # Convert "False" to a boolean
    "stream": False
}

response = requests.post(url, headers=headers, data=json.dumps(data))


if response.status_code == 200:
    response_text = response.text
    data = json.loads(response_text)
    actual_response = data["response"]
    print(actual_response)
else:
    print("Error: ", response.status_code, response.text)

In [6]:
# Ollama-Streamlit-LangChain-Chat-App
# Streamlit app for chatting with Meta Llama 3.2 using Ollama and LangChain
# Author: Gary A. Stafford
# Date: 2024-09-26

import logging
from typing import Dict, Any

import streamlit as st
from langchain_community.chat_message_histories import StreamlitChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_ollama import ChatOllama

In [None]:
# Ollama-Streamlit-LangChain-Chat-App
# Streamlit app for chatting with Meta Llama 3.2 using Ollama and LangChain
# Author: Gary A. Stafford
# Date: 2024-09-26

import logging
from typing import Dict, Any

import streamlit as st
from langchain_community.chat_message_histories import StreamlitChatMessageHistory
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_ollama import ChatOllama

# Constants
PAGE_TITLE = "Llama 3.2 Chat"
PAGE_ICON = "🦙"
SYSTEM_PROMPT = "You are a friendly AI chatbot having a conversation with a human."
DEFAULT_MODEL = "llama3.2:latest"

# Configure logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)


def initialize_session_state() -> None:
    defaults: Dict[str, Any] = {
        "model": DEFAULT_MODEL,
        "input_tokens": 0,
        "output_tokens": 0,
        "total_tokens": 0,
        "total_duration": 0,
        "num_predict": 2048,
        "seed": 1,
        "temperature": 0.5,
        "top_p": 0.9,
    }
    for key, value in defaults.items():
        if key not in st.session_state:
            st.session_state[key] = value


def create_sidebar() -> None:
    with st.sidebar:
        st.header("Inference Settings")
        st.session_state.system_prompt = st.text_area(
            label="System",
            value=SYSTEM_PROMPT,
            help="Sets the context in which to interact with the AI model. It typically includes rules, guidelines, or necessary information that help the model respond effectively.",
        )

        st.session_state.model = st.selectbox(
            "Model",
            ["llama3.2:1b", "llama3.2:latest"],
            index=1,
            help="Select the model to use.",
        )
        st.session_state.seed = st.slider(
            "Seed",
            min_value=1,
            max_value=9007199254740991,
            value=round(9007199254740991 / 2),
            step=1,
            help="Controls the randomness of how the model selects the next tokens during text generation.",
        )
        st.session_state.temperature = st.slider(
            "Temperature",
            min_value=0.0,
            max_value=1.0,
            value=0.5,
            step=0.01,
            help="Sets an LLM's entropy. Low temperatures render outputs that are predictable and repetitive. Conversely, high temperatures encourage LLMs to produce more random, creative responses.",
        
        )
        st.session_state.top_p = st.slider(
            "Top P",
            min_value=0.0,
            max_value=1.0,
            value=0.90,
            step=0.01,
            help="Sets the probability threshold for the nucleus sampling algorithm. It controls the diversity of the model's responses.",
        )
        st.session_state.num_predict = st.slider(
            "Response Tokens",
            min_value=0,
            max_value=8192,
            value=2048,
            step=16,
            help="Sets the maximum number of tokens the model can generate in response to a prompt.",
        )

        st.markdown("---")
        st.text(
            f"""Stats:
- model: {st.session_state.model}
- seed: {st.session_state.seed}
- temperature: {st.session_state.temperature}
- top_p: {st.session_state.top_p}
- num_predict: {st.session_state.num_predict}
        """
        )


def create_chat_model() -> ChatOllama:
    return ChatOllama(
        model=st.session_state.model,
        seed=st.session_state.seed,
        temperature=st.session_state.temperature,
        top_p=st.session_state.top_p,
        num_predict=st.session_state.num_predict,
    )


def create_chat_chain(chat_model: ChatOllama):
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", st.session_state.system_prompt),
            MessagesPlaceholder(variable_name="chat_history"),
            ("human", "{input}"),
        ]
    )
    return prompt | chat_model


def update_sidebar_stats(response: Any) -> None:
    total_duration = response.response_metadata["total_duration"] / 1e9
    st.session_state.total_duration = f"{total_duration:.2f} s"
    st.session_state.input_tokens = response.usage_metadata["input_tokens"]
    st.session_state.output_tokens = response.usage_metadata["output_tokens"]
    st.session_state.total_tokens = response.usage_metadata["total_tokens"]
    token_per_second = (
        response.response_metadata["eval_count"]
        / response.response_metadata["eval_duration"]
    ) * 1e9
    st.session_state.token_per_second = f"{token_per_second:.2f} tokens/s"

    with st.sidebar:
        st.text(
            f"""
- input_tokens: {st.session_state.input_tokens}
- output_tokens: {st.session_state.output_tokens}
- total_tokens: {st.session_state.total_tokens}
- total_duration: {st.session_state.total_duration}
- token_per_second: {st.session_state.token_per_second}
        """
        )


def main() -> None:
    st.set_page_config(page_title=PAGE_TITLE, page_icon=PAGE_ICON, layout="wide")
    st.markdown(
        """
        <style>
            MainMenu {visibility: hidden;}
            footer {visibility: hidden;}
            header {visibility: hidden;}
        </style>
        """,
        unsafe_allow_html=True,
    )

    st.title(f"{PAGE_TITLE} {PAGE_ICON}")

    st.markdown("##### Chat")

    initialize_session_state()
    create_sidebar()

    chat_model = create_chat_model()
    chain = create_chat_chain(chat_model)

    msgs = StreamlitChatMessageHistory(key="special_app_key")
    if not msgs.messages:
        msgs.add_ai_message("How can I help you?")

    chain_with_history = RunnableWithMessageHistory(
        chain,
        lambda session_id: msgs,
        input_messages_key="input",
        history_messages_key="chat_history",
    )

    for msg in msgs.messages:
        st.chat_message(msg.type).write(msg.content)

    if prompt := st.chat_input("Type your message here..."):
        st.chat_message("human").write(prompt)

        with st.spinner("Thinking..."):
            config = {"configurable": {"session_id": "any"}}
            response = chain_with_history.invoke({"input": prompt}, config)
            logger.info({"input": prompt}, config)
            st.chat_message("ai").write(response.content)
            logger.info(response)
            update_sidebar_stats(response)

    if st.button("Clear Chat History"):
        msgs.clear()
        st.rerun()


if __name__ == "__main__":
    main()

### Hugginface




Hugging Face has become a go-to platform for accessing and utilizing a vast library of pre-trained machine learning models. From natural language processing (NLP) tasks like text generation and sentiment analysis to computer vision applications like image classification and object detection, Hugging Face offers a user-friendly interface and a diverse collection of cutting-edge models.
Hugging Face provides a simple and intuitive API, making it easy to integrate these models into your projects.The platform hosts a wide range of models, from well-known architectures like BERT and GPT to more specialized models for specific tasks.The Hugging Face community is active and supportive, providing valuable resources like tutorials, documentation, and forums. Many models can be fine-tuned on your own data, allowing you to adapt them to your specific needs.
To get started, you can create a Hugging Face Account: Sign up for a free account on the Hugging Face platform.

**Explore the Model Hub:** Browse the Model Hub to discover models relevant to your needs.

**Choose and Load a Model:** Select a model and use the Hugging Face library to load it into your project.


**Use the Model:** Utilize the model for your desired task, such as text generation, sentiment analysis, or image classification.

**Fine-tune (Optional):** If needed, fine-tune the model on your own data to improve its performance on specific tasks.

Hugging Face provides a powerful and accessible platform for leveraging the power of AI. By utilizing their pre-trained models, developers and researchers can quickly and easily integrate advanced machine learning capabilities into their projects, accelerating innovation and driving progress in various fields.

In [2]:
#pip install transformers
#!pip install huggingface_hub
#!pip install sentencepiece 
#!pip3 install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
import torch
torch.__version__
#pip install 'transformers[torch]'
#pip install 'transformers[tf-cpu]'
#!pip install modelscope
#!pip install accelerate>=0.26.0
#!pip install jmespath
#!pip install  pkginfo  pycryptodomex  texttable  types-PyYAML types-requests
#!pip install -U bitsandbytes
#!pip install "transformers>=4.45.1"



'2.5.1+cpu'

In [3]:
#!pip install python-dotenv
from dotenv import load_dotenv
import os

load_dotenv("C:/Users/nboateng/OneDrive - Nice Systems Ltd/Documents/Research/LLM/huggingface_api/.env")  # take environment variables from .env.

# Code of your application, which uses environment variables (e.g. from `os.environ` or
# `os.getenv`) as if they came from the actual environment.

# Access environment variables as if they came from the actual environment
API_KEY = os.getenv('API_KEY')


HF_TOKEN  = os.getenv('API_KEY')

# Example usage
print(f'API_KEY: {API_KEY}')

from huggingface_hub import login
login(token = HF_TOKEN)

API_KEY: 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


In [41]:
# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM-360M"
models  = ["HuggingFaceTB/SmolLM-1.7B-Instruct","HuggingFaceTB/SmolLM-360M-Instruct","HuggingFaceTB/SmolLM-135M-Instruct"]
#checkpoint = models[1]
#checkpoint =    "nroggendorff/smallama-it"
device = "cpu" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
prompt = "what is machine learning?"
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


what is machine learning?
Machine learning is a branch of artificial intelligence that enables computers to learn from data without being explicitly programmed


In [49]:
%%time

from transformers import AutoModelForCausalLM, AutoTokenizer
models  = ["HuggingFaceTB/SmolLM-1.7B-Instruct","HuggingFaceTB/SmolLM-360M-Instruct","HuggingFaceTB/SmolLM-135M-Instruct"]

# Function to generate text
def generate_text(prompt,device,model,top_p,temperature,max_new_tokens):
  tokenizer = AutoTokenizer.from_pretrained(model)
  # for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
  model = AutoModelForCausalLM.from_pretrained(model).to(device)
  messages = [{"role": "user", "content": prompt}]
  #input_text=tokenizer.apply_chat_template(messages, tokenize=False)
  inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
  #inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
  output = model.generate(inputs, temperature=temperature, top_p=top_p, max_new_tokens=max_new_tokens,do_sample=True)
  generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
  return generated_text



# checkpoint = models[0]

# device = "cpu" # for GPU usage or "cpu" for CPU usage
# tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# # for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
# model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

# messages = [{"role": "user", "content": "What is the capital of the united states of america?"}]
# input_text=tokenizer.apply_chat_template(messages, tokenize=False)
# print(input_text)
# inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
# outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
# print(tokenizer.decode(outputs[0]))


prompt = "What is the capital of the united states of america?" 
device = "cpu"
model = models[0]
temperature=0.2
top_p=0.9
max_new_tokens= 50
generate_text(prompt,device,model,temperature,top_p,max_new_tokens)


CPU times: total: 4.42 s
Wall time: 5.48 s


'user\nWhat is the capital of the united states of america?\nassistant\nThe capital of the United States of America is Washington, D.C.'

In [22]:
%%time

# pip install transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
models  = ["HuggingFaceTB/SmolLM-1.7B-Instruct","HuggingFaceTB/SmolLM-360M-Instruct","HuggingFaceTB/SmolLM-135M-Instruct"]

checkpoint = models[0]

device = "cpu" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")`
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

messages = [{"role": "user", "content": "What is the capital of the united states of america?"}]
input_text=tokenizer.apply_chat_template(messages, tokenize=False)
print(input_text)
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=50, temperature=0.2, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0]))


tokenizer_config.json:   0%|          | 0.00/3.59k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/738 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.42G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

<|im_start|>user
What is the capital of the united states of america?<|im_end|>

<|im_start|>user
What is the capital of the united states of america?<|im_end|>
<|im_start|>assistant
The capital of the United States of America is Washington, D.C.<|im_end|>
CPU times: total: 15.2 s
Wall time: 1min 35s


In [9]:
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

HF_MODEL = 'lmsys/fastchat-t5-3b-v1.0'
#HF_MODEL = 'Genius-Society/gpt4all'
#model_id =  "SweatyCrayfish/llama-3-8b-quantized"
HF_MODEL_PATH = HF_MODEL.split('/')[1]
HF_MODEL_PATH


'fastchat-t5-3b-v1.0'

### GPT4All

https://www.nomic.ai/gpt4all

Nomic.ai's GPT4All is an open-source desktop application allows you to run LLMs directly on your device, without the need for cloud access or expensive hardware. With GPT4All, you can harness the power of LLMs for a wide range of tasks, including:

**Writing:** GPT4All can help you brainstorm ideas, overcome writer's block, and even generate different creative text formats of text content, like poems, code, scripts, musical pieces, email, letters, etc.

**Research:** Use GPT4All to summarize complex topics, translate research papers, and find relevant information from your local documents.

**Coding:** GPT4All can assist you with coding tasks by generating code snippets, debugging code, and writing documentation.

**Learning:** GPT4All can be a valuable tool for learning new things. Ask it questions, get explanations of complex concepts, and even have it generate practice problems.


Getting started with GPT4All is easy. Simply download the application from the Nomic.ai website and install it on your device. Once installed, you can start using GPT4All to explore the power of LLMs.

GPT4All is a powerful tool that is democratizing access to LLMs. With its continued development, GPT4All is poised to play a major role in the future of artificial intelligence. The recent update of GPT4All v3.4.0, which includes faster models and expanded filetype support. GPT4All is not only for individuals but can also be used by businesses for enterprise purposes.


## LM Studio

https://lmstudio.ai/






LM Studio is a powerful application that allows you to run large language models (LLMs) directly on your computer. This means you can harness the power of these sophisticated AI models for a variety of tasks, all without relying on cloud services or powerful graphics cards. Developed by a team of AI researchers at Microsoft, LmStudio uses machine learning algorithms to generate natural-sounding text based on your input.

**Key Features of LM Studio**

**Local Processing:** Run LLMs on your own hardware, keeping your data private and secure. By running LLMs locally, you can ensure that your data never leaves your device, providing greater privacy and security. LM Studio can even be used  when you don't have an internet connection, making it a valuable tool for situations where connectivity is limited.

**Wide Model Support:** LM Studio supports a variety of LLM models, including Llama 3.2, Mistral, Phi, Gemma, and DeepSeek. This gives you the flexibility to choose the model that best suits your needs.

**Chat Interface:** Interact with LLMs through a user-friendly chat interface, allowing you to have conversations and ask questions in a natural way.

**OpenAI Compatibility:**  LM Studio's local server is compatible with OpenAI API, making it easy to integrate with existing tools and workflows.

**Hugging Face Integration:** Download compatible model files directly from the Hugging Face model repository, giving you access to a vast collection of pre-trained models.

**Integration with Other Tools:** LM Studio's local server can be integrated with other tools, such as text editors, chatbots, and AI development platforms, making it easy to collaborate and build custom solutions.

**Community Support:** Join the LM Studio community on Discord to ask questions, share resources, and collaborate with other users.

**Cost and Availability:** LM Studio is free to use for personal purposes, making it an accessible option for individuals and hobbyists.


In [None]:
!pip install resource 



ERROR: Could not find a version that satisfies the requirement signal (from versions: none)
ERROR: No matching distribution found for signal


In [11]:
from vllm import LLM, SamplingParams

prompt = ["Hello, my name is", "The capital of united states is"]

#llm = LLM(model="meta-llama/Llama-2-70b-chat-hf")
llm = LLM(model="gpt2")
sampling_params = SamplingParams(temperature=0.5)
outputs = llm.generate(prompt, sampling_params=sampling_params)

outputs

ModuleNotFoundError: No module named 'resource'

https://huggingface.co/legraphista/DeepSeek-V2-Lite-Chat-IMat-GGUF

### JellyBox

https://jellybox.com/

### Local AI

https://localai.io/


LocalAI empowers you to harness the power of advanced AI models right on your own computer. It functions as a direct replacement for the OpenAI API, meaning you can use it with existing applications designed for OpenAI. You can run LLMs, generate images, audio etc. locally or on-prem with consumer grade hardware, supporting multiple model families and architectures without a GPU hardware. 


In [1]:
import requests
import json