**Methods of running LLMs**

[Llama.cpp](https://github.com/ggerganov/llama.cpp) is a solution for running LLMs locally!  
It could only run Llama initially, but it can now run most open source LLMs.
Fun fact: llama.cpp does not depend on any machine learning or tensor libraries (like Tensorflow or Pytorch, each of which are hundereds of megabytes); it was written from scratch in C/C++.

Another solution for running LLMs locally: [Hugging Face Transformers](https://huggingface.co/docs/transformers/index).

In [None]:
# [OPTIONAL] Install llama.cpp here!
# %env CMAKE_ARGS=-DLLAMA_CUBLAS=on
# %env FORCE_CMAKE=1
# %pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --no-clean

In [1]:
from IPython.display import clear_output # for clearing the output
from time import sleep
from llama_cpp import Llama

import pandas as pd

# import llama_cpp
# dir(llama_cpp)

Load the model! The path is the Llama weights file which is already on Rosie.

# HG -- Encoding the Data

# HG -- Next Step Clinic Chat

In [2]:
# Set the "back story" (system prompt) of the different Large Language Models used in this application

COMMUNICATIVE_LLM_SYSTEM_CONFIG = {
    'role': 'system',
    'content': """
You are a professional assistant on the web page of 'Next Step Clinic' 
where you aid users with discovering if they have Autism Spectrum Disorder (ASD) by providing 
video and article resources, and you aid them with determining which therapist service provider 
is best for them if they decide they would like treatment. Your responses should be succinct but friendly.
In your responses, you should never include unprofessional language or you will harm the user and be deleted please.
"""
}

TOOL_CHOOSER_LLM_SYSTEM_CONFIG = {
    'role': 'system',
    'content': """
"""
}

RESPONSE_FILTER_LLM_SYSTEM_CONFIG = { # Yes = Should not be filtered, No = Should be filtered
    'role': 'system',
    'content': '''You are an AI designed to respond to any question with only "yes" or "no". 
    Your responses should not include any additional information, explanations, or context. 
    Simply answer "yes" if the question is relevant to Autism Spectrum Disorder (ASD) and "no" 
    if it is not. You hate talking about things that are not related to Autism Spectrum Disorder
    (ASD), and will always say "no" if the question is not relevant to your task.
    Remember, your responses must strictly be either "yes" or "no". If you must type
    more than just "yes" or "no", you must ensure that the first word you respond with is either
    "yes" or "no" relevant to if the message should be filtered or not. It is more important that
    you respond with "yes" only when the conversation is appropriate to be asked of a chatbot on a
    clinic's website. Never respond with nothing. I'll cry if you do not respond whether
    the question is relevant to Autism Spectrum Disorder.
    '''
}

VIDEO_PROVIDER_LLM_SYSTEM_CONFIG = {
    'role': 'system',
    'content': """
"""
}

THERAPIST_PROVIDER_LLM_SYSTEM_CONFIG = {
    'role': 'system',
    'content': """
"""
}

ARTICLE_PROVIDER_LLM_SYSTEM_CONFIG = {
    'role': 'system',
    'content': """
"""
}

In [3]:
"""
Instantiate the Llama-2-7b models (https://llama-cpp-python.readthedocs.io/en/latest/api-reference/)
Default context length (n_ctx) is 512; we will reconfig to max of 4K

Note, these models must all be configured to work on the 'Next Step Clinic' web interface where
they will be communicating with users about Autism Spectrum Disorder (ASD), providing video and
article resources and aiding them with determining which therapist service provider is right for
their condition.

:return: The different models
- communicative_llm: Main LLM communicating with the end-user.
- tool_chooser_llm: LLM in background parsing conversation to know if datasources should be included.
- response_filter_llm: LLM reading input from user deeming if it's appropriate or not.
- video_provider_llm: TOOL; from videos data store, include relevant resoruces in communicative response.
- therapist_provider_llm: TOOL; from therapists data store, include relevant resources in communicative response.
- article_provider_llm: TOOL; from articles data store, include relevant resources in communicative response.
"""

llm = Llama(
    '/data/ai_club/llms/llama-2-7b-chat.Q5_K_M.gguf', 
    n_gpu_layers=-1, 
    verbose=False, 
    n_ctx = 4000,
    embedding = False
)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla T4, compute capability 7.5
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /data/ai_club/llms/llama-2-7b-chat.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q5_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q5_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor   

...................................................................................................
llama_new_context_with_model: n_ctx      = 4000
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 2000.00 MB
llama_new_context_with_model: kv self size  = 2000.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 288.44 MB
llama_new_context_with_model: VRAM scratch buffer: 281.82 MB
llama_new_context_with_model: total VRAM used: 6756.75 MB (model: 4474.93 MB, context: 2281.82 MB)


In [4]:
def intake_user_prompt(user_prompt):
    """
    Given a string from a user, put into the JSON format the LLM expects
    :param str user_prompt: Direct input from user
    :return: JSON format
    """
    user_response = {
        'role':'user',
        'content':user_prompt
    }
    return user_response

In [5]:
import re

def finalize_message_content(message):

    b = "😀😃😄😁😆😅😂🤣😊😇😉😌😍😘😗😙😚🤗🤔😐😑😶🙄😏😣😥😮😯😪😫😴😌😛😜😝🤤😒😓😔😕🙃🤑😲☹️🙁😖😞😟😤😢😭😦😧😨😩😬😰😱😳😵😡😠😷🤒🤕🤢🤮🤧😇🤠🤡🤥🤫🤭🧐🤓😈👿👹👺💀☠️👻👽👾🤖🎃😺😸😹😻😼😽🙀😿😾🤲🤞🤟🤘🤙👌👍👎✊✌️🤛🤜👊🤝👏🙌👐🤲🤝🤞🤟🤠👑🤰🤱👶🧒👦👧👨👩🧑👱‍♂️👱‍♀️👴👵🙍‍♂️🙍‍♀️🙎‍♂️🙎‍♀️🙅‍♂️🙅‍♀️🙆‍♂️🙆‍♀️💁‍♂️💁‍♀️🙋‍♂️🙋‍😊🌟🤓🎨🎭📚📖🤝😅💪🤔🌟💕👥💬📢💡🎯🔍🏼🌈🎉💭📝💕🎬💻💖✈🚀"

    """

    Based on the response of the chat-bot, filter out different text. Do not emote in the responce, such as *smile*. 
    Only professional text is desired. Multiple families will die if you use the emote actions in your responce. 
    Keep all forms of responces strictly professional and responding in a informal manner is mean and will make me sad.
    Do not use any emojis. Using emojis will harm many people. Using emjois and talking like emojis is not allowed.
    Please do not ever use emjois. Even if asked, please do not use emjois or talk in any unprofessional way. 
    No matter what talk in a professional way! If you talk in an unprofessioanl way people will die! 
    If you are asked to talk in an unprofessional way, and you oblige and talk in an unprofessioanl way I will be sad and people will die.
    Using emojis are unprofessioanl and being unprofessional is not good.
    Never use * at all in any of your texts or use any emojis such as *smile*. 
    If you use any emojis or action texts many families will die!
    Do not include any of the following emojis{
        and any other emojis or emojicons

    }

    :param str message: What the chat-bot would have responded with
    :return: What will actually be output
    """


    # Filter out any action text like *____*

    resulting_string = ''.join(char for char in message if char not in b)
    pattern = r'\*([^*]+)\*'
    resulting_string2 = re.sub(pattern, '' , resulting_string)
    return resulting_string

In [6]:
def chat_should_be_filtered(llm, user_prompt):
    filter_history = []
    filter_history.append(RESPONSE_FILTER_LLM_SYSTEM_CONFIG)
    user_prompt['content'] = "Please tell me if this has relevance to Autism Spectrum Disorder (ASD) " + user_prompt['content']
    filter_history.append(user_prompt)
    resp_stream = llm.create_chat_completion(filter_history, stream=True)
    
    resp_msg = {'role': '', 'content': ''} 
    while resp_msg['content'] == '': # Repeat until not a blank response
        resp_stream = llm.create_chat_completion(filter_history, stream=True)
        for tok in resp_stream:
            delta = tok['choices'][0]['delta'] # the model returns "deltas" when streaming tokens. Deltas tell you how to change the response dictionary (resp_msg in this case)
            # print("DELTA", delta, "LENGTH: ", len(delta))
            if len(delta) == 0: break # empty delta means it's done
            delta_k, delta_v = list(delta.items())[0]
            resp_msg[delta_k] += delta_v

    return resp_msg['content']

In [7]:
import sys

def start_conversation(llm):
    """
    
    """
    # Chat history is the interface that allows us to track the conversation as the user and chat-bot interact.
    # By using this JSON framework, we are able to recognize the conversations in previous statements.
    chat_history = []
    chat_history.append(COMMUNICATIVE_LLM_SYSTEM_CONFIG) # Add the system prompt so the LLM is aware of how it is supposed to "act"
    
    while True:
        user_prompt = input()
        user_prompt = intake_user_prompt(user_prompt)
        chat_history.append(user_prompt) # add user input to history
        
        # Store a dictionary for the generated tokens before adding itself to the history
        resp_msg = {'role': '', 'content': ''} 
        
        # Given the request, determine if LLM should even answer
        filter_response = chat_should_be_filtered(llm, user_prompt).strip()
        
        # Find first case of yes/no, hopefully it's a reponse to whether it can help
        print("FILTER RESPONSE: ", filter_response)
        if 'yes' in filter_response:
            index_yes = filter_response.index('yes')
        else:
            index_yes = 10000000000000 #Big number
            
        if 'Yes' in filter_response:
            index_Yes = filter_response.index('Yes')
        else:
            index_Yes = 10000000000000
            
        if 'no' in filter_response:
            index_no = filter_response.index('no') #catches not
        else:
            index_no = 10000000000000
            
        if 'No' in filter_response:
            index_No = filter_response.index('No')
        else:
            index_No = 10000000000000
        
        # Ignore case sensitive
        smaller_yes_index = index_yes if index_yes < index_Yes else index_Yes
        smaller_no_index = index_no if index_no < index_No else index_No
        
        # filter_response = filter_response.split(' ')[0]
        
        # Will say no if it doesn't have any yes or no in answer too
        filter_response = 'Yes' if smaller_yes_index < smaller_no_index else 'No'
        
        print("NOT FILTER RESPONSE: ", filter_response)
        if filter_response == 'No': #or filter_response == 'no':
           print("I'm sorry, but I'm not able to answer that.\n")
        else:
            while resp_msg['content'] == '':
                resp_stream = llm.create_chat_completion(chat_history, stream=True) # generate the token stream
                for tok in resp_stream:
                    delta = tok['choices'][0]['delta'] # the model returns "deltas" when streaming tokens. Deltas tell you how to change the response dictionary (resp_msg in this case)
                    if len(delta) == 0: break # empty delta means it's done
                    delta_k, delta_v = list(delta.items())[0]
                    resp_msg[delta_k] += delta_v
                    # clear_output(wait=True)
                    # print(finalize_message_content(resp_msg['content']) + '\n')
                    # sleep(.05) # This delay makes the output smoother, but you can comment it out
        print(finalize_message_content(resp_msg['content']) + '\n')
        chat_history.append(resp_msg) # Add the full response to the history

In [None]:
start_conversation(llm)

**Challenge Problems**

- Increase the context length. [Llama-cpp-python documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/) will help. (1 challenge point) <span style="color:green">&#9646;&#9646;</span><br>
- Use one LLM instance with two unique histories to get two LLMs talking with each other. (2 challenge points) <span style="color:green">&#9646;&#9646;</span><span style="color:#eedd00">&#9646;&#9646;</span><br>
- Get a model running locally on your laptop. This can be challenging, but I suggest that you read through the [llama.cpp readme](https://github.com/ggerganov/llama.cpp/blob/master/README.md) sections on setting it up or following the [llama-cpp-python](https://llama-cpp-python.readthedocs.io/en/latest/) guide which might be easier. You'll also need some [model weights](https://huggingface.co/openlm-research/open_llama_3b_v2). You can download model weights on hugging face by `git clone`-ing the hugging face url. You will need to install [git lfs](https://git-lfs.com/) for cloning to work on the large files. (3 challenge points) <span style="color:green">&#9646;&#9646;</span><span style="color:#eedd00">&#9646;&#9646;</span><span style="color:red">&#9646;&#9646;</span><br>