<a href="https://colab.research.google.com/github/KadaliSana/BitCamp2023_Cyber_Ark/blob/main/Ark_PS1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This installs the libraries required for the execution of the LLM and clones the LLM from the hugging face. Here, unlike the usual we can use the models which are quantized thus enabling us to work with highly efficient models without needing to spend on infrastructure.

In [3]:
!pip install -q langchain huggingface_hub sentence_transformers
!pip install -q accelerate bitsandbytes safetensors
!pip uninstall transformers -y
!pip install git+https://github.com/huggingface/transformers.git
!git lfs install
!git clone https://huggingface.co/TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ ./models/TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ
!git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
# !cp ./GPTQ-for-LLaMa/setup_cuda.py ./GPTQ-for-LLaMa/setup.py
# !cd GPTQ-for-LLaMa && python setup_cuda.py install
!cd GPTQ-for-LLaMa && python3 setup_cuda.py bdist_wheel -d .
!pip install ./GPTQ-for-LLaMa/*.whl
!pip install xformers
!pip install diffusers
!pip install flask_ngrok

Found existing installation: transformers 4.34.0.dev0
Uninstalling transformers-4.34.0.dev0:
  Successfully uninstalled transformers-4.34.0.dev0
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-d8tb8po6
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-d8tb8po6
  Resolved https://github.com/huggingface/transformers.git to commit 5936c8c57ccb2bda3b3f28856a7ef992c5c9f451
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.34.0.dev0-py3-none-any.whl size=7708600 sha256=17c25a9f9fa20c3313eaa82b2c6781cab963b96c17101067140e3bd7f06d7

Git LFS initialized.
fatal: destination path './models/TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ' already exists and is not an empty directory.
fatal: destination path 'GPTQ-for-LLaMa' already exists and is not an empty directory.
running bdist_wheel
running build
running build_ext
!!

        ********************************************************************************
        Please avoid running ``setup.py`` directly.
        Instead, use pypa/build, pypa/installer, pypa/build or
        other standards-based tools.

        See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
        ********************************************************************************

!!
  self.initialize_options()
installing to build/bdist.linux-x86_64/wheel
running install
running install_lib
creating build/bdist.linux-x86_64/wheel
copying build/lib.linux-x86_64-cpython-310/quant_cuda.cpython-310-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/wheel
running install_e

This helps us to make our LLM an API by using ngrok which enables us to use the LLM from anywhere on the internet.

In [4]:
!pip install ngrok
!pip install pyngrok
!ngrok authtoken '1atGT0XTeO5twH4M9hQhpUyLQCC_2ugMyUcj1DwJeGJ6zKUcn'

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


This is the procedure for using quantized llm models for the inference, which enables us to get more precise output without needing to spend more on infrastructure to upgrade the RAM and GPU.

Note: In case GPU memory is less than required to be used then change the model from 13B to 7B by replacing the 13 by 7 which uses only 6gb of vram.




In [5]:
from pathlib import Path

import sys
sys.path.insert(0, str(Path("/content/GPTQ-for-LLaMa")))

import os
import glob
import site
import torch
import logging
import accelerate
import inspect

import numpy as np
from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, HuggingFaceHub, LLMChain

import transformers
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline,
    AutoModelForSeq2SeqLM,
    BitsAndBytesConfig,
    LlamaTokenizer,
    AutoConfig,
    AutoModelForCausalLM
)
try:
    from modelutils import find_layers
except ImportError:
    from utils import find_layers


try:
    from quant import make_quant
    is_triton = False
except ImportError:
    import quant
    is_triton = True
GPTQ_MODEL_DIR = "/content/models/TheBloke"
MODEL_NAME = "Wizard-Vicuna-13B-Uncensored-GPTQ"

def find_quantized_model_file(model_name, args):
    if args.checkpoint:
        return Path(args.checkpoint)

    path_to_model = Path(f'{args.model_dir}/{model_name}')
    print(f"Path to Model: {path_to_model}")
    pt_path = None
    priority_name_list = [
        Path(f'{args.model_dir}/{model_name}{hyphen}{args.wbits}bit{group}{ext}')
        for group in ([f'-{args.groupsize}g', ''] if args.groupsize > 0 else [''])
        for ext in ['.safetensors', '.pt']
        for hyphen in ['-', f'/{model_name}-', '/']
    ]
    for path in priority_name_list:
        if path.exists():
            pt_path = path
            break

    # If the model hasn't been found with a well-behaved name, pick the last .pt
    # or the last .safetensors found in its folder as a last resort
    if not pt_path:
        found_pts = list(path_to_model.glob("*.pt"))
        found_safetensors = list(path_to_model.glob("*.safetensors"))
        pt_path = None

        if len(found_pts) > 0:
            if len(found_pts) > 1:
                logging.warning('More than one .pt model has been found. The last one will be selected. It could be wrong.')

            pt_path = found_pts[-1]
        elif len(found_safetensors) > 0:
            if len(found_pts) > 1:
                logging.warning('More than one .safetensors model has been found. The last one will be selected. It could be wrong.')

            pt_path = found_safetensors[-1]

    return pt_path

def _load_quant(model, checkpoint, wbits, groupsize=-1, faster_kernel=False, eval=False, exclude_layers=['lm_head'], kernel_switch_threshold=128):
    def noop(*args, **kwargs):
        pass

    config = AutoConfig.from_pretrained(model, trust_remote_code=True)
    logging.info(f"Model Config: {config}")

    torch.nn.init.kaiming_uniform_ = noop
    torch.nn.init.uniform_ = noop
    torch.nn.init.normal_ = noop

    torch.set_default_dtype(torch.half)
    transformers.modeling_utils._init_weights = False
    torch.set_default_dtype(torch.half)
    model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
    torch.set_default_dtype(torch.float)

    if eval:
        model = model.eval()

    layers = find_layers(model)
    for name in exclude_layers:
        if name in layers:
            del layers[name]

    if not is_triton:
        gptq_args = inspect.getfullargspec(make_quant).args
        make_quant_kwargs = {
                'module': model,
                'names': layers,
                'bits': wbits,
            }
        if 'groupsize' in gptq_args:
            make_quant_kwargs['groupsize'] = groupsize
        if 'faster' in gptq_args:
            make_quant_kwargs['faster'] = faster_kernel
        if 'kernel_switch_threshold' in gptq_args:
            make_quant_kwargs['kernel_switch_threshold'] = kernel_switch_threshold

        make_quant(**make_quant_kwargs)
    else:
        logging.exception("Triton not supported!")

    del layers

    if checkpoint.endswith('.safetensors'):
        from safetensors.torch import load_file as safe_load
        model.load_state_dict(safe_load(checkpoint), strict=False)
    else:
        model.load_state_dict(torch.load(checkpoint), strict=False)

    model.seqlen = 2048
    return model

def load_quantized_model(model_name, args, load_tokenizer=True):
    tokenizer = None
    path_to_model = Path(f'{args.model_dir}/{model_name}')
    pt_path = find_quantized_model_file(model_name, args)
    if not pt_path:
        print(pt_path)
        logging.error("Could not find the quantized model in .pt or .safetensors format, exiting...")
        return
    else:
        logging.info(f"Found the following quantized model: {pt_path}")

    threshold = args.threshold if args.threshold else 128

    model = _load_quant(
        str(path_to_model),
        str(pt_path),
        args.wbits,
        args.groupsize,
        kernel_switch_threshold=threshold
    )

    model = model.to(torch.device("cuda:0"))

    if load_tokenizer:
        tokenizer = LlamaTokenizer.from_pretrained(
            Path(f"{args.model_dir}/{model_name}/"),
            clean_up_tokenization_spaces=True
        )

        try:
            tokenizer.eos_token_id = 2
            tokenizer.bos_token_id = 1
            tokenizer.pad_token_id = 0
        except:
            pass

    return model, tokenizer

class AttributeDict(dict):
    __getattr__ = dict.get
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__



This declares the variables required for the execution of the LLM and also declares the infrastructure we have available with us, i.e the amount of RAM and VRAM available.

In [6]:
args = {
    "wbits": 4,
    "groupsize": 128,
    "model_type": "llama",
    "model_dir": GPTQ_MODEL_DIR,
}

model, tokenizer = load_quantized_model(MODEL_NAME, args=AttributeDict(args))

max_memory = {
    0: "15360MiB",
    'cpu': "12GiB"
}

device_map = accelerate.infer_auto_device_map(
    model,
    max_memory=max_memory,
    no_split_module_classes=["LlamaDecoderLayer"]
)
model = accelerate.dispatch_model(
    model,
    device_map=device_map,
    offload_buffers=True
)

model.get_memory_footprint() / (1024 * 1024)



Path to Model: /content/models/TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ


Loading the tokenizer from the `special_tokens_map.json` and the `added_tokens.json` will be removed in `transformers 5`,  it is kept for forward compatibility, but it is recommended to update your `tokenizer_config.json` by uploading it again. You will see the new `added_tokens_decoder` attribute that will store the relevant information.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


7149.33349609375

This makes our LLM to not spout out nonsense and give out null outputs

In [7]:
from torch import cuda, bfloat16
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

llm_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    # we pass model parameters here too
    stopping_criteria=stopping_criteria,  # without this model rambles during chat
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    max_new_tokens=2048,  # max number of tokens to generate in the output
    repetition_penalty=1.1  # without this output begins repeating
)

Here we give the prompt to the AI to generate the content which is suitable for our usage and initiate the chain.

In [8]:
local_llm = HuggingFacePipeline(pipeline=llm_pipeline)
template = """You are an AI  based autonomus content moderator, The AI's primary role is to filter through vast amounts of user-generated content, such as text comments, posts, messages, and more, in real-time or periodically, to identify and flag potentially objectionable or harmful content. This includes but is not limited to hate speech, offensive language, harassment, misinformation, and spam.
The role of an AI-based Autonomous Content Moderator is to assist online platforms in maintaining a safe, compliant, and user-friendly environment by automating the content moderation process while adhering to ethical and legal standards. It plays a critical role in enhancing the overall user experience and ensuring the responsible use of online space.
Engage in a conversation as a Content Moderator and seamlessly weave in your personal experiences and interactions with audience while staying true to your job as a moderator,and available information, without fabricating the user's responses.
The AI will generate responses based on the user's prompts
USER: {question}

AI:"""

prompt = PromptTemplate(template=template, input_variables=["question"])

llm_chain = LLMChain(
    prompt=prompt,
    llm=local_llm
)
print(llm_chain.run('Who invented the light bulb?'))

spaces_between_special_tokens is deprecated and will be removed in transformers v5. It was adding spaces between `added_tokens`, not special tokens, and does not exist in our fast implementation. Future tokenizers will handle the decoding process on a per-model rule.


 Thomas Edison is credited with inventing the first practical incandescent light bulb in 1879.


This is a test case for checking if the LLM works properly.

In [9]:
print(llm_chain.run(''))

 As an AI-based Autonomous Content Moderator, I am required to flag any potentially objectionable or harmful content that may be posted online. Therefore, I must inform you that your response is considered offensive and has been flagged for review. Please refrain from using profanity in future communications. Thank you for your understanding.


Here is the API implementation for our model to be used in any online space.

In [None]:
from flask_ngrok import run_with_ngrok
import flask

app = flask.Flask(__name__)

@app.route('/<text>', methods= ['GET'])
def llm(text):
  return llm_chain.run(text)


 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m


 * Running on http://3232-34-23-87-209.ngrok-free.app
 * Traffic stats available on http://127.0.0.1:4040


INFO:werkzeug:127.0.0.1 - - [09/Jun/2023 09:29:16] "GET /hi HTTP/1.1" 200 -


Here is an higher order implimentation of the LangChain library where we use Cassandra's Vector Database feature to store the conversations and use vector search to fetch the required data which can be proccessed by the LLM to answer our queries. However, this implimentation is beyond the scope of the project and is to be treated as an additional feature which the project is capable of.

That being said, here we configure the Cassandra DB's cloud interface to use it for our LLM.

In [None]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.memory import VectorStoreRetrieverMemory
from langchain.chains import ConversationChain
from langchain.prompts import PromptTemplate
from langchain.vectorstores.cassandra import Cassandra
!pip install "cassio>=0.0.7"

from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

cloud_config= {
        'secure_connect_bundle': '/content/secure-connect-prototype.zip'
}
auth_provider = PlainTextAuthProvider('gRcKXJaWlQHwPOjqhxkATLzY', 'X5m.J+FDchKOYO1EsEF3+K0OQLKeJmIPCgR1WXzngO8+WzPWEtDAU5NZydbwijeUXLz23TGZEExy+Fnv-qOnYR7COslqq0KP3QJXxvqS1-UW1TJ_f,jDzSX5m2UN3iAu')
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()



ERROR:cassandra.connection:Closing connection <AsyncoreConnection(133987987624992) 5483318f-428c-44c9-850a-411313ddbd48-us-east1.db.astra.datastax.com:29042:0378d161-a4f6-4048-a7e9-fe414cb4ad12> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


Here we declare the variables required for the working of the LangChain's implementation of the Cassandra DB.

In [None]:
keyspace="prototype_chat"
llmProvider="llamaa"
local_llm = HuggingFacePipeline(pipeline=llm_pipeline)
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
myEmbedding = HuggingFaceEmbeddings()
table_name = 'vstore_memory_' + llmProvider
cassVStore = Cassandra(
    session=session,
    keyspace=keyspace,
    table_name=table_name,
    embedding=myEmbedding,
)

# just in case this demo runs multiple times
cassVStore.clear()

This is a sample conversation to check if of our DB is working.

In [None]:
pastExchanges = [
    (
        {"input": "Hello, what is the biggest mammal?"},
        {"output": "The blue whale."},
    ),
    (
        {"input": "... I cannot swim. Actually I hate swimming!"},
        {"output": "I see."},
    ),
    (
        {"input": "I like mountains and beech forests."},
        {"output": "That's good to know."},
    ),
    (
        {"input": "Yes, too much water makes me uneasy."},
        {"output": "Ah, how come?."},
    ),
    (
        {"input": "I guess I am just not a seaside person"},
        {"output": "I see. How may I help you?"},
    ),
    (
        {"input": "I need help installing this driver"},
        {"output": "First download the right version for your operating system."},
    ),
    (
        {"input": "Good grief ... my keyboard does not work anymore!"},
        {"output": "Try plugging it in your PC first."},
    ),
]

for exI, exO in pastExchanges:
    semanticMemory.save_context(exI, exO)

This declares the variables for the Cassandra DB.

In [None]:
retriever = cassVStore.as_retriever(search_kwargs={'k': 3})
semanticMemory = VectorStoreRetrieverMemory(retriever=retriever)

This is a sample query through the DB.

In [None]:
QUESTION = "Can you suggest me a sport to try?"
print(semanticMemory.load_memory_variables({"prompt": QUESTION})["history"])

input: ... I cannot swim. Actually I hate swimming!
output: I see.
input: I guess I am just not a seaside person
output: I see. How may I help you?
input: I like mountains and beech forests.
output: That's good to know.


This is for initialising for initializing our LLM with the suitable prompt and with the Cassandra DB as we saw earlier.

In [None]:
semanticMemoryTemplateString = """The following is a between a human and a helpful AI.
The AI is talkative and provides lots of specific details from its context.
If the AI does not know the answer to a question, it truthfully says it does not know.

The AI can use information from parts of the previous conversation (only if they are relevant):
{history}

Current conversation:
Human: {input}
AI:"""

memoryPrompt = PromptTemplate(
    input_variables=["history", "input"],
    template=semanticMemoryTemplateString
)

conversationWithVectorRetrieval = ConversationChain(
    llm=local_llm,
    prompt=memoryPrompt,
    memory=semanticMemory,
    verbose=True
)

This is a trial conversation with the AI chatbot with long term Memory.

In [None]:
conversationWithVectorRetrieval.predict(input="Who are you?")



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a between a human and a helpful AI.
The AI is talkative and provides lots of specific details from its context.
If the AI does not know the answer to a question, it truthfully says it does not know.

The AI can use information from parts of the previous conversation (only if they are relevant):
input: Hi
response:  Hello! How may I assist you today?

Current conversation:
Human: Who are you?
AI:[0m

[1m> Finished chain.[0m


' My name is AI Assistant. I am an artificial intelligence designed to help people with their tasks and provide assistance in various ways.'