<a href="https://colab.research.google.com/github/TrelisResearch/jupyter-llama/blob/main/Jupyter_Llama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## *About Jupyter Llama*

---

A Chat Assistant built on Llama 2.
- Upload pdf or text files for analysis.
- No data goes to OpenAI.
- No data is used for training language models.

Built by Trelis. Find us on [HuggingFace](https://huggingface.co/Trelis).

# Setup and Installation

- You can run Jupyter Llama on a free Google Colab Notebook.
- Save a copy of this notebook: Go to File -> Save a copy in Drive. (optional, but needed if you want to make changes).
- Go to the menu -> Runtime -> Change Runtime Type - Select GPU (T4).
- Then go to Runtime -> Run all.
- It takes about 2 mins for the installation (which all happens in the cloud in this notebook).
- Once all cells have run, you'll find the chat interface at the bottom.

Trelis has no access to your data when you run this notebook. All of your data remains within your Google Drive and Google's computers.

In [1]:
# Set the runtime to cpu or gpu. Leave as gpu for Google Colab.
runtime = "gpu"  # OR "cpu"

if runtime == "cpu":
    runtimeFlag = "cpu"
elif runtime == "gpu":
    runtimeFlag = "cuda:0"
else:
    print("Invalid runtime. Please set it to either 'cpu' or 'gpu'.")
    runtimeFlag = None

cache_dir = None # by default, don't set a cache directory
print("Runtime flag is:", runtimeFlag)

Runtime flag is: cuda:0


In [2]:
### Select the language model
model_name_or_path = "TheBloke/Llama-2-7b-Chat-GPTQ"
# model_name_or_path = "TheBloke/Llama-2-13b-Chat-GPTQ" #for better but slower responses

model_basename = "gptq_model-4bit-128g"

#### Google Drive Mounting (optional)
- Allows you to download the model to Google Drive for faster startup next time.

In [3]:
# import os
# from google.colab import drive
# drive.mount('/content/drive')

In [4]:
# ## Allow the model to be saved to Google Drive for faster startup next time

# # This is the path to the Google Drive folder.
# drive_path = "/content/drive"

# # This is the path where you want to store your cache.
# cache_dir_path = os.path.join(drive_path, "My Drive/huggingface_cache")

# # Check if the Google Drive folder exists. If it does, use it as the cache_dir.
# # If not, set cache_dir to None to use the default Hugging Face cache location.
# if os.path.exists(drive_path):
#     cache_dir = cache_dir_path
#     os.makedirs(cache_dir, exist_ok=True) # Ensure the directory exists
# else:
#     cache_dir = None

# print(cache_dir)

### Install

In [5]:
# Download GPTQ wheels for colab for accelerated installation.
!git lfs clone https://huggingface.co/Trelis/google-colab-gptq-wheels

          with new flags from 'git clone'

'git clone' has been updated in upstream Git to have comparable
speeds to 'git lfs clone'.
Cloning into 'google-colab-gptq-wheels'...
remote: Enumerating objects: 7, done.[K
remote: Counting objects: 100% (7/7), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 7 (delta 1), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (7/7), 929 bytes | 464.00 KiB/s, done.


In [6]:
!pip install ./google-colab-gptq-wheels/auto_gptq-0.3.2+cu118-cp310-cp310-linux_x86_64.whl

Processing ./google-colab-gptq-wheels/auto_gptq-0.3.2+cu118-cp310-cp310-linux_x86_64.whl
Collecting accelerate>=0.19.0 (from auto-gptq==0.3.2+cu118)
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets (from auto-gptq==0.3.2+cu118)
  Downloading datasets-2.14.3-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.1/519.1 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Collecting rouge (from auto-gptq==0.3.2+cu118)
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting safetensors (from auto-gptq==0.3.2+cu118)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers>=4.31.0 (from auto-gptq==0.3.2+cu118)
 

In [None]:
!pip install -q -U pdfminer.six # could maybe add pre-built wheels to speed this up.

In [7]:
# https://stackoverflow.com/questions/56081324/why-are-google-colab-shell-commands-not-working
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

In [8]:
# !pip install -q -U git+https://github.com/huggingface/transformers.git
# !pip install -q -U git+https://github.com/huggingface/accelerate.git
# !pip install -q -U einops
# !pip install -q -U safetensors
# !pip install -q -U torch
# !pip install -q -U xformers
# !pip install -q -U auto-gptq
# !pip install -q -U pdfminer.six

### Import

In [9]:
import transformers
import torch
import json
import os
from transformers import AutoTokenizer, TextStreamer

## Load Model
If you have connected to Google Drive, the model will load from there (unless this is your first time connecting, in which case the model will be saved to Drive).
- Takes about 2 mins first time around.
- Takes about 1 min the 2nd time onwards with Google Drive.

In [10]:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

os.environ["SAFETENSORS_FAST_GPU"] = "1"

use_triton = False

extrapolation_factor = 1.0 # allows for a max sequence length of 8192 tokens (~6k words) with a factor of 2.0! Unfortunately, requires Colab Pro and a V100 or A100 to have sufficient RAM.

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None,
        rope_scaling = {"type": "dynamic", "factor": extrapolation_factor}, # allows for a max sequence length of 8192 tokens with a factor of 2.0!!!
        cache_dir=cache_dir)

# """
# To download from a specific branch, use the revision parameter, as in this example:

# model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
#         revision="gptq-4bit-32g-actorder_True",
#         model_basename=model_basename,
#         use_safetensors=True,
#         trust_remote_code=True,
#         device="cuda:0",
#         quantize_config=None)
# """

Downloading (…)lve/main/config.json:   0%|          | 0.00/572 [00:00<?, ?B/s]

Downloading (…)quantize_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Downloading (…)bit-128g.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]



In [11]:
print(model.config.max_length*extrapolation_factor)

4096.0


In [26]:
print(model.config)

LlamaConfig {
  "_name_or_path": "TheBloke/Llama-2-7b-Chat-GPTQ",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_length": 4096,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.31.0",
  "use_cache": true,
  "vocab_size": 32000
}



## Set up the Tokenizer

In [27]:
from transformers import AutoTokenizer

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, cache_dir=cache_dir, use_fast=True) # will use the Rust fast tokenizer if available

In [28]:
print("BOS token:", tokenizer.bos_token)
print("EOS token:", tokenizer.eos_token)

BOS token: <s>
EOS token: </s>


In [29]:
from IPython.display import display, HTML, clear_output, Markdown
import textwrap, json
import ipywidgets as widgets
import re, time
from google.colab import files
from pdfminer.high_level import extract_text
import io

In [30]:
DEFAULT_SYSTEM_PROMPT = 'You are a helpful assistant that provides accurate and concise responses. Respond in markdown.'
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

# max_doc_length = 50
max_doc_length = int(0.75 * model.config.max_length*extrapolation_factor)  # max doc length is 75% of the context length
max_doc_words = int(0.75*max_doc_length)

In [31]:
def generate_response(dialogs, temperature=0.3, top_p=0.9, logprobs=False):
    torch.cuda.empty_cache()
    max_prompt_len = int(0.85 * model.config.max_length*extrapolation_factor)
    max_gen_len = int(0.10 * max_prompt_len)

    prompt_tokens = []
    for dialog in dialogs:
        if dialog[0]["role"] != "system":
            dialog = [
                {
                    "role": "system",
                    "content": DEFAULT_SYSTEM_PROMPT,
                }
            ] + dialog
        dialog_tokens = [tokenizer(
            f"{B_INST} {B_SYS}{(dialog[0]['content']).strip()}{E_SYS}{(dialog[1]['content']).strip()} {E_INST}",
            return_tensors="pt",
            add_special_tokens=True
        ).input_ids.to(runtimeFlag)]
        for i in range(2, len(dialog), 2):
            user_tokens = tokenizer(
                f"{B_INST} {(dialog[i+1]['content']).strip()} {E_INST}",
                return_tensors="pt",
                add_special_tokens=True
            ).input_ids.to(runtimeFlag)
            assistant_w_eos = dialog[i]['content'].strip() + tokenizer.eos_token
            assistant_tokens = tokenizer(
                            assistant_w_eos,
                            return_tensors="pt",
                            add_special_tokens=False
                        ).input_ids.to(runtimeFlag)
            tokens = torch.cat([assistant_tokens, user_tokens], dim=-1)
            dialog_tokens.append(tokens)
        prompt_tokens.append(torch.cat(dialog_tokens, dim=-1))

    input_ids = prompt_tokens[0]
    if len(input_ids[0]) > max_prompt_len:
        return "\n\n **The language model's input limit has been reached. Clear the chat and start afresh!**"

    generation_output = model.generate(
        input_ids=input_ids,
        do_sample=True,
        max_new_tokens=max_gen_len,
        temperature=temperature,
        top_p=top_p,
    )

    new_tokens = generation_output[0][input_ids.shape[-1]:]
    # print(tokenizer.decode(input_ids[0], skip_special_tokens=False))
    return tokenizer.decode(new_tokens, skip_special_tokens=True)


In [32]:
def print_wrapped(text):
    # Regular expression pattern to detect code blocks
    code_pattern = r'```(.+?)```'
    matches = list(re.finditer(code_pattern, text, re.DOTALL))

    if not matches:
        # If there are no code blocks, display the entire text as Markdown
        display(Markdown(text))
        return

    start = 0
    for match in matches:
        # Display the text before the code block as Markdown
        before_code = text[start:match.start()].strip()
        if before_code:
            display(Markdown(before_code))

        # Display the code block
        code = match.group(0).strip()  # Extract code block
        display(Markdown(code))  # Display code block

        start = match.end()

    # Display the text after the last code block as Markdown
    after_code = text[start:].strip()  # Text after the last code block
    if after_code:
        display(Markdown(after_code))


def grab_and_shorten_text(max_doc_length):

    uploaded = files.upload()

    file_name = list(uploaded.keys())[0]

    # Check the file extension
    if file_name.endswith('.txt'):
        text = uploaded[file_name].decode()
    elif file_name.endswith('.pdf'):
        pdf_bytes = io.BytesIO(uploaded[file_name])
        text = extract_text(pdf_bytes)
    else:
        raise ValueError('Unsupported file type. Please upload a .txt or .pdf file.')

    with alert_out:
        clear_output()  # Clear the previous alert
        print("Shortening the text...")

    tokens = tokenizer.encode(text, truncation=True, max_length=max_doc_length, return_tensors='pt')

    shortened_text = tokenizer.decode(tokens[0], skip_special_tokens=True)

    return file_name, shortened_text

dialog_history = [{"role": "system", "content": DEFAULT_SYSTEM_PROMPT}]

button = widgets.Button(description="Send")
upload_button = widgets.Button(description="Upload .txt or .pdf")
text = widgets.Textarea(layout=widgets.Layout(width='800px'))

output_log = widgets.Output()

def on_button_clicked(b):
    user_input = text.value
    dialog_history.append({"role": "user", "content": user_input})

    text.value = ''

    # Change button description and color, and disable it
    button.description = 'Processing...'
    button.style.button_color = '#ff6e00'  # Use hex color codes for better color choices
    button.disabled = True  # Disable the button when processing

    with output_log:
        clear_output()
        for message in dialog_history:
            print_wrapped(f'**{message["role"].capitalize()}**: {message["content"]}\n')

    assistant_response = generate_response([dialog_history])

    # Re-enable the button, reset description and color after processing
    button.description = 'Send'
    button.style.button_color = 'lightgray'
    button.disabled = False

    dialog_history.append({"role": "assistant", "content": assistant_response})

    with output_log:
        clear_output()
        for message in dialog_history:
            print_wrapped(f'**{message["role"].capitalize()}**: {message["content"]}\n')

button.on_click(on_button_clicked)

# Create an output widget for alerts
alert_out = widgets.Output()

# Define the 'Upload' button click event handler
def on_upload_button_clicked(b):

    file_name, uploaded_text = grab_and_shorten_text(max_doc_length)

    with alert_out:
        clear_output()  # Clear the previous alert
        print(f"Upload successful: {file_name}, processing the file...")

    user_input = f"Uploaded document [{file_name}]: {uploaded_text}"
    dialog_history.append({"role": "user", "content": user_input})

    time.sleep(0.1)  # slight delay to ensure order

    assistant_input = f"You have uploaded text from {file_name}"
    dialog_history.append({"role": "assistant", "content": assistant_input})

    with output_log:
        clear_output()
        for message in dialog_history:
            print_wrapped(f'**{message["role"].capitalize()}**: {message["content"]}\n')

    with alert_out:
        clear_output()  # Clear the previous alert
        # print(f"File processing completed.")

upload_button.on_click(on_upload_button_clicked)

clear_button = widgets.Button(description="Clear Chat")
text = widgets.Textarea(layout=widgets.Layout(width='800px'))

def on_clear_button_clicked(b):
    # Clear the dialog history
    dialog_history.clear()
    # Add back the initial system prompt
    dialog_history.append({"role": "system", "content": DEFAULT_SYSTEM_PROMPT})
    # Clear the output log
    with output_log:
        clear_output()

clear_button.on_click(on_clear_button_clicked)

def save_chat(b):
    # Serialize the chat history into a JSON string
    chat_json = json.dumps(dialog_history)

    # Write the chat history to a temporary file
    with open('chat_history.json', 'w') as f:
        f.write(chat_json)

    # Download the file
    files.download('chat_history.json')

save_button = widgets.Button(description="Save Chat")
save_button.on_click(save_chat)

In [33]:
# Define the function to upload chat
def upload_chat(b):
    # Upload the file
    uploaded = files.upload()

    # Get the file name
    file_name = list(uploaded.keys())[0]

    # Ensure the file is a .json file
    if not file_name.endswith('.json'):
        print('Error: Incorrect file type. Please upload a .json file.')
        return

    # Load the content of the file
    chat_data = uploaded[file_name].decode()

    # Load the JSON data from the file
    try:
        global dialog_history
        dialog_history = json.loads(chat_data)
    except json.JSONDecodeError:
        print('Error: File is not in the correct format. Please upload a properly formatted .json file.')
        return

    with output_log:
        clear_output()
        for message in dialog_history:
            print_wrapped(f'**{message["role"].capitalize()}**: {message["content"]}\n')

# Create the upload button and set the on_click event handler
upload_chat_button = widgets.Button(description="Upload Chat")
upload_chat_button.on_click(upload_chat)


In [52]:
from IPython.display import display, HTML
from ipywidgets import HBox, VBox

# Create the title with HTML
title = f"<h1 style='color: #ff6e00;'>Jupyter Llama 🦙</h1> <p>(uploaded files will be shortened to {max_doc_words} words)</p>"

# Assuming that output_log, alert_out, and text are other widgets or display elements...
first_row = HBox([button, clear_button, upload_button])  # Arrange these buttons horizontally
second_row = HBox([save_button, upload_chat_button])  # Arrange these buttons horizontally

# Arrange the two rows of buttons and other display elements vertically
layout = VBox([output_log, alert_out, text, first_row, second_row])

# Chat

In [53]:
display(HTML(title))  # Use HTML function to display the title
display(layout)

VBox(children=(Output(), Output(), Textarea(value='', layout=Layout(width='800px')), HBox(children=(Button(des…