# **Create LLM**

The following code is adapted from https://blog.gopenai.com/bye-bye-llama-2-mistral-7b-is-taking-over-get-started-with-mistral-7b-instruct-1504ff5f373c


## Step 1.  Import Libraries

In [5]:
# Import libraries
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import tensorflow
import torch
from langchain.llms import HuggingFacePipeline
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate

device = 'cuda' if torch.cuda.is_available() else 'cpu'

2024-01-24 21:59:05.921879: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-24 21:59:05.944558: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-24 21:59:05.944583: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-24 21:59:05.945254: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-24 21:59:05.949676: I tensorflow/core/platform/cpu_feature_guar

## Step 2.  Download the Mistral 7B Instruct Model and Tokenizer

In [6]:
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", load_in_4bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

config.json: 100%|██████████| 596/596 [00:00<00:00, 3.46MB/s]
model.safetensors.index.json: 100%|██████████| 25.1k/25.1k [00:00<00:00, 40.4MB/s]
model-00001-of-00003.safetensors: 100%|██████████| 4.94G/4.94G [00:56<00:00, 87.3MB/s]
model-00002-of-00003.safetensors: 100%|██████████| 5.00G/5.00G [00:57<00:00, 86.9MB/s]
model-00003-of-00003.safetensors: 100%|██████████| 4.54G/4.54G [00:52<00:00, 87.1MB/s]
Downloading shards: 100%|██████████| 3/3 [02:46<00:00, 55.57s/it]
Loading checkpoint shards: 100%|██████████| 3/3 [00:12<00:00,  4.03s/it]
generation_config.json: 100%|██████████| 111/111 [00:00<00:00, 221kB/s]
tokenizer_config.json: 100%|██████████| 1.46k/1.46k [00:00<00:00, 5.21MB/s]
tokenizer.model: 100%|██████████| 493k/493k [00:00<00:00, 537MB/s]
tokenizer.json: 100%|██████████| 1.80M/1.80M [00:00<00:00, 36.9MB/s]
special_tokens_map.json: 100%|██████████| 72.0/72.0 [00:00<00:00, 255kB/s]


3.  Test Mistral

In [12]:
# Test prompt 1
prompt = """### Instruction: Act as a historian

Explain in three phrases the downfall of the Roman Empire.
 """

encoded_instruction = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
model_inputs = encoded_instruction.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

<s> ### Instruction: Act as a historian

Explain in three phrases the downfall of the Roman Empire.
 1. Military and Economic Overextension: As the Roman Empire expanded its territories through conquest and absorbed diverse cultures, it stretched its military resources thin and incurred immense economic costs maintaining its vast territories.
2. Political Instability and Corruption: Internal power struggles and constant infighting amongst the Roman elite, as well as corruption within the ruling classes, weakened the empire's political stability and diverted resources away from essential public services and infrastructure.
3. Barbarian Invasions and Succession Crises: The Roman Empire's borders were constantly threatened by barbarian invasions, and the empire's weakness during succession crises left it vulnerable to conquest and fragmentation.</s>


II.  Integrate LangChain with Mistral

Step 1.  Create Text Generation Pipeline

In [8]:
text_generation_pipeline = transformers.pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=1000,

)

Step 2.  Create an instance of Minstral

In [9]:
mistral_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

**III. Build a RAG Pipeline**
https://blog.gopenai.com/rag-pipeline-with-mistral-7b-instruct-model-a-step-by-step-guide-138df378a0c2

Step 1.  Create an LLM chain (see above)

In [10]:
from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

Test to see what model knows about topic beforehand

In [14]:
# Test prompt 1
prompt = """### Instruction: Explain what the Mamba model is in machine learning.
 """

encoded_instruction = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
model_inputs = encoded_instruction.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

# Model completely hallucinates here

<s> ### Instruction: Explain what the Mamba model is in machine learning.
  The Mamba (Massively Multiclass Accuracy Based Trainer) model is a powerful and efficient machine learning algorithm that is used primarily for multiclass classification tasks. It is an extension of the popular gradient boosting algorithm (Gradient Boosting Decision Trees or GBDT), and it was developed to address the limitations of GBDT in handling large datasets and high dimensionality.

  Mamba works by training multiple weak learners in parallel, known as base learners, on different subsets of the data. Each base learner is trained to predict the probability of an instance belonging to each of the classes. During training, Mamba uses a subsampling strategy toselect a random subset of instances and features for each tree. This helps to reduce correlation between trees and improve the overall accuracy of the model.

  One of the key advantages of the Mamba model is its ability to handle massive datasets and la

Now feed it with information

In [37]:
filepath = "sample_text_for_RAG.txt"

with open(filepath, encoding='utf-8') as file:
    data = file.read()

In [38]:
print(data[:1000])

The following is an article about an ML model, you should base all your future answers on it:

Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Abstract
Foundation models, now powering most of the exciting applications in deep learning, are almost universally
based on the Transformer architecture and its core attention module. Many subquadratic-time architectures
such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs)
have been developed to address Transformers’ computational inefficiency on long sequences, but they have not
performed as well as attention on important modalities such as language. We identify that a key weakness of
such models is their inability to perform content-based reasoning, and make several improvements. First, simply
letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing
the model to selectively propagate or forget information along the se

In [39]:
# Create Document object from text documents
docs = [Document(page_content=post) for post in [data]]

# Split documents into chunks

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=10, separators=['\n\n', '\n', '.']
)

document_chunks = text_splitter.split_documents(docs)

Step 5:  Download an Embedding Model

In [13]:
# Downloading embedding model
embedding_model = SentenceTransformerEmbeddings(model_name='BAAI/bge-large-zh-v1.5')

.gitattributes: 100%|██████████| 1.52k/1.52k [00:00<00:00, 6.57MB/s]
1_Pooling/config.json: 100%|██████████| 191/191 [00:00<00:00, 1.49MB/s]
README.md: 100%|██████████| 27.7k/27.7k [00:00<00:00, 48.9MB/s]
config.json: 100%|██████████| 1.00k/1.00k [00:00<00:00, 6.89MB/s]
config_sentence_transformers.json: 100%|██████████| 124/124 [00:00<00:00, 894kB/s]
model.safetensors: 100%|██████████| 1.30G/1.30G [00:33<00:00, 38.8MB/s]
pytorch_model.bin: 100%|██████████| 1.30G/1.30G [00:30<00:00, 42.1MB/s]
sentence_bert_config.json: 100%|██████████| 52.0/52.0 [00:00<00:00, 265kB/s]
special_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 624kB/s]
tokenizer.json: 100%|██████████| 439k/439k [00:00<00:00, 28.0MB/s]
tokenizer_config.json: 100%|██████████| 394/394 [00:00<00:00, 480kB/s]
vocab.txt: 100%|██████████| 110k/110k [00:00<00:00, 10.8MB/s]
modules.json: 100%|██████████| 349/349 [00:00<00:00, 1.08MB/s]


### Step 6. Initiate a Vector Store Instance

In [40]:
# Initiate a chromadb instance
chroma_db = Chroma.from_documents(document_chunks, embedding_model)
retriever = chroma_db.as_retriever()

### Step 7.  Create Your Question Answering (QA) Chain

In [41]:
# Prompt template
qa_template = """<s>[INST] You are a ML engineer.
Use the following context to Answer the question below briefly:

{context}

{question} [/INST] </s>
"""


# Create a prompt instance
QA_PROMPT = PromptTemplate.from_template(qa_template)

# Custom QA Chain
qa_chain = RetrievalQA.from_chain_type(
    llm = mistral_llm,
    retriever=retriever,
    chain_type_kwargs={"prompt": QA_PROMPT}
)

### Step 8.  Query Mistral 7B Instruct Model

In [43]:
# Your Question
question = "Based on the article given, explain what the Mamba model is in machine learning."

# Query Mistral 7B Instruct model
response = qa_chain({"query": question})

# Print your result
print(response['result'])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


The Mamba model mentioned in the context is not a widely known or standard machine learning model. Instead, it appears to be a specific implementation of gradient boosting machines developed by the authors of the article for their research purposes. The name "Mamba" is likely an acronym for their methodology, which stands for "Maximum Absolute Mean Difference Boosting Algorithm."

In essence, Mamba is a gradient boosting algorithm that aims to improve the performance of traditional gradient boosting methods by addressing some of their limitations. Specifically, it focuses on handling noisy data and reducing overfitting by maximizing the absolute difference between the residuals of successive trees instead of minimizing the squared error. This approach helps to increase the robustness of the model and improve its generalization ability.

Therefore, the Mamba model is a custom-built gradient boosting algorithm designed to address certain challenges in machine learning applications, parti