### Large Language Models: Going back from OSS SoTA to the roots and back

#### Agenda
- Discussion of LLaMA-3-70B-chat. 
  - Instruction following and prompt format
  - Temperature
  - Hallucinations

- From the roots: BERT (2018)
  - Inference Process - steps involved when you call a model
  - Looking inside the model - what are parameters and what do they really look like
  - Tokenization
  - Extracting feature vectors/ Embedding

- Next step: GPT 2 (2019)
  - What text generation process actually looks like
  - Uncontrollable text generation
  - Wrapping it into a Hugging Face pipeline

- Back to the future: MPT-7B and MPT-7B-Instruct (2023)
  - Controllable generation: The difference created by instruction following
  - Aside on RLHF/ DPO for model alignment and detoxification

- RAG: Why is it even necessary
  - Context is everything - AG
  - Retrieval for finding relevant context leads to RAG


In [0]:
import pandas as pd
import mlflow.pyfunc
import requests 
import json
import torch
import numpy as np
from scipy.spatial.distance import cosine

### Evolution of LLMs

link: https://youtu.be/dMH0bHeiRNg?si=T_enhbnix15h6iN8

![Evolution of datance](https://i.makeagif.com/media/12-03-2015/EpkTKe.gif "Databricks Logo")

#### Current OSS state of the art: LLaMA-3-70B-chat released April 2024

In [0]:
formatted_text =   """<s>[INST] <<SYS>>
  {system_prompt}
  <</SYS>>

  {user_message} [/INST]
  """

system_prompt = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature."

user_message = "Write an email informing a customer about Enzyme in DLT. Be Concise"
#user_message = "Write me a verse about Databricks in the style of Outkast"


In [0]:
formatted_text = formatted_text.format(system_prompt= system_prompt, user_message = user_message)
print(formatted_text)

<s>[INST] <<SYS>>
  You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
  <</SYS>>

  Write an email informing a customer about Enzyme in DLT. Be Concise [/INST]
  


In [0]:
import requests

databricks_token = "<your personal-access-token>"
url = "https://<workspace-id>.databricks.com/serving-endpoints/databricks-meta-llama-3-70b-instruct/invocations"

payload = {
    "messages": [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message}
    ],
    "temperature": 0,
    "top_p": 0.95,
    "max_tokens": 500
}

headers = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {databricks_token}"
}

response = requests.post(url, headers=headers, json=payload)

#print(response['candidates'][0]['text'])
print(json.loads(response.text)['choices'][0]['message']['content'])

Here is a concise email informing a customer about Enzyme in DLT:

Subject: Introduction to Enzyme in DLT

Dear [Customer's Name],

I hope this email finds you well. I wanted to take a moment to introduce you to Enzyme, a key component in our Distributed Ledger Technology (DLT) solution.

Enzyme is a powerful encryption algorithm that ensures the secure and efficient processing of transactions within our DLT network. It enables fast and reliable data validation, while maintaining the highest levels of security and integrity.

In simple terms, Enzyme helps to:

* Protect sensitive data from unauthorized access
* Validate transactions quickly and accurately
* Ensure the integrity of our DLT network

If you have any questions or would like to learn more about Enzyme or our DLT solution, please don't hesitate to reach out. We're always here to help.

Best regards,

[Your Name]


In [0]:
def generate(endpoint: str, token: str, system: str, user: str ) -> str:
  payload = {
      "messages": [
          {"role": "system", "content": system},
          {"role": "user", "content": user}
      ],
      "temperature": 0,
      "top_p": 0.95,
      "max_tokens": 500
  }

  headers = {
      "Content-Type": "application/json",
      "Authorization": f"Bearer {token}"
  }

  response = requests.post(endpoint, headers=headers, json=payload)

  #print(response['candidates'][0]['text'])
  return json.loads(response.text)['choices'][0]['message']['content']


In [0]:
output = generate(url, databricks_token, system_prompt, user_message)
print(output)

Here is a concise email informing a customer about Enzyme in DLT:

Subject: Introduction to Enzyme in DLT

Dear [Customer's Name],

I hope this email finds you well. I wanted to take a moment to introduce you to Enzyme, a key component in our Distributed Ledger Technology (DLT) system.

Enzyme is a powerful encryption algorithm that enables secure and efficient data processing within our DLT network. It plays a crucial role in ensuring the integrity and confidentiality of transactions, making it an essential element of our system.

In simple terms, Enzyme helps to:

* Protect sensitive data from unauthorized access
* Ensure fast and reliable transaction processing
* Maintain the integrity of our DLT network

If you have any questions or would like to learn more about Enzyme or our DLT system, please don't hesitate to reach out. We're always here to help.

Best regards,

[Your Name]


In [0]:
#Change temperature from 0 to 0.5 to 0.75 and try again

### Starting from the bottom: smaller Language Models   

#### BERT - released October 2018


![BERT Training](https://jalammar.github.io/images/bert-transfer-learning.png "BERT training")


link: https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

link: https://jalammar.github.io/illustrated-bert/

![BERT Architecture](https://jalammar.github.io/images/bert-output-vector.png "BERT architecture")

![LLM inference process](https://www.oreilly.com/api/v2/epubs/9781098150952/files/assets/tokens_token_embeddings_963889_03.png "LLM Inference Process")

#### Looking inside the model

In [0]:
from transformers import BertTokenizer, BertModel
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained("bert-base-uncased")

In [0]:
bert_model

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

In [0]:
bert_model.num_parameters()

109482240

In [0]:
#so roughly 109.5 million parameters
# Print the names and shapes of the model's parameters
i=0
for name, param in bert_model.named_parameters():
    if i<5:
      print(f"Parameter name: {name}, Shape: {param.shape}, Param: {param}")
    i+=1

Parameter name: embeddings.word_embeddings.weight, Shape: torch.Size([30522, 768]), Param: Parameter containing:
tensor([[-0.0102, -0.0615, -0.0265,  ..., -0.0199, -0.0372, -0.0098],
        [-0.0117, -0.0600, -0.0323,  ..., -0.0168, -0.0401, -0.0107],
        [-0.0198, -0.0627, -0.0326,  ..., -0.0165, -0.0420, -0.0032],
        ...,
        [-0.0218, -0.0556, -0.0135,  ..., -0.0043, -0.0151, -0.0249],
        [-0.0462, -0.0565, -0.0019,  ...,  0.0157, -0.0139, -0.0095],
        [ 0.0015, -0.0821, -0.0160,  ..., -0.0081, -0.0475,  0.0753]],
       requires_grad=True)
Parameter name: embeddings.position_embeddings.weight, Shape: torch.Size([512, 768]), Param: Parameter containing:
tensor([[ 1.7505e-02, -2.5631e-02, -3.6642e-02,  ...,  3.3437e-05,
          6.8312e-04,  1.5441e-02],
        [ 7.7580e-03,  2.2613e-03, -1.9444e-02,  ...,  2.8910e-02,
          2.9753e-02, -5.3247e-03],
        [-1.1287e-02, -1.9644e-03, -1.1573e-02,  ...,  1.4908e-02,
          1.8741e-02, -7.3140e-03],
  

In [0]:
user_message

'Write an email informing a customer about Enzyme in DLT. Be Concise'

#### Tokenization



In [0]:
text = user_message
tokenized_input = bert_tokenizer(text, return_tensors='pt', padding=True)

In [0]:
tokenized_input

{'input_ids': tensor([[  101,  4339,  2019, 10373, 21672,  1037,  8013,  2055,  9007,  1999,
         21469,  2102,  1012,  2022,  9530, 18380,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [0]:
def tokenizer_mappings(text, model_name='bert-base-uncased'):
    """
    Prints out the tokenizer mappings for a given text, including special tokens.

    Parameters:
    text (str): The text to tokenize and print mappings for.
    model_name (str): The model identifier for the tokenizer.
    """
    # Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # Tokenize the text and add special tokens
    tokens = ['[CLS]'] + tokenizer.tokenize(text) + ['[SEP]']

    # Convert tokens to their respective IDs
    token_ids = tokenizer.convert_tokens_to_ids(tokens)

    # Create a mapping of tokens to IDs
    token_id_mapping = dict(zip(tokens, token_ids))

    # Print the mapping of tokens to IDs
    print("Token to ID mapping:")
    for token, token_id in token_id_mapping.items():
        print(f"{token}: {token_id}")

    # Additionally, print the mapping of IDs back to tokens
    id_token_mapping = dict(zip(token_ids, tokens))

    print("\nID to Token mapping:")
    for token_id, token in id_token_mapping.items():
        print(f"{token_id}: {token}")





In [0]:
from transformers import AutoTokenizer
#Token to word mappings
tokenizer_mappings("Write an email informing a customer about Enzyme in DLT. Be Concise")

Token to ID mapping:
[CLS]: 101
write: 4339
an: 2019
email: 10373
informing: 21672
a: 1037
customer: 8013
about: 2055
enzyme: 9007
in: 1999
dl: 21469
##t: 2102
.: 1012
be: 2022
con: 9530
##cise: 18380
[SEP]: 102

ID to Token mapping:
101: [CLS]
4339: write
2019: an
10373: email
21672: informing
1037: a
8013: customer
2055: about
9007: enzyme
1999: in
21469: dl
2102: ##t
1012: .
2022: be
9530: con
18380: ##cise
102: [SEP]


### Performing inference with the model

In [0]:
output = bert_model(**tokenized_input)
output

BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.5539, -0.0599, -0.3867,  ..., -0.2954,  0.1448,  0.4046],
         [ 0.4885,  0.3727,  0.1662,  ...,  0.1482,  0.3383,  0.0029],
         [ 0.2084,  0.2156,  0.0060,  ...,  0.0378, -0.0881,  0.5661],
         ...,
         [-0.0743, -0.5236,  0.2663,  ..., -0.0671,  0.0327, -0.1273],
         [ 0.2363, -0.1622,  0.2775,  ...,  0.0386,  0.2413, -0.2018],
         [ 0.5561,  0.0881, -0.4061,  ...,  0.3340, -0.6033, -0.2583]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-9.1370e-01, -4.0207e-01, -9.3675e-01,  7.7835e-01,  8.3409e-01,
         -3.2733e-01,  7.9987e-01,  3.7112e-01, -8.9480e-01, -9.9999e-01,
         -7.6652e-01,  8.9397e-01,  9.6921e-01,  5.5964e-01,  8.6713e-01,
         -7.5099e-01, -2.7724e-01, -6.6277e-01,  3.6119e-01,  1.5148e-01,
          6.6745e-01,  1.0000e+00, -2.6902e-01,  4.8091e-01,  6.0734e-01,
          9.8295e-01, -7.3657e-01,  9.0130e-01,  9.3619e-01,  6.984

In [0]:
embedding = output.last_hidden_state.detach().numpy()[:, 0, :]
embedding.shape

(1, 768)

In [0]:
embedding

array([[-5.53901792e-01, -5.99062964e-02, -3.86650622e-01,
         1.71672598e-01, -3.02498370e-01, -6.82645679e-01,
         2.72584464e-02,  6.25620127e-01, -2.71181855e-02,
        -2.25335017e-01, -2.13734671e-01,  6.08832873e-02,
        -8.00492391e-02,  1.75523072e-01, -2.11149499e-01,
         2.49073222e-01,  9.43429321e-02,  6.28977656e-01,
         1.94459215e-01,  2.56132841e-01, -6.48359835e-01,
        -5.16234279e-01,  1.25005245e-01, -1.03662685e-01,
         1.88452646e-01, -1.76306024e-01, -3.49137634e-01,
         1.24744721e-01, -3.03795904e-01, -4.22742426e-01,
        -3.05943191e-01,  7.24825636e-02, -1.02582388e-01,
        -5.57676494e-01,  5.91061532e-01,  7.19033852e-02,
         5.67032576e-01, -7.40324259e-02,  4.68637049e-02,
         1.51682928e-01, -7.17560410e-01,  4.83835042e-02,
         6.34507179e-01,  9.20557082e-02, -1.82615116e-01,
        -1.79851539e-02, -3.43808842e+00, -4.36769277e-02,
         9.54044163e-02, -2.60890752e-01,  3.65607232e-0

In [0]:
#get the token for the CLS token
embedding[0][:5]

array([-0.5539018 , -0.0599063 , -0.38665062,  0.1716726 , -0.30249837],
      dtype=float32)

### Embeddings are a learned numeric representation of data (text, image, video, audio, geospatial or even tabular!)
![Embedding Generation](https://cdn.openai.com/embeddings/draft-20220124e/vectors-1.svg "Embedding Generation")

source: https://openai.com/blog/introducing-text-and-code-embeddings


### Embeddings of similar strings are closer to one another
![Embeddings](https://images.openai.com/blob/6feca3be-2b6b-4a99-9a64-34ba58980fae/Graphofsimilarembeddings.svg?width=10&height=10&quality=50 "Embeddings")

In [0]:
#wrap this into a function 
def get_embeddings(model, tokenizer,text):
  encoded_input = tokenizer(text, return_tensors='pt')
  with torch.no_grad():
    output = model(**encoded_input)
    embedding = output.last_hidden_state.detach().numpy()[:, 0, :]
  return embedding

In [0]:
#test this out
get_embeddings(bert_model, bert_tokenizer, text)[0][:10]

array([-0.5539018 , -0.0599063 , -0.38665062,  0.1716726 , -0.30249837,
       -0.6826457 ,  0.02725845,  0.6256201 , -0.02711819, -0.22533502],
      dtype=float32)

A potential use is in classification or document clustering

e.g.:
![classification](https://jalammar.github.io/images/BERT-classification-spam.png "classifier")

## GPT-2 - February 2019

![GPT2](https://jalammar.github.io/images/xlnet/gpt-2-output.gif "GPT2")

![GPT2](https://jalammar.github.io/images/xlnet/gpt-2-autoregression-2.gif "GPT2")

In [0]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load pre-trained model tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

model.num_parameters()

124439808

In [0]:
# Encode text input
text = "Emzyme is a cool way to do"
encoded_input = tokenizer.encode(text, return_tensors='pt')

# Generate text
output = model.generate(encoded_input, max_length=50, num_return_sequences=1)

# Decode the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Emzyme is a cool way to do it.

The first thing you need to do is to add the following to your.bashrc file:

#!/bin/bash # This will create a new file called "scripts/scripts


In [0]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2-large')
set_seed(42)


In [0]:
#Attempt 1
text = "Write an email informing a customer about Enzyme in DLT. Be Concise"
output = generator(text, max_length=100, num_return_sequences=1)
output

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Write an email informing a customer about Enzyme in DLT. Be Concise\n\nYou've been working with the company for months now and are confident with its reputation. The emails will be consistent. Follow up with a question if there are any questions. Be Concise\n\nYour message has been forwarded to a member of the team; send a follow-up email to check their contact. Be Concise\n\nYour message has been forwarded to a member of the team. Be Concise"}]

In [0]:
#Attempt 2
text = "Here's a concise email informing a customer about Enzyme in DLT.\n Dear Customer, "
output = generator(text, max_length=100, num_return_sequences=1)
print(output[0]['generated_text'])

Here's a concise email informing a customer about Enzyme in DLT.
 Dear Customer,  you will receive your new Enzyme DLT products shortly after shipment. You may also contact the sales team at enzyme@enzyme.com to arrange your delivery. If you encounter any issues, please contact the company as soon as possible at (0071-4270-3436) or (02-862-0890) to get an item status. You will be eligible


#### GPT-3 which was released in June 2020 is a (much) bigger version of this

### Fast forward to 2023: Instruction finetuning

![Fine-tuning](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc4935aa8-722e-4461-9e9e-88341b5558f6_1379x446.png "fine-tuning")

In [0]:
dbrx_url = "https://<workspace-id>.databricks.com/serving-endpoints/databricks-dbrx-instruct/invocations"
output = generate(dbrx_url, databricks_token, system_prompt, user_message)
print(output)

Subject: Introduction to Enzyme and its Role in Distributed Ledger Technology

Dear [Customer's Name],

I hope this email finds you well. I am writing to introduce you to Enzyme, a cutting-edge technology that is revolutionizing the world of distributed ledger technology (DLT).

Enzyme is an open-source, decentralized platform that enables the creation and management of digital assets and smart contracts. It is built on top of the Ethereum blockchain and utilizes a unique consensus algorithm called "Proof of Stake" (PoS) to validate transactions and maintain the network's security.

Enzyme's primary goal is to provide a more efficient, secure, and scalable platform for the development and deployment of decentralized applications (dApps). It offers a wide range of features, including built-in privacy, high transaction throughput, and low fees, making it an attractive alternative to traditional blockchain platforms.

One of the key benefits of Enzyme is its ability to support the creatio

### Enter RAG: Why context is everything for reliable and repeatable use cases

#### Let's start with Context

Before RAG, let's start with roots of RAG: providing context. 
Let's get a passage that describes what Enzyme is from here: https://www.databricks.com/blog/2022/06/29/delta-live-tables-announces-new-capabilities-and-performance-optimizations.html

In [0]:
context = """Transforming data to prepare it for downstream analysis is a prerequisite for most other workloads on the Databricks platform. While SQL and DataFrames make it relatively easy for users to express their transformations, the input data constantly changes. This requires recomputation of the tables produced by ETL. Recomputing the results from scratch is simple, but often cost-prohibitive at the scale many of our customers operate.

We are pleased to announce that we are developing project Enzyme, a new optimization layer for ETL. Enzyme efficiently keeps up-to-date a materialization of the results of a given query stored in a Delta table. It uses a cost model to choose between various techniques, including techniques used in traditional materialized views, delta-to-delta streaming, and manual ETL patterns commonly used by our customers."""

In [0]:
formatted_text =   """Given the following context

  {context}

  given the above context,

  {user_message} [/INST]
  """

system_prompt = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature."

user_message = "Write an email informing a customer about Enzyme in DLT. Be Concise"
#user_message = "Write me a verse about Databricks in the style of Outkast"

In [0]:
formatted_text = formatted_text.format(context = context, user_message = user_message)
print(formatted_text)

Given the following context

  Transforming data to prepare it for downstream analysis is a prerequisite for most other workloads on the Databricks platform. While SQL and DataFrames make it relatively easy for users to express their transformations, the input data constantly changes. This requires recomputation of the tables produced by ETL. Recomputing the results from scratch is simple, but often cost-prohibitive at the scale many of our customers operate.

We are pleased to announce that we are developing project Enzyme, a new optimization layer for ETL. Enzyme efficiently keeps up-to-date a materialization of the results of a given query stored in a Delta table. It uses a cost model to choose between various techniques, including techniques used in traditional materialized views, delta-to-delta streaming, and manual ETL patterns commonly used by our customers.

  given the above context,

  Write an email informing a customer about Enzyme in DLT. Be Concise [/INST]
  


In [0]:
output = generate(dbrx_url, databricks_token, system_prompt, formatted_text)
print(output)

Subject: Introducing Project Enzyme: Optimizing ETL on Databricks

Dear [Customer],

I hope this email finds you well. I am excited to share some news about an upcoming project on the Databricks platform that I believe will greatly benefit your data transformation and analysis workloads.

Project Enzyme is a new optimization layer for ETL that efficiently maintains up-to-date materializations of query results stored in Delta tables. By utilizing a cost model, Enzyme selects the most efficient techniques, such as those used in traditional materialized views, delta-to-delta streaming, and manual ETL patterns commonly employed by our customers.

With Enzyme, you can expect improved performance and reduced costs for your ETL workloads, as it minimizes the need for recomputing tables from scratch. This feature will be particularly beneficial for customers operating at scale.

We are committed to continuously enhancing the Databricks platform to better serve your needs, and we believe Projec

This is basically 'Augmented Generation' i.e. AG in RAG. Let's move into R i.e. retrieval

### RAG!!!

#### Retrieval Augmented Generation

Let's build simple RAG system fro scratch

In [0]:
source_passages = ["Popeye the Sailor Man is a fictional cartoon character created by Elzie Crisler Segar.[40][41][42][43] The character first appeared on January 17, 1929, in the daily King Features comic strip Thimble Theatre. The strip was in its tenth year when Popeye made his debut, but the one-eyed sailor quickly became the lead character, and Thimble Theatre became one of King Features' most popular properties during the 1930s. Following Segar's death in 1938, Thimble Theatre was continued by several writers and artists, most notably Segar's assistant Bud Sagendorf. It was formally renamed Popeye. The strip continues to appear in first-run installments on Sundays, written and drawn by R.K. Milholland. The daily strips are reprints of old Sagendorf stories.","Transforming data to prepare it for downstream analysis is a prerequisite for most other workloads on the Databricks platform. While SQL and DataFrames make it relatively easy for users to express their transformations, the input data constantly changes. This requires recomputation of the tables produced by ETL. Recomputing the results from scratch is simple, but often cost-prohibitive at the scale many of our customers operate. We are pleased to announce that we are developing project Enzyme, a new optimization layer for ETL. Enzyme efficiently keeps up-to-date a materialization of the results of a given query stored in a Delta table. It uses a cost model to choose between various techniques, including techniques used in traditional materialized views, delta-to-delta streaming, and manual ETL patterns commonly used by our customers.","The Philadelphia Eagles are a professional American football team based in Philadelphia. The Eagles compete in the National Football League (NFL) as a member club of the league's National Football Conference (NFC) East division. The team plays its home games at Lincoln Financial Field in the South Philadelphia Sports Complex.[7]", "Databricks, Inc. is an American enterprise software company founded by the creators of Apache Spark.[2] Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks. The company develops Delta Lake, an open-source project to bring reliability to data lakes for machine learning and other data science use cases.[3]", "The Ferrari Roma (Type F169) is a grand touring car by Italian manufacturer Ferrari. It has a front mid-engine, rear-wheel-drive layout with a turbocharged V8 engine and a 2+2 seating arrangement.[1] Based on the Ferrari Portofino, the car is placed between the Portofino and the F8 Tributo in Ferrari's range of sports cars." ]
source_passages 

["Popeye the Sailor Man is a fictional cartoon character created by Elzie Crisler Segar.[40][41][42][43] The character first appeared on January 17, 1929, in the daily King Features comic strip Thimble Theatre. The strip was in its tenth year when Popeye made his debut, but the one-eyed sailor quickly became the lead character, and Thimble Theatre became one of King Features' most popular properties during the 1930s. Following Segar's death in 1938, Thimble Theatre was continued by several writers and artists, most notably Segar's assistant Bud Sagendorf. It was formally renamed Popeye. The strip continues to appear in first-run installments on Sundays, written and drawn by R.K. Milholland. The daily strips are reprints of old Sagendorf stories.",
 'Transforming data to prepare it for downstream analysis is a prerequisite for most other workloads on the Databricks platform. While SQL and DataFrames make it relatively easy for users to express their transformations, the input data const

In [0]:
passage_embeddings = [get_embeddings(bert_model, bert_tokenizer, passage) for passage in source_passages]
# passage_embeddings[0]

In [0]:
query_text = "Write an email informing a customer about Enzyme in DLT. Be Concise"

In [0]:
query_embedding = get_embeddings(bert_model, bert_tokenizer, query_text)
query_embedding

array([[-5.53901792e-01, -5.99062964e-02, -3.86650622e-01,
         1.71672598e-01, -3.02498370e-01, -6.82645679e-01,
         2.72584464e-02,  6.25620127e-01, -2.71181855e-02,
        -2.25335017e-01, -2.13734671e-01,  6.08832873e-02,
        -8.00492391e-02,  1.75523072e-01, -2.11149499e-01,
         2.49073222e-01,  9.43429321e-02,  6.28977656e-01,
         1.94459215e-01,  2.56132841e-01, -6.48359835e-01,
        -5.16234279e-01,  1.25005245e-01, -1.03662685e-01,
         1.88452646e-01, -1.76306024e-01, -3.49137634e-01,
         1.24744721e-01, -3.03795904e-01, -4.22742426e-01,
        -3.05943191e-01,  7.24825636e-02, -1.02582388e-01,
        -5.57676494e-01,  5.91061532e-01,  7.19033852e-02,
         5.67032576e-01, -7.40324259e-02,  4.68637049e-02,
         1.51682928e-01, -7.17560410e-01,  4.83835042e-02,
         6.34507179e-01,  9.20557082e-02, -1.82615116e-01,
        -1.79851539e-02, -3.43808842e+00, -4.36769277e-02,
         9.54044163e-02, -2.60890752e-01,  3.65607232e-0

In [0]:
# Normalize the query embedding
query_embedding_norm = query_embedding / np.linalg.norm(query_embedding)

![cosine similarity](https://i.stack.imgur.com/ewM2b.png "cosine similarity")

In [0]:
# Calculate cosine similarities
cosine_similarities = []
for cls_embedding in passage_embeddings:
    # Normalize the CLS token embedding
    cls_embedding_norm = cls_embedding / np.linalg.norm(cls_embedding)
    
    # Convert query_embedding to a 1D vector and normalize it
    query_embedding_norm_1d = query_embedding_norm.ravel() / np.linalg.norm(query_embedding_norm.ravel())
    
    # Convert cls_embedding to a 1D vector and normalize it
    cls_embedding_norm_1d = cls_embedding_norm.ravel()
    
    # Compute the cosine similarity
    similarity = 1 - cosine(query_embedding_norm_1d, cls_embedding_norm_1d)
    cosine_similarities.append(similarity)

In [0]:
relevant_context = source_passages[cosine_similarities.index(max(cosine_similarities))]
relevant_context

'Transforming data to prepare it for downstream analysis is a prerequisite for most other workloads on the Databricks platform. While SQL and DataFrames make it relatively easy for users to express their transformations, the input data constantly changes. This requires recomputation of the tables produced by ETL. Recomputing the results from scratch is simple, but often cost-prohibitive at the scale many of our customers operate. We are pleased to announce that we are developing project Enzyme, a new optimization layer for ETL. Enzyme efficiently keeps up-to-date a materialization of the results of a given query stored in a Delta table. It uses a cost model to choose between various techniques, including techniques used in traditional materialized views, delta-to-delta streaming, and manual ETL patterns commonly used by our customers.'

In [0]:
formatted_text = formatted_text.format(context = relevant_context, user_message = user_message)
print(formatted_text)

Given the following context

  Transforming data to prepare it for downstream analysis is a prerequisite for most other workloads on the Databricks platform. While SQL and DataFrames make it relatively easy for users to express their transformations, the input data constantly changes. This requires recomputation of the tables produced by ETL. Recomputing the results from scratch is simple, but often cost-prohibitive at the scale many of our customers operate.

We are pleased to announce that we are developing project Enzyme, a new optimization layer for ETL. Enzyme efficiently keeps up-to-date a materialization of the results of a given query stored in a Delta table. It uses a cost model to choose between various techniques, including techniques used in traditional materialized views, delta-to-delta streaming, and manual ETL patterns commonly used by our customers.

  given the above context,

  Write an email informing a customer about Enzyme in DLT. Be Concise [/INST]
  


In [0]:
output = generate(dbrx_url, databricks_token, system_prompt, formatted_text)
print(output)

Subject: Introducing Project Enzyme: Optimizing ETL on Databricks

Dear [Customer],

I hope this email finds you well. I am excited to share some news about an upcoming project on the Databricks platform that I believe could greatly benefit your data processing needs.

Project Enzyme is a new optimization layer for ETL (Extract, Transform, Load) that aims to efficiently maintain up-to-date materializations of query results stored in Delta tables. By utilizing a cost model, Enzyme will choose the most efficient technique for your specific use case, including traditional materialized view techniques, delta-to-delta streaming, and manual ETL patterns commonly used by our customers.

This new feature will help reduce the cost of recomputing tables produced by ETL, especially when dealing with constantly changing input data. By automating and optimizing these processes, Project Enzyme will enable you to focus more on your data analysis and less on the technical aspects of data transformatio