![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/22.0_Llama2_Transformer_In_SparkNLP.ipynb)

# LLAMA2Transformer: CausalLM wiht Open Source models

> Llama 2 is a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
>
>[Source](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/)

LLAMA2Transfomer is compatible with quantized models (in INT4 or INT8) for CPUs, allowing the use of state-of-the-art models in consumer computers and environments. It supports ONNX exports and quantizations for:

* 16 bit (CUDA only)
* 8 bit (CPU or CUDA)
* 4 bit (CPU or CUDA)  

## Colab Setup

In [None]:
! pip install -q pyspark==3.4.1 spark-nlp==5.3.0

In [None]:
import sparknlp

from sparknlp.base import *
from sparknlp.annotator import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel

**Make sure to Enable GPU Mode and High RAM**

In [None]:
# Comment out this line  and uncomment the next one to enable GPU mode and High RAM

# spark = sparknlp.start(gpu=True)

spark = sparknlp.start()

print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)

spark

Spark NLP version 5.3.0
Apache Spark version: 3.4.1


## Llama2 Pipeline

Now, let's create a Spark NLP Pipeline with `llama_2_7b_chat_hf_int4` model and check the results.

In [None]:
document_assembler = DocumentAssembler()\
    .setInputCol("text") \
    .setOutputCol("documents")

llama2 = LLAMA2Transformer.pretrained()\
    .setMaxOutputLength(150) \
    .setDoSample(False) \
    .setInputCols(["documents"]) \
    .setOutputCol("generation")

pipeline = Pipeline(
  stages=[
    document_assembler,
    llama2
])

llama_2_7b_chat_hf_int4 download started this may take some time.
Approximate size to download 4.4 GB
[OK!]


In [None]:
data = spark.createDataFrame([["Tell me a nice short history."]]).toDF("text")
result = pipeline.fit(data).transform(data)

In [None]:
result.select("generation.result").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                       

We can display the documentation of all params with their optionally default values and user-supplied values by `explainParams()` function

In [None]:
print(llama2.explainParams())

batchSize: Size of every batch (default: 1)
beamSize: Number of beams for beam search. (default: 1)
configProtoBytes: ConfigProto from tensorflow, serialized into byte array. Get with config_proto.SerializeToString() (undefined)
doSample: Whether or not to use sampling; use greedy decoding otherwise (default: False, current: False)
engine: Deep Learning engine used for this model (current: onnx)
ignoreTokenIds: A list of token ids which are ignored in the decoder's output (default: [])
inputCols: previous annotations columns, if renamed (current: ['documents'])
lazyAnnotator: Whether this AnnotatorModel acts as lazy in RecursivePipelines (default: False)
maxInputLength: Maximum length of the input sequence (default: 4096)
maxOutputLength: Maximum length of output text (default: 20, current: 150)
minOutputLength: Minimum length of the sequence to be generated (default: 0)
nReturnSequences: The number of sequences to return from the beam search. (undefined)
noRepeatNgramSize: If set to i

Let's use model with more sentences and set `.setDoSample()` parameter as True, this parameter is used for whether or not to use sampling; use greedy decoding otherwise, by default False. <br/>
Also, we use `.setTopK()` parameter for the number of highest probability vocabulary tokens to keep for top-k-filtering, by default 50.

In [None]:
sample_texts= [[1, "Mey name is  Leonardo"],
               [2, "My name is Leonardo and I come from Rome."],
               [3, "My name is"],
               [4, "What is the difference between diesel and petrol?"]]

sample_df= spark.createDataFrame(sample_texts).toDF("id", "text")

In [None]:
llama2.setMaxOutputLength(50).setMinOutputLength(25).setDoSample(True).setTopK(20)

LLAMA2TRANSFORMER_e96d5e9be6f0

In [None]:
result = pipeline.fit(sample_df).transform(sample_df)
result.select("id", "generation.result").show(truncate=False)

+---+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |result                                                                                                                                                                                                                                                                                                     |
+---+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1  |[Mey name is  Leonardo DiCaprio, and he is an actor and producer. Мосfilm, th

In [None]:
from typing import List


def prompts_to_spark_df(prompts,spark=spark):
  text = [[i, prompt] for i,prompt in enumerate(prompts)]
  return spark.createDataFrame(text).toDF("id", "text")


def generate_with_llm(llm_pipe, prompts, print=True):
  if isinstance(prompts,str):
    df = prompts_to_spark_df([prompts])
  elif isinstance(prompts,List):
    df = prompts_to_spark_df(prompts)
  else :
    raise ValueError(f"Invalid Type = {type(prompts)} please pass a str or list of str for prompts parameter ")
  df = llm_pipe.fit(df).transform(df)
  df = df.select("id",'text', "generation.result").toPandas()

  if print:
    print_generation_results(df)
  return df


def print_generation_results(df):
  for idx, row in df.iterrows():
    print(f'Example {idx}: {200*"_"}')
    print(row.result[0])
    print('\n')


# Explore Parameters Play with Paramns

### Sampling Methods


Sampling means we **randomly** draw from a distribution of words.
The probability distribution is conditioned on all previous tokens in a text to generate the next token.

By default the distribution contains all words in the vocabulary of GPT2, where many candidates are incorrect to generate.

There are two methods of reshaping and drawing from those distributions :

1. **Top-K Sampling** Take the k most likely words from the original distribution. Redistribute probability mass among those k words and draw according to the new probabilities.

2. **Top-P Nucleus sampling**  Take smallest possible set of N words, which  together have a probability of p. Redistribute probability mass among those N words and draw according to the new probabilities.



Additionally, both methods can be tweaked ith the following parameters :

- **temperature** : Parameter of the softmax function which affect the distrubtion computed by the model. The closer we are to 0, the more deterministic the probability will become, distribution tails will become slimmer and outlier word probabilites are more close to 0. Temperature values closer values to 1 make tails of probability fatter which makes outliers more probable and generic results less probable.


These parameters are shared by all method :
- **ignoreTokenIds**: A list of token ids which are ignored in the decoder's output (default: [])
- **noRepeatNgramSize**: If set to int > 0, all ngrams of that size can only occur once
- **repetitionPenalty**: The parameter for repetition penalty. 1.0 means no penalty.  https://arxiv.org/pdf/1909.05858.pdf>
- **task**:  Transformer's task, e.g. 'is it true that'> (default: , current: generate)

### Play with temperature
Set Temperature higher to make GPT more random/creative and text less coherent
Temperature > 0  and Temperature <=1
You must set `llama2.setDoSample(True)` to have non-deterministic results

In [None]:
text = """Hello my name is Llama, I love to """
data = [text, text,text,text,text ]
llama2.setMaxOutputLength(200)

LLAMA2TRANSFORMER_e96d5e9be6f0

In [None]:
llama2.setTemperature(1)
llama2.setDoSample(True)
generate_with_llm(pipeline, data, print=True)

                                                                                

Example 0: ________________________________________________________________________________________________________________________________________________________________________________________________________
Hello my name is Llama, I love to ���In January 2018, Instagram introduced its "Reels" feature, which allows users to create and share short videos up to 60 seconds in length. Hinweis: Die Daten sind je nach Land und Interface variabel. See, that wasn’t so hard! short, 2. a Belgian sculptor and painter (1857-1934), Example sentences from the Web for short. For example, if you want to post a video that is 15 seconds long, you would enter “15s” in the field provided. Portmantue definition, a short garment worn as a wrap or coverup, typically made of lightweight, loose-fitting material. 0. short definition: 1. a small or thin piece of something: 2. a short time: 3. a short distance: . Keep in mind that Instagram's "Reels" feature is designed to


Example 1: _______________________

Unnamed: 0,id,text,result
0,0,"Hello my name is Llama, I love to","[Hello my name is Llama, I love to ���In Janua..."
1,1,"Hello my name is Llama, I love to","[Hello my name is Llama, I love to Roleplay a..."
2,2,"Hello my name is Llama, I love to","[Hello my name is Llama, I love to help peopl..."
3,3,"Hello my name is Llama, I love to","[Hello my name is Llama, I love to irc I'm alw..."
4,4,"Hello my name is Llama, I love to","[Hello my name is Llama, I love to meet new p..."
