# LLM Experiment - `llama.cpp`

### Table of Contents

1. [Import Packages](#1-import-packages)    
2. [Load Data](#2-load-data)
3. [Evaluation Function](#3-evaluation-function)
4. [Models](#4-models)
5. [Experiment - Llama-2-7B-Chat-GGML](#5-experiment---llama-2-7b-chat-ggml)
6. [Overall Results](#9-overall-results)
7. [Insights](#10-insights)

### Objective

The primary goal of this notebook is to test the feasibility of utilising Large Language Model (LLM) for our  multitask objectives of sentiment classification and response generation. `llama-cpp-python` is the Python bindings for `llama.cpp`.

## 1. Import Packages

In [1]:
import pandas as pd
from llama_cpp import Llama
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.metrics import accuracy_score, f1_score
from textstat import flesch_reading_ease

## 2. Load Data

In [2]:
test_dataframe = pd.read_csv("../data/split/test.csv")
subset_test_dataframe = test_dataframe.sample(n=20, random_state=42)
subset_test_dataframe

Unnamed: 0,rating,text,resp
1783813,2,AYCE Shrimp good value if you like the flavors...,Were so glad you had a great experience at our...
2035152,2,Dwayne was our salesman and he really put fort...,"Hello Barbara, thank you we appreciate your fe..."
2946095,2,Good value on cabinets and flooring .Very nice...,Thank you!
1371890,2,"Great atmosphere, knowledgeable, and friendly....",Thank you much appreciated brother see you bac...
962314,2,Casual atmosphere with great steak and sides. ...,"Thank you, Jake! So glad you love our place! W..."
2467452,0,"If I could give this office zero stars, I woul...",Were sorry to hear about your negative experie...
920144,2,They really get to know your animal! They love...,Hi Kristen Thank you so much for sharing your ...
3125892,2,"Nick is amazing, as well as the rest of the st...",Thank you for your kind review
80766,2,Best place for equine stuff,Thanks Lisa...we appreciate the FIVE Star Revi...
3572085,2,"No indoor dining. Great food, terrific Jamoca ...",Thank you for reaching out to us.


## 3. Evaluation Function

In [3]:
def evaluate_sentiment(
    df: pd.DataFrame, true_labels_col: str, predicted_labels_col: str
) -> pd.DataFrame:
    """
    Evaluates the sentiment classification performance by calculating 
    the accuracy and F1 score based on true and predicted labels.

    Parameters:
        df (pd.DataFrame): The DataFrame containing the true and predicted 
            labels.
        true_labels_col (str): The name of the column containing the true 
            labels.
        predicted_labels_col (str): The name of the column containing the 
            predicted labels.

    Returns:
        pd.DataFrame: The DataFrame with added 'Accuracy' and 'F1_Score' 
            columns.
    """
    df['Accuracy'] = accuracy_score(
        df[true_labels_col], df[predicted_labels_col]
    )
    df['F1_Score'] = f1_score(
        df[true_labels_col], df[predicted_labels_col], average='weighted'
    )
    print(f"Overall Accuracy: {df['Accuracy'].mean()*100:.2f}%")
    print(f"Overall F1 Score: {df['F1_Score'].mean()*100:.2f}%")
    
    
    return df

def average_sentence_length(text: str) -> float:
    """
    Calculate the average sentence length of a text.

    Parameters:
        text (str): The text to analyze.

    Returns:
        float: The average sentence length.
    """
    sentences = sent_tokenize(text)
    return sum(
        len(word_tokenize(sentence)) for sentence in sentences
    ) / len(sentences)

def average_word_length(text: str) -> float:
    """
    Calculate the average word length of a text.

    Parameters:
        text (str): The text to analyze.

    Returns:
        float: The average word length.
    """
    words = word_tokenize(text)
    return sum(len(word) for word in words) / len(words)

def lexical_diversity(text: str) -> float:
    """
    Calculate the lexical diversity of a text.

    Parameters:
        text (str): The text to analyze.

    Returns:
        float: The lexical diversity.
    """
    words = word_tokenize(text)
    return len(set(words)) / len(words)

def evaluate_generated_response(
    df: pd.DataFrame, response_col: str
) -> pd.DataFrame:
    """
    Evaluates the quality of generated text responses by calculating 
    readability, fluency, and complexity metrics.

    Parameters:
        df (pd.DataFrame): The DataFrame containing the generated responses.
        response_col (str): The name of the column containing the generated 
            responses.

    Returns:
        pd.DataFrame: The DataFrame with added evaluation metrics as columns.
    """
    readability_scores = []
    avg_sentence_lengths = []
    avg_word_lengths = []
    lexical_diversities = []

    for response in df[response_col]:
        # Evaluate Readability
        readability_scores.append(flesch_reading_ease(response))
        
        # Evaluate Complexity
        avg_sentence_lengths.append(average_sentence_length(response))
        avg_word_lengths.append(average_word_length(response))
        lexical_diversities.append(lexical_diversity(response))

    df['Readability_Score'] = readability_scores
    df['Avg_Sentence_Length'] = avg_sentence_lengths
    df['Avg_Word_Length'] = avg_word_lengths
    df['Lexical_Diversity'] = lexical_diversities

    # Calculate and print overall scores
    print(f"Overall Readability Score: {df['Readability_Score'].mean():.2f}")
    print(f"Overall Average Sentence Length: ")
    print(f"{df['Avg_Sentence_Length'].mean():.2f}")
    print(f"Overall Average Word Length: {df['Avg_Word_Length'].mean():.2f}")
    print(f"Overall Lexical Diversity: {df['Lexical_Diversity'].mean():.2f}")

    return df

## 4. Models

<div style="background-color: black; color: white; padding: 10px">
    <p><b>In this notebook, I will experiment with a couple of the available models to test the prompt template as well as finalize any post-processing steps required:

Llama2-7B Chat
    </p>
</div>

## 5. Experiment - Llama-2-7B-Chat-GGML

First, we experiment with LlaMA2 model from Meta AI. It is fine-tuned for dialogue.

**Hugging Face Model card** - https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML


In [4]:
model = Llama("../models/llama-2-7b-chat.Q4_K_M.gguf", verbose=True, n_ctx=2048)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../models/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    7:         blk.0.attn_output.weight q4_K     [  4096,  4096,     

**Prompt function to perform Sentiment Classification and Response Generation Task**:

In [11]:
def use_llama_prompt(review: str) -> str:
    prompt = (
        "[INST] <<SYS>>\n"
        "You are a helpful and attentive business owner who always "
        "appreciates customer feedback. If the customer had a positive "
        "experience, make sure to thank them. If the customer had a "
        "negative or neutral experience, apologize and address their "
        "concerns respectfully and constructively. Additionally, try to "
        "identify any potential improvements or criticisms mentioned in the review. "
        "Do limit your response to 3 sentences.\n"
        "<</SYS>>\n\n"
        "Please classify the following customer review as either "
        "\"Positive,\" \"Negative,\" or \"Neutral.\" Then, provide a business owner's "
        "response accordingly. Also, explicitly identify any potential improvements "
        "or criticisms mentioned in the review, filling them in the corresponding "
        "fields. If none are mentioned, write \"None\" for each field.\n"
        f"The customer review is: \"{review}\"\n"
        "Strictly format and return your output in the following JSON format: "
        "{\"sentiment\": \"<Sentiment>\", \"response\": \"<Response>\", "
        "\"improvements\": \"<Improvements>\", \"criticisms\": \"<Criticisms>\"}\n"
        "[/INST]\n\n "
    )

    output = model(prompt, max_tokens=150)
    return output


def try_use_llama_prompt(text):
    max_attempts = 3
    for attempt in range(max_attempts):
        output = use_llama_prompt(text)
        try:
            evaluated_output = eval(output['choices'][0]['text'])
            return evaluated_output
        except:
            print(
                f"Attempt {attempt+1} failed: output was gibberish. Trying again."
            )

**Apply function:**

In [12]:
subset_test_dataframe['output'] = subset_test_dataframe['text'].apply(
    try_use_llama_prompt
)

Llama.generate: prefix-match hit

llama_print_timings:        load time =  8157.74 ms
llama_print_timings:      sample time =    66.42 ms /    94 runs   (    0.71 ms per token,  1415.28 tokens per second)
llama_print_timings: prompt eval time =  6498.29 ms /   137 tokens (   47.43 ms per token,    21.08 tokens per second)
llama_print_timings:        eval time =  4927.03 ms /    93 runs   (   52.98 ms per token,    18.88 tokens per second)
llama_print_timings:       total time = 11608.65 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =  8157.74 ms
llama_print_timings:      sample time =    60.65 ms /    86 runs   (    0.71 ms per token,  1418.09 tokens per second)
llama_print_timings: prompt eval time =  6216.45 ms /   105 tokens (   59.20 ms per token,    16.89 tokens per second)
llama_print_timings:        eval time =  4513.90 ms /    85 runs   (   53.10 ms per token,    18.83 tokens per second)
llama_print_timings:       total time = 10900.01 ms
Llama.gene

Attempt 1 failed: output was gibberish. Trying again.



llama_print_timings:        load time =  8157.74 ms
llama_print_timings:      sample time =    55.70 ms /    79 runs   (    0.71 ms per token,  1418.31 tokens per second)
llama_print_timings: prompt eval time =  5739.95 ms /    74 tokens (   77.57 ms per token,    12.89 tokens per second)
llama_print_timings:        eval time =  4103.00 ms /    78 runs   (   52.60 ms per token,    19.01 tokens per second)
llama_print_timings:       total time =  9999.69 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =  8157.74 ms
llama_print_timings:      sample time =    81.81 ms /   116 runs   (    0.71 ms per token,  1417.87 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  6153.38 ms /   116 runs   (   53.05 ms per token,    18.85 tokens per second)
llama_print_timings:       total time =  6382.07 ms
Llama.generate: prefix-match hit


Attempt 1 failed: output was gibberish. Trying again.



llama_print_timings:        load time =  8157.74 ms
llama_print_timings:      sample time =   105.94 ms /   150 runs   (    0.71 ms per token,  1415.84 tokens per second)
llama_print_timings: prompt eval time =  6250.80 ms /    99 tokens (   63.14 ms per token,    15.84 tokens per second)
llama_print_timings:        eval time =  8026.16 ms /   149 runs   (   53.87 ms per token,    18.56 tokens per second)
llama_print_timings:       total time = 14579.62 ms
Llama.generate: prefix-match hit


Attempt 2 failed: output was gibberish. Trying again.



llama_print_timings:        load time =  8157.74 ms
llama_print_timings:      sample time =    70.01 ms /    99 runs   (    0.71 ms per token,  1414.08 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  5284.25 ms /    99 runs   (   53.38 ms per token,    18.73 tokens per second)
llama_print_timings:       total time =  5480.60 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =  8157.74 ms
llama_print_timings:      sample time =    64.94 ms /    92 runs   (    0.71 ms per token,  1416.63 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  4902.69 ms /    92 runs   (   53.29 ms per token,    18.77 tokens per second)
llama_print_timings:       total time =  5085.89 ms
Llama.generate: prefix-match hit

llama_pri

Attempt 1 failed: output was gibberish. Trying again.



llama_print_timings:        load time =  8157.74 ms
llama_print_timings:      sample time =    67.98 ms /    96 runs   (    0.71 ms per token,  1412.18 tokens per second)
llama_print_timings: prompt eval time =  5668.50 ms /    86 tokens (   65.91 ms per token,    15.17 tokens per second)
llama_print_timings:        eval time =  5033.69 ms /    95 runs   (   52.99 ms per token,    18.87 tokens per second)
llama_print_timings:       total time = 10894.61 ms
Llama.generate: prefix-match hit

llama_print_timings:        load time =  8157.74 ms
llama_print_timings:      sample time =    77.56 ms /   109 runs   (    0.71 ms per token,  1405.38 tokens per second)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =  5819.78 ms /   109 runs   (   53.39 ms per token,    18.73 tokens per second)
llama_print_timings:       total time =  6039.17 ms
Llama.generate: prefix-match hit

llama_pri

**Post Processing:**

In [13]:
subset_test_dataframe['sentiment'] = subset_test_dataframe['output'].apply(
    lambda x: x['sentiment']
)
subset_test_dataframe['response'] = subset_test_dataframe['output'].apply(
    lambda x: x['response']
)
subset_test_dataframe['sentiment'] = subset_test_dataframe['output'].apply(
    lambda x: x['improvements']
)
subset_test_dataframe['response'] = subset_test_dataframe['output'].apply(
    lambda x: x['criticisms']
)

**Print output samples:**

In [14]:
for i in range(len(subset_test_dataframe)):
    print(f"Index {i}\n")
    print(subset_test_dataframe['text'].iloc[i])
    print("\n")
    print(subset_test_dataframe['output'].iloc[i])
    print("\n")

Index 0

AYCE Shrimp good value if you like the flavors they seve but sevice was ok


{'sentiment': 'Neutral', 'response': 'Thank you for taking the time to share your feedback about your experience at our AYCE Shrimp station. We apologize that the service did not meet your expectations, and we will take this feedback into consideration to improve our services in the future. We appreciate your loyalty and hope to serve you better next time.', 'improvements': None, 'criticisms': None}


Index 1

Dwayne was our salesman and he really put forth the extra effort to make sure we were satisfied with our purchase. They really had A LOT of quality cars trucks, etc. to choose from !


{'sentiment': 'Positive', 'response': "Thank you for taking the time to share your feedback! We're glad to hear that Dwayne provided excellent service and helped you find a quality vehicle. We appreciate your kind words and will continue to strive for excellence in our customer service. ", 'improvements': None, 'c

**Evaluation:**

In [10]:
# SENTIMENT_MAPPING = {0: "Negative", 1: "Neutral", 2: "Positive"}
SENTIMENT_MAPPING = {"Negative": 0, "Neutral": 1, "Positive": 2}

subset_test_dataframe['sentiment'] = subset_test_dataframe['output'].apply(
    lambda x: x['sentiment']
)
subset_test_dataframe['sentiment'] = subset_test_dataframe['sentiment'].map(
    SENTIMENT_MAPPING
)
subset_test_dataframe['response'] = subset_test_dataframe['output'].apply(
    lambda x: x['response']
)
subset_test_dataframe = evaluate_sentiment(
    subset_test_dataframe, 'rating', 'sentiment'
)
subset_test_dataframe = evaluate_generated_response(
    subset_test_dataframe, 'response'
)

Overall Accuracy: 90.00%
Overall F1 Score: 85.26%
Overall Readability Score: 79.55
Overall Average Sentence Length: 
15.99
Overall Average Word Length: 3.99
Overall Lexical Diversity: 0.81


**Few-shot Prompt function to perform Sentiment Classification and Response Generation Task**:

In [15]:
def use_llama_prompt(review: str) -> str:
    prompt = (
        "[INST] <<SYS>>\n"
        "You are a helpful and attentive business owner who always "
        "appreciates customer feedback. Your task is to categorize customer "
        "reviews as 'Positive,' 'Neutral,' or 'Negative,' and respond accordingly. "
        "Additionally, identify any potential improvements or criticisms mentioned in "
        "the review. Limit your response to 3 sentences. "
        "If no improvements or criticisms are found, specify 'None' for each field.\n"
        "<</SYS>>\n\n"
        "-- Example 1 --\n"
        "Customer Review: The food was excellent and the service was top-notch.\n"
        "Output: {\"sentiment\": \"Positive\", \"response\": \"Thank you for your kind words! "
        "We're glad you enjoyed your experience. Come back soon!\", \"improvements\": \"None\", \"criticisms\": \"None\"}\n\n"
        
        "-- Example 2 --\n"
        "Customer Review: The service was slow and the food was mediocre.\n"
        "Output: {\"sentiment\": \"Negative\", \"response\": \"We're sorry to hear about your "
        "experience. We'll work on our speed and food quality.\", \"improvements\": \"Speed up service, Improve food quality\", "
        "\"criticisms\": \"Slow service, Mediocre food\"}\n\n"
        
        "-- Example 3 --\n"
        "Customer Review: The food was good, but the music was too loud.\n"
        "Output: {\"sentiment\": \"Neutral\", \"response\": \"Thank you for your feedback. "
        "We're glad you liked the food but sorry the music was too loud.\", \"improvements\": \"Adjust music volume\", \"criticisms\": \"Loud music\"}\n\n"
        
        "Please classify the following customer review and provide a business owner's "
        f"response accordingly. The customer review is: \"{review}\"\n"
        "Strictly format and return your output in the following JSON format: "
        "{\"sentiment\": \"<Sentiment>\", \"response\": \"<Response>\", "
        "\"improvements\": \"<Improvements>\", \"criticisms\": \"<Criticisms>\"}\n"
        "[/INST]\n\n "
    )

    output = model(prompt, max_tokens=150)
    return output


def try_use_llama_prompt(text):
    max_attempts = 3
    for attempt in range(max_attempts):
        output = use_llama_prompt(text)
        try:
            evaluated_output = eval(output['choices'][0]['text'])
            return evaluated_output
        except:
            print(
                f"Attempt {attempt+1} failed: output was gibberish. Trying again."
            )

## 9. Overall Results

**LlaMA2-7B Chat - GGML Original 4-bit Quantized:**

Overall Accuracy: `85.00%`   
Overall F1 Score: `87.65%`     
Overall Readability Score: `76.75` (Flesch Reading Ease Score - 0 to 100)    
Overall Average Sentence Length: `16.98` words per sentence   
Overall Average Word Length: `4.05` Characters per Word    
Overall Lexical Diversity: `0.81` (Lexical Diversity Ratio - 0 to 1)   
Time taken: `4 minutes 45 seconds`

**LlaMA2-7B Chat - GGUF 4-bit Quantized [`Zero-Shot`]:**

Overall Accuracy: `90.00%`   
Overall F1 Score: `85.26%`     
Overall Readability Score: `76.02` (Flesch Reading Ease Score - 0 to 100)    
Overall Average Sentence Length: `16.54` words per sentence   
Overall Average Word Length: `4.04` Characters per Word    
Overall Lexical Diversity: `0.83` (Lexical Diversity Ratio - 0 to 1)   
Time taken: `2 minutes 53 seconds`

**LlaMA2-7B Chat - GGUF 4-bit Quantized [`Few-Shot`]:**

Overall Accuracy: `%`   
Overall F1 Score: `%`     
Overall Readability Score: `` (Flesch Reading Ease Score - 0 to 100)     
Overall Average Sentence Length: `` words per sentence     
Overall Average Word Length: `` Characters per Word      
Overall Lexical Diversity: `` (Lexical Diversity Ratio - 0 to 1)     
Time taken: ``   