# LLM Experiment - GPT4All

### Table of Contents

1. [Import Packages](#1-import-packages)    
2. [Load Data](#2-load-data)
3. [Evaluation Function](#3-evaluation-function)
4. [Models](#4-models)
5. [Experiment - Llama-2-7B-Chat-GGML](#5-experiment---llama-2-7b-chat-ggml)
6. [Experiment - GPT4All Falcon](#6-experiment---gpt4all-falcon)
7. [Experiment - Hermes](#7-experiment---hermes)
8. [Experiment - Mini Orca (Large)](#8-experiment---mini-orca-large)
9. [Overall Results](#9-overall-results)
10. [Insights](#10-insights)

### Objective

The primary goal of this notebook is to test the feasibility of utilising Large Language Model (LLM) for our multitask objectives of sentiment classification and response generation. GPT4All library, by Nomic AI, is chosen as it provides Python bindings and provides an ecosystem of open-source LLMs.

## 1. Import Packages

In [6]:
import pandas as pd
from gpt4all import GPT4All
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.metrics import accuracy_score, f1_score
from textstat import flesch_reading_ease

## 2. Load Data

In [2]:
test_dataframe = pd.read_csv("../data/split/test.csv")
subset_test_dataframe = test_dataframe.sample(n=20, random_state=42)
subset_test_dataframe

Unnamed: 0,rating,text,resp
1783813,2,AYCE Shrimp good value if you like the flavors...,Were so glad you had a great experience at our...
2035152,2,Dwayne was our salesman and he really put fort...,"Hello Barbara, thank you we appreciate your fe..."
2946095,2,Good value on cabinets and flooring .Very nice...,Thank you!
1371890,2,"Great atmosphere, knowledgeable, and friendly....",Thank you much appreciated brother see you bac...
962314,2,Casual atmosphere with great steak and sides. ...,"Thank you, Jake! So glad you love our place! W..."
2467452,0,"If I could give this office zero stars, I woul...",Were sorry to hear about your negative experie...
920144,2,They really get to know your animal! They love...,Hi Kristen Thank you so much for sharing your ...
3125892,2,"Nick is amazing, as well as the rest of the st...",Thank you for your kind review
80766,2,Best place for equine stuff,Thanks Lisa...we appreciate the FIVE Star Revi...
3572085,2,"No indoor dining. Great food, terrific Jamoca ...",Thank you for reaching out to us.


## 3. Evaluation Function

In [3]:
def evaluate_sentiment(
    df: pd.DataFrame, true_labels_col: str, predicted_labels_col: str
) -> pd.DataFrame:
    """
    Evaluates the sentiment classification performance by calculating 
    the accuracy and F1 score based on true and predicted labels.

    Parameters:
        df (pd.DataFrame): The DataFrame containing the true and predicted 
            labels.
        true_labels_col (str): The name of the column containing the true 
            labels.
        predicted_labels_col (str): The name of the column containing the 
            predicted labels.

    Returns:
        pd.DataFrame: The DataFrame with added 'Accuracy' and 'F1_Score' 
            columns.
    """
    df['Accuracy'] = accuracy_score(
        df[true_labels_col], df[predicted_labels_col]
    )
    df['F1_Score'] = f1_score(
        df[true_labels_col], df[predicted_labels_col], average='weighted'
    )
    print(f"Overall Accuracy: {df['Accuracy'].mean()*100:.2f}%")
    print(f"Overall F1 Score: {df['F1_Score'].mean()*100:.2f}%")
    
    return df


def average_sentence_length(text: str) -> float:
    """
    Calculate the average sentence length of a text.

    Parameters:
        text (str): The text to analyze.

    Returns:
        float: The average sentence length.
    """
    sentences = sent_tokenize(text)
    return sum(
        len(word_tokenize(sentence)) for sentence in sentences
    ) / len(sentences)

def average_word_length(text: str) -> float:
    """
    Calculate the average word length of a text.

    Parameters:
        text (str): The text to analyze.

    Returns:
        float: The average word length.
    """
    words = word_tokenize(text)
    return sum(len(word) for word in words) / len(words)

def lexical_diversity(text: str) -> float:
    """
    Calculate the lexical diversity of a text.

    Parameters:
        text (str): The text to analyze.

    Returns:
        float: The lexical diversity.
    """
    words = word_tokenize(text)
    return len(set(words)) / len(words)

def evaluate_generated_response(
    df: pd.DataFrame, response_col: str
) -> pd.DataFrame:
    """
    Evaluates the quality of generated text responses by calculating 
    readability, fluency, and complexity metrics.

    Parameters:
        df (pd.DataFrame): The DataFrame containing the generated responses.
        response_col (str): The name of the column containing the generated 
            responses.

    Returns:
        pd.DataFrame: The DataFrame with added evaluation metrics as columns.
    """
    readability_scores = []
    avg_sentence_lengths = []
    avg_word_lengths = []
    lexical_diversities = []

    for response in df[response_col]:
        # Evaluate Readability
        readability_scores.append(flesch_reading_ease(response))
        
        # Evaluate Complexity
        avg_sentence_lengths.append(average_sentence_length(response))
        avg_word_lengths.append(average_word_length(response))
        lexical_diversities.append(lexical_diversity(response))

    df['Readability_Score'] = readability_scores
    df['Avg_Sentence_Length'] = avg_sentence_lengths
    df['Avg_Word_Length'] = avg_word_lengths
    df['Lexical_Diversity'] = lexical_diversities

    # Calculate and print overall scores
    print(f"Overall Readability Score: {df['Readability_Score'].mean():.2f}")
    print(f"Overall Average Sentence Length: ")
    print(f"{df['Avg_Sentence_Length'].mean():.2f}")
    print(f"Overall Average Word Length: {df['Avg_Word_Length'].mean():.2f}")
    print(f"Overall Lexical Diversity: {df['Lexical_Diversity'].mean():.2f}")

    return df

## 4. Models

a. **Wizard v1.1 - 13B**   
b. **GPT4All Falcon - 7B**   
c. **Hermes - 13B**   
d. **Snoozy - 13B**      
e. **Mini Orca - 7B**     
f. **Mini Orca (Small) - 3B**    
g. **Mini Orca (Large) - 7B**    
h. **Wizard Uncensored - 13B**    
i. **Replit - 3B**    
j. **Starcoder (Small) - 3B**    
k. **Starcoder - 7B**    
l. **Llama-2-7B Chat**     

<div style="background-color: black; color: white; padding: 10px">
    <p><b>In this notebook, I will experiment with a couple of the available models to test the prompt template as well as finalize any post-processing steps required:

1. Llama2-7B Chat
2. GPT4All Falcon
3. Hermes
4. Mini Orca (Large)
    </p>
</div>

## 5. Experiment - Llama-2-7B-Chat-GGML

First, we experiment with LlaMA2 model from Meta AI. It is fine-tuned for dialogue.

**Hugging Face Model card** - https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML


In [9]:
model = GPT4All("llama-2-7b-chat.ggmlv3.q4_0.bin")

Found model file at  /Users/Alpha/.cache/gpt4all/llama-2-7b-chat.ggmlv3.q4_0.bin


llama.cpp: using Metal
llama.cpp: loading model from /Users/Alpha/.cache/gpt4all/llama-2-7b-chat.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 4013.73 MB (+ 1024.00 MB per state)
llama_new_context_w

**Prompt function to perform Sentiment Classification and Response Generation Task**:

In [32]:
def use_llama_prompt(review: str) -> str:
    prompt = (
        "[INST] <<SYS>>\n"
        "You are a helpful and attentive business owner who always "
        "appreciates customer feedback. If the customer had a positive "
        "experience, make sure to thank them. If the customer had a "
        "negative or neutral experience, apologize and address their "
        "concerns respectfully and constructively. "
        "Do limit your response to 3 sentences.\n"
        "<</SYS>>\n\n"
        "Please classify the following customer review as either "
        f"\"Positive,\" \"Negative,\" or \"Neutral\" and provide a business "
        f"owner's response accordingly. The customer review is: \"{review}\"\n"
        "Format your output in JSON format: "
        "{\"sentiment\": \"<Sentiment>\", \"response\": \"<Response>\"}\n"
        "[/INST]\n\n "
    )

    output = model.generate(prompt, max_tokens=150)
    return output.strip()

**Apply function:**

In [6]:
subset_test_dataframe['output'] = subset_test_dataframe['text'].apply(
    use_llama_prompt
)

**Post Processing:**

In [7]:
subset_test_dataframe['output'] = subset_test_dataframe['output'].apply(
    lambda x: x if x.endswith("}") else x + "}"
)
subset_test_dataframe['output'] = subset_test_dataframe['output'].apply(
    lambda x: eval(x)
)
subset_test_dataframe['sentiment'] = subset_test_dataframe['output'].apply(
    lambda x: x['sentiment']
)
subset_test_dataframe['response'] = subset_test_dataframe['output'].apply(
    lambda x: x['response']
)

**Print output samples:**

In [8]:
for i, output in enumerate(subset_test_dataframe['output'].iloc[:5]):
    print(f"Index {i}:\n{output}")

Index 0:
{'sentiment': 'Neutral', 'response': "Thank you for taking the time to share your feedback! We appreciate your input and are glad to hear that our AYCE shrimp is a good value. However, we apologize for any inconvenience caused by the service. Please let us know if there's anything else we can do to improve your experience."}
Index 1:
{'sentiment': 'Positive', 'response': "Thank you so much for taking the time to leave us a review! We're thrilled to hear that Dwayne provided excellent service and helped you find the perfect vehicle. Your satisfaction is our top priority, and we appreciate your business!"}
Index 2:
{'sentiment': 'Positive', 'response': "Thank you so much for taking the time to leave us a review! We're thrilled to hear that you found our cabinets and flooring to be good value. Our team works hard to provide excellent service, and we appreciate your recognition. Please don't hesitate to reach out if there's anything else we can help with!"}
Index 3:
{'sentiment': 

**Evaluation:**

In [9]:
# SENTIMENT_MAPPING = {0: "Negative", 1: "Neutral", 2: "Positive"}
SENTIMENT_MAPPING = {"Negative": 0, "Neutral": 1, "Positive": 2}

subset_test_dataframe['sentiment'] = subset_test_dataframe['output'].apply(
    lambda x: x['sentiment']
)
subset_test_dataframe['sentiment'] = subset_test_dataframe['sentiment'].map(
    SENTIMENT_MAPPING
)
subset_test_dataframe['response'] = subset_test_dataframe['output'].apply(
    lambda x: x['response']
)
subset_test_dataframe = evaluate_sentiment(
    subset_test_dataframe, 'rating', 'sentiment'
)
subset_test_dataframe = evaluate_generated_response(
    subset_test_dataframe, 'response'
)

Overall Accuracy: 85.00%
Overall F1 Score: 87.65%
Overall Readability Score: 76.75
Overall Average Sentence Length: 
16.98
Overall Average Word Length: 4.05
Overall Lexical Diversity: 0.81


## 6. Experiment - GPT4All Falcon

Next, we experiment with GPT4All Falcon which is fine-tuned by Nomic AI. It is described by `Nomic AI` as the best overall smaller model with fast responses.

**Hugging Face Model card** - https://huggingface.co/nomic-ai/gpt4all-falcon-ggml

In [10]:
model = GPT4All("ggml-model-gpt4all-falcon-q4_0.bin")

Found model file at  /Users/Alpha/.cache/gpt4all/ggml-model-gpt4all-falcon-q4_0.bin
falcon_model_load: loading model from '/Users/Alpha/.cache/gpt4all/ggml-model-gpt4all-falcon-q4_0.bin' - please wait ...
falcon_model_load: n_vocab   = 65024
falcon_model_load: n_embd    = 4544
falcon_model_load: n_head    = 71
falcon_model_load: n_head_kv = 1
falcon_model_load: n_layer   = 32
falcon_model_load: ftype     = 2
falcon_model_load: qntvr     = 0
falcon_model_load: ggml ctx size = 3872.64 MB
falcon_model_load: memory_size =    32.00 MB, n_mem = 65536
falcon_model_load: ........................ done
falcon_model_load: model size =  3872.59 MB / num tensors = 196


ggml_metal_free: deallocating


**Prompt function to perform Sentiment Classification and Response Generation Task**:

In [27]:
def use_falcon_prompt(review: str) -> str:
    instruction = (
        "You are a helpful and attentive business owner who always "
        "appreciates customer feedback. If the customer had a positive "
        "experience, make sure to thank them. If the customer had a "
        "negative or neutral experience, apologize and address their "
        "concerns respectfully and constructively. "
        "Do limit your response to 3 sentences.\n\n"
        "Please classify the following customer review as either "
        "\"Positive,\" \"Negative,\" or \"Neutral\" and provide a business "
        f"owner's response accordingly. The customer review is: \"{review}\"\n"
        "Strictly format your string output in JSON format: "
        "{\"sentiment\": \"<Sentiment>\", \"response\": \"<Response>\"}"
    )

    falcon_prompt = f"### Instruction:\n{instruction}\n### Response:\n " + "\{"
    output = model.generate(falcon_prompt, max_tokens=150)
    
    return output.strip()

**Apply function:**

In [33]:
subset_test_dataframe['output'] = subset_test_dataframe['text'].apply(
    use_llama_prompt
)
subset_test_dataframe['output'] = subset_test_dataframe['output'].apply(
    lambda x: x if x.endswith("}") else x + "}"
)
subset_test_dataframe['output'] = subset_test_dataframe['output'].apply(
    lambda x: x if x.startswith("{") else "{" + x
)
subset_test_dataframe['output'] = subset_test_dataframe['output'].apply(
    lambda x: eval(x)
)
subset_test_dataframe['sentiment'] = subset_test_dataframe['output'].apply(
    lambda x: x['sentiment']
)
subset_test_dataframe['response'] = subset_test_dataframe['output'].apply(
    lambda x: x['response']
)

**Print output samples**

In [34]:
for i, output in enumerate(subset_test_dataframe['output'].iloc[:5]):
    print(f"Index {i}:\n{output}")

Index 0:
{'sentiment': 'Positive', 'response': 'Thank you for the positive feedback! We are glad to hear that you enjoyed our shrimp. If you have any further questions or concerns, please do not hesitate to contact us.'}
Index 1:
{'sentiment': 'Positive', 'response': 'Thank you for the positive feedback! We are glad to hear that our salesman Dwayne put forth extra effort to ensure your satisfaction with your purchase. Your feedback is valuable and helps us improve our services. If you have any further questions or concerns, please do not hesitate to contact us.'}
Index 2:
{'sentiment': 'Positive', 'response': "Thank you for the positive feedback! We are glad to hear that our products and staff provided good value. If you have any further questions or concerns, please don't hesitate to contact us."}
Index 3:
{'sentiment': 'Positive', 'response': 'Thank you for the positive feedback! We are always happy to hear that our customers enjoy their experience at our shop. Your kind words mean a

**Evaluation:**

In [35]:
subset_test_dataframe['sentiment'] = subset_test_dataframe['output'].apply(
    lambda x: x['sentiment']
)
subset_test_dataframe['sentiment'] = subset_test_dataframe['sentiment'].map(
    SENTIMENT_MAPPING
)
subset_test_dataframe['response'] = subset_test_dataframe['output'].apply(
    lambda x: x['response']
)
subset_test_dataframe = evaluate_sentiment(
    subset_test_dataframe, 'rating', 'sentiment'
)
subset_test_dataframe = evaluate_generated_response(
    subset_test_dataframe, 'response'
)

Overall Accuracy: 95.00%
Overall F1 Score: 92.57%
Overall Readability Score: 76.77
Overall Average Sentence Length: 
13.66
Overall Average Word Length: 4.17
Overall Lexical Diversity: 0.85


## 7. Experiment - Hermes

The next LLM to experiment with is Hermes which is instructed based that is curated with 300,000 uncensored instructions. Trained by Nous Research.


**Hugging Face Model card** - https://huggingface.co/TheBloke/Nous-Hermes-13B-GGML

In [36]:
model = GPT4All("nous-hermes-13b.ggmlv3.q4_0.bin")

100%|██████████| 7.32G/7.32G [10:50<00:00, 11.3MiB/s]


Model downloaded at:  /Users/Alpha/.cache/gpt4all/nous-hermes-13b.ggmlv3.q4_0.bin


llama.cpp: using Metal
llama.cpp: loading model from /Users/Alpha/.cache/gpt4all/nous-hermes-13b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: mem required  = 7477.73 MB (+ 1600.00 MB per state)
llama_new_context_

**Prompt function to perform Sentiment Classification and Response Generation Task**:

In [37]:
def use_hermes_prompt(review):
    instruction = (
        "You are a helpful and attentive business owner who always "
        "appreciates customer feedback. If the customer had a positive "
        "experience, make sure to thank them. If the customer had a "
        "negative or neutral experience, apologize and address their "
        "concerns respectfully and constructively. "
        "Do limit your response to 3 sentences.\n\n"
        "Please classify the following customer review as either "
        "\"Positive,\" \"Negative,\" or \"Neutral\" and provide a business "
        f"owner's response accordingly. The customer review is: \"{review}\"\n"
        "Format your output in JSON format: "
        "{\"sentiment\": \"<Sentiment>\", \"response\": \"<Response>\"}"
    )

    hermes_prompt = f"### Instruction:\n{instruction}\n### Response:\n"
    
    output = model.generate(hermes_prompt, max_tokens=150)
    
    return output.strip()

**Apply function:**

In [38]:
subset_test_dataframe['output'] = subset_test_dataframe['text'].apply(
    use_hermes_prompt
)
subset_test_dataframe['output'] = subset_test_dataframe['output'].apply(
    lambda x: x if x.endswith("}") else x + "}"
)
subset_test_dataframe['output'] = subset_test_dataframe['output'].apply(
    lambda x: eval(x)
)
subset_test_dataframe['sentiment'] = subset_test_dataframe['output'].apply(
    lambda x: x['sentiment']
)
subset_test_dataframe['response'] = subset_test_dataframe['output'].apply(
    lambda x: x['response']
)

**Print output samples**

In [39]:
for i, output in enumerate(subset_test_dataframe['output'].iloc[:5]):
    print(f"Index {i}:\n{output}")

Index 0:
{'sentiment': 'Positive', 'response': "Thank you for your feedback! We're glad to hear that our AYCE Shrimp is good value. We will work on improving our service to ensure a better experience next time."}
Index 1:
{'sentiment': 'Positive', 'response': "Thank you for taking the time to share your positive experience with us! We're glad Dwayne was able to help you find a great vehicle, and we appreciate your kind words about our selection. Thank you again for choosing our dealership!"}
Index 2:
{'sentiment': 'Positive', 'response': "Thank you for taking the time to share your experience with us! We're glad that you found our products and staff to be of good value. Your feedback is appreciated, and we hope to continue providing excellent service in the future."}
Index 3:
{'sentiment': 'Positive', 'response': "Thank you for taking the time to share your experience! We're thrilled that our atmosphere, knowledgeable staff and friendly service made your visit a positive one. Please do

**Evaluation:**

In [40]:
subset_test_dataframe['sentiment'] = subset_test_dataframe['output'].apply(
    lambda x: x['sentiment']
)
subset_test_dataframe['sentiment'] = subset_test_dataframe['sentiment'].map(
    SENTIMENT_MAPPING
)
subset_test_dataframe['response'] = subset_test_dataframe['output'].apply(
    lambda x: x['response']
)
subset_test_dataframe = evaluate_sentiment(
    subset_test_dataframe, 'rating', 'sentiment'
)
subset_test_dataframe = evaluate_generated_response(
    subset_test_dataframe, 'response'
)

Overall Accuracy: 95.00%
Overall F1 Score: 93.33%
Overall Readability Score: 76.04
Overall Average Sentence Length: 
15.46
Overall Average Word Length: 4.07
Overall Lexical Diversity: 0.85


## 8. Experiment - Mini Orca (Large)

The next LLM is Mini Orca (Large). GPT4All offers several variation of sizes for Mini Orca - namely 3B, 7B and 13B. The Mini Orca (Large) with 13B parameters will be chosen to represent the entire catalogue of Mini Orca LLM. 

**Hugging Face Model card** - https://huggingface.co/TheBloke/orca_mini_13B-GGML

In [41]:
model = GPT4All("orca-mini-13b.ggmlv3.q4_0.bin")

100%|██████████| 7.32G/7.32G [16:17<00:00, 7.49MiB/s]  


Model downloaded at:  /Users/Alpha/.cache/gpt4all/orca-mini-13b.ggmlv3.q4_0.bin


llama.cpp: using Metal
llama.cpp: loading model from /Users/Alpha/.cache/gpt4all/orca-mini-13b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: mem required  = 7477.72 MB (+ 1600.00 MB per state)
llama_new_context_wi

**Prompt function to perform Sentiment Classification and Response Generation Task**:

In [42]:
def use_mini_orca_prompt(review):
    system = (
        "You are a helpful and attentive business owner who always "
        "appreciates customer feedback. If the customer had a positive "
        "experience, make sure to thank them. If the customer had a "
        "negative or neutral experience, apologize and address their "
        "concerns respectfully and constructively. "
        "Do limit your response to 3 sentences.\n\n"
    )
    instruction = (
        "Please classify the following customer review as either "
        "\"Positive,\" \"Negative,\" or \"Neutral\" and provide a business "
        f"owner's response accordingly. The customer review is: \"{review}\"\n"
        "Format your output in JSON format: "
        "{\"sentiment\": \"<Sentiment>\", \"response\": \"<Response>\"}"
    )

    hermes_prompt = (
        f"### System:\n{system} "
        f"### Instruction:\n{instruction}\n### Response:\n"
    )

    output = model.generate(hermes_prompt, max_tokens=150)

    return output.strip()

**Apply function:**

In [43]:
subset_test_dataframe['output'] = subset_test_dataframe['text'].apply(
    use_mini_orca_prompt
)
subset_test_dataframe['output'] = subset_test_dataframe['output'].apply(
    lambda x: x if x.endswith("}") else x + "}"
)
subset_test_dataframe['output'] = subset_test_dataframe['output'].apply(
    lambda x: eval(x)
)
subset_test_dataframe['sentiment'] = subset_test_dataframe['output'].apply(
    lambda x: x['sentiment']
)
subset_test_dataframe['response'] = subset_test_dataframe['output'].apply(
    lambda x: x['response']
)

**Print output samples**

In [44]:
for i, output in enumerate(subset_test_dataframe['output'].iloc[:5]):
    print(f"Index {i}:\n{output}")

Index 0:
{'sentiment': 'Negative', 'response': 'Thank you for taking the time to share your feedback. We are sorry to hear that you had a negative experience with our service. Please let us know how we can improve and we will take appropriate measures to ensure that this does not happen again in the future.'}
Index 1:
{'sentiment': 'Positive', 'response': 'Thank you for taking the time to share your feedback with us. We are glad to hear that you had a positive experience with our salesman, Dwayne. He is always looking out for his customers and going above and beyond to ensure their satisfaction. We appreciate your business and hope to see you again soon.'}
Index 2:
{'sentiment': 'Positive', 'response': 'Thank you for taking the time to leave us a review. We are glad to hear that you had a positive experience with our products and services. Your satisfaction is important to us, and we will continue to strive for excellence in providing quality products and excellent customer service.'}


**Evaluation:**

In [45]:
subset_test_dataframe['sentiment'] = subset_test_dataframe['output'].apply(
    lambda x: x['sentiment']
)
subset_test_dataframe['sentiment'] = subset_test_dataframe['sentiment'].map(
    SENTIMENT_MAPPING
)
subset_test_dataframe['response'] = subset_test_dataframe['output'].apply(
    lambda x: x['response']
)
subset_test_dataframe = evaluate_sentiment(
    subset_test_dataframe, 'rating', 'sentiment'
)
subset_test_dataframe = evaluate_generated_response(
    subset_test_dataframe, 'response'
)

Overall Accuracy: 65.00%
Overall F1 Score: 73.11%
Overall Readability Score: 70.83
Overall Average Sentence Length: 
14.60
Overall Average Word Length: 4.31
Overall Lexical Diversity: 0.82


## 9. Overall Results

**LlaMA2-7B Chat:**

Overall Accuracy: `85.00%`   
Overall F1 Score: `87.65%`     
Overall Readability Score: `76.75` (Flesch Reading Ease Score - 0 to 100)
Overall Average Sentence Length: `16.98` words per sentence   
Overall Average Word Length: `4.05` Characters per Word    
Overall Lexical Diversity: `0.81` (Lexical Diversity Ratio - 0 to 1)   
Time taken: `4 minutes 45 seconds`

**GPT4All Falcon:**

Overall Accuracy: `95.00%`
Overall F1 Score: `92.57%`    
Overall Readability Score: `76.77` (Flesch Reading Ease Score - 0 to 100)   
Overall Average Sentence Length: `13.66` words per sentence   
Overall Average Word Length: `4.17` Characters per Word   
Overall Lexical Diversity: `0.85` (Lexical Diversity Ratio - 0 to 1)   
Time taken: ~8 minutes

**Hermes:**

Overall Accuracy: `95.00%`    
Overall F1 Score: `93.33%`        
Overall Readability Score: `76.04` (Flesch Reading Ease Score - 0 to 100)    
Overall Average Sentence Length: `15.46` words per sentence   
Overall Average Word Length: `4.07` Characters per Word   
Overall Lexical Diversity: `0.85` (Lexical Diversity Ratio - 0 to 1)   
Time taken: `~8 minutes`

**Mini Orca (Large):**

Overall Accuracy: `65.00%`     
Overall F1 Score: `73.11%`    
Overall Readability Score: `70.83` (Flesch Reading Ease Score - 0 to 100)    
Overall Average Sentence Length: `14.60` words per sentence   
Overall Average Word Length: `4.31` Characters per Word   
Overall Lexical Diversity: `0.82` (Lexical Diversity Ratio - 0 to 1)   
Time taken: `~8 minutes`

## 10. Insights

**Best Accuracy score:** `GPT4All Falcon` & `Hermes`


**Best F1 score:** `Hermes`


**Best Readability score:** `LlaMa2-7B Chat`


**Shortest Average Sentence Length:** `LlaMa2-7B Chat`


**Longest Average Sentence Length:** `Hermes`


**Shortest Average Word Length:** `LlaMa2-7B Chat`


**Longest Average Word Length:** `Mini Orca (Large)`


**Most Diverse output:** `GPT4All Falcon` & `Hermes`


**Least Diverse output:** `LlaMa2-7B Chat`


**Fastest Model:** `LlaMA2-7B Chat`


**Human Evaluation of Generated Response:**


- All the above models are able to generate proper responses corresponding to the predicted sentiment of input customer reviews
- With the exception of `Mini Orca (Large)`, all the remaining LLMs are able to generate response containing references to the input review which adds to the personalised response and not a system generated response
- For `LlaMA2-7B Chat` and `Hermes`, both LLMs notably reference highly specific food menu item mentioned in the input review
- With the exception of `Mini Orca (Large)`, all the remaining LLMs achieved similar performance in Accuracy and F1 score for sentiment classification task