# Evaluating the LLM Models
Mahan Madani - Mohammad Mehdi Begmaz

## Load basic libraries and models

In [10]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

import pandas as pd
import numpy as np
import torch

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoTokenizer,
    DataCollatorForLanguageModeling,
    TrainingArguments,
    Trainer)

import evaluate
from evaluate import load
from rouge_score import rouge_scorer
import jiwer

In [2]:
df = pd.read_csv("./dataset/BG3_reviews_preprocessed.csv")  # load the preprocessed version of the dataset
print(df.columns)
print(df.shape)

Index(['review', 'voted_up', 'votes_up', 'votes_funny', 'weighted_vote_score',
       'word_count', 'profanity'],
      dtype='object')
(10000, 7)


In [3]:
model_v1 = AutoModelForCausalLM.from_pretrained("./model/v1")
tokenizer_v1 = AutoTokenizer.from_pretrained("./model/v1")

In [4]:
model_v2 = AutoModelForCausalLM.from_pretrained("./model/v2")
tokenizer_v2 = AutoTokenizer.from_pretrained("./model/v2")

In [5]:
model_v3 = AutoModelForCausalLM.from_pretrained("./model/v3")
tokenizer_v3 = AutoTokenizer.from_pretrained("./model/v3")

In [6]:
model_v4 = AutoModelForCausalLM.from_pretrained("./model/v4")
tokenizer_v4 = AutoTokenizer.from_pretrained("./model/v4")

In [7]:
def generate_samples(model, tokenizer, count=10):
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    sample = []
    for i in range(count):
        generated_text = model.generate(do_sample=True, top_k=50, top_p=0.95, pad_token_id=tokenizer.pad_token_id, max_new_tokens=200)
        generated_text = tokenizer.decode(generated_text[0], skip_special_tokens=True)
        print(generated_text)
        sample.append(generated_text)
    
    return sample

In [8]:
from transformers.utils import logging
import transformers

logging.set_verbosity(transformers.logging.ERROR)

## Evaluating the text generator models

### Evaluation using  ROUGE

In [41]:
def evaluate_rouge(reference_texts, generated_text):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    rouge_scores = scorer.score(' '.join(reference_texts), ' '.join(generated_text))
    print(f"\nROUGE Scores: {rouge_scores}")

In [42]:
evaluate_rouge(df['review'].sample(100), generate_samples(model_v1, tokenizer_v1, 5))

the game is a best one game. of the game is the game that country. it that. that with a game is my worlds of the 1979, this game for the game of all the world. i would be a game. it is the new game is a game, i was a [unused708] a few based based of this game is [unused636] i can's gate game is a new game, the most for the game. it. the game is a few in the game is a lot in for the game. of theour and it!, but it're a world of the game'm a bit game and your going. there is ー it was the game is the game of the game. it is a bit with a world. this is a lot of a good of the game is a great at it has a game [unused859] the game. you can have a senior i can'm the game and i can't play that a game..
the best game game. i've played of the game of the game is a [unused656] good of what i am game. i get this game is played to be me to get this game! to your played, this game as there are in the characters. this game is the game, and the game is a few [unused253] a examination the game has still

In [52]:
evaluate_rouge(df['review'].sample(100), generate_samples(model_v2, tokenizer_v2, 5))

Sitting down to play through it, the game is well done, it's just buggy at times but all of that is confirmed with feedback that i've been very excited to play, the game is ready to kick it off as soon as i get an actual pc, so i can definitely recommend it.  update it is very well done and i'm very impressed by how well it has done with its content! even with some minor issues the quality is so good. as an avid dd player this game is a game of a lifetime. with all the bugs and crashes you could potentially encounter, the game is well worth the price.   update as of now,  the game is in early access. this is one of the few games where you will be able to level the overall level or simply the individual quest levels very easily. this also means you can always go on a quest to improve the level up and progress in each progression. the only other option is to explore specific paths in the
Might this one be good?  i think it would be interesting in a way that would be interesting to see, b

In [53]:
evaluate_rouge(df['review'].sample(100), generate_samples(model_v3, tokenizer_v3, 5))

For the first time in gaming, i can say that this game is absolutely amazing! the replayability is unparalleled, and the choices and actions matter so far. the replayability is amazing, the story is compelling, the combat is challenging, i like a challenge that is very personal, and i love every single situation, even the simple one.  i have to say, if you haven't already had your chance at this title, it is a very enjoyable experience and i can't recommend it enough!   edit after spending 3 hours and 45 minutes on this game, i'm finally excited to play it. i don't want to spoil things by waiting till after launch, but i have enjoyed the game so far, and its honestly one of my favorites. i love how the choices have really influenced how i play this game. i love how the replayability is so low, and if i had any doubts or issues, i can say that my doubts were confirmed.   i highly recommend
The original baldur's gate games, as well as the baldur's gate 4 series of games that preceded the

In [57]:
evaluate_rouge(df['review'].sample(100), generate_samples(model_v4, tokenizer_v4, 5))

"if you're going to play this game, you need this game."  larian studios is working on a rpg based on dd. it is amazing. you can do it in about an hour.  the game has tons of content and you can go to any location to get all sorts of content.  the story is pretty amazing and if you're reading the reviews on here, you've seen all kinds of stuff, so why waste any time on a game that's so good?   the graphics are amazing. everything feels very realistic. it seems like every character has a different hairstyle or outfit.  the dialogue is amazing. i was really hoping for more character dialog, but after reading all the reviews, i wasn't sure.   my only problem is that the story isn't very detailed. i found myself thinking about how to avoid having the adventure in front of me, which makes it feel less exciting. if you play through the early access phase, you'll
A few things to note about this game, it is the first game to completely rerelease the release date for this one.  it is also the f

### Evaluation using perplexity

In [None]:
def evaluate_perplexity(reference_texts, generated_text):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    rouge_scores = scorer.score(' '.join(reference_texts), ' '.join(generated_text))
    print(f"\nROUGE Scores: {rouge_scores}")

In [None]:
evaluate_perplexity(df['review'].sample(100), generate_samples(model_v1, tokenizer_v1, 5))

In [1]:
# from transformers import pipeline

# def calculate_perplexity(model, tokenizer, text):
#     text_ids = tokenizer.encode(text, return_tensors="pt")
#     logits = model(text_ids).logits
#     perplexity = pipeline("text-generation").compute_perplexity(logits)
#     return perplexity

# perplexity = calculate_perplexity(model_v1, tokenizer_v1, "Your generated text goes here.")
# print(f"Perplexity: {perplexity}")


In [None]:
perplexity = load("perplexity", module_type="metric")
results = perplexity.compute(predictions=generated_text, model_id='gpt2')

In [None]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

# Load your fine-tuned model
model_name = "your_fine_tuned_model_name"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example prompt
prompt = "Once upon a time..."

# Tokenize the prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Generate text
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
generated_text = generator(prompt, max_length=50, num_return_sequences=1)[0]['generated_text']

# Compute perplexity
perplexity = generator.compute_perplexity(generated_text)
print(f"Perplexity: {perplexity}")


### Evaluation using WER

In [14]:
def evaluate_wer(generated_text, reference_text):
    # Tokenize the texts into words
    generated_tokens = ' '.join(generated_text).split()
    reference_tokens = ' '.join(reference_text).split()

    # Calculate WER
    wer_score = jiwer.wer(reference_tokens, generated_tokens)
    print(f"WER Score: {wer_score}")

In [15]:
evaluate_wer(df['review'].sample(3), generate_samples(model_v1, tokenizer_v1, 3))

[unused24] a best game. this game of the game is the best is a game. the best is the ° the dissertation of the story in the best. it is the [unused253]. i am early access to be one is network of the game. and the dnd, no played to play you can's in the first is no game is a good of your story a game, and i can's the game... and the most based game is my few the game of a game at the game is a bit of a lot that i's example.. i's a new review, the character out with to play. i can be a few game and i can's gate is a little, and the combat, i's a lot, there's a lot. you م me, and play you're be the game has a lot of the game. the game't be a game. you'm out. theour. i '
ˤ. and it, and i like the game is a lot and i can be a world! i can be early access is. this game. if you can say the story, the game, you can activation the game is a worlds and i can's been all the game. the game i't be a lot, and i can's gate game. i can's still a game is a lot is [unused790] it. it've played.. i can ge

ValueError: After applying the transforms on the reference and hypothesis sentences, their lengths must match. Instead got 452 reference and 265 hypothesis sentences.