In [55]:
pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [64]:
from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering
import pandas as pd



# Create the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("deepset/deberta-v3-base-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("deepset/deberta-v3-base-squad2")

# Define the question answering pipeline
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

In [65]:
import torch
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Using CPU")

Using GPU: Tesla T4


In [66]:
from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering
import pandas as pd

# Load the dataset
df = pd.read_csv("train.csv")

# Create the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("deepset/deberta-v3-base-squad2")
model = AutoModelForQuestionAnswering.from_pretrained("deepset/deberta-v3-base-squad2")

# Define the question answering pipeline
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer, batch_size = 1, device = 0)

# Loop through each row in the dataset and train the model
for index, row in df.iterrows():
    context = row["targetParagraphs"] +" "+ str(row["targetTitle"]) +" "+ str(row["targetDescription"])
    question = row["postText"]
    answer = row["spoiler"]
    qa_pipeline(context=context, question=question, answer=answer)



In [67]:
model.save_pretrained("/newmodel")

In [68]:

from transformers import AutoModelForQuestionAnswering
from google.colab import drive
import pandas as pd

# Mount Google Drive
drive.mount('/content/draive')

Drive already mounted at /content/draive; to attempt to forcibly remount, call drive.mount("/content/draive", force_remount=True).


In [71]:

# Use the trained model to generate responses

question = "Five Nights at Freddy’s Sequel Delayed for Weird Reason Five Nights at Freddy's creator Scott Cawthon takes to Steam to tease a possible delay for Five Nights at Freddy's: Sister Location, the fifth game in the series. "
context = "Five Nights at Freddy’s creator Scott Cawthon takes to Steam to tease a possible delay for Five Nights at Freddy’s: Sister Location, the fifth game in the series., For the past couple of years, horror gaming fans have been able to look forward to one new entry in the Five Nights at Freddy’s series after another, with four core games, one RPG spinoff, and a novel released so far. The next game in the franchise, Five Nights at Freddy’s: Sister Location, was scheduled to release this coming Friday, October 7th, but if developer Scott Cawthon is to be believed, the project has been delayed by a few months.,According to a post by Cawthon on the Five Nights at Freddy’s: Sister Location Steam page, the game is being delayed because it’s too dark. Cawthon said that some of the plot elements are so disturbing that they are making him feel sick, and so he is thinking about delaying the game so that he can rework it entirely \into something kid-friendly.\,Delays happen in the gaming industry all the time, but it’s rare for a game to be delayed mere days before its release. Five Nights at Freddy’s fans are confused and angry about this latest development, as many were looking forward to playing the game on Friday. Something else upsetting fans is Cawthon’s reasoning that the game is too dark to release, as being dark and disturbing are two characteristics that many consumers look for in a horror game.Cawthon’s reason for suddenly delaying Five Nights at Freddy’s Sister Location from its planned October 7th release date doesn’t make much sense. A more likely scenario is that this is just a weird publicity stunt meant to hype the game as being so disturbing that its developer almost didn’t even release it. Alternatively, perhaps Cawthon is delaying the game for technical reasons and decided to concoct this story instead of admitting that the fifth core game in the series has issues.,Fans should also consider the possibility that Cawthon is just trolling in an attempt to throw them off the scent of an early release. Cawthon has a habit of surprising fans by releasing Five Nights at Freddy’s games early, and it wouldn’t be all that shocking for Five Nights at Freddy’s: Sister Location to carry on that tradition, despite Cawthon’s post to the contrary., With October 7th just a few days away, fans will learn soon enough whether or not Cawthon is serious about Sister Location‘s delay. If the game is delayed, it will be interesting to see if Cawthon actually does rework it to be more kid-friendly,\ or if he goes with a slightly altered version of his original vision., Five Nights at Freddy’s: Sister Location is scheduled to launch on October 7th for PC as well as iOS and Android mobile devices., Source: Scott Cawthon"

response = qa_pipeline(context=context, question=question)

print(response["answer"])

 publicity stunt


In [72]:

question = "Say it ain't so! Jon Stewart has set his official departure date from #TheDailyShow"
context =  "Jon Stewart now has a firm departure date from Comedy Central’s \The Daily Show.\ The comic announced on Monday’s broadcast of the program that he will leave the show after its August 6th broadcast.,The disclosure paves the way for the show’s new host, Trevor Noah, and suggests that Stewart will not hang around as candidates start to make announcements this year about running in the 2016 election for U.S. President.Stewart had said previously that he would continue to do the program until some time between July and the end of 2015.,Stewart made the announcement at the very end of the evening’s broadcast, just before rolling the program’s signature \"Moment of Zen\" final segment. He offered few details about what he might do for his final broadcast, but did reiterate a contest that would give a viewer the chance to attend the program’s last taping.,Stewart’s announcement sets the stage for Trevor Noah, a South African comic who has hosted a late-night program in that country, to take the reins of the series. Noah is a relative unknown in the United States and has already come under scrutiny for a series of controversial tweets made in past years that were discovered on social media after Comedy Central announced him as Stewart’s heir.,The Viacom-owned network and Stewart have both come out in support of Noah, urging audiences to give him a chance before passing judgement on his humor."
   
response = qa_pipeline(context=context, question=question)

print(response["answer"])

 August 6th


In [73]:

question = "what is the date?"
context =  "Jon Stewart now has a firm departure date from Comedy Central’s \The Daily Show.\ The comic announced on Monday’s broadcast of the program that he will leave the show after its August 6th broadcast.,The disclosure paves the way for the show’s new host, Trevor Noah, and suggests that Stewart will not hang around as candidates start to make announcements this year about running in the 2016 election for U.S. President.Stewart had said previously that he would continue to do the program until some time between July and the end of 2015.,Stewart made the announcement at the very end of the evening’s broadcast, just before rolling the program’s signature \"Moment of Zen\" final segment. He offered few details about what he might do for his final broadcast, but did reiterate a contest that would give a viewer the chance to attend the program’s last taping.,Stewart’s announcement sets the stage for Trevor Noah, a South African comic who has hosted a late-night program in that country, to take the reins of the series. Noah is a relative unknown in the United States and has already come under scrutiny for a series of controversial tweets made in past years that were discovered on social media after Comedy Central announced him as Stewart’s heir.,The Viacom-owned network and Stewart have both come out in support of Noah, urging audiences to give him a chance before passing judgement on his humor."
   
response = qa_pipeline(context=context, question=question)

print(response["answer"])

 August 6th


In [74]:

question = "How big is Justin Bieber's dick really?"
context =  "A single question now plagues the minds of all Americans, weighing down our brains as we slump in our office chairs, then slump in our cars, then slump in our couches, and then slump into bed: how big is Justin Bieber's penis really? The swaggy lil pop star and his cavalry of minders would have us believe that Justin Bieber has a huge dickLast week, Calvin Klein released photos of Bieber modeling their underwear for a new ad campaign. One memorable shot showed off the singer's protruding package in arresting profile. Shortly after the photos hit the Internet, a web site called Breathe Heavy posted what it claimed was the same image prior to re-touching. If that claim were accurate, it would mean that Calvin Klein (well, not him personally, although maybe) had stuffed Bieber's stocking nearly to bursting. Here are the two images side-by-side:, Bieber's team immediately insisted that Breathe Heavy's photo was fake, and requested the web site take it down. Breathe Heavy complied, originally replacing the photos with an editor's note, but eventually removing the entire post altogether. In that since-deleted note, Breathe Heavy's editor seems to accept Bieber's explanation at gunpoint.,Bieber denies the photo is real, and I respect that and will believe him.,The question, therefore, is: Are the claims of the Bieber camp correct, and the photo fake?,Or did Breathe Heavy have the real photo, and capitulate in the face of legal intimidation?,(It's easy to make a case that Breathe Heavy's photos are the real deal: We know that at least one photo was significantly retouched prior to publication, as Bieber's camp did not dispute an earlier TMZ story alleging that Calvin Klein sculpted Bieber's pecs, filled out his abs and bestowed him pubes in this ad from the campaign; furthermore, virtually every celebrity photoshoot in America gets touched up at some point. Why would Bieber's dick be a grand outlier?),But in many ways this dispute is just a lead-in to an essential American question: What exactly is Bieber packing?, Let's be true detectives.,This is a screencap taken from a video of the Calvin Klein shoot that Bieber himself posted to Instagram. Here we have a direct, unaltered view of his package and can plainly see that it looks quite different than the massive knot he is sporting in the photo advertisement. Front-bulge will almost always look less impressive than side-bluge, granted, and this is a fine bulge, certainly, but one that seems far off of Calvin Klein's idealized Burmese python.,Last September, Bieber appeared onstage at the Fashion Rocks concert. For some reason he stripped down to his underwear, which produced a number of generally alarming photos such as this one.,There are a number of things we can glean from this photo. One is that Justin Bieber has muscles. Look at the strong boy! Another is that his happy trail does indeed appear to stop abruptly right about where it does in the pre-Photoshop version of the Calvin Klein shot in which a model gropes him. But because Bieber wore jet black briefs that reveal no hint of bulge, this photo doesn't help us understand how big his dick actually is.,For that, we must consult more candid shots.,In 2013, Bieber went to Hawaii and jumped off a cliff. After exiting the water, he was photographed walking on the beach, resulting in the image you see here:,This is perhaps the most revealing shot of Bieber's bulge in the wild. Does it look exceedingly large? I'd say not. In fact, it looks like any man's normal penis. Of course, it should be noted that it's unfair to judge a dick by what it looks like immediately after being submerged in the sea. However we can only work with the materials we have.,Next we will consult a Tumblr called Justin Bieber's Bulge, a blog \"dedicated to Justin Bieber's glorious, wonderful bulge,\" which is not run by me. For a Tumblr devoted to one man's dick, it's a pretty boring blog, but there is one compelling photo.,Here is a fan shot of Justin Bieber in concert, his leather drop-crotch pants dropped well below his crotch. We can see a hint of bulge, and from this angle it does not look like Justin Bieber is trying to smuggle a butternut squash through airport security, as Calvin Klein might want us to believe.,That is evidence supporting the theory that Justin Bieber is adequately endowed. Arguing in favor of Justin Bieber's alleged big dick are two people: Tati Neves, a Brazilian model, and Bieber's trainer Patrick Nilsson. These two claim to have seen Bieber's flesh in the flesh, and if we're to believe them, Calvin Klein has staked its reputation on the right massive dong.,Neves claims to have slept with Bieber during his infamous Brazilian sex romp. Here is what she told a British tabloid about Bieber's D:,Speaking to The Sun, Brazilian model Tati Nevas said: \Take it from me, he's well endowed - and very good in bed.\,Nilsson, meanwhile, was shuttled out to do damage control in the wake of the Calvin Klein Photoshop controversy. Here, according to Breathe Heavy, is his assessment,And to make up it, here's a new quote from Justin's trainer Patrick Nilsson, who says JB is packing. \I can definitely confirm that he is a well-endowed guy. I sound weird saying that, but yes.,Indeed you do.,Two people claim to have personal connections with Justin Bieber's dick and claim it is large, but one is on Justin Bieber's payroll. While we will consider their opinions, the overwhelming visual evidence suggests that Justin Bieber's penis is perfectly average—large enough to adequately fill out a pair of briefs, but not so large that it could arouse envy and terror when plastered across sprawling billboards, or choke a cow, without enhancement.,In this case, it appears, Justin Bieber is the same as any man.,Still, we don't know for sure, and here is where we turn to you, our readers: Have you ever had sex with Justin Bieber? Have you ever seen his dick? Do you know someone who has? Are you Scooter Braun? Let's settle this debate once and for all. Email me at jordan@gawker.com or leave a comment below."

response = qa_pipeline(context=context, question=question)

print(response["answer"])

 Burmese python.,Last


In [75]:

import pandas as pd
df_validation = pd.read_csv("validation.csv")

In [76]:
spoliers = []

for index, row in df_validation.iterrows():
    context = row["targetParagraphs"]+ str(row["targetTitle"]) +" "+ str(row["targetDescription"])
    question = row["postText"]
    response = qa_pipeline(context=context, question=question)
    spoliers.append(response["answer"])
    print(response["answer"])




 Scott CawthonFive
 the judge didn’t want to see Arpaio reelected.
 no tip necessary.
 candy corn,
 photography.
 below.Sprite
 176
 jealous of McGonagall,
 He gives a definitive answer.
 Well, it couldn't have been any faker
 below.Elettra Wiedemann,
 Anthony Bourdain
 Myriam Ducre-Lemay,
 Does she think Simpson really did it?
 Beverly Hills,
 naled
 birds rarely or never fart.
 despicable thing
 squalor
 and...
 later in the day worked the same number of hours as those who started earlier,
 a hormone released by the brain that helps regulate the body's circadian rhythm
 cheesy.'"Why There Will Probably
 four surgeries
 Apple iOS 9.3.2 is bricking iPad Pros.
 Stace Nelson,
 £1.7 million
 If they ever did decide to marry, it would further complicate their business relationships.
 we can't have nice things.
 Frito's, Cheetos
 Storyful.
 Edward Gorey
 Rag & Bone's
 pixie cut.
 homemade oven cleaner,'
 congressmen wouldn’t have to do the dirty work."
 nan
 Twitter.29 Of The Most Beautiful

In [77]:
df_validation['generated_spoiler'] = spoliers

In [78]:


pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [79]:

import pandas as pd
from nltk.tokenize import TreebankWordTokenizer

# Define the tokenizer function
tokenizer = TreebankWordTokenizer().tokenize

# Tokenize the generated spoilers
df_validation['tokenized_generated_spoiler'] = df_validation['generated_spoiler'].apply(tokenizer)

In [80]:
df_validation.to_csv('calculateBleu.csv', index=False)

In [81]:

import pandas as pd

# Load the dataset
df_validation = pd.read_csv("calculateBleu.csv")

In [82]:

import nltk
# Calculate the BLEU score

true_spoiler = df_validation['tokenized_spoiler'].tolist()
myoutput = df_validation['tokenized_generated_spoiler'].tolist()
bleu_score = nltk.translate.bleu_score.corpus_bleu(true_spoiler,myoutput)

print("BLEU score: ", bleu_score)

BLEU score:  1.3516735709644542e-231


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


In [83]:

pip install seqeval

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [84]:

from nltk.metrics import f_measure, precision, recall

# Define true labels and predicted labels as lists of sentences
y_true = df_validation['spoiler']
y_pred = df_validation['generated_spoiler']


# Tokenize the sentences and convert them to tuples of words
tokenize = lambda sent: tuple(sent.split())
y_true_tok = [tokenize(sent) for sent in y_true]
y_pred_tok = [tokenize(str(sent)) for sent in y_pred]

# Calculate precision, recall and F1 score
precision_score = precision(set(y_true_tok), set(y_pred_tok))
recall_score = recall(set(y_true_tok), set(y_pred_tok))
f1_score = f_measure(set(y_true_tok), set(y_pred_tok))

print("Precision: {:.2f}".format(precision_score))
print("Recall: {:.2f}".format(recall_score))
print("F1 score: {:.2f}".format(f1_score))


Precision: 0.06
Recall: 0.06
F1 score: 0.06


In [85]:
pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [86]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [87]:

pip install bert_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [88]:

import pandas as pd
from bert_score import score


generated_spoilers = df_validation['generated_spoiler'].tolist()
true_spoilers = df_validation['spoiler'].tolist()

# Convert any floats to strings
generated_spoilers = [str(s) for s in generated_spoilers]
true_spoilers = [str(s) for s in true_spoilers]

# Calculate the BERT score for each pair of sentences
scores = score(generated_spoilers, true_spoilers, lang='en', verbose=False)

# Extract precision, recall, and F1 scores
precision, recall, f1 = scores

# Print average scores
print(f"BERT precision score: {precision.mean():.4f}")
print(f"BERT recall score: {recall.mean():.4f}")
print(f"BERT F1 score: {f1.mean():.4f}")

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BERT precision score: 0.8658
BERT recall score: 0.8476
BERT F1 score: 0.8559


In [89]:

pip install meteor_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Could not find a version that satisfies the requirement meteor_score (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for meteor_score[0m[31m
[0m

In [90]:
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [91]:
import nltk
from nltk.translate import meteor_score

# Example reference and hypothesis sentences
ref = df_validation['spoiler'].tolist()
hyp = df_validation['generated_spoiler'].tolist()

# Tokenize the reference and hypothesis sentences
ref_tokens = [nltk.word_tokenize(sentence) for sentence in ref]
hyp_tokens = nltk.word_tokenize(hyp[0])

# Calculate the METEOR score
score = meteor_score.meteor_score(ref_tokens, hyp_tokens)

print(score)


0.0


In [92]:
import nltk

bleu_scores = []

for index, row in df_validation.iterrows():
    if row['tags'] == "phrase":
        true_spoiler = row['tokenized_spoiler']
        myoutput = row['tokenized_generated_spoiler']
        bleu_score = nltk.translate.bleu_score.sentence_bleu([true_spoiler], myoutput)
        bleu_scores.append(bleu_score)

average_bleu_score = sum(bleu_scores) / len(bleu_scores)
print(f"Average BLEU score for phrase: {average_bleu_score}")

Average BLEU score for phrase: 0.3577942853248229


In [93]:
import nltk

bleu_scores = []

for index, row in df_validation.iterrows():
    if row['tags'] == "passage":
        true_spoiler = row['tokenized_spoiler']
        myoutput = row['tokenized_generated_spoiler']
        bleu_score = nltk.translate.bleu_score.sentence_bleu([true_spoiler], myoutput)
        bleu_scores.append(bleu_score)

average_bleu_score = sum(bleu_scores) / len(bleu_scores)
print(f"Average BLEU score for passage: {average_bleu_score}")

Average BLEU score for passage: 0.13267946841380449


In [94]:
import nltk

bleu_scores = []

for index, row in df_validation.iterrows():
    if row['tags'] == "multi":
        true_spoiler = row['tokenized_spoiler']
        myoutput = row['tokenized_generated_spoiler']
        bleu_score = nltk.translate.bleu_score.sentence_bleu([true_spoiler], myoutput)
        bleu_scores.append(bleu_score)

average_bleu_score = sum(bleu_scores) / len(bleu_scores)
print(f"Average BLEU score for multi: {average_bleu_score}")

Average BLEU score for multi: 0.08540943768291809


In [95]:
import nltk
from nltk.translate import meteor_score
import re

meteor_scores = []

for index, row in df_validation.iterrows():
    if row['tags'] == "phrase":
        ref = row['spoiler']
        hyp = row['generated_spoiler']

        # Check if ref and hyp are strings
        if type(ref) != str or type(hyp) != str:
            continue

        # Check if ref and hyp are empty or have length 0
        if len(ref) == 0 or len(hyp) == 0:
            continue

        # Remove non-ascii characters and special characters
        ref = re.sub(r'[^\x00-\x7F]+',' ', ref)
        hyp = re.sub(r'[^\x00-\x7F]+',' ', hyp)
        ref = re.sub(r'[^a-zA-Z0-9\s]','', ref)
        hyp = re.sub(r'[^a-zA-Z0-9\s]','', hyp)

        ref_tokens = nltk.word_tokenize(ref)
        hyp_tokens = nltk.word_tokenize(hyp)

        score = meteor_score.meteor_score([ref_tokens], hyp_tokens)
        meteor_scores.append(score)

average_meteor_score = sum(meteor_scores) / len(meteor_scores)
print(f"Average METEOR score phrase: {average_meteor_score}")

Average METEOR score phrase: 0.3261704874537227


In [96]:
import nltk
from nltk.translate import meteor_score
import re

meteor_scores = []

for index, row in df_validation.iterrows():
    if row['tags'] == "passage":
        ref = row['spoiler']
        hyp = row['generated_spoiler']

        # Check if ref and hyp are strings
        if type(ref) != str or type(hyp) != str:
            continue

        # Check if ref and hyp are empty or have length 0
        if len(ref) == 0 or len(hyp) == 0:
            continue

        # Remove non-ascii characters and special characters
        ref = re.sub(r'[^\x00-\x7F]+',' ', ref)
        hyp = re.sub(r'[^\x00-\x7F]+',' ', hyp)
        ref = re.sub(r'[^a-zA-Z0-9\s]','', ref)
        hyp = re.sub(r'[^a-zA-Z0-9\s]','', hyp)

        ref_tokens = nltk.word_tokenize(ref)
        hyp_tokens = nltk.word_tokenize(hyp)

        score = meteor_score.meteor_score([ref_tokens], hyp_tokens)
        meteor_scores.append(score)

average_meteor_score = sum(meteor_scores) / len(meteor_scores)
print(f"Average METEOR score passage: {average_meteor_score}")

Average METEOR score passage: 0.09561982154726002


In [97]:
import nltk
from nltk.translate import meteor_score
import re

meteor_scores = []

for index, row in df_validation.iterrows():
    if row['tags'] == "multi":
        ref = row['spoiler']
        hyp = row['generated_spoiler']

        # Check if ref and hyp are strings
        if type(ref) != str or type(hyp) != str:
            continue

        # Check if ref and hyp are empty or have length 0
        if len(ref) == 0 or len(hyp) == 0:
            continue

        # Remove non-ascii characters and special characters
        ref = re.sub(r'[^\x00-\x7F]+',' ', ref)
        hyp = re.sub(r'[^\x00-\x7F]+',' ', hyp)
        ref = re.sub(r'[^a-zA-Z0-9\s]','', ref)
        hyp = re.sub(r'[^a-zA-Z0-9\s]','', hyp)

        ref_tokens = nltk.word_tokenize(ref)
        hyp_tokens = nltk.word_tokenize(hyp)

        score = meteor_score.meteor_score([ref_tokens], hyp_tokens)
        meteor_scores.append(score)

average_meteor_score = sum(meteor_scores) / len(meteor_scores)
print(f"Average METEOR score multi: {average_meteor_score}")

Average METEOR score multi: 0.07868105273778735


In [98]:
pip install bert_score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [99]:
import pandas as pd
from bert_score import score

total_precision, total_recall, total_f1, count = 0, 0, 0, 0

for index, row in df_validation.iterrows():
    if row['tags'] == "phrase":
        # Convert floats to strings if necessary
        if isinstance(row['generated_spoiler'], float):
            gen_spoiler = str(row['generated_spoiler'])
        else:
            gen_spoiler = row['generated_spoiler']
        if isinstance(row['spoiler'], float):
            spoiler = str(row['spoiler'])
        else:
            spoiler = row['spoiler']

        # Calculate the BERT score for each pair of sentences
        scores = score([gen_spoiler], [spoiler], lang='en', verbose=False)

        # Extract precision, recall, and F1 scores
        precision, recall, f1 = scores

        # Add to running total
        total_precision += precision.mean()
        total_recall += recall.mean()
        total_f1 += f1.mean()
        count += 1

if count > 0:
    # Calculate and print average scores
    avg_precision = total_precision / count
    avg_recall = total_recall / count
    avg_f1 = total_f1 / count
    print(f"BERT precision score (average): {avg_precision:.4f}")
    print(f"BERT recall score (average): {avg_recall:.4f}")
    print(f"BERT F1 score (average): {avg_f1:.4f}")
else:
    print("No rows with 'tags' equal to 'phrase' found in DataFrame")


Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaM

BERT precision score (average): 0.8824
BERT recall score (average): 0.8811
BERT F1 score (average): 0.8810


In [100]:
import pandas as pd
from bert_score import score

total_precision, total_recall, total_f1, count = 0, 0, 0, 0

for index, row in df_validation.iterrows():
    if row['tags'] == "passage":
        # Convert floats to strings if necessary
        if isinstance(row['generated_spoiler'], float):
            gen_spoiler = str(row['generated_spoiler'])
        else:
            gen_spoiler = row['generated_spoiler']
        if isinstance(row['spoiler'], float):
            spoiler = str(row['spoiler'])
        else:
            spoiler = row['spoiler']

        # Calculate the BERT score for each pair of sentences
        scores = score([gen_spoiler], [spoiler], lang='en', verbose=False)

        # Extract precision, recall, and F1 scores
        precision, recall, f1 = scores

        # Add to running total
        total_precision += precision.mean()
        total_recall += recall.mean()
        total_f1 += f1.mean()
        count += 1

if count > 0:
    # Calculate and print average scores
    avg_precision = total_precision / count
    avg_recall = total_recall / count
    avg_f1 = total_f1 / count
    print(f"BERT precision score (average): {avg_precision:.4f}")
    print(f"BERT recall score (average): {avg_recall:.4f}")
    print(f"BERT F1 score (average): {avg_f1:.4f}")
else:
    print("No rows with 'tags' equal to 'phrase' found in DataFrame")

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaM

BERT precision score (average): 0.8546
BERT recall score (average): 0.8314
BERT F1 score (average): 0.8423


In [101]:
import pandas as pd
from bert_score import score

total_precision, total_recall, total_f1, count = 0, 0, 0, 0

for index, row in df_validation.iterrows():
    if row['tags'] == "multi":
        # Convert floats to strings if necessary
        if isinstance(row['generated_spoiler'], float):
            gen_spoiler = str(row['generated_spoiler'])
        else:
            gen_spoiler = row['generated_spoiler']
        if isinstance(row['spoiler'], float):
            spoiler = str(row['spoiler'])
        else:
            spoiler = row['spoiler']

        # Calculate the BERT score for each pair of sentences
        scores = score([gen_spoiler], [spoiler], lang='en', verbose=False)

        # Extract precision, recall, and F1 scores
        precision, recall, f1 = scores

        # Add to running total
        total_precision += precision.mean()
        total_recall += recall.mean()
        total_f1 += f1.mean()
        count += 1

if count > 0:
    # Calculate and print average scores
    avg_precision = total_precision / count
    avg_recall = total_recall / count
    avg_f1 = total_f1 / count
    print(f"BERT precision score (average): {avg_precision:.4f}")
    print(f"BERT recall score (average): {avg_recall:.4f}")
    print(f"BERT F1 score (average): {avg_f1:.4f}")
else:
    print("No rows with 'tags' equal to 'phrase' found in DataFrame")

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaM

BERT precision score (average): 0.8519
BERT recall score (average): 0.8054
BERT F1 score (average): 0.8275
