Initial Setup

In [10]:
pip install transformers torch sentencepiece rouge-score nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.12 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Problem 1

In [9]:
import json
import random
import torch
from transformers import AutoTokenizer, T5ForConditionalGeneration

def setup_model():
    # Load the model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
    model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large")
    return tokenizer, model

def load_sample_data():
    try:
        # First, let's try to read the raw content of the file
        with open("TimeTravel/train_supervised_small.json", 'r') as f:
            content = f.read()
            
        # Try to parse each line as a separate JSON object
        data = []
        for line in content.strip().split('\n'):
            if line.strip():  # Skip empty lines
                try:
                    data.append(json.loads(line))
                except json.JSONDecodeError as e:
                    print(f"Warning: Couldn't parse line: {line[:50]}...")
                    continue
        
        if not data:
            raise ValueError("No valid JSON objects found in the file")
        
        # Select two random datapoints
        random.seed(42)  # For reproducibility
        sample_points = random.sample(data, min(2, len(data)))
        
        # Print selected datapoints
        print("\nSelected Datapoints for Evaluation:")
        for i, point in enumerate(sample_points):
            print(f"\nDatapoint {i+1}:")
            print(f"Premise: {point.get('premise', 'N/A')}")
            print(f"Initial: {point.get('initial', 'N/A')}")
            print(f"Original Ending: {point.get('original_ending', 'N/A')}")
            print(f"Counterfactual: {point.get('counterfactual', 'N/A')}")
            print(f"Edited Ending: {point.get('edited_ending', 'N/A')}")
        
        return sample_points
    
    except FileNotFoundError:
        print("Error: Could not find the file 'TimeTravel/train_supervised_small.json'")
        print("Please make sure the file exists in the TimeTravel directory")
        return None
    except Exception as e:
        print(f"Error loading data: {str(e)}")
        print("Could you please share the first few lines of your JSON file?")
        return None

if __name__ == "__main__":
    print("Setting up model and tokenizer...")
    tokenizer, model = setup_model()
    print("Loading sample data...")
    sample_points = load_sample_data()
    
    if sample_points is None:
        print("\nPlease verify that:")
        print("1. The TimeTravel directory exists")
        print("2. The file 'train_supervised_small.json' is in the TimeTravel directory")
        print("3. The JSON file is properly formatted")

Setting up model and tokenizer...
Loading sample data...

Selected Datapoints for Evaluation:

Datapoint 1:
Premise: The only thing I had of my dad's was his iguana.
Initial: Since I couldn't spend time with my dad I played with the lizard.
Original Ending: One day there was a solar eclipse. When I got home the lizard was dead. As a dumb kid I believed the eclipse was to blame.
Counterfactual: Since I was afraid of lizards, I sold the iguana.
Edited Ending: ['One day there was a solar eclipse.', 'When I got home the lizard had found its way back to my house.', 'As a dumb kid I believed the eclipse was to blame.']

Datapoint 2:
Premise: Tammy hated being a waitress.
Initial: Last week she almost quit.
Original Ending: Customers walked into her job five minutes before closing. They demanded lots of food and left a poor tip. She cried herself to sleep two days in a row.
Counterfactual: Tammy was given the day off today because she knew Tammy was considering quitting.
Edited Ending: ['The 

Problem 2

In [12]:
from rouge_score import rouge_scorer
import numpy as np
import string

class StoryEvaluator:
    def __init__(self):
        self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    
    def evaluate_rouge(self, generated_ending: str, reference_ending: str) -> dict:
        """
        Method 1: ROUGE Score Evaluation
        Evaluates the similarity between generated and reference endings using ROUGE metrics
        """
        # Handle list input for reference ending
        if isinstance(reference_ending, list):
            reference_ending = ' '.join(reference_ending)
            
        scores = self.rouge_scorer.score(reference_ending, generated_ending)
        return {
            'rouge1': round(scores['rouge1'].fmeasure, 4),
            'rouge2': round(scores['rouge2'].fmeasure, 4),
            'rougeL': round(scores['rougeL'].fmeasure, 4)
        }
    
    def simple_tokenize(self, text: str) -> set:
        """
        Simple tokenization function that splits on whitespace and removes punctuation
        """
        # Remove punctuation and convert to lowercase
        text = text.lower()
        for punct in string.punctuation:
            text = text.replace(punct, ' ')
        
        # Split on whitespace and filter out empty strings
        return set(token for token in text.split() if token)
    
    def evaluate_semantic_consistency(self, generated_ending: str, premise: str, counterfactual: str) -> float:
        """
        Method 2: Semantic Consistency Score
        Measures how well the generated ending maintains key elements from premise and counterfactual
        """
        # Tokenize the texts using simple tokenization
        premise_tokens = self.simple_tokenize(premise)
        counterfactual_tokens = self.simple_tokenize(counterfactual)
        generated_tokens = self.simple_tokenize(generated_ending)
        
        # Calculate overlap with both premise and counterfactual
        premise_overlap = len(premise_tokens.intersection(generated_tokens)) / len(premise_tokens) if premise_tokens else 0
        counterfactual_overlap = len(counterfactual_tokens.intersection(generated_tokens)) / len(counterfactual_tokens) if counterfactual_tokens else 0
        
        # Return weighted average of overlaps
        return round((0.4 * premise_overlap + 0.6 * counterfactual_overlap), 4)

    def subjective_evaluation(self, generated_ending: str, prompt: str) -> int:
        """
        Subjective Evaluation: Logical Flow (1-5 scale)
        Evaluates how logically the generated ending follows from the premise and counterfactual
        
        1: Completely illogical or contradictory
        2: Mostly illogical with minor connections
        3: Partially logical with some inconsistencies
        4: Mostly logical with minor gaps
        5: Perfectly logical and consistent
        """
        print("\nSubjective Evaluation (Logical Flow):")
        print(f"Generated ending: {generated_ending}")
        print("\nPlease rate on a scale of 1-5:")
        print("1: Completely illogical or contradictory")
        print("2: Mostly illogical with minor connections")
        print("3: Partially logical with some inconsistencies")
        print("4: Mostly logical with minor gaps")
        print("5: Perfectly logical and consistent")
        
        while True:
            try:
                score = int(input("Enter score (1-5): "))
                if 1 <= score <= 5:
                    return score
                print("Please enter a number between 1 and 5")
            except ValueError:
                print("Please enter a valid number")

# Test the evaluator
if __name__ == "__main__":
    evaluator = StoryEvaluator()
    
    # Test data from your output
    test_data = {
        'datapoint1': {
            'premise': "The only thing I had of my dad's was his iguana.",
            'counterfactual': "Since I was afraid of lizards, I sold the iguana.",
            'edited_ending': "One day there was a solar eclipse. When I got home the lizard had found its way back to my house. As a dumb kid I believed the eclipse was to blame."
        },
        'datapoint2': {
            'premise': "Tammy hated being a waitress.",
            'counterfactual': "Tammy was given the day off today because she knew Tammy was considering quitting.",
            'edited_ending': "The next day, customers walked into her job five minutes before closing. They demanded lots of food and left a poor tip. She cried herself to sleep two days in a row."
        }
    }
    
    # Test both evaluation methods
    for dp_name, dp in test_data.items():
        print(f"\nTesting {dp_name}:")
        
        # Test ROUGE scores
        rouge_scores = evaluator.evaluate_rouge(
            dp['edited_ending'],
            dp['edited_ending']  # Using same text for demonstration
        )
        print(f"ROUGE Scores: {rouge_scores}")
        
        # Test semantic consistency
        consistency_score = evaluator.evaluate_semantic_consistency(
            dp['edited_ending'],
            dp['premise'],
            dp['counterfactual']
        )
        print(f"Semantic Consistency Score: {consistency_score}")


Testing datapoint1:
ROUGE Scores: {'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0}
Semantic Consistency Score: 0.3667

Testing datapoint2:
ROUGE Scores: {'rouge1': 1.0, 'rouge2': 1.0, 'rougeL': 1.0}
Semantic Consistency Score: 0.23


In [13]:
from transformers import AutoTokenizer, T5ForConditionalGeneration
from typing import List, Dict

class ZeroShotPrompter:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
        self.model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large")
        self.evaluator = StoryEvaluator()
    
    def generate_ending(self, prompt: str) -> str:
        """Generate a story ending using the model"""
        inputs = self.tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
        outputs = self.model.generate(
            inputs.input_ids,
            max_length=150,
            num_beams=4,
            temperature=0.7,
            no_repeat_ngram_size=2
        )
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    def evaluate_prompts(self, datapoints: List[Dict]) -> List[Dict]:
        # Two different zero-shot prompts
        prompts = [
            """
            Write a new ending to this story that follows from the changed fact:
            Story beginning: {premise}
            Changed fact: {counterfactual}
            New ending:
            """,
            
            """
            Rewrite how this story ends, considering that {counterfactual}
            Original beginning: {premise}
            Write the new ending:
            """
        ]
        
        results = []
        for i, dp in enumerate(datapoints):
            print(f"\nEvaluating Datapoint {i+1}")
            print(f"Premise: {dp['premise']}")
            print(f"Counterfactual: {dp['counterfactual']}")
            
            for prompt_idx, prompt_template in enumerate(prompts, 1):
                # Format prompt
                prompt = prompt_template.format(
                    premise=dp['premise'],
                    counterfactual=dp['counterfactual']
                )
                
                # Generate ending
                generated_ending = self.generate_ending(prompt)
                
                # Evaluate using both methods
                rouge_scores = self.evaluator.evaluate_rouge(
                    generated_ending,
                    dp['edited_ending']
                )
                
                consistency_score = self.evaluator.evaluate_semantic_consistency(
                    generated_ending,
                    dp['premise'],
                    dp['counterfactual']
                )
                
                # Get subjective score
                print(f"\nPrompt {prompt_idx} generated ending:")
                print(generated_ending)
                subjective_score = self.evaluator.subjective_evaluation(generated_ending, prompt)
                
                results.append({
                    'datapoint': i+1,
                    'prompt_version': prompt_idx,
                    'generated_ending': generated_ending,
                    'rouge_scores': rouge_scores,
                    'consistency_score': consistency_score,
                    'subjective_score': subjective_score
                })
        
        return results

if __name__ == "__main__":
    # Initialize prompter
    prompter = ZeroShotPrompter()
    
    # Use the same test data
    test_data = [
        {
            'premise': "The only thing I had of my dad's was his iguana.",
            'counterfactual': "Since I was afraid of lizards, I sold the iguana.",
            'edited_ending': "One day there was a solar eclipse. When I got home the lizard had found its way back to my house. As a dumb kid I believed the eclipse was to blame."
        },
        {
            'premise': "Tammy hated being a waitress.",
            'counterfactual': "Tammy was given the day off today because she knew Tammy was considering quitting.",
            'edited_ending': "The next day, customers walked into her job five minutes before closing. They demanded lots of food and left a poor tip. She cried herself to sleep two days in a row."
        }
    ]
    
    # Run zero-shot prompting evaluation
    results = prompter.evaluate_prompts(test_data)
    
    # Print final results
    print("\nFinal Results:")
    for result in results:
        print(f"\nDatapoint {result['datapoint']}, Prompt Version {result['prompt_version']}:")
        print(f"Generated: {result['generated_ending']}")
        print(f"ROUGE Scores: {result['rouge_scores']}")
        print(f"Consistency Score: {result['consistency_score']}")
        print(f"Subjective Score: {result['subjective_score']}")


Evaluating Datapoint 1
Premise: The only thing I had of my dad's was his iguana.
Counterfactual: Since I was afraid of lizards, I sold the iguana.





Prompt 1 generated ending:
I was sad that I had to give up my dad's iguana.

Subjective Evaluation (Logical Flow):
Generated ending: I was sad that I had to give up my dad's iguana.

Please rate on a scale of 1-5:
1: Completely illogical or contradictory
2: Mostly illogical with minor connections
3: Partially logical with some inconsistencies
4: Mostly logical with minor gaps
5: Perfectly logical and consistent

Prompt 2 generated ending:
The iguana was the only thing I had of my dad's.

Subjective Evaluation (Logical Flow):
Generated ending: The iguana was the only thing I had of my dad's.

Please rate on a scale of 1-5:
1: Completely illogical or contradictory
2: Mostly illogical with minor connections
3: Partially logical with some inconsistencies
4: Mostly logical with minor gaps
5: Perfectly logical and consistent

Evaluating Datapoint 2
Premise: Tammy hated being a waitress.
Counterfactual: Tammy was given the day off today because she knew Tammy was considering quitting.

Promp