## Dataset Overview

This notebook processes the synthetic energy prompt dataset by computing:

- **Token Count** using a transformer tokenizer (BERT).
- **Readability Score** using the Flesch Reading Ease formula.

**Input**: `data/synthetic/energy_data.csv`  
**Output**: `data/processed/processed_data.csv`


In [5]:
import pandas as pd
from transformers import AutoTokenizer
import textstat

# Load dataset
df = pd.read_csv(r"C:\Users\Jasmine\PycharmProjects\New_project\data\synthetic\energy_data.csv")

# Confirm column names
print("Columns in CSV:", df.columns.tolist())

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Feature: Token Count
def compute_token_count(prompt):
    return len(tokenizer(prompt)["input_ids"])

# Feature: Readability Score
def compute_readability_score(prompt):
    return textstat.flesch_reading_ease(prompt)

# Apply features to the correct column name
df["token_count"] = df["prompt_text"].apply(compute_token_count)
df["readability_score"] = df["prompt_text"].apply(compute_readability_score)

# Save processed dataset
df.to_csv(r"C:\Users\Jasmine\PycharmProjects\New_project\data\processed\processed_data.csv", index=False)


Columns in CSV: ['prompt_id', 'prompt_text', 'token_count', 'readability_score', 'hardware_type', 'energy_consumption']
