## Dataset Overview

This notebook processes the synthetic energy prompt dataset by computing:

- **Token Count** using a transformer tokenizer (BERT).
- **Readability Score** using the Flesch Reading Ease formula.

**Input**: `data/synthetic/energy_data.csv`  
**Output**: `data/processed/processed_data.csv`

In [None]:
import pandas as pd
from transformers import AutoTokenizer
import textstat
import os

# Load dataset
df = pd.read_csv(r"E:\SustainableAI_FinalProject\data\synthetic\energy_data.csv")

# Confirm column names
print("Columns in CSV:", df.columns.tolist())

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Optional: If you believe 'tokens_per_prompt' contains actual prompt text,
# then compute token count and readability on that.
# Otherwise, skip this section.
if "tokens_per_prompt" in df.columns:
    df["token_count"] = df["tokens_per_prompt"].apply(lambda prompt: len(tokenizer(str(prompt))["input_ids"]))
    df["readability_score"] = df["tokens_per_prompt"].apply(lambda prompt: textstat.flesch_reading_ease(str(prompt)))
else:
    print("⚠️ Column 'tokens_per_prompt' not treated as prompt text. Skipping feature engineering.")

# Feature and label setup (match actual column names)
features = df[["num_layers", "training_hours", "flops_per_hour"]]
labels = df["energy_consumption"]

# Save processed dataset
output_path = r"E:\SustainableAI_FinalProject\data\processed"
os.makedirs(output_path, exist_ok=True)
df.to_csv(os.path.join(output_path, "processed_data.csv"), index=False)

print("✅ Preprocessing complete. Saved to processed_data.csv")

Columns in CSV: ['num_layers', 'training_hours', 'flops_per_hour', 'energy_consumption']


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


⚠️ Column 'tokens_per_prompt' not treated as prompt text. Skipping feature engineering.
✅ Preprocessing complete. Saved to processed_data.csv
