# Data Feature Engineering Notebook

This notebook demonstrates the **feature extraction pipeline** that transforms raw prompts into numerical features for machine learning.

## Pipeline Overview

```
Raw Prompts → Clean Text → Tokenize → Extract Features → Save CSV
```

## Core Features (Calibrated Model - December 2025)

The production model uses **6 core features** that were found to have the strongest correlation with energy consumption:

| Feature | Description | Correlation |
|---------|-------------|-------------|
| **token_count** | Number of tokens | 0.946 |
| **word_count** | Number of words | 0.915 |
| **char_count** | Total characters | High |
| **complexity_score** | Linguistic complexity | 0.516 |
| **avg_word_length** | Mean word length | Medium |
| **avg_sentence_length** | Mean sentence length | Medium |

## Model Performance (After Calibration)
| Metric | Achieved |
|--------|----------|
| **R² Score** | 0.9813 |
| **MAPE** | 6.8% |
| **Prediction Bias** | 0.9988 |

## Output Files
- `data/processed/features_df.csv` - Extracted features
- `data/synthetic/hybrid_training_data.csv` - Calibrated training data

## 1. Import Libraries

Load required packages:
- **pandas/numpy**: Data manipulation
- **transformers**: HuggingFace tokenizer (DistilBERT)
- **nltk**: Natural language toolkit for stopwords

In [None]:
import os
import pandas as pd
import numpy as np
import re
from transformers import AutoTokenizer
import nltk
from nltk.corpus import stopwords

## 2. Initialize Tokenizer & Stopwords

Set up the NLP tools:
- **DistilBERT tokenizer**: Converts text to tokens (subword units)
- **Stopwords**: Common words like "the", "is", "a" (to calculate stopword ratio)

In [None]:
nltk.download("stopwords", quiet=True)
stop_words = set(stopwords.words("english"))

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

## 3. Load and Clean Raw Data

Load prompts from `data/raw/raw_prompts.csv` and apply basic cleaning:
- Remove extra whitespace
- Filter out very short strings (< 5 characters)
- Drop null values

**Expected Output**: ~50 rows loaded

In [None]:
def clean_text(text):
    #Basic cleaning of data: remove extra spaces and very short string


    if not isinstance(text, str):
        text = str(text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text if len(text) > 5 else None

def load_clean_raw(csv_path="../../data/raw/raw_prompts.csv"):
    #load csv and clean prompts
    df = pd.read_csv(csv_path)
    df["prompt"] = df["prompt"].apply(clean_text)
    df = df.dropna(subset=["prompt"]).reset_index(drop=True)
    return df

df_raw = load_clean_raw()
print(f"Raw data loaded: {len(df_raw)} rows")

## 4. Feature Engineering Functions

Define the core feature extraction logic:

### `compute_feature(prompt, num_layers, training_hours, flops_per_hour)`

Extracts these features:
- **token_count**: DistilBERT tokenization count
- **char_count**: Total characters
- **punct_ratio**: Punctuation / total chars
- **avg_word_length**: Mean word length
- **stopword_ratio**: Common words / total words
- **flops_per_layer**: Compute per model layer
- **training_efficiency**: Hours per layer

In [None]:

def compute_feature(prompt, num_layers, training_hours, flops_per_hour):
    #Compute features for a given prompt and model params



    word = prompt.split()
    chars = len(prompt)

    #token count using tokenizer
    token_counter = len(tokenizer.encode(prompt))

    #ratio of punctiuatioin characters
    punct_ratio = sum(1 for c in prompt if c in ".,!?;:") /max(chars,1)

    #Average word length
    avg_word_len = sum(len(w) for w in word) /max(len(word),1)


    #Ratio of stopwrods

    stopword_ratio = sum(1 for w in word if w.lower() in stop_words) /max(len(word),1)

    #Derived numeric features
    flops_per_layer = flops_per_hour / max(num_layers,1)
    training_efficiency = training_hours / max(num_layers,1)

    return {
        "prompt": prompt,
        "token_count": token_counter,
        "char_count": chars,
        "punct_ratio": punct_ratio,
        "avg_word_length": avg_word_len,
        "stopword_ratio": stopword_ratio,
        "num_layers": num_layers,
        "training_hours": training_hours,
        "flops_per_hour": flops_per_hour,
        "flops_per_layer": flops_per_layer,
        "training_efficiency": training_efficiency,

    }

def create_feature_pipeline(df, num_layers_list, training_hours_list, flops_per_hour_list):
    rows = []
    for i, row in df.iterrows():
        features = compute_feature(
            row["prompt"], num_layers_list[i], training_hours_list[i], flops_per_hour_list[i]
        )
        rows.append(features)
    return pd.DataFrame(rows)

## 5. Generate Model Parameters & Compute Features

Simulate LLM parameters for each prompt:
- **num_layers**: Random 4-48 (typical range for transformer models)
- **training_hours**: Random 0.5-20 hours
- **flops_per_hour**: Random 1B - 1T FLOPs

**Output**: Saves to `data/processed/features_df.csv`

In [None]:

n = len(df_raw)
layers = np.random.randint(4, 48, size=n)
hours = np.random.uniform(0.5, 20, size=n)
flops = np.random.uniform(1e9, 1e12, size=n)

features_df = create_feature_pipeline(df_raw, layers, hours, flops)

# Save processed dataset
os.makedirs("../../data/processed", exist_ok=True)
features_df.to_csv("../../data/processed/features_df.csv", index=False)
print(f"Processed dataset saved with {len(features_df)} rows")
features_df.head()


## 6. Generate Synthetic Energy Labels

Create energy consumption labels using a formula:

```python
energy_kwh = 0.5 + 
    token_count * 0.003 +           # More tokens = more energy
    avg_word_length * 0.10 +        # Longer words = more processing
    num_layers * 0.01 +             # Deeper models = more compute
    log10(flops_per_hour) * 0.05 +  # Higher FLOPs = more energy
    random_noise                     # Realistic variation
```

**Output**: Saves to `data/synthetic/energy_dataset.csv`

In [None]:
def generate_energy_data(df):
    """Generate synthetic energy labels for testing."""
    # Create synthetic energy consumption based on features
    energy = (
        0.5 +  # Base energy
        df["token_count"] * 0.003 +
        df["avg_word_length"] * 0.10 +
        df["num_layers"] * 0.01 +
        np.log10(df["flops_per_hour"] + 1) * 0.05 +
        np.random.normal(0, 0.02, size=len(df))  # Small noise
    )
    
    df_energy = df.copy()
    df_energy["energy_kwh"] = np.maximum(0.01, energy)  # Ensure positive
    
    os.makedirs("../../data/synthetic", exist_ok=True)
    df_energy.to_csv("../../data/synthetic/energy_dataset.csv", index=False)
    return df_energy

# Generate energy labels
energy_df = generate_energy_data(features_df)
print(f"Synthetic energy dataset saved with {len(energy_df)} rows")
print(f"Energy range: {energy_df['energy_kwh'].min():.4f} - {energy_df['energy_kwh'].max():.4f} kWh")
energy_df.head()

## Summary

This notebook demonstrated the feature engineering pipeline:

| Step | Input | Output |
|------|-------|--------|
| 1. Load | raw_prompts.csv | Cleaned DataFrame |
| 2. Tokenize | Text prompts | Token counts |
| 3. Extract | Prompts + params | 11 features |
| 4. Generate | Features | Energy labels |

### Files Generated
- `data/processed/features_df.csv` - Feature matrix
- `data/synthetic/energy_dataset.csv` - Features + energy labels

### Next Steps
Run `Energy Prediction ML.ipynb` to train models on this data.