# Prompt Improvement Service – Metrics Analysis

This notebook measures how much your **prompt‑improvement service** boosts prompt quality.

It expects **two CSV files**—one for *image‑generation* prompts and one for *chat* prompts—each with
two columns:

1. `original_prompt`
2. `improved_prompt`

> **Tip:** If your schema or file type is different, edit the *Load data* cell.

## 1. Setup

In [7]:
# Uncomment the next line on a fresh environment
# !pip install pandas numpy matplotlib tqdm textstat tiktoken scikit-learn scipy

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from tqdm import tqdm
from prompt_rating import PromptRatingHandler, InputTypes
import textstat              # readability scores
import tiktoken              # token counts
from sklearn.feature_extraction.text import CountVectorizer
from scipy.stats import ttest_ind
promptRatingService = PromptRatingHandler()
plt.rcParams['figure.figsize'] = (9, 4)
plt.rcParams['figure.dpi'] = 110
plt.rcParams['axes.grid'] = True


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Евгений\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Евгений\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Евгений\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Евгений\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Евгений\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\Евгений\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset 

# 1.1 Imrove prompts

## 2. Load your prompt datasets

In [2]:
IMAGE_PATH = Path('./datasets/image_generation_prompts.csv')
CHAT_PATH  = Path('./datasets/chatgpt_prompts.csv')

image_df = pd.read_csv(IMAGE_PATH)
chat_df  = pd.read_csv(CHAT_PATH)

print(f'Loaded {len(image_df):,} image prompt pairs')
print(f'Loaded {len(chat_df):,} chat prompt pairs')


Loaded 10,000 image prompt pairs
Loaded 153 chat prompt pairs


## 3. Helper functions

In [None]:
enc = tiktoken.get_encoding('cl100k_base')

def token_count(text: str) -> int:
    """Approximate GPT‑4 token count using tiktoken."""
    return len(enc.encode(text))

def char_len(text: str) -> int:
    return len(text)

def word_len(text: str) -> int:
    return len(text.split())

def lexical_diversity(text: str) -> float:
    words = text.split()
    return len(set(words)) / max(len(words), 1)

def readability(text: str) -> float:
    """Flesch Reading Ease (higher = easier)."""
    try:
        return textstat.flesch_reading_ease(text)
    except Exception:
        return np.nan

def compute_pairwise_stats(df: pd.DataFrame) -> pd.DataFrame:
    """Row‑wise deltas between improved and original prompts."""
    return pd.DataFrame({
        'tokens_delta'          : df['improved_prompt'].map(token_count)      - df['original_prompt'].map(token_count),
        'chars_delta'           : df['improved_prompt'].map(char_len)         - df['original_prompt'].map(char_len),
        'words_delta'           : df['improved_prompt'].map(word_len)         - df['original_prompt'].map(word_len),
        'lexical_diversity_delta': df['improved_prompt'].map(lexical_diversity) - df['original_prompt'].map(lexical_diversity),
        'readability_delta'     : df['improved_prompt'].map(readability)      - df['original_prompt'].map(readability),
    })

def summarize(df: pd.DataFrame, label: str) -> pd.DataFrame:
    """Aggregate descriptive metrics."""
    rows = []
    for col in ['original_prompt', 'improved_prompt']:
        rows.append({
            'dataset'     : label,
            'version'     : col.split('_')[0].title(),
            'avg_tokens'  : df[col].map(token_count).mean(),
            'avg_words'   : df[col].map(word_len).mean(),
            'avg_chars'   : df[col].map(char_len).mean(),
            'avg_lex_div' : df[col].map(lexical_diversity).mean(),
            'avg_readab'  : df[col].map(readability).mean(),
        })
    return pd.DataFrame(rows)



In [None]:
DatasetOfRatings = pd.DataFrame(columns=['prompt', 'rating'])
for entry in image_df['prompt']:
    promptRatingService.calculate(entry, ai_type=In)

Unnamed: 0,prompt
0,simple mountain landscape
1,mind-blowing mountain landscape
2,mind-blowing desert caravan
3,mind-blowing cyber samurai
4,highly detailed robot


## 4. Compute summary metrics

In [5]:
image_metrics = summarize(image_df, 'Image')
chat_metrics  = summarize(chat_df,  'Chat')
metrics = pd.concat([image_metrics, chat_metrics], ignore_index=True)
metrics.style.format('{:.2f}')


KeyError: 'original_prompt'

## 5. Improvement deltas (improved − original)

In [None]:
image_delta = compute_pairwise_stats(image_df)
chat_delta  = compute_pairwise_stats(chat_df)

def describe_delta(deltas, name):
    print(f'\n{name} prompts (Δ improved − original)')
    display(deltas.describe(percentiles=[.25, .5, .75]).T)

describe_delta(image_delta, 'Image')
describe_delta(chat_delta,  'Chat')


## 6. Visualise the distribution of gains

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, column in zip(axes, ['tokens_delta', 'lexical_diversity_delta', 'readability_delta']):
    ax.hist(image_delta[column], alpha=0.6, label='Image', bins=30)
    ax.hist(chat_delta[column],  alpha=0.6, label='Chat',  bins=30)
    ax.set_title(column)
    ax.legend()

plt.tight_layout()
plt.show()


## 7. Statistical significance

In [None]:
from scipy.stats import ttest_ind

def sig_test(before, after):
    """Welch t‑test (unequal var)."""
    return ttest_ind(after, before, equal_var=False)

for df_, name in [(image_df, 'Image'), (chat_df, 'Chat')]:
    t_stat, p = sig_test(df_['original_prompt'].map(token_count),
                         df_['improved_prompt'].map(token_count))
    print(f'{name} token length: t={t_stat:.2f}, p={p:.3g}')


## 8. Save the summary

In [None]:
metrics.to_csv('summary_metrics.csv', index=False)
print('📁 summary_metrics.csv written to working directory')


---
### Next steps
* Add more specialised metrics (e.g. **CLIP** or **LLM‑based** quality judgments).
* Integrate this notebook into a CI pipeline for continuous tracking.
* Swap in your preferred tokeniser or readability metric if needed.

Happy evaluating! 🚀