## Correction and Grading

### Prompt

```
I have questions generated by visual genome data and template questions. Some of them are still wrong. Can you correct them and grade them to score from 1 to 5 using criterias (Grammatical,Relevance,Clarity,Specificity,Prettiness,Average Score,Reason for Corrections)?

Please return the output as CSV file with
Original Question,Corrected Question,Grammatical,Relevance,Clarity,Specificity,Prettiness,Average Score,Reason for Corrections as headers

Here is the generated questions along with its context data
```

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_rel
from sklearn.metrics import mean_absolute_error, mean_squared_error

# Root directory for your data
root_directory = "data/corrected/"
df_llm = pd.read_csv(root_directory + 'gpt3-5_graded_corrected_questions.csv')
df_human = pd.read_csv(root_directory + 'manually_graded_corrected_questions.csv')

# List of categories to compare
categories = ["Grammatical", "Relevance", "Clarity", "Specificity", "Prettiness"]

t_tests = {}
hypothesis_results = {}

mae_scores = {}
rmse_scores = {}
recall_scores = {}
precision_scores = {}
accuracy_scores = {}

# Calculating mean and std for each category for both LLM and Human scores
means_std = {}

for category in categories:
    llm_scores = df_llm[category]
    human_scores = df_human[category]
    
    # Calculate mean and std for LLM
    llm_mean = llm_scores.mean()
    llm_std = llm_scores.std()

    # Calculate mean and std for Human
    human_mean = human_scores.mean()
    human_std = human_scores.std()

    # Store mean and std in a dictionary
    means_std[category] = {
        'LLM Mean': llm_mean,
        'LLM Std': llm_std,
        'Human Mean': human_mean,
        'Human Std': human_std
    }

    # Z-score normalization (handling std = 0)
    if llm_std == 0:
        llm_zscores = np.zeros_like(llm_scores)
    else:
        llm_zscores = (llm_scores - llm_mean) / llm_std

    if human_std == 0:
        human_zscores = np.zeros_like(human_scores)
    else:
        human_zscores = (human_scores - human_mean) / human_std

    # Perform paired t-test on z-scores
    t_stat, p_value = ttest_rel(llm_zscores, human_zscores)
    t_tests[category] = (t_stat, p_value)
    
    alpha = 0.05
    if p_value < alpha:
        hypothesis_results[category] = "Reject H_0"
    else:
        hypothesis_results[category] = "Don't reject H_0"
    
    # Calculate MAE and RMSE (on original scores, not z-scores)
    mae = mean_absolute_error(human_scores, llm_scores)
    rmse = np.sqrt(mean_squared_error(human_scores, llm_scores))
    mae_scores[category] = mae
    rmse_scores[category] = rmse

    # Adjusted logic for recall, precision, and accuracy
    # Correct predictions: LLM < threshold and Human < threshold (for positive cases)
    negative_threshold = 5  # Invert the threshold logic

    # Calculate true positives, false negatives, false positives, true negatives
    true_positives = np.sum((llm_scores < negative_threshold) & (human_scores < negative_threshold))
    false_negatives = np.sum((llm_scores >= negative_threshold) & (human_scores < negative_threshold))
    false_positives = np.sum((llm_scores < negative_threshold) & (human_scores >= negative_threshold))
    true_negatives = np.sum((llm_scores >= negative_threshold) & (human_scores >= negative_threshold))

    # Recall, Precision, Accuracy
    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
    accuracy = (true_positives + true_negatives) / (len(llm_scores))  # Total correct predictions / total predictions

    recall_scores[category] = recall
    precision_scores[category] = precision
    accuracy_scores[category] = accuracy

# Output mean and std results
print("Mean and Standard Deviation for each category:")
for category, stats in means_std.items():
    print(f"{category}: LLM Mean = {stats['LLM Mean']:.3f}, LLM Std = {stats['LLM Std']:.3f}, "
          f"Human Mean = {stats['Human Mean']:.3f}, Human Std = {stats['Human Std']:.3f}")

# Output t-test results
print("\nT-tests (t-statistic, p-value) and Hypothesis Decision:")
for category, result in t_tests.items():
    t_stat, p_value = result
    decision = hypothesis_results[category]
    print(f"{category}: t-statistic = {t_stat:.3f}, p-value = {p_value:.3f} => {decision}")

# Output Mean Absolute Error results
print("\nMean Absolute Error (MAE):")
for category, mae in mae_scores.items():
    print(f"{category}: {mae:.3f}")

# Output Root Mean Squared Error results
print("\nRoot Mean Squared Error (RMSE):")
for category, rmse in rmse_scores.items():
    print(f"{category}: {rmse:.3f}")

# Output Recall results
print("\nRecall:")
for category, recall in recall_scores.items():
    print(f"{category}: {recall:.3f}")

# Output Precision results
print("\nPrecision:")
for category, precision in precision_scores.items():
    print(f"{category}: {precision:.3f}")

# Output Accuracy results
print("\nAccuracy:")
for category, accuracy in accuracy_scores.items():
    print(f"{category}: {accuracy:.3f}")


Mean and Standard Deviation for each category:
Grammatical: LLM Mean = 4.667, LLM Std = 0.478, Human Mean = 5.000, Human Std = 0.000
Relevance: LLM Mean = 5.000, LLM Std = 0.000, Human Mean = 4.778, Human Std = 0.929
Clarity: LLM Mean = 5.000, LLM Std = 0.000, Human Mean = 4.778, Human Std = 0.929
Specificity: LLM Mean = 5.000, LLM Std = 0.000, Human Mean = 4.750, Human Std = 0.937
Prettiness: LLM Mean = 4.667, LLM Std = 0.478, Human Mean = 4.889, Human Std = 0.667

T-tests (t-statistic, p-value) and Hypothesis Decision:
Grammatical: t-statistic = -0.000, p-value = 1.000 => Don't reject H_0
Relevance: t-statistic = -0.000, p-value = 1.000 => Don't reject H_0
Clarity: t-statistic = -0.000, p-value = 1.000 => Don't reject H_0
Specificity: t-statistic = 0.000, p-value = 1.000 => Don't reject H_0
Prettiness: t-statistic = -0.000, p-value = 1.000 => Don't reject H_0

Mean Absolute Error (MAE):
Grammatical: 0.333
Relevance: 0.222
Clarity: 0.222
Specificity: 0.250
Prettiness: 0.444

Root Mean