# Rewriting Comments

- In this notebook, I rewrote all the thesis-relevant YouTube comments one by one using the OpenAI API (GPT-4o Mini), without using batch processing.  
- I loaded the original dataset and removed any extra columns I didn’t need.  
- The prompt was designed to clean up each comment — fixing grammar, spelling, and clarifying slang — while keeping the original meaning, tone, and even any explicit language.  
- Before rewriting, the script checked which comments had already been processed so it wouldn’t redo anything.  
- Each comment was sent individually to GPT-4o Mini, and the rewritten version was saved to a new CSV alongside the original info.  
- This continued until all comments were done, and the final output was saved to `Rewritten_YouTube_Comments.csv`, ready to use for the next steps in the project.


### Importing Libraries

In [None]:
import openai
import pandas as pd
import os
import json
from tqdm import tqdm

In [3]:
# OpenAI API Key
OPENAI_API_KEY = "************"  
client = openai.OpenAI(api_key=OPENAI_API_KEY)

# File paths
COMMENTS_FILE = "All_YouTube_Comments.csv"
OUTPUT_CSV = "Rewritten_YouTube_Comments.csv"

# Load CSV File
df_comments = pd.read_csv(COMMENTS_FILE)

# Drop unnecessary columns if needed
df_comments = df_comments.drop(columns=["Data_Source", "Source_File"], errors='ignore')

# Ensure output CSV exists before processing
if not os.path.exists(OUTPUT_CSV):
    pd.DataFrame(columns=list(df_comments.columns) + ["Rewritten Comment"]).to_csv(OUTPUT_CSV, index=False)

# Function to rewrite comment using GPT-4o Mini
def rewrite_comment(comment):
    prompt = f"""
    You are a text cleaner. Your task is to rewrite the following YouTube comment to make it more readable while keeping its original meaning intact. 
    - Fix typos and grammatical errors.
    - Expand abbreviations where necessary.
    - Keep slang words if they contribute to the meaning but clarify if needed.
    - **Do NOT remove or censor foul language**.
    - Do NOT change the sentiment or tone of the comment.
    
    **Original Comment:**
    "{comment}"
    
    **Rewritten Version:**
    """
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are rewriting text to be more readable while preserving meaning, including any explicit language."},
            {"role": "user", "content": prompt}
        ]
    )
    
    return response.choices[0].message.content.strip()

# Load processed comments if they exist
if os.path.exists(OUTPUT_CSV):
    df_existing = pd.read_csv(OUTPUT_CSV)
    processed_comments = set(df_existing["Comment"].astype(str))  # Convert to set for fast lookup
else:
    processed_comments = set()

# Process each comment with GPT-4o Mini, skipping already processed ones
for index, row in tqdm(df_comments.iterrows(), total=len(df_comments), desc="Rewriting Comments"):
    comment = str(row["Comment"])  # Ensure it's a string

    # Skip if already processed
    if comment in processed_comments:
        continue

    try:
        rewritten_comment = rewrite_comment(comment)
        new_row = row.to_dict()
        new_row["Rewritten Comment"] = rewritten_comment

        pd.DataFrame([new_row]).to_csv(OUTPUT_CSV, mode='a', header=False, index=False)
    except Exception as e:
        print(f" Failed to rewrite comment: {comment[:50]}... | Error: {e}", flush=True)

print(f" Processing complete! New CSV file saved as: {OUTPUT_CSV}")


Rewriting Comments: 100%|████████████████████| 108463/108463 [3:55:03<00:00, 7.69it/s]
Processing complete! New CSV file saved as: Rewritten_YouTube_Comments.csv


# Rewriting Comments

- I checked for any comments that were missed during the initial rewriting process.  
- I compared the original dataset with the rewritten file and identified missing entries based on `Comment_ID` or `Comment`.  
- The missed comments were then rewritten one by one using GPT-4o Mini and added back to the original rewritten file.  
- Finally, I merged everything again and saved a fully updated dataset with all comments rewritten.


In [5]:
# OpenAI API Key
OPENAI_API_KEY = "**********"
client = openai.OpenAI(api_key=OPENAI_API_KEY)

# File paths
COMMENTS_FILE = "All_YouTube_Comments.csv"  # Original dataset
REWRITTEN_FILE = "Rewritten_YouTube_Comments.csv"  # Processed dataset
OUTPUT_FILE = "Rewritten_YouTube_Comments_Corrected.csv"  # Final merged dataset
MISSED_FILE = "Missed_Comments.csv"  # Comments that were not rewritten

# Load original and processed datasets
df_comments = pd.read_csv(COMMENTS_FILE)
df_rewritten = pd.read_csv(REWRITTEN_FILE)

# Ensure 'Rewritten Comment' column exists
if "Rewritten Comment" not in df_rewritten.columns:
    df_rewritten["Rewritten Comment"] = None

# Handle empty values properly
df_rewritten["Rewritten Comment"].replace("", pd.NA, inplace=True)

# Identify merge key
if "Comment_ID" in df_comments.columns and "Comment_ID" in df_rewritten.columns:
    merge_key = "Comment_ID"
elif "Comment" in df_comments.columns and "Comment" in df_rewritten.columns:
    merge_key = "Comment"
else:
    raise ValueError("No valid unique identifier found for merging.")

# Merge to find missing rewritten comments
df_corrected = df_comments.merge(df_rewritten[[merge_key, "Rewritten Comment"]], 
                                 on=merge_key, how="left")

# Extract missed comments
df_missed = df_corrected[df_corrected["Rewritten Comment"].isna()].copy()

# Save missed comments for reference
df_missed.to_csv(MISSED_FILE, index=False)
print(f" {len(df_missed)} comments were missed. They are saved in '{MISSED_FILE}' for reprocessing.")

# Function to rewrite comment using GPT-4o Mini
def rewrite_comment(comment):
    prompt = f"""
    You are a text cleaner. Your task is to rewrite the following YouTube comment to make it more readable while keeping its original meaning intact. 
    - Fix typos and grammatical errors.
    - Expand abbreviations where necessary.
    - Keep slang words if they contribute to the meaning but clarify if needed.
    - **Do NOT remove or censor foul language**.
    - Do NOT change the sentiment or tone of the comment.
    
    **Original Comment:**
    "{comment}"
    
    **Rewritten Version:**
    """
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are rewriting text to be more readable while preserving meaning, including any explicit language."},
            {"role": "user", "content": prompt}
        ]
    )
    
    return response.choices[0].message.content.strip()

# Process only the missed comments
if not df_missed.empty:
    print(" Processing missed comments...")

    rewritten_comments = []
    for index, row in tqdm(df_missed.iterrows(), total=len(df_missed), desc="Rewriting Comments"):
        comment = str(row["Comment"])  # Ensure it's a string

        try:
            rewritten_comment = rewrite_comment(comment)
            rewritten_comments.append((row[merge_key], rewritten_comment))
        except Exception as e:
            print(f" Failed to rewrite comment: {comment[:50]}... | Error: {e}", flush=True)

    # Convert results to DataFrame
    df_new_rewritten = pd.DataFrame(rewritten_comments, columns=[merge_key, "Rewritten Comment"])

    # Append newly processed comments to the existing rewritten file
    df_rewritten = pd.concat([df_rewritten, df_new_rewritten])
    df_rewritten.to_csv(REWRITTEN_FILE, index=False)

    print(f" {len(df_new_rewritten)} missed comments have been processed and added back to '{REWRITTEN_FILE}'.")

# Re-Merge After Processing
df_corrected = df_comments.merge(df_rewritten[[merge_key, "Rewritten Comment"]], 
                                 on=merge_key, how="left")

# Save final corrected dataset
df_corrected.to_csv(OUTPUT_FILE, index=False)
print(f" Final corrected file saved as '{OUTPUT_FILE}'.")


Rewriting Comments: 100%|████████████████████| 1552/1552 [17:14<00:00, 1.50it/s]
1552 missed comments have been processed and added back to 'Rewritten_YouTube_Comments.csv'
Final corrected file saved as: Rewritten_YouTube_Comments_Corrected.csv


In [None]:
# Function to rewrite comment using GPT-4o Mini
def rewrite_comment(comment):
    prompt = f"""
    You are a text cleaner. Your task is to rewrite the following YouTube comment 
    to make it more readable while keeping its original meaning intact. 
    - Fix typos and grammatical errors.
    - Expand abbreviations where necessary.
    - Keep slang words if they contribute to meaning, but clarify if needed.
    - Do NOT remove or censor foul language.
    - Do NOT change the sentiment or tone of the comment.

    Original Comment:
    "{comment}"

    Rewritten Version:
    """
    
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are rewriting text to be more readable while preserving meaning."},
            {"role": "user", "content": prompt}
        ]
    )
    
    return response.choices[0].message.content.strip()
