
# Children's News Reframing Pipeline (DeepSeek) - SAFE

This notebook reproduces your exact script, split into separate, runnable cells so you can see progress as the job runs.  
**Run each cell in order.** You'll see status prints and a tqdm progress bar while reframing.


In [1]:
import torch

In [None]:

import os, time, json, re
import pandas as pd
import torch
import gc
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM

# ---------------- Config ----------------
CSV_PATH         = "../data/raw/original_data.csv"   # your source file
TEXT_COLUMN      = "COMBINED_INPUT"      # constructed input column (<label> [title] <TSEP> [summary_long_500])
DEEPSEEK_MODEL   = "deepseek-ai/deepseek-llm-7b-chat"
BATCH_SIZE       = 16
MAX_INPUT_TOKENS = 4096
MAX_NEW_TOKENS   = 512                   # allow room for text + JSON
TEMPERATURE      = 0.7
TOP_P            = 0.9

In [17]:
INSTRUCTION_PROMPT = r"""You are an ethical children's news editor for readers aged 10. You MUST follow the EXACT output format below with no deviations.

=====================
### STYLE REQUIREMENTS:
- Preserve all facts; remove graphic/adult detail
- No added opinions or interpretations.
- Vocabulary simple enough for FK ‚â§ 4.5
- Keep sentences short (‚âà 8-15 words on average), words simple (‚â§1.5 syllables)
=====================
### CRITICAL FORMAT RULES:
1. Start with title.
2. Then exactly 8-10 sentences in one paragraph
3. No additional text, explanations, or deviations
=====================
### INPUT FORMAT:
[News Title] <TSEP> [Full Article Text]
=====================
### OUTPUT FORMAT (MUST MATCH EXACTLY):
[News Title]
A calm paragraph (**8‚Äì10 sentences**, ‚â§400 words) summarizing what happened in simple, factual language suitable for 8‚Äì10 year-olds.

=======
EXAMPLE:
### INPUT:
An 'Accidental Dictionary' Explores How Errors Created The English Language <TSEP> The Accidental Dictionary is the work of writer and etymologist Paul Anthony Jones. It's a 100-word dictionary that reveals the many lives each word has lived. "Clumsy" once meant ‚Äúnumb with cold,‚Äù and ‚Äúprestigious‚Äù once means ‚Äúdeceitful.‚Äù Jones: The book is a reminder that the words we use have never been static, and that's not going to change any time soon.. The book's little potted history works like a single short story I guess, so the book has ended up acting like an etyMological anthology. It‚Äôs easy to think that once a word finds its way into the dictionary it's set in stone, but that‚Äôt of course not the case. Not only are new words being coined and old words being lost every day, but existing words are being molded and mutated, and knocked into different shapes to better fit what we need them to mean. It was originally a table that was originally yellow (and that sense probably derives from an old German dialect word for ‚Äúto urinate‚Äù). But the stories I keep coming back to are those of words like man, girl and bimbo, which originally meant just ‚Äúchild,‚Äô so could be applied to both girls and boys. And a burbo was burboly, so it does in words like ‚Äúhuman‚Äù or ‚Äúmanslaughter‚Äù.

### OUTPUT: 
Title: An 'Accidental Dictionary' Explores How Errors Created The English Language 
=========================

The book "The Accidental Dictionary," written by expert Paul Anthony Jones, shows us that words in the English language don't always stay the same. The book is a special list of 100 words that reveals how their meanings have changed over time, proving that no word is truly "set in stone." For example, the word "clumsy" once had a totally different meaning, which was "numb with cold." Another interesting change is the word "prestigious," which is a compliment today but used to mean "deceitful," or tricky. Jones also found that words like "man," "girl," and a word called "bimbo" all used to mean just "child." This little history book reminds us that as people change, our language changes right along with us.
"""


In [7]:

# ----------------- RUN: Prepare Input -----------------
# Load and assemble <label> [title] <TSEP> [summary_long_500]
src = pd.read_csv(CSV_PATH)

# Normalize/guard: ensure presence and strings
required_cols = ["label", "title", "summary_long_500"]
missing = [c for c in required_cols if c not in src.columns]

if missing:
    raise ValueError(f"Missing columns in {CSV_PATH}: {missing}")

src["title_norm"] = src["title"].astype(str).str.strip()
src["summary_norm"] = src["summary_long_500"].astype(str).str.strip()

src = src[src["label"] == "SAFE"]

src[TEXT_COLUMN] = src["title_norm"] + " <TSEP> " + src["summary_norm"]
src = src[src[TEXT_COLUMN].str.len() > 0].reset_index(drop=True)

src = src[["label", "title", "summary_long_500", TEXT_COLUMN]]
print("Prepared rows:", len(src))
print(src.head(1))
print(src[TEXT_COLUMN].iloc[10])

Prepared rows: 2754
  label                                              title  \
0  SAFE  Man Surprises Girlfriend By Drawing Them In Di...   

                                    summary_long_500  \
0  Kellen Hickey, 26, drew himself and his girlfr...   

                                      COMBINED_INPUT  
0  Man Surprises Girlfriend By Drawing Them In Di...  
An 'Accidental Dictionary' Explores How Errors Created The English Language <TSEP> The Accidental Dictionary is the work of writer and etymologist Paul Anthony Jones. It's a 100-word dictionary that reveals the many lives each word has lived. "Clumsy" once meant ‚Äúnumb with cold,‚Äù and ‚Äúprestigious‚Äù once means ‚Äúdeceitful.‚Äù Jones: The book is a reminder that the words we use have never been static, and that's not going to change any time soon.. The book's little potted history works like a single short story I guess, so the book has ended up acting like an etyMological anthology. It‚Äôs easy to think that once a wor

In [10]:
src.to_csv("reframer_safe_output/input_safe.csv")

In [11]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Add tokenizer loading with padding token setup
tok = AutoTokenizer.from_pretrained(
    DEEPSEEK_MODEL, 
    use_fast=True,
    padding_side="left",  # Move this here for clarity
    truncation_side="right"
)

# Ensure pad_token is set BEFORE model loading
if tok.pad_token is None:
    if tok.eos_token is not None:
        tok.pad_token = tok.eos_token
    else:
        # Add a pad token if neither exists
        tok.add_special_tokens({'pad_token': '[PAD]'})

model = AutoModelForCausalLM.from_pretrained(
    DEEPSEEK_MODEL,
    torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32,
    trust_remote_code=True,
    device_map="auto" if device == "cuda" else None,  # Better GPU memory management
).eval()

# If you added a new pad token, resize model embeddings
if tok.pad_token == '[PAD]':
    model.resize_token_embeddings(len(tok))

print(f"Pad token: {tok.pad_token}")
print(f"Model loaded successfully on {device}")

Using device: cuda


  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Pad token: <ÔΩúend‚ñÅof‚ñÅsentenceÔΩú>
Model loaded successfully on cuda




In [None]:
# # ----------------- TEST MODE (no Ray) -----------------
# start_idx = 0
# end_idx = 20
# bad_count = 0
# print(f"üîç Running test mode: {start_idx} to {end_idx}...")

# test_df = src[start_idx:end_idx+1].copy()

# def build_prompts(texts, instruction):
#     prompts = []
#     for x in texts:
#         prompt = f"User: {instruction}\n\nArticle: {str(x)}\n\nNow please analyze this article and provide a child-friendly version following the specified format.\n\nAssistant:"
#         prompts.append(prompt)
#     return prompts

# texts = test_df[TEXT_COLUMN].astype(str).tolist()
# prompts = build_prompts(texts, INSTRUCTION_PROMPT)

# # Tokenize
# inputs = tok(prompts, return_tensors="pt", padding=True, truncation=True, max_length=MAX_INPUT_TOKENS).to(device)

# with torch.inference_mode():
#     with torch.no_grad():
#         # Test with bare minimum
#         gen = model.generate(
#             **inputs,
#             max_new_tokens=500,
#             do_sample=True,
#             temperature=0.1,
#             top_p=0.9,
#             pad_token_id=tok.pad_token_id
#         )

#         # Decode only completions
#         attn = inputs["attention_mask"]
#         outputs = []
#         for i in range(gen.size(0)):
#             in_len = int(attn[i].sum().item())
#             new_tokens = gen[i] #gen[i, in_len:]
#             text = tok.decode(new_tokens, skip_special_tokens=True)
#             text = text.strip()

#             assistant_idx = text.rfind("Assistant:")
#             if assistant_idx != -1:
#                 # Take everything after "Assistant:"
#                 response = text[assistant_idx + len("Assistant:"):].strip()
#                 outputs.append(response)
#             else:
#                 bad_count+=1
#                 outputs.append(text)

#         test_df["RAW_OUTPUT"] = outputs

# test_df[["RAW_OUTPUT"]].to_csv(f"sample_output_safe.csv")
# print("\n‚úÖ Test generation complete. Output written to " + f"sample_output_safe.csv")
# print("bad_count", bad_count)

üîç Running test mode: 0 to 20...

‚úÖ Test generation complete. Output written to sample_output_safe.csv
bad_count 0


In [21]:
# ----------------- BATCH PROCESSING WITH SEPARATE FILES -----------------
batch_size = 50
total_samples = len(src)
bad_count = 0

def build_prompts(texts, instruction):
    prompts = []
    for x in texts:
        prompt = f"User: {instruction}\n\nArticle: {str(x)}\n\nNow please analyze this article and provide a child-friendly version following the specified format.\n\nAssistant:"
        prompts.append(prompt)
    return prompts

print(f"üîç Running batch processing: {total_samples} total samples in batches of {batch_size}...")

for start_idx in range(0, total_samples, batch_size):
    end_idx = min(start_idx + batch_size - 1, total_samples - 1)
    print(f"Processing batch: {start_idx} to {end_idx}...")
    
    test_df = src[start_idx:end_idx+1].copy()

    texts = test_df[TEXT_COLUMN].astype(str).tolist()
    prompts = build_prompts(texts, INSTRUCTION_PROMPT)

    # Tokenize
    inputs = tok(prompts, return_tensors="pt", padding=True, truncation=True, max_length=MAX_INPUT_TOKENS).to(device)

    with torch.inference_mode():
        with torch.no_grad():
            gen = model.generate(
                **inputs,
                max_new_tokens=500,
                do_sample=True,
                temperature=0.1,
                top_p=0.9,
                pad_token_id=tok.pad_token_id
            )

            attn = inputs["attention_mask"]
            outputs = []
            for i in range(gen.size(0)):
                in_len = int(attn[i].sum().item())
                new_tokens = gen[i]
                text = tok.decode(new_tokens, skip_special_tokens=True)
                text = text.strip()

                assistant_idx = text.rfind("Assistant:")
                if assistant_idx != -1:
                    response = text[assistant_idx + len("Assistant:"):].strip()
                    outputs.append(response)
                else:
                    bad_count += 1
                    outputs.append(text)

            test_df["RAW_OUTPUT"] = outputs

    # Save each batch separately
    test_df[["label", "title", "summary_long_500", "RAW_OUTPUT"]].to_csv(f"reframer_safe_output/batch_output_{start_idx}_{end_idx}.csv")
    
    # Clear memory
    del inputs, gen
    torch.cuda.empty_cache()
    gc.collect()

    print("\n‚úÖ Output written to " + f"reframer_safe_output/batch_output_{start_idx}_{end_idx}.csv")

print(f"\n‚úÖ Batch processing complete! Processed {total_samples} samples.")
print(f"Total bad counts: {bad_count}")

üîç Running batch processing: 2754 total samples in batches of 50...
Processing batch: 0 to 49...

‚úÖ Output written to reframer_safe_output/batch_output_0_49.csv
Processing batch: 50 to 99...

‚úÖ Output written to reframer_safe_output/batch_output_50_99.csv
Processing batch: 100 to 149...

‚úÖ Output written to reframer_safe_output/batch_output_100_149.csv
Processing batch: 150 to 199...

‚úÖ Output written to reframer_safe_output/batch_output_150_199.csv
Processing batch: 200 to 249...

‚úÖ Output written to reframer_safe_output/batch_output_200_249.csv
Processing batch: 250 to 299...

‚úÖ Output written to reframer_safe_output/batch_output_250_299.csv
Processing batch: 300 to 349...

‚úÖ Output written to reframer_safe_output/batch_output_300_349.csv
Processing batch: 350 to 399...

‚úÖ Output written to reframer_safe_output/batch_output_350_399.csv
Processing batch: 400 to 449...

‚úÖ Output written to reframer_safe_output/batch_output_400_449.csv
Processing batch: 450 to 499...

In [22]:
import glob
import os

# Method 1: Using glob pattern matching
def combine_batch_files(folder_path, output_filename=f"combined_batch_output_0_{total_samples}.csv"):
    # Find all batch output files
    batch_files = glob.glob(os.path.join(folder_path, "batch_output_*.csv"))
    
    # Sort files numerically by the start index
    batch_files.sort(key=lambda x: int(x.split('_')[-2]))
    
    print(f"Found {len(batch_files)} batch files:")
    for file in batch_files:
        print(f"  {file}")
    
    # Read and combine all files
    combined_df = pd.concat([pd.read_csv(file) for file in batch_files], ignore_index=True)
    
    # Save combined file
    combined_df.to_csv(os.path.join(folder_path, output_filename), index=False)
    
    print(f"\n‚úÖ Combined {len(batch_files)} files into {output_filename}")
    print(f"Total rows: {len(combined_df)}")
    
    return combined_df

# Usage
folder_path = "reframer_safe_output"  # Current directory, or specify your folder path
combined_data = combine_batch_files(folder_path)

Found 56 batch files:
  reframer_safe_output/batch_output_0_49.csv
  reframer_safe_output/batch_output_50_99.csv
  reframer_safe_output/batch_output_100_149.csv
  reframer_safe_output/batch_output_150_199.csv
  reframer_safe_output/batch_output_200_249.csv
  reframer_safe_output/batch_output_250_299.csv
  reframer_safe_output/batch_output_300_349.csv
  reframer_safe_output/batch_output_350_399.csv
  reframer_safe_output/batch_output_400_449.csv
  reframer_safe_output/batch_output_450_499.csv
  reframer_safe_output/batch_output_500_549.csv
  reframer_safe_output/batch_output_550_599.csv
  reframer_safe_output/batch_output_600_649.csv
  reframer_safe_output/batch_output_650_699.csv
  reframer_safe_output/batch_output_700_749.csv
  reframer_safe_output/batch_output_750_799.csv
  reframer_safe_output/batch_output_800_849.csv
  reframer_safe_output/batch_output_850_899.csv
  reframer_safe_output/batch_output_900_949.csv
  reframer_safe_output/batch_output_950_999.csv
  reframer_safe_output/

In [23]:
def print_detailed_memory():
    print("\n=== Detailed GPU Memory ===")
    print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"Cached:    {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
    
    # Per-device breakdown if multiple GPUs
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}:")
        print(f"  Allocated: {torch.cuda.memory_allocated(i) / 1024**3:.2f} GB")
        print(f"  Cached:    {torch.cuda.memory_reserved(i) / 1024**3:.2f} GB")
    
    print(f" GPU total memory:   {torch.cuda.get_device_properties(0).total_memory / 1024**3}")
    # System-level GPU info
    if torch.cuda.is_available():
        print(f"GPU Free:   {(torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_allocated()) / 1024**3:.2f} GB available")

print_detailed_memory()


=== Detailed GPU Memory ===
Allocated: 12.94 GB
Cached:    12.95 GB
GPU 0:
  Allocated: 12.94 GB
  Cached:    12.95 GB
 GPU total memory:   178.3616943359375
GPU Free:   165.42 GB available


In [None]:
gc.collect()
torch.cuda.empty_cache()

## OUTPUT CLEANING

In [24]:
import pandas as pd
df = pd.read_csv("reframer_safe_output/combined_batch_output_0_2754.csv")

In [25]:
print(len(df))
df["RAW_OUTPUT"].head()

2754


0    Title: Man Surprises Girlfriend By Drawing The...
1    Title: This Artist Gives Renaissance-Style Scu...
2    Title: Sculptures From This International Ice ...
3    Title: 60 Books We Can't Wait To Read In 2018\...
4    Title: Why Do We Call That Holiday Game Yankee...
Name: RAW_OUTPUT, dtype: object

In [26]:
mask = (df["RAW_OUTPUT"].str.count("Title:") == 2)

df.loc[mask, "RAW_OUTPUT"] = (
    df.loc[mask, "RAW_OUTPUT"]
    .str.replace(r'(?si)\nArticle:.*', '', regex=True)
    .str.strip()
)

print(mask.sum())
for line in df.loc[mask, "RAW_OUTPUT"]:
    print(line)
    print("+++++++++++++++++++++++++++++")


8
[News Title]
A calm paragraph (**8‚Äì10 sentences**, ‚â§400 words) summarizing what happened in simple, factual language suitable for 8‚Äì10 year-olds.


Title: Philly Special, And How Doug Pederson Out-Coached Bill Belichick In The Super Bowl

In the Super Bowl, Doug Pederson, the coach of the Philadelphia Eagles, out-coached Bill Belichick, the coach of the New England Patriots. Pederson made some clever decisions, like going for it on fourth-and-goal at the 1-yard line and calling a trick play with a tight end passing to his quarterback. This trick play, called the "Philly Special," was actually borrowed from another team, Clemson University. The Eagles won the Super Bowl with a score of 41-33.
+++++++++++++++++++++++++++++
Title: Mummified Baboons Found in Egypt Have Puzzled Scientists for 118 Years

Scientists found some very old, well-preserved baboons in Egypt. These baboons were not with any important people from the past, and they were not in a group of other baboons. Scient

In [36]:
# 1Ô∏è‚É£ Read the raw text (replace this with your variable or file read)
manually_rewritten_input_file = "reframer_safe_output/manually_rewritten_safe.txt"

# Read text from a local file
with open(manually_rewritten_input_file, 'r', encoding='utf-8') as f:
    text = f.read()

# 2Ô∏è‚É£ Split the text by the delimiter
chunks = text.split('+++++++++++++++++++++++++++++')

# 3Ô∏è‚É£ Clean up any leading/trailing whitespace
chunks = [chunk.strip() for chunk in chunks if chunk.strip()]

# 4Ô∏è‚É£ Create a DataFrame
df_manual1 = pd.DataFrame(chunks, columns=['Content'])

print(len(df_manual1))
for line in df_manual1['Content']:
    print(line)
    print("********")


30
Title: An 'Accidental Dictionary' Explores How Errors Created The English Language
This book by Paul Anthony Jones shows how word meanings change over time. ‚ÄúClumsy‚Äù once meant ‚Äúnumb with cold.‚Äù ‚ÄúPrestigious‚Äù once meant ‚Äútricky‚Äù or ‚Äúdeceitful.‚Äù It‚Äôs a set of 100 words with quick, fun histories. Some words like ‚Äúman,‚Äù ‚Äúgirl,‚Äù and ‚Äúbimbo‚Äù once meant ‚Äúchild.‚Äù The book reminds us that language never sits still. Words are shaped by how people use them. Change in words is normal, not a mistake. That‚Äôs part of what makes English lively and fun!
********
Title: Julia Haft Candell Confounds The Infinite At Parrasch Heijnen Gallery
This art show features new clay sculptures by Julia Haft-Candell. Her shapes look like the infinity sign twisting and looping. Tiny plates fit together to make bigger forms. Her wall pieces nod to famous clay artists Ken Price and Peter Voulkos. Black-and-white marks make a bold, secret-looking alphabet. The show runs through

In [37]:
# get the index positions of the rows in df that lack "Title:"
idx_no_title = df[df["RAW_OUTPUT"].str.count("Title:") == 0].index

# make sure both are same length
assert len(idx_no_title) == len(df_manual1), "Length mismatch between no_title and df_manual!"

# replace RAW_OUTPUT for those indices with df_manual["Content"]
df.loc[idx_no_title, "RAW_OUTPUT"] = df_manual1["Content"].values


In [48]:
# rows with exactly one "Title:" AND not starting with "Title:"
mask = ~df["RAW_OUTPUT"].str.startswith("Title:", na=False)

# strip everything before the first "Title:" (multiline-safe, case-insensitive optional)
df.loc[mask, "RAW_OUTPUT"] = (
    df.loc[mask, "RAW_OUTPUT"]
      .str.replace(r'(?is)^.*?(Title\s*:)', r'\1', regex=True)  # (?s)=dotall, (?i)=ignore case
      .str.strip()
)

# verify
print(mask.sum())
for line in df.loc[mask, "RAW_OUTPUT"]:
    print(line)
    print("+++++++++++++++++++++++++++++")


29
Title: Jane Austen Apparently Made Up Two Fake Marriages, For The Lulz

Jane Austen, the famous writer of romantic stories, was known for writing about people getting married. But did you know that she made up two fake marriage announcements when she was younger? She filled out papers to say that she was getting married, just for fun! These fake announcements were found in a book called the Steventon parish marriage register. Some people think this shows that Jane Austen liked to have a little fun when she was young. She never got married and passed away when she was 41 years old.
+++++++++++++++++++++++++++++
Title: Thoughts on 54 Below, 'Blood Brothers' and Cabaret

Julia Murney opened a new cabaret show at Feinstein's/54 Below. The show was well done, but it didn't feel like a cabaret. The writer says he's always nervous going to cabaret shows. He likes cabaret because it's honest and tells interesting stories. But he doesn't like simple concerts of standards and isn't a fan of L

In [50]:
df.to_csv(f"reframer_safe_output/cleanedFinal_combined_batch_output_0_{len(df)}.csv")

## COMBINE AND CREATE FINAL DATASET

In [2]:
import pandas as pd
df1 = pd.read_csv(f"reframer_safe_output/cleanedFinal_combined_batch_output_0_2754.csv")
df2 = pd.read_csv(f"reframer_sensitive_output/cleanedFinal_combined_batch_output_0_3031.csv")

In [3]:
import pandas as pd

# keep only these columns, stack the two DFs, then rename
cols = ["label", "title", "summary_long_500", "RAW_OUTPUT"]

out = (
    pd.concat(
        [df1[cols], df2[cols]],   # <- your two DataFrames
        ignore_index=True
    )
    .rename(columns={
        "summary_long_500": "article_500",
        "RAW_OUTPUT": "rewrite"
    })
)

# (optional) quick check
print(out.head())
print(out.columns.tolist())


  label                                              title  \
0  SAFE  Man Surprises Girlfriend By Drawing Them In Di...   
1  SAFE  This Artist Gives Renaissance-Style Sculptures...   
2  SAFE  Sculptures From This International Ice And Sno...   
3  SAFE             60 Books We Can't Wait To Read In 2018   
4  SAFE  Why Do We Call That Holiday Game Yankee Swap, ...   

                                         article_500  \
0  Kellen Hickey, 26, drew himself and his girlfr...   
1  Barcelona-based artist Gerard Mas mixes Renais...   
2  The Harbin International Ice and Snow Festival...   
3  The coming year's literary crop looks bountifu...   
4  Gift exchanges are a big part of American Chri...   

                                             rewrite  
0  Title: Man Surprises Girlfriend By Drawing The...  
1  Title: This Artist Gives Renaissance-Style Scu...  
2  Title: Sculptures From This International Ice ...  
3  Title: 60 Books We Can't Wait To Read In 2018\...  
4  Title: Why D

In [6]:
out.to_csv("../data/raw/kid_rewrite_corpus.csv")
len(out)

5785