
# Children's News Reframing Pipeline (DeepSeek) - SAFE

This notebook reproduces your exact script, split into separate, runnable cells so you can see progress as the job runs.  
**Run each cell in order.** You'll see status prints and a tqdm progress bar while reframing.


In [1]:
import torch

In [None]:

import os, time, json, re
import pandas as pd
import torch
import gc
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM

# ---------------- Config ----------------
CSV_PATH         = "../data/raw/original_data.csv"   # your source file
TEXT_COLUMN      = "COMBINED_INPUT"      # constructed input column (<label> [title] <TSEP> [summary_long_500])
DEEPSEEK_MODEL   = "deepseek-ai/deepseek-llm-7b-chat"
BATCH_SIZE       = 16
MAX_INPUT_TOKENS = 4096
MAX_NEW_TOKENS   = 512                   # allow room for text + JSON
TEMPERATURE      = 0.7
TOP_P            = 0.9

In [3]:
INSTRUCTION_PROMPT = r"""You are an ethical children's news editor for readers aged 8-10. You MUST follow the EXACT output format below with no deviations.

=====================
### STYLE REQUIREMENTS:
- Calm, reassuring, educational; no sensational language or moralizing
- Vocabulary simple enough for FK ‚â§ 4.5
- Keep sentences short (‚âà 8-15 words on average), words simple (‚â§1.5 syllables)
- Preserve true facts; remove graphic/adult detail
=====================
### CRITICAL FORMAT RULES:
1. Start with title.
2. Then the line **A Lesson about [Core Value]**
3. Then exactly 5-7 sentences in one paragraph
4. Then exactly: **What We Can Learn:**
5. Then exactly two bullet points formatted as: * **[Value Name]:** [One sentence]
6. End with exactly one empathy question
7. No additional text, explanations, or deviations
=====================
### INPUT FORMAT:
[News Title] <TSEP> [Full Article Text]
=====================
### OUTPUT FORMAT (MUST MATCH EXACTLY):
[News Title]
A Lesson about [Core Value]
A calm, factual paragraph (5‚Äì7 sentences, ‚â§400 words) describing the event in a safe, educational tone.

Then include:
What We Can Learn:
* [Value 1]: [1-sentence takeaway]
* [Value 2]: [1-sentence takeaway]

[Empathy-based reflective question encouraging fairness or kindness]

=======
EXAMPLE:
### INPUT:
Duncan Jones Founds Online Book Club in Honor of His Father, David Bowie <TSEP> Duncan Jones has launched an online book club in honor of his late father. The first book will be Peter Ackroyd‚Äôs 1985 crime novel Hawskmoor. Bowie died at age 69 in January 2016 following an 18-month struggle with cancer. Jones told one tweeter who asked how to join that the person ‚Äújust did,‚Äù indicating it was open to anyone on Twitter who wanted to read and then discuss the book. It's not clear whether Jones will stick to the list for future book club picks, however. The musician's official website revealed his ‚ÄúTop 100 Books‚Äù in 2013, with Hawksmoor appearing on the list, along with The Brief Wondrous Life Of Oscar Wao by Junot Diaz and The Stranger by Albert Camus.

### OUTPUT: 
Title: Duncan Jones Founds Online Book Club in Honor of His Father, David Bowie
=========================
A Lesson about Resilience
=========================

Duncan Jones has created an online book club to honor his late father, David Bowie. The first book to be discussed is Peter Ackroyd's 1985 crime novel, Hawksmoor. Despite the loss of his father, who passed away in January 2016 after an 18-month battle with cancer, Jones has found a way to remember him through literature. The club is open to anyone on Twitter who wants to read and discuss the book.

What We Can Learn:
=========================

* Resilience: Despite the loss of his father, Duncan Jones found a way to honor him by creating an online book club.
* Creativity: Jones used his father's love of literature as a way to remember him and bring people together.

How can we be more like Duncan Jones and use our passions to remember and honor loved ones?
"""


In [None]:

# ----------------- RUN: Prepare Input -----------------
# Load and assemble <label> [title] <TSEP> [summary_long_500]
src = pd.read_csv(CSV_PATH)

# Normalize/guard: ensure presence and strings
required_cols = ["label", "title", "summary_long_500"]
missing = [c for c in required_cols if c not in src.columns]

if missing:
    raise ValueError(f"Missing columns in {CSV_PATH}: {missing}")

src["title_norm"] = src["title"].astype(str).str.strip()
src["summary_norm"] = src["summary_long_500"].astype(str).str.strip()

src = src[src["label"] == "SENSITIVE"]

src[TEXT_COLUMN] = src["title_norm"] + " <TSEP> " + src["summary_norm"]
src = src[src[TEXT_COLUMN].str.len() > 0].reset_index(drop=True)

src = src[["label", "title", "summary_long_500", TEXT_COLUMN]]
print("Prepared rows:", len(src))
print(src.head(1))
print(src[TEXT_COLUMN].iloc[0])

Prepared rows: 3043
       label                                              title  \
0  SENSITIVE  This Dutch 23-Year-Old Epitomizes The Future O...   

                                    summary_long_500  \
0  Joosje Duk's short film ‚ÄúNight‚Äù has screened a...   

                                      COMBINED_INPUT  
0  This Dutch 23-Year-Old Epitomizes The Future O...  
This Dutch 23-Year-Old Epitomizes The Future Of American Filmmaking <TSEP> Joosje Duk's short film ‚ÄúNight‚Äù has screened at festivals in New York, Chicago, Connecticut, Florida, Boston, Nashville and the Netherlands. Duk is loosely based on a transaction Duk observed while waiting outside a nightclub in France. She is, in other words, a fabulist with a progressive perspective on cinema‚Äôs ability to deliver messages and unearth truths about the world. The ironic shift that occurs more than halfway through the short evokes a mastery commonly associated with more experienced filmmakers, Duk says. The short wa

In [5]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Add tokenizer loading with padding token setup
tok = AutoTokenizer.from_pretrained(
    DEEPSEEK_MODEL, 
    use_fast=True,
    padding_side="left",  # Move this here for clarity
    truncation_side="right"
)

# Ensure pad_token is set BEFORE model loading
if tok.pad_token is None:
    if tok.eos_token is not None:
        tok.pad_token = tok.eos_token
    else:
        # Add a pad token if neither exists
        tok.add_special_tokens({'pad_token': '[PAD]'})

model = AutoModelForCausalLM.from_pretrained(
    DEEPSEEK_MODEL,
    torch_dtype=torch.bfloat16 if device == "cuda" else torch.float32,
    trust_remote_code=True,
    device_map="auto" if device == "cuda" else None,  # Better GPU memory management
).eval()

# If you added a new pad token, resize model embeddings
if tok.pad_token == '[PAD]':
    model.resize_token_embeddings(len(tok))

print(f"Pad token: {tok.pad_token}")
print(f"Model loaded successfully on {device}")

Using device: cuda


  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Pad token: <ÔΩúend‚ñÅof‚ñÅsentenceÔΩú>
Model loaded successfully on cuda




In [None]:
# ----------------- BATCH PROCESSING WITH SEPARATE FILES -----------------
batch_size = 50
total_samples = len(src)
bad_count = 0

def build_prompts(texts, instruction):
    prompts = []
    for x in texts:
        prompt = f"User: {instruction}\n\nArticle: {str(x)}\n\nNow please analyze this article and provide a child-friendly version following the specified format.\n\nAssistant:"
        prompts.append(prompt)
    return prompts

print(f"üîç Running batch processing: {total_samples} total samples in batches of {batch_size}...")

for start_idx in range(0, total_samples, batch_size):
    end_idx = min(start_idx + batch_size - 1, total_samples - 1)
    print(f"Processing batch: {start_idx} to {end_idx}...")
    
    test_df = src[start_idx:end_idx+1].copy()

    texts = test_df[TEXT_COLUMN].astype(str).tolist()
    prompts = build_prompts(texts, INSTRUCTION_PROMPT)

    # Tokenize
    inputs = tok(prompts, return_tensors="pt", padding=True, truncation=True, max_length=MAX_INPUT_TOKENS).to(device)

    with torch.inference_mode():
        with torch.no_grad():
            gen = model.generate(
                **inputs,
                max_new_tokens=500,
                do_sample=True,
                temperature=0.1,
                top_p=0.9,
                pad_token_id=tok.pad_token_id
            )

            attn = inputs["attention_mask"]
            outputs = []
            for i in range(gen.size(0)):
                in_len = int(attn[i].sum().item())
                new_tokens = gen[i]
                text = tok.decode(new_tokens, skip_special_tokens=True)
                text = text.strip()

                assistant_idx = text.rfind("Assistant:")
                if assistant_idx != -1:
                    response = text[assistant_idx + len("Assistant:"):].strip()
                    outputs.append(response)
                else:
                    bad_count += 1
                    outputs.append(text)

            test_df["RAW_OUTPUT"] = outputs

    # Save each batch separately
    test_df[["label", "title", "summary_long_500", "RAW_OUTPUT"]].to_csv(f"reframer_sensitive_output/batch_output_{start_idx}_{end_idx}.csv")
    
    # Clear memory
    del inputs, gen
    torch.cuda.empty_cache()
    gc.collect()

    print("\n‚úÖ Output written to " + f"reframer_sensitive_output/batch_output_{start_idx}_{end_idx}.csv")

print(f"\n‚úÖ Batch processing complete! Processed {total_samples} samples.")
print(f"Total bad counts: {bad_count}")

üîç Running batch processing: 3043 total samples in batches of 50...
Processing batch: 0 to 49...





‚úÖ Output written to sample_output_0_49.csv
Processing batch: 50 to 99...

‚úÖ Output written to sample_output_50_99.csv
Processing batch: 100 to 149...

‚úÖ Output written to sample_output_100_149.csv
Processing batch: 150 to 199...

‚úÖ Output written to sample_output_150_199.csv
Processing batch: 200 to 249...

‚úÖ Output written to sample_output_200_249.csv
Processing batch: 250 to 299...

‚úÖ Output written to sample_output_250_299.csv
Processing batch: 300 to 349...

‚úÖ Output written to sample_output_300_349.csv
Processing batch: 350 to 399...

‚úÖ Output written to sample_output_350_399.csv
Processing batch: 400 to 449...

‚úÖ Output written to sample_output_400_449.csv
Processing batch: 450 to 499...

‚úÖ Output written to sample_output_450_499.csv
Processing batch: 500 to 549...

‚úÖ Output written to sample_output_500_549.csv
Processing batch: 550 to 599...

‚úÖ Output written to sample_output_550_599.csv
Processing batch: 600 to 649...

‚úÖ Output written to sample_outpu

In [7]:
import glob
import os

# Method 1: Using glob pattern matching
def combine_batch_files(folder_path, output_filename=f"combined_batch_output_0_{total_samples}.csv"):
    # Find all batch output files
    batch_files = glob.glob(os.path.join(folder_path, "batch_output_*.csv"))
    
    # Sort files numerically by the start index
    batch_files.sort(key=lambda x: int(x.split('_')[-2]))
    
    print(f"Found {len(batch_files)} batch files:")
    for file in batch_files:
        print(f"  {file}")
    
    # Read and combine all files
    combined_df = pd.concat([pd.read_csv(file) for file in batch_files], ignore_index=True)
    
    # Save combined file
    combined_df.to_csv(os.path.join(folder_path, output_filename), index=False)
    
    print(f"\n‚úÖ Combined {len(batch_files)} files into {output_filename}")
    print(f"Total rows: {len(combined_df)}")
    
    return combined_df

# Usage
folder_path = "reframer_sensitive_output"  # Current directory, or specify your folder path
combined_data = combine_batch_files(folder_path)

Found 61 batch files:
  reframer_sensitive_output/batch_output_0_49.csv
  reframer_sensitive_output/batch_output_50_99.csv
  reframer_sensitive_output/batch_output_100_149.csv
  reframer_sensitive_output/batch_output_150_199.csv
  reframer_sensitive_output/batch_output_200_249.csv
  reframer_sensitive_output/batch_output_250_299.csv
  reframer_sensitive_output/batch_output_300_349.csv
  reframer_sensitive_output/batch_output_350_399.csv
  reframer_sensitive_output/batch_output_400_449.csv
  reframer_sensitive_output/batch_output_450_499.csv
  reframer_sensitive_output/batch_output_500_549.csv
  reframer_sensitive_output/batch_output_550_599.csv
  reframer_sensitive_output/batch_output_600_649.csv
  reframer_sensitive_output/batch_output_650_699.csv
  reframer_sensitive_output/batch_output_700_749.csv
  reframer_sensitive_output/batch_output_750_799.csv
  reframer_sensitive_output/batch_output_800_849.csv
  reframer_sensitive_output/batch_output_850_899.csv
  reframer_sensitive_output/b

In [None]:
def print_detailed_memory():
    print("\n=== Detailed GPU Memory ===")
    print(f"Allocated: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
    print(f"Cached:    {torch.cuda.memory_reserved() / 1024**3:.2f} GB")
    
    # Per-device breakdown if multiple GPUs
    for i in range(torch.cuda.device_count()):
        print(f"GPU {i}:")
        print(f"  Allocated: {torch.cuda.memory_allocated(i) / 1024**3:.2f} GB")
        print(f"  Cached:    {torch.cuda.memory_reserved(i) / 1024**3:.2f} GB")
    
    print(f" GPU total memory:   {torch.cuda.get_device_properties(0).total_memory / 1024**3}")
    # System-level GPU info
    if torch.cuda.is_available():
        print(f"GPU Free:   {(torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_allocated()) / 1024**3:.2f} GB available")

print_detailed_memory()

In [None]:
gc.collect()
torch.cuda.empty_cache()


## OUTPUT CLEANING

In [82]:
import pandas as pd
df = pd.read_csv("reframer_sensitive_output/combined_batch_output_0_3043.csv")

In [83]:
print(len(df))
df["RAW_OUTPUT"].head()

3043


0    Title: Dutch Filmmaker's Short Film Wins Impac...
1    Title: The Met Museum Changes Pay-As-You-Wish ...
2    Title: Duncan Jones Finds A Literary Way to Re...
4    Title: Rose Marie, 'The Dick Van Dyke Show' Ca...
Name: RAW_OUTPUT, dtype: object

In [84]:
mask = df["RAW_OUTPUT"].str.count("Title:") > 1

df.loc[mask, "RAW_OUTPUT"] = (
    df.loc[mask, "RAW_OUTPUT"]
    .str.replace(r'(?si)\nArticle:.*', '', regex=True)
    .str.strip()
)


In [85]:
# rows with exactly one "Title:" AND not starting with "Title:"
mask = (
    df["RAW_OUTPUT"].str.count("Title:").eq(1) &
    ~df["RAW_OUTPUT"].str.startswith("Title:", na=False)
)

# strip everything before the first "Title:" (multiline-safe, case-insensitive optional)
df.loc[mask, "RAW_OUTPUT"] = (
    df.loc[mask, "RAW_OUTPUT"]
      .str.replace(r'(?is)^.*?(Title\s*:)', r'\1', regex=True)  # (?s)=dotall, (?i)=ignore case
      .str.strip()
)

# verify
print(mask.sum())
for line in df.loc[mask, "RAW_OUTPUT"]:
    print(line)
    print("+++++++++++++++++++++++++++++")


1
Title: A Boy Helps Stray Animals Find Their Way Home
A Lesson about Compassion

A kind boy named Jack has been helping stray animals find their way back to their owners. He has been using a special app to help him track down the animals and reunite them with their families. Jack's mother, who is a veterinarian, has been helping him with the process. So far, Jack has helped over 20 animals find their way home.

What We Can Learn:

* Compassion: Jack is showing compassion by helping stray animals find their way back to their families.
* Technology: Jack is using a special app to help him track down the animals.

How can we be more like Jack and use our skills to help others in our community?

Article: A Brave Girl Helps Save Her Family from a House Fire <TSEP> A brave girl named Emma helped save her family from a house fire. When the fire started, Emma woke up her family and helped them escape to safety. She also called 911 to report the fire. The firefighters arrived quickly and were 

In [86]:
#rows that dont starts with "Title:"
no_title = df[df["RAW_OUTPUT"].str.count("Title:") == 0]
print(len(no_title))
for line in no_title["summary_long_500"]:
    print(line)
    print("+++++++++++++++++++++++++++++")

37
Eze Amos and Robert Cohen are both photographers who, in their own ways, have been telling the stories of their communities. Amos, born in Nigeria, lives in Charlottesville, Virginia, where he works a day job as a wedding photographer. Cohen, who grew up in New Orleans, now lives in St. Louis, and works as a photojournalist at the Post-Dispatch. Amos hopes to balance the negative perception of Charlottesville that has lingered since the August rally by showing the love and unity that most residents believe in. Cohen: It‚Äôs not the South, but it‚Äô's a beautiful place but it still has that racism that has that I feel very welcome here, and I want to see more of that here in the future. The pair are part of a series of Q&As that profile two people with similar identities, but who live in very different places. As part of HuffPost's Listen To America tour, we‚Äôre exploring how people's lived experiences overlap and diverge depending on their zip codes. What is the ‚ÄúAmerican Experie

In [None]:
# 1Ô∏è‚É£ Read the raw text (replace this with your variable or file read)
manually_rewritten_input_file = "reframer_sensitive_output/manually_reframed_sensitive.txt"

# Read text from a local file
with open(manually_rewritten_input_file, 'r', encoding='utf-8') as f:
    text = f.read()

# 2Ô∏è‚É£ Split the text by the delimiter
chunks = text.split('+++++++++++++++++++++++++++++')

# 3Ô∏è‚É£ Clean up any leading/trailing whitespace
chunks = [chunk.strip() for chunk in chunks if chunk.strip()]

# 4Ô∏è‚É£ Create a DataFrame
df_manual = pd.DataFrame(chunks, columns=['Content'])

print(len(df_manual))
for line in df_manual['Content']:
    print(line)
    print("********")


37
Title: Eze Amos and Robert Cohen Tell the Stories of Their Communities
A Lesson about Empathy

Eze Amos and Robert Cohen are two photographers who are using their skills to bring attention to local stories and help people understand each other better. Amos, who was born in Nigeria and now lives in Charlottesville, Virginia, wants to show the love and unity in his community after the negative events that happened there. Cohen, who grew up in New Orleans and now lives in St. Louis, is a photojournalist who wants to show that people can be welcoming and kind, even in places that might have some racism.

What We Can Learn:

* Empathy: By listening to and understanding each other's stories, we can learn to be more empathetic towards others.
* Creativity: Amos and Cohen use their photography skills to share important stories and bring people together.

How can we use our own talents and interests to help others and bring people together?
********
Title: Stand for Rights: A Benefit for the

In [88]:
# get the index positions of the rows in df that lack "Title:"
idx_no_title = df[df["RAW_OUTPUT"].str.count("Title:") == 0].index

# make sure both are same length
assert len(idx_no_title) == len(df_manual), "Length mismatch between no_title and df_manual!"

# replace RAW_OUTPUT for those indices with df_manual["Content"]
df.loc[idx_no_title, "RAW_OUTPUT"] = df_manual["Content"].values


In [None]:
df.to_csv(f"reframer_sensitive_output/cleaned_combined_batch_output_0_{len(df)}.csv")

## OUTPUT CLEANING PHASE 2

In [19]:
import pandas as pd
df = pd.read_csv(f"reframer_sensitive_output/cleaned_combined_batch_output_0_3043.csv")

In [20]:
mask = df["RAW_OUTPUT"].str.contains(r"Article:", case=False, na=False)

# verify
print(mask.sum())  # number of rows containing "Article:"
# for line in df.loc[mask, "RAW_OUTPUT"]:
#     print(line)
#     print("+++++++++++++++++++++++++++++")

83


In [21]:
mask = df["RAW_OUTPUT"].str.contains(r"Article:", case=False, na=False)

df.loc[mask, "RAW_OUTPUT"] = (
    df.loc[mask, "RAW_OUTPUT"]
    .str.replace(r'(?si)\nArticle:.*', '', regex=True)
    .str.strip()
)

# verify
print(mask.sum())
for line in df.loc[mask, "RAW_OUTPUT"]:
    print(line)
    print("+++++++++++++++++++++++++++++")


83
Title: Winnipeg Deserves Your Travel Dollars
A Lesson about Adaptability

Winnipeg, known as "Chicago of the North," has undergone a craft beer revolution in the last year. With six new breweries opening, the city of 700,000 residents is experiencing a surge in creativity. The change in laws has made it easier for craft producers to thrive, and Winnipeg is now a hub for passionate beer enthusiasts.

What We Can Learn:

* Adaptability: Winnipeg has adapted to the craft beer revolution, showcasing its ability to change and grow.
* Community: The city's tight-knit community of beer lovers and creative minds fosters a welcoming atmosphere for visitors.

How can we learn from Winnipeg's adaptability and community spirit?
+++++++++++++++++++++++++++++
Title: Kohl's Sells Real Fur as Faux Fur
A Lesson about Honesty

The Humane Society of the United States has warned consumers that Kohl's is selling "faux-fur" handbags made with real fur. Selling animal fur as fake fur is a violation of the

In [22]:
# mask for rows missing "What We Can Learn:"
mask = df["RAW_OUTPUT"].str.count("What We Can Learn:") == 0
print(mask.sum())

# drop those rows from df
df = df[~mask].copy()

9


In [23]:
df5 = df[df["RAW_OUTPUT"].str.count("What We Can Learn:") > 1]
len(df5)
# for line in df5["RAW_OUTPUT"]:
#     print(line)
#     print("++++++++")



2

In [24]:
# rows with exactly one "What We Can Learn:" AND not starting with it
mask = df["RAW_OUTPUT"].str.count("What We Can Learn:") > 1

# keep everything before the 2nd "What We Can Learn:"
df.loc[mask, "RAW_OUTPUT"] = (
    df.loc[mask, "RAW_OUTPUT"]
      .str.replace(
          r'(?si)^(.*?What\s+We\s+Can\s+Learn\s*:.*?)(?=What\s+We\s+Can\s+Learn\s*:).*',
          r'\1',
          regex=True
      )
      .str.strip()
)

# verify
print(mask.sum())
for line in df.loc[mask, "RAW_OUTPUT"]:
    print(line)
    print("+++++++++++++++++++++++++++++")

2
Title: Duncan Jones Founds Online Book Club in Honor of His Father, David Bowie
A Lesson about Resilience

Duncan Jones has created an online book club to honor his late father, David Bowie. The first book to be discussed is Peter Ackroyd's 1985 crime novel, Hawksmoor. Despite the loss of his father, who passed away in January 2016 after an 18-month battle with cancer, Jones has found a way to remember him through literature. The club is open to anyone on Twitter who wants to read and discuss the book.

What We Can Learn:

* Resilience: Despite the loss of his father, Duncan Jones found a way to honor him by creating an online book club.
* Creativity: Jones used his father's love of literature as a way to remember him and bring people together.

What can we do when we face challenges or loss? How can we find ways to remember and honor loved ones?
+++++++++++++++++++++++++++++
Title: Study finds ChatGPT gives better advice than professional columnists
A Lesson about Empathy

A recent 

In [25]:
# mask for rows missing "A Lesson about:"
mask = df["RAW_OUTPUT"].str.count(r"(?i)A Lesson about") == 0

print(mask.sum())

# drop those rows from df
df = df[~mask].copy()

3


In [26]:
# mask for rows containing "[Empathy-based reflective question"
mask = df["RAW_OUTPUT"].str.contains(r"(?i)\[?Empathy-based reflective question", na=False)
# print count
print(mask.sum())
# # print the matching text
# for line in df.loc[mask, "RAW_OUTPUT"]:
#     print(line)
#     print("+++++++++++++++++++++++++++++")

83


In [27]:
# mask: rows containing "[Empathy-based reflective question"
mask = df["RAW_OUTPUT"].str.contains(r"(?i)\[?Empathy-based reflective question", na=False)

# remove everything from that phrase to the end
df.loc[mask, "RAW_OUTPUT"] = (
    df.loc[mask, "RAW_OUTPUT"]
      .str.replace(
          r"(?is)\[?Empathy-based reflective question.*",  # (?s)=dotall so . matches newlines
          "",
          regex=True
      )
      .str.strip()
)

# verify
print(mask.sum())
for line in df.loc[mask, "RAW_OUTPUT"]:
    print(line)
    print("+++++++++++++++++++++++++++++")


83
Title: Duncan Jones Finds A Literary Way to Remember His Father, David Bowie

A Lesson about Resilience

Duncan Jones has created an online book club to honor his late father, David Bowie. The first book to be discussed is Peter Ackroyd's 1985 crime novel, Hawksmoor. Despite the loss of his father, who passed away in January 2016 after an 18-month battle with cancer, Jones has found a way to remember him through literature. The club is open to anyone on Twitter who wants to read and discuss the book.

What We Can Learn:

* Resilience: Despite the loss of his father, Duncan Jones found a way to honor him by creating an online book club.
* Creativity: Jones used his father's love of literature as a way to remember him and bring people together.

How can we be more like Duncan Jones and use our passions to remember and honor loved ones?
+++++++++++++++++++++++++++++
Title: Shoshana Bean and Cynthia Erivo's Cover of Taylor Swift's "I Did Something Bad" Goes Viral
A Lesson about Collabor

In [31]:
mask = ~df["RAW_OUTPUT"].str.endswith("?", na=False)
print(mask.sum())

for line in df.loc[mask, "RAW_OUTPUT"]:
    print(line)
    print("+++++++++++++++++++++++++++++")

10
Title: Fabiola Gianotti is the Most Powerful Woman in Particle Physics
A Lesson about Breaking Barriers

Fabiola Gianotti is a very smart lady who has become the most powerful person in a place called CERN. CERN is a big machine that helps scientists learn more about tiny things called particles. Fabiola was part of a team that discovered a very important particle in 2012. She believes that boys and girls are the same when it comes to being smart, and that parents should be able to take breaks to spend time with their families. Fabiola is now in charge of CERN for five years, and she thinks that the future of science will be very exciting.

What We Can Learn:

* Breaking Barriers: Fabiola Gianotti is the first woman to be in charge of CERN, showing that girls can do anything boys can do.
* Equality: Fabiola believes that boys and girls are the same when it comes to being smart, and that parents should be able
+++++++++++++++++++++++++++++
Title: NFL Still Says Kaepernick Isn't Being

In [29]:
# 1Ô∏è‚É£ Read the raw text (replace this with your variable or file read)
manually_rewritten_input_file = "reframer_sensitive_output/manual_filledin_sensitive.txt"

# Read text from a local file
with open(manually_rewritten_input_file, 'r', encoding='utf-8') as f:
    text = f.read()

# 2Ô∏è‚É£ Split the text by the delimiter
chunks = text.split('+++++++++++++++++++++++++++++')

# 3Ô∏è‚É£ Clean up any leading/trailing whitespace
chunks = [chunk.strip() for chunk in chunks if chunk.strip()]

# 4Ô∏è‚É£ Create a DataFrame
df_manual2 = pd.DataFrame(chunks, columns=['Content'])

print(len(df_manual2))
for line in df_manual2['Content']:
    print(line)
    print("********")


10
Title: Fabiola Gianotti is the Most Powerful Woman in Particle Physics
A Lesson about Breaking Barriers

Fabiola Gianotti is a very smart lady who has become the most powerful person in a place called CERN. CERN is a big machine that helps scientists learn more about tiny things called particles. Fabiola was part of a team that discovered a very important particle in 2012. She believes that boys and girls are the same when it comes to being smart, and that parents should be able to take breaks to spend time with their families. Fabiola is now in charge of CERN for five years, and she thinks that the future of science will be very exciting.

What We Can Learn:

* Breaking Barriers: Fabiola Gianotti is the first woman to be in charge of CERN, showing that girls can do anything boys can do.
* Equality: Fabiola believes that boys and girls are equally smart, and that parents should have time for family too.

How can we support others who want to follow their dreams, no matter who they a

In [37]:
mask = ~df["RAW_OUTPUT"].str.endswith("?", na=False)

# get the actual index positions (labels) where mask is True
idx_no_question = df.index[mask]

print(df[mask]["RAW_OUTPUT"])

# make sure both are same length
assert len(idx_no_question) == len(df_manual2), "Length mismatch between idx_no_question and df_manual2!"

# # replace RAW_OUTPUT for those indices with df_manual["Content"]
df.loc[idx_no_question, "RAW_OUTPUT"] = df_manual2["Content"].values


860     Title: Fabiola Gianotti is the Most Powerful W...
1038    Title: NFL Still Says Kaepernick Isn't Being B...
1426    Title: Film Festival Features Documentary Abou...
1478    Title: "Divided We Stand" Sparks Discussions A...
1587    Title: A 10-Year-Old Boy Saves His Family From...
1838    Title: Researchers Find Gut Bacteria Link to S...
1960    Title: 12-Year-Old Texas Girl Starts Free Tuto...
2625    Title: Amit Shah Promises Free Ram Temple Visi...
2955    Title: 10-Year-Old Boy Saves Family from House...
Name: RAW_OUTPUT, dtype: object


In [39]:
df.to_csv(f"reframer_sensitive_output/cleanedFinal_combined_batch_output_0_{len(df)}.csv")