## Chapter 2 - Prepping Data for AI
This notebook contains examples of data preparation strategies for AI, including data cleaning, feature engineering, and handling data sensitivity. Using open-source tools like Pandas, LangChain, and ChromaDB, it explores design patterns for crafting high-quality datasets. It also covers techniques for ensuring data privacy and security, highlighting methods like data masking and synthetic data generation to safeguard sensitive information.

### Listing 2-1: Defining Dataset Constants
The first block defines the **dataset** file names, builds the URLs, and creates a small `load_and_describe` helper. The second block calls that helper twice to load the Superheroes Info and Superheroes Powers CSVs into Pandas DataFrames. For each dataset, you’ll see a quick summary: row count, column count, and the first few column names. This confirms your files loaded correctly before you begin cleaning and analysis. **Run both code blocks below in order.**


In [None]:
# Filename and URL Constants used throughout this Notebook

# Base GitHub URL
BASE_URL = "https://opensourceai-book.github.io/code/datasets/"

# Dataset file names
INFO_FILE = "superheroes_info.csv"
INFO_CLEAN_FILE = "superheroes_info_cleansed.csv"
POWERS_FILE = "superheroes_powers.csv"
INFO_POWERS_FILE = "superheroes_info_powers.csv"
PLOTS_FILE = "superheroes_story_plots.csv"

# Construct full dataset URLs
SUPERHEROES_INFO_URL = f"{BASE_URL}{INFO_FILE}"
SUPERHEROES_INFO_CLEAN_URL = f"{BASE_URL}{INFO_CLEAN_FILE}"
SUPERHEROES_POWERS_URL = f"{BASE_URL}{POWERS_FILE}"
SUPERHEROES_INFO_POWERS_URL = f"{BASE_URL}{INFO_POWERS_FILE}"
SUPERHEROES_INFO_PLOTS_URL = f"{BASE_URL}{PLOTS_FILE}"

In [None]:
import pandas as pd

def load_and_describe(name, url):
    """Load CSV and print a brief summary."""
    # Read the CSV file into a DataFrame
    df = pd.read_csv(url)
    # Show the first five column names for a quick preview
    cols = ", ".join(df.columns[:5])
    print(f"{name}\n  Rows: {df.shape[0]}  Cols: {df.shape[1]}")
    print(f"  First columns: {cols}\n" + "=" * 75)
    return df

# Load the two core superhero datasets and display summaries
info_df = load_and_describe("Superheroes Info", SUPERHEROES_INFO_URL)
powers_df = load_and_describe("Superheroes Powers", SUPERHEROES_POWERS_URL)


### Listing 2-2: Analyzing and Detecting Duplicates and Sparse Fields
Identifies duplicate rows by superhero name and assess sparse fields, highlighting missing data percentages across all dataset columns.

In [None]:

# Find duplicate rows based on all columns
duplicates = info_df[info_df.duplicated(subset=['name'], keep=False)]

# List unique superheroes with duplicate rows
duplicate_names = duplicates['name'].unique()

print("Superheroes with duplicate rows:")
print(duplicate_names)


... **Run this cell next** to analyze the dataset by detecting sparse fields, calculating the percentage of missing data for each column.

In [None]:
import numpy as np

def print_sparse_fields(sparse):
    """Pretty-print percent missing per column."""
    print("\nSparse Fields (Percentage of Missing Data):")
    print("{:<20} {:>10}".format("Column", "Missing %"))
    print("-" * 30)
    for col, pct in sparse.items():
        print("{:<20} {:>8.2f}%".format(col, pct))

# Normalize placeholder values to NaN
info_df.replace({'-': np.nan, -99: np.nan}, inplace=True)

# Remove the counter column (assumed first)
info_df.drop(info_df.columns[0], axis=1, inplace=True)

# Compute percent missing per column using NumPy on boolean masks
sparse_fields = {
    col: float(np.mean(info_df[col].isna().to_numpy())) * 100.0
    for col in info_df.columns
}

# Display results
print_sparse_fields(sparse_fields)

### Listing 2-3: Infer Superhero Species with LangChain
This function uses LangChain and an open-source model from Hugging Face Hub to infer a superhero's species based on their name and publisher.

⚠️ Make sure you've set your `HF_TOKEN` (and optionally,  `OPENAI_API_KEY`)  in Colab secrets.
Refer to **Chapter 1** (notebook or book text) for setup instructions.  
The next two code cells install the required packages and configure API keys in the environment for use with LangChain.


In [None]:
# Install required packages for Hugging Face and LangChain usage

%pip install -q "langchain>=0.2" "langchain-huggingface>=0.0.3" \
                 "huggingface_hub>=0.23"

In [None]:
# Constants and API Key Configuration
import os
from google.colab import userdata

# === Load API keys securely from Google Colab Secrets ===
def load_api_keys():
    keys = {
        "HF_TOKEN": userdata.get("HF_TOKEN"),
        "OPENAI_API_KEY": userdata.get("OPENAI_API_KEY"),
    }
    for key, value in keys.items():
        if not value:
            raise ValueError(f"❌ Missing {key}. Please set this API key in Colab secrets.")
        os.environ[key] = value
    print("✅ All API keys loaded and configured successfully.")

# Execute API key loading upon running this cell
load_api_keys()

#### Define the default LLM (Text Generation Model) to use from Hugging Face
Run the code cell below to define the DEFAULT_MODEL constant
> ⚠️ If you get an error running LangChain code due to a missing model, welcome to open-source AI development. Models are updated or replaced often. Check Hugging Face’s list of supported text generation models here:  
> https://huggingface.co/docs/api-inference/en/supported-models


In [None]:
# Candidate Models

#DEFAULT_MODEL = "openai/gpt-oss-20b"
#DEFAULT_MODEL = "HuggingFaceH4/zephyr-7b-beta"
#DEFAULT_MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"
DEFAULT_MODEL = "mistralai/Mistral-7B-Instruct-v0.2"

In [None]:
import re
from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Build chat LLM on Hugging Face serverless
chat_llm = ChatHuggingFace(
    llm=HuggingFaceEndpoint(
        repo_id=DEFAULT_MODEL, task="conversational",
        temperature=0.1, max_new_tokens=24, return_full_text=False
    )
)

# Prompt that constrains the output format
prompt = ChatPromptTemplate.from_messages([
    ('system', 'Provide only the superhero race in one word, surrounded by '
               'parentheses (). If you don’t know, respond with "".'),
    ('human', 'What is the race of {hero_name} from {publisher}?')
])

# Chain: Prompt → LLM → plain text
chain = prompt | chat_llm | StrOutputParser()

# Extract the one word in parentheses
def get_species(hero_name, publisher):
    txt = chain.invoke({"hero_name": hero_name, "publisher": publisher})
    m = re.search(r"\(([A-Za-z\-']+)\)", txt)
    return m.group(1) if m else ""

# Example call
print("Species of Spider-Man: ", get_species("Spider-Man", "Marvel Comics"))
print("Batman →", get_species("Batman", "DC Comics"))
print("Vision →", get_species("Vision", "Marvel Comics"))

### Listing 2-4: Clean and Normalize Dataset
Cleans and normalizes the dataset by removing unnecessary columns, filling missing values, and applying race-based averages (using langchain based program defined in previous listing).

In [None]:
# Start from info_df (loaded earlier) and work on a copy
df = info_df.copy()

# Drop sparse or unused columns
df.drop(columns=['Unnamed: 0', 'Eye color', 'Hair color', 'Skin color'],
        errors='ignore', inplace=True)

# Normalize placeholders to NaN for consistent missing-value handling
df.replace({'-': np.nan, -99: np.nan}, inplace=True)

# Fill Species via LLM (see Listing 2-4)
miss = df['Species'].isna()
filled = 0
for i, row in df.loc[miss, ['name', 'Publisher']].iterrows():
    sp = get_species(row['name'], row['Publisher'])
    if sp:
        df.at[i, 'Species'] = sp
        filled += 1
print(f"Species filled: {filled}, still missing: {df['Species'].isna().sum()}")

# Impute Height and Weight by Species mean (rounded to 1 decimal)
grp = (df.groupby('Species')[['Height', 'Weight']]
         .transform('mean')
         .round(1))
df['Height'] = df['Height'].fillna(grp['Height'])
df['Weight'] = df['Weight'].fillna(grp['Weight'])

# Save and preview a sample of the cleaned result
df.to_csv('superheroes_info_cleansed.csv', index=False)
print(df[['name', 'Species', 'Height', 'Weight']].sample(10))

### Listing 2-5: Calculating Quality with Gini Coefficient
We calculate the **Gini coefficient** for the "Alignment" column to assess imbalance between categories, helping us evaluate potential skew in model predictions.

In [None]:
import pandas as pd
import numpy as np

# Load the Superheroes Info dataset
df_info = pd.read_csv(SUPERHEROES_INFO_CLEAN_URL)

# Function to calculate the Gini coefficient
def gini_coefficient(counts):
    sorted_counts = np.sort(counts)  # Sort counts
    n = len(counts)
    cumulative_values = np.cumsum(sorted_counts)  # Cumulative sorted count sum
    index = np.arange(1, n + 1)
    gini = (np.sum((2 * index - n - 1) * sorted_counts)) / (
        n * np.sum(sorted_counts)
    )
    return gini

# Count occurrences of each alignment category (good, bad, neutral)
alignment_counts = df_info['Alignment'].value_counts()

# Calculate Gini coefficient for the Alignment column
gini_score = gini_coefficient(alignment_counts.values)

# Display the counts and Gini coefficient
print("Alignment Counts:\n", alignment_counts)
print(f"Gini Coefficient for 'Alignment' categories: {gini_score}")

### Listing 2-6: Data Relevance Using EDA
Analyzes average height, weight, and moral alignment proportions by gender, then formats and prints an easy-to-read table for data relevance.

In [None]:
# Load the dataset
df = pd.read_csv(SUPERHEROES_INFO_CLEAN_URL)

# Filter for missing values in key columns
df = df.dropna(subset=['Gender', 'Alignment', 'Species', 'Height', 'Weight'])

# Analyze imbalance across categorical columns
categories = ['Gender', 'Alignment', 'Species']

gini_results = {}
for category in categories:
    counts = df[category].value_counts()
    gini_results[category] = gini_coefficient(counts.values)

# Analyze imbalance for discretized height and weight
df['Height_bins'] = pd.cut(df['Height'], bins=5)
df['Weight_bins'] = pd.cut(df['Weight'], bins=5)

gini_results['Height'] = gini_coefficient(df['Height_bins'].value_counts().values)
gini_results['Weight'] = gini_coefficient(df['Weight_bins'].value_counts().values)

# Print Gini coefficients for each category
print("Gini Coefficients for Dataset Imbalances:")
for category, gini_score in gini_results.items():
    print(f"{category}: {gini_score:.3f}")

### Listing 2-7: Superhero Dataset Merge Analysis
Analyze compatibility of superheroes_info_clean and superheroes_powers by merging on hero_names field and calculating match percentage for **feature integration depth**.

In [None]:
# Import pandas
import pandas as pd

# Load the datasets from the URLs
info_df = pd.read_csv(SUPERHEROES_INFO_CLEAN_URL)
powers_df = pd.read_csv(SUPERHEROES_POWERS_URL)

# Rename 'name' in info_df to 'hero_names' for consistent merging
info_df.rename(columns={'name': 'hero_names'}, inplace=True)

# Merge the datasets on 'hero_names'
merged_df = pd.merge(info_df, powers_df, on='hero_names', how='inner')

# Calculate and display the total number of matched entries and the percentage match
matched_count = merged_df.shape[0]
total_info_count = info_df.shape[0]
percentage_matched = (matched_count / total_info_count) * 100

print(f"Matched entries: {matched_count}")
print(f"Total entries in Info dataset: {total_info_count}")
print(f"Percentage matched: {percentage_matched:.2f}%")


### Listing 2-8: Compute OPR/SDR and Merge into Hero Info
This program builds two composite signals for each hero: Offensive Power Rating
(OPR) and Strategic Defense Rating (SDR). Magic and Super Speed are treated as
dual-use with a small defensive weight, and the code avoids double counting while
using only columns that exist. Duplicate power rows are aggregated by hero name,
and a has_powers_source flag is added (1=yes, 0=no) to mark whether a powers
row exists so missing rows are not mistaken for true zeros. The scores and flag are
then merged into the info table and written to INFO_POWERS_FILE.

In [None]:
import pandas as pd

# Helper functions

def normalize_join_key(df):
    if 'hero_names' in df.columns and 'name' not in df.columns:
        return df.rename(columns={'hero_names': 'name'})
    return df

def dedupe_info(df):
    if 'Publisher' in df.columns:
        return (df.sort_values(['name','Publisher'])
                .drop_duplicates(subset=['name','Publisher'], keep='first'))
    return (df.sort_values('name')
            .drop_duplicates(subset=['name'], keep='first'))

def aggregate_powers(df):
    cols = [c for c in df.columns if c != 'name']
    return df.groupby('name', as_index=False)[cols].max()

def prune_present(df, off, deff, dual):
    dual_present = [p for p in dual if p in df.columns]
    off_base = [p for p in off if p in df.columns and p not in dual_present]
    def_base = [p for p in deff if p in df.columns and p not in dual_present]
    pcols = sorted(set(off_base + def_base + dual_present))
    return off_base, def_base, dual_present, pcols

def compute_opr_sdr(df, off_base, def_base, dual_w, pcols):
    out = df[['name']].copy()
    opr = df[off_base].fillna(0).sum(axis=1) if off_base else 0.0
    sdr = df[def_base].fillna(0).sum(axis=1) if def_base else 0.0
    for col in [c for c in dual_w if c in df.columns]:
        opr = opr + dual_w[col]['OPR'] * df[col].fillna(0)
        sdr = sdr + dual_w[col]['SDR'] * df[col].fillna(0)
    out['OPR'] = opr
    out['SDR'] = sdr
    out['has_powers_source'] = False
    if pcols:
        out['has_powers_source'] = ~df[pcols].isna().all(axis=1)
    return out

def merge_and_save(info_df, ratings, out_file):
    res = info_df.merge(ratings, on='name', how='left')
    res['has_powers_source'] = (res['has_powers_source']
                                .fillna(False).astype('int8'))
    res.to_csv(out_file, index=False)
    print(f"Saved {out_file} (has_powers_source: 1=yes, 0=no).")
    return res


In [None]:
import pandas as pd

# Opt in to future behavior to avoid downcasting warnings
pd.set_option('future.no_silent_downcasting', True)

# Load
info_df = pd.read_csv(SUPERHEROES_INFO_CLEAN_URL)
powers_df = pd.read_csv(SUPERHEROES_POWERS_URL)

# Prep
powers_df = normalize_join_key(powers_df)
info_df = dedupe_info(info_df)
powers_df = aggregate_powers(powers_df)

# Power spec
OFFENSIVE_POWERS = [
    'Super Strength','Energy Blasts','Weapons Master','Marksmanship',
    'Telekinesis','Cryokinesis','Fire Control','Power Augmentation',
    'Animal Oriented Powers','Super Speed'
]
DEFENSIVE_POWERS = [
    'Durability','Invulnerability','Force Fields','Energy Absorption',
    'Regeneration','Immortality','Camouflage','Phasing',
    'Enhanced Senses','Teleportation'
]
DUAL_WEIGHTS = {
    'Magic': {'OPR': 0.7, 'SDR': 0.3},
    'Super Speed': {'OPR': 0.7, 'SDR': 0.3},
}

# Build features
off_base, def_base, dual_present, pcols = prune_present(
    powers_df, OFFENSIVE_POWERS, DEFENSIVE_POWERS, DUAL_WEIGHTS
)
ratings = compute_opr_sdr(
    powers_df, off_base, def_base, DUAL_WEIGHTS, pcols
)

# Merge and save (writes has_powers_source as 1/0)
info_with_ratings = merge_and_save(info_df, ratings, INFO_POWERS_FILE)

### Listing 2-8A: OPR/SDR Sanity Test Suite
Validates OPR/SDR by checking coverage, face validity, consistency vs.
source powers, redundancy, sensitivity to small changes, outliers, and
simple leakage. Assumes constants from Listing 2-9 are in memory.

In [None]:
import pandas as pd
import numpy as np

# Require constants from Listing 2-9
# Load enriched info+powers produced by Listing 2-9
#try:
#    df = pd.read_csv(INFO_POWERS_FILE)
#except Exception:
df = pd.read_csv('superheroes_info_powers.csv')

# Load raw powers for cross-checks
pow_df = pd.read_csv(SUPERHEROES_POWERS_URL)
if 'hero_names' in pow_df.columns and 'name' not in pow_df.columns:
    pow_df = pow_df.rename(columns={'hero_names': 'name'})

# Buckets present in this file
present_off = [p for p in OFFENSIVE_POWERS if p in pow_df.columns]
present_def = [p for p in DEFENSIVE_POWERS if p in pow_df.columns]
present_dual = [p for p in DUAL_WEIGHTS if p in pow_df.columns]
off_base = [p for p in present_off if p not in present_dual]
def_base = [p for p in present_def if p not in present_dual]

# 1) Coverage and descriptives
print("OPR/SDR summary:")
print(df[['OPR','SDR']].describe().T[['count','mean','std','min','max']], "\n")
miss = df[['OPR','SDR']].isna().mean().mul(100).rename('missing_%')
print("Missing values (%):\n", miss.to_frame(), "\n")

# 2) Face validity (top/bottom)
def peek(dfin, col, k=10):
    cols = [c for c in ['name','Publisher','Alignment','Species','OPR','SDR']
            if c in dfin.columns]
    print(f"Top {k} by {col}:")
    print(dfin[cols].sort_values(col, ascending=False).head(k)
          .to_string(index=False), "\n")
    print(f"Bottom {k} by {col}:")
    print(dfin[cols].sort_values(col, ascending=True).head(k)
          .to_string(index=False), "\n")

peek(df, 'OPR', 10)
peek(df, 'SDR', 10)

# 3) Consistency vs. source flags
src = pow_df.set_index('name')
have_off = (src[off_base + present_dual].fillna(0).sum(axis=1) > 0
            if (off_base or present_dual) else pd.Series(False, index=src.index))
have_def = (src[def_base + present_dual].fillna(0).sum(axis=1) > 0
            if (def_base or present_dual) else pd.Series(False, index=src.index))

dfi = df.set_index('name')

only_dual_off = ((src[present_dual].fillna(0).sum(axis=1) > 0) &
                 (src[off_base].fillna(0).sum(axis=1) == 0))
only_dual_def = ((src[present_dual].fillna(0).sum(axis=1) > 0) &
                 (src[def_base].fillna(0).sum(axis=1) == 0))

viol_opr_strict = dfi.loc[have_off & ~only_dual_off & (dfi['OPR'] < 1)]
viol_opr_dual   = dfi.loc[only_dual_off & (dfi['OPR'] < 0.3)]
viol_sdr_strict = dfi.loc[have_def & ~only_dual_def & (dfi['SDR'] < 1)]
viol_sdr_dual   = dfi.loc[only_dual_def & (dfi['SDR'] < 0.3)]

print(f"OPR violations (strict): {len(viol_opr_strict)}")
print(f"OPR violations (dual):   {len(viol_opr_dual)}")
print(f"SDR violations (strict): {len(viol_sdr_strict)}")
print(f"SDR violations (dual):   {len(viol_sdr_dual)}\n")

# 4) Redundancy and alignment signal
corr = df[['OPR','SDR']].corr(method='pearson').iloc[0,1]
print(f"Correlation(OPR, SDR): {corr:.3f}")
if 'Alignment' in df.columns:
    means = df.groupby('Alignment')[['OPR','SDR']].mean().round(2)
    counts = df['Alignment'].value_counts()
    print("\nMeans by Alignment:\n", means)
    print("\nCounts by Alignment:\n", counts, "\n")

# 5) Sensitivity: move one offense power to defense, then re-rank
cand = [c for c in off_base if c in pow_df.columns]
tweak = cand[0] if cand else None
if tweak:
    t = pow_df.copy()
    cols = [*off_base, *def_base, *present_dual]
    t[cols] = t[cols].fillna(0)
    off_t = [c for c in off_base if c != tweak]
    def_t = def_base + [tweak]
    t['OPR_tmp'] = t[off_t].sum(axis=1)
    t['SDR_tmp'] = t[def_t].sum(axis=1)
    for col in present_dual:
        w = DUAL_WEIGHTS[col]
        if col in t.columns:
            t['OPR_tmp'] += w['OPR'] * t[col]
            t['SDR_tmp'] += w['SDR'] * t[col]
    tmp = dfi[['OPR','SDR']].merge(t[['OPR_tmp','SDR_tmp']],
                                   left_index=True, right_index=True,
                                   how='inner')
    if tmp['OPR'].nunique() > 1 and tmp['OPR_tmp'].nunique() > 1:
        rho_opr = tmp['OPR'].corr(tmp['OPR_tmp'], method='spearman')
        print(f"Sensitivity Spearman(OPR vs OPR_tmp): {rho_opr:.3f}")
    if tmp['SDR'].nunique() > 1 and tmp['SDR_tmp'].nunique() > 1:
        rho_sdr = tmp['SDR'].corr(tmp['SDR_tmp'], method='spearman')
        print(f"Sensitivity Spearman(SDR vs SDR_tmp): {rho_sdr:.3f}\n")
else:
    print("Sensitivity test skipped (no suitable offensive power).\n")

# 6) Outliers and a robust cosmic flag
q99_opr = df['OPR'].quantile(0.99)
q99_sdr = df['SDR'].quantile(0.99)
df['cosmic_flag'] = (df['OPR'] >= q99_opr) | (df['SDR'] >= q99_sdr)
print("Cosmic-flagged heroes:", int(df['cosmic_flag'].sum()))
if df['cosmic_flag'].any():
    print(df.loc[df['cosmic_flag'], ['name','OPR','SDR']]
          .head(15).to_string(index=False), "\n")

# 7) Simple leakage heuristic
maybe = [c for c in ['Good','Evil','Omniscient','Omnipotent']
         if c in pow_df.columns]
print("Potentially leaky power columns:", maybe if maybe else "None")

### Listing 2-9: Generate Story Plot Dataset  
This code generates superhero plot summaries using randomized archetypes and a Hugging Face language model, then saves the results to a CSV file for analysis or reuse.


In [None]:
import os, re, random, pandas as pd
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Config
NUM_PLOTS = 5
TEMP = 0.4
MODEL = "gpt-4o-mini"  # fast, capable default

# LLM and prompt
llm = ChatOpenAI(model=MODEL, temperature=TEMP)
prompt = ChatPromptTemplate.from_messages([
    ("system", "Write a concise superhero plot. Include a title and three beats."),
    ("user",
     "Hero: {hero}\nVillain: {villain}\nConflict: {conflict}\nSetting: {setting}")
])
chain = prompt | llm | StrOutputParser()

# Archetype pools
heroes = ["reluctant hero","outcast","chosen one","antihero","AI guardian",
          "alien protector","mystic sage","time traveler","reformed villain"]
villains = ["mastermind","rogue AI","dark sorcerer","doppelganger","void entity",
            "time manipulator","cosmic tyrant","corrupt politician"]
conflicts = ["personal vendetta","cosmic invasion","identity crisis",
             "magic vs science","city under siege","revenge plot"]
settings = ["futuristic city","space station","hidden temple","cyberpunk sprawl",
            "parallel universe","post-apocalyptic wasteland"]

# Generate
rows = []
for i in range(NUM_PLOTS):
    args = dict(hero=random.choice(heroes),
                villain=random.choice(villains),
                conflict=random.choice(conflicts),
                setting=random.choice(settings))
    text = chain.invoke(args)
    text = re.sub(r"\s*\n\s*", " | ", text.strip())
    rows.append({**{k.title(): v for k, v in args.items()}, "Plot": text})
    print(f"Plot {i+1}: {text[:120]}...")

# Save
pd.DataFrame(rows).to_csv(PLOTS_FILE, index=False)
print(f"Saved {PLOTS_FILE}")

### Listing 2-10 Comic Story Assistant with RAG
Showcases a Retrieval-Augmented Generation (RAG) approach that uses ChromaDB with hero plot and power data to dynamically generate tailored superhero story arcs.

In [None]:
%pip install -q chromadb langchain_chroma langchain_openai

In [None]:
import pandas as pd, random
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate

# Load data
plots = pd.read_csv(SUPERHEROES_INFO_PLOTS_URL)
heroes = pd.read_csv(SUPERHEROES_INFO_POWERS_URL)

# Vector store from plot texts
emb = OpenAIEmbeddings(model="text-embedding-3-small")
db = Chroma.from_texts(plots['Plot'].astype(str).tolist(), emb)

# Writer intent and retrieval
query = ("Time-traveling hero vs scientific mastermind in a futuristic city.")
doc = db.similarity_search(query, k=1)[0].page_content

# Pick hero/villain with valid power scores
good = heroes.query("Alignment == 'good'").dropna(subset=['OPR','SDR'])
bad  = heroes.query("Alignment == 'bad'").dropna(subset=['OPR','SDR'])
h, v = good.sample(1).iloc[0], bad.sample(1).iloc[0]

# Prompt and model
prompt = ChatPromptTemplate.from_messages([
    ("system", "Write a clear superhero story in 8–12 sentences."),
    ("user",
     "Plot outline:\n{plot}\n\n"
     "Hero: {h} (OPR {ho:.1f}, SDR {hs:.1f}).\n"
     "Villain: {v} (OPR {vo:.1f}, SDR {vs:.1f}).\n"
     "Rewrite the story with these roles. Build tension and a final showdown.")
])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.2)
chain = prompt | llm

# Generate
resp = chain.invoke({
    "plot": doc, "h": h['name'], "ho": h['OPR'], "hs": h['SDR'],
    "v": v['name'], "vo": v['OPR'], "vs": v['SDR']
})
print(resp.content)

### Listing 2-11: Pseudonymizing Superhero Plots Using SpaCy
Demonstrates pseudonymization of entities within superhero plots, using spaCy to replace names, organizations, and locations with generic terms for privacy. **Note:** Be sure to run the *pip* install before running the code snippet below.

In [None]:
%pip install -q spacy

In [None]:
import spacy

# Load spaCy's English model for entity recognition
nlp = spacy.load('en_core_web_sm')

# Function to pseudonymize entity names in the plot text
def pseudonymize_entities(text):
    doc = nlp(text)
    for ent in doc.ents:
        # Replace detected entities with generic labels
        if ent.label_ == "PERSON":
            text = text.replace(ent.text, 'Hero A')
        elif ent.label_ == "ORG":
            text = text.replace(ent.text, 'Organization X')
        elif ent.label_ == "GPE":
            text = text.replace(ent.text, 'Location Z')
    return text

# Pseudonymize only the first plot
first_plot = plots['Plot'].iloc[0]
pseudonymized_first_plot = pseudonymize_entities(first_plot)

# Display the pseudonymized first plot
print("Pseudonymized Plot:")
print(pseudonymized_first_plot)

### Listing 2-12: Data Masking And Differential Privacy
Demonstrates data masking and differential privacy on health records by masking phone numbers and adding noise to age values.

In [None]:
import pandas as pd
import numpy as np

# Sample dataset: health records
data = pd.DataFrame({
    'patient_id': ['A123', 'B456', 'C789'],
    'phone': ['123-456-7890', '987-654-3210', '555-123-4567'],
    'age': [29, 47, 35],
    'diagnosis': ['Condition A', 'Condition B', 'Condition A']
})

# Data Masking: Mask all but the last four digits of phone numbers
data['masked_phone'] = data['phone'].apply(lambda x: 'XXX-XXX-' + x[-4:])

# Differential Privacy: Add noise to age values for anonymization
noise_level = 2  # Adjust noise level as needed
data['age_noisy'] = data['age'] + np.random.laplace(0, noise_level, len(data))

# Display the modified dataset
print("Anonymized Data:\n", data[['patient_id', 'masked_phone',
                                 'age_noisy', 'diagnosis']])

### Listing 2-13: Encrypting Sensitive Data
This code encrypts a sensitive dataset with **Fernet**, allowing secure decryption and access for authorized users only.
**Note:** Be sure to run the following *pip install*

In [None]:
%pip -q install cryptography

In [None]:
#Import Fernet
from cryptography.fernet import Fernet
import pandas as pd
import io  # Import io for StringIO

# Generate encryption key and create a cipher suite
key = Fernet.generate_key()
cipher_suite = Fernet(key)

# Sample sensitive data to be encrypted
data = pd.DataFrame({"Patient": ["John Doe", "Jane Smith"],
                     "Diagnosis": ["Diabetes", "Hypertension"]})
data_str = data.to_csv(index=False)
encrypted_data = cipher_suite.encrypt(data_str.encode())

# Show part of the encrypted string
print("Encrypted Data (partial):", encrypted_data[:50], "...")

# Decrypt the data when access is needed
decrypted_data_str = cipher_suite.decrypt(encrypted_data).decode()
secure_data = pd.read_csv(io.StringIO(decrypted_data_str))

print("\nDecrypted Data Accessible (Only by Authorized Users)\n", secure_data)

### Listing 2-14: Generating Synthetic Health Records
Uses **Faker** to create fictional health records, providing a safe, realistic dataset structure for AI training or testing applications. **Note**: Be sure to run the following pip install.

In [None]:
%pip -q install Faker

In [None]:
from faker import Faker
import pandas as pd

# Initialize Faker for synthetic data generation
fake = Faker()

# Generate a synthetic health records dataset
data = pd.DataFrame({
    'Patient_ID': [fake.uuid4() for _ in range(5)],
    'Name': [fake.name() for _ in range(5)],
    'Age': [fake.random_int(min=18, max=90) for _ in range(5)],
    'Diagnosis': [fake.random_element(elements=('Condition A',
                                               'Condition B',
                                               'Condition C'))
                   for _ in range(5)],
    'Phone': [fake.phone_number() for _ in range(5)],
    'Address': [fake.address().replace("\n", ", ") for _ in range(5)],
    'Last_Visit': [fake.date_between(start_date='-2y', end_date='today')
                   for _ in range(5)]
})

# Display the synthetic dataset
print(data)