## Chapter 2 - Prepping Data for AI
This notebook contains examples of data preparation strategies for AI, including data cleaning, feature engineering, and handling data sensitivity. Using open-source tools like Pandas, LangChain, and ChromaDB, it explores design patterns for crafting high-quality datasets. It also covers techniques for ensuring data privacy and security, highlighting methods like data masking and synthetic data generation to safeguard sensitive information.

### Listing 2-1: Defining Dataset Constants
Defining **dataset** URLs as constants for reuse throughout the notebook, enabling easy loading of superhero datasets across multiple cells.

In [None]:
# Base GitHub repository URL
BASE_URL = "https://opensourceai-book.github.io/code/datasets/"

# Dataset file names
INFO_FILE = "superheroes_info.csv"
INFO_CLEAN_FILE = "superheroes_info_cleansed.csv"
POWERS_FILE = "superheroes_powers.csv"
INFO_POWERS_FILE = "superheroes_info_powers.csv"
PLOTS_FILE = "superheroes_story_plots.csv"

# Construct full dataset URLs
SUPERHEROES_INFO_URL = f"{BASE_URL}{INFO_FILE}"
SUPERHEROES_INFO_CLEAN_URL = f"{BASE_URL}{INFO_CLEAN_FILE}"
SUPERHEROES_POWERS_URL = f"{BASE_URL}{POWERS_FILE}"
SUPERHEROES_INFO_POWERS_URL = f"{BASE_URL}{INFO_POWERS_FILE}"
SUPERHEROES_INFO_PLOTS_URL = f"{BASE_URL}{PLOTS_FILE}"

**Note:** If you see errors about unset constants, rerun the above block. This can happen if your Python runtime disconnects in Colab.

### Listing 2-2: Loading and Displaying Dataset Stats
We load our two superhero-related datasets from provided URLs into a Dataframe and displays basic statistics like number of rows, columns, and column names.


In [None]:
# Import our good friend Pandas, home of the Dataframe and Numpy
import pandas as pd
import numpy as np

# Using the constants defined earlier in the dictionary for reuse
urls = {
    "Superheroes Info Dataset": SUPERHEROES_INFO_URL,
    "Superheroes Powers Dataset": SUPERHEROES_POWERS_URL
}

# Load datasets and display basic stats
for name, url in urls.items():
    df = pd.read_csv(url)
    print(f"{name}:\n  - Rows: {df.shape[0]}\n  - Columns: {df.shape[1]}"
          f"\n  - Column names: {', '.join(df.columns[:5])}...\n")
    print("="*50 + "\n")

### Listing 2-3: Analyzing and Detecting Duplicates and Sparse Fields
Identifies duplicate rows by superhero name and assess sparse fields, highlighting missing data percentages across all dataset columns.

In [None]:
# Load the dataset
df = pd.read_csv(SUPERHEROES_INFO_URL)

# Find duplicate rows based on all columns
duplicates = df[df.duplicated(subset=['name'], keep=False)]

# List unique superheroes with duplicate rows
duplicate_names = duplicates['name'].unique()

print("Superheroes with duplicate rows:")
print(duplicate_names)


... **Run this cell next** to analyze the dataset by detecting sparse fields, calculating the percentage of missing data for each column.

In [None]:
# Load the dataset
df = pd.read_csv(SUPERHEROES_INFO_URL)

# Replace placeholder values '-' and '-99' with NaN to detect missing data
df.replace('-', np.nan, inplace=True)
df.replace(-99, np.nan, inplace=True)

# Remove the counter column (assuming it's the first column)
df.drop(df.columns[0], axis=1, inplace=True)

# Calculate the percentage of missing data for all columns
total_entries = len(df)
sparse_fields = {
    col: (df[col].isnull().sum() / total_entries) * 100 for col in df.columns
}

# Display all columns with their percentage of missing data
print("\nSparse Fields (Percentage of Missing Data):")
print("{:<20} {:>10}".format("Column", "Missing %"))
print("-" * 30)
for col, percentage in sparse_fields.items():
    print("{:<20} {:>8.2f}%".format(col, percentage))

### Listing 2-4: Infer Superhero Race with LangChain
This function uses LangChain and an open-source model from Hugging Face Hub to infer a superhero's race based on their name and publisher.

⚠️ Make sure you've set your `HF_TOKEN` (and optionally,  `OPENAI_API_KEY`)  in Colab secrets.
Refer to **Chapter 1** (notebook or book text) for setup instructions.  
The next two code cells install the required packages and configure API keys in the environment for use with LangChain.


In [None]:
# Install required packages for Hugging Face and LangChain usage
%pip install -q langchain langchain-community langchain-openai huggingface_hub

In [None]:
# Constants and API Key Configuration
import os
from google.colab import userdata

# === Load API keys securely from Google Colab Secrets ===
def load_api_keys():
    keys = {
        "HF_TOKEN": userdata.get("HF_TOKEN"),
        "OPENAI_API_KEY": userdata.get("OPENAI_API_KEY"),
    }
    for key, value in keys.items():
        if not value:
            raise ValueError(f"❌ Missing {key}. Please set this API key in Colab secrets.")
        os.environ[key] = value
    print("✅ All API keys loaded and configured successfully.")

# Execute API key loading upon running this cell
load_api_keys()

#### Define the default LLM (Text Generation Model) to use from Hugging Face
Run the code cell below to define the DEFAULT_MODEL constant
> ⚠️ If you get an error running LangChain code due to a missing model, welcome to open-source AI development. Models are updated or replaced often. Check Hugging Face’s list of supported text generation models here:  
> https://huggingface.co/docs/api-inference/en/supported-models


In [None]:
DEFAULT_MODEL = "mistralai/Mistral-Nemo-Instruct-2407"

In [None]:
import re
from huggingface_hub import InferenceClient
from langchain_core.prompts import ChatPromptTemplate

client = InferenceClient(model=DEFAULT_MODEL)

# Define the prompt template
prompt = ChatPromptTemplate.from_messages([
    ('system', 'Provide only the superhero race in one word, surrounded by '
               'parentheses (). If you don’t know, respond with "".'),
    ('human', 'What is the race of {hero_name} from {publisher}?')
])

def get_race_from_llm(hero_name, publisher):
    # Format the prompt
    formatted_prompt = prompt_template.format(
        hero_name=hero_name,
        publisher=publisher
    )

    # Send to model
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=[{"role": "user", "content": formatted_prompt}],
        max_tokens=100,
        temperature=0.1
    )

    # Debug: print raw content
    content = response.choices[0].message.content

    # Extract race from parentheses
    match = re.search(r'\((\w+)\)', content)
    return match.group(1) if match else ""

# Test it
hero_name = "Spider-Man"
publisher = "Marvel Comics"
race = get_race_from_llm(hero_name, publisher)
print(f"Race of {hero_name}: {race}")


### Listing 2-5: Clean and Normalize Dataset
Cleans and normalizes the dataset by removing unnecessary columns, filling missing values, and applying race-based averages (using langchain based program defined in previous listing).

In [None]:
# Load dataset
df = pd.read_csv(SUPERHEROES_INFO_URL)

# Step 1: Remove unnecessary columns
df.drop(columns=['Unnamed: 0', 'Eye color', 'Hair color', 'Skin color'],
        inplace=True)

# Step 2: Normalize placeholder values
df.replace('-', np.nan, inplace=True)
df.replace(-99, np.nan, inplace=True)

# Step 3: Use LangChain LLM to fill missing race values
for idx, row in df[df['Race'].isna()].iterrows():
    race = get_race_from_llm(row['name'], row['Publisher'])
    if race:
        df.at[idx, 'Race'] = race
        print(f"Filled race for {row['name']}: {race}")
    else:
        print(f"Could not determine race for {row['name']}")

# Step 4: Fill missing height/weight using race averages
# Round the averages to 1 decimal place
race_grouped = df.groupby('Race')[['Height', 'Weight']].mean().round(1)

for race in race_grouped.index:
    avg_height = race_grouped.loc[race, 'Height']
    avg_weight = race_grouped.loc[race, 'Weight']
    df.loc[(df['Race'] == race) & (df['Height'].isnull()), 'Height'] = avg_height
    df.loc[(df['Race'] == race) & (df['Weight'].isnull()), 'Weight'] = avg_weight

# Save cleansed data to CSV
df.to_csv('superheroes_info_cleansed.csv', index=False)

# Output sample of cleaned dataset
print(df[['name', 'Race', 'Height', 'Weight']].sample(10))

### Listing 2-6: Calculating Quality with Gini Coefficient
We calculate the **Gini coefficient** for the "Alignment" column to assess imbalance between categories, helping us evaluate potential skew in model predictions.

In [None]:
import pandas as pd
import numpy as np

# Load the Superheroes Info dataset
df_info = pd.read_csv(SUPERHEROES_INFO_CLEAN_URL)

# Function to calculate the Gini coefficient
def gini_coefficient(counts):
    sorted_counts = np.sort(counts)  # Sort counts
    n = len(counts)
    cumulative_values = np.cumsum(sorted_counts)  # Cumulative sorted count sum
    index = np.arange(1, n + 1)
    gini = (np.sum((2 * index - n - 1) * sorted_counts)) / (
        n * np.sum(sorted_counts)
    )
    return gini

# Count occurrences of each alignment category (good, bad, neutral)
alignment_counts = df_info['Alignment'].value_counts()

# Calculate Gini coefficient for the Alignment column
gini_score = gini_coefficient(alignment_counts.values)

# Display the counts and Gini coefficient
print("Alignment Counts:\n", alignment_counts)
print(f"Gini Coefficient for 'Alignment' categories: {gini_score}")

### Listing 2-7: Data Relevance Using EDA
Analyzes average height, weight, and moral alignment proportions by gender, then formats and prints an easy-to-read table for data relevance.

In [None]:
# Load the dataset
df = pd.read_csv(SUPERHEROES_INFO_CLEAN_URL)

# Filter for missing values in key columns
df = df.dropna(subset=['Gender', 'Alignment', 'Race', 'Height', 'Weight'])

# Analyze imbalance across categorical columns
categories = ['Gender', 'Alignment', 'Race']

gini_results = {}
for category in categories:
    counts = df[category].value_counts()
    gini_results[category] = gini_coefficient(counts.values)

# Analyze imbalance for discretized height and weight
df['Height_bins'] = pd.cut(df['Height'], bins=5)
df['Weight_bins'] = pd.cut(df['Weight'], bins=5)

gini_results['Height'] = gini_coefficient(df['Height_bins'].value_counts().values)
gini_results['Weight'] = gini_coefficient(df['Weight_bins'].value_counts().values)

# Print Gini coefficients for each category
print("Gini Coefficients for Dataset Imbalances:")
for category, gini_score in gini_results.items():
    print(f"{category}: {gini_score:.3f}")

### Listing 2-8: Superhero Dataset Merge Analysis
Analyze compatibility of superheroes_info_clean and superheroes_powers by merging on hero_names field and calculating match percentage for **feature integration depth**.

In [None]:
# Import pandas
import pandas as pd

# Load the datasets from the URLs
info_df = pd.read_csv(SUPERHEROES_INFO_CLEAN_URL)
powers_df = pd.read_csv(SUPERHEROES_POWERS_URL)

# Rename 'name' in info_df to 'hero_names' for consistent merging
info_df.rename(columns={'name': 'hero_names'}, inplace=True)

# Merge the datasets on 'hero_names'
merged_df = pd.merge(info_df, powers_df, on='hero_names', how='inner')

# Calculate and display the total number of matched entries and the percentage match
matched_count = merged_df.shape[0]
total_info_count = info_df.shape[0]
percentage_matched = (matched_count / total_info_count) * 100

print(f"Matched entries: {matched_count}")
print(f"Total entries in Info dataset: {total_info_count}")
print(f"Percentage matched: {percentage_matched:.2f}%")


### Listing 2-9: Hero Power Metrics Enhancement
Calculates Offensive Power Rating (OPR) and Strategic Defense Rating (SDR) for each hero by summing relevant power attributes, enriching the dataset for enhanced power analysis.

In [None]:
import pandas as pd

# Load the datasets
info_dfs = pd.read_csv(SUPERHEROES_INFO_CLEAN_URL)
powers_dfs = pd.read_csv(SUPERHEROES_POWERS_URL)

# Ensure consistent naming by renaming 'hero_names' to 'name' in powers_dfs
powers_dfs.rename(columns={'hero_names': 'name'}, inplace=True)

# Define power attributes for offensive and defensive power calculations
OFFENSIVE_POWERS = [
    'Super Strength', 'Energy Blasts', 'Weapons Master', 'Marksmanship',
    'Magic', 'Telekinesis', 'Cryokinesis', 'Fire Control',
    'Power Augmentation', 'Animal Oriented Powers'
]

DEFENSIVE_POWERS = [
    'Durability', 'Invulnerability', 'Force Fields', 'Energy Absorption',
    'Regeneration', 'Immortality', 'Camouflage', 'Phasing',
    'Enhanced Senses', 'Teleportation'
]

# Calculate OPR and SDR in the powers dataset
powers_dfs['OPR'] = powers_dfs[OFFENSIVE_POWERS].sum(axis=1)
powers_dfs['SDR'] = powers_dfs[DEFENSIVE_POWERS].sum(axis=1)

# Keep only necessary columns (name, OPR, SDR)
powers_ratings = powers_dfs[['name', 'OPR', 'SDR']]

# Merge the calculated fields into the info dataset
info_with_ratings = pd.merge(info_dfs, powers_ratings, on='name', how='left')

# Save the updated info dataset with OPR and SDR
info_with_ratings.to_csv(INFO_POWERS_FILE, index=False)
print(f"Updated dataset saved as {INFO_POWERS_FILE} with OPR and SDR.")

### Listing 2-10: Generate Story Plot Dataset  
This code generates superhero plot summaries using randomized archetypes and a Hugging Face language model, then saves the results to a CSV file for analysis or reuse.


In [None]:
# Absolutely! Here's the fully updated and working version of your original code,
# now using Hugging Face's `InferenceClient` instead of the deprecated LangChain `HuggingFaceEndpoint`.
# The logic, structure, and CSV output are preserved exactly.

import os
import re
import pandas as pd
import random
from huggingface_hub import InferenceClient
from langchain_core.prompts import ChatPromptTemplate

# Configuration
TEMP = 0.4
NUM_PLOT_SAMPLES = 5
PLOTS_FILE = "superhero_plots.csv"

# Initialize Hugging Face inference client
client = InferenceClient(model=DEFAULT_MODEL)

# Define the LangChain prompt template
prompt = ChatPromptTemplate.from_messages([
    ('system', 'Generate a concise superhero plot where:'),
    ('human', '''
    - Hero archetype: {hero}
    - Villain archetype: {villain}
    - Conflict type: {conflict_type}
    - Setting: {setting}
    Include a title and three key plot points.
    ''')
])

# Archetype components
heroes = [
    "reluctant hero", "outcast", "chosen one", "antihero", "reluctant mentor",
    "alien protector", "cyber-enhanced rebel", "fallen warrior", "mystic sage",
    "time-traveling guardian", "reformed villain", "empathic healer",
    "street-level hero", "artificial intelligence (AI)", "mystical guardian",
    "scientific genius", "nature warrior", "cosmic nomad", "avenger of loss",
    "dimension traveler", "energy manipulator"
]

villains = [
    "mastermind", "alien invader", "corrupt hero", "rogue AI", "dark sorcerer",
    "doppelgänger", "genetic experiment", "shadow manipulator", "void entity",
    "cyber overlord", "time manipulator", "mythical creature", "corrupt politician",
    "revenge-seeking nemesis", "intergalactic tyrant", "poisonous rival",
    "mind controller", "techno-terrorist", "corrupted scientist", "necromancer",
    "cosmic parasite"
]

conflicts = [
    "personal vendetta", "cosmic invasion", "technological catastrophe", "race against time",
    "internal struggle", "resource battle", "power corruption", "revenge plot",
    "city under siege", "reality manipulation", "identity crisis", "magic vs. science",
    "clash of ideals", "control of the mind", "forbidden knowledge", "mutant uprising",
    "alien invasion", "technological sabotage", "environmental disaster",
    "lost relic chase", "temporal war"
]

settings = [
    "futuristic city", "ancient ruins", "space station", "post-apocalyptic wasteland",
    "underwater fortress", "cyberpunk metropolis", "forest of lost souls",
    "deserted asylum", "alien planet", "hidden temple", "high-tech lab",
    "floating island", "deep jungle", "mystic caves", "parallel universe",
    "underground society", "cloud city", "dreamscape", "hollow mountain",
    "time-frozen world", "ruins of civilization"
]

# Generate plots
plot_samples = []
for i in range(NUM_PLOT_SAMPLES):
    hero = random.choice(heroes)
    villain = random.choice(villains)
    conflict = random.choice(conflicts)
    setting = random.choice(settings)

    # Format prompt
    formatted_prompt = prompt.format(
        hero=hero,
        villain=villain,
        conflict_type=conflict,
        setting=setting
    )

    # Get response from model
    response_obj = client.chat.completions.create(
        model=DEFAULT_MODEL,
        messages=[{"role": "user", "content": formatted_prompt}],
        max_tokens=250,
        temperature=TEMP
    )

    response = response_obj.choices[0].message.content
    response_cleaned = response.replace("\n", " | ")

    plot_samples.append({
        "Hero": hero,
        "Villain": villain,
        "Conflict Type": conflict,
        "Setting": setting,
        "Plot": response_cleaned
    })

    print(f"\nPlot {i + 1}:")
    print(f"Hero: {hero}")
    print(f"Villain: {villain}")
    print(f"Conflict Type: {conflict}")
    print(f"Setting: {setting}")
    print(f"Plot:\n{response}")

# Save results
df = pd.DataFrame(plot_samples)
df.to_csv(PLOTS_FILE, index=False)
print(f"\nPlot samples saved to {PLOTS_FILE}.")

### Listing 2-11 Comic Story Assistant with RAG
Showcases a Retrieval-Augmented Generation (RAG) approach that uses ChromaDB with hero plot and power data to dynamically generate tailored superhero story arcs.

In [None]:
%pip install -q chromadb

In [None]:
# Import necessary libraries
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.vectorstores.chroma import Chroma
import pandas as pd
import random

# Load datasets from URLs
plot_data = pd.read_csv(SUPERHEROES_INFO_PLOTS_URL)
hero_data = pd.read_csv(SUPERHEROES_INFO_POWERS_URL)

# Prepare plot data for embeddings
plot_texts = plot_data['Plot'].tolist()

# Initialize embeddings model and create Chroma vector database for plot data
embeddings = OpenAIEmbeddings()
db = Chroma.from_texts(plot_texts, embeddings)

# Define query from the comic writer
query_text = ("Show me a story where a time-traveling hero faces a scientific "
              "mastermind in a futuristic city.")

# Retrieve the most relevant plot based on similarity
results = db.similarity_search(query_text, 1)
selected_plot = results[0].page_content

# Randomly select hero and villain from hero data (filtered by alignment)
hero_row = hero_data[hero_data['Alignment'] == 'good'].sample(1).iloc[0]
villain_row = hero_data[hero_data['Alignment'] == 'bad'].sample(1).iloc[0]

# Extract hero and villain details for prompt
hero_name, hero_opr, hero_sdr = hero_row['name'], hero_row['OPR'], hero_row['SDR']
villain_name, villain_opr, villain_sdr = villain_row['name'], villain_row['OPR'], villain_row['SDR']

# Define prompt template for story creation
prompt = ChatPromptTemplate.from_messages([
    ("system", "Generate a superhero story based on the following details:"),
    ("human", '''
    Plot Outline: {plot}

    Hero: {hero} with an offense power of {hero_opr}x and defense of {hero_sdr}x.
    Villain: {villain} with an offense power of {villain_opr}x and defense
    of {villain_sdr}x.

    Rewrite the story with {hero} as the main hero and {villain} as the main villain.
    Build tension, introduce challenges, and create a climactic final showdown.
    ''')
])

print(f"""Hero: {hero_name} with OPR of {hero_opr} and defense of {hero_sdr}x.
Villain: {villain_name} with OPR of {villain_opr}x and defense of {villain_sdr}x.
""")

# Set up LangChain LLM for generating story arcs
llm = ChatOpenAI(temperature=0.2)
chain = prompt | llm

# Invoke the LLM with structured prompt
response = chain.invoke({
    "plot": selected_plot,
    "hero": hero_name,
    "villain": villain_name,
    "hero_opr": hero_opr,
    "hero_sdr": hero_sdr,
    "villain_opr": villain_opr,
    "villain_sdr": villain_sdr
})

# Display the generated story arc
print("Generated Story Arc:")
print(response.content)



### Listing 2-12: Pseudonymizing Superhero Plots Using SpaCy
Demonstrates pseudonymization of entities within superhero plots, using spaCy to replace names, organizations, and locations with generic terms for privacy. **Note:** Be sure to run the *pip* install before running the code snippet below.

In [None]:
%pip install -q spacy

In [None]:
import spacy
import pandas as pd

# Load spaCy's English model for entity recognition
nlp = spacy.load('en_core_web_sm')

# Example dataset with superhero story plots
plot_data = pd.read_csv(SUPERHEROES_INFO_PLOTS_URL)

# Function to pseudonymize entity names in the plot text
def pseudonymize_entities(text):
    doc = nlp(text)
    for ent in doc.ents:
        # Replace detected entities with generic labels
        if ent.label_ == "PERSON":
            text = text.replace(ent.text, 'Hero A')
        elif ent.label_ == "ORG":
            text = text.replace(ent.text, 'Organization X')
        elif ent.label_ == "GPE":
            text = text.replace(ent.text, 'Location Z')
    return text

# Pseudonymize only the first plot
first_plot = plot_data['Plot'].iloc[0]
pseudonymized_first_plot = pseudonymize_entities(first_plot)

# Display the pseudonymized first plot
print("Pseudonymized Plot:")
print(pseudonymized_first_plot)

### Listing 2-13: Data Masking And Differential Privacy
Demonstrates data masking and differential privacy on health records by masking phone numbers and adding noise to age values.

In [None]:
import pandas as pd
import numpy as np

# Sample dataset: health records
data = pd.DataFrame({
    'patient_id': ['A123', 'B456', 'C789'],
    'phone': ['123-456-7890', '987-654-3210', '555-123-4567'],
    'age': [29, 47, 35],
    'diagnosis': ['Condition A', 'Condition B', 'Condition A']
})

# Data Masking: Mask all but the last four digits of phone numbers
data['masked_phone'] = data['phone'].apply(lambda x: 'XXX-XXX-' + x[-4:])

# Differential Privacy: Add noise to age values for anonymization
noise_level = 2  # Adjust noise level as needed
data['age_noisy'] = data['age'] + np.random.laplace(0, noise_level, len(data))

# Display the modified dataset
print("Anonymized Data:\n", data[['patient_id', 'masked_phone',
                                 'age_noisy', 'diagnosis']])

### Listing 2-14: Encrypting Sensitive Data
This code encrypts a sensitive dataset with **Fernet**, allowing secure decryption and access for authorized users only.
**Note:** Be sure to run the following *pip install*

In [None]:
%pip -q install cryptography

In [None]:
#Import Fernet
from cryptography.fernet import Fernet
import pandas as pd
import io  # Import io for StringIO

# Generate encryption key and create a cipher suite
key = Fernet.generate_key()
cipher_suite = Fernet(key)

# Sample sensitive data to be encrypted
data = pd.DataFrame({"Patient": ["John Doe", "Jane Smith"],
                     "Diagnosis": ["Diabetes", "Hypertension"]})
data_str = data.to_csv(index=False)
encrypted_data = cipher_suite.encrypt(data_str.encode())

# Show part of the encrypted string
print("Encrypted Data (partial):", encrypted_data[:50], "...")

# Decrypt the data when access is needed
decrypted_data_str = cipher_suite.decrypt(encrypted_data).decode()
secure_data = pd.read_csv(io.StringIO(decrypted_data_str))

print("\nDecrypted Data Accessible (Only by Authorized Users)\n", secure_data)

### Listing 2-15: Generating Synthetic Health Records
Uses **Faker** to create fictional health records, providing a safe, realistic dataset structure for AI training or testing applications. **Note**: Be sure to run the following pip install.

In [None]:
%pip -q install Faker

In [None]:
from faker import Faker
import pandas as pd

# Initialize Faker for synthetic data generation
fake = Faker()

# Generate a synthetic health records dataset
data = pd.DataFrame({
    'Patient_ID': [fake.uuid4() for _ in range(5)],
    'Name': [fake.name() for _ in range(5)],
    'Age': [fake.random_int(min=18, max=90) for _ in range(5)],
    'Diagnosis': [fake.random_element(elements=('Condition A',
                                               'Condition B',
                                               'Condition C'))
                   for _ in range(5)],
    'Phone': [fake.phone_number() for _ in range(5)],
    'Address': [fake.address().replace("\n", ", ") for _ in range(5)],
    'Last_Visit': [fake.date_between(start_date='-2y', end_date='today')
                   for _ in range(5)]
})

# Display the synthetic dataset
print(data)