# Synthetic HCP-Campaign Interaction Data Generation

#### Step 1 - Importing Packages

Start by importing foundational libraries used for **data simulation and manipulation**:

- **Pandas** → for tabular data structures and I/O operations.
- **NumPy** → for efficient numerical computations and random sampling.
- **random** → for non-vectorized but flexible random sampling (used for realistic variations).
- **uuid** → for generating unique identifiers if required in downstream extensions.
- **datetime, timedelta** → for simulating temporal interaction events over a rolling 90-day period.

From a data generation standpoint, these libraries help us replicate **real-world stochastic behavior** observed in **HCP engagement patterns**.

In [1]:
# importing required libraries

import pandas as pd
import numpy as np
import random
import uuid
from datetime import datetime, timedelta

#### Step 2: Loading Source Datasets

Load the two preprocessed datasets:

- *hcp_master* → Contains healthcare professional (HCP) metadata such as hcp_id, specialty, and demographic/behavioral attributes.
- *campaigns_with_content* → Contains scientific content campaigns with topic-model-derived features, specialty alignment, and engagement channels.

These two datasets serve as **base entities** for simulating interactions — analogous to **users and items** in a recommender system context.

In [2]:
# Loading hcp and campaign datasets

hcp_master = pd.read_csv("C:/Users/bhand/Desktop/Data Science - My Collection/Deep Learning Project - 1/Data/Data for Recommender System/HCP_Hybrid_Recommendation_System/data/hcp_master.csv")
campaigns = pd.read_csv("C:/Users/bhand/Desktop/Data Science - My Collection/Deep Learning Project - 1/Data/Data for Recommender System/HCP_Hybrid_Recommendation_System/data/campaigns_with_content.csv")

# renaming specialty column
hcp_master.rename(columns={"speciality": "specialty"}, inplace=True)

# View two datasets
print(hcp_master.head())
print(campaigns.head())

       hcp_id  Entity Type Code   last_name first_name credential  \
0  1679576722                 1       WIEBE      DAVID       M.D.   
1  1588667638                 1     PILCHER    WILLIAM         MD   
2  1215930367                 1     GRESSOT    LAURENT       M.D.   
3  1932102084                 1  ADUSUMILLI       RAVI         MD   
4  1841293990                 1    WORTSMAN      SUSAN     MA-CCC   

           city state        zip taxonomy_code   specialty  
0       KEARNEY    NE  688482168    207X00000X       Other  
1  JACKSONVILLE    FL  322044736    207RC0000X  Cardiology  
2       HOUSTON    TX  770901243    174400000X       Other  
3        TOLEDO    OH  436151753    207RC0000X  Cardiology  
4     HARTSDALE    NY  105303455    231H00000X       Other  
  campaign_id campaign_channel content_id  \
0       CMP_1        in-person      C0000   
1       CMP_2        rep_visit      C0001   
2       CMP_3     social-event      C0002   
3       CMP_4        rep_visit      C00

#### Step 3: Simulating Realistic HCP–Campaign Interactions and Engagement Behavior

In this section, **synthetic interaction dataset** is created that mimics how healthcare professionals (HCPs) engage with scientific content campaigns across multiple channels.
Each record represents a single **HCP–campaign interaction event** enriched with behavioral, temporal, and contextual features.

1. **Iterating through HCPs and Selecting Campaigns**

    For every HCP in the master dataset, a subset of campaigns that match their medical specialty is identified.
    If no match exists, content is sampled from all available campaigns to maintain diversity.
    Each HCP interacts with a **random number of campaigns (5–12)** — replicating natural variability in content exposure frequency.

    Statistically, this process defines a **conditional sampling mechanism**:

                    P(content∣HCP specialty) > P(content)
                    
    reflecting that HCPs are more likely to receive and interact with content aligned with their clinical expertise.

---

2. **Simulating Temporal Engagement Patterns**:

    Each interaction is assigned a timestamp over the **past 90 days**, distributed across hours of the day based on **realistic engagement probabilities**:

    - Morning and evening peaks (9–11 AM, 3–7 PM)
    - Lower engagement around lunchtime and late evening

    This yields a **non-uniform temporal distribution** that captures real-world attention rhythms — crucial for time-aware models such as **sequence-based recommenders** or **temporal decay analysis**.

---

3. **Modeling Engagement Likelihood**:

    Simulating engagement behavior using probabilistic rules based on specialty relevance:
    - **65%** chance of engagement when content matches the HCP’s specialty
    - **25%** otherwise

    If engaged, an HCP produces measurable interaction metrics:
    - **Clicks**: Random count (1–3) indicating active interaction
    - **Dwell Time**: Random duration (20–180 seconds) representing reading or viewing time
    - **Conversion**: Binary indicator (10% conditional probability) signifying a high-value action such as form submission or follow-up request

    This probabilistic framework captures the **stochastic nature of user engagement** and the **domain relevance effect**, which are essential components in modeling **propensity-to-engage**.

---

4. **Computing Engagement Score and Label**:

    Each interaction is transformed into quantifiable outcomes through a weighted engagement score:

        Score = 0.4(Clicks) + 0.3log(1 + Dwell Time) + 0.3(Conversion)
        
    This formula balances **activity (clicks), attention duration (dwell time),** and **conversion intent** — providing a continuous measure of engagement strength.

    A **binary engagement label** is also created:

    - 1 if there was any meaningful activity (clicks, long dwell time, or conversion).
    - 0 if the content was ignored.
    which simplifies behavioral outcomes for downstream **classification tasks** such as predicting whether an HCP will engage with a given campaign.

---

*The result is a realistic, multi-dimensional dataset representing HCP–campaign interactions over time. It reflects real-world dynamics such as*:

- Specialty-driven content targeting
- Daily engagement patterns
- Variable behavioral intensity

This data can be used to train models for **Next Best Content (NBC) prediction**, **propensity scoring**, or **channel effectiveness analysis** in pharmaceutical marketing.

In [3]:
# Building realistic interactions
interactions = []
interaction_id = 1
np.random.seed(42)

for _, hcp in hcp_master.iterrows():
    hcp_id = hcp["hcp_id"]
    hcp_specialty = hcp["specialty"]

    # Select subset of content (specialty match preferred)
    matched_content = campaigns[campaigns["specialty"].str.contains(hcp_specialty, case=False, na=False)]
    if matched_content.empty:
        matched_content = campaigns

    sampled_content = matched_content.sample(min(len(matched_content), random.randint(5, 12)))

    for _, row in sampled_content.iterrows():
        content_id = row["content_id"]
        campaign_id = row["campaign_id"]
        campaign_channel = row["campaign_channel"]
                
        # Random timestamp (last 90 days, realistic engagement, no overnight)
        
        random_days = random.randint(1, 90)

        # Hour distribution (6 AM - 10 PM only)
        hour_probs = [
            0.02, 0.05, 0.09, 0.11, 0.09, 0.07,   # 6 AM - 11 AM (morning peak at 9-11)
            0.04, 0.03, 0.025,                    # 12 PM - 2 PM (lunch dip)
            0.05, 0.07, 0.10, 0.09, 0.08,         # 3 PM - 7 PM (evening peak)
            0.05, 0.035, 0.025                    # 8 PM - 10 PM (decline)
        ]
        allowed_hours = list(range(6, 23))  # 6 AM - 10 PM
        hour_probs = np.array(hour_probs) / np.sum(hour_probs)  # normalize
        hour = np.random.choice(allowed_hours, p=hour_probs)
        minute = random.randint(0, 59)
        second = random.randint(0, 59)

        timestamp = datetime.now() - timedelta(days=random_days)
        timestamp = timestamp.replace(hour=hour, minute=minute, second=second, microsecond=0)


        # Engagement probability
        engaged = np.random.rand() < (0.65 if hcp_specialty.lower() in str(row["specialty"]).lower() else 0.25)

        clicks = np.random.randint(1, 4) if engaged else 0
        dwell_time = np.random.randint(20, 181) if engaged else 0
        conversion = 1 if (engaged and np.random.rand() < 0.1) else 0

        # Engagement score
        score = 0.4*clicks + 0.3*np.log1p(dwell_time) + 0.3*conversion
        label = 1 if (clicks > 0 or dwell_time > 30 or conversion == 1) else 0

        interactions.append([
            interaction_id, hcp_id, hcp_specialty,
            content_id, row["specialty"],campaign_id, campaign_channel, 
            timestamp.strftime("%Y-%m-%d %H:%M:%S"),clicks,
             dwell_time, conversion, round(score, 3), label
        ])
        interaction_id += 1

#### Step 4: Aggregating and Exporting the Final Dataset

All simulated interaction records are consolidated into a single DataFrame and saved as *hcp_campaign_interaction_data.csv*.

This dataset represents a **multi-relational behavioral log**, connecting:

- HCPs (users)
- Scientific content (items)
- Channels and Campaigns (context)
- Temporal and behavioral features (engagement metrics)

By combining conditional sampling, probabilistic engagement logic, and realistic temporal behavior, this process yields a behaviorally rich synthetic dataset.
It captures multiple dimensions of real-world interactions — **who engaged, with what, through which channel,** and **how strongly**— laying the groundwork for advanced analyses like:

- **Hybrid recommendation models (collaborative + content-based)**
- **Engagement propensity prediction**
- **Channel optimization and Next Best Action (NBA) strategies**

In [4]:
# Final dataframe
interaction_df = pd.DataFrame(interactions, columns=[
    "interaction_id", "hcp_id", "hcp_specialty",
    "content_id", "content_specialty","campaign_id",
    "campaign_channel", "timestamp","clicks", "dwell_time",
    "conversion", "engagement_score", "engagement_label"
])

interaction_df.to_csv("hcp_campaign_interaction_data.csv", index=False)
print(f"✅ Created {len(interaction_df)} rows → hcp_campaign_interaction_data.csv")

# View the dataset
interaction_df.head()

✅ Created 320881 rows → hcp_campaign_interaction_data.csv


Unnamed: 0,interaction_id,hcp_id,hcp_specialty,content_id,content_specialty,campaign_id,campaign_channel,timestamp,clicks,dwell_time,conversion,engagement_score,engagement_label
0,1,1679576722,Other,C0079,General Practice,CMP_80,email,2025-09-01 16:00:08,0,0,0,0.0,0
1,2,1679576722,Other,C0317,Family Medicine,CMP_318,email,2025-07-30 17:23:38,0,0,0,0.0,0
2,3,1679576722,Other,C0486,Family Medicine,CMP_487,webinar,2025-07-18 15:14:16,0,0,0,0.0,0
3,4,1679576722,Other,C0397,General Practice,CMP_398,email,2025-08-11 11:50:27,0,0,0,0.0,0
4,5,1679576722,Other,C0167,Internal Medicine,CMP_168,email,2025-08-17 20:45:12,1,31,0,1.44,1
