# Synthetic Campaign Data Generation

#### Step 1 : Importing required packages

Import two foundational Python libraries for data manipulation and numerical computing.
- Pandas
- Numpy
  
Together, these libraries form the data preprocessing backbone for building the synthetic “campaigns” dataset used in HCP hybrid recommender system.

In [1]:
# Importing required libraries

import pandas as pd
import numpy as np

#### Step 2: Loading Scientific Content Data

The scientific content topics dataset, which contains textual and topic model outputs derived from scientific articles or educational content for HCPs (Healthcare Professionals).


In [2]:
# loading scientific content topics data

scientific_content = pd.read_csv('C:/Users/bhand/Desktop/Data Science - My Collection/Deep Learning Project - 1/Data/Data for Recommender System/HCP_Hybrid_Recommendation_System/data/scientific_content.csv')
scientific_content.head()

Unnamed: 0,PMID,Title,Abstract,topic_ids,topic_keywords,content_id,source,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,specialty
0,40887509,Sodium glucose co-transporter 2 inhibitor-asso...,Perioperative euglycaemic diabetic ketoacidosi...,9,"group, control, risk, cardiovascular, disease,...",C0000,PubMed,0.0004,0.000584,0.205027,0.000507,0.000541,0.129902,0.187374,0.000603,0.000469,0.474591,Cardiology
1,40886230,Can Dual Incretin Receptor Agonists Exert Bett...,Despite advances in cardiovascular risk reduct...,5,"management, diabetes, glucose, risk, clinical,...",C0001,PubMed,0.00044,0.000642,0.000446,0.000558,0.000595,0.994736,0.00065,0.000663,0.000516,0.000755,General Practice
2,40885915,The association between diabetes management se...,Self-efficacy emerges as a crucial element tha...,1,"health, care, healthcare, self, model, include...",C0002,PubMed,0.001692,0.53226,0.448801,0.002145,0.002287,0.002878,0.002498,0.002551,0.001984,0.002904,Family Medicine
3,40884731,Intrinsic Motivation Moderates the Effect of F...,Few studies have examined effects of intrinsic...,0,"thyroid, disorder, woman, hormone, disease, hy...",C0003,PubMed,0.546629,0.271952,0.000572,0.000716,0.000763,0.00096,0.175925,0.000851,0.000662,0.000969,Endocrinology
4,40877913,Inhibitory effects of the flavonoids extracted...,"Pollen Typhae (PT), a traditional Chinese medi...",7,"disease, therapy, ckd, treatment, cell, risk, ...",C0004,PubMed,0.209126,0.004637,0.003217,0.004026,0.004293,0.366418,0.004689,0.394418,0.003725,0.005451,General Practice


#### Step 3: Defining Channel Types

This list defines the communication or engagement channels through which HCPs receive or interact with scientific content.

These channels simulate the multi-touchpoint marketing ecosystem in pharmaceutical communications:

- **Email**: Digital marketing campaigns.
- **Webinar**: Virtual educational sessions.
- **Rep Visit / In-person**: Face-to-face engagement by medical representatives.
- **Social Event**: Conference or community event interactions.

Channel information acts as a categorical feature, later used for **multi-modal learning or recommendation personalization**.

In [3]:
# Type of channels

channels = ["email", "webinar", "rep_visit", "in-person","social-event"]

#### Step 4: Creating Synthetic Campaign Dataset

In this step, campaign-level data is being synthesized by combining the original scientific content with simulated channel assignments.

- Each campaign_id represents a **unique marketing campaign** (e.g., “CMP_102”).
- **Campaign_channel** is randomly assigned to mimic real-world distribution across digital and physical channels.
- Topic probabilities (topic_0–topic_9) are inherited from the scientific_content dataset — allowing campaigns to retain the **same semantic and topical structure** as their base content.

📊 **Analytical Intent**:
This structure allows future **predictive modeling or recommendation systems** to:

- Learn which types of content perform best per channel.
- Evaluate engagement patterns per specialty and topic distribution.
- Build **multi-dimensional feature embeddings** for **hybrid recommendation models**.

In [4]:
campaigns = pd.DataFrame({
    "campaign_id" : [f"CMP_{i}" for i in range(1, len(scientific_content) + 1)],
    "campaign_channel" : np.random.choice(channels, len(scientific_content)), 
    "content_id": scientific_content["content_id"],
    "content": scientific_content["topic_keywords"],
    "dominant_topic": scientific_content["topic_ids"],
    "topic_0" : scientific_content["topic_0"],
    "topic_1" : scientific_content["topic_1"],
    "topic_2" : scientific_content["topic_2"],
    "topic_3" : scientific_content["topic_3"],
    "topic_4" : scientific_content["topic_4"],
    "topic_5" : scientific_content["topic_5"],
    "topic_6" : scientific_content["topic_6"],
    "topic_7" : scientific_content["topic_7"],
    "topic_8" : scientific_content["topic_8"],
    "topic_9" : scientific_content["topic_9"],
    "specialty": scientific_content["specialty"]
})

campaigns.head()

Unnamed: 0,campaign_id,campaign_channel,content_id,content,dominant_topic,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,specialty
0,CMP_1,email,C0000,"group, control, risk, cardiovascular, disease,...",9,0.0004,0.000584,0.205027,0.000507,0.000541,0.129902,0.187374,0.000603,0.000469,0.474591,Cardiology
1,CMP_2,email,C0001,"management, diabetes, glucose, risk, clinical,...",5,0.00044,0.000642,0.000446,0.000558,0.000595,0.994736,0.00065,0.000663,0.000516,0.000755,General Practice
2,CMP_3,webinar,C0002,"health, care, healthcare, self, model, include...",1,0.001692,0.53226,0.448801,0.002145,0.002287,0.002878,0.002498,0.002551,0.001984,0.002904,Family Medicine
3,CMP_4,social-event,C0003,"thyroid, disorder, woman, hormone, disease, hy...",0,0.546629,0.271952,0.000572,0.000716,0.000763,0.00096,0.175925,0.000851,0.000662,0.000969,Endocrinology
4,CMP_5,social-event,C0004,"disease, therapy, ckd, treatment, cell, risk, ...",7,0.209126,0.004637,0.003217,0.004026,0.004293,0.366418,0.004689,0.394418,0.003725,0.005451,General Practice


In [5]:
# removing commas from content (topic_keywords)

campaigns['content'] = campaigns['content'].str.replace(',', '')

#### Step 5 : Exporting the Final Campaign Dataset

The enriched and cleaned dataset is saved as campaigns_with_content.csv for reuse in subsequent modeling workflows.

By storing campaign-level data that integrates channel, topic, and specialty, we effectively create a multi-view dataset.

This dataset bridges **content semantics (via topic modeling)** with **marketing structure** (via campaign-channel linkage), forming the foundation for **personalized scientific engagement analysis**.

In [6]:
# Exporting as csv file

campaigns.to_csv("campaigns_with_content.csv", index = False)
print("Saved Campaigns with linked content -> campaigns_with_content")

# View the campaigns_with_content dataset
campaigns.head()

Saved Campaigns with linked content -> campaigns_with_content


Unnamed: 0,campaign_id,campaign_channel,content_id,content,dominant_topic,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,specialty
0,CMP_1,email,C0000,group control risk cardiovascular disease anal...,9,0.0004,0.000584,0.205027,0.000507,0.000541,0.129902,0.187374,0.000603,0.000469,0.474591,Cardiology
1,CMP_2,email,C0001,management diabetes glucose risk clinical heal...,5,0.00044,0.000642,0.000446,0.000558,0.000595,0.994736,0.00065,0.000663,0.000516,0.000755,General Practice
2,CMP_3,webinar,C0002,health care healthcare self model include fact...,1,0.001692,0.53226,0.448801,0.002145,0.002287,0.002878,0.002498,0.002551,0.001984,0.002904,Family Medicine
3,CMP_4,social-event,C0003,thyroid disorder woman hormone disease hypothy...,0,0.546629,0.271952,0.000572,0.000716,0.000763,0.00096,0.175925,0.000851,0.000662,0.000969,Endocrinology
4,CMP_5,social-event,C0004,disease therapy ckd treatment cell risk inflam...,7,0.209126,0.004637,0.003217,0.004026,0.004293,0.366418,0.004689,0.394418,0.003725,0.005451,General Practice
