# Synthetic Campaign Dataset Generation

The following code generates a realistic synthetic dataset of 5,000 customers for marketing campaign analysis. It includes customer demographics, campaign exposure, response behavior, and financial metrics. Key highlights:

- **Campaign logic**: Customers are randomly assigned to receive a campaign, with response probabilities influenced by income and age.
- **Customer behavior**: Simulated monthly spend, response status, and whether the customer is high-value.
- **Optional features**: Credit score, days since last purchase, and customer segment using KMeans clustering on age and income.

The resulting dataset will be used for exploratory analysis, segmentation, and uplift modeling.

## Data Generation

In [21]:
import pandas as pd
import numpy as np
import random


from sklearn.cluster import KMeans

# Set random seed for reproducibility
np.random.seed(42)

# Number of samples
n = 5000

# Generate synthetic data
data = pd.DataFrame({
    'customer_id': [f'CUST{i:05d}' for i in range(1, n + 1)],
    'age': np.random.randint(18, 70, size=n),
    'gender': np.random.choice(['Male', 'man','Female', 'female','Other'], size=n, p=[0.26, 0.22, 0.24, 0.25, 0.03]),
    'income': np.random.normal(loc=70000, scale=20000, size=n).astype(int),
    'region': np.random.choice(['North', 'South', 'East', 'West','Unknown'], size=n,  p=[0.24, 0.24, 0.24, 0.24, 0.04]),
    'tenure_years': np.round(np.random.exponential(scale=5, size=n), 1),
    'received_campaign': np.random.choice([0, 1], size=n, p=[0.5, 0.5]),
    'spend_last_month': np.round(np.random.exponential(scale=200, size=n), 2),
    'channel': np.random.choice(['Email','email', 'SMS','sms', 'Phone','phone', 'App Notification'], size=n),
    'product_category': np.random.choice(['Loans', 'Credit Card', 'Savings', 'Insurance', 'None'], size=n)
})

# Add response variable based on whether they received the campaign
def generate_response(row):
    base_prob = 0.05
    uplift = 0.10 if row['received_campaign'] == 1 else 0
    income_factor = 0.02 if row['income'] > 80000 else -0.01
    age_factor = -0.01 if row['age'] < 25 else 0.01
    prob = base_prob + uplift + income_factor + age_factor
    return np.random.rand() < prob

data['responded'] = data.apply(generate_response, axis=1).astype(int)

# Spend this month is affected by response
data['spend_this_month'] = data['spend_last_month'] + data['responded'] * np.random.normal(loc=100, scale=50, size=n)
data['spend_this_month'] = data['spend_this_month'].round(2)

# Clip income to positive values
data['income'] = data['income'].clip(lower=15000)

# Add high value customer flag
data['is_high_value'] = ((data['income'] > 90000) & (data['tenure_years'] > 5)).astype(int)

# Add optional columns
data['days_since_last_purchase'] = np.random.randint(1, 365, size=n)
data['credit_score'] = np.random.normal(loc=700, scale=50, size=n).astype(int)
data['credit_score'] = data['credit_score'].clip(lower=300, upper=850)

# Create customer segments using KMeans on age and income
kmeans = KMeans(n_clusters=4, random_state=42)
segments = kmeans.fit_predict(data[['age', 'income']])
data['customer_segment'] = segments

# Preview updated dataset
data.head()




Unnamed: 0,customer_id,age,gender,income,region,tenure_years,received_campaign,spend_last_month,channel,product_category,responded,spend_this_month,is_high_value,days_since_last_purchase,credit_score,customer_segment
0,CUST00001,56,female,111974,North,4.5,0,616.04,email,Loans,1,771.72,0,156,664,1
1,CUST00002,69,man,42291,South,5.7,0,171.21,email,Loans,0,171.21,0,185,732,2
2,CUST00003,46,female,70585,North,5.7,0,280.33,sms,Loans,0,280.33,0,41,715,0
3,CUST00004,32,Male,78626,East,9.3,0,143.79,sms,,0,143.79,0,201,725,0
4,CUST00005,60,Female,64871,South,0.0,1,67.43,email,,1,221.13,0,50,719,3


## Save Data

In [26]:
# Save to CSV
file_path = "data/campaign_data.csv"
data.to_csv(file_path, index=False)

