### Title

Synthetic Data Generation for E-Commerce A/B Testing

### Description

This notebook generates synthetic datasets which simulate user and sessions data for an e-commerce website. The data includes user demographics, traffic sources and A/B test assignments for image layout, CTA button, and social prrof. This notebook serves as the foundation for subsequent exploratory data analysis and hypothesis testing.

## Introduction

To analyze the impact of different page elements on user behavior and the conversion rate, we need realistic synthetic datasets. These datasets should mimic actual e-commerce traffic and user interactions. This notebook generates two datasets `users_data` containing user-level attributes and A/B test assignments; and `sessions_data`containing session-level details.

## Objectives

*   To simulate user demographics and traffic sources reflecting typical e-commerce website activity
*   Assign users to control or treatment groups for image layout, CTA button and social proof experiments
*   Generate sessions data corresponding to these user profiles and A/B assignments
*   Validate the generated data to ensure it is suitable for exploratory data analysis and hypothesis testing





## Import Libarries

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random
import os
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

## Business Logic and Assumption

**Business Logic in Users Data Simulation:**

1.   **Experiment duration**: The simulation covers a period starting from the specified `start_date` (default: 2024-01-01) and spans `duration_days` (dafault: 90 days). All generated data falls within this time window.
2.   **Randomness and Reproducibility**: Both Numpy and Python's random number geenrators are seeded with reproducibility, ensuring that the same synthetic data is produced each time with the same seed.

3. **Base Rates and Effect Sizes**:
*   The basline conversion rate for the site is set at 2.5%.
*   The baseline click-through-rate (CTR) for calls-to-action is 15%.
*   Three experimental factors - image layout, CTA button and social proof, with relative effect sizes of 15%, 12% and 22% respectively, represents the expected uplift in conversion rate when the treatment is applied.

4. **User Segmentation**:
*  Device Type: Users are assigned a device type with the following probability: 45% desktop, 45% mobile, and 10% tablet. This reflects a balanced desktop and mobile audience with a smaller proportion of tablet users.
* Traffic Source: Users are assigned a traffic source with these probabilities: 35% organic, 25% paid search, 15% each for social and direct and 10% for email. This is designed to mimic typical e-commerce acquisition channels.
* User Type: 65% users are new and 35% are returning, which reflects a healthy mix of new and repeat visitors.

5. **User Registration Date**: Each registration date for a user is generated by sampling an exponential distribution (mean = 30 days) and subtracting that many days from the experiment's start date. This simulates a realistic spread of user sign-up times, with more recent registrations being more common.

6. **User IDs**: Each user is assigned a unique ID in the format user_XXXXXX where XXXXXX is a zer-padded integer.






**Business Logic in Session Simulation**:

1. **Sessions per user**: The number of sessions is drawn from a Poisson distribution (plus one to avoid zero). It reflects real-world analytics where some users are more active than others.

2. **Session Timing**: Session timestamps are generated with realistic day-of-week and hour-of-week patterns, where traffic is not uniformly distributed across time. It simulates higher activity during the business hours and evenings.

3. **Device-Specific Behavior**: The device-specific multipliers are based on industry observations.

*   Desktop users are modeled to have higher CTR, conversion rates and longer sessions durations.
*   Mobile users have lower conversion and shorter sessions, reflecting common mobile usage trends.
*   Tablet users are intermediate

4. **Traffic Source Effects**: The traffic sources also affects CTR and conversion rates. E-mails and paid search users are modeled with higher engagement due to their intent-driven nature.

5. **Page Views**: The number of pages viewed in a session is drawn from a Poisson distribution, simulating the variability in browsing depth.

6. **Click and Conversion Simulation**: The probability of clicking a CTA is bounded and capped at 95% and simulated as a Bernoulli trial. Conversion depends on whether the CTA was clicked. It is simulated with a capped probability which reflects the funnel from click to purchase.

7. **Revenue Modeling**: Revenue is generated only for converted sessions. The amount is sampled from a log-normal distribution, which is commonly used to model transaction values in e-commerce due to its positive skewness.

8. **Session Duration**: Session duration is drawn from a gamma distribution, scaled by device type, to reflect realistic session length variability.

9. **Seasonality and Weekly Patterns**: Additional features like day of week, week number and month are extracted for time-based analysis. On weekends, users are modeled to browse more (higher page views, longer sessions) but with lower conversion rates, reflecting typical e-commerce trends.






**Treatment Assignment Logic**

1. We assign each user to control or treatment groups for three independent A/B tests: image layout, CTA button and social proof.

2. Each user us assigned to either 'control' or 'treatment' group for each test using random assignment with equal probability (50% control and 50% treatment).

3. Thus the assignment is unbiased and prevents selection bias. Both the groups are balanced in size and representative of the overall user base.

4. The only systematic difference between groups is the experimental condition, supporting causal inference.

## Schema Overview

**Schema Design for Users Dataset**

| Column Name           | Data Type | Description                                  | Example Value    |
|-----------------------|-----------|----------------------------------------------|-----------------|
| user_id               | object    | Unique user identifier (primary key)                      | USER_000001     |
| device_type           | object    | Device used by user (mobile/desktop/tablet)  | mobile          |
| traffic_source        | object    | How user arrived (organic/paid/referral)     | organic         |
| user_type             | object    | User type (new/returning)                    | new             |
| registration_date     | object    | Date user registered (YYYY-MM-DD)            | 2025-01-15      |
| image_layout_group    | object    | A/B group for image layout                   | control         |
| cta_button_group      | object    | A/B group for CTA button                     | treatment       |
| social_proof_group    | object    | A/B group for social proof                   | control         |



 **Schema Design for Sessions Dataset**

| Column Name                | Data Type  | Description                                                         | Example Value        |
|----------------------------|------------|---------------------------------------------------------------------|---------------------|
| session_id                 | object     | Unique identifier for the session (primary key)                                   | user_000001_s0      |
| user_id                    | object     | Unique identifier for the user (foreign key to users_data)          | user_000001         |
| session_date               | datetime   | Timestamp for the session (date and time)                           | 2024-01-15 14:30    |
| device_type                | object     | Device used in session (desktop, mobile, tablet)                    | mobile              |
| traffic_source             | object     | Channel through which user arrived (organic, paid_search, etc.)     | organic             |
| user_type                  | object     | User segment (new, returning)                                       | new                 |
| image_layout_group         | object     | A/B test group for image layout (control, treatment)                | treatment           |
| cta_button_group           | object     | A/B test group for CTA button (control, treatment)                  | control             |
| social_proof_group         | object     | A/B test group for social proof (control, treatment)                | treatment           |
| page_views                 | float64    | Number of pages viewed in the session                               | 4                   |
| session_duration_minutes   | float64    | Duration of the session in minutes                                  | 7.5                 |
| clicked_cta                | int64      | Whether the CTA was clicked (1=yes, 0=no)                           | 1                   |
| converted                  | int64      | Whether a conversion occurred (1=yes, 0=no)                         | 0                   |
| revenue                    | float64    | Revenue generated in the session (0 if not converted)               | 49.99               |
| log_revenue                | float64    | Log-transformed revenue for normalization                           | 3.91                |
| day_of_week                | object     | Day of the week for the session                                     | Monday              |
| week_number                | int64      | ISO week number of the session                                      | 3                   |
| month                      | int64      | Month of the session (1-12)                                         | 1                   |


## Class Definition


In [None]:
class EcommerceABTestGenerator:
    """
    Generate realistic synthetic data for e-commerce A/B testing portfolio project
    """

    def __init__(self, start_date='2024-01-01', duration_days=90, seed=42):
        self.start_date = datetime.strptime(start_date, '%Y-%m-%d')
        self.duration_days = duration_days
        self.end_date = self.start_date + timedelta(days=duration_days)
        np.random.seed(seed)
        random.seed(seed)

        # Base conversion rates and effect sizes
        self.base_conversion_rate = 0.025  # 2.5% baseline
        self.base_ctr = 0.15  # 15% click-through rate
        self.effect_sizes = {
            'image_layout': 0.15,    # 15% relative improvement
            'cta_button': 0.12,      # 12% relative improvement
            'social_proof': 0.22     # 22% relative improvement
        }

    def generate_users(self, n_users=100000):
        """Generate user demographic and behavioral data"""

        # User segments with realistic distributions
        device_types = ['desktop', 'mobile', 'tablet']
        device_weights = [0.45, 0.45, 0.10]

        traffic_sources = ['organic', 'paid_search', 'social', 'direct', 'email']
        traffic_weights = [0.35, 0.25, 0.15, 0.15, 0.10]

        user_types = ['new', 'returning']
        user_type_weights = [0.65, 0.35]

        # Generate user data
        users_data = {
            'user_id': [f'user_{i:06d}' for i in range(n_users)],
            'device_type': np.random.choice(device_types, n_users, p=device_weights),
            'traffic_source': np.random.choice(traffic_sources, n_users, p=traffic_weights),
            'user_type': np.random.choice(user_types, n_users, p=user_type_weights),
            'registration_date': [
                self.start_date - timedelta(days=np.random.exponential(30))
                for _ in range(n_users)
            ]
        }

        return pd.DataFrame(users_data)

    def assign_treatments(self, users_df):
        """Randomly assign users to treatment groups"""

        n_users = len(users_df)

        # Treatment assignments (balanced randomization)
        treatments = {
            'image_layout': np.random.choice(['control', 'treatment'], n_users, p=[0.5, 0.5]),
            'cta_button': np.random.choice(['control', 'treatment'], n_users, p=[0.5, 0.5]),
            'social_proof': np.random.choice(['control', 'treatment'], n_users, p=[0.5, 0.5])
        }

        # Add to dataframe
        for test_name, assignments in treatments.items():
            users_df[f'{test_name}_group'] = assignments

        return users_df

    def simulate_sessions(self, users_df, avg_sessions_per_user=2.3):
        """Generate session-level data with realistic patterns"""

        sessions = []

        for _, user in users_df.iterrows():
            # Number of sessions per user (Poisson distribution)
            n_sessions = np.random.poisson(avg_sessions_per_user) + 1

            for session_num in range(n_sessions):
                # Session timing with day-of-week and hour effects
                session_date = self._generate_session_timestamp()

                # Device-specific behavior adjustments
                device_multipliers = {
                    'desktop': {'ctr': 1.2, 'conversion': 1.4, 'time': 1.3},
                    'mobile': {'ctr': 0.9, 'conversion': 0.8, 'time': 0.7},
                    'tablet': {'ctr': 1.0, 'conversion': 1.1, 'time': 1.0}
                }

                device_mult = device_multipliers[user['device_type']]

                # Traffic source effects
                traffic_multipliers = {
                    'organic': {'ctr': 1.1, 'conversion': 1.2},
                    'paid_search': {'ctr': 1.3, 'conversion': 1.1},
                    'social': {'ctr': 1.0, 'conversion': 0.9},
                    'direct': {'ctr': 1.2, 'conversion': 1.3},
                    'email': {'ctr': 1.4, 'conversion': 1.25}
                }

                traffic_mult = traffic_multipliers[user['traffic_source']]

                # Calculate adjusted rates
                base_ctr = self.base_ctr * device_mult['ctr'] * traffic_mult['ctr']
                base_conv = self.base_conversion_rate * device_mult['conversion'] * traffic_mult['conversion']

                # Apply treatment effects
                final_ctr = self._apply_treatment_effects(user, base_ctr, 'ctr')
                final_conv = self._apply_treatment_effects(user, base_conv, 'conversion')

                # Simulate user behavior
                page_views = np.random.poisson(3) + 1
                clicked_cta = np.random.binomial(1, min(final_ctr, 0.95))

                if clicked_cta:
                    converted = np.random.binomial(1, min(final_conv / base_ctr, 0.5))
                else:
                    converted = np.random.binomial(1, min(final_conv * 0.1, 0.05))  # Low conversion without click

                # Revenue simulation (if converted)
                revenue = 0
                if converted:
                    # Log-normal distribution for order values
                    revenue = np.random.lognormal(mean=3.5, sigma=0.8)  # ~$50 average
                else:
                  revenue = 0
                log_revenue = np.log1p(revenue)

                # Session duration (minutes)
                base_duration = 4.5 * device_mult['time']
                session_duration = np.random.gamma(2, base_duration/2)

                session_data = {
                    'session_id': f"{user['user_id']}_s{session_num}",
                    'user_id': user['user_id'],
                    'session_date': session_date,
                    'device_type': user['device_type'],
                    'traffic_source': user['traffic_source'],
                    'user_type': user['user_type'],
                    'image_layout_group': user['image_layout_group'],
                    'cta_button_group': user['cta_button_group'],
                    'social_proof_group': user['social_proof_group'],
                    'page_views': page_views,
                    'session_duration_minutes': round(session_duration, 2),
                    'clicked_cta': clicked_cta,
                    'converted': converted,
                    'revenue': round(revenue, 2) if revenue > 0 else 0,
                    'log_revenue': log_revenue
                }

                sessions.append(session_data)

        return pd.DataFrame(sessions)

    def _generate_session_timestamp(self):
        """Generate realistic session timestamps with day/hour patterns"""

        # Random day within the test period
        random_day = np.random.randint(0, self.duration_days)
        session_date = self.start_date + timedelta(days=random_day)

        # Hour distribution (higher traffic during business hours and evenings)
        hour_weights = [0.01, 0.01, 0.01, 0.01, 0.01, 0.02,  # 0-5 AM
                       0.03, 0.05, 0.08, 0.10, 0.09, 0.08,  # 6-11 AM
                       0.07, 0.06, 0.08, 0.09, 0.08, 0.07,  # 12-5 PM
                       0.06, 0.07, 0.08, 0.06, 0.04, 0.02]  # 6-11 PM

        hour_weights = np.array(hour_weights)
        hour_weights = hour_weights / hour_weights.sum()

        hour = np.random.choice(range(24), p=hour_weights)
        minute = np.random.randint(0, 60)

        return session_date.replace(hour=hour, minute=minute)

    def _apply_treatment_effects(self, user, base_rate, metric_type):
        """Apply treatment effects based on user's assigned groups"""

        rate = base_rate

        # Image layout effect (affects both CTR and conversion)
        if user['image_layout_group'] == 'treatment':
            if metric_type in ['ctr', 'conversion']:
                rate *= (1 + self.effect_sizes['image_layout'])

        # CTA button effect (primarily affects CTR)
        if user['cta_button_group'] == 'treatment':
            if metric_type == 'ctr':
                rate *= (1 + self.effect_sizes['cta_button'])
            elif metric_type == 'conversion':
                rate *= (1 + self.effect_sizes['cta_button'] * 0.6)  # Smaller conversion effect

        # Social proof effect (primarily affects conversion)
        if user['social_proof_group'] == 'treatment':
            if metric_type == 'conversion':
                rate *= (1 + self.effect_sizes['social_proof'])
            elif metric_type == 'ctr':
                rate *= (1 + self.effect_sizes['social_proof'] * 0.3)  # Smaller CTR effect

        return rate

    def add_seasonal_patterns(self, sessions_df):
        """Add realistic seasonal and weekly patterns"""

        sessions_df['day_of_week'] = sessions_df['session_date'].dt.day_name()
        sessions_df['week_number'] = sessions_df['session_date'].dt.isocalendar().week
        sessions_df['month'] = sessions_df['session_date'].dt.month

        # Weekend effect (lower conversion but higher browsing)
        weekend_mask = sessions_df['day_of_week'].isin(['Saturday', 'Sunday'])
        sessions_df.loc[weekend_mask, 'page_views'] *= 1.2
        sessions_df.loc[weekend_mask, 'session_duration_minutes'] *= 1.1

        return sessions_df

    def generate_complete_dataset(self, n_users=100000):
        """Generate the complete dataset for A/B testing analysis"""

        print("Generating user data...")
        users_df = self.generate_users(n_users)

        print("Assigning treatments...")
        users_df = self.assign_treatments(users_df)

        print("Simulating sessions...")
        sessions_df = self.simulate_sessions(users_df)

        print("Adding seasonal patterns...")
        sessions_df = self.add_seasonal_patterns(sessions_df)

        print(f"Generated {len(sessions_df)} sessions for {len(users_df)} users")
        print(f"Date range: {sessions_df['session_date'].min()} to {sessions_df['session_date'].max()}")

        return sessions_df, users_df


## Usage Example

In [None]:
if __name__ == "__main__":
    # Initialize generator
    generator = EcommerceABTestGenerator(
        start_date='2024-01-01',
        duration_days=90,
        seed=42
    )

    # Generate dataset
    sessions_df, users_df = generator.generate_complete_dataset(n_users=75000)

    # Basic statistics
    print("\n=== Dataset Summary ===")
    print(f"Total sessions: {len(sessions_df):,}")
    print(f"Total users: {len(users_df):,}")
    print(f"Overall conversion rate: {sessions_df['converted'].mean():.3%}")
    print(f"Overall CTR: {sessions_df['clicked_cta'].mean():.3%}")
    print(f"Total revenue: ${sessions_df['revenue'].sum():,.2f}")

    # Treatment group sizes
    print("\n=== Treatment Group Sizes ===")
    for test in ['image_layout', 'cta_button', 'social_proof']:
        print(f"{test}:")
        print(sessions_df[f'{test}_group'].value_counts())
        print()

Generating user data...
Assigning treatments...
Simulating sessions...
Adding seasonal patterns...
Generated 248033 sessions for 75000 users
Date range: 2024-01-01 00:03:00 to 2024-03-30 23:59:00

=== Dataset Summary ===
Total sessions: 248,033
Total users: 75,000
Overall conversion rate: 4.964%
Overall CTR: 21.687%
Total revenue: $562,410.23

=== Treatment Group Sizes ===
image_layout:
image_layout_group
control      124418
treatment    123615
Name: count, dtype: int64

cta_button:
cta_button_group
control      124068
treatment    123965
Name: count, dtype: int64

social_proof:
social_proof_group
treatment    124436
control      123597
Name: count, dtype: int64



## Save Datasets

In [None]:
# Mount the Drive
from google.colab import drive
drive.mount('/content/drive')

os.makedirs('/content/drive/MyDrive/data', exist_ok=True)
sessions_df.to_csv('/content/drive/MyDrive/data/sessions_data.csv', index=False)
users_df.to_csv('/content/drive/MyDrive/data/users_data.csv', index=False)

print("Datasets saved to 'data/' directory")



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Datasets saved to 'data/' directory


## Loading the Datasets

In [None]:
sessions_df = pd.read_csv('/content/drive/MyDrive/data/sessions_data.csv')
users_df= pd.read_csv('/content/drive/MyDrive/data/users_data.csv')

## Data Quality Check

**Users_Data Information**

In [None]:
users_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75000 entries, 0 to 74999
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   user_id             75000 non-null  object
 1   device_type         75000 non-null  object
 2   traffic_source      75000 non-null  object
 3   user_type           75000 non-null  object
 4   registration_date   75000 non-null  object
 5   image_layout_group  75000 non-null  object
 6   cta_button_group    75000 non-null  object
 7   social_proof_group  75000 non-null  object
dtypes: object(8)
memory usage: 4.6+ MB


**Unique User IDs**

In [None]:
print("Unique user ids: ", users_df['user_id'].nunique())
print("Total user_ids: ", len(users_df))

Unique user ids:  75000
Total user_ids:  75000


**sessions_data information**

In [None]:
sessions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248033 entries, 0 to 248032
Data columns (total 18 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   session_id                248033 non-null  object 
 1   user_id                   248033 non-null  object 
 2   session_date              248033 non-null  object 
 3   device_type               248033 non-null  object 
 4   traffic_source            248033 non-null  object 
 5   user_type                 248033 non-null  object 
 6   image_layout_group        248033 non-null  object 
 7   cta_button_group          248033 non-null  object 
 8   social_proof_group        248033 non-null  object 
 9   page_views                248033 non-null  float64
 10  session_duration_minutes  248033 non-null  float64
 11  clicked_cta               248033 non-null  int64  
 12  converted                 248033 non-null  int64  
 13  revenue                   248033 non-null  f

**Check for unique sessions id**

In [None]:
print("Unique session_ids:", sessions_df['session_id'].nunique())
print("Total session_ids:", len(sessions_df))

Unique session_ids: 248033
Total session_ids: 248033


**Summary Statistics for users_data**

In [None]:
users_df.describe()

Unnamed: 0,user_id,device_type,traffic_source,user_type,registration_date,image_layout_group,cta_button_group,social_proof_group
count,75000,75000,75000,75000,75000,75000,75000,75000
unique,75000,3,5,2,75000,2,2,2
top,user_074999,mobile,organic,new,2023-11-25 11:17:13.519149,control,control,treatment
freq,1,33850,26180,48732,1,37641,37505,37647


**Summary Statistics for sessions_data**

In [None]:
sessions_df.describe()

Unnamed: 0,page_views,session_duration_minutes,clicked_cta,converted,revenue,log_revenue,week_number,month
count,248033.0,248033.0,248033.0,248033.0,248033.0,248033.0,248033.0,248033.0
mean,4.218763,4.630426,0.216874,0.049643,2.267481,0.17585,6.936343,1.989441
std,1.868955,3.659792,0.412117,0.217206,13.847722,0.788141,3.713298,0.823857
min,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
25%,3.0,2.04,0.0,0.0,0.0,0.0,4.0,1.0
50%,4.0,3.674,0.0,0.0,0.0,0.0,7.0,2.0
75%,5.0,6.17,0.0,0.0,0.0,0.0,10.0,3.0
max,16.8,43.868,1.0,1.0,791.75,6.675511,13.0,3.0


**Category Balance**

In [None]:
for col in ['device_type', 'traffic_source', 'user_type']:
    print(f"\n{col} distribution:")
    print(sessions_df[col].value_counts(normalize=True))


device_type distribution:
device_type
mobile     0.451021
desktop    0.450174
tablet     0.098805
Name: proportion, dtype: float64

traffic_source distribution:
traffic_source
organic        0.348208
paid_search    0.249265
direct         0.152939
social         0.150262
email          0.099325
Name: proportion, dtype: float64

user_type distribution:
user_type
new          0.648321
returning    0.351679
Name: proportion, dtype: float64


The categories are almost balanced as we envisioned in our business assumptions.

**Temporal Logic for Registration Dates**

In [None]:
merged = sessions_df.merge(users_df[['user_id', 'registration_date']], on='user_id')
invalid_dates = (pd.to_datetime(merged['session_date']) < pd.to_datetime(merged['registration_date'])).sum()
print("Sessions before registration:", invalid_dates)

Sessions before registration: 0


**Check for negative values in revenues and page views**

In [None]:
print("Negative revenues:", (sessions_df['revenue'] < 0).sum())
print("Zero or negative page_views:", (sessions_df['page_views'] <= 0).sum())

Negative revenues: 0
Zero or negative page_views: 0


**Group Assignment Balance**

In [None]:
for col in ['image_layout_group', 'cta_button_group', 'social_proof_group']:
    print(f"{col} group balance:\n", sessions_df[col].value_counts(normalize=True))

image_layout_group group balance:
 image_layout_group
control      0.501619
treatment    0.498381
Name: proportion, dtype: float64
cta_button_group group balance:
 cta_button_group
control      0.500208
treatment    0.499792
Name: proportion, dtype: float64
social_proof_group group balance:
 social_proof_group
treatment    0.501691
control      0.498309
Name: proportion, dtype: float64


Each of the experimental groups are very close to a 50/50 split between control and treatment. This matches our intended 50/50 allocation.

## Conclusion

We generated synthetic datasets for A/B hypothesis testing. This synthetic data distribution looks realistic and closely models real-world e-commerce behavior. We undertook data quality checks to ensure that the group assignments are balanced.

**Limitations:**

The synthetic data generated are intended to reflect a wide ranging real-world scenarios. However, they may not capture certain edge cases. We also made various business assumptions to simulate the data which might undergo revisions with changing technologies and times.

**Next Steps:**

We will next proceed to use this datatsets for Exploratory Analysis before undertaking the Hypothesis Tests.
We will continue to monitor and improve data quality as the project progresses.