# Exploratory & A/B Testing 

## 1. Subject Line A/B Performance

* Q1: Do variant A vs B subject lines produce statistically significant differences in open rate?

* Q2:  Is the effect of subject variant on open rate consistent across segments and channels, or explained by confounders

## 2. Timing & Frequency Effects

* Q3: Does send time (hour/day) affect open/CTR?

* Q4: How does customer engagement fatigue—measured by cumulative opens and inactivity periods—influence unsubscribe rates and purchase likelihood?

## 3. Segment-Level Insights

* Q5: Which customer segments (RFM quintiles, new vs. repeat) respond best to promotions vs. newsletters?

* Q6: What is the interplay between purchase frequency and email engagement?

* Campaign→Order Link

*  Q7: For engaged customers (opened/clicked), what is the uplift in next-30-day order rate and average order value vs. non-openers?

In [1]:
import pandas as pd
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
email_campaign_df = pd.read_csv('../data/email_campaigns.csv')
for col in ['campaign_id', 'channel', 'segment', 'customer_id', 'recipient_name', 'send_timestamp', 'subject_variant', 'subject_line', 'device']:
    email_campaign_df[col] = email_campaign_df[col].astype('string')

email_engagement_df = pd.read_csv('../data/email_engagement.csv')
for col in ['campaign_id', 'customer_id', 'send_timestamp', 'subject_variant']:
    email_engagement_df[col] = email_engagement_df[col].astype('string')




# Convert timestamps
email_campaign_df['send_timestamp'] = pd.to_datetime(email_campaign_df['send_timestamp'], errors='coerce')
email_engagement_df['send_timestamp'] = pd.to_datetime(email_engagement_df['send_timestamp'], errors='coerce')

In [2]:
email_engagement_df.columns

Index(['campaign_id', 'customer_id', 'send_timestamp', 'subject_variant',
       'opened', 'clicked', 'unsubscribed', 'purchase', 'revenue'],
      dtype='object')

## 1. Subject Line A/B Performance

### Q1: Do variant A vs B subject lines produce statistically significant differences in open rate?

#### Statistical test (e.g., two-proportion z-test) for significance.

In [3]:
from statsmodels.stats.proportion import proportions_ztest


# Aggregate open counts by variant 

agg= (email_engagement_df
      .groupby('subject_variant')['opened']
      .agg(['sum','count'])
      )


## Variant A vs Variant B

open = agg.loc[['A','B'],'sum'] 

totals = agg.loc[['A','B'],'count']


# two-propertion z test
stats, pval = proportions_ztest(open,totals)

print(f"Z-Stat: {stats}, p-value: {pval}")



Z-Stat: -32.86374315017827, p-value: 7.248358126247467e-237


#### Effect size calculation (difference in open rate) for praticial impact

In [4]:
from statsmodels.stats.proportion import confint_proportions_2indep

#open rates 
rate_A = open.iloc[0]/totals.iloc[0]
rate_B = open.iloc[1]/totals.iloc[1]

effect_size= rate_A-rate_B


## confidence interval for difference 
# method = wald or score can be used 

lower,upper = confint_proportions_2indep(open.iloc[0],totals.iloc[0],open.iloc[1],totals.iloc[1],method='score')

print(f"Effect size (A-B): {effect_size:5f}")

print(f"95% CI: ({lower:.5f}, {upper:.5f})")



Effect size (A-B): -0.009370
95% CI: (-0.00993, -0.00881)


### *Summary for Subject Line A/B Performance*

The statistical analysis shows a highly significant difference in open rates between subject line variants A and B.

- Two-proportion z-test result: Z = -32.86, p-value ≈ 0, indicating the difference is not due to chance.
- Effect size (A - B): -0.00937, meaning variant A's open rate is approximately 0.9% lower than variant B's.
-  95% confidence interval for the difference: (-0.00993, -0.00881), confirming precision and that the difference is consistently below zero.

Conclusion: Subject line variant has a statistically significant and measurable impact on open rate. The difference is precise and robust given the large dataset.

#### Q2: Is the effect of subject variant on open rate consistent across segments and channels, or explained by confounders

In [5]:
# Merge email_campaign_df with email_engagement_df to add channel and segment info to engagement data
# Drop duplicates to keep unique campaign_id with channel and segment
campaign_agg = email_campaign_df[['campaign_id', 'channel', 'segment']].drop_duplicates(subset=['campaign_id']).copy()

# Merge aggregated campaign info with engagement data
merged_df = email_engagement_df.merge(campaign_agg, on='campaign_id', how='left')



In [23]:
merged_df.columns

Index(['campaign_id', 'customer_id', 'send_timestamp', 'subject_variant',
       'opened', 'clicked', 'unsubscribed', 'purchase', 'revenue', 'channel',
       'segment'],
      dtype='object')

 #### logistic regression to model open probability

In [7]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

## prepare merged df with nesessary columns 
## ensuring categorical variables are treated as such 

merged_df['subject_variant']= merged_df['subject_variant'].astype('category')
merged_df['channel']= merged_df['channel'].astype('category')
merged_df['segment']= merged_df['segment'].astype('category')


# Logistic regression formula with interactions
formula = 'opened ~ subject_variant + channel + segment + subject_variant:channel + subject_variant:segment'


# fit model using statsmodel 
model = smf.logit(formula=formula, data=merged_df).fit()

print(model.summary())


Optimization terminated successfully.
         Current function value: 0.353206
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:                 opened   No. Observations:              4946692
Model:                          Logit   Df Residuals:                  4946672
Method:                           MLE   Df Model:                           19
Date:                Wed, 06 Aug 2025   Pseudo R-squ.:               0.0009951
Time:                        13:04:23   Log-Likelihood:            -1.7472e+06
converged:                       True   LL-Null:                   -1.7489e+06
Covariance Type:            nonrobust   LLR p-value:                     0.000
                                                     coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------
Intercept                           

#### *Model Significance Order and Conclusion*

Order of predictors by statistical significance (lowest to highest p-value):

1. channel[T.promo] (p ≈ 0.000, z=11.126) - strongest positive effect  
2. channel[T.loyalty] (p ≈ 0.000, z=9.011)  
3. channel[T.transactional] (p ≈ 0.000, z=4.392)  
4. subject_variant[T.B] (p ≈ 0.000, z=4.281)  
5. channel[T.survey] (p=0.006, z=2.765)  
6. segment[T.occasional] (p=0.031, z=2.156)  
7. segment[T.inactive] (p=0.041, z=2.045)  
8. channel[T.newsletter] (p=0.588, not significant)  

**Conclusion:**  

The promotional, loyalty, and transactional channels significantly increase the odds of email opens, with promo having the largest effect.  

Subject variant B also significantly increases open likelihood but with a smaller effect than these top channels.  

Segments occasional and inactive show small but statistically significant effects, while the newsletter channel has no significant impact on open rates.  

## 2. Timing & Frequency Effects

### Q3: Does send time (hour/day) affect open/CTR?

In [None]:
grouped = (merged_df
           .groupby([merged_df['send_timestamp'].dt.dayofweek.rename('day_sent'),
                     merged_df['send_timestamp'].dt.hour.rename('hour_sent')])
           .agg(opened_sum=('opened', 'sum'),
                clicked_sum=('clicked', 'sum'),
                email_sent=('campaign_id', 'count'))
          ).reset_index()

grouped['open_rate'] = grouped['opened_sum'] / grouped['email_sent']
grouped['ctr'] = grouped['clicked_sum'] / grouped['email_sent']



#### Logistic regression to test impact of send day and hour on email open probability

In [17]:
merged_df['day_sent']= merged_df['send_timestamp'].dt.dayofweek
merged_df['hour_sent']= merged_df['send_timestamp'].dt.hour


logit_df = merged_df[['opened', 'day_sent', 'hour_sent']].copy()

In [None]:
import numpy as np
import statsmodels.formula.api as smf

# Random 1% sample (≈ 100k rows) to fit model
sample_df = merged_df.sample(frac=0.01, random_state=42)

# Convert to categorical
sample_df['day_sent'] = sample_df['day_sent'].astype('category')
sample_df['hour_sent'] = sample_df['hour_sent'].astype('category')

# Fit logistic regression
formula = 'opened ~ day_sent + hour_sent'
model = smf.logit(formula=formula, data=sample_df)
result = model.fit()

print(result.summary())



                      Generalized Linear Model Regression Results                      
Dep. Variable:     ['clicked_sum', 'failures']   No. Observations:                   44
Model:                                     GLM   Df Residuals:                 10747422
Model Family:                         Binomial   Df Model:                           17
Link Function:                           Logit   Scale:                          1.0000
Method:                                   IRLS   Log-Likelihood:            -5.0037e+07
Date:                         Wed, 06 Aug 2025   Deviance:                   5.2440e+06
Time:                                 13:30:56   Pearson chi2:                 5.25e+06
No. Iterations:                            100   Pseudo R-squ. (CS):              1.000
Covariance Type:                     nonrobust                                         
                      coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------

#### Logistic regression to test impact of send day and hour on email CTR probability

In [36]:

grouped['failures'] = grouped['email_sent'] - grouped['clicked_sum']

grouped['day_sent'] = grouped['day_sent'].astype('category')
grouped['hour_sent'] = grouped['hour_sent'].astype('category')

formula = 'clicked_sum + failures ~ day_sent + hour_sent'

model = smf.glm(formula=formula, data=grouped, family=sm.families.Binomial())
result = model.fit()

print(result.summary())


                      Generalized Linear Model Regression Results                      
Dep. Variable:     ['clicked_sum', 'failures']   No. Observations:                   44
Model:                                     GLM   Df Residuals:                       26
Model Family:                         Binomial   Df Model:                           17
Link Function:                           Logit   Scale:                          1.0000
Method:                                   IRLS   Log-Likelihood:                -197.07
Date:                         Wed, 06 Aug 2025   Deviance:                       20.307
Time:                                 13:38:04   Pearson chi2:                     20.3
No. Iterations:                              9   Pseudo R-squ. (CS):              1.000
Covariance Type:                     nonrobust                                         
                      coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------

#### *Summary for open and CTR probability*

Conclusion: The model's low pseudo R-squared (0.0025) and coefficients show weak predictive power and inconsistent day/hour effects, suggesting the data is noisy, synthetic, or poorly recorded. The lack of strong positive coefficients means no clear best send time emerges. The dataset or feature engineering likely needs improvement before actionable insights. The data shows Sunday at 00:00 being the most optimal day to send the emails which makes no sense in the real world.

In [37]:
merged_df.columns

Index(['campaign_id', 'customer_id', 'send_timestamp', 'subject_variant',
       'opened', 'clicked', 'unsubscribed', 'purchase', 'revenue', 'channel',
       'segment', 'day_sent', 'hour_sent'],
      dtype='object')

## Q4: How does customer engagement fatigue-measured by cumulative opens and inactivity periods-influence unsubscribe rates and purchase likelihood?

In [39]:
df = merged_df.copy()

In [40]:

df['customer_id'] = df['customer_id'].astype('category')

df['purchase'] = df['purchase'].astype('int8')
df['send_timestamp'] = pd.to_datetime(df['send_timestamp'])

## sort the values by customer_id and send timestamp
df.sort_values(['customer_id', 'send_timestamp'], inplace=True)

## identify first row per customer 
df['is_first']= df.groupby('customer_id',observed=True).cumcount()==0

## has any prior purchase? 
df['prior_purchase']= ( 
    df.groupby('customer_id',observed=True)['purchase']
    .transform(lambda x: x.shift().cummax().fillna(0)))

## assing segments 
df['segment']='inactive'
df.loc[df['is_first'],'segment']= 'new'

# if customer is not new and has prior purchase then they are a repat  
df.loc[(~df['is_first']) & (df['prior_purchase']>0),'segment']='repeat'

In [43]:
df['days_since_last_purchase'] = df.groupby('customer_id',observed=True)['send_timestamp'].diff().dt.days.fillna(0)

df['days_since_last_purchase_cum'] = df.groupby('customer_id',observed=True)['days_since_last_purchase'].cumsum()


In [48]:
import statsmodels.formula.api as smf

# Create per-customer cumulative opens
df['cumulative_opens'] = (
    df.groupby('customer_id', observed=True)['opened']
      .cumsum()
)

# Scale continuous predictors to avoid overflow
df['cumulative_opens_scaled'] = (
    (df['cumulative_opens'] - df['cumulative_opens'].mean()) /
    df['cumulative_opens'].std()
)
df['days_since_last_purchase_cum_scaled'] = (
    (df['days_since_last_purchase_cum'] - df['days_since_last_purchase_cum'].mean()) /
    df['days_since_last_purchase_cum'].std()
)

# Stratified sample for unsubscribe model
sample_df_unsub = (
    df.groupby('unsubscribed', group_keys=False, observed=True)
      .sample(n=min(50000, df['unsubscribed'].value_counts().min()),
              random_state=42)
)

# Stratified sample for purchase model
sample_df_purchase = (
    df.groupby('purchase', group_keys=False, observed=True)
      .sample(n=min(50000, df['purchase'].value_counts().min()),
              random_state=42)
)

# Ensure binary outcomes are int
sample_df_unsub['unsubscribed'] = sample_df_unsub['unsubscribed'].astype(int)
sample_df_purchase['purchase'] = sample_df_purchase['purchase'].astype(int)

# Logistic regression: unsubscribe ~ fatigue/inactivity
model_unsub = smf.logit(
    formula='unsubscribed ~ days_since_last_purchase_cum_scaled + cumulative_opens_scaled + segment',
    data=sample_df_unsub
).fit()
print(model_unsub.summary())




Optimization terminated successfully.
         Current function value: 0.680744
         Iterations 4
                           Logit Regression Results                           
Dep. Variable:           unsubscribed   No. Observations:                77990
Model:                          Logit   Df Residuals:                    77985
Method:                           MLE   Df Model:                            4
Date:                Wed, 06 Aug 2025   Pseudo R-squ.:                 0.01789
Time:                        14:06:28   Log-Likelihood:                -53091.
converged:                       True   LL-Null:                       -54059.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                                          coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
Intercept                              -0.0476      0.008 

#### *Logistic regression: unsubscribe summary*
* days_since_last_purchase_cum_scaled is strongly negative (p<0.001). 
* Customers with longer inactivity periods are significantly less likely to unsubscribe.
* cumulative_opens_scaled is strongly positive (p<0.001). Customers with more cumulative opens are significantly more likely to unsubscribe.
* Segment type (new vs repeat) is not significant.

Conclusion: Engagement fatigue is visible. Frequent openers are at higher unsubscribe risk, while long inactive periods lower that risk. Purchase recency plays a smaller role than cumulative engagement.

*Conclusion* unsubscribe risk is driven more by over-engagement than by inactivity

In [52]:
# Logistic regression: purchase ~ fatigue/inactivity
import statsmodels.formula.api as smf

# Collapse 'new' into 'inactive'
sample_df_purchase['segment'] = sample_df_purchase['segment'].replace({'new': 'inactive'})

# Scale predictors
for col in ['days_since_last_purchase_cum', 'cumulative_opens']:
    sample_df_purchase[col + '_scaled'] = (
        sample_df_purchase[col] - sample_df_purchase[col].mean()
    ) / sample_df_purchase[col].std()

# Fit logistic regression
model_purchase = smf.logit(
    formula='purchase ~ days_since_last_purchase_cum_scaled + cumulative_opens_scaled + C(segment)',
    data=sample_df_purchase
).fit()
print(model_purchase.summary())




Optimization terminated successfully.
         Current function value: 0.600360
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:               purchase   No. Observations:                10092
Model:                          Logit   Df Residuals:                    10088
Method:                           MLE   Df Model:                            3
Date:                Wed, 06 Aug 2025   Pseudo R-squ.:                  0.1339
Time:                        14:14:40   Log-Likelihood:                -6058.8
converged:                       True   LL-Null:                       -6995.2
Covariance Type:            nonrobust   LLR p-value:                     0.000
                                          coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------
Intercept                              -0.0645      0.025 

#### *Summary: Factors Influencing Purchase Likelihood* 

Logistic regression analysis reveals that customer engagement, measured by cumulative email opens, and customer segment strongly correlate with purchase probability. Specifically, customers in the "repeat" segment are significantly more likely to purchase than other segments (coef = 0.49, p < 0.001). Increased cumulative opens also positively predict purchase likelihood (coef = 0.88, p < 0.001), indicating that higher engagement drives conversions.

Days since last purchase shows no significant effect (p = 0.579), suggesting that recency of last purchase alone does not predict immediate purchase behavior within this dataset.

Overall, these results highlight that repeated engagement and customer loyalty are key drivers of purchase behavior in the email campaign data.