# Regreression Adjustment Channel Effect

## 1. Objective and Analytical Motivation

The naïve A/B comparison indicated a statistically significant difference in conversion rates between cellular and telephone contact methods. However, the contact method was not randomly assigned, raising concerns about selection bias and confounding.

Therefore, this stage aims to estimate the adjusted channel effect by controlling for observable customer characteristics using logistic regression.

In [72]:
# Import Libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf


# Load data
df = pd.read_csv('../data/preprocessed-bank-data.csv')

df_reg = df[df['contact'] .isin(['cellular', 'telephone'])].copy() 

## 2. Methodology: Logistic Regression for Confounder Adjustment

Considering that the target variable (`y`) is binary outcome, apply logistic regression as confounder adjustment methodology. This model controls for observable confounders such as age, job, marital status, loan status, prior campaign interactions, and macroeconomic variables to isolate the conditional effect of the contact method.

### 2.1 Model Specification

conversion ~ contact + age + job + marital + education + loan + housing + previous + month + macro_vars

- contact: main variable of interest
- other variables: control covariates (observable confounders)

*`macro_vars` refer to external macroeconomic indicators that may influence overall conversion likelihood independently of customer characteristics.*

macro_vars

- emp.var.rate: employment variation rate (economic labor market condition indicator)
- cons.price.idx: consumer price index
- cons.conf.idx: consumer confidence index
- euribor3m: 3-month Euriobr rate
- nr.employed: total employment level in the economy (macro-level indicator)

In [73]:
df_reg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  campaign        41188 non-null  int64  
 11  pdays           41188 non-null  int64  
 12  previous        41188 non-null  int64  
 13  poutcome        41188 non-null  object 
 14  emp.var.rate    41188 non-null  float64
 15  cons.price.idx  41188 non-null  float64
 16  cons.conf.idx   41188 non-null  float64
 17  euribor3m       41188 non-null 

In [74]:
print(df_reg['housing'].unique())
print(df_reg['loan'].unique())

['no' 'yes' 'unknown']
['no' 'yes' 'unknown']


Although `housing` and `loan` were documented as 'binary', actual data contains 'unknown' and treat as categorical variables. Thus, categorical variables are `job`, `marital`, `education`, `month`, `housing`, `loan`.

In [75]:
# Define Target Variable
df_reg.rename(columns={'y':'conversion'}, inplace = True)

df_reg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  campaign        41188 non-null  int64  
 11  pdays           41188 non-null  int64  
 12  previous        41188 non-null  int64  
 13  poutcome        41188 non-null  object 
 14  emp.var.rate    41188 non-null  float64
 15  cons.price.idx  41188 non-null  float64
 16  cons.conf.idx   41188 non-null  float64
 17  euribor3m       41188 non-null 

`y` represents the result whether client subscribed or not, so rename it as `conversion`.

## 3. Understanding Confounding

A confounder is a variable that influences both:

1) The treatment assignment (contact method), and  
2) The outcome (conversion).

Because the contact channel was not randomly assigned, certain customer characteristics may have influenced which channel was used.

For example:
- Younger or digitally active customers may be more likely to be contacted via cellular.
- Customers with prior campaign engagement may have higher inherent conversion probability.

If these variables are not controlled for, the estimated channel effect may capture both:
- The true channel impact
- AND systematic customer differences

This leads to biased estimation (confounding bias).

Therefore, a baseline confounder adjustment model is estimated using logistic regression to isolate the conditional effect of contact while holding observable covariates constant.

### 3.1 Baseline Confounder Adjustment Model

To isolate the conditional effect of `contact` method, a logistic regression model is trained while controlling for observable confounders.

The model specification is:
$$\text{conversion(y)} \sim \text{contact} + X$$

Where:

- `contact` represents the treatment (cellular vs telephone),
- `X` includes customer characteristics and macroeconomic indicators.

By including these covariates, the model compares customers with similar observable profiles across contact channels.

Thus, the estimated coefficient for `contact` reflects the adjusted channel effect rather than raw group differences.

In [76]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Define Target
y = df_reg['conversion']

# Treatment encoding
df_reg['contact_binary'] = (df_reg['contact'] == 'cellular').astype(int)

# Feature selection
feature_cols = ['contact_binary', 'age', 'previous',
                'job','marital','education', 'housing', 'loan', 'month',
                'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']

X = df_reg[feature_cols].copy()

categorical_cols = ['job', 'marital','education','housing','loan', 'month']
numeric_cols = ['contact_binary', 'age','previous',
                'emp.var.rate', 'cons.price.idx', 'cons.conf.idx','euribor3m', 'nr.employed']

# Predprocessing
numeric_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))
])

preprocess = ColumnTransformer([
    ('num', numeric_pipe, numeric_cols),
    ('cat', categorical_pipe, categorical_cols)
])

model = LogisticRegression(max_iter=2000)

clf = Pipeline([
    ('preprocess', preprocess),
    ('model', model)
])

# Train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42, stratify = y)

clf.fit(X_train, y_train)

## 4. Model Estimation

After training the baseline confounder adjustment model, we:

1. Extract coefficient estimates
2. Convert coefficients to odds ratios 
3. Assess statistical relevance of the contact effect
4. Perform a basic predicitive sanity check

The primary objective is inference (adjusted effect estimation), not pure predictive optimisation.

### 4.1 Extract Coefficients & Odd Ratios

In [77]:
# Retrieve feature names
feature_names = clf.named_steps['preprocess'].get_feature_names_out()

coef = clf.named_steps['model'].coef_.ravel()

coef_table = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coef,
    'odds_ratio': np.exp(coef)
}).sort_values('odds_ratio', ascending=False)

coef_table[coef_table['feature'].str.contains('contact', case= False, na=False)]

Unnamed: 0,feature,coefficient,odds_ratio
0,num__contact_binary,0.3643,1.439506



After fitting the baseline logistic regression model with confounder adjustment, we extract the estimated coefficients and convert them to odds ratios for interpretability.

For the contact variable (`num_contact_binary`):

- Coefficient (β) = 0.3701  
- Odds Ratio (exp(β)) ≈ 1.44

This suggests that, holding other variables constant, customers contacted via cellular have approximately **44% higher conversion odds** compared to those contacted via other channels.

Importantly, this is an adjusted estimate, meaning the effect persists even after controlling for demographic, behavioural, and macroeconomic factors.

### 4.2 Assess statistical significance (p-value / CI)

In [79]:
# Preprocess Result
X_train_processed = clf.named_steps['preprocess'].transform(X_train)

# Dense + Float covert to prevent ValueError
if hasattr(X_train_processed, 'toarray'):
    X_dense = X_train_processed.toarray()
else:
    X_dense = np.asarray(X_train_processed)

X_dense = X_dense.astype(np.float64)

# Get feature names
feature_names = clf.named_steps['preprocess'].get_feature_names_out()
X_df = pd.DataFrame(X_dense, columns=feature_names, index=y_train.index)

# Add a constant
X_df = sm.add_constant(X_df, has_constant='add')

# GLM (Binomial)
glm = sm.GLM(y_train, X_df, family=sm.families.Binomial())
result = glm.fit()

print(result.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:             conversion   No. Observations:                32950
Model:                            GLM   Df Residuals:                    32908
Model Family:                Binomial   Df Model:                           41
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -9379.4
Date:                Tue, 24 Feb 2026   Deviance:                       18759.
Time:                        20:06:44   Pearson chi2:                 3.30e+04
No. Iterations:                     6   Pseudo R-squ. (CS):             0.1261
Covariance Type:            nonrobust                                         
                                         coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
cons

In [83]:
print(f"Odds Ratio 95% CI: [{round(np.exp(0.301),2)}, {round(np.exp(0.439),2)}]")

Odds Ratio 95% CI: [1.35, 1.55]


To evaluate whether the observed contact efffect is statistically reliable, we examine the p-value and confidence interval from the GLM output.

For `num_contact_binary`:

- p-value < 0.001
- 95% CI (coefficient): [0.301, 0.439]
- 95% CI (odds ratio): approximately [1.35, 1.55]

The extremely small p-value indicates that the likelihood of observing this effect by random chance is negligible. More importantly from a business perspective, even the lower bound of the confidence interval implies at least a **35% increase in conversion odds**. This means that under conservative assumptions, the positive association between **cellular** contact and **conversion** remains materially meaningful.

### 4.3 Perform a Basic Predictive Sanity Check

Although the primary objective of this model is inference rather than prediction, it is important to confirm that the model demonstrates reasonable predictive performance. We evaluate out-of-sample performance using AUC and accuracy metrics.

In [84]:
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix

# Predict probabilities
y_pred_prob = clf.predict_proba(X_test)[:, 1]

# Predict class
y_pred = clf.predict(X_test)

# Metrics
auc = roc_auc_score(y_test, y_pred_prob)
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print("AUC:", round(auc, 3))
print("Accuracy:", round(acc, 3))
print("Confusion Matrix:\n", cm)

AUC: 0.792
Accuracy: 0.892
Confusion Matrix:
 [[7209  101]
 [ 785  143]]


The model achieves an AUC of 0.79, indicating reasonable discrimination ability between converters and non-converters. While the predictive performance is not the primary objective, the model demonstrates sufficient stability to support inference on coefficient estimates.

## 5. Naïve vs Adjusted Comparison

| Method | Estimate | Interpretation |
|--------|----------|----------------|
|Naïve A/B| +9.5%p | Raw difference|
|Logistic Regression (Adjusted)| OR = 1.45 (95% CI: 1.35 - 1.55)| Adjusted effect|

## 6. Robustness and Sensitivity

[Todo]

- Add interaction (e.g., contact x age)
- Remove macro vars
- Compare coefficient stability

## 7. Interpretation and Business Implications

- Does channel remain significant?
- Is effect economically meaningful?
- Should strategy shift?

## 8. Limitations

- Observational data
- Potential unobserved confounding
- Not causal proof
- Experimental validation required for causal inference

## 9. Executive Summary