### Disclaimer

All hypothesis tests in this notebook will use a significance level ($\alpha$) of **0.05**

In [None]:
# constant
ALPHA = 0.05

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pingouin as pg
from scipy.stats import levene
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE

In [None]:
train = pd.read_csv(r'https://raw.githubusercontent.com/MohamedMostafa259/Customer-Churn-Prediction-and-Analysis/main/Milestone2_FeatureEng_AdvancedAnalysis/data/train_basicFeatureEng.csv')
train.sample(5, random_state=42)

In [None]:
train_copy = train.copy()

In [None]:
train_copy.info()

In [None]:
train_copy.hist(bins=50, figsize=(10, 7))
plt.tight_layout()
plt.show()

It's clear that most of the distributions are not normal, but thanks to Central Limit Theorem we still can use parametric tests as our dataset size is large!

### **Oneway ANOVA test** (numerical features vs. churn risk score)

ANOVA test assumes homogeneity of variance, which means that the variance of the feature should be roughly equal across all groups (all churn risk scores). So, let's test that for each numeric column we have using the levene test.

-	If p > 0.05: no significant difference in variance → safe to proceed with ANOVA

-	p <= 0.05: variances significantly differ → use Welch’s correction that doesn't assume equal variance and adjusts the degree of freedom based on that or use non-parametric tests instead (e.g, kruskal).

In [None]:
for num_col in train_copy.select_dtypes('number').columns[:6]:
	groups = [group[num_col] for _, group in train_copy.groupby('churn_risk_score')]
	stat, p_val = levene(*groups)
	print(f'{num_col}: p-value = {p_val:.4f}', end=f', ')
	if p_val > ALPHA:
		print('equal variances')
	else:
		print('unequal variances - WARNING', '!!'*10)

Most columns don't meet the assumption of equal variances, so we'll apply ANOVA to `'age'` and `'days_since_last_login'` columns only.

**Research Question:** Is there a significant difference in the mean value of `'age'` and `'days_since_last_login'` numerical features across the churn risk scores?

$H_0$: The mean of `'age'` and `'days_since_last_login'` numerical features is the same across all churn risk score levels.

$H_1$: At least one group (churn risk score) has a different mean.

In [None]:
for num_col in train_copy.select_dtypes('number').columns[:2]:
	print('-'*30, f'\n{num_col} vs. churn_risk_score')
	print(pg.anova(data=train_copy, dv=num_col, between='churn_risk_score'), '\n')

For `'days_since_last_login'`, let's apply a further analysis (post-hoc test) to investigate which two or more categories have significant differences between their means.

**N.B.** The **family-wise error rate** (FWER) refers to the risk of making one or more **Type I errors** (false positives) when performing multiple statistical tests within a set of comparisons, and methods like **Bonferroni** correction are applied to control this error rate.

In [None]:
pg.pairwise_tests(data=train_copy, dv='days_since_last_login', between='churn_risk_score', padjust='bonf')

In [None]:
plt.figure(figsize=(5, 4))
sns.barplot(data=train_copy, x='churn_risk_score', y='days_since_last_login')
plt.show()

pg.pairwise_tests(data=train_copy, dv='days_since_last_login', between='churn_risk_score', padjust='bonf')

In [None]:
num_features = train_copy.select_dtypes('number').columns[:6].tolist()
X = train_copy[num_features]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

tsne = TSNE(n_components=2, random_state=42)
tsne_results = tsne.fit_transform(X_scaled)

# add t-SNE components to the DataFrame
train_copy['tsne-2d-one'] = tsne_results[:, 0]
train_copy['tsne-2d-two'] = tsne_results[:, 1]

In [None]:
train_copy['churn_risk_level'] = train_copy['churn_risk_score'].apply(lambda x: 'low' if x <= 2 else 'high')

In [None]:
sns.scatterplot(data=train_copy, x='tsne-2d-one', y='tsne-2d-two', hue='churn_risk_level', alpha=0.3)
plt.title('t-SNE Visualization of Churn Risk Levels')
plt.show()

**Conclusion:**

- For `age`, the p-value > 0.05, so we **fail to reject** $H_0$. There's no significant difference in average age across churn risk scores.

- For `days_since_last_login`, the p-value are all less than 0.05, meaning we **reject** $H_0$. There's statistically significant differences in mean values across churn risk scores.

	-	The pairwise comparisons between churn risk scores and t_SNE visualization of churn risk levels reveal two distinct customer groups based on their login behavior. Customers with risk scores of 1 and 2 show no significant difference in their days_since_last_login, indicating similar recent activity levels and suggesting a low likelihood of churn. In contrast, customers with risk scores of 3, 4, and 5 also show no significant differences among themselves but show a clear and statistically significant difference from those in the 1 and 2 group. This separation implies that risk scores 3, 4, and 5 are associated with customers who have not logged in for longer periods, pointing to a higher risk of churn. Therefore, the churn risk score effectively distinguishes between low-risk (scores 1 and 2) and high-risk (scores 3, 4, and 5) customer segments based on login activity.

### **Two-sample ttest**

We'll apply ttest with Welch's correction to the columns with no equal variance (`avg_time_spent`, `avg_transaction_value`, `avg_frequency_login_days`, and `points_in_wallet`), but now we'll group by the `churn_risk_level`column created below. 

**Research Question:** Do behavioral time spent, transaction value, login frequency, and loyalty points significantly differ between customers with high and low churn risk?

$H_0$: There is no difference in the mean values of the `'avg_time_spent'`, `'avg_transaction_value'`, `'avg_frequency_login_days'`, and `'points_in_wallet'` columns between customers with high and low churn risk

$H_1$: There is a significant difference in at least one of these features between customers with high and low churn risk.

In [None]:
train_copy.select_dtypes('number').columns[2:6].tolist()

In [None]:
for num_col in train_copy.select_dtypes('number').columns[2:6]:
	print('-'*30, f'\n{num_col} vs. churn_risk_level')
	print(pg.ttest(
	x=train_copy[train_copy['churn_risk_level'] == 'high'][num_col],
	y=train_copy[train_copy['churn_risk_level'] == 'low'][num_col],
	alternative='two-sided', 
	correction=True
	))

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()  # make it easier to index with a single loop

for idx, col in enumerate(train_copy.select_dtypes('number').columns[2:6]):
	sns.boxplot(data=train_copy, hue='churn_risk_level', x=col, ax=axes[idx])
	axes[idx].set_title(f'{col} vs churn_risk_level')

plt.tight_layout()
plt.show()

**Conclusion:** 

- For all columns: `'avg_time_spent'`, `'avg_transaction_value'`, `'avg_frequency_login_days'`, and `'points_in_wallet'`, the p-value < 0.05, so we **reject** $H_0$ for all of them and conclude that there's no significant difference in their means across churn risk scores.

- High churn risk is consistently linked to lower engagement metrics (time spent, transaction value, loyalty points) but higher login frequency, implying these customers may be encountering service frustrations. 

### **Chi-squared test of independence**

In [None]:
train_copy.select_dtypes(object).columns.tolist()

In [None]:
# extracted from the FeatureEng custom transformer in the FeatureEngineering notebook in milestone 2
positive_feedback = ['Products always in Stock', 'Quality Customer Care', 'Reasonable Price', 'User Friendly Website']
negative_feedback = ['Poor Website', 'Poor Customer Service', 'Poor Product Quality', 'Too many ads']

def get_sentiment(feedback):
	if feedback in positive_feedback:
		return 'positive'
	elif feedback in negative_feedback:
		return 'negative'
	else:
		return 'neutral'
	
train_copy['feedback'] = train_copy['feedback'].apply(get_sentiment)

In [None]:
associated_cols = []
for cat_col in train_copy.select_dtypes(object).columns:
	if cat_col in ['joining_date', 'last_visit_time', 'churn_risk_level']:
		continue
	print('-'*30, f'\n{cat_col} vs. churn_risk_level')
	excepted, observed, stats = pg.chi2_independence(data=train_copy, x='churn_risk_level', y=cat_col)
	print(stats[stats['test'] == 'pearson'], '\n')

	if stats[stats['test'] == 'pearson'].loc[0, 'pval'] <= ALPHA:
		associated_cols.append(cat_col)

In [None]:
print(len(associated_cols))
associated_cols

In [None]:
membership_order = ['No Membership', 'Basic Membership', 'Silver Membership',
					'Gold Membership', 'Platinum Membership', 'Premium Membership']

fig, axes = plt.subplots(3, 3, figsize=(15, 12))
axes = axes.flatten()  # make it easier to index with a single loop

for idx, associated_col in enumerate(associated_cols):
	props = train_copy.groupby('churn_risk_level')[associated_col].value_counts(normalize=True).unstack()

	if associated_col == 'membership_category':
		props = props[membership_order]

	props.plot(kind='bar', stacked=True, rot=0, ax=axes[idx])
	axes[idx].set_title(f'{associated_col} vs churn_risk_level')
	axes[idx].set_ylabel('Proportion')
	# `bbox_to_anchor=(1.05, 1)` places the legend outside the right of the plot
	axes[idx].legend(
		title=associated_col,
		bbox_to_anchor=(1.05, 1),
		loc='upper left',
		borderaxespad=0.
	)

plt.tight_layout()
plt.show()

**Conclusion**:

Among all the associated columns, `membership_category` and `feedback` show the strongest relationship with the `churn_risk_level`.

-	**Membership Category**:

	-	Notably, 0% of customers with low churn risk hold either ‘No Membership’ or ‘Basic Membership’, while these two categories make up a large portion of customers with high churn risk (around 50%).
	
	-	In contrast, Silver, Gold, Platinum, and Premium Memberships are more common among customers with low churn risk.
	
	-	This suggests that higher-tier memberships are associated with lower churn risk, possibly because such customers are more engaged or receive more value from the service.

-	**Feedback**:

	-	100% of customers with low churn risk left positive feedback, with 0% negative or neutral feedback, indicating a strong satisfaction level.
	
	-	On the other hand, customers with high churn risk provided no positive feedback at all—80% gave negative feedback and the remaining 20% gave neutral feedback.
	
	-	This highlights a clear correlation between customer satisfaction and churn risk. Dissatisfied customers (those giving negative or neutral feedback) are far more likely to fall into the high churn risk category.

