# Customer Churn - Telecom
---

### CRISP-DM Methodology
This project follows the CRISP-DM (*Cross-Industry Standard Process for Data Mining*) framework applied to **Customer Retention & Churn Prediction**:
| **Stage** | **Objective** | **Methodological Execution** |
| :--- | :--- | :--- |
| **1. Business Understanding** | Mitigate revenue loss by identifying at-risk customers. | • **Target Definition**: Binary Classification (Churn: Yes/No).<br>• **KPIs**: Maximize **Lift** in retention campaigns & Revenue Saved vs. Cost. |
| **2. Data Understanding** | Detect patterns of friction and dissatisfaction. | • **EDA**: Distribution analysis (Detect Imbalance).<br>• **Hypothesis Testing**: Correlation Matrix & Independence Tests (Chi-Square). |
| **3. Data Preparation** | Construct a robust dataset for parametric modeling. | • **Scaling**: Standardization (Z-score) for coefficient comparability.<br>• **Encoding**: One-Hot Encoding for nominal variables.<br>• **Splitting**: Stratified Train/Test Split to preserve class ratio. |
| **4. Modeling** | Estimate Churn Probability | • **Algorithms**: Logistic Regression, SVM Linear, KNN, Regression, Decision Tree, Random Florest, XGBoost, LightGBM.<br>• **Inference**: Analyze **Odds Ratios** to determine feature elasticity. |
| **5. Evaluation** | Assess model reliability and financial impact. | • **Discrimination**: AUC-ROC & F1-Score (Threshold Tuning).<br>• **Calibration**: Probability Calibration Curve (Reliability Diagram). |
| **6. Deployment** | Integrate insights into the CRM lifecycle. | • **Deliverable**: "High-Risk" Customer List for Marketing Squad.<br>• **Artifact**: Serialize model (`joblib`) for batch inference. |

---

### Installs:

In [0]:
%%capture
%pip install -r '../requirements.txt'
# Command to restart the kernel and update the installed libraries
%restart_python

### Imports:

In [0]:
# Data Analize and Visualization
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy.stats import skew
from scipy.stats import mannwhitneyu, pointbiserialr, chi2_contingency
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Data Modeling
from sklearn.model_selection import train_test_split

In [0]:
# SRC/ Functions Utils:
import sys
sys.path.append('../src')
from visualization import GraphicsData
from utils import EDATest, optimize_dtypes

### Load the data

In [0]:
df = pd.read_csv('../data/ChurnData.csv')

### Verify successful load with some randomly selected records


In [0]:
df.sample(9)

### 1. Business Understanding
---

##### 1.1 Situation Assessment

> The telecommunications industry operates in a mature, highly saturated environment with intense competitive pressure.

Given elevated **Customer Acquisition Costs (CAC)**, the profitability strategy has shifted from customer acquisition toward **customer retention optimization**.

Internal financial analysis indicates that losing a high-tenure customer is approximately **5x more expensive** than acquiring a new one, primarily due to:

- Loss of predictable recurring revenue  
- Erosion of Customer Lifetime Value (LTV)  
- Increased volatility in revenue forecasting  

Retention, therefore, becomes not merely a marketing initiative but a **core financial lever**.

---

##### 1.2 The Business Problem

> The company is facing an unexplained increase in customer attrition (churn).

Current operational methods rely on reactive, rule-based logic (e.g., cancellation triggers based on usage drops). These approaches:

- Capture symptoms rather than root behavioral drivers  
- Ignore multivariate interaction effects  
- Fail to detect latent risk patterns  

The marketing team requires a **proactive predictive mechanism** capable of identifying high-risk customers **before the cancellation decision materializes**.

---

##### 1.3 Strategic Objectives

1. **Primary Objective – Revenue Protection**  
   Accurately identify customers with high churn probability to reduce preventable revenue leakage.

2. **Strategic Objective – Driver Identification (Explainability)**  
   Understand the structural determinants of churn.  
   Example questions:
   - Is churn price-sensitive (e.g., `wiremon`)?
   - Is it driven by service portfolio composition?
   - Is it linked to lifecycle instability?

   Interpretability will inform product and pricing strategy.

3. **Financial Objective – ROI Optimization**  
   Maximize **Return on Retention Investment (RRI)**.

   **Operational Constraint:**  
   Retention incentives (discounts, offers) cannot be distributed indiscriminately.  
   Targeting must focus on **high-risk and economically valuable customers**.

---

### 2. Data Understanding:
---

#### Dataset: `Telecommunications Churn Database`

- This dataset captures a historical snapshot of customer demographics, service usage patterns, and account status. The primary objective is to correlate these features with the attrition event (**Churn**) to isolate behavioral drivers of dissatisfaction.
---

#### Variables Dictionary:

**1. Target Variable (The Outcome)**
* **churn** *Integer (Binary)* - The classification target. `1` = Customer left (Voluntary Churn); `0` = Stayed.

**2. Customer Demographics & Profile**
* **custcat** *Categorical* - Customer category classification (1-4). Represents the customer segment/cluster.
* **tenure** *Integer* - Months with the company (Proxy for *Loyalty*).
* **age** *Integer* - Customer's age in years.
* **address** *Integer* - Years living at current address (Stability indicator).
* **ed** *Categorical* - Education level (1-5).
* **employ** *Integer* - Years with current employer.
* **income** *Continuous* - Annual household income (in thousands).

**3. Service Subscriptions (Binary Portfolio)**
* **ebill** *Binary* - Electronic billing subscription (0/1). (Digital Adoption indicator).
* **equip** *Binary* - Equipment rental (0/1).
* **callcard** *Binary* - Calling card service (0/1).
* **wireless** *Binary* - Wireless service (0/1).
* **pager** *Binary* - Pager service (Legacy technology).
* **internet** *Binary* - Internet service (0/1).
* **voice** *Binary* - Voice mail service (0/1).
* **callwait** *Binary* - Call waiting service (0/1).
* **confer** *Binary* - Conference calling service (0/1).

**4. Billing & Usage (Monthly Dynamics)**
* **longmon** *Continuous* - Average monthly long-distance usage ($).
* **tollmon** *Continuous* - Average monthly toll-free usage ($).
* **equipmon** *Continuous* - Average monthly equipment rental charges ($).
* **cardmon** *Continuous* - Average monthly calling card charges ($).
* **wiremon** *Continuous* - Average monthly wireless service charges ($).


### 2. Data Understanding
---

#### Dataset Overview: `Telecommunications Churn Database`

> The dataset represents a historical cross-sectional snapshot of customer behavior, demographic structure, service portfolio composition, and billing dynamics.

The primary analytical objective is to model the relationship between these attributes and the attrition event (**Churn**) in order to isolate structural and behavioral drivers of customer disengagement.

---

##### 1. Target Variable (Outcome)

- **churn** *(Binary Integer)*  
  Classification target:
  - `1` → Voluntary churn  
  - `0` → Retained customer  

This variable defines the supervised learning objective.

---

##### 2. Customer Demographics & Profile (Structural Stability Layer)

These variables capture socioeconomic and lifecycle characteristics:

- **custcat** *(Categorical)* – Customer segment (1–4); behavioral cluster indicator  
- **tenure** *(Integer)* – Months with the company (loyalty proxy)  
- **age** *(Integer)* – Age in years  
- **address** *(Integer)* – Years at current residence (residential stability proxy)  
- **ed** *(Categorical)* – Education level (1–5)  
- **employ** *(Integer)* – Years with current employer (professional stability proxy)  
- **income** *(Continuous)* – Annual household income (in thousands)  

These variables form the **stability and demographic backbone** of the churn hypothesis.

---

##### 3. Service Subscriptions (Binary Portfolio Structure)

Binary indicators representing service adoption:

- **ebill** – Electronic billing (digital engagement proxy)  
- **equip** – Equipment rental  
- **callcard** – Calling card subscription  
- **wireless** – Wireless service subscription  
- **pager** – Pager service (legacy technology)  
- **internet** – Internet subscription  
- **voice** – Voicemail service  
- **callwait** – Call waiting feature  
- **confer** – Conference calling feature  

These variables describe **portfolio breadth and technological adoption profile**.

---

##### 4. Billing & Usage Dynamics (Behavioral Intensity Layer)

Continuous variables capturing consumption magnitude:

- **longmon** – Monthly long-distance spending  
- **tollmon** – Monthly toll-free spending  
- **equipmon** – Monthly equipment rental charges  
- **cardmon** – Monthly calling card spending  
- **wiremon** – Monthly wireless service charges  

These features represent **economic engagement intensity** and may capture switching cost dynamics or service anchoring effects.

---


#### Exploratory Data Analysis (EDA):
---

#### 2.1 Univariate Analysis:
---

Examines the behavior of **a single variable** in isolation to understand its distribution, central tendency, and dispersion (e.g., histograms and means), without seeking relationships with other data.

In [0]:
df.head()

In [0]:
df.info()

##### Adjusting the variable types with their respective characteristics. 
---
- In this data, there are both binary and ordinal variables; I will be adjusting them so that there is no invalid statistical aggregation in the analyses.

In [0]:
df = optimize_dtypes(df)

print(f'New dtypes of variables:')
df.info()

print(f'Visual sample:')
df.head()

In [0]:
df.isnull().sum()

In [0]:
df[df.duplicated(keep = False)]

In [0]:
data_describe = df.describe()
data_describe

In [0]:
data_describe.loc['min']

In [0]:
data_describe.loc['max']

In [0]:
data_describe.loc['mean']

In [0]:
numericals_cols = [
    'tenure', 'age', 'employ', 'address', 'income', 'longmon', 'tollmon', 'equipmon', 'cardmon', 'wiremon',
]
GraphicsData(data = df[numericals_cols]).numerical_histograms()

In [0]:

GraphicsData(data = df[numericals_cols]).numerical_boxplots(showfliers = True, showmeans = True)

In [0]:
categorical_cols = [
    'ed', 'equip', 'callcard', 'wireless', 'voice', 'pager', 'internet',
    'callwait', 'confer', 'ebill', 'custcat'
]
GraphicsData(data = df[categorical_cols]).categorical_countplots()

In [0]:
GraphicsData(data = df).plot_target_analysis(target_col='churn', colors=['#1abc9c', '#ff6b6b'])

##### Key Observations
---

##### 1. Data Integrity & Structural Quality
---

- Dataset composed of **200 records and 22 active variables**
- No missing values detected
- No duplicate records identified

> The dataset demonstrates high structural integrity.  
No imputation or deduplication procedures were required, allowing the analysis to proceed directly to statistical exploration and modeling while minimizing bias introduced by preprocessing interventions.

---

##### 2. Numerical Variables Analysis
---

##### Income

- Extremely high skewness (skew ≈ 9.96)
- Approximately 6.5% upper outliers

> The distribution indicates strong income concentration, with the majority of customers clustered within a standard range and a small segment exhibiting substantially higher purchasing power.

The extreme values appear to represent legitimate high-income customers rather than data errors.  
Therefore, mathematical transformations (e.g., logarithmic scaling) are preferable to removal.

---

##### Consumption Variables (`longmon`, `cardmon`, `wiremon`)

- Highly skewed distributions
- Distributional pattern similar to income

> This suggests a structural relationship between purchasing power and consumption intensity.  
However, this assumption must be validated through multivariate analysis to avoid conclusions based solely on distributional similarity.

---

##### Niche Services (`tollmon`, `equipmon`)

- Zero-inflated distributions
- Large concentration of non-users

> These products exhibit low penetration within the customer base, characterizing them as niche services.

While their isolated explanatory power may be limited, they could become relevant when evaluated through interaction effects.

---

##### Demographics (`age`)

- Relatively balanced distribution
- Higher concentration below ~50 years
- Tail extending to 76 years

> The dataset includes a predominance of economically active adults, alongside a meaningful representation of older customers.  
This demographic diversity supports segmentation analysis by age group.

---

##### Loyalty Indicator (`tenure`)

- Broad distribution
- Presence of both new and long-standing customers

> The company demonstrates simultaneous acquisition and retention capacity.  
The coexistence of early- and late-lifecycle customers enables consistent temporal analysis of churn dynamics.

---

##### 3. Outlier Assessment
---

- Outlier volume proportional to sample size
- Primary concentration in `income` and `longmon`

> Extreme values likely represent high-value customers rather than recording errors.  
These observations carry strategic importance and should not be indiscriminately removed.

---

##### 4. Categorical Variables Analysis
---

##### Educational Profile (`ed`)

- Concentration in intermediate levels (2 and 4)
- Level 5 represents a minority segment

> The customer base is predominantly composed of mid-level education profiles.  
Low-support categories should be interpreted cautiously due to reduced statistical power.

---

##### Low-Adoption Products (`wireless`, `voice`, `pager`)

- Penetration below 30%

> These services show limited adoption within the current portfolio.

They may represent:
- Expansion opportunities  
- Niche products  
- Low attractiveness for the dominant customer profile  

Strategic relevance will depend on their association with churn behavior.

---

##### Core Product (`callcard`)

- Approximately 70% adoption

> This appears to be the most penetrated product in the portfolio, potentially functioning as a core contractual or entry-level offering.

---

##### Moderate Adoption Products (`internet`, `ebill`, `confer`)

- Intermediate penetration rates

> These services present growth potential within the existing customer base.  
They are neither niche nor universally adopted, positioning them as potential leverage points for cross-selling strategies.

---

##### 5. Churn Rate (≈29%)
---

- Overall churn rate of approximately 29%

> This level is materially elevated and signals the need for structured retention strategies.

Although telecommunications is inherently competitive, a churn rate near 30% represents a significant revenue risk and a strong opportunity for data-driven intervention.

---


#### 2.2 Bi-Variate Analysis:
---

Explores the statistical relationship between **two variables simultaneously**, aiming to identify associations, correlations, patterns of separation, or structural dependencies (e.g., scatterplot of `Income` vs. `Debt`, boxplot of `tenure` vs. `churn`).

---

##### Methodological Note:
---

- Prior to conducting the bivariate analysis, the dataset will be partitioned into **Train** and **Test** sets.  
The primary objective is to prevent **Data Leakage**, ensuring that all statistical insights, outlier treatments, transformation decisions, and feature engineering strategies are derived exclusively from the training data.

> This approach preserves the integrity of the validation process and guarantees that performance metrics reflect true generalization capacity rather than information contamination.

- The `stratify` parameter will be applied within the `train_test_split` procedure.  
Given that churn prediction is a **binary classification problem**, maintaining the original **class proportion** across both subsets is statistically mandatory.

> Stratified sampling preserves the prior probability distribution of the target variable, avoiding distortions in class balance that could bias model training, threshold calibration, and performance evaluation metrics (e.g., Recall, Precision, ROC-AUC).

---


In [0]:
train_set, test_set = train_test_split(df, test_size = 0.2, stratify = df['churn'], shuffle = True, random_state = 33)

In [0]:
# Checking the proportions of the target variable
print(f'Shape of training: {train_set.shape}')
print(f'Shape of test: {test_set.shape}')

print('\n--- Churn Rate (Stratify Validation) ---')
print(f'Original: {df['churn'].mean():.2%}')
print(f'Train:    {train_set['churn'].mean():.2%}')
print(f'Test:    {test_set['churn'].mean():.2%}')

In [0]:
train_set.head()

##### Checking the correlations between the variables

In [0]:
train_set.corr()['churn'].abs().sort_values(ascending = False)

In [0]:
GraphicsData(data = train_set).correlation_heatmap()


##### Key Observations:
---
##### 1. Retention Factors (Negative Correlation)
---
##### Tenure (`tenure`)
> Presents the strongest negative correlation with Churn (**-0.35**).  
This empirically reinforces the Survival Analysis premise: the probability of cancellation decreases as customer lifetime increases. The highest-risk profile is concentrated in early lifecycle stages.

##### Stability Indicators (`address`, `employ`, `age`)** 
> Variables associated with residential, professional, and life stability act as protective factors against churn.  
The loyal customer profile is statistically more mature, established, and less behaviorally volatile.

##### Voice Consumption (`longmon`, `callcard`)
> Higher expenditure on traditional voice services is negatively associated with churn.  
This suggests stronger product dependency and potential *switching costs* within the legacy voice ecosystem compared to digital services.
---
##### 2. Risk Factors (Positive Correlation)
---
##### Education (`ed`)
> Displays a positive correlation (**0.24**) with Churn.  
More educated customers may exhibit higher price sensitivity, greater market awareness, and lower informational switching barriers.

##### Digital Services (`equip`, `internet`, `wireless`)
> Adoption of modern services is associated with **higher churn probability**.
*Hypothesis:* These services operate in highly competitive and commoditized environments, where price elasticity is higher and differentiation is lower compared to traditional voice offerings.

##### Digital Billing (`ebill`)
> Positive correlation with Churn (**0.21**) suggests a behavioral segmentation effect.  
Users of digital billing tend to be younger and more technologically oriented, typically presenting lower brand attachment and reduced friction to provider switching.

---
##### 3. Neutral and Categorical Variables
---
##### Income (`income`)
> The weak linear correlation (-0.09) must be interpreted cautiously due to extreme skewness (Skew ≈ 9.96).
The relationship between income and churn is likely **non-linear**, requiring logarithmic transformation to uncover its true predictive signal.

##### Customer Category (`custcat`)
> As a nominal variable, its influence cannot be properly captured by Pearson correlation.  
Its effect should be validated through visual inspection (e.g., churn rate per category) or statistical tests appropriate for categorical predictors.

---

##### 4. Multicollinearity Alert

> There are structurally redundant relationships between binary service indicators and their respective monetary counterparts:

- **`equip` vs `equipmon`** (0.95)  
- **`wireless` vs `wiremon`** (0.89)  

> Additional relevant correlations:

- **`longmon` vs `tenure`** (0.77)  
Older customers tend to spend more on long-distance services. 

- **`address` vs `age`** (0.74)  
There is contextual coherence: older customers tend to remain longer at the same address.  

---

##### Analyzing the numerical variables and their relationships with the target variable.

In [0]:
numericals_cols = [
    'tenure', 'age', 'employ', 'address', 'income', 'longmon', 'tollmon', 'equipmon', 'cardmon', 'wiremon', 'churn'
]
GraphicsData(data = df[numericals_cols]).numerical_histograms(hue = 'churn')

In [0]:
GraphicsData(data = df[numericals_cols]).numerical_boxplots(hue = 'churn', showmeans = True)


##### Key Observations:
---

##### 1. Customer Lifecycle (`tenure`)

##### Averages
> Churners **(22.78 months)** vs. Non-Churners **(40.22 months)**.

##### Survival Analysis Perspective

> There is a clear "Critical Risk Zone" concentrated between 0 and 22 months.  
The observed churn concentration within this window suggests that once a customer surpasses the initial ~2-year threshold, the probability of retention increases substantially, effectively doubling long-term loyalty likelihood.

---

##### 2. The Stability Factor (`age`, `employ`, `address`)

##### Observed Pattern
> There is a direct structural relationship between life stability indicators and customer loyalty.

##### Age  
> Loyal customers are, on average, 8 years older (43 years) than churners (35 years).  
The segment above 65 years old exhibits near-zero churn, indicating extremely low behavioral volatility.

##### Employment Stability
> Retained customers present an average employment duration of 12 years, more than **double** that of churners (5.5 years).

##### Residential Stability
> Address permanence follows the same proportional pattern (13 years vs. 7.5 years).

---

##### 3. The Income–Digital Services Paradox (`income`, `wiremon`, `equipmon`)

##### Economic Profile 
> Churners present an average income **approximately 31% lower (56k) than retained customers (82k)**.

##### Observed Paradox
> Despite lower income and reduced stability, churners demonstrate higher engagement with modern and premium services (e.g., wireless usage and equipment rental via `equipmon`).

##### Business Interpretation
> There is evidence of a potential **Product–Market Fit misalignment**.
Higher-cost digital services are being adopted by a financially unstable and younger segment, increasing exposure to default risk or rapid cancellation.  
Conversely, the wealthier and more stable segment remains concentrated in legacy, lower-cost products.

---

##### 4. Loyalty Anchor (`longmon`, `cardmon`)

##### Voice Consumption 
> Traditional voice usage emerges as a primary loyalty differentiator.  
Retained customers spend nearly twice as much on long-distance services (13.6 vs. 7.2).

##### Strategic Insight
> Voice services function as an anchoring product.  
Customers who primarily use telephony for voice communication exhibit higher retention, whereas customers oriented toward internet/data services demonstrate greater churn propensity.

---

##### 5. Low-Signal Variable (`tollmon`)

##### Statistical Irrelevance
> The `tollmon` variable does not show meaningful separation between churners and non-churners in terms of central tendency or distribution.
---

##### Analyzing the categorical variables and their relationships with the target variable.

In [0]:
categorical_cols = [
    'ed', 'equip', 'callcard', 'wireless', 'voice', 'pager', 'internet',
    'callwait', 'confer', 'ebill', 'custcat', 'churn'
]
GraphicsData(data = df[categorical_cols]).categorical_countplots(hue = 'churn')

In [0]:
GraphicsData(data = df[categorical_cols]).categorical_bar_percentages(hue = 'churn')


##### Key Observations:
---

##### 1. Educational Level (`ed`)

##### The Effect of "Smart Switching"
> A consistent linear increase in churn rate is observed as educational level rises, reaching a critical peak of **47.1%** at Level 5.

##### Behavioral Diagnosis
> Highly educated customers tend to treat telecommunications services as a *commodity*.  
They exhibit **low switching costs**, high price sensitivity, and rational decision-making patterns.  
Unlike inert or legacy segments, this profile actively compares market offers and responds to marginal economic advantages.

##### Strategic Implication
> Retention strategies for this segment should emphasize **financial competitiveness and value differentiation**, rather than emotional engagement or loyalty-based messaging.

---

##### 2. Digital Services and Equipment (`equip`, `internet`, `wireless`)

##### The "Toxic Trio"
> The data indicates a structural retention weakness associated with value-added digital services.  
Customer retention drops significantly upon adoption of these products.

##### Internet
> Churn rate increases from **18.8%** (non-users) to **42.0%** (users).

##### Equipment Rental (`equip`)
> Even more critical, churn escalates from **18.3%** to **43.5%**, signaling a potential friction point in pricing or perceived value.

##### Product–Market Fit Alert
> These findings suggest that the company's digital infrastructure and hardware rental policies may be functioning as **churn accelerators**.  
The commercialization of these services among inherently unstable or price-sensitive segments (e.g., younger users) appears to intensify competitive migration rather than strengthen loyalty.

---

##### 3. Voice Services (`callcard`, `confer`, `voice`)

##### Retention Anchors (with an exception)

> Functional voice-related features generally act as retention drivers, though the category presents internal heterogeneity.

##### Phone Card (`callcard`)
> A strong retention factor.  
Users exhibit significantly lower churn (**19.9%**) compared to non-users (**50.9%**), reinforcing the anchoring effect of traditional voice services.

##### Conference (`confer`)
> Confirms the retention pattern, with users demonstrating lower cancellation rates relative to non-users.

##### The "Anomalous" Variable (`voice`)
> Voicemail contradicts the broader voice-service trend.  
Users present higher churn (**39.0%**) than non-users (**24.8%**).

##### Interpretation
> The `voice` feature behaves more like a friction-generating digital add-on than a loyalty-enhancing voice product.  
It may introduce perceived cost without proportional value, resembling the behavioral dynamics observed in digital services.

---

##### 4. Customer Segmentation (`custcat`)

##### Segment Disparity
> There is substantial heterogeneity across customer classes, indicating structurally different risk profiles.

##### Class 3 — "Gold Segment"
> Extremely low churn rate (**8.3%**), representing a high-value, stable segment with strong retention dynamics.

##### Class 4 — "Danger Zone"
> Critically elevated churn rate (**45.6%**), suggesting structural dissatisfaction, misalignment, or financial instability within this segment.
---


#### 2.3 Multi-Variate Analysis:
---

> Analyzes **three or more variables simultaneously** in order to uncover complex interaction effects, hidden dependency structures, and latent behavioral patterns that cannot be identified through univariate or bivariate approaches alone.
This stage is essential for understanding how combinations of demographic, behavioral, and financial variables jointly influence churn probability.

---


##### Multivariate Strategy: Feature Engineering & Statistical Validation
---

##### 1. Feature Engineering Focus

> At this stage, the primary focus will be on **creating new derived features**, exploring interactions, ratios, linear combinations, and non-linear transformations among the original attributes.

The objective is to capture latent patterns that are not evident in isolated analyses, such as:

- Interactions between stability and consumption (`age × longmon`)  
- Proportional relationships (`equipmon / income`)  
- Transformations to reduce skewness (e.g., `log(income)`)  
- Composite indicators of risk or stability  

> Feature engineering aims to enhance the explanatory power of the model by incorporating implicit behavioral structures embedded in the data.

---

##### 2. Statistical Significance Testing

> After constructing the new variables, appropriate **statistical tests** will be applied to evaluate their association with the target variable (`churn`).

Depending on the variable type:

- Numerical variables → Tests such as *Mann-Whitney U* or *t-test*  
- Categorical variables → *Chi-Square Test*  
- Correlation analysis → Linear or non-linear association measures  

> The goal is to verify whether the newly created features demonstrate **statistical significance**, thereby reducing the inclusion of noise and strengthening the robustness of the modeling phase.

---


##### Optimization of Numerical Variables
---

- Certain financial and consumption-related variables exhibit **long-tailed distributions** (a high concentration of low values and a small number of extreme observations), which can distort the learning dynamics of several machine learning algorithms, particularly those sensitive to scale and distributional assumptions.

- **Action:** A logarithmic transformation will be applied to normalize and compress these distributions.

> **Objective:** To stabilize variance, reduce the influence of extreme outliers, and enhance the model’s ability to detect subtle behavioral patterns.  
> This transformation enables better differentiation of at-risk clients, regardless of their income or consumption level.
---

In [0]:
# Columns for transfomations
skew_columns = ['income', 'longmon', 'cardmon', 'wiremon']

for col in skew_columns:
    train_set[f'log_{col}'] = np.log1p(train_set[col])

In [0]:
train_set.head()

In [0]:
skew_columns = ['income', 'longmon', 'cardmon', 'wiremon', 'log_income', 'log_longmon', 'log_cardmon', 'log_wiremon', 'churn']
GraphicsData(data = train_set[skew_columns]).numerical_histograms(hue = 'churn')

In [0]:
skew_columns = ['income', 'longmon', 'cardmon', 'wiremon', 'log_income', 'log_longmon', 'log_cardmon', 'log_wiremon', 'churn']
GraphicsData(data = train_set[skew_columns]).correlation_heatmap()


##### Note:
---
##### Impact Analysis: Log Transformation

##### Skewness Mitigation
>The application of the `log1p` transformation effectively normalized the distributions, substantially reducing skewness in the critical numerical variables.

#####Signal Gain (Pearson Correlation)
> A measurable increase in linear association with the target variable (`churn`) was observed:

  - **Income:** Increased from **0.09** to **0.13** (**+44% relative gain**).
  - **Longmon:** Improved from **0.26** to **0.29**.
  - **Cardmon (Highlight):** Demonstrated the most expressive improvement, increasing from **0.13** to **0.24**, nearly doubling its linear relevance.

##### Latent Pattern Discovery
> The transformation of `cardmon` exposed a previously obscured **bimodal structure** in the original distribution. The logarithmic scaling visually separated two distinct behavioral subpopulations:

  - 1. **Left Peak:** Casual users (*low usage intensity*).
  - 2. **Right Peak:** Heavy users (*high usage intensity*).

> This structural separation enhances the capacity of tree-based algorithms to create more meaningful decision splits, potentially improving segmentation quality and predictive performance.

---

##### Creating News Features

In [0]:
# Aggregations of costs (Total Wallet Share)
# All expenses related to monthly services will be added together.
mon_cols = ['longmon', 'tollmon', 'equipmon', 'cardmon','wiremon']
train_set['total_spend'] = train_set[mon_cols].sum(axis = 1)


# Since 'income' is in thousands (e.g., 20 = 20,000), I adjusted the scale.
# I added +1 to the denominator to avoid division by zero (safety).
# Accessibility Index (Affordability)
train_set['affordability_idx'] = train_set['total_spend'] / ((train_set['income'] * 1000) + 1)
# Longmon x Income
train_set['longmon_inc'] = train_set['longmon'] / ((train_set['income'] * 1000) + 1)
# Equipmon x Income
train_set['equipmon_inc'] = train_set['equipmon'] / ((train_set['income'] * 1000) + 1)
# Cardmon x Income
train_set['cardmon_inc'] = train_set['cardmon'] / ((train_set['income'] * 1000) + 1)
# Wiremonx Income
train_set['wiremon_inc'] = train_set['wiremon'] / ((train_set['income'] * 1000) + 1)

# Risk Feature (Toxicity + Education)
toxic_list = ['internet', 'wireless', 'equip', 'voice', 'pager']
train_set['toxic_score'] = train_set[toxic_list].sum(axis = 1)
train_set['toxic_ed'] = (train_set['toxic_score'] * train_set['ed'].astype('int64')).astype('float32')

# Behavioral Usage Features
# Ternure for longom
train_set['ternure_longmon'] = ((train_set['tenure'] / 12) * (train_set['longmon']) ).astype('float32')
# Ternure for cardmon
train_set['ternure_cardmon'] = ((train_set['tenure'] / 12) * (train_set['cardmon']) ).astype('float32')
# Age for longom
train_set['age_longmon'] = (train_set['age']  * train_set['longmon'] ).astype('float32')
# Age for cardmon
train_set['age_cardmon'] = (train_set['age'] * train_set['cardmon'] ).astype('float32')

# Stability Features
# Ternure for age 
train_set['stability_age'] =  ((train_set['tenure'] / 12) * (train_set['age'] - 18) ).astype('float32')
# Ternure for address 
train_set['stability_address'] = ((train_set['tenure'] / 12) * (train_set['address']) ).astype('float32')
# Ternure for address 
train_set['stability_employ'] = ((train_set['tenure'] / 12) * (train_set['employ']) ).astype('float32')

# Good Score 
train_set['good_score'] = train_set[['callcard', 'confer', 'callwait']].sum(axis=1)

In [0]:
new_features = [ 
    'log_income', 'log_longmon', 'log_cardmon',
    'log_wiremon', 'total_spend', 'affordability_idx', 'toxic_score',
    'toxic_ed', 'ternure_longmon', 'ternure_cardmon', 'stability_age',
    'stability_address', 'stability_employ', 'good_score', 'age_longmon', 'age_cardmon', 
    'longmon_inc', 'equipmon_inc', 'cardmon_inc', 'wiremon_inc','churn',
]
GraphicsData(data = train_set[new_features]).numerical_histograms(hue = 'churn')

#### 2.4 Statistical tests

##### Numerical Features
---

##### Mann-Whitney U Test
---
- For the Feature Selection stage, I adopted the Mann-Whitney U Test. This choice is due to the high skewness (> 1) of the numerical variables, which requires a non-parametric approach robust to outliers, where traditional tests (such as the T-test) would fail.
---

In [0]:
audit_vars = [ 
    'log_income', 'log_longmon', 'log_cardmon',
    'log_wiremon', 'total_spend', 'affordability_idx', 'toxic_score',
    'toxic_ed', 'ternure_longmon', 'ternure_cardmon', 'stability_age',
    'stability_address', 'stability_employ', 'good_score', 'age_longmon',
    'age_cardmon', 'longmon_inc', 'equipmon_inc', 'cardmon_inc', 'wiremon_inc','churn',
]
EDATest(data = train_set).mannwhitney_u_test(audit_vars =  audit_vars, target = 'churn')

In [0]:
train_set[audit_vars].describe()

##### Considerations on Statistical Validation
---

##### 1. Evidence of Predictive Signal

> The statistical testing phase confirmed the presence of structurally relevant predictors.

The validation was supported by two complementary criteria:

- **Statistical Significance (p-value):** Ensuring that observed differences are unlikely to be due to random variation.
- **Effect Size (Practical Impact):** Measuring the magnitude of separation between churners and non-churners, reinforcing business relevance beyond mere statistical detection.

This dual validation framework strengthens the reliability of the selected variables for downstream modeling.

---

##### 2. Feature Retention Strategy

> No variables were prematurely removed at this stage, even those with weaker statistical signals.

The technical rationale is grounded in the intrinsic regularization mechanisms of the selected algorithms:

##### Tree-Based Models

- Perform embedded feature selection via impurity reduction (e.g., Information Gain).
- Irrelevant variables are naturally excluded from meaningful decision splits.
- This reduces the risk of manual selection bias.

##### Linear Models

- Will incorporate **L1 Regularization (Lasso)**.
- Non-informative coefficients are penalized and driven toward zero.
- This shrinks noise variables and mitigates overfitting while preserving interpretability.

---

##### Strategic Implication

> The modeling phase will act as a secondary validation layer, allowing the algorithms themselves to determine final feature relevance.

This approach preserves potential latent signals while maintaining statistical rigor and model robustness.

---

##### Categorical Features

##### Chi Square Test

In [0]:
audit_vars = [
    'ed', 'equip', 'callcard', 'wireless', 'voice', 'pager', 'internet',
    'callwait', 'confer', 'ebill', 'custcat'
]
EDATest(train_set).chi_square_test(audit_vars = audit_vars, target = 'churn')

In [0]:
train_set['ed'].value_counts()

In [0]:
train_set['ed'] = train_set['ed'].astype('int64').replace({5: 4})
train_set['ed'] = train_set['ed'].astype('category')

In [0]:
EDATest(train_set).chi_square_test(audit_vars = audit_vars, target = 'churn')

##### Conclusion of Categorical Variable Testing
---

##### 1. Dominant Predictor

> The variable `custcat` emerged as the primary retention driver.

- Demonstrated **high statistical significance**.
- Exhibited strong **effect size**, indicating substantial separation between churners and retained customers.
- Structurally behaves as a core segmentation variable within the churn framework.

This confirms `custcat` as a central explanatory axis in the modeling phase.

---

##### 2. Sparsity Mitigation Strategy

> To address instability caused by low sample representation, a **feature coarsening** approach was applied to `ed`.

- Categories `4` and `5` were merged.
- Objective: reduce variance inflation and correct weak statistical support in sparse classes.
- Result: improved distributional balance and greater robustness for downstream modeling.

---

##### 3. Feature Selection Philosophy

> Boundary or low-signal variables were intentionally preserved.

The filtering process will be delegated to model-driven mechanisms:

- **Tree-Based Models:** Embedded selection via impurity reduction and Information Gain.
- **Linear Models:** Application of **L1 (Lasso) regularization** to penalize and shrink non-informative coefficients.

This strategy avoids premature elimination of potential latent signals while maintaining control over model complexity and overfitting risk.

---

##### Checking mixed correlations of categorical features and numerical features

In [0]:
audit_pairs = [
    ('equip', 'equipmon'),       
    ('wireless', 'log_wiremon'),     
    ('callcard', 'ternure_cardmon'),     
    ('ed', 'toxic_ed'),         
    ('internet', 'toxic_ed'),
    ('ebill', 'stability_age')
]
EDATest(train_set).mixed_redundancy_test(audit_pairs = audit_pairs)

##### Note: 
---
##### 1. Structural Collinearity Detection

> The application of the *Eta Score (η)* — a canonical metric for measuring variance explained between categorical and continuous variables — revealed strong structural redundancy in the following pairs:

- `equip` vs `equipmon`  
- `wireless` vs `log_wiremon`

The analysis demonstrated that the binary indicators explain nearly all variance of the corresponding continuous variables when values are equal to zero. In practical terms, the presence flag is deterministically embedded within the monetary variable.

This indicates functional duplication rather than complementary information.

---

##### 2. Dimensionality Rationalization

> Based on this redundancy, the categorical flags (`equip`, `wireless`) will be removed.

The continuous variables (`equipmon`, `log_wiremon`):

- Encode **service existence** (value = 0 implies absence).
- Capture **behavioral magnitude** (spending intensity).
- Provide strictly more informational entropy than the binary counterparts.



In [0]:
final_features = [
    'tenure', 'age', 'address', 'employ', 'tollmon', 'equipmon', 'log_income', 'log_longmon', 'log_cardmon',
    'log_wiremon', 'total_spend', 'affordability_idx', 'toxic_score', 'toxic_ed', 'ternure_longmon', 'ternure_cardmon', 'stability_age',
    'stability_address', 'stability_employ','good_score', 'ed', 'callcard', 'voice', 'pager', 'internet','callwait', 
    'confer', 'ebill', 'custcat',  'age_longmon', 'age_cardmon', 'longmon_inc', 'equipmon_inc', 'cardmon_inc', 'wiremon_inc',
    'churn'
]
train_set[final_features].head()

##### 2.5 EDA Conclusion

##### Multivariate & Univariate Synthesis – Churn Drivers
---

##### 1. Socioeconomic Stability Factors

> Statistical testing (Mann–Whitney for numerical variables and Chi-square for categorical variables) indicates that **socioeconomic stability variables exhibit strong association with churn**.

The most statistically robust variables were:

- **Tenure**
- **Age**
- **Employ**
- **Address**

All demonstrated:

- Extremely low p-values  
- Negative correlation with churn  
- Clear median separation between churners and retained customers  

###### Interpretation

Customers with greater relational, professional, and residential stability show **significantly lower churn propensity**.

Important:  
The findings indicate **robust statistical association**, not direct causality.

---

##### 2. Tenure as the Primary Driver

> Tenure emerged as the strongest individual predictor.

Observed characteristics:

- Highest statistical strength  
- Largest effect magnitude  
- Clear distributional separation  

Empirical pattern:

- Churn concentration below ~22 months  
- Substantial retention improvement beyond ~40 months  

###### Interpretation

Churn risk is highest during the early lifecycle phase and declines as the relationship matures.

Methodological note:  
Tenure is cumulative and may reflect a survival bias effect. This will be validated during modeling.

---

##### 3. Age Effect

Results:

- Mean churners: ~35 years  
- Mean retained: ~43 years  
- Significant negative correlation  

###### Interpretation

Younger customers exhibit higher churn rates, whereas older customers demonstrate stronger retention.

Hypotheses (not conclusions):

- Higher mobility  
- Greater price sensitivity  
- Different consumption profiles  

---

##### 4. Employment and Residential Stability

Both variables showed:

- Strong statistical significance  
- Clear intergroup separation  

###### Interpretation

Professional and residential stability are inversely associated with churn.

Address changes may be linked to:

- Service availability shifts  
- Competitive exposure  
- Economic transitions  

---

##### 5. Portfolio Intensity (Toxic Score)

Constructed variables:

`toxic_score = sum(internet, wireless, equip, voice, pager)`

 `toxic_ed = toxic_score * ed`


These measure **intensity of multi-service adoption**.

Results indicate strong association with churn.

###### Interpretation

Customers with higher service bundle intensity show higher churn rates.

Possible explanations:

- Competitive segment  
- Younger/digital demographic  
- Greater exposure to substitutes  

Potential collinearity with tenure and demographics will be evaluated in modeling.

---

##### 6. Anchoring Services

Variables: `longmon`, `cardmon`

Findings:

- Significant association  
- Higher average spend among retained customers  

###### Interpretation

These services may function as **retention anchors**, increasing switching costs.

---

##### 7. Education

Results:

- Significant Chi-square association  
- Category 5 → highest churn  
- Category 2 → lowest churn  

###### Interpretation

Higher education levels are associated with higher churn rates.

Hypotheses:

- Greater price sensitivity  
- Higher market awareness  
- Stronger service expectations  

---

##### 8. Customer Segmentation (`custcat`)

- Highest Cramer’s V among categorical variables  
- Strong separation across behavioral segments  

Category 4 → highest churn  
Category 3 → highest retention  

###### Interpretation

Behavioral segmentation contains substantial explanatory power for churn dynamics.

---

##### 9. Income & Total Spend

Results:

- Non-significant p-values  
- Weak correlation  
- High distributional skewness  

###### Interpretation

Income and total spending do not show robust univariate association with churn.

Technical explanation:

- Strong asymmetry  
- Outlier influence  

---

##### 10. Individual Services (Internet, Wireless, Equip, Voice, Pager)

Results:

- Significant association for Internet, Wireless, Equip  
- Moderate/weak for Voice and Pager  

###### Interpretation

Adoption of certain digital services correlates with higher churn.

Important:  
Association does not imply causation. These services may proxy for demographic or lifecycle factors.

---

##### 11. Tenure–Consumption Interactions

Variables: `tenure_longmon`, `tenure_cardmon`

Findings:

- Strong statistical significance  
- Reduced churn as tenure increases alongside consumption  

###### Interpretation

Retention effects of anchoring services may be amplified among long-tenure customers.

Independence of this effect will be validated in multivariate modeling.

---

##### 12. Log Transformations

Variables: `log_income`, `log_longmon`, `log_cardmon`, `log_wiremon`

Results:

- Reduced skewness  
- Improved statistical stability for some variables  
- No universal gain in predictive strength  

###### Interpretation

Log transformation mitigated outlier distortion but did not universally enhance explanatory power.

---

##### 13. Derived Stability Indices

Variables: `stability_age`, `stability_employ`, `stability_address`

Results:

- Strong statistical significance  
- Greater separation than original isolated variables  

###### Interpretation

Composite stability metrics may capture a latent socioeconomic dimension not fully represented individually.

---

##### 14. Low-Evidence Variables (Voice, Pager, Callwait, Confer)

Results:

- High p-values  
- Low association magnitude  

###### Interpretation

These variables did not demonstrate robust univariate signal.

They will not be discarded yet; relevance will be confirmed in modeling.

---

##### Modeling Hypotheses

1. Stability variables will be dominant churn predictors.  
2. Early lifecycle phase presents elevated churn risk.  
3. High portfolio intensity may increase churn probability.  
4. Higher education levels correlate with increased churn.  
5. Long-distance and card services may act as retention anchors.  
6. Income and total spending may not exert direct independent effects.

---