# Customer Churn - Telecom
---

### CRISP-DM Methodology
This project follows the CRISP-DM (*Cross-Industry Standard Process for Data Mining*) framework applied to **Customer Retention & Churn Prediction**:
| **Stage** | **Objective** | **Methodological Execution** |
| :--- | :--- | :--- |
| **1. Business Understanding** | Mitigate revenue loss by identifying at-risk customers. | • **Target Definition**: Binary Classification (Churn: Yes/No).<br>• **KPIs**: Maximize **Lift** in retention campaigns & Revenue Saved vs. Cost. |
| **2. Data Understanding** | Detect patterns of friction and dissatisfaction. | • **EDA**: Distribution analysis (Detect Imbalance).<br>• **Hypothesis Testing**: Correlation Matrix & Independence Tests (Chi-Square). |
| **3. Data Preparation** | Construct a robust dataset for parametric modeling. | • **Scaling**: Standardization (Z-score) for coefficient comparability.<br>• **Encoding**: One-Hot Encoding for nominal variables.<br>• **Splitting**: Stratified Train/Test Split to preserve class ratio. |
| **4. Modeling** | Estimate Churn Probability:<br>$$P(Y=1 \vert X) = \frac{1}{1+e^{-z}}$$ | • **Algorithm**: Logistic Regression (Baseline).<br>• **Inference**: Analyze **Odds Ratios** to determine feature elasticity. |
| **5. Evaluation** | Assess model reliability and financial impact. | • **Discrimination**: AUC-ROC & F1-Score (Threshold Tuning).<br>• **Calibration**: Probability Calibration Curve (Reliability Diagram). |
| **6. Deployment** | Integrate insights into the CRM lifecycle. | • **Deliverable**: "High-Risk" Customer List for Marketing Squad.<br>• **Artifact**: Serialize model (`joblib`) for batch inference. |

---
#### Note:

Although the CRISP-DM Modeling phase typically involves comparing several algorithms to select the best performer, this project focuses, by scope definition, on implementing a baseline. Therefore, a Logistic Regression model will be developed, going through all stages of the cycle (analysis, preparation, and modeling) to validate the initial hypothesis

### Installs:

In [0]:
%%capture
%pip install numpy==2.4.0
%pip install pandas==2.3.3
%pip install scikit-learn==1.8.0
%pip install matplotlib==3.10.8
%pip install seaborn==0.13.2

In [0]:
# Command to restart the kernel and update the installed libraries
%restart_python

### Imports:

In [0]:
# Data Analize and Visualization
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy.stats import skew
from scipy.stats import mannwhitneyu, pointbiserialr, chi2_contingency
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# Data Modeling / Model Linear / Metrics / Save Model
from sklearn.model_selection import train_test_split, cross_val_score, KFold, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, ConfusionMatrixDisplay
import joblib

In [0]:
# SRC/ Functions Utils:
import sys
sys.path.append('../src')
from visualization import GraphicsData
from utils import EDATest, optimize_dtypes

### Load the data

In [0]:
df = pd.read_csv('../data/ChurnData.csv')

### Verify successful load with some randomly selected records


In [0]:
df.sample(9)

### 1. Business Understanding: 

#### 1.1 Situation Assessment
The Telecommunications industry faces saturation and fierce competition. With high Customer Acquisition Costs (CAC), the profitability strategy has shifted from **Acquisition** to **Retention**. 
Current analysis shows that losing a long-standing customer ("High Tenure") is 5x more expensive than acquiring a new one due to the loss of predictable recurring revenue (LTV).

#### 1.2 The Business Problem
The company is experiencing an unexplained increase in customer attrition (Churn). Traditional rule-based methods (e.g., "Cancel if usage drops 10%") are reactive and fail to capture complex behavioral patterns. 
The marketing team needs a proactive mechanism to identify at-risk customers **before** the cancellation decision is irreversible.

#### 1.3 Objectives
1.  **Primary Goal:** Mitigate revenue loss by accurately identifying customers with high probability of churning.
2.  **Strategic Goal:** Understand the *drivers* of churn (Explainability). Is it price (`wiremon`) or poor service? This informs the product roadmap.
3.  **Financial Goal:** Maximize the **ROI of Retention**.
    * *Constraint:* We cannot offer discounts to everyone (Cost of Intervention). We must target only high-risk/high-value customers.

#### 1.4 Success Criteria (KPIs)
Instead of purely technical metrics (Accuracy), success is defined by:
* **Lift in Top Decile:** The model must capture at least 30% of total churners within the top 10% risk list.
* **Precision (Cost of False Positives):** Minimizing "False Alarms" to avoid giving discounts to customers who would have stayed anyway.

### 2. Data Understanding:
---

#### Dataset: `Telecommunications Churn Database`

- This dataset captures a historical snapshot of customer demographics, service usage patterns, and account status. The primary objective is to correlate these features with the attrition event (**Churn**) to isolate behavioral drivers of dissatisfaction.
---

#### Variables Dictionary:

**1. Target Variable (The Outcome)**
* **churn** *Integer (Binary)* - The classification target. `1` = Customer left (Voluntary Churn); `0` = Stayed.

**2. Customer Demographics & Profile**
* **custcat** *Categorical* - Customer category classification (1-4). Represents the customer segment/cluster.
* **tenure** *Integer* - Months with the company (Proxy for *Loyalty*).
* **age** *Integer* - Customer's age in years.
* **address** *Integer* - Years living at current address (Stability indicator).
* **ed** *Categorical* - Education level (1-5).
* **employ** *Integer* - Years with current employer.
* **income** *Continuous* - Annual household income (in thousands).

**3. Service Subscriptions (Binary Portfolio)**
* **ebill** *Binary* - Electronic billing subscription (0/1). (Digital Adoption indicator).
* **equip** *Binary* - Equipment rental (0/1).
* **callcard** *Binary* - Calling card service (0/1).
* **wireless** *Binary* - Wireless service (0/1).
* **pager** *Binary* - Pager service (Legacy technology).
* **internet** *Binary* - Internet service (0/1).
* **voice** *Binary* - Voice mail service (0/1).
* **callwait** *Binary* - Call waiting service (0/1).
* **confer** *Binary* - Conference calling service (0/1).

**4. Billing & Usage (Monthly Dynamics)**
* **longmon** *Continuous* - Average monthly long-distance usage ($).
* **tollmon** *Continuous* - Average monthly toll-free usage ($).
* **equipmon** *Continuous* - Average monthly equipment rental charges ($).
* **cardmon** *Continuous* - Average monthly calling card charges ($).
* **wiremon** *Continuous* - Average monthly wireless service charges ($).



#### Exploratory Data Analysis (EDA):
---

#### 2.1 Univariate Analysis:
---

Examines the behavior of **a single variable** in isolation to understand its distribution, central tendency, and dispersion (e.g., histograms and means), without seeking relationships with other data.

In [0]:
df.head()

In [0]:
df.info()

##### Adjusting the variable types with their respective characteristics. 
---
- In this data, there are both binary and ordinal variables; I will be adjusting them so that there is no invalid statistical aggregation in the analyses.

In [0]:
df = optimize_dtypes(df)

print(f'New dtypes of variables:')
df.info()

print(f'Visual sample:')
df.head()

In [0]:
df.isnull().sum()

In [0]:
df[df.duplicated(keep = False)]

In [0]:
data_describe = df.describe()
data_describe

In [0]:
data_describe.loc['min']

In [0]:
data_describe.loc['max']

In [0]:
data_describe.loc['mean']

In [0]:
numericals_cols = [
    'tenure', 'age', 'employ', 'address', 'income', 'longmon', 'tollmon', 'equipmon', 'cardmon', 'wiremon',
]
GraphicsData(data = df[numericals_cols]).numerical_histograms()

In [0]:

GraphicsData(data = df[numericals_cols]).numerical_boxplots(showfliers = True, showmeans = True)

In [0]:
categorical_cols = [
    'ed', 'equip', 'callcard', 'wireless', 'voice', 'pager', 'internet',
    'callwait', 'confer', 'ebill', 'custcat'
]
GraphicsData(data = df[categorical_cols]).categorical_countplots()

In [0]:
GraphicsData(data = df).plot_target_analysis(target_col='churn', colors=['#1abc9c', '#ff6b6b'])

##### Key Observations:
---
**1. Data Integrity and Quality**
- The dataset exhibits high integrity, containing 200 records and 22 active variables. The absence of null or duplicate data eliminates the need for imputation, allowing immediate focus on statistical modeling.
---
**2. Distribution of Numerical Variables**

  - **Income:** There is extreme skewness (**9.96**). The scenario reflects a concentration of income: a vast majority with standard income and an elite of "Super Rich" (representing the **6.5% of outliers**). **Spending Pattern (`longmon`, `cardmon`, `wiremon`):** Consumption variables follow the income curve. The strong skewness confirms that spending behavior is directly proportional to purchasing power.

  - **Niche Services (`tollmon`, `equipmon`):** These distributions are "Zero-Inflated," indicating that a significant portion of the base does not even use these specific services.

  - **Demographics (`age`):** The base is balanced, with a predominance of customers up to 50 years old, but with a tail extending to 76 years old, suggesting demographic maturity.

  - **Loyalty (`tenure`):** The bimodal/flat distribution indicates a healthy life cycle: the company manages to attract new customers (entrants) while retaining a loyal older base.
---
**3. Outlier Analysis**
  - The volume of outliers is not critical to the size of the dataset. The discrepancies in `income` and `longmon` are not errors, but natural characteristics of high-value customers (Whales). *Strategy: Treat with mathematical transformations instead of removal.*
---
**4. Portfolio Analysis (Categorical Variables)**

  - **Educational Profile (`ed`):** Concentration in middle levels (2 and 4). Level 5 (Postgraduate) is a minority, deserving specific investigation regarding its retention.

  - **Low Adoption:** Products such as `wireless`, `voice`, and `pager` have penetration below **30%**. They are candidates for *Upsell* campaigns or product review.

  - **Flagship Product:** The `callcard` dominates with **70.05%** adoption, proving to be the company's essential entry-level product.

  - **Moderate Adoption:** `internet`, `ebill`, and `confer` are in the intermediate range. They are not niche products, but have great growth potential in the current base.
---
**5. Churn Rate (29%)** 
  - The cancellation rate of **29%** is at the upper limit of the telecommunications market average. Although the sector is volatile, losing almost 1/3 of the customer base annually requires immediate, data-driven retention actions.
---


####  2.2 Bi-Variate Analysis:
---

Explores the mathematical relationship between **two variables** simultaneously to discover associations, correlations, or dependencies (e.g., scatterplot of Income vs. Debt).

##### Note: 
---
- For the execution of the bivariate analysis, I will perform a prior division of the data into **Train** and **Test** sets. The primary objective is to avoid **Data Leakage**, ensuring that all insights, outlier treatments, and feature engineering are derived exclusively from the training set.

- In addition, I will use the `stratify` parameter in the `train_test_split` function. Since this is a classification problem (churn prediction), it is mandatory to maintain the same **proportion of classes** of the target variable in both subsets, preserving the original statistical representativeness.

In [0]:
train_set, test_set = train_test_split(df, test_size = 0.2, stratify = df['churn'], shuffle = True, random_state = 33)

In [0]:
# Checking the proportions of the target variable
print(f'Shape of training: {train_set.shape}')
print(f'Shape of test: {test_set.shape}')

print('\n--- Churn Rate (Stratify Validation) ---')
print(f'Original: {df['churn'].mean():.2%}')
print(f'Train:    {train_set['churn'].mean():.2%}')
print(f'Test:    {test_set['churn'].mean():.2%}')

In [0]:
train_set.head()

##### Checking the correlations between the variables

In [0]:
train_set.corr()['churn'].abs().sort_values(ascending = False)

In [0]:
GraphicsData(data = train_set).correlation_heatmap()


##### Key Observations:
---
**1. Retention Factors (Negative Correlation)**

   - **Tenure (`tenure`):** Shows the strongest negative correlation (**-0.35**) with Churn. This confirms the premise of "Survival Analysis": the probability of cancellation decreases drastically as customer tenure increases. The critical customer is the newcomer.

   - **Stability (`address`, `employ`, `age`):** Variables that denote life stability (time at address, stable employment, and advanced age) are strong protectors against Churn. The profile of a loyal customer is mature and conservative.

   - **Voice Usage (`longmon`, `callcard`):** Customers with high consumption of voice services (long distance and calling cards) tend to be more loyal. This suggests that the "Voice" product generates greater *lock-in* than digital products.
---
**2. Risk Factors (Positive Correlation)**

   - **Education (`ed`):** With a positive correlation (**0.24**), it is observed that customers with higher levels of education tend to cancel more. This suggests that more educated consumers are more demanding, research the competition more, and have a lower informational barrier to switching operators.

   - **Digital Services (`equip`, `internet`, `wireless`):** Paradoxically, the contracting of modern services is associated with **higher Churn**.

   - *Hypothesis:* These services operate in highly competitive markets (commodities), where the customer switches providers for small price differences, unlike traditional voice service.

   - **Digital Billing (`ebill`):** The positive correlation with Churn (**0.21**) suggests a behavioral profile. Users of `ebill` tend to be younger and more digitally savvy, possessing a lower switching cost and lower brand loyalty than users of physical invoices.
---
**3. Neutral and Categorical Variables**

- **Income:** The low linear correlation (-0.09) reflects the high asymmetry (Skewness 9.96). The relationship between money and churn is probably not linear, requiring transformations (Log) to reveal its true predictive power.

- **Category (`custcat`**): Being nominal, its influence will be validated via visual analysis, since Pearson does not capture nuances of non-ordinal categories.
---
**4. Multicollinearity Alert**
There are some critical redundancies between service ownership and monthly cost indicators:
   - **`equip` vs `equipmon`** (0.95)
   - **`wireless` vs `wiremon`** (0.89)
   - *For linear models (Logistic Regression), it will be mandatory to remove the continuous cost variables (`...mon`) in favor of the binary ones, or vice versa, to avoid instability in the coefficients.*

   - **`longmon` vs `ternure`** (0.77): This correlation means that older customers spend more on long-distance calling services. There is a possibility that the model will become unstable if these 2 variables are maintained, because the correlation between them, although not perfect, is significant.

   - **`address` vs `age`** (0.74): There is a hypothesis that they are passing similar information but not the same information. Older customers tend to stay longer at the same address, as this behavior relates to *Stability*.
   Contextually, it makes sense for one variable to have a linear relationship with the other. However, in this case, it is worth evaluating these two variables and their possible influence on the target variable `churn`.
---

##### Analyzing the numerical variables and their relationships with the target variable.

In [0]:
numericals_cols = [
    'tenure', 'age', 'employ', 'address', 'income', 'longmon', 'tollmon', 'equipmon', 'cardmon', 'wiremon', 'churn'
]
GraphicsData(data = df[numericals_cols]).numerical_histograms(hue = 'churn')

In [0]:
GraphicsData(data = df[numericals_cols]).numerical_boxplots(hue = 'churn', showmeans = True)

##### Key Observations:
---
**1. Customer Lifecycle (`tenure`)**

  - **Averages:** Churners (22.78 months) vs. Non-Churners (40.22 months).

  - **Survival Analysis:** There is a "Critical Risk Zone" between 0 and 22 months. The average churn rate at 22 months indicates that if the institution manages to retain the customer beyond this initial 2-year barrier, the probability of loyalty doubles.

  - **Retention Insight:** Customers who exceed the 50-month mark become practically immune to churn ("total lock-in"). The retention strategy should be aggressive in the first year (onboarding) to push the customer base into the safe zone of 40+ months.
---
**2. The Stability Factor (`age`, `employ`, `address`)**

  - **The Pattern:** There is a direct correlation between life stability and loyalty.

  - **Age:** Loyal customers are, on average, 8 years older (43 years) than churners (35 years). The age group over 65 years old shows almost zero churn.

  - **Employment:** The employment time of retained customers (12 years) is more than **double** that of those who cancel (5.5 years).

  - **Residence:** Address stability follows the same proportion (13 years vs. 7.5 years).

  - **Diagnosis:** Churn at this institution is not just a matter of dissatisfaction with the service, but a reflection of the **financial volatility** of the younger and more unstable demographic profile. Cancellation can often be involuntary (non-payment) or due to price sensitivity.
---
**3. The Paradox of Income and Digital Services (`income`, `wiremon`, `equipmon`)**

  - **Economic Scenario:** Customers who cancel have an average income 31% lower (56k) than loyal customers (82k).

  - **The Paradox:** Despite having lower income and less stability, churners are the ones who most often contract (and spend on) "premium" and modern services, such as wireless and equipment rental (`equipmon`).

  - **Business Conclusion:** There is a **Product-Market Fit** misalignment. Expensive products (Wireless/Equipment) are being sold to an audience with lower purchasing power and high instability (young people), generating default or rapid cancellation. Meanwhile, the loyal audience (wealthy and mature) consumes cheap and legacy products.
---
**4. Loyalty Anchor (`longmon`, `cardmon`)**

  - **Voice Products:** Unlike digital services, voice usage (long distance) is the major differentiator for loyalty. Retained customers spend almost twice as much (13.6 vs. 7.2) on long-distance calls.

  - **Insight:** The voice service acts as an "anchoring" product. Those who use the phone to talk, stay. Those who use it for internet/data, leave.
---
**5. Noise Variables (`tollmon`)**

- **Irrelevance:** The `tollmon` variable (toll/extra charges) does not present a clear statistical distinction between the means or distributions of the two groups. It is suggested to evaluate its removal (Feature Selection) to reduce the complexity of the model without loss of information.

---

##### Analyzing the categorical variables and their relationships with the target variable.

In [0]:
categorical_cols = [
    'ed', 'equip', 'callcard', 'wireless', 'voice', 'pager', 'internet',
    'callwait', 'confer', 'ebill', 'custcat', 'churn'
]
GraphicsData(data = df[categorical_cols]).categorical_countplots(hue = 'churn')

In [0]:
GraphicsData(data = df[categorical_cols]).categorical_bar_percentages(hue = 'churn')

##### Key Observations

---

**1. Educational Level (`ed`)**

* **The Effect of "Smart Switching":** A linear increase in the cancellation rate is observed as the educational level increases, reaching a critical peak of **47.1%** at Level 5.
* **Diagnosis:** Customers with a high level of education treat communications as a *commodity*. They have **low switching costs** and make rational decisions. Unlike the inert base, this profile compares market prices. The prevention strategy here should be based on financial advantage, not emotional.

---

**2. Digital Services and Equipment (`equip`, `internet`, `wireless`)**

* **The Toxic Trio:** Data reveals a structural flaw in these value-added services. Retention drops significantly when customers acquire them.

* **Internet:** The cancellation rate jumps from **18.8%** to **42.0%**.

* **Equipment:** Even more critical, churn explodes from **18.3%** to **43.5%**.

* **Product-Market Fit Alert:** This indicates that the company's digital infrastructure and hardware rental policies are acting as **churn accelerators**. Selling these products to at-risk segments (such as young users) is effective in pushing the competition.

---

**3. Voice Services (phone card, conferencing, voice)**

* **Retention Anchors (with an imposter):** Functional voice features generally generate high loyalty (retention), but there is a clear divide.

* **Phone card:** A massive retention factor. Users canceled significantly less (**19.9%**) than non-users (**50.9%**).

* **Conference and Call Waiting:** Both confirm the rule, showing lower churn rates for users.

* **The "Toxic" Intruder (`voice`):** Voicemail defies the category trend. Users canceled the service significantly more (**39.0%**) than non-users (**24.8%**). This suggests that a `voice` behaves more like a deficient "Digital Service" (generating friction/cost) than a useful "Voice Service".

---

**4. Customer Segmentation (`customer`)**

* **Scalability Alert:** The disparity between segments is extreme.

* **Class 3:** The "Gold Segment" with only **8.3%** cancellation rate.

* **Class 4:** The "Danger Zone" with a **45.6%** cancellation rate.

---

#### 2.3 Multi-Variate Analysis:
---

Analyzes a set of **three or more variables** at the same time to understand complex interactions and latent structures.

##### Optimization of Numerical Variables

- Some financial and consumption variables have "long tails" (many clients with low values, few with extreme values), which distorts the predictive capacity of some algorithms.

- **Action:** A mathematical (Logarithmic) normalization will be applied to balance these distributions. Objective: To "unlock" hidden patterns of behavior, allowing the model to better differentiate at-risk clients, regardless of their income or consumption bracket.

In [0]:
# Columns for transfomations
skew_columns = ['income', 'longmon', 'cardmon', 'wiremon']

for col in skew_columns:
    train_set[f'log_{col}'] = np.log1p(train_set[col])

In [0]:
train_set.head()

In [0]:
skew_columns = ['income', 'longmon', 'cardmon', 'wiremon', 'log_income', 'log_longmon', 'log_cardmon', 'log_wiremon', 'churn']
GraphicsData(data = train_set[skew_columns]).numerical_histograms(hue = 'churn')

In [0]:
skew_columns = ['income', 'longmon', 'cardmon', 'wiremon', 'log_income', 'log_longmon', 'log_cardmon', 'log_wiremon', 'churn']
GraphicsData(data = train_set[skew_columns]).correlation_heatmap()


##### Note:
---
**Impact Analysis: Log-Transformation**

- **Mitigation of Skewness:** The application of the `log1p` transformation was effective in normalizing the distributions, drastically reducing the skewness of the critical variables.

- **Sign Gain (Pearson Correlation):** There was a tangible increase in linearity with the target variable (`churn`):
- **Income:** Increase from **0.09** to **0.13** (+44% relative gain).

- **Longmon:** Refinement from **0.26** to **0.29**.

- **Cardmon (Highlight):** The most significant jump, doubling its relevance from **0.13** to **0.24**.

- **Discovery of Latent Patterns:** The transformation in `cardmon` revealed a hidden **bimodal** structure in the raw data. The logarithm visually separated two distinct subgroups of behavior:

  - 1. **Peak on the Left:** Casual Users (*Low usage*).

  - 2. **Peak on the Right:** Heavy Users (*Heavy users*).

*This will facilitate the creation of decision cuts by tree-based models.*

---

##### Creating News Features

In [0]:
# Aggregations of costs (Total Wallet Share)
# All expenses related to monthly services will be added together.
mon_cols = ['longmon', 'tollmon', 'equipmon', 'cardmon','wiremon']
train_set['total_spend'] = train_set[mon_cols].sum(axis = 1)

# Accessibility Index (Affordability)
# Since 'income' is in thousands (e.g., 20 = 20,000), I adjusted the scale.
# I added +1 to the denominator to avoid division by zero (safety).
train_set['affordability_idx'] = train_set['total_spend'] / ((train_set['income'] * 1000) + 1)

# Risk Feature (Toxicity + Education)
toxic_list = ['internet', 'wireless', 'equip', 'voice', 'pager']
train_set['toxic_score'] = train_set[toxic_list].sum(axis = 1)
train_set['toxic_ed'] = (train_set['toxic_score'] * train_set['ed'].astype('int64')).astype('float32')

# Behavioral Usage Features
# Ternure for longom
train_set['ternure_longmon'] = ((train_set['tenure'] / 12) * (train_set['longmon']) ).astype('float32')
# Ternure for cardmon
train_set['ternure_cardmon'] = ((train_set['tenure'] / 12) * (train_set['cardmon']) ).astype('float32')

# Stability Features
# Ternure for age 
train_set['stability_age'] =  ((train_set['tenure'] / 12) * (train_set['age'] - 18) ).astype('float32')
# Ternure for address 
train_set['stability_address'] = ((train_set['tenure'] / 12) * (train_set['address']) ).astype('float32')
# Ternure for address 
train_set['stability_employ'] = ((train_set['tenure'] / 12) * (train_set['employ']) ).astype('float32')

# Stability Feature (The Master Feature)
factors = train_set['address'] + train_set['age'] + train_set['employ']
train_set['stability_full'] = (factors * (train_set['tenure'] / 12)).astype('float32')

# Good Score 
train_set['good_score'] = train_set[['callcard', 'confer', 'callwait']].sum(axis=1)

In [0]:
new_features =  [ 'log_income', 'log_longmon', 'log_cardmon',
       'log_wiremon', 'total_spend', 'affordability_idx', 'toxic_score',
       'toxic_ed', 'ternure_longmon', 'ternure_cardmon', 'stability_age',
       'stability_address', 'stability_employ', 'stability_full',
       'good_score', 'churn',]
GraphicsData(data = train_set[new_features]).numerical_histograms(hue = 'churn')

#### 2.4 Features Selection

##### Numerical Features
---

##### Mann-Whitney U Test
---
- For the Feature Selection stage, I adopted the Mann-Whitney U Test. This choice is due to the high skewness (> 1) of the numerical variables, which requires a non-parametric approach robust to outliers, where traditional tests (such as the T-test) would fail.
---

In [0]:
audit_vars = [
    'tenure', 'age', 'address', 'income', 'employ', 
    'longmon', 'tollmon', 'equipmon', 'cardmon',
    'wiremon','log_income', 'log_longmon', 'log_cardmon',
    'log_wiremon', 'total_spend', 'affordability_idx', 'toxic_score',
    'toxic_ed', 'ternure_longmon', 'ternure_cardmon', 'stability_age',
    'stability_address', 'stability_employ', 'stability_full',
    'good_score'  
]
EDATest(data = train_set).mannwhitney_u_test(audit_vars =  audit_vars, target = 'churn')

##### Conclusion of Test
---
- In this selection stage, variables with a P-value greater than 0.05 in the Mann-Whitney U test were excluded, indicating a lack of statistical significance. The case of the variable `income` is illustrative: even after the logarithmic transformation (`log_income`), the variable failed to reject the null hypothesis, proving that purchasing power is not a relevant discriminator of churn for this base.

- **In contrast**, the analysis robustly validated that retention is governed by **stability factors** (tenure, age, address) and by **service quality** (toxicity), which demonstrated a direct and statistically significant influence on customer retention.

In [0]:
audit_vars = [
    'tenure', 'age', 'address', 'employ', 
    'longmon', 'equipmon', 'cardmon',
    'wiremon', 'log_longmon', 'log_cardmon',
    'log_wiremon', 'toxic_score',
    'toxic_ed', 'ternure_longmon', 'ternure_cardmon', 'stability_age',
    'stability_address', 'stability_employ', 'stability_full', 'churn'
]
GraphicsData(train_set[audit_vars]).correlation_heatmap(v_size = 15, h_size = 20)

##### Note:
---
- The creation of interaction variables generated mathematical redundancy in the dataset (multicollinearity). To mitigate this, I adopted the **'Kill the Parent'** strategy: remove the original variables when their derived versions ('children') show greater suitability to the Target.

- As the new features better capture customer behavior, the following original variations were altered and removed to avoid noise in the model:

- *Cut-off list:* `tenure`, `age`, `address`, `employ`, `longmon`, `cardmon`, `log_longmon`, `log_cardmon`, `logwiremon`, `toxic_score`.
---

##### Multicollinearity Diagnosis (VIF Analysis)
---
- During the *Feature Engineering* phase, multiplicative interaction variables were created (e.g., `tenure` * `longmon`) to capture the elasticity of customer behavior over time. However, the introduction of these derived variables generates, by definition, a high degree of linear correlation with their original variables (Linear Dependence).

- To mitigate the risk of redundancy and ensure the parsimony of the model (Occam's Razor), I will apply the **VIF (Variance Inflation Factor)** test. The goal is to identify and remove variables with a VIF > 10, ensuring that the final model prioritizes the real *Feature Importance* and does not suffer from instability in the estimation of the coefficients.
---

In [0]:
audit_vars = [
    'equipmon', 'wiremon', 'toxic_ed', 'ternure_longmon', 'ternure_cardmon', 'stability_age',
    'stability_address', 'stability_employ', 'stability_full',
]

In [0]:
EDATest(train_set).vif_test(audit_vars = audit_vars)

In [0]:
audit_vars = [
    'equipmon', 'wiremon', 'toxic_ed', 'ternure_longmon', 'ternure_cardmon', 'stability_age',
    'stability_address', 'stability_employ', #'stability_full',
]

In [0]:
EDATest(train_set).vif_test(audit_vars = audit_vars)

##### Note:
---
- **Variance Inflation Factor (VIF)** analysis detected severe multicollinearity in the derived feature `stability_full` (**VIF: 149.18**), indicating mathematical redundancy with its component variables (`stability_age`, `address`, `employ`).

  **Action:** Removal of the variable `stability_full`.

  **Result:** After deletion, the system rebalanced, with all remaining features showing **VIF < 10**.

- Although `stability_full` showed a high linear correlation with the target, I chose to keep the component variables (such as `stability_employ`). These demonstrated a superior **Impact (Class Separation)** and offer greater granularity for decision-making. For Decision Tree/Ensemble algorithms, the ability for pure segregation (Information Gain) surpasses the aggregate linear correlation.
---

##### Categorical Features

##### Chi Square Test

In [0]:
audit_vars = [
    'ed', 'equip', 'callcard', 'wireless', 'voice', 'pager', 'internet',
    'callwait', 'confer', 'ebill', 'custcat'
]
EDATest(train_set).chi_square_test(audit_vars = audit_vars, target = 'churn')

In [0]:
train_set['ed'].value_counts()

In [0]:
train_set['ed'] = train_set['ed'].astype('int64').replace({5: 4})
train_set['ed'] = train_set['ed'].astype('category')

In [0]:
EDATest(train_set).chi_square_test(audit_vars = audit_vars, target = 'churn')

##### Checking mixed correlations of categorical features and numerical features

In [0]:
audit_pairs = [
    ('equip', 'equipmon'),       
    ('wireless', 'wiremon'),     
    ('callcard', 'ternure_cardmon'),     
    ('ed', 'toxic_ed'),         
    ('internet', 'toxic_ed'),
    ('ebill', 'stability_age')
]
EDATest(train_set).mixed_redundancy_test(audit_pairs = audit_pairs)