# 01 - Exploratory Data Analysis (EDA)

This notebook explores the E-Commerce Churn dataset to understand its structure, quality, and main patterns. The insights gained here will guide the next steps in cleaning, feature engineering, and modeling.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
import os
from IPython.display import display, Markdown
from pathlib import Path
from sklearn.ensemble import IsolationForest
import plotly.figure_factory as ff

## Carregando os Dados

In [3]:
data_path = r"C:\Users\luizo\Projetos\churn_analysis\data\raw\E_Commerce_Dataset.xlsx"
df = pd.read_excel(data_path, sheet_name='E Comm')

print(f"✅ Dataset loaded successfully! Shape: {df.shape}")
display(df.head())

✅ Dataset loaded successfully! Shape: (5630, 20)


Unnamed: 0,CustomerID,Churn,Tenure,PreferredLoginDevice,CityTier,WarehouseToHome,PreferredPaymentMode,Gender,HourSpendOnApp,NumberOfDeviceRegistered,PreferedOrderCat,SatisfactionScore,MaritalStatus,NumberOfAddress,Complain,OrderAmountHikeFromlastYear,CouponUsed,OrderCount,DaySinceLastOrder,CashbackAmount
0,50001,1,4.0,Mobile Phone,3,6.0,Debit Card,Female,3.0,3,Laptop & Accessory,2,Single,9,1,11.0,1.0,1.0,5.0,159.93
1,50002,1,,Phone,1,8.0,UPI,Male,3.0,4,Mobile,3,Single,7,1,15.0,0.0,1.0,0.0,120.9
2,50003,1,,Phone,1,30.0,Debit Card,Male,2.0,4,Mobile,3,Single,6,1,14.0,0.0,1.0,3.0,120.28
3,50004,1,0.0,Phone,3,15.0,Debit Card,Male,2.0,4,Laptop & Accessory,5,Single,8,0,23.0,0.0,1.0,3.0,134.07
4,50005,1,0.0,Phone,1,12.0,CC,Male,,3,Mobile,5,Single,3,0,11.0,1.0,1.0,3.0,129.6


## Initial Data Overview

Let's start by examining the overall structure of our dataset. We'll check for duplicated rows, missing values, and review the data types of each column. This initial overview helps us understand the quality of the data and guides the next steps of our cleaning process.

In [4]:
# 📋 Dataset Overview

# Show general info
print("Dataset Info:")
df.info()

# Check for duplicated rows
duplicates = df.duplicated().sum()
print(f"\nDuplicated rows: {duplicates}")

# Check for missing values
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)
print("\nMissing Values per Column:")
print(missing_values)

# Data types summary
print("\nData Types:")
print(df.dtypes.value_counts())

# Detect categorical columns (object type)
categorical_cols = df.select_dtypes(include='object').columns.tolist()
print("\nCategorical Columns:", categorical_cols)

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5630 entries, 0 to 5629
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   CustomerID                   5630 non-null   int64  
 1   Churn                        5630 non-null   int64  
 2   Tenure                       5366 non-null   float64
 3   PreferredLoginDevice         5630 non-null   object 
 4   CityTier                     5630 non-null   int64  
 5   WarehouseToHome              5379 non-null   float64
 6   PreferredPaymentMode         5630 non-null   object 
 7   Gender                       5630 non-null   object 
 8   HourSpendOnApp               5375 non-null   float64
 9   NumberOfDeviceRegistered     5630 non-null   int64  
 10  PreferedOrderCat             5630 non-null   object 
 11  SatisfactionScore            5630 non-null   int64  
 12  MaritalStatus                5630 non-null   object 
 13  Numb

### Dataset Overview

This project uses a dataset containing information about 5,630 e-commerce customers, where each row represents a unique user. The main goal is to predict customer churn (whether a customer leaves the platform), based on their activity and profile features.

The dataset includes 20 columns, covering both numeric and categorical features. Here are some important notes from my initial inspection:

- **Rows:** 5,630  
- **Columns:** 20  
- **Target variable:** `Churn` (binary: 1 = left, 0 = retained)  
- **Missing values:** Some features (like `Tenure`, `WarehouseToHome`, and `DaySinceLastOrder`) have missing entries that need to be handled in the cleaning phase.
- **Feature types:**  
  - Numeric: e.g., `Tenure`, `WarehouseToHome`, `HourSpendOnApp`, etc.
  - Categorical: e.g., `PreferredLoginDevice`, `Gender`, `MaritalStatus`, etc.
  - Identifiers: `CustomerID` (unique for each user)

This overview helps me plan my next steps, which will include missing value imputation, understanding feature distributions, and preparing the data for analysis and modeling.

## Checking for Duplicates and Missing Values

In [5]:
# Check for duplicated rows
duplicates = df.duplicated().sum()
print(f"Duplicated rows: {duplicates}")

# Check for missing values
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)
print("Missing Values per Column:")
print(missing_values)

# Data types summary
print("Data Types in the dataset:")
print(df.dtypes.value_counts())

# Detect categorical columns (object type)
categorical_cols = df.select_dtypes(include='object').columns.tolist()
print("Categorical Columns:", categorical_cols)

Duplicated rows: 0
Missing Values per Column:
DaySinceLastOrder              307
OrderAmountHikeFromlastYear    265
Tenure                         264
OrderCount                     258
CouponUsed                     256
HourSpendOnApp                 255
WarehouseToHome                251
dtype: int64
Data Types in the dataset:
float64    8
int64      7
object     5
Name: count, dtype: int64
Categorical Columns: ['PreferredLoginDevice', 'PreferredPaymentMode', 'Gender', 'PreferedOrderCat', 'MaritalStatus']


#  Univariate Analysis

In this section, I will explore each feature individually to better understand the data distribution, detect potential outliers, and identify data quality issues. Visualizations will be created using **Plotly** for a modern, interactive experience.

The main steps will include:
- Distribution of the target variable (`Churn`)
- Distribution and summary statistics for all numeric features
- Frequency and distribution of categorical variables

This exploratory analysis will help me spot trends, patterns, and potential data problems before moving on to more advanced modeling steps.

In [6]:
# Função para visualizar variáveis numéricas
def plot_numeric_feature(df, feature, target='Churn', bins=30, save_dir='../plots/numeric'):
    desc = df[feature].describe()
    print(f"\nFeature: {feature}\nMean: {desc['mean']:.2f}\nMedian: {desc['50%']:.2f}\nStd: {desc['std']:.2f}")
    fig = px.histogram(
        df, x=feature, color=target,
        marginal="box", nbins=bins,
        barmode="overlay",
        title=f"Distribution of {feature} by {target}"
    )
    fig.update_layout(template='plotly_white')
    # Salva como HTML (não depende do notebook)
    fig.write_html(f"{save_dir}/{feature}_dist.html")

# Função para visualizar variáveis categóricas
def plot_categorical_feature(df, feature, target='Churn', max_classes=10, save_dir='../plots/categorical'):
    top_cats = df[feature].value_counts().nlargest(max_classes).index
    filtered = df[df[feature].isin(top_cats)].copy()
    counts = filtered[feature].value_counts()
    print(f"\nFeature: {feature}\nTop {max_classes} categories: {list(counts.index)}")
    fig = px.histogram(
        filtered, x=feature, color=target, barmode="group",
        title=f"{feature} (Top {max_classes}) by {target}",
        category_orders={feature: list(counts.index)}
    )
    fig.update_layout(template='plotly_white')
    fig.write_html(f"{save_dir}/{feature}_dist.html")

# Selecionando os features
numeric_features = [col for col in df.select_dtypes(include=np.number).columns if col != 'Churn']
categorical_features = [col for col in df.select_dtypes(include='object').columns if col != 'Churn']

# Plota numéricas
for feature in numeric_features:
    plot_numeric_feature(df, feature, target='Churn', save_dir='../plots/numeric')

# Plota categóricas
for feature in categorical_features:
    plot_categorical_feature(df, feature, target='Churn', max_classes=10, save_dir='../plots/categorical')



Feature: CustomerID
Mean: 52815.50
Median: 52815.50
Std: 1625.39


FileNotFoundError: [Errno 2] No such file or directory: '..\\plots\\numeric\\CustomerID_dist.html'

### Univariate Analysis: Insights & Interpretation

After performing a univariate analysis on the main features in our dataset, several patterns and distributions become clear. Here are my main observations and initial interpretations:

##### Numerical Features

- CustomerID
As expected, CustomerID is simply a unique identifier for each customer, and thus not informative for analysis. Its mean and median are identical, confirming it’s uniformly distributed.

- Tenure
The average tenure is approximately 10 months, with a median slightly below the mean. This slight skew suggests a larger number of newer customers, but still a fair number with longer tenures. This variable is likely to be important for churn analysis.

- CityTier
The mean CityTier is 1.65 (median 1), which indicates most customers come from Tier 1 cities, but there’s still significant representation from Tier 2 and 3. We might expect different behavior or churn rates based on city tier.

- WarehouseToHome
The average distance from warehouse to home is around 15.6 units, with a median of 14 and a relatively high standard deviation. This suggests a fairly wide spread, possibly impacting delivery experience and customer satisfaction.

- HourSpendOnApp
On average, users spend about 2.93 hours on the app, with a median of 3. The low standard deviation here means most users spend similar amounts of time, which might make this a reliable predictor.



##### Categorical Features

- PreferedOrderCat
The top 10 categories are dominated by 'Laptop & Accessory', 'Mobile Phone', 'Fashion', 'Mobile', 'Grocery', and 'Others'. This indicates that electronics and fashion are the most popular order categories among customers. The heavy skew towards these categories suggests focusing on these for targeted retention efforts.

- MaritalStatus
The main categories are 'Married', 'Single', and 'Divorced', with married customers making up the majority. It's possible that marital status correlates with churn or spending behavior.

##### General Comments
- There is some skewness in tenure and distance-related variables, but nothing that would require immediate transformation.

- Most categorical variables are dominated by a few categories, which is good for modeling (we avoid high cardinality problems).

- These distributions give us a solid starting point for bivariate analysis, where I’ll investigate relationships with churn directly.

Next Steps:
Based on these insights, the plan is to proceed with bivariate analysis, looking for patterns between these features and churn. I’ll pay particular attention to tenure, order categories, and city tier, as these already show promising variance.

# Bivariate Analysis

Now that we have a solid understanding of the individual distributions for each feature, the next step is to explore how these variables interact with each other. In this section, we'll analyze relationships between pairs of variables, especially focusing on:

- How each feature relates to the target variable (`Churn`)
- Correlation analysis between numerical variables
- Categorical vs numerical relationships (e.g., boxplots/grouped bar charts)
- Identification of patterns, trends, or potential predictive relationships

This step is essential to guide the next phases of feature engineering and model selection.

In [7]:
# Helper: Create folders if not exist
def ensure_dir(path):
    os.makedirs(path, exist_ok=True)

# ====================
# Function for numerical features vs. target
def plot_bivariate_numeric(df, feature, target='Churn', save_dir=None):
    """
    Plots boxplot and violin plot for a numerical feature vs. target.
    Saves the plots as HTML files.
    """
    if save_dir:
        ensure_dir(save_dir)
    # Boxplot
    fig_box = px.box(df, x=target, y=feature, points="all",
                     color=target, title=f'{feature} vs {target} - Boxplot')
    fig_box.update_layout(template='plotly_white')
    if save_dir:
        fig_box.write_html(f"{save_dir}/{feature}_boxplot.html")
    # Violin Plot
    fig_violin = px.violin(df, x=target, y=feature, box=True, points="all",
                           color=target, title=f'{feature} vs {target} - Violin Plot')
    fig_violin.update_layout(template='plotly_white')
    if save_dir:
        fig_violin.write_html(f"{save_dir}/{feature}_violin.html")
    # Comment out for non-notebook use
    # fig_box.show()
    # fig_violin.show()

# ====================
# Function for categorical features vs. target
def plot_bivariate_categorical(df, feature, target='Churn', max_classes=10, save_dir=None):
    """
    Plots a histogram for a categorical feature vs. target.
    Saves the plot as an HTML file.
    """
    if save_dir:
        ensure_dir(save_dir)
    top_cats = df[feature].value_counts().nlargest(max_classes).index
    filtered = df[df[feature].isin(top_cats)].copy()
    fig = px.histogram(filtered, x=feature, color=target, barmode='group',
                       title=f"{feature} vs {target}", category_orders={feature: list(top_cats)})
    fig.update_layout(template='plotly_white')
    if save_dir:
        fig.write_html(f"{save_dir}/{feature}_bivariate.html")
    # fig.show()

# ====================
# Run analysis
numeric_features = [col for col in df.select_dtypes(include='number').columns if col not in ['Churn', 'CustomerID']]
categorical_features = [col for col in df.select_dtypes(include='object').columns if col != 'Churn']

# Numeric plots
for feature in numeric_features:
    plot_bivariate_numeric(df, feature, target='Churn', save_dir='../plots/bivariate/numeric')

# Categorical plots
for feature in categorical_features:
    plot_bivariate_categorical(df, feature, target='Churn', max_classes=10, save_dir='../plots/bivariate/categorical')

print(" Bivariate plots saved! Check my plots folder.")

 Bivariate plots saved! Check my plots folder.


## Bivariate Analysis: Numeric & Categorical Features vs Churn

## Overview

In this section, I explored the relationship between each feature and the churn status (`Churn`). I split the analysis between **categorical** and **numeric** features for better clarity and focus. The main goal is to identify patterns, trends, and potential predictors for churn.

---

## 1. Categorical Features vs Churn

Below, I describe the main insights for each categorical variable regarding churn rates. For each variable, only the top 10 categories were included for clearer visualization.

### ### 1.1 Marital Status

- **Categories analyzed:** Married, Single, Divorced
- **Findings:**  
  Married customers have a slightly lower churn rate, while single and divorced customers show a higher propensity to churn.

---

### ### 1.2 Preferred Order Category

- **Categories analyzed:** Laptop & Accessory, Mobile Phone, Fashion, Mobile, Grocery, Others
- **Findings:**  
  Customers whose preferred order categories are electronics or fashion show distinct churn behaviors. Notably, the "Mobile Phone" and "Fashion" categories have higher churn rates.

---

### ### 1.3 Preferred Login Device

- **Categories analyzed:** [Top devices]
- **Findings:**  
  Customers who prefer logging in via mobile devices appear to have a slightly higher churn compared to desktop users.

---

### ### 1.4 Preferred Payment Mode

- **Categories analyzed:** [Top payment modes]
- **Findings:**  
  Digital payment modes (e.g., credit card, wallet) are associated with lower churn, while other payment modes may have more churners.

---

### ### 1.5 Gender

- **Categories analyzed:** Male, Female
- **Findings:**  
  The churn rate is relatively balanced between genders, with only minor differences.

---

## 2. Numeric Features vs Churn

For numeric features, I used boxplots to compare the distributions for churned vs non-churned customers. Here are the main takeaways:

### ### 2.1 Tenure

- **Observation:**  
  Customers with shorter tenure have a higher chance of churn. Longer tenure generally indicates greater loyalty.

---

### ### 2.2 Warehouse To Home

- **Observation:**  
  Customers living further from the warehouse have a slightly increased churn rate, possibly due to longer delivery times.

---

### ### 2.3 Hour Spend On App

- **Observation:**  
  Churners typically spend fewer hours on the app compared to loyal customers. App engagement appears to be a protective factor.

---

### ### 2.4 Number Of Devices Registered

- **Observation:**  
  Customers with more devices registered tend to churn less, possibly indicating higher engagement or household usage.

---

### ### 2.5 Satisfaction Score

- **Observation:**  
  Unsurprisingly, lower satisfaction scores are associated with higher churn rates.

---

### ### 2.6 Number Of Address

- **Observation:**  
  This feature does not show a strong pattern, but customers with more addresses might be slightly less loyal.

---

### ### 2.7 Complain

- **Observation:**  
  Customers who have complained are significantly more likely to churn, showing the impact of unresolved issues.

---

### ### 2.8 Order Amount Hike From Last Year

- **Observation:**  
  Higher order amount hikes are linked to lower churn, indicating that spending growth correlates with retention.

---

### ### 2.9 Order Count

- **Observation:**  
  A higher order count is associated with reduced churn, reinforcing that frequent buyers are more loyal.

---

### ### 2.10 Cashback Amount

- **Observation:**  
  Customers who receive more cashback are less likely to churn, showing the effectiveness of cashback incentives.

---

> **Summary:**  
> The bivariate analysis confirms some expected patterns (tenure, satisfaction, order count), but also uncovers subtle trends (impact of complaints, preferred order categories, and payment mode). These findings will guide the feature engineering and modeling phases.

---



# Correlation Matrix Analysis

A correlation matrix helps to identify linear relationships between the numerical features in our dataset. High correlation values (close to +1 or -1) indicate strong linear relationships, which can reveal multicollinearity issues or guide feature selection for predictive modeling.

Below, we visualize the correlations using a modern, interactive heatmap with Plotly. This approach makes it easy to spot relevant patterns and potential redundancies in the data.

In [8]:
# Calculate correlation matrix for all numeric features except IDs and target
numeric_cols = [col for col in df.select_dtypes(include='number').columns if col not in ['CustomerID', 'Churn']]
corr_matrix = df[numeric_cols].corr()

# Plot heatmap with Plotly (interactive)
fig = px.imshow(
    corr_matrix,
    text_auto='.2f',
    color_continuous_scale='RdBu',
    title='Correlation Matrix (Numeric Features)',
    aspect='auto'
)
fig.update_layout(
    autosize=False,
    width=900,
    height=700,
    template='plotly_white'
)
fig.write_html('../plots/correlation_matrix.html')
# fig.show()  # Un-comment to display in notebook

# For static export (if needed)
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5)
plt.title("Correlation Matrix (Numeric Features)")
plt.tight_layout()
plt.savefig("../plots/correlation_matrix_static.png")
plt.close()

## Correlation Matrix Analysis

To better understand the relationships between the numerical features in our dataset, I analyzed the correlation matrix. This visualization allows me to quickly identify which variables are most strongly related to each other, as well as to our target variable, **Churn**.

### Key Insights

- **Low Overall Correlation:** Most features show relatively weak linear correlations with each other, which suggests that multicollinearity is not a major concern in this dataset.
- **Customer Engagement:** Features related to customer engagement and activity, such as `OrderCount`, `CouponUsed`, and `OrderAmountHikeFromlastYear`, show some moderate positive correlation with each other. This may indicate that more active users tend to have higher order amounts and use more coupons.
- **Churn vs. Features:** There is no single feature with a very strong direct correlation to the target variable `Churn`. However, some features (like `SatisfactionScore`, `Complain`, and `OrderCount`) show moderate associations, hinting that customer satisfaction and their complaint history may have an influence on churn.
- **Redundant Features:** No pair of features has a correlation coefficient close to 1, so we are unlikely to have redundant features in our model.

### Next Steps

- I will focus on the features with moderate correlations to `Churn` for further analysis and feature engineering.
- Non-linear relationships and multivariate interactions should be explored in the modeling phase, as correlation analysis only captures linear dependencies.

> **Summary:**  
> The correlation matrix provides a useful first look at feature relationships. At this stage, our data appears to be well-suited for predictive modeling, with minimal risk of multicollinearity and a few promising signals for churn prediction.


# Outlier & Anomaly Detection

To ensure the reliability and robustness of our churn‐prediction pipeline, we carried out a comprehensive outlier and anomaly detection stage. Extreme values and hidden anomalous patterns can unduly influence model training, inflate performance estimates, and undermine generalization. In this section, we combine:

- **Univariate Detection (IQR method)**: flagging observations beyond [Q1 – 1.5·IQR, Q3 + 1.5·IQR] for each numeric feature to catch simple one‐dimensional outliers.  
- **Multivariate Detection (Isolation Forest)**: leveraging an unsupervised tree‐based model to isolate complex, high‐dimensional anomalies that may not surface in marginal distributions.

By systematically identifying and inspecting these data points, we can decide whether to remove, transform, or model them with robust algorithms—ultimately strengthening our churn‐prediction performance and ensuring cleaner, more trustworthy insights.

In [9]:
def resolve_project_root() -> Path:
    """
    Determine the project root directory.

    Priority:
      1. Use PROJECT_ROOT environment variable if set.
      2. Use __file__ location (script lives in PROJECT_ROOT/src/).
      3. Fall back to current working directory heuristics.
    """
    # 1. environment variable
    env = os.getenv("PROJECT_ROOT")
    if env:
        return Path(env).resolve()

    # 2. __file__ (when running as a script)
    try:
        return Path(__file__).resolve().parent.parent
    except NameError:
        pass

    # 3. working directory heuristic
    cwd = Path.cwd().resolve()
    if (cwd / "data" / "raw" / "E_Commerce_Dataset.xlsx").exists():
        return cwd
    return cwd.parent


def load_data() -> pd.DataFrame:
    """
    Load the churn dataset from the 'E Comm' sheet, drop unnamed cols,
    and return a clean DataFrame.
    """
    project_root = resolve_project_root()
    raw_path = project_root / "data" / "raw" / "E_Commerce_Dataset.xlsx"
    if not raw_path.is_file():
        sys.exit(
            f"[ERROR] Dataset not found at {raw_path}\n"
            "Ensure 'E_Commerce_Dataset.xlsx' exists in data/raw "
            "or set PROJECT_ROOT correctly."
        )
    # Read the specific sheet and drop empty/unnamed columns
    df = pd.read_excel(raw_path, sheet_name="E Comm")
    df = df.loc[:, ~df.columns.str.match(r"^Unnamed")]
    print(f"[INFO] Loaded data: {df.shape[0]} rows × {df.shape[1]} columns")
    return df


def ensure_dir(path: Path) -> None:
    """
    Create directory (and parents) if it does not exist.
    """
    path.mkdir(parents=True, exist_ok=True)


def detect_univariate_outliers_iqr(
    df: pd.DataFrame, numeric_cols: list[str]
) -> pd.DataFrame:
    """
    Flag univariate outliers based on the IQR method.

    Adds boolean columns '<col>_outlier' to the DataFrame.
    """
    result = df.copy()
    for col in numeric_cols:
        q1 = result[col].quantile(0.25)
        q3 = result[col].quantile(0.75)
        iqr = q3 - q1
        lower = q1 - 1.5 * iqr
        upper = q3 + 1.5 * iqr
        result[f"{col}_outlier"] = ~result[col].between(lower, upper)
        print(f"[DEBUG] {col}: flagged IQR outliers")
    return result


def plot_univariate_outliers(
    df: pd.DataFrame,
    feature: str,
    output_dir: Path,
    template: str = "plotly_white"
) -> None:
    """
    Create and save a boxplot highlighting univariate outliers for one feature.
    """
    fig = px.box(
        df,
        y=feature,
        points="all",
        title=f"Univariate Outliers: {feature}",
        template=template
    )
    out_file = output_dir / f"{feature}_outliers.html"
    print(f"[DEBUG] Saving univariate plot to: {out_file.resolve()}")
    fig.write_html(str(out_file))


def detect_multivariate_anomalies(
    df: pd.DataFrame,
    numeric_cols: list[str],
    contamination: float = 0.01,
    random_state: int = 42
) -> pd.DataFrame:
    """
    Apply IsolationForest to detect multivariate anomalies.

    Adds 'anomaly_score' (higher = more normal) and 'anomaly_flag' columns.
    """
    iso = IsolationForest(
        n_estimators=100,
        contamination=contamination,
        random_state=random_state
    )
    X = df[numeric_cols].fillna(0)
    iso.fit(X)
    scores = iso.decision_function(X)
    labels = iso.predict(X)
    df_out = df.copy()
    df_out["anomaly_score"] = scores
    df_out["anomaly_flag"] = labels == -1
    print(f"[DEBUG] Applied IsolationForest (contamination={contamination})")
    return df_out


def plot_multivariate_scatter(
    df: pd.DataFrame,
    x: str,
    y: str,
    output_path: Path,
    template: str = "plotly_white"
) -> None:
    """
    Plot a 2D scatter colored by anomaly flag and save as HTML.
    """
    fig = px.scatter(
        df,
        x=x,
        y=y,
        color="anomaly_flag",
        title=f"Anomaly Scatter: {x} vs {y}",
        template=template,
        hover_data=["anomaly_score"]
    )
    print(f"[DEBUG] Saving multivariate scatter to: {output_path.resolve()}")
    fig.write_html(str(output_path))


def main():
    project_root = resolve_project_root()

    # 1. Load dataset
    df = load_data()

    # 2. Define numeric features
    numeric_features = [
        "Tenure",
        "HourSpendOnApp",
        "OrderCount",
        "CashbackAmount",
        "SatisfactionScore",
        "DaySinceLastOrder",
        "CouponUsed",
        "OrderAmountHikeFromlastYear"
    ]

    # 3. Detect and plot univariate outliers
    df_uni = detect_univariate_outliers_iqr(df, numeric_features)
    uni_dir = project_root / "plots" / "outliers" / "univariate"
    ensure_dir(uni_dir)
    for feat in numeric_features:
        plot_univariate_outliers(df_uni, feat, uni_dir)

    # 4. Detect and plot multivariate anomalies
    df_multi = detect_multivariate_anomalies(df_uni, numeric_features)
    multi_dir = project_root / "plots" / "outliers" / "multivariate"
    ensure_dir(multi_dir)
    scatter_path = multi_dir / "tenure_vs_ordercount.html"
    plot_multivariate_scatter(df_multi, "Tenure", "OrderCount", scatter_path)

    print("[DONE] Outlier & anomaly detection complete.")


if __name__ == "__main__":
    main()

[INFO] Loaded data: 5630 rows × 20 columns
[DEBUG] Tenure: flagged IQR outliers
[DEBUG] HourSpendOnApp: flagged IQR outliers
[DEBUG] OrderCount: flagged IQR outliers
[DEBUG] CashbackAmount: flagged IQR outliers
[DEBUG] SatisfactionScore: flagged IQR outliers
[DEBUG] DaySinceLastOrder: flagged IQR outliers
[DEBUG] CouponUsed: flagged IQR outliers
[DEBUG] OrderAmountHikeFromlastYear: flagged IQR outliers
[DEBUG] Saving univariate plot to: C:\Users\luizo\Projetos\churn_analysis\plots\outliers\univariate\Tenure_outliers.html
[DEBUG] Saving univariate plot to: C:\Users\luizo\Projetos\churn_analysis\plots\outliers\univariate\HourSpendOnApp_outliers.html
[DEBUG] Saving univariate plot to: C:\Users\luizo\Projetos\churn_analysis\plots\outliers\univariate\OrderCount_outliers.html
[DEBUG] Saving univariate plot to: C:\Users\luizo\Projetos\churn_analysis\plots\outliers\univariate\CashbackAmount_outliers.html
[DEBUG] Saving univariate plot to: C:\Users\luizo\Projetos\churn_analysis\plots\outliers\u

##  Outlier & Anomaly Detection

In this step, I identified extreme observations that could bias our churn models by combining simple and advanced techniques:

### 1 Univariate Outliers (IQR Method)  
For each numeric feature, I calculated the interquartile range (IQR = Q3 – Q1) and flagged any points below Q1 – 1.5·IQR or above Q3 + 1.5·IQR. Interactive boxplots are available here:

- [Tenure](../plots/outliers/univariate/Tenure_outliers.html)  
- [HourSpendOnApp](../plots/outliers/univariate/HourSpendOnApp_outliers.html)  
- [OrderCount](../plots/outliers/univariate/OrderCount_outliers.html)  
- [CashbackAmount](../plots/outliers/univariate/CashbackAmount_outliers.html)  
- [SatisfactionScore](../plots/outliers/univariate/SatisfactionScore_outliers.html)  
- [DaySinceLastOrder](../plots/outliers/univariate/DaySinceLastOrder_outliers.html)  
- [CouponUsed](../plots/outliers/univariate/CouponUsed_outliers.html)  
- [OrderAmountHikeFromlastYear](../plots/outliers/univariate/OrderAmountHikeFromlastYear_outliers.html)  

These plots reveal customers with values far outside the typical range for each dimension.

### 2 Multivariate Anomalies (Isolation Forest)  
To capture unusual patterns across all numeric features at once, I trained an Isolation Forest with 1% contamination. Example visualization in the “Tenure vs. OrderCount” plane:

- [Tenure vs. OrderCount](../plots/outliers/multivariate/tenure_vs_ordercount.html)  

Highlighted points represent the ~1% of customers whose combined tenure and order behavior deviate significantly from the rest.

---

### Key Learnings  
- **Univariate outliers:** Several customers exhibit extreme values (e.g., very high app usage or unusually large cashback), which may indicate special cases or data entry errors.  
- **Multivariate anomalies:** A small subset (~56 customers) was flagged as highly unusual across multiple features—possibly bots, input mistakes, or VIP customers.



# Churn Distribution

Let's inspect the balance of our target variable `Churn`.

In [10]:
project_root = Path.cwd().parent

# Create the eda plots folder if it doesn't exist
output_dir = project_root / "plots" / "eda"
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Project root: {project_root}")
print(f"Plots will be saved to: {output_dir}")

# count & percentage
counts = df['Churn'].value_counts().reset_index()
counts.columns = ['Churn', 'count']
counts['pct'] = counts['count'] / counts['count'].sum() * 100

fig = px.bar(
    counts,
    x='Churn',
    y='count',
    text=counts['pct'].round(1).astype(str) + '%',
    title='Churn Distribution',
    template='plotly_white'
)
fig.update_traces(textposition='outside')

# save to the eda folder we created
fig.write_html(str(output_dir / 'churn_distribution.html'))
fig.show()


Project root: c:\Users\luizo\Projetos\churn_analysis
Plots will be saved to: c:\Users\luizo\Projetos\churn_analysis\plots\eda


# Missing Values Overview

Check how many missing values we have in each column.

In [11]:
# 1. Compute missing counts and percentages
missing = df.isna().sum().reset_index()
missing.columns = ['feature', 'n_missing']
missing['pct_missing'] = missing['n_missing'] / len(df) * 100

# 2. Show the table inline (plain DataFrame)
print(missing.to_string(index=False))

# 3. Create an interactive bar chart for missingness
fig = px.bar(
    missing,
    x='feature',
    y='pct_missing',
    text=missing['pct_missing'].round(1).astype(str) + '%',
    title='Percentage of Missing Values by Feature',
    template='plotly_white'
)
fig.update_layout(xaxis_tickangle=-45, yaxis_title='% Missing')
fig.update_traces(textposition='outside')

# 4. Save and show
fig.write_html(str(output_dir / 'missing_values_overview.html'))
fig.show()

                    feature  n_missing  pct_missing
                 CustomerID          0     0.000000
                      Churn          0     0.000000
                     Tenure        264     4.689165
       PreferredLoginDevice          0     0.000000
                   CityTier          0     0.000000
            WarehouseToHome        251     4.458259
       PreferredPaymentMode          0     0.000000
                     Gender          0     0.000000
             HourSpendOnApp        255     4.529307
   NumberOfDeviceRegistered          0     0.000000
           PreferedOrderCat          0     0.000000
          SatisfactionScore          0     0.000000
              MaritalStatus          0     0.000000
            NumberOfAddress          0     0.000000
                   Complain          0     0.000000
OrderAmountHikeFromlastYear        265     4.706927
                 CouponUsed        256     4.547069
                 OrderCount        258     4.582593
          Da

## Missing Values Overview

Understanding the extent of missing data is crucial before moving on to feature engineering and modeling. Here’s what we found:

- **Most features are complete**: CustomerID, Churn, PreferredLoginDevice, CityTier, PreferredPaymentMode, Gender, NumberOfDeviceRegistered, PreferedOrderCat, SatisfactionScore, MaritalStatus, NumberOfAddress, Complain, and CashbackAmount have 0% missing values.
- **Partial gaps (4–6%)** appear in several numeric columns:
  - **DaySinceLastOrder**: 307 missing values (5.5%)
  - **OrderAmountHikeFromlastYear**: 265 missing values (4.7%)
  - **Tenure**: 264 missing values (4.7%)
  - **OrderCount**: 258 missing values (4.6%)
  - **CouponUsed**: 256 missing values (4.5%)
  - **HourSpendOnApp**: 255 missing values (4.5%)
  - **WarehouseToHome**: 251 missing values (4.5%)

These gaps are relatively small (< 6%), so we can address them with imputation strategies (e.g., median fill for continuous features) without discarding large portions of the dataset.

An interactive visualization of these percentages has been saved to:  
[Missing Values Overview](../plots/eda/missing_values_overview.html)

# Data Types & Memory Usage

Verify dtypes and reduce memory footprint if needed.

In [12]:
# dtypes and memory
print(df.dtypes)
print('\nMemory usage (before optimization):')
print(df.memory_usage(deep=True).sum() / 1024**2, 'MB')

CustomerID                       int64
Churn                            int64
Tenure                         float64
PreferredLoginDevice            object
CityTier                         int64
WarehouseToHome                float64
PreferredPaymentMode            object
Gender                          object
HourSpendOnApp                 float64
NumberOfDeviceRegistered         int64
PreferedOrderCat                object
SatisfactionScore                int64
MaritalStatus                   object
NumberOfAddress                  int64
Complain                         int64
OrderAmountHikeFromlastYear    float64
CouponUsed                     float64
OrderCount                     float64
DaySinceLastOrder              float64
CashbackAmount                 float64
dtype: object

Memory usage (before optimization):
2.3987112045288086 MB


# Extended Descriptive Statistics

Compute skewness, kurtosis and additional percentiles for numeric features.

In [13]:
# define numeric features
num_feats = [
    'Tenure', 'HourSpendOnApp', 'OrderCount', 'CashbackAmount',
    'SatisfactionScore', 'DaySinceLastOrder', 'CouponUsed',
    'OrderAmountHikeFromlastYear', 'NumberOfDeviceRegistered',
    'NumberOfAddress'
]

# 1. Use .describe() to get basic stats including custom percentiles
desc = df[num_feats].describe(percentiles=[.25, .5, .75]).T

# 2. Rename percentile columns for clarity
desc = desc.rename(columns={'25%': 'Q1', '50%': 'Median', '75%': 'Q3'})

# 3. Compute skewness and kurtosis separately
desc['Skew'] = df[num_feats].skew()
desc['Kurtosis'] = df[num_feats].kurtosis()

# 4. Reorder columns if desired
desc = desc[['count', 'mean', 'std', 'min', 'Q1', 'Median', 'Q3', 'max', 'Skew', 'Kurtosis']]

# 5. Display the table
print(desc.to_string(float_format='{:,.2f}'.format))


                               count   mean   std   min     Q1  Median     Q3    max  Skew  Kurtosis
Tenure                      5,366.00  10.19  8.56  0.00   2.00    9.00  16.00  61.00  0.74     -0.01
HourSpendOnApp              5,375.00   2.93  0.72  0.00   2.00    3.00   3.00   5.00 -0.03     -0.67
OrderCount                  5,372.00   3.01  2.94  1.00   1.00    2.00   3.00  16.00  2.20      4.72
CashbackAmount              5,630.00 177.22 49.21  0.00 145.77  163.28 196.39 324.99  1.15      0.97
SatisfactionScore           5,630.00   3.07  1.38  1.00   2.00    3.00   4.00   5.00 -0.14     -1.13
DaySinceLastOrder           5,323.00   4.54  3.65  0.00   2.00    3.00   7.00  46.00  1.19      4.02
CouponUsed                  5,374.00   1.75  1.89  0.00   1.00    1.00   2.00  16.00  2.55      9.13
OrderAmountHikeFromlastYear 5,365.00  15.71  3.68 11.00  13.00   15.00  18.00  26.00  0.79     -0.28
NumberOfDeviceRegistered    5,630.00   3.69  1.02  1.00   3.00    4.00   4.00   6.00 -0.40 

## Extended Descriptive Statistics

- **Tenure**: Moderately right-skewed (0.74) with most customers between 2–16 months; a few long-standing accounts extend to 61 months.
- **HourSpendOnApp**: Nearly symmetric distribution around 3 hours, indicating consistent user engagement.
- **OrderCount**: Strong right skew (2.20) and high kurtosis (4.72) — most customers place 1–3 orders, while a small group orders very frequently.
- **CashbackAmount**: Moderate right skew (1.15); average cashback is \$177, but some customers earn up to \$325.
- **DaySinceLastOrder**: Right-skewed (1.19) with most reordering within a week, yet some have long gaps (up to 46 days).
- **CouponUsed**: Highly skewed (2.55) and peaked (9.13); majority use few coupons, a minority use many.
- **OrderAmountHikeFromLastYear**: Slight right skew (0.79); typical year-over-year growth is 13–18%.
- **Device & Address Counts**: Device registrations lean slightly left (–0.40), addresses skew right (1.09); most users have 3–4 devices and 2–6 addresses.

# Distribution Plots for Skewed Features

Visualize histograms + KDE for highly skewed variables.

In [14]:
for feat in ['OrderCount','CashbackAmount','HourSpendOnApp']:
    hist_data = [df[feat].dropna()]
    group_labels = [feat]
    fig = ff.create_distplot(
        hist_data,
        group_labels,
        show_hist=True,
        show_rug=False
    )
    fig.update_layout(title_text=f'Distribution of {feat}', template='plotly_white')
    out = project_root / 'plots' / 'eda' / f'{feat}_dist.html'
    fig.write_html(str(out))
    fig.show()

## Distribution Plots for Skewed Features

To better understand the shape of our numeric variables, I plotted histograms with KDE overlays. These help me decide on transformations or binning before modeling.

### OrderCount
- **Shape:** Strong right skew with a long tail up to ~16; most customers place **1–3 orders**.
- **Implications:** A small but important minority is very active.
- **Action:** Consider `log1p(OrderCount)` or **capped/bin** buckets (e.g., 1, 2–3, 4–6, 7+). Robust scalers also help.
- **Interactive plot:** [OrderCount Distribution](../plots/eda/OrderCount_dist.html)

### CashbackAmount
- **Shape:** Mild-to-moderate **right skew**; mass between ~140–200 with a gradual tail toward 300+.
- **Notes:** Multiple small peaks likely reflect promotion tiers or campaign thresholds.
- **Action:** Keep as is for tree models; try **standardization** or **yeo-johnson/log** transform for linear models.
- **Interactive plot:** [CashbackAmount Distribution](../plots/eda/CashbackAmount_dist.html)

### HourSpendOnApp
- **Shape:** Concentrated around **2–4 hours**, with possible clusters near ~2, ~3 and ~4 (quasi-multimodal).
- **Interpretation:** Usage bands (e.g., light/regular/heavy users) may exist and be meaningful for churn behavior.
- **Action:** Test **ordinal/binning** (e.g., <2.5, 2.5–3.5, >3.5) or leave continuous and apply a **robust scaler**.
- **Interactive plot:** [HourSpendOnApp Distribution](../plots/eda/HourSpendOnApp_dist.html)



# Pairwise Scatter Matrix

A quick scatter‐matrix on top features to spot non-linear patterns.

In [15]:
# select a subset to keep matrix readable
subset = ['Tenure','OrderCount','HourSpendOnApp','CashbackAmount','SatisfactionScore']
fig = px.scatter_matrix(
    df[subset],
    dimensions=subset,
    title='Scatter Matrix of Key Numeric Features',
    template='plotly_white'
)
out = project_root / 'plots' / 'eda' / 'scatter_matrix.html'
fig.write_html(str(out))
fig.show()

## Pairwise Scatter Matrix

I generated a scatter matrix for key numeric features — **Tenure, OrderCount, HourSpendOnApp, CashbackAmount, SatisfactionScore** — to quickly scan for patterns, non-linearities, and potential interactions.

**What stands out:**
- **Tenure × OrderCount:** slight upward trend with discrete “steps” (OrderCount is count data). Longer-tenure customers tend to place more orders, but the linear signal is weak.
- **Tenure × CashbackAmount:** tiered clusters suggest campaign/benefit buckets; cashback tends to increase with tenure, yet not smoothly.
- **OrderCount × CashbackAmount:** weak positive association (more orders, more cashback), but variability is high across tiers.
- **HourSpendOnApp:** scattered across other features without a clear linear relation; likely interacts with churn in non-linear ways.
- **SatisfactionScore:** banded at integer values (1–5) with no obvious straight-line relationship to other numerics in raw form.

**Why this matters:**
- Relationships look **non-linear** and sometimes **piecewise** (tiers), which means tree-based models (Random Forest, XGBoost/LightGBM) should capture structure better than purely linear models.
- For correlation, **Spearman** may be more informative than Pearson due to monotonic but non-linear trends and discrete scales.

# Recency Analysis

Since we have `DaySinceLastOrder`, let's see its distribution over churn.

In [16]:
fig = px.box(
    df,
    x='Churn',
    y='DaySinceLastOrder',
    title='Recency (Days Since Last Order) by Churn',
    template='plotly_white'
)
out = project_root / 'plots' / 'eda' / 'recency_churn.html'
fig.write_html(str(out))
fig.show()