# KDD Cup 2009 – Churn Dataset EDA

This notebook performs an **Exploratory Data Analysis (EDA)** on the **KDD Cup 2009 churn prediction dataset** (`KDDCup09_churn.arff`).

Goals of this notebook:
- Load the ARFF file and convert it into a clean `pandas` DataFrame
- Inspect basic structure (shape, dtypes, target distribution)
- Understand missing values and data quality issues
- Explore numerical and categorical features
- Provide visualizations that will be re-usable in the thesis

> **Note:** This notebook is written in English and designed to be self-explanatory, so it can be directly referenced in your thesis appendix.

## 1. Imports and configuration

In [1]:
import numpy as np
import pandas as pd
from scipy.io import arff
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 120)

# Make plots a bit larger
plt.rcParams['figure.figsize'] = (8, 5)


ModuleNotFoundError: No module named 'scipy'

## 2. Load the ARFF file

The dataset is provided in **ARFF** format. We use `scipy.io.arff.loadarff` to read it.

Please make sure the file `KDDCup09_churn.arff` is in the same folder as this notebook or update the `data_path` variable accordingly.

In [None]:
data_path = 'KDDCup09_churn.arff'  # change this path if needed

data, meta = arff.loadarff(data_path)
df_raw = pd.DataFrame(data)

print('Raw shape:', df_raw.shape)
df_raw.head()

## 3. Basic cleaning

The ARFF loader reads string attributes as Python `bytes` objects and uses the `'?'` symbol for missing values. The target column is named **`CHURN`** with values `b'1'` and `b'-1'`.

We will:
- Convert all `bytes` columns to regular Python strings
- Replace `'?'` with proper `NaN`
- Map `CHURN` from `{b'1', b'-1'}` to a binary label `{1, 0}`
- Rename the target column to `churn` for convenience.

In [None]:
df = df_raw.copy()

# 3.1 Convert bytes to strings where necessary
for col in df.columns:
    if df[col].dtype == object:
        df[col] = df[col].apply(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)

# 3.2 Standardize missing values
df.replace('?', np.nan, inplace=True)

# 3.3 Handle target variable
if 'CHURN' in df.columns:
    df['churn'] = df['CHURN'].map({'1': 1, '-1': 0}).astype('Int64')
    df.drop(columns=['CHURN'], inplace=True)
else:
    raise KeyError('CHURN column not found in the dataset')

print('Cleaned shape:', df.shape)
df[['churn']].head()

## 4. Dataset structure and target distribution

In [None]:
n_rows, n_cols = df.shape
print(f'Number of rows: {n_rows}')
print(f'Number of columns (including target): {n_cols}')

print('\nData types summary:')
print(df.dtypes.value_counts())

print('\nTarget distribution (churn):')
print(df['churn'].value_counts(dropna=False))
print('\nChurn rate:', df['churn'].mean())

## 5. Numerical vs categorical features

We separate numerical and categorical predictors. The target `churn` is excluded from the feature lists.

In [None]:
feature_cols = [c for c in df.columns if c != 'churn']

num_cols = [c for c in feature_cols if pd.api.types.is_numeric_dtype(df[c])]
cat_cols = [c for c in feature_cols if c not in num_cols]

print(f'Numerical features: {len(num_cols)}')
print(f'Categorical features: {len(cat_cols)}')

num_cols[:10], cat_cols[:10]

## 6. Missing values analysis

The KDD churn dataset is known to be **noisy** and to contain many missing values.

We compute the missing rate per column and visualize the top columns with the largest amount of missing data.

In [None]:
missing_counts = df[feature_cols].isna().sum()
missing_pct = (missing_counts / len(df)) * 100

missing_df = pd.DataFrame({'missing_count': missing_counts, 'missing_pct': missing_pct})
missing_df.sort_values('missing_pct', ascending=False).head(20)

In [None]:
# Plot top 20 features by missing percentage
top_missing = missing_df.sort_values('missing_pct', ascending=False).head(20)

plt.figure()
top_missing['missing_pct'].sort_values().plot(kind='barh')
plt.xlabel('Missing percentage (%)')
plt.title('Top 20 features with highest missing rate')
plt.tight_layout()
plt.show()

## 7. Numerical features – summary statistics

We compute basic descriptive statistics for numerical variables and inspect a few of them.
Given the large number of features, we show only the first 10 columns by default.

In [None]:
num_summary = df[num_cols].describe().T
num_summary.head(10)

### 7.1 Histograms for selected numerical features

To get an idea of the distribution shapes, we plot histograms for a small subset of numerical variables.
Each plot is generated in a separate figure (no subplots), in line with the plotting constraints.

In [None]:
sample_num_cols = num_cols[:6]  # adjust if you want more/others

for col in sample_num_cols:
    plt.figure()
    df[col].hist(bins=30)
    plt.title(f'Distribution of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.tight_layout()
    plt.show()

### 7.2 Correlation heatmap (subset of numerical features)

Because the dataset has many numerical features, computing a full correlation matrix may be heavy.
We select up to 30 numerical features to visualize their pairwise correlations.

In [None]:
subset_num_cols = num_cols[:30]
corr_matrix = df[subset_num_cols].corr()

plt.figure(figsize=(10, 8))
plt.imshow(corr_matrix, aspect='auto')
plt.xticks(range(len(subset_num_cols)), subset_num_cols, rotation=90)
plt.yticks(range(len(subset_num_cols)), subset_num_cols)
plt.colorbar(label='Correlation')
plt.title('Correlation heatmap (subset of numerical features)')
plt.tight_layout()
plt.show()

## 8. Categorical features – cardinality and frequency

We inspect categorical features to understand their cardinality (number of distinct levels) and the distribution of the most frequent categories.

In [None]:
cat_cardinality = df[cat_cols].nunique().sort_values(ascending=False)
cat_cardinality.head(20)

In [None]:
# Plot top 15 categorical variables by cardinality
plt.figure()
cat_cardinality.head(15).sort_values().plot(kind='barh')
plt.xlabel('Number of unique categories')
plt.title('Top categorical features by cardinality')
plt.tight_layout()
plt.show()

### 8.1 Category frequency plots

We visualize the top categories for a small set of categorical variables.
This helps to see whether some variables are dominated by a few levels or are more evenly distributed.

In [None]:
sample_cat_cols = cat_cols[:4]  # adjust as needed

for col in sample_cat_cols:
    plt.figure()
    df[col].value_counts(dropna=False).head(10).plot(kind='bar')
    plt.title(f'Top 10 categories for {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.tight_layout()
    plt.show()

## 9. Target vs. features (preliminary insights)

Finally, we perform a few quick checks of how some features differ between churners and non-churners.
Here we only use simple groupby statistics for illustration; detailed modeling will be done in separate notebooks.

In [None]:
# Example: compare mean of a few numerical features by churn status
example_num_cols = num_cols[:5]
df.groupby('churn')[example_num_cols].mean()

In [None]:
# Example: distribution of a categorical variable by churn
if sample_cat_cols:
    col = sample_cat_cols[0]
    ctab = pd.crosstab(df[col], df['churn'], normalize='index')
    ctab.head(10)

## 10. Summary

In this notebook we:
- Loaded the KDD Cup 2009 churn dataset from an ARFF file
- Converted raw bytes and placeholder missing values into a clean `pandas` DataFrame
- Explored the dataset structure, including the strong class imbalance of the `churn` label
- Analyzed missing values, confirming that the dataset is **noisy** with many incomplete features
- Separated numerical and categorical variables and examined their distributions
- Visualized correlations and category frequencies to build intuition for later feature engineering and modeling

This EDA provides the empirical motivation for applying robust, cost-sensitive models and careful pre-processing steps in the main experiments of the thesis.