## Feature Engineering — Telco Customer Churn
### Objective

Transform cleaned raw customer data into model-ready, high-signal features that:

Improve churn prediction accuracy

Preserve business interpretability

Avoid data leakage

Are reusable in production pipelines

In [1]:
import pandas as pd
import numpy as np
import os

In [3]:
df = pd.read_csv("C:\\Users\\admin\\OneDrive\\Desktop\\CHURN PREDICTION\\customer-churn-prediction\\data\\processed\\churn_clean.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,0,0,1,0,1,0,No phone service,DSL,No,...,No,No,No,No,Month-to-month,1,Electronic check,29.85,29.85,0
1,5575-GNVDE,1,0,0,0,34,1,No,DSL,Yes,...,Yes,No,No,No,One year,0,Mailed check,56.95,1889.5,0
2,3668-QPYBK,1,0,0,0,2,1,No,DSL,Yes,...,No,No,No,No,Month-to-month,1,Mailed check,53.85,108.15,1
3,7795-CFOCW,1,0,0,0,45,0,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,0,Bank transfer (automatic),42.3,1840.75,0
4,9237-HQITU,0,0,0,0,2,1,No,Fiber optic,No,...,No,No,No,No,Month-to-month,1,Electronic check,70.7,151.65,1


In [4]:
df['Churn'].value_counts(dropna=False)

Churn
0    5163
1    1869
Name: count, dtype: int64

In [5]:
df.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [6]:
df.drop(columns=["customerID"], inplace=True)

In [7]:
df["Churn"].value_counts()

Churn
0    5163
1    1869
Name: count, dtype: int64

In [8]:
# Binary Feature Encoding (Yes / No)
binary_cols = [
    "Partner",
    "Dependents",
    "PhoneService",
    "PaperlessBilling"
]

for col in binary_cols:
    df[col] = (
        df[col]
        .map({"Yes": 1, "No": 0})
        .fillna(0)
        .astype(int)
    )
df[binary_cols].isna().sum()


Partner             0
Dependents          0
PhoneService        0
PaperlessBilling    0
dtype: int64

In [9]:
# Gender Encoding

df["gender"] = df["gender"].replace({
    "Male": 1,
    "Female": 0
})

In [10]:
# Service Usage Normalization

service_cols = [
    "MultipleLines", "OnlineSecurity", "OnlineBackup",
    "DeviceProtection", "TechSupport",
    "StreamingTV", "StreamingMovies"
]

for col in service_cols:
    df[col] = df[col].replace({
        "No internet service": "No",
        "No phone service": "No"
    })
    df[col] = df[col].replace({"Yes": 1, "No": 0})

  df[col] = df[col].replace({"Yes": 1, "No": 0})


In [11]:
# Ordinal Encoding — Contract Type

contract_map = {
    "Month-to-month": 0,
    "One year": 1,
    "Two year": 2
}

df["Contract"] = df["Contract"].map(contract_map)


In [12]:
# One-Hot Encoding — Nominal Variables
df = pd.get_dummies(
    df,
    columns=["InternetService", "PaymentMethod"],
    drop_first=True
)


In [13]:
# Tenure-Based Feature Engineering
# Tenure Bucket
df["tenure_group"] = pd.cut(
    df["tenure"],
    bins=[0, 12, 24, 48, 72],
    labels=["0-1yr", "1-2yr", "2-4yr", "4-6yr"]
)

df = pd.get_dummies(df, columns=["tenure_group"], drop_first=True)


In [14]:
# Average Monthly Spend (LTV Proxy)

df["avg_monthly_spend"] = df["TotalCharges"] / df["tenure"]

df["avg_monthly_spend"].replace(
    [np.inf, -np.inf], 0, inplace=True
)
df["avg_monthly_spend"].fillna(0, inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["avg_monthly_spend"].replace(
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["avg_monthly_spend"].fillna(0, inplace=True)


In [15]:
# Service Density Feature (High Impact)
df["total_services"] = df[
    [
        "PhoneService", "MultipleLines", "OnlineSecurity",
        "OnlineBackup", "DeviceProtection", "TechSupport",
        "StreamingTV", "StreamingMovies"
    ]
].sum(axis=1)


In [16]:
# Final Validation

df.isnull().sum().sort_values(ascending=False).head()

gender                0
SeniorCitizen         0
avg_monthly_spend     0
tenure_group_4-6yr    0
tenure_group_2-4yr    0
dtype: int64

In [17]:
# Save Engineered Dataset
df.to_csv("../data/processed/featured_telco.csv", index=False)


# Feature Engineering — Telco Customer Churn

## Objective

Transform cleaned customer-level data into **model-ready, high-signal features** that improve churn prediction while maintaining interpretability and preventing data leakage.

---

## Input Schema (Cleaned)

```
customerID, gender, SeniorCitizen, Partner, Dependents, tenure,
PhoneService, MultipleLines, InternetService, OnlineSecurity,
OnlineBackup, DeviceProtection, TechSupport, StreamingTV,
StreamingMovies, Contract, PaperlessBilling, PaymentMethod,
MonthlyCharges, TotalCharges, Churn
```

---

## Feature Engineering Steps

### 1. Identifier Removal

* **Dropped** `customerID`
* Rationale: unique identifier; no predictive value

---

### 2. Target Encoding

* `Churn`: `Yes → 1`, `No → 0`
* Ensures compatibility with correlation analysis and classifiers

---

### 3. Binary Encoding (Yes/No)

**Columns**:

* Partner
* Dependents
* PhoneService
* PaperlessBilling

**Method**:

* Normalize text (`lower()`, `strip()`)
* Replace: `yes → 1`, `no → 0`

**Validation**:

* All values present (7032 rows)
* No missing values

---

### 4. Gender Encoding

* `Male → 1`
* `Female → 0`

---

### 5. Service Usage Normalization

**Columns**:

* MultipleLines
* OnlineSecurity
* OnlineBackup
* DeviceProtection
* TechSupport
* StreamingTV
* StreamingMovies

**Standardization**:

* `No internet service → No`
* `No phone service → No`

**Encoding**:

* `Yes → 1`, `No → 0`

---

### 6. Ordinal Encoding — Contract

**Mapping**:

* Month-to-month → 0
* One year → 1
* Two year → 2

**Reason**:

* Preserves churn risk ordering

---

### 7. One-Hot Encoding — Nominal Features

**Columns**:

* InternetService
* PaymentMethod

**Approach**:

* One-hot encoding with `drop_first=True`
* Prevents multicollinearity

---

### 8. Tenure-Based Features

#### 8.1 Tenure Buckets

* 0–1 year
* 1–2 years
* 2–4 years
* 4–6 years

Encoded via one-hot variables (drop first)

#### 8.2 Average Monthly Spend (LTV Proxy)

```
avg_monthly_spend = TotalCharges / tenure
```

* Handles divide-by-zero
* Captures spending behavior independent of tenure

---

### 9. Service Density Feature

**Definition**:
Total number of subscribed services per customer

**Included Services**:

* PhoneService
* MultipleLines
* OnlineSecurity
* OnlineBackup
* DeviceProtection
* TechSupport
* StreamingTV
* StreamingMovies

**Business Meaning**:

* Higher service count → lower churn probability

---

## Final Dataset Validation

* Rows: **7032**
* Missing values: **0**
* Data types: numeric / boolean only
* Target integrity confirmed

---

## Output Artifact

* **Saved file**: `data/processed/featured_telco.csv`
* **Used by**: model training, evaluation, explainability

---

## Conclusion

> Feature engineering focused on churn-relevant signals such as tenure segmentation, service density, and normalized spending behavior. Ordinal and nominal variables were encoded appropriately, identifiers were removed, and leakage was avoided by deferring scaling to the modeling pipeline.
