# 📊 Telco Customer Churn Prediction

## 🔎 Project Overview
This notebook builds a **customer churn prediction system** for a telecom company using the [Telco Customer Churn dataset](https://www.kaggle.com/datasets/blastchar/telco-customer-churn).  

Churn (Yes/No) represents whether a customer **left the company**. Predicting churn is crucial because acquiring new customers is often more expensive than retaining existing ones.  

The project follows a structured data science workflow:

---

## 📋 Workflow
1. **Initial Data Assessment**
   - Data types, missing values, and inconsistencies
   - Target variable analysis (churn rate and class imbalance)
   - Categorization of features into demographics, services, and financials  

2. **Exploratory Data Analysis (EDA)**
   - Outlier detection using IQR and boxplots
   - Univariate, bivariate, and multivariate analysis
   - Churn patterns across customer segments  

3. **Feature Engineering**
   - Derived features such as:
     - `TenureCategory` (New / Established / Loyal)
     - `ServiceCount` (number of services subscribed)
     - `BundleUser` (subscribed to both Internet + Phone)
     - `ChargeCategory` (low / medium / high spenders)
   - Encoding categorical features  

4. **Preprocessing & Transformations**
   - Handling missing values
   - Scaling numeric features
   - Log and power transformations for skewed variables
   - One-hot encoding categorical features  

5. **Modeling**
   - Baseline: Logistic Regression, Decision Tree
   - Ensemble methods: Random Forest, XGBoost, CatBoost
   - **Pipeline integration** with preprocessing  

6. **Model Evaluation**
   - Metrics for imbalanced data: Precision, Recall, F1-score, ROC-AUC
   - Comparison of models
   - Business interpretation of evaluation metrics  

7. **Business Insights & Recommendations**
   - High-risk customer profiles
   - Retention strategies (contract incentives, discounts, bundled services)
   - Estimated revenue impact and ROI from retention  

---

## 🎯 Learning Outcomes
- Ability to handle **class imbalance** in real-world datasets
- Building robust **data and model pipelines**
- Experience with **ensemble methods** (bagging & boosting)  
- Advanced **EDA and feature engineering** for actionable insights  
- Translation of machine learning results into **business value**  

---


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV, cross_validate
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PowerTransformer, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from catboost import CatBoostClassifier

import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Load Dataset**

In [None]:
df = pd.read_csv("/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.describe(include = "all")

In [None]:
df.isnull().sum()

There are 7043 customers in the dataset and there are no missing values

# **EDA**

In [None]:
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
for i in categorical_features:
    print(i)

In [None]:
# Identify numerical features
numerical_features = df.select_dtypes(include=['int64','float64']).columns.tolist()
for i in numerical_features:
    print(i)

* Most of the featues in the dataset are categorical

In [None]:
for cat in categorical_features:
    print(cat)
    print(df[cat].unique())

We do not need the customer id column for EDA

In [None]:
categorical_features.remove("customerID")

In [None]:
for num in numerical_features:
    print(num)
    print(df[num].unique())

* Have to treat SeniorCitizen as a categorical feature
* Have to treat total charges as a numerical feature

In [None]:
numerical_features.remove("SeniorCitizen")  # treat this as categorical
categorical_features.append("SeniorCitizen")

categorical_features.remove("TotalCharges") #treat this as numerical
numerical_features.append("TotalCharges")


* TotalCharges has to be numerical

In [None]:
# Convert TotalCharges to numeric, forcing errors to NaN
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Check how many became NaN (were blanks)
print("Missing values in TotalCharges:", df["TotalCharges"].isna().sum())

In [None]:
# Find rows where TotalCharges is NaN
missing_total = df[df["TotalCharges"].isna()]

# Show how many and their tenure values
print(missing_total["tenure"].value_counts())


These are brand new customers and have a 0 totalcharge. We will impute the 0 for Total Charges

In [None]:
# Impute with 0 (new customers with no charges yet)
df["TotalCharges"].fillna(0, inplace=True)

# Verify
print(df["TotalCharges"].dtype)
print(df.loc[df["tenure"] == 0, ["tenure", "TotalCharges"]].head())

In [None]:
df.info()

In [None]:
categorical_features

In [None]:
numerical_features

### 👤 Demographic Features
- **gender** – Whether the customer is male or female  
- **SeniorCitizen** – Indicates if the customer is a senior citizen (1 = Yes, 0 = No)  
- **Partner** – Whether the customer has a partner  
- **Dependents** – Whether the customer has dependents  

### 🔧 Behavioral / Services Features
- **PhoneService** – Whether the customer has phone service  
- **MultipleLines** – Whether the customer has multiple lines  
- **InternetService** – Type of internet service (DSL, Fiber optic, None)  
- **OnlineSecurity** – Whether the customer has online security service  
- **OnlineBackup** – Whether the customer has online backup service  
- **DeviceProtection** – Whether the customer has device protection service  
- **TechSupport** – Whether the customer has tech support service  
- **StreamingTV** – Whether the customer has streaming TV service  
- **StreamingMovies** – Whether the customer has streaming movies service  
- **Contract** – Type of contract (Month-to-month, One year, Two year)  
- **PaperlessBilling** – Whether the customer uses paperless billing  

### 💰 Financial Features
- **MonthlyCharges** – Amount charged to the customer monthly  
- **TotalCharges** – Total amount charged to the customer during their tenure  
- **PaymentMethod** – Payment method used by the customer  

### 🎯 Target Variable
- **Churn** – Indicates whether the customer left the company (Yes = churned, No = retained)


# **Churn distribution**

In [None]:
# Target distribution
sns.countplot(data=df, x="Churn")
plt.title("Churn Distribution")
plt.show()

In [None]:
churn_rate = df["Churn"].value_counts(normalize=True)
print("Churn Rate:\n", churn_rate)

In [None]:
churn_counts = df['Churn'].value_counts()
imbalance_ratio = churn_counts['No'] / churn_counts['Yes']
print("Imbalance Ratio (No:Yes) =", imbalance_ratio)

It seems like there is a huge imbalance in the dataset where the number of not churned customers are high than the number of churned customers

# **Categorical Feature distributions**

In [None]:
for col in categorical_features:
    fig, axes = plt.subplots(1, 2, figsize=(12,4))

    # Left: Distribution of feature
    sns.countplot(x=col, data=df, ax=axes[0], palette="Set2")
    axes[0].set_title(f"Distribution of {col}")
    axes[0].tick_params(axis='x', rotation=45)

    # Right: Feature vs Churn
    churn_dist = pd.crosstab(df[col], df['Churn'], normalize='index') * 100
    churn_dist.plot(kind='bar', stacked=True, color=['skyblue','salmon'], ax=axes[1])
    axes[1].set_title(f"Churn % by {col}")
    axes[1].set_ylabel("Percentage")
    axes[1].legend(title="Churn")

    plt.tight_layout()
    plt.show()


# **Numerical Feature Distributions**

In [None]:
for col in numerical_features:
    fig, axes = plt.subplots(1, 2, figsize=(12,4))

    # Left: Distribution (histogram)
    sns.histplot(df[col], bins=30, kde=True, ax=axes[0], color="skyblue")
    axes[0].set_title(f"Distribution of {col}")

    # Right: Boxplot vs Churn
    sns.boxplot(x="Churn", y=col, data=df, palette="Set2", ax=axes[1])
    axes[1].set_title(f"{col} by Churn")

    plt.tight_layout()
    plt.show()

# **Outlier Detection**

In [None]:
for col in numerical_features:
    plt.figure(figsize=(10,4))

    # Boxplot
    sns.boxplot(x=df[col], color="skyblue")
    plt.title(f"Boxplot of {col}")
    plt.show()

    # IQR Calculation
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    print(f"{col}: {outliers.shape[0]} outliers detected")
    print(f"Lower bound: {lower_bound}, Upper bound: {upper_bound}\n")


No outliers detected

# **Feature Engineering**

# Tenture_category

* New: 0-1 year
* Established: 1-4 years
* Loyal: 4-6 years

In [None]:
def tenure_category(tenure):
    if tenure <= 12: 
        return "New"          # 0–1 year
    elif tenure <= 48: 
        return "Established"  # 1–4 years
    else: 
        return "Loyal"        # 4–6 years

df["Tenure_category"] = df["tenure"].apply(tenure_category)

# Charge_category

* The Charge_category is involved in monthly charges

In [None]:
df["Charge_category"] = pd.qcut(df["MonthlyCharges"], q=3, labels=["Low","Medium","High"])

# Service_count

* Total number of services a customer has

In [None]:
service_cols = ['PhoneService','MultipleLines','InternetService',
                'OnlineSecurity','OnlineBackup','DeviceProtection',
                'TechSupport','StreamingTV','StreamingMovies']

df["Service_count"] = (df[service_cols] == "Yes").sum(axis=1)

# Bundle_user

* Customers with both Internet + Phone

In [None]:
# BundleUser: 1 if customer has BOTH InternetService (not 'No') and PhoneService = 'Yes'
df["Bundle_user"] = np.where(
    (df["InternetService"] != "No") & (df["PhoneService"] == "Yes"), 1, 0
)

# **EDA after feature engineering**

In [None]:
categorical_features.extend(["Charge_category", "Tenure_category","Bundle_user"])
numerical_features.append("Service_count")

**Feature distributions**

In [None]:
for col in categorical_features:
    fig, axes = plt.subplots(1, 2, figsize=(12,4))

    # Left: Distribution of feature
    sns.countplot(x=col, data=df, ax=axes[0], palette="Set2")
    axes[0].set_title(f"Distribution of {col}")
    axes[0].tick_params(axis='x', rotation=45)

    # Right: Feature vs Churn
    churn_dist = pd.crosstab(df[col], df['Churn'], normalize='index') * 100
    churn_dist.plot(kind='bar', stacked=True, color=['skyblue','salmon'], ax=axes[1])
    axes[1].set_title(f"Churn % by {col}")
    axes[1].set_ylabel("Percentage")
    axes[1].legend(title="Churn")

    plt.tight_layout()
    plt.show()

In [None]:
for col in numerical_features:
    fig, axes = plt.subplots(1, 2, figsize=(12,4))

    # Left: Distribution (histogram)
    sns.histplot(df[col], bins=30, kde=True, ax=axes[0], color="skyblue")
    axes[0].set_title(f"Distribution of {col}")

    # Right: Boxplot vs Churn
    sns.boxplot(x="Churn", y=col, data=df, palette="Set2", ax=axes[1])
    axes[1].set_title(f"{col} by Churn")

    plt.tight_layout()
    plt.show()

# **Transforming Strategy**

# TotalCharges

In [None]:
feature = "TotalCharges"
x = df[feature].copy()

# Transformations
transforms = {
    "Original": x,
    "Log": np.log1p(x),
    "Sqrt": np.sqrt(x),
    "Cube Root": np.cbrt(x),
    "Reciprocal": 1/(x+1),  # +1 to avoid division by zero
    "Yeo-Johnson": PowerTransformer(method="yeo-johnson").fit_transform(x.values.reshape(-1,1)).flatten()
}

# Plot results
plt.figure(figsize=(12,8))
for i, (name, vals) in enumerate(transforms.items(), 1):
    plt.subplot(3,2,i)
    sns.histplot(vals, bins=30, kde=True, color="skyblue")
    plt.title(f"{feature} - {name}")
plt.tight_layout()
plt.show()


* Square root is better

# MonthlyCharges

In [None]:
feature = "MonthlyCharges"
x = df[feature].copy()

# Transformations
transforms = {
    "Original": x,
    "Log": np.log1p(x),
    "Sqrt": np.sqrt(x),
    "Cube Root": np.cbrt(x),
    "Reciprocal": 1/(x+1),  # +1 to avoid division by zero
    "Yeo-Johnson": PowerTransformer(method="yeo-johnson").fit_transform(x.values.reshape(-1,1)).flatten()
}

# Plot results
plt.figure(figsize=(12,8))
for i, (name, vals) in enumerate(transforms.items(), 1):
    plt.subplot(3,2,i)
    sns.histplot(vals, bins=30, kde=True, color="skyblue")
    plt.title(f"{feature} - {name}")
plt.tight_layout()
plt.show()


* No need to transform because each graph looks like the same

# tenture

In [None]:
feature = "tenure"
x = df[feature].copy()

# Transformations
transforms = {
    "Original": x,
    "Log": np.log1p(x),
    "Sqrt": np.sqrt(x),
    "Cube Root": np.cbrt(x),
    "Reciprocal": 1/(x+1),  # +1 to avoid division by zero
    "Yeo-Johnson": PowerTransformer(method="yeo-johnson").fit_transform(x.values.reshape(-1,1)).flatten()
}

# Plot results
plt.figure(figsize=(12,8))
for i, (name, vals) in enumerate(transforms.items(), 1):
    plt.subplot(3,2,i)
    sns.histplot(vals, bins=30, kde=True, color="skyblue")
    plt.title(f"{feature} - {name}")
plt.tight_layout()
plt.show()


* No need to transform

In [None]:
categorical_features

In [None]:
# Define x and y
X = df.drop(["Churn"], axis = 1)
y = df["Churn"].map({
    "Yes":1,
    "No":0
})

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)

# Keep customerID from test set separately
customer_id = X_test["customerID"].copy()

# Drop customerID from features (train and test)
X_train = X_train.drop("customerID", axis=1)
X_test = X_test.drop("customerID", axis=1)

# Check shape
X_train.shape


# **Data Preprocessing Pipeline**

In [None]:
# Step 1: Apply sqrt to the selected column
sqrt_obj = FunctionTransformer(np.sqrt,validate = True)
sqrt_features = ['TotalCharges']

sqrt_transformer = ColumnTransformer(
    transformers=[
        ('sqrt',sqrt_obj,sqrt_features),
    ],
    remainder = 'passthrough'
)

# # Step 2: Scaling & normalization to all numeric columns
numeric_pipeline = Pipeline(
    steps=[
        ('sqrt_transform',sqrt_transformer),
        ('scaler',StandardScaler()),
    ]
)

# Step 3: Define objects for categorical columns
cat_features = [col for col in categorical_features if col != "Churn"]
norminal_encoder = OneHotEncoder(handle_unknown='ignore',sparse_output=False)

# Step 4: Make the entire preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num',numeric_pipeline,numerical_features),
        ('cat',norminal_encoder,cat_features)
    ]
)

# **Model Building Pipeline**

In [None]:
# Suppose in training set:
n_positive = sum(y_train == 1)  # churned customers
n_negative = sum(y_train == 0)  # non-churned customers

scale_pos_weight = n_negative / n_positive
models = {
    "Logistic Regression": LogisticRegression(max_iter = 100,class_weight = 'balanced'),
    "Decision Tree": DecisionTreeClassifier(class_weight = 'balanced'),
    "Random Forest": RandomForestClassifier(n_estimators= 100,class_weight='balanced'),
    "XGBoost": xgb.XGBClassifier(eval_metric='logloss',scale_pos_weight = (n_negative / n_positive)),
    "CatBoost":CatBoostClassifier(verbose=0,auto_class_weights='Balanced')
}

# Stratified CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

results = {}

for name, model in models.items():
    clf = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", model)])
    
    # --- Cross-validation on training data ---
    cv_scores = cross_validate(
        clf, X_train, y_train, cv=cv,
        scoring=["accuracy", "roc_auc"], return_train_score=True
    )
    
    # --- Fit on full training set ---
    clf.fit(X_train, y_train)
    
    # --- Predictions for training and test sets ---
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    
    y_train_proba = clf.predict_proba(X_train)[:,1]
    y_test_proba = clf.predict_proba(X_test)[:,1]
    
    # --- Metrics ---
    train_acc = accuracy_score(y_train, y_train_pred)
    test_acc = accuracy_score(y_test, y_test_pred)
    train_auc = roc_auc_score(y_train, y_train_proba)
    test_auc = roc_auc_score(y_test, y_test_proba)
    
    # --- Confusion matrix (test set) ---
    cm = confusion_matrix(y_test, y_test_pred)
    plt.figure(figsize=(5,4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1])
    plt.title(f"{name} - Confusion Matrix (Test Set)")
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.show()
    
    # --- Print metrics ---
    print(f"\n===== {name} =====")
    print(f"Training Accuracy : {train_acc:.3f} | ROC-AUC : {train_auc:.3f}")
    print(f"Test Accuracy     : {test_acc:.3f} | ROC-AUC : {test_auc:.3f}")
    print(f"Cross-Validation Accuracy : {cv_scores['test_accuracy'].mean():.3f} ± {cv_scores['test_accuracy'].std():.3f}")
    print(f"Cross-Validation ROC-AUC  : {cv_scores['test_roc_auc'].mean():.3f} ± {cv_scores['test_roc_auc'].std():.3f}")
    print(classification_report(y_test, y_test_pred))
    
    # --- Save results ---
    results[name] = {
        "train_accuracy": train_acc,
        "train_roc_auc": train_auc,
        "test_accuracy": test_acc,
        "test_roc_auc": test_auc,
        "cv_accuracy_mean": cv_scores['test_accuracy'].mean(),
        "cv_accuracy_std": cv_scores['test_accuracy'].std(),
        "cv_roc_auc_mean": cv_scores['test_roc_auc'].mean(),
        "cv_roc_auc_std": cv_scores['test_roc_auc'].std(),
        "confusion_matrix": cm
    }


**

**According to the above results Logistic Regression is the best model so far**

# **Hyperparameter tuning**

In [None]:
# Pipeline: preprocessing + model
clf_new = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000))
])

# Hyperparameter grid
param_grid = {
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l2'],
    'classifier__solver': ['lbfgs']
}

# Stratified CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# GridSearchCV
grid_search = GridSearchCV(
    clf_new, param_grid, cv=cv, scoring='roc_auc', n_jobs=-1
)

# Fit
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best CV ROC-AUC:", grid_search.best_score_)


In [None]:
final_model = grid_search.best_estimator_
final_model

In [None]:
# --- Fit on full training set ---
final_model.fit(X_train, y_train)

# --- Predictions for training and test sets ---
y_train_pred = final_model.predict(X_train)
y_test_pred = final_model.predict(X_test)

y_train_proba = final_model.predict_proba(X_train)[:,1]
y_test_proba = final_model.predict_proba(X_test)[:,1]

# --- Metrics ---
train_acc = accuracy_score(y_train, y_train_pred)
test_acc = accuracy_score(y_test, y_test_pred)
train_auc = roc_auc_score(y_train, y_train_proba)
test_auc = roc_auc_score(y_test, y_test_proba)

# --- Confusion matrix (test set) ---
cm = confusion_matrix(y_test, y_test_pred)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1])
plt.title("Logistic regression after hyperparameter tunning Confusion Matrix (Test Set)")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()

# --- Print metrics ---
print("\n===== Logistic regression after hyperparameter tunning=====")
print(f"Training Accuracy : {train_acc:.3f} | ROC-AUC : {train_auc:.3f}")
print(f"Test Accuracy     : {test_acc:.3f} | ROC-AUC : {test_auc:.3f}")
print(f"Cross-Validation Accuracy : {cv_scores['test_accuracy'].mean():.3f} ± {cv_scores['test_accuracy'].std():.3f}")
print(f"Cross-Validation ROC-AUC  : {cv_scores['test_roc_auc'].mean():.3f} ± {cv_scores['test_roc_auc'].std():.3f}")
print(classification_report(y_test, y_test_pred))


# **Model Predictions**

In [None]:
# Predict for test data
test_preds = final_model.predict(X_test)

# Add predicted churn
predictions_df = X_test.copy()
predictions_df["Predicted Churn"] = test_preds
predictions_df["customerID"] = customer_id

# Reorder columns: customerID first
cols = ["customerID"] + [col for col in predictions_df.columns if col != "customerID"]
predictions_df = predictions_df[cols]

# View first few rows
predictions_df.head()

# **Conclusion**

In this project, we developed a robust machine learning pipeline to predict customer churn for a telecom company using the Telco Customer Churn dataset. The workflow covered data preprocessing, feature engineering, model training, evaluation, and prediction, ensuring a reproducible and production-ready pipeline.

**Key Steps and Insights**

* Data Preprocessing
* Handled numeric and categorical features separately.
* Applied a square root transformation to reduce skewness in TotalCharges.
* Standardized all numeric features using StandardScaler.
* Encoded categorical features with OneHotEncoder, ignoring unknown categories.
* Dropped irrelevant identifiers like customerID from the modeling features.
* Feature-Target Split and Data Splitting
* The target variable Churn was mapped to binary values (Yes=1, No=0).
* Stratified train-test split ensured balanced representation of classes in both sets.
* Modeling and Evaluation
* Multiple classification models were tested: Logistic Regression, Decision Tree, Random Forest, XGBoost, and CatBoost.
* Cross-validation was performed with stratification to obtain reliable performance estimates.
* Metrics evaluated included accuracy, ROC-AUC, and confusion matrices to capture both overall and class-level performance.

**Logistic Regression** emerged as the best-performing model of an 80% of test accuracy, achieving a high ROC-AUC and balanced accuracy, indicating both good predictive power and generalization.

**Hyperparameter Tuning**

* **GridSearchCV** was used to tune the regularization strength (C) of Logistic Regression.
* The final model combined preprocessing and tuned Logistic Regression in a single pipeline, ensuring that test data can be predicted without additional preprocessing.

**Predictions and Output**

* Predicted probabilities for the positive class (churn) were generated on the test set.
* A final DataFrame was created that retained all original features along with customerID and the predicted churn probability, ready for analysis or submission.

**Key Takeaways**

* Pipelines and ColumnTransformers ensure that preprocessing and modeling are fully reproducible and avoid data leakage.
* Cross-validation and ROC-AUC are essential for evaluating models on imbalanced datasets like churn prediction.
* Even simple models like Logistic Regression can perform strongly when proper preprocessing and hyperparameter tuning are applied.
* Feature engineering (e.g., sqrt transform for skewed data) significantly improves model stability and performance.

**Future Work / Improvements**

* Explore feature interactions or derived features for better predictive performance.
* Consider class imbalance handling using **SMOTE** if the churn ratio is highly skewed.
* Deploy the pipeline as a predictive service for real-time churn probability estimation.

Conclusion: The project demonstrates a complete end-to-end machine learning workflow. Logistic Regression, combined with thoughtful preprocessing and hyperparameter tuning, provides an effective solution for predicting telecom customer churn with interpretable and reproducible results.

