# Telco Customer Churn

In this workshop, you'll work with a classic telecom dataset from [IBM Sample Data Sets](https://www.kaggle.com/datasets/blastchar/telco-customer-churn), where the goal is to understand and predict **customer churn** — whether a customer left the company within the last month.

Each row corresponds to a customer, with the following columns:

- `customerID` – unique customer identifier
- `gender` – Male or Female
- `SeniorCitizen` – whether the customer is a senior citizen (1 = yes, 0 = no)
- `Partner` – whether the customer has a partner (Yes/No)
- `Dependents` – whether the customer has dependents (Yes/No)
- `tenure` – number of months the customer has been with the company
- `PhoneService` – whether the customer has phone service (Yes/No)
- `MultipleLines` – whether the customer has multiple lines (Yes/No/No phone service)
- `InternetService` – type of internet service (DSL, Fiber optic, No)
- `OnlineSecurity` – whether the customer has online security (Yes/No/No internet service)
- `OnlineBackup` – whether the customer has online backup (Yes/No/No internet service)
- `DeviceProtection` – whether the customer has device protection (Yes/No/No internet service)
- `TechSupport` – whether the customer has tech support (Yes/No/No internet service)
- `StreamingTV` – whether the customer streams TV (Yes/No/No internet service)
- `StreamingMovies` – whether the customer streams movies (Yes/No/No internet service)
- `Contract` – contract term (Month-to-month, One year, Two year)
- `PaperlessBilling` – whether the customer uses paperless billing (Yes/No)
- `PaymentMethod` – payment method (Electronic check, Mailed check, Bank transfer, Credit card)
- `MonthlyCharges` – the monthly charge amount
- `TotalCharges` – the total amount charged over the customer's tenure
- `Churn` – whether the customer churned (Yes/No) — **this is the target variable**

Below you'll find some possible starting points. Pick the level that best suits you, dig in, or ignore them and do your own thing. Happy coding!


##### **Beginner**
- Start by getting oriented with `.head()`, `.info()`, `.describe()`.
    - How many rows/columns do you have?
    - Which columns are numeric vs categorical?
    - Are there missing values?
- What is the overall churn rate? Use `df["Churn"].value_counts()` to find out.
- Explore the categorical columns using `.value_counts()`:
    - `Contract`, `InternetService`, `PaymentMethod`
    - Do any categories stand out as particularly common or rare?
- Make a histogram of `tenure` and another of `MonthlyCharges`. What do the distributions look like?
- Compute the churn rate by `Contract` type using [groupby](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html). Which contract type has the highest churn?


##### **Intermediate**
- Build a summary table: for each `InternetService` type, compute the churn rate, mean `MonthlyCharges`, and mean `tenure`.
- Investigate the relationship between `tenure` and churn. Bin `tenure` into groups (e.g. 0–12, 13–24, 25–48, 49–72 months) using [`pd.cut()`](https://pandas.pydata.org/docs/reference/api/pandas.cut.html) and compute the churn rate per group.
- Make a figure with a few subplots comparing churn rates across different features (e.g. `Contract`, `InternetService`, `TechSupport`, `PaymentMethod`).
- Fit a [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) predicting `Churn` from `tenure`, `MonthlyCharges`, and `Contract`. Which features matter most?


##### **Advanced**
- Encode all categorical features and build a full predictive model:
    - Split into train/test sets.
    - Try a tree-based model such as [XGBoost](https://xgboost.readthedocs.io/en/stable/), [LightGBM](https://lightgbm.readthedocs.io/en/latest/Python-Intro.html), or [CatBoost](https://catboost.ai/docs/en/concepts/python-quickstart).
    - Evaluate using accuracy, precision, recall, and AUC-ROC. Since the classes are imbalanced, why might accuracy alone be misleading?
- Interpretability: use [SHAP](https://shap.readthedocs.io/en/latest/) to identify which features drive churn predictions the most. Do the results match your intuition from the EDA?
- Think about the business side: if it costs the company $X$ to retain a customer, and a churned customer represents $Y$ in lost revenue, how would you turn your model into a decision rule for who to target with a retention offer?

## Code

In [None]:
import pandas as pd

df = pd.read_csv("WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()