# Customer Retention Analysis: Telco Churn Dataset
### Understanding customer behavior and identifying churn drivers for a telecommunications company

---

## Notebook 01: Dataset Overview
This notebook provides an initial overview of the Telco Churn dataset.  
We’ll explore its structure, data types, and feature definitions to prepare for later analysis.

---

## Table of Contents
- [1.0 Dataset Overview](#10-dataset-overview)
- [1.1 Dataset Structure & Datatypes](#11-dataset-structure--datatypes)
- [1.2 Investigating Non-Numeric TotalCharges Entries](#12-investigating-non-numeric-totalcharges-entries)
- [1.3 Feature Definitions & Categories](#13-feature-definitions--categories)
- [1.4 Summary](#14-summary)


## 1.0 Dataset Overview <a class="anchor" id="10-dataset-overview"></a>

The dataset used in this project is the *Telco Customer Churn Dataset*, originally published by IBM and available on [Kaggle](https://www.kaggle.com/blastchar/telco-customer-churn). It contains customer demographics, account information, and service usage details for a telecommunications company. The target variable is `Churn`, which indicates whether a customer has discontinued the service.


#### Import Libraries

In [1]:
# Setup project root path
from setup_paths import add_project_root
add_project_root()

In [2]:
# Import libraries
import pandas as pd

#### Load the Data
Load the raw dataset.

In [3]:
# Load raw dataset from csv file
df = pd.read_csv('../data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv')

## 1.1 Dataset Structure & Datatypes <a class="anchor" id="11-dataset-structure--datatypes"></a>

In [4]:
# Examine dataset shape and size
print(f'Dataset rows: {df.shape[0]}\nDataset columns: {df.shape[1]}')

Dataset rows: 7043
Dataset columns: 21


In [5]:
# Inspect a sample from the first 5 rows of the data set
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [6]:
# Examine dataset characteristics
# Check for missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [7]:
# Examine numeric columns
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


**Observations:** 
 - **Table Structure:** All columns are present, but the `TotalCharges` column is not being recognized as numeric.
 
 - **Sample Rows:** Sample rows have expected values and formats.
 - **Missing Values:** No null values were detected, but mismatched data types may indicate sentinel values representing missing data or just incorrect values. This should be inspected further. 

## 1.2 Investigating Non-Numeric TotalCharges Entries <a class="anchor" id="12-investigating-non-numeric-totalcharges-entries"></a>
During the data type inspection, the `TotalCharges` column was found to be of type *object* rather than *numeric*. To diagnose the issue, we’ll identify and display the rows where conversion to numeric fails.

In [8]:
# Save index of non-numeric values to identify datatype 
non_numeric = []
for i, val in enumerate(df['TotalCharges'].values):
    try:
        float(val)
    except:
        non_numeric.append(i)
df.iloc[non_numeric]


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
488,4472-LVYGI,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,Yes,Bank transfer (automatic),52.55,,No
753,3115-CZMZD,Male,0,No,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.25,,No
936,5709-LVOEQ,Female,0,Yes,Yes,0,Yes,No,DSL,Yes,...,Yes,No,Yes,Yes,Two year,No,Mailed check,80.85,,No
1082,4367-NUYAO,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.75,,No
1340,1371-DWPAZ,Female,0,Yes,Yes,0,No,No phone service,DSL,Yes,...,Yes,Yes,Yes,No,Two year,No,Credit card (automatic),56.05,,No
3331,7644-OMVMY,Male,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,19.85,,No
3826,3213-VVOLG,Male,0,Yes,Yes,0,Yes,Yes,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,25.35,,No
4380,2520-SGTTA,Female,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,20.0,,No
5218,2923-ARZLG,Male,0,Yes,Yes,0,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,One year,Yes,Mailed check,19.7,,No
6670,4075-WKNIU,Female,0,Yes,Yes,0,Yes,Yes,DSL,No,...,Yes,Yes,Yes,No,Two year,No,Mailed check,73.35,,No


**Observations:** The rows with non-numeric `TotalCharges` values correspond to customers with zero `tenure`, meaning they likely signed up but never completed a billing cycle. These entries can safely be treated as missing values.

## 1.3 Feature Definitions & Categories <a class="anchor" id="13-feature-definitions--categories"></a>
The dataset contains 7,043 records and 21 features, including demographics, account info, services used, and the churn label.

In [9]:
# Check range of values for non-numeric features
summary = pd.DataFrame({
    'Data Type': df.dtypes, # Display the column data types
    'Unique Values': df.nunique(), # Display the number of unique values
    'Example Values': [df[c].unique()[:3] for c in df.columns] # Display the first few unique values
})

# Display dataset summary table
summary


Unnamed: 0,Data Type,Unique Values,Example Values
customerID,object,7043,"[7590-VHVEG, 5575-GNVDE, 3668-QPYBK]"
gender,object,2,"[Female, Male]"
SeniorCitizen,int64,2,"[0, 1]"
Partner,object,2,"[Yes, No]"
Dependents,object,2,"[No, Yes]"
tenure,int64,73,"[1, 34, 2]"
PhoneService,object,2,"[No, Yes]"
MultipleLines,object,3,"[No phone service, No, Yes]"
InternetService,object,3,"[DSL, Fiber optic, No]"
OnlineSecurity,object,3,"[No, Yes, No internet service]"


**Feature Descriptions**

- **`customerID`:** Unique customer identifier with a 4-digit number and 5 uppercase characters separated by a dash

- **`gender`:** Sex of the customer
- **`SeniorCitizen`:** Whether the customer is a senior citizen or not
- **`Partner`:** Whether the customer has a partner or not
- **`Dependents`:** Whether the customer has dependents
- **`tenure`:** The number of months the customer has stayed with the company
- **`PhoneService`:** Whether the customer has phone service
- **`MultipleLines`:** Whether the customer has multiple lines
- **`InternetService`:** Customer's internet service provider type
- **`OnlineSecurity`:** Whether the customer has online security
- **`OnlineBackup`:** Whether the customer has online backup
- **`DeviceProtection`:** Whether the customer has device protection
- **`TechSupport`:** Whether the customer has tech support
- **`StreamingTV`:** Whether the customer has streaming TV
- **`Contract`:** The contract term of the customer
- **`PaperlessBilling`:** Whether the customer uses paperless billing
- **`PaymentMethod`:** The customer's payment method
- **`MonthlyCharges`:** Amount charged to the customer monthly
- **`TotalCharges`:** The total amount charged to the customer
- **`Churn`:** Whether the customer has churned

---

## 1.4 Summary <a class="anchor" id="14-summary"></a>

In this notebook, we examined the structure and content of the Telco Churn dataset to establish a foundational understanding before data cleaning and analysis. 
The dataset contains 7,043 customer records across 21 features encompassing demographic information, account details, service usage, and churn status. 
During the inspection of data types, we identified that the `TotalCharges` column was stored as an object rather than a numeric type due to several non-numeric entries. 
These cases corresponded to customers with zero tenure, indicating new sign-ups who had not yet been billed; such entries will be treated as missing values in the data preparation phase. 
With the dataset structure clarified and feature meanings documented, we are now ready to clean, preprocess, and engineer features for deeper exploration.
