## 1. Business Understanding

###  Objective

The aim of this project is to build a **machine learning classification model** to predict whether a customer will churn. Churn is defined as a customer discontinuing their service with the company.

Churn impacts company growth, customer acquisition costs, and revenue. Reducing churn through predictive analytics can help improve long-term profitability.

---

###  Business Context

Telecommunications companies often struggle with customer churn due to fierce competition, pricing wars, and service dissatisfaction. Retaining existing customers is **significantly more cost-effective** than acquiring new ones.

By analyzing customer behavior and demographics, we aim to provide actionable insights that help the business:
- Identify at-risk customers early
- Understand the drivers of churn
- Develop targeted retention strategies

---

###  Target Variable

- **Column Name:** `Churn`
- **Type:** Binary classification
- **Values:** `Yes` (Customer churned) or `No` (Customer retained)

---

###  Key Business Questions

1. What customer behaviors, demographics, or service usage patterns are linked to churn?
2. Can we build a predictive model to flag customers likely to churn?
3. Which features contribute most to customer retention or loss?
4. How can the business leverage these insights to **proactively reduce churn**?

---

###  Success Criteria

- **Technical**: A predictive model with strong recall, F1-score, and low false negatives (so we don’t miss churners).
- **Business**: Insights are actionable and help reduce churn by **at least 10%** over the next 6 months through targeted interventions.


##  2. Data Understanding

In this step, we aim to familiarize ourselves with the dataset by understanding its structure, data types, volume, completeness, and general properties. This is a crucial foundation before cleaning, exploring, and modeling the data.

###  Objectives:
- Understand the shape and structure of the data
- Identify the target and feature variables
- Assess data types (categorical, numerical, etc.)
- Check for missing, duplicate, or inconsistent values
- Begin identifying potential relationships and patterns

###  Dataset Overview:
The dataset used in this project contains customer-level information for a telecommunications company. Each row represents a unique customer and includes attributes such as:

- **Demographics** (e.g., gender, senior citizen, partner, dependents)  
- **Account Information** (e.g., contract type, tenure, monthly charges)  
- **Service Usage** (e.g., internet service, streaming services, tech support)  
- **Churn Label** – whether the customer has churned (`Yes`/`No`)  

###  Importance of this Step:
A clear understanding of the raw data helps prevent poor assumptions and guides how we clean, explore, and model the data. Without proper understanding:
- We may misinterpret features  
- Miss important insights  
- Or introduce bias into the model  

---

In the next section, we will load the dataset and begin our initial inspection.


### 2.1 Load and Preview the Dataset

In this step, we will:
- Load the dataset from the `data/` directory
- Preview the first few rows
- Check the basic shape and column names
- Begin identifying potential issues (e.g., null values, formatting)

This gives us our first glance at what we’re working with.


In [2]:
# Import necessary libraries
import pandas as pd

# Load the dataset
file_path = '../data/bigml_59c28831336c6604c800002a.csv'
df = pd.read_csv(file_path)

# Display the first 5 rows
df.head()


Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [3]:
# Basic structure of the dataset
print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.\n")
print("Column names:\n", df.columns.tolist())


Dataset contains 3333 rows and 21 columns.

Column names:
 ['state', 'account length', 'area code', 'phone number', 'international plan', 'voice mail plan', 'number vmail messages', 'total day minutes', 'total day calls', 'total day charge', 'total eve minutes', 'total eve calls', 'total eve charge', 'total night minutes', 'total night calls', 'total night charge', 'total intl minutes', 'total intl calls', 'total intl charge', 'customer service calls', 'churn']


### 2.2 Data Types and Null Value Check

Before diving deeper into exploration, it's important to understand:
- The data types of each column (e.g., numeric, object)
- Whether there are missing (null) values that could affect analysis or modeling

This helps us determine:
- Which columns need to be converted to appropriate types
- Where data cleaning will be necessary


In [4]:
# Check data types and count of null values per column
df.info()

# Quick summary of nulls
null_summary = df.isnull().sum()
null_summary[null_summary > 0]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

Series([], dtype: int64)

In [5]:
# Show percentage of missing values
missing_percent = (df.isnull().sum() / len(df)) * 100
missing_percent[missing_percent > 0].sort_values(ascending=False)


Series([], dtype: float64)

### 2.3 Check for Duplicates and Unique Values

Identifying duplicate rows and understanding the uniqueness of columns helps ensure data quality.

In this step, we will:
- Check if the dataset contains duplicate records
- Explore how many unique values exist per column
- Detect columns with potentially constant values that may not add value to the model


In [6]:
# Check for duplicate rows
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

# Show number of unique values per column
df.nunique().sort_values()


Number of duplicate rows: 0


churn                        2
international plan           2
voice mail plan              2
area code                    3
customer service calls      10
total intl calls            21
number vmail messages       46
state                       51
total day calls            119
total night calls          120
total eve calls            123
total intl minutes         162
total intl charge          162
account length             212
total night charge         933
total eve charge          1440
total night minutes       1591
total eve minutes         1611
total day charge          1667
total day minutes         1667
phone number              3333
dtype: int64

In [8]:
# Remove duplicates (if found)
if duplicate_count > 0:
    df.drop_duplicates(inplace=True)
    print("Duplicate rows dropped.")


### 2.4 Column-by-Column Review

In this step, we briefly explore each feature to understand:

- Its purpose (e.g., customer demographic, service usage, subscription)
- Type (categorical vs numerical)
- Potential impact on churn
- If it needs cleaning or transformation

This helps guide our decisions in later steps like data preparation, feature engineering, and modeling.


In [9]:
# Display summary statistics for numeric columns
df.describe()

# Display summary for object (categorical) columns
df.describe(include='object')


Unnamed: 0,state,phone number,international plan,voice mail plan
count,3333,3333,3333,3333
unique,51,3333,2,2
top,WV,417-9455,no,no
freq,106,1,3010,2411


In [10]:
# Look at unique values for each column (first few)
for col in df.columns:
    print(f"\n{col}:\n{df[col].unique()[:5]}")  # Show first 5 unique values



state:
['KS' 'OH' 'NJ' 'OK' 'AL']

account length:
[128 107 137  84  75]

area code:
[415 408 510]

phone number:
['382-4657' '371-7191' '358-1921' '375-9999' '330-6626']

international plan:
['no' 'yes']

voice mail plan:
['yes' 'no']

number vmail messages:
[25 26  0 24 37]

total day minutes:
[265.1 161.6 243.4 299.4 166.7]

total day calls:
[110 123 114  71 113]

total day charge:
[45.07 27.47 41.38 50.9  28.34]

total eve minutes:
[197.4 195.5 121.2  61.9 148.3]

total eve calls:
[ 99 103 110  88 122]

total eve charge:
[16.78 16.62 10.3   5.26 12.61]

total night minutes:
[244.7 254.4 162.6 196.9 186.9]

total night calls:
[ 91 103 104  89 121]

total night charge:
[11.01 11.45  7.32  8.86  8.41]

total intl minutes:
[10.  13.7 12.2  6.6 10.1]

total intl calls:
[3 5 7 6 4]

total intl charge:
[2.7  3.7  3.29 1.78 2.73]

customer service calls:
[1 0 2 3 4]

churn:
[False  True]
