## 1. Business Understanding

###  Objective

The aim of this project is to build a **machine learning classification model** to predict whether a customer will churn. Churn is defined as a customer discontinuing their service with the company.

Churn impacts company growth, customer acquisition costs, and revenue. Reducing churn through predictive analytics can help improve long-term profitability.

---

###  Business Context

Telecommunications companies often struggle with customer churn due to fierce competition, pricing wars, and service dissatisfaction. Retaining existing customers is **significantly more cost-effective** than acquiring new ones.

By analyzing customer behavior and demographics, we aim to provide actionable insights that help the business:
- Identify at-risk customers early
- Understand the drivers of churn
- Develop targeted retention strategies

---

###  Target Variable

- **Column Name:** `Churn`
- **Type:** Binary classification
- **Values:** `Yes` (Customer churned) or `No` (Customer retained)

---

###  Key Business Questions

1. What customer behaviors, demographics, or service usage patterns are linked to churn?
2. Can we build a predictive model to flag customers likely to churn?
3. Which features contribute most to customer retention or loss?
4. How can the business leverage these insights to **proactively reduce churn**?

---

###  Success Criteria

- **Technical**: A predictive model with strong recall, F1-score, and low false negatives (so we don’t miss churners).
- **Business**: Insights are actionable and help reduce churn by **at least 10%** over the next 6 months through targeted interventions.


##  2. Data Understanding

In this step, we aim to familiarize ourselves with the dataset by understanding its structure, data types, volume, completeness, and general properties. This is a crucial foundation before cleaning, exploring, and modeling the data.

###  Objectives:
- Understand the shape and structure of the data
- Identify the target and feature variables
- Assess data types (categorical, numerical, etc.)
- Check for missing, duplicate, or inconsistent values
- Begin identifying potential relationships and patterns

###  Dataset Overview:
The dataset used in this project contains customer-level information for a telecommunications company. Each row represents a unique customer and includes attributes such as:

- **Demographics** (e.g., gender, senior citizen, partner, dependents)  
- **Account Information** (e.g., contract type, tenure, monthly charges)  
- **Service Usage** (e.g., internet service, streaming services, tech support)  
- **Churn Label** – whether the customer has churned (`Yes`/`No`)  

###  Importance of this Step:
A clear understanding of the raw data helps prevent poor assumptions and guides how we clean, explore, and model the data. Without proper understanding:
- We may misinterpret features  
- Miss important insights  
- Or introduce bias into the model  

---

In the next section, we will load the dataset and begin our initial inspection.


### 2.1 Load and Preview the Dataset

In this step, we will:
- Load the dataset from the `data/` directory
- Preview the first few rows
- Check the basic shape and column names
- Begin identifying potential issues (e.g., null values, formatting)

This gives us our first glance at what we’re working with.


In [2]:
# Import necessary libraries
import pandas as pd

# Load the dataset
file_path = '../data/bigml_59c28831336c6604c800002a.csv'
df = pd.read_csv(file_path)

# Display the first 5 rows
df.head()


Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [3]:
# Basic structure of the dataset
print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.\n")
print("Column names:\n", df.columns.tolist())


Dataset contains 3333 rows and 21 columns.

Column names:
 ['state', 'account length', 'area code', 'phone number', 'international plan', 'voice mail plan', 'number vmail messages', 'total day minutes', 'total day calls', 'total day charge', 'total eve minutes', 'total eve calls', 'total eve charge', 'total night minutes', 'total night calls', 'total night charge', 'total intl minutes', 'total intl calls', 'total intl charge', 'customer service calls', 'churn']
