---
# Business Understanding
---

## Business Objective
The goal of the business is to reduce churn by identifying which customers are likely to leave the SyriaTel. By actively addressing the customers that are considered to be at risk,also, the company can improve retention by increasing revenue and making customer service resources a top priority.

## Business Problem
Churn (customers leaving SyriaTel) directly affect revenue and profitability. The company wants to predict churn before it happens using historical customer data, the steps to be taken (like offers that are personalized or service improvement) can be used to retain these customers.

## Data Mining goal
*Predictive Modeling:* Use of historical customer data to create a model that predicts the likelihood of churn occuring.
### Key Business Questions
* Which customers are most likely to churn?
* What factors drive customer churn?
* When is a customer most at risk of leaving?
* What can be done to reduce churn?

## Success Criteria
Business Success: Reduction in churn rate, increased retention, and improved customer satisfaction.

Data Mining Success: Accurate predictive model.

---
# Data Understanding
---

## Dataset Overview
The dataset contains historical information about customers which includes their usage patterns and trends and interactions with customer service. The target variable is churn, which indicates whether a customer has left the service (1 = Left or 0 = Stayed).

## Data Quality Checks
1. **Missing Values:** Check for null or missing entries in any column. In SyriaTel there are no missing values.

2. **Data Types:** Ensure numeric columns (e.g., minutes, charges, calls) are of numeric types for correlation and modelling.

3. **Outliers:** Identify unusually high or low values (e.g, extremely high day minutes) that may affect the model and churn is the Target variable.

4. **Duplicate values:** Detect and handle any possible duplicate records in our dataset

## Exploration Insights
* Features like customer service calls, total day charge and total day minutes show a higher correlation with churn, these shows that they are key indicators of potential churn.
* Other usage and billing features have weak correlation but may still contribute when combined in a predictive model.
* Categorical features such as area code are less likely to impact churn individually but may have subtle effects in combination with other variables.

---
### Data Preparation
---

In the churn prediction dataset, the main tasks include cleaning, transforming, and structuring the data.

In [1]:
# import important libraries
# import pandas alias pd 
import pandas as pd

In [2]:
# load the data set
df = pd.read_csv('bigml_59c28831336c6604c800002a.csv')
# display the first 5 rows 
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


#### **BASIC DATA UNDERSTANDING**

This enables gain key insights abouty our data , understanding key relationships between the features 

##### **statistical summary**

In [3]:
df.describe()

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


#### Handling Missing Values

* Checking each column for null or missing values.

In [4]:
# checking for missing values
df.isnull().sum()

state                     0
account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64

#### Handling duplicated values
Checking each column for duplicated values and according toour dat there are no duplicate values.

In [5]:
# checking for duplicates
df.duplicated().sum()

0

#### Partial imbalance
* Checking for dominant categories, which is useful for spotting imbalance before modelling.

In [6]:
df.describe(include=['object']).T[['top', 'freq']]

Unnamed: 0,top,freq
state,WV,106
phone number,382-4657,1
international plan,no,3010
voice mail plan,no,2411


In [7]:
df.dtypes

state                      object
account length              int64
area code                   int64
phone number               object
international plan         object
voice mail plan            object
number vmail messages       int64
total day minutes         float64
total day calls             int64
total day charge          float64
total eve minutes         float64
total eve calls             int64
total eve charge          float64
total night minutes       float64
total night calls           int64
total night charge        float64
total intl minutes        float64
total intl calls            int64
total intl charge         float64
customer service calls      int64
churn                        bool
dtype: object

#### Convert Data type
* Convert data type(churn) from boolean to integers(int64).

In [8]:
df['churn'] = df['churn'].astype('int64')

In [9]:
df.dtypes

state                      object
account length              int64
area code                   int64
phone number               object
international plan         object
voice mail plan            object
number vmail messages       int64
total day minutes         float64
total day calls             int64
total day charge          float64
total eve minutes         float64
total eve calls             int64
total eve charge          float64
total night minutes       float64
total night calls           int64
total night charge        float64
total intl minutes        float64
total intl calls            int64
total intl charge         float64
customer service calls      int64
churn                       int64
dtype: object

In [10]:
 # Shows the structure of the data set
df.shape

(3333, 21)

#### Related Features
* Checks which numeric features are most related to churn, which is useful when doing feature selection or understanding the data.

In [11]:
df.select_dtypes(include='number').corr()['churn']

account length            0.016541
area code                 0.006174
number vmail messages    -0.089728
total day minutes         0.205151
total day calls           0.018459
total day charge          0.205151
total eve minutes         0.092796
total eve calls           0.009233
total eve charge          0.092786
total night minutes       0.035493
total night calls         0.006141
total night charge        0.035496
total intl minutes        0.068239
total intl calls         -0.052844
total intl charge         0.068259
customer service calls    0.208750
churn                     1.000000
Name: churn, dtype: float64