#  SyriaTel Customer Churn

## Bussiness understanding

In [3]:
#This is where we talk about the overview

### Problem statement

### Goals and objectives

### Metric of success

## 2: Data Undestanding

### 2.1: Data Overview 

The dataset used in this analysis is sourced from the **SyriaTel Customer Churn Dataset** on [Kaggle](https://www.kaggle.com/becksddf/churn-in-telecoms-dataset). It contains information on **3,333 customers** of SyriaTel, a telecommunications company.  The dataset is designed for a **binary classification problem** with the **Target variable:** `churn` (whether a customer left the company — `True`/`False`).  

The core of the dataset is customer account details, usage behavior (calls, minutes, charges), and interactions with customer service. Together, these variables enable robust analysis of churn drivers and prediction.  

---

#### Column Name Meanings  

| Column Name              | Meaning                                                                 |
|--------------------------|-------------------------------------------------------------------------|
| `state`                  | The U.S. state the customer resides in.                                 |
| `account length`         | Number of days the account has been active.                             |
| `area code`              | Customer’s assigned telephone area code.                                |
| `phone number`           | Customer’s phone number (unique identifier).            |
| `international plan`     | Whether the customer has an international calling plan (`yes`/`no`).    |
| `voice mail plan`        | Whether the customer has a voicemail plan (`yes`/`no`).                 |
| `number vmail messages`  | Number of voicemail messages recorded.                                  |
| `total day minutes`      | Total number of minutes of calls made during the day.                   |
| `total day calls`        | Total number of calls made during the day.                              |
| `total day charge`       | Total charges for daytime calls.                                        |
| `total eve minutes`      | Total minutes of evening calls.                                         |
| `total eve calls`        | Total number of evening calls.                                          |
| `total eve charge`       | Total charges for evening calls.                                        |
| `total night minutes`    | Total minutes of night calls.                                           |
| `total night calls`      | Total number of night calls.                                            |
| `total night charge`     | Total charges for night calls.                                          |
| `total intl minutes`     | Total minutes of international calls.                                   |
| `total intl calls`       | Total number of international calls.                                    |
| `total intl charge`      | Total charges for international calls.                                  |
| `customer service calls` | Number of calls made to customer service.                               |
| `churn`                  | **Target variable**: Whether the customer has churned (`True`/`False`). |


### 2.2: Data Description

#### 2.2.1: Importing the dataset

In [2]:
#importing the necessary libraries
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
import warnings 
warnings.filterwarnings("ignore")
import re

In [22]:
#Reading the dataset and checking top five rows and last five rows
data = pd.read_csv('original_data/churn_telecom.csv')
data

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.70,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.70,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.30,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.90,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,AZ,192,415,414-4276,no,yes,36,156.2,77,26.55,...,126,18.32,279.1,83,12.56,9.9,6,2.67,2,False
3329,WV,68,415,370-3271,no,no,0,231.1,57,39.29,...,55,13.04,191.3,123,8.61,9.6,4,2.59,3,False
3330,RI,28,510,328-8230,no,no,0,180.8,109,30.74,...,58,24.55,191.9,91,8.64,14.1,6,3.81,2,False
3331,CT,184,510,364-6381,yes,no,0,213.8,105,36.35,...,84,13.57,139.2,137,6.26,5.0,10,1.35,2,False


#### 2.2.2: Basic Structure

In [10]:
#data shape
data.shape

(3333, 21)

In [11]:
#column names
data.columns

Index(['state', 'account length', 'area code', 'phone number',
       'international plan', 'voice mail plan', 'number vmail messages',
       'total day minutes', 'total day calls', 'total day charge',
       'total eve minutes', 'total eve calls', 'total eve charge',
       'total night minutes', 'total night calls', 'total night charge',
       'total intl minutes', 'total intl calls', 'total intl charge',
       'customer service calls', 'churn'],
      dtype='object')

#### 2.2.3: Overview of column types and non-null values

In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

#### 2.2.4: Summary statistics numerical

In [14]:
data.describe(include='number').T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
account length,3333.0,101.064806,39.822106,1.0,74.0,101.0,127.0,243.0
area code,3333.0,437.182418,42.37129,408.0,408.0,415.0,510.0,510.0
number vmail messages,3333.0,8.09901,13.688365,0.0,0.0,0.0,20.0,51.0
total day minutes,3333.0,179.775098,54.467389,0.0,143.7,179.4,216.4,350.8
total day calls,3333.0,100.435644,20.069084,0.0,87.0,101.0,114.0,165.0
total day charge,3333.0,30.562307,9.259435,0.0,24.43,30.5,36.79,59.64
total eve minutes,3333.0,200.980348,50.713844,0.0,166.6,201.4,235.3,363.7
total eve calls,3333.0,100.114311,19.922625,0.0,87.0,100.0,114.0,170.0
total eve charge,3333.0,17.08354,4.310668,0.0,14.16,17.12,20.0,30.91
total night minutes,3333.0,200.872037,50.573847,23.2,167.0,201.2,235.3,395.0


#### 2.2.5: Summary statistics categorical

In [17]:
data.describe(include='O')

Unnamed: 0,state,phone number,international plan,voice mail plan
count,3333,3333,3333,3333
unique,51,3333,2,2
top,WV,382-4657,no,no
freq,106,1,3010,2411


#### 2.2.6: Missing values

In [18]:
data.isna().mean()*100

state                     0.0
account length            0.0
area code                 0.0
phone number              0.0
international plan        0.0
voice mail plan           0.0
number vmail messages     0.0
total day minutes         0.0
total day calls           0.0
total day charge          0.0
total eve minutes         0.0
total eve calls           0.0
total eve charge          0.0
total night minutes       0.0
total night calls         0.0
total night charge        0.0
total intl minutes        0.0
total intl calls          0.0
total intl charge         0.0
customer service calls    0.0
churn                     0.0
dtype: float64

#### 2.2.7: Duplicates

In [19]:
data.duplicated().sum()

0

### Data Summary
Summarize the info above, touch on something about data relevance

## Data Preparation

### Data Cleaning
Make copy

In [None]:
#drop cabin
del data1["Cabin"]
data1[:2]

#convert the column names to lower case
#data1.columns = data1.columns.str.lower()
#data1.columns

#convert pclass to a str
#data1["pclass"] = data1["pclass"].apply(str)
#data1.info()

#check the null values
data1.isna().sum()

#impute the age variable with median
age_median = data1["Age"].median()
#impute
data1["Age"] = data1["Age"].fillna(age_median)

#impute embarked with the mode
embarked_mode = data1["Embarked"].mode()[0]

data1["Embarked"] = data1["Embarked"].fillna(embarked_mode)

#confirm the imputation
data1.isna().sum().any()

#check duplicates
data1.duplicated().sum()

#check outliers
sns.boxplot(data1,color="r")
plt.tight_layout()
plt.grid(alpha=.3)

In [None]:
#Feature Engineering

## Exploratory Data Analysis

### Univariate Analysis

###  Bivariate analysis

###  Multivariate analysis

## Inferential Analysis

Pearson Correlation Test, T test, One_Way Anova or Chi_test

## Modeling

### data Preprocessing

follow from mwalimu