# 1. SYRIATEL CHURN PREDICTION

## 1.1. Business Understanding
SyriaTel is a telecommunications company that provides services such as voice calls, messaging, and data plans to a wide range of customers. Like many telecom providers, its business relies heavily on maintaining a stable base of long-term subscribers. Customer loyalty is therefore crucial as steady usage over time generates predictable revenue and reduces the high costs of constantly getting new users.

Customer churn has long been a challenge for the telecommunications industry, and SyriaTel is no exception. Churn refers to the rate at which customers discontinue their service, either by switching to a competitor or by simply stopping usage altogether. For telecom companies, churn is costly since acquiring a new customer is usually more expensive than retaining an existing one.

This project seeks to identify customers at high risk of churn. With these insights, SyriaTel can implement retention strategies to keep valuable customers engaged.

### 1.1.1. Problem statement
SyriaTel is losing a portion of its customers to churn, which directly impacts profitability. However, the company lacks a clear framework to identify which customers are most at risk of leaving. Without this knowledge, SyriaTel cannot act proactively to retain these customers. This project aims to develop a predictive model that will identify customers likely to churn or notify before they leave.

### 1.1.2. Objectives
1. To identify the factors that significantly influence churn.
2. To build and validate a predictive model that classifies whether a customer is likely to churn.
3. To determine how much revenue is lost due to customer churn.
4. To provide insights that SyriaTel can use in customer retention strategies.

### 1.1.3. Metric of Success

1. The project will be considered successful if it leads to:
* lower churn rate, meaning fewer customers discontinue their subscriptions.
* higher customer retention rate, with more customers choosing to stay with SyriaTel for longer periods.
* Enhanced customer experience, as the company acts on insights from the model to resolve common pain points.
2. From a technical perspective, the model should achieve an overall accuracy above 75% and a recall above 70% for churn prediction, since recall is critical to correctly identifying customers at risk of leaving.

## 1.2. Data understanding
The dataset consists of customer information collected by SyriaTel. Each row represents a single customer, and the columns describe their usage, subscription plans, and service interactions. The target variable is churn, which indicates whether the customer left the company or stayed.

**Columns:**
* state : The U.S. state where the customer is located. 
* account length : The number of days a customer has had their account. 
* area code : The customer’s area code.
* phone number : Each customer’s phone number (unique identifier).
* international plan : Whether the customer has subscribed to an international calling plan (yes or no). 
* voice mail plan : Whether the customer has subscribed to a voicemail plan. 
* number vmail messages : Number of voice mail messages received by the customer. 
* total day minutes : Total minutes used by the customer during daytime.
* total day calls : Total number of calls made during daytime.
* total day charge : Total charges incurred during daytime.
* total eve minutes : Total minutes used during evening hours.
* total eve calls : Total number of calls made during evening hours.
* total eve charge : Total charges incurred during evening hours.
* total night minutes : Total minutes used during night hours.
* total night calls : Total number of calls made during night hours.
* total night charge : Total charges incurred during night hours.
* total intl minutes : Total international minutes used.
* total intl calls : Number of international calls made.
* total intl charge : Total charges from international calls.
* customer service calls : Number of times the customer called customer service. 
* churn : The target variable (True if the customer churned, False if they stayed). 

In [12]:
# Import the libraries
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
import warnings 
warnings.filterwarnings("ignore")


In [13]:
# load the  data
data = pd.read_csv("bigml_59c28831336c6604c800002a.csv")
data.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [14]:
# Check the tail
data.tail()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
3328,AZ,192,415,414-4276,no,yes,36,156.2,77,26.55,...,126,18.32,279.1,83,12.56,9.9,6,2.67,2,False
3329,WV,68,415,370-3271,no,no,0,231.1,57,39.29,...,55,13.04,191.3,123,8.61,9.6,4,2.59,3,False
3330,RI,28,510,328-8230,no,no,0,180.8,109,30.74,...,58,24.55,191.9,91,8.64,14.1,6,3.81,2,False
3331,CT,184,510,364-6381,yes,no,0,213.8,105,36.35,...,84,13.57,139.2,137,6.26,5.0,10,1.35,2,False
3332,TN,74,415,400-4344,no,yes,25,234.4,113,39.85,...,82,22.6,241.4,77,10.86,13.7,4,3.7,0,False


Observation: the dataset values are uniform from top to bottom

## 1.2.1. Data Relevance
The dataset is relevant for churn prediction because it is complete, contains a clear target variable (churn), and provides multiple features that capture customer behavior and service usage

In [15]:
# Check the structure of the data set
print(f" The dataset has {data.shape[0]} records and {data.shape[1]} columns")

 The dataset has 3333 records and 21 columns


In [16]:
#concise stat summary
data.describe()

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


Observation: At the top we see that all numerical columns have 3,333 entries, confirming there are no missing values in numeric data.
The avaerage length or number of days a customer has an account is roughly 101 days, with the minimum being 1 day and the maximum being 243 days.

In [17]:
# check stat summary for categorical columns
data.describe(include='object').T[['top', 'freq']]

Unnamed: 0,top,freq
state,WV,106
phone number,382-4657,1
international plan,no,3010
voice mail plan,no,2411


Observation: Most of the customers do not subscribe to either the international plan or the voice mail plan, suggesting these services are less popular among the customer base

In [18]:
#check unique values
for coln in data:
    uni_vale = data[coln].unique()
    print(f" {coln}\n, {uni_vale}\n")

 state
, ['KS' 'OH' 'NJ' 'OK' 'AL' 'MA' 'MO' 'LA' 'WV' 'IN' 'RI' 'IA' 'MT' 'NY'
 'ID' 'VT' 'VA' 'TX' 'FL' 'CO' 'AZ' 'SC' 'NE' 'WY' 'HI' 'IL' 'NH' 'GA'
 'AK' 'MD' 'AR' 'WI' 'OR' 'MI' 'DE' 'UT' 'CA' 'MN' 'SD' 'NC' 'WA' 'NM'
 'NV' 'DC' 'KY' 'ME' 'MS' 'TN' 'PA' 'CT' 'ND']

 account length
, [128 107 137  84  75 118 121 147 117 141  65  74 168  95  62 161  85  93
  76  73  77 130 111 132 174  57  54  20  49 142 172  12  72  36  78 136
 149  98 135  34 160  64  59 119  97  52  60  10  96  87  81  68 125 116
  38  40  43 113 126 150 138 162  90  50  82 144  46  70  55 106  94 155
  80 104  99 120 108 122 157 103  63 112  41 193  61  92 131 163  91 127
 110 140  83 145  56 151 139   6 115 146 185 148  32  25 179  67  19 170
 164  51 208  53 105  66  86  35  88 123  45 100 215  22  33 114  24 101
 143  48  71 167  89 199 166 158 196 209  16  39 173 129  44  79  31 124
  37 159 194 154  21 133 224  58  11 109 102 165  18  30 176  47 190 152
  26  69 186 171  28 153 169  13  27   3  42 189 156 13