# SyriaTel  Customer Churn using Machine Learning

### Problem

SyriaTel has the significant issue of customer churn, wherein subscribers terminate their services, resulting in revenue decline, diminished market share, and heightened expenses related to gaining new customers to offset those that depart. This challenge necessitates the early prediction of at-risk clients via binary classification (churn vs. non-churn) to facilitate targeted retention efforts, optimise resource allocation, and sustain competitive stability in the telecoms industry.

## ML solution workflow
![image.png](attachment:image.png)

## Business Context 



## Stakeholders

### Internal Stakeholders

- Executive Leadership (Chief Executive Officer (CEO), Chief Technology Officer (CTO), Chief Financial Officer (CFO)):
    Responsible for profitability, shareholder value, and strategic direction. Directly affected by revenue decline resulting from client attrition and expenses associated with customer acquisition. Seek data-driven insights to allocate capital and guide corporate strategy.

- Marketing Department: 
    Responsible for customer acquisition, retention initiatives, and brand perception. Requires churn predictions to formulate tailored offers, enhance marketing expenditure (CAC), and assess campaign ROI.

- Customer Service & Support Teams:
    The primary interface for client engagement and problem resolution. Demands early alerts to prioritise high-risk clients, address pain points proactively, and enhance satisfaction.

### External Stakeholders

- Customers (Exsisting Subscribers):
    Service quality, cost, and support all have an effect on them. Churn behaviour is based on how happy they are and how much they think the service is worth. Actions to keep them affect their experience.

- Shareholders & Investors:
    Interested in the welfare of the business and its bottom line in the abstract. A company's stock price, growth prospects, and revenue stability are all affected by its churn rate.

## Data Understanding

In [9]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

%matplotlib inline


# EDA

In [12]:
Tel = pd.read_csv("Telecom's data.csv")
Tel.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [13]:
# creating a copy of the data to avoid any changes to original data
data = Tel.copy()
data.shape

(3333, 21)

In [14]:
# checking the statistical summary of the data
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
account length,3333.0,101.064806,39.822106,1.0,74.0,101.0,127.0,243.0
area code,3333.0,437.182418,42.37129,408.0,408.0,415.0,510.0,510.0
number vmail messages,3333.0,8.09901,13.688365,0.0,0.0,0.0,20.0,51.0
total day minutes,3333.0,179.775098,54.467389,0.0,143.7,179.4,216.4,350.8
total day calls,3333.0,100.435644,20.069084,0.0,87.0,101.0,114.0,165.0
total day charge,3333.0,30.562307,9.259435,0.0,24.43,30.5,36.79,59.64
total eve minutes,3333.0,200.980348,50.713844,0.0,166.6,201.4,235.3,363.7
total eve calls,3333.0,100.114311,19.922625,0.0,87.0,100.0,114.0,170.0
total eve charge,3333.0,17.08354,4.310668,0.0,14.16,17.12,20.0,30.91
total night minutes,3333.0,200.872037,50.573847,23.2,167.0,201.2,235.3,395.0


### Observations from summary statistics

- Service Usage Indicators with a High Churn Risk

    An extremely skewed distribution is observed in customer support calls (mean=1.56, 75th percentile=2, max=9).  An important churn predictor is the existence of customers with 9 service calls, which indicates significant unhappiness.

     The usage of voicemail is extremely skewed; half of the customers never use it (median=0), while there may be separate behavioural segments of strong users (up to 51 messages).

- Pricing Sensitivity Indicated by Usage Patterns

    Despite equal minute volumes (~180 min) during the day and night, the charges during the day are 3.5 times higher (mean=30.56) than at night (mean=9.04), suggesting that there is a possibility of bill shock at peak hours.

    Low involvement (mean=4.48 calls, 10.24 min) in international services may be due to excessive costs or a lack of interest.

- Inefficiencies in Operations

    Rigid pricing plans without time-based flexibility are indicated by consistent call volume across day, eve, and night (all means ≈100 calls) with tight standard deviations (σ≈20).

    There seems to be an unusual concentration of area codes (408/415/510), which could be hiding regional service gaps.

- Fears Regarding Data Redundancy

    Day charge=30.56 and minute charge=179.78 are two examples of clearly visible minute-charge correlations.  Models may be subject to multicollinearity if both are included.

- Disparity in Essential Merits

    There are outlier clients that require segmentation because to the long-tailed distributions of international calls (20 vs. 75th percentile=6) and voicemail usage (51 vs. 75th percentile=20).





In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

## Obsvertions of the Data types

- Numerical Data types

    - Account Length
    - Area Code
    - Number vmail messages
    - Total day calls
    - Total eve calls
    - Total night calls
    - Total intl calls
    - Total day minutes
    - Total day charge
    - Total eve minutes
    - Total eve charge
    - Total night minutes
    - Total night charge
    - Total intl minutes
    - Total intl charge
    - Total intl charge
    - Customer service calls

- Categorical Data types

    - State
    - Phone number
    - International plan
    - Voice mail plan
    - Churn