## Introduction
#### Business Understanding
SyriaTel Communications, a leading telecommunications company, faces a significant challenge with customer churn, where customers discontinue their services. This project aims to predict and prevent customer churn, providing substantial real-world value for SyriaTel. By addressing customer churn, SyriaTel can:

1. Reduce Financial Losses: Retaining customers helps in maintaining steady revenue streams by avoiding the loss of monthly or yearly payments.
2. Minimize Customer Acquisition Costs: Acquiring new customers is often more expensive than retaining existing ones. By reducing churn, SyriaTel can lower these acquisition costs.
3. Enhance Customer Satisfaction and Loyalty: By understanding and addressing the reasons behind customer churn, SyriaTel can improve customer satisfaction, leading to increased loyalty and long-term engagement.
4. Gain Competitive Advantage: A lower churn rate can position SyriaTel more favorably in the competitive telecommunications market, attracting more customers through positive word-of-mouth and reputation.
### The project’s real-world value is clear:
 - It helps SyriaTel maintain a stable customer base, optimize operational costs, and improve overall customer experience.


## DATA UNDERSTANDING

The data used in this project is sourced from SyriaTel’s customer records and includes various attributes that are crucial for understanding customer behavior and predicting churn. The key data properties and their relevance to the real-world problem of customer churn are as follows:

1. Customer Service Calls:
Source: Customer service logs.
Properties: Frequency and duration of calls to customer service.
Relevance: Frequent calls to customer service may indicate dissatisfaction or unresolved issues, which are potential indicators of churn. Analyzing these patterns helps identify at-risk customers.
2. Usage Patterns:
Source: Usage records from customer accounts.
Properties: Data usage, call duration, and frequency of service use.
Relevance: Understanding how customers use their plans can reveal engagement levels. Low usage might indicate that customers are not finding value in their plans, which could lead to churn.
3. Geographic Data:
Source: Customer address records.
Properties: Geographic location of customers.
Relevance: Certain regions may have higher churn rates due to factors like network coverage, competition, or regional preferences. Identifying these areas allows for targeted retention strategies.
4. Demographic Information:
Source: Customer profiles.
Properties: Age, gender, income level, etc.
Relevance: Demographic factors can influence customer behavior and preferences. Understanding these can help tailor retention efforts to specific customer segments.
- By explicitly relating these data properties to the real-world problem of customer churn, the project can identify key indicators and patterns that contribute to churn. This comprehensive data understanding is crucial for developing an effective predictive model and crafting targeted interventions to retain customers.

## Exploratory Data Analysis (EDA)
In the EDA portion, the following questions were explored to gain insights into customer churn:

1. Customer Service Calls: Is calling customer service a sign of customer unhappiness/potential churn?
2. Usage Patterns: How much are people using their plan? What can this tell us about churn?
3. Geographic Analysis: Are customers in certain areas more likely to churn?
- By addressing these questions, the project aims to uncover patterns and trends that can inform the development of a predictive model for customer churn. This comprehensive analysis helps SyriaTel understand the factors driving churn and develop strategies to mitigate it.

## DATA PREPARATION

1. Import Necessary Libraries

In [11]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns


In [18]:
##2. Data Loading
df = pd.read_csv('syriatel_customer_data.csv')
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


# Data source and properties
The data used in this project is sourced from SyriaTel's customer records and includes various attributes such as customer service call frequency, usage patterns, geographic information, and demographic details. These properties are crucial for understanding customer behavior and predicting churn.

In [30]:
#check data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Columns: 3401 entries, account length to voice mail plan_yes
dtypes: bool(3385), float64(8), int64(8)
memory usage: 11.2 MB


In [14]:
## 3. Check Data Types
print(df.dtypes)

state                      object
account length              int64
area code                   int64
phone number               object
international plan         object
voice mail plan            object
number vmail messages       int64
total day minutes         float64
total day calls             int64
total day charge          float64
total eve minutes         float64
total eve calls             int64
total eve charge          float64
total night minutes       float64
total night calls           int64
total night charge        float64
total intl minutes        float64
total intl calls            int64
total intl charge         float64
customer service calls      int64
churn                        bool
dtype: object


In [16]:
## 4. Check for Null Values
# Check for null values
print(df.isnull().sum())


state                     0
account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64


In [31]:
#check for duplicates
df.duplicated().sum()

0

In [33]:
# Check for nonsensical or placeholder values
for col in df.columns:
    print(col)
    print(df[col].unique())
    print('\n-----------------------------------------------\n')

account length
[128 107 137  84  75 118 121 147 117 141  65  74 168  95  62 161  85  93
  76  73  77 130 111 132 174  57  54  20  49 142 172  12  72  36  78 136
 149  98 135  34 160  64  59 119  97  52  60  10  96  87  81  68 125 116
  38  40  43 113 126 150 138 162  90  50  82 144  46  70  55 106  94 155
  80 104  99 120 108 122 157 103  63 112  41 193  61  92 131 163  91 127
 110 140  83 145  56 151 139   6 115 146 185 148  32  25 179  67  19 170
 164  51 208  53 105  66  86  35  88 123  45 100 215  22  33 114  24 101
 143  48  71 167  89 199 166 158 196 209  16  39 173 129  44  79  31 124
  37 159 194 154  21 133 224  58  11 109 102 165  18  30 176  47 190 152
  26  69 186 171  28 153 169  13  27   3  42 189 156 134 243  23   1 205
 200   5   9 178 181 182 217 177 210  29 180   2  17   7 212 232 192 195
 197 225 184 191 201  15 183 202   8 175   4 188 204 221]

-----------------------------------------------

area code
[415 408 510]

-----------------------------------------------

