# 📊 SyriaTel Customer Churn Prediction

Welcome to this end-to-end data science project focused on customer churn prediction for **SyriaTel**, a telecommunications company.

## 🧠 Problem Understanding

**Business Objective:**  
SyriaTel wants to proactively identify customers likely to **churn** (stop using their services), so they can intervene and reduce revenue loss.

**Technical Objective:**  
This is a binary classification task — I aim to train a model that accurately predicts whether a customer will churn based on historical data.




In [1]:

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Machine Learning libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import (classification_report, confusion_matrix, 
                             roc_auc_score, roc_curve, precision_recall_curve,
                             accuracy_score, precision_score, recall_score, f1_score)
from sklearn.feature_selection import SelectKBest, f_classif

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")


All the necessary Python libraries for data manipulation, visualization, and machine learning have been imported at the start. Their usage will become clear as we move through each stage of the pipeline.


## 📥 2. Data Loading and Initial Exploration

Before diving into preprocessing or modeling, we start by loading the dataset and gaining an initial understanding of its structure.

Key steps in this section:
- Load the data from a CSV file
- Inspect the dataset shape and column types
- Preview the first few rows
- Check for any missing values
- Generate basic descriptive statistics


In [None]:
# Load the dataset
df = pd.read_csv('Data/telecom.csv')

In [10]:
df.shape

(3333, 21)

The dataset contains `3,333 rows` and `21 columns`, which indicates a medium-sized dataset suitable for detailed analysis and modeling.


In [8]:
df.dtypes

state                      object
account length              int64
area code                   int64
phone number               object
international plan         object
voice mail plan            object
number vmail messages       int64
total day minutes         float64
total day calls             int64
total day charge          float64
total eve minutes         float64
total eve calls             int64
total eve charge          float64
total night minutes       float64
total night calls           int64
total night charge        float64
total intl minutes        float64
total intl calls            int64
total intl charge         float64
customer service calls      int64
churn                        bool
dtype: object

We observe:
- Most features are numerical, with a few categorical ones (`state`, `international plan`, `voice mail plan`, etc.).
- The target variable is `churn`, stored as a boolean.
- `phone number` and `state` may not be useful for prediction and could be dropped later.


In [11]:
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


Looking at the first five rows gives us a glimpse of how the data is structured. We can already spot columns like `total day minutes`, `total intl charge`, and `customer service calls` which may be predictive of churn.


In [12]:
df.isnull().sum()  

state                     0
account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64

✅ Great news — there are **no missing values** in the dataset. This saves us the effort of imputing or removing records at this stage.


In [13]:
df.describe()

Unnamed: 0,account length,area code,number vmail messages,total day minutes,total day calls,total day charge,total eve minutes,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,100.435644,30.562307,200.980348,100.114311,17.08354,200.872037,100.107711,9.039325,10.237294,4.479448,2.764581,1.562856
std,39.822106,42.37129,13.688365,54.467389,20.069084,9.259435,50.713844,19.922625,4.310668,50.573847,19.568609,2.275873,2.79184,2.461214,0.753773,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,0.0,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.0,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.3,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,165.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


Descriptive statistics reveal:
- `account length` ranges from 1 to 243 days.
- `customer service calls` varies from 0 to 9, which might indicate dissatisfaction.
- Call and charge distributions vary across time periods (day, evening, night, international), and some values are skewed (e.g., `total intl calls` has a low mean).

These insights help us begin feature selection and transformation decisions in upcoming steps.
