## Business Understanding:

SyriaTel, a telecommunications company, faces a significant challenge with customer churn, leading to financial losses. To tackle this, they initiate a predictive analytics project aimed at identifying customers likely to churn soon. By analyzing historical customer data like demographics, usage patterns, and service interactions, they develop a binary classifier model. This model, trained on various machine learning algorithms, predicts churn likelihood in real-time, enabling SyriaTel to proactively target at-risk customers with personalized retention strategies. Ultimately, this predictive approach helps SyriaTel optimize marketing efforts, improve customer retention, and boost overall profitability.

## Data Understanding:

In preparing to use logistic regression for churn prediction at SyriaTel, the initial step involves collecting historical customer data encompassing demographics, usage patterns, subscription details, and service interactions. Subsequently, a thorough exploration of the dataset is conducted to understand its structure, identify relevant features, and preprocess the data to handle missing values and categorical variables. Following data splitting into training and testing sets, logistic regression is trained on the training data and evaluated using metrics like accuracy and F1 score. Through model optimization, hyperparameters are fine-tuned to enhance predictive performance, ensuring the model's robustness and interpretability. Ultimately, this approach enables SyriaTel to gain insights into customer churn drivers and implement targeted retention strategies effectively.

## Data Analysis:

Importing the necessary libraries below

In [26]:
import pandas as pd  # For data manipulation and analysis
import numpy as np  # For numerical operations
import matplotlib.pyplot as plt  # For data visualization
import seaborn as sns  # For enhanced data visualization
from sklearn.model_selection import train_test_split  # For splitting data into train and test sets
from sklearn.preprocessing import StandardScaler  # For standardizing numerical features
from sklearn.linear_model import LogisticRegression  # For logistic regression model
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report  # For model evaluation metrics
from sklearn.model_selection import GridSearchCV  # For hyperparameter tuning
import warnings
warnings.filterwarnings("ignore") # For filtering out annoying warnings lol


In [27]:
df = pd.read_csv('Customer Churn.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

The dataset has no null values but has a mixture of string and float datatypes so we need to convert the object data types to float for the regression model later on.

In [28]:
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


Looking at all the columns that have an object data type to see how it can be converted and if theres a need to do so

In [29]:
object_columns = ['state','voice mail plan','international plan','phone number']
df.loc[:,object_columns]

Unnamed: 0,state,voice mail plan,international plan,phone number
0,KS,yes,no,382-4657
1,OH,yes,no,371-7191
2,NJ,no,no,358-1921
3,OH,no,yes,375-9999
4,OK,no,yes,330-6626
...,...,...,...,...
3328,AZ,yes,no,414-4276
3329,WV,no,no,370-3271
3330,RI,no,no,328-8230
3331,CT,no,yes,364-6381


so the internation plan and the voice mail plan are boolean so will convert to bool the state isnt really required so we can drop that column and also phone number.But before dropping state will look at how different state have different numbers of churns and see if theres a pattern somewhere and the state with the greatest number of churned customers

In [30]:
for index,data in enumerate(df['voice mail plan']):
    if data == 'yes':
        df['voice mail plan'][index] = True
    else:
        df['voice mail plan'][index] = False
df['voice mail plan'].head()


0     True
1     True
2    False
3    False
4    False
Name: voice mail plan, dtype: object

In [31]:
df['voice mail plan'] = df['voice mail plan'].astype(bool)
df['voice mail plan'].value_counts()

voice mail plan
False    2411
True      922
Name: count, dtype: int64

Converted the international plan to boolean true and false values as show below

In [32]:
for index,data in enumerate(df['international plan']):
    if data == 'yes':
        df['international plan'][index] = True
    else:
        df['international plan'][index] = False
df['international plan'] = df['international plan'].astype(bool)
df['international plan'].value_counts()

international plan
False    3010
True      323
Name: count, dtype: int64