# Customer Churn Prediction

The project focuses on a dataset that provides a comprehensive view of customer behavior and churn in the telecom industry. It includes detailed information on customer demographics, service usage, and various indicators for analyzing customer retention and churn.

The purpose of this project is to gain an insight into customer demographics and determining the indicators that affect customer churn. We will also attempt to build a predictive model that can accurately predict the customer churn based on these various attributes. 

The dataset was provided by a user on Kaggle (https://www.kaggle.com/datasets/abdullah0a/telecom-customer-churn-insights-for-analysis).

## Import Libraries

In [20]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy as np
from matplotlib import pyplot as plt 
import seaborn as sns
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import precision_score
from sklearn.compose import ColumnTransformer

## Load Dataset

In [21]:
df = pd.read_csv('customer_churn_data.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerID       1000 non-null   int64  
 1   Age              1000 non-null   int64  
 2   Gender           1000 non-null   object 
 3   Tenure           1000 non-null   int64  
 4   MonthlyCharges   1000 non-null   float64
 5   ContractType     1000 non-null   object 
 6   InternetService  703 non-null    object 
 7   TotalCharges     1000 non-null   float64
 8   TechSupport      1000 non-null   object 
 9   Churn            1000 non-null   object 
dtypes: float64(2), int64(3), object(5)
memory usage: 78.2+ KB


Unnamed: 0,CustomerID,Age,Gender,Tenure,MonthlyCharges,ContractType,InternetService,TotalCharges,TechSupport,Churn
0,1,49,Male,4,88.35,Month-to-Month,Fiber Optic,353.4,Yes,Yes
1,2,43,Male,0,36.67,Month-to-Month,Fiber Optic,0.0,Yes,Yes
2,3,51,Female,2,63.79,Month-to-Month,Fiber Optic,127.58,No,Yes
3,4,60,Female,8,102.34,One-Year,DSL,818.72,Yes,Yes
4,5,42,Male,32,69.01,Month-to-Month,,2208.32,No,Yes


The dataset contains 1000 rows and 10 rows containing information on different customers. It seems that only the 'InternetService' column has null values. 

Here's a quick summary of the columns:

- **CustomerID**: Unique identifier for each customer.
- **Age**: Age of the customer, reflecting their demographic profile.
- **Gender**: Gender of the customer (Male or Female).
- **Tenure**: Duration (in months) the customer has been with the service provider.
- **MonthlyCharges**: The monthly fee charged to the customer.
- **ContractType**: Type of contract the customer is on (Month-to-Month, One-Year, Two-Year).
- **InternetService**: Type of internet service subscribed to (DSL, Fiber Optic, None).
- **TechSupport**: Whether the customer has tech support (Yes or No).
- **TotalCharges**: Total amount charged to the customer (calculated as MonthlyCharges * Tenure).
- **Churn**: Target variable indicating whether the customer has churned (Yes or No).

## Data Cleaning and Preparation

Firstly, we will replace the null values in the 'InternetService' column.

In [22]:
df[df.isnull().any(axis=1)].head()

Unnamed: 0,CustomerID,Age,Gender,Tenure,MonthlyCharges,ContractType,InternetService,TotalCharges,TechSupport,Churn
4,5,42,Male,32,69.01,Month-to-Month,,2208.32,No,Yes
6,7,60,Male,14,80.32,One-Year,,1124.48,No,Yes
7,8,52,Female,6,58.9,One-Year,,353.4,No,Yes
12,13,47,Male,2,63.26,Two-Year,,126.52,No,Yes
13,14,25,Female,8,71.78,One-Year,,574.24,No,Yes


In [23]:
# Filling null values with the last valid observation
df.InternetService.ffill(axis = 0, inplace=True)
df.InternetService.value_counts()

InternetService
Fiber Optic    563
DSL            437
Name: count, dtype: int64

Next, we will convert the data type of 'Gender', 'InternetService' and 'TechSupport' to boolean columns.

In [24]:
# Converting columns to bool
df['Gender'] = df['Gender'].replace({'Male':True, 'Female': False})
df['InternetService'] = df['InternetService'].replace({'Yes':True, 'No': False})
df['TechSupport'] = df['TechSupport'].replace({'Yes':True, 'No':False})
df['Churn'] = df['Churn'].replace({'Yes':True, 'No':False})

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerID       1000 non-null   int64  
 1   Age              1000 non-null   int64  
 2   Gender           1000 non-null   bool   
 3   Tenure           1000 non-null   int64  
 4   MonthlyCharges   1000 non-null   float64
 5   ContractType     1000 non-null   object 
 6   InternetService  1000 non-null   object 
 7   TotalCharges     1000 non-null   float64
 8   TechSupport      1000 non-null   bool   
 9   Churn            1000 non-null   bool   
dtypes: bool(3), float64(2), int64(3), object(2)
memory usage: 57.7+ KB


## Feature Importances

We will now find the most important features to be used in our training and testing data. 

### Preprocessing 

In [26]:
# Creating feature and target variables
X = df.drop(columns = ['CustomerID','Churn'])
y = df.Churn

# Separating columns by their data type
cat_cols = X.select_dtypes(include = 'object').columns
num_cols = X.select_dtypes(include = ['int', 'float']).columns
bin_cols = X.select_dtypes(include='bool').columns

preprocessor_importances = ColumnTransformer(
    transformers = [
        ('cat', OneHotEncoder(sparse=False, drop='first'), cat_cols),
        ('num', StandardScaler(), num_cols),
        ('bin', 'passthrough', bin_cols)
    ]
)

In [27]:
# Apply the transformations to the training data
X_preprocessed = preprocessor_importances.fit_transform(X)
X_preprocessed = pd.DataFrame(X_preprocessed, columns=preprocessor_importances.get_feature_names_out())

# Split the data into train and test sets
x_train_processed, x_test_processed, y_train_processed, y_test_processed = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

In [28]:
# Fitting model for feature importance
clf = DecisionTreeClassifier(random_state = 0, criterion = 'gini')
clf.fit(x_train_processed, y_train_processed)

In [29]:
# Get feature importances
importances = clf.feature_importances_

# Create a DataFrame to view feature importances
feature_importances = pd.DataFrame({'feature': x_train_processed.columns, 'importance': importances}).sort_values(by='importance', ascending=False)

# Print the top 10 most important features
print(feature_importances)

                            feature  importance
1        cat__ContractType_Two-Year    0.256712
4                       num__Tenure    0.224663
5               num__MonthlyCharges    0.198792
0        cat__ContractType_One-Year    0.191267
8                  bin__TechSupport    0.128565
2  cat__InternetService_Fiber Optic    0.000000
3                          num__Age    0.000000
6                 num__TotalCharges    0.000000
7                       bin__Gender    0.000000


As we can see from above, the 'ContractType' column has the largest impact on your model with approximately 27%. We can also determine that 'Age', 'TotalCharges' and 'Gender has no effect on our model. Hence, we can remove them from our training and testing data.

## Model Selection and Evaluation

In [30]:
# Separate column types
feature_columns = ['Tenure', 'ContractType', 'MonthlyCharges', 'TechSupport']
X = df[feature_columns]

cat_cols = X.select_dtypes(include = 'object').columns
num_cols = X.select_dtypes(include = ['int', 'float']).columns
bin_cols = X.select_dtypes(include='bool').columns

# Preprocessing for predictive model
preprocessing_model = ColumnTransformer(
    transformers = [
        ('cat', OneHotEncoder(sparse=False, drop='first'), cat_cols),
        ('num', StandardScaler(), num_cols),
        ('bin', 'passthrough', bin_cols)
    ]
)

# Splitting train and test data
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [31]:
# Fitting predictive model
pipeline = Pipeline([('preprocessing', preprocessing_model), ('classifier', clf)])
pipeline.fit(x_train, y_train)

In [32]:
# Predicting y value from test set
y_pred = pipeline.predict(x_test)

#Pipeline score
train_score = pipeline.score(x_train, y_train)
test_score = pipeline.score(x_test, y_test)
print(f'Train Score: {train_score}')
print(f'Test Score: {test_score}')

Train Score: 1.0
Test Score: 1.0


It seems that we managed to create a predictive model that can accuractely predict the churn of customers based on this dataset. However, it is important to remember that the complexity of the problem is relatively easy, hence creating this model was easier as well.