The goal of this project is to predict whether a customer with a set of characteristics will churn or not from our company (an Iranian telecom company). To accomplish this goal, the following steps will be taken:
*   Data Importing
*   Data Cleaning
*   Exploratory Data Analysis
*   Data Preprocessing
*   Training a ML Model
*   Evaluating Model Performance
*   Next Steps







# Data Importing

In [9]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [10]:
# importing dataset
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Customer Churn.csv')
print(df.head())

   Call  Failure  Complains  Subscription  Length  Charge  Amount  \
0              8          0                    38               0   
1              0          0                    39               0   
2             10          0                    37               0   
3             10          0                    38               0   
4              3          0                    38               0   

   Seconds of Use  Frequency of use  Frequency of SMS  \
0            4370                71                 5   
1             318                 5                 7   
2            2453                60               359   
3            4198                66                 1   
4            2393                58                 2   

   Distinct Called Numbers  Age Group  Tariff Plan  Status  Age  \
0                       17          3            1       1   30   
1                        4          2            1       2   25   
2                       24          3    

In [11]:
# understanding dataset structure
print(df.shape)
print(df.dtypes)

(3150, 14)
Call  Failure                int64
Complains                    int64
Subscription  Length         int64
Charge  Amount               int64
Seconds of Use               int64
Frequency of use             int64
Frequency of SMS             int64
Distinct Called Numbers      int64
Age Group                    int64
Tariff Plan                  int64
Status                       int64
Age                          int64
Customer Value             float64
Churn                        int64
dtype: object


By calling the shape and dtype attributes of the dataframe, we verify that our dataset has 3150 observations, each representing a client, and 14 columns, being 13 features and 1 label (column Churn)

# Data Cleaning

In [12]:
# verifying existence of missing values
df.isna().any()

Unnamed: 0,0
Call Failure,False
Complains,False
Subscription Length,False
Charge Amount,False
Seconds of Use,False
Frequency of use,False
Frequency of SMS,False
Distinct Called Numbers,False
Age Group,False
Tariff Plan,False


From the output above, we conclude that there are no missing values in the dataset.

To finish the data cleaning step, we'll perform two tasks:
*   drop 'Age Group' column, since we have a 'Age' column that will give us more granular information than the 'Age Group' one
*   rename columns so that their names become more meaningful




In [13]:
# dropping Age Group column
df = df.drop('Age Group', axis = 1)

In [14]:
# renaming columns
df.columns = ['num_call_failures', 'has_complaint', 'sub_length_months',
              'charge_tier', 'total_call_seconds', 'total_num_calls',
              'total_num_sms', 'distinct_call_num', 'type_plan',
              'status', 'age', 'customer_value', 'churn']

# Exploratory Data Analysis

In [15]:
# getting a summary of the data
df.describe()

Unnamed: 0,num_call_failures,has_complaint,sub_length_months,charge_tier,total_call_seconds,total_num_calls,total_num_sms,distinct_call_num,type_plan,status,age,customer_value,churn
count,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0,3150.0
mean,7.627937,0.076508,32.541905,0.942857,4472.459683,69.460635,73.174921,23.509841,1.077778,1.248254,30.998413,470.972916,0.157143
std,7.263886,0.265851,8.573482,1.521072,4197.908687,57.413308,112.23756,17.217337,0.267864,0.432069,8.831095,517.015433,0.363993
min,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,15.0,0.0,0.0
25%,1.0,0.0,30.0,0.0,1391.25,27.0,6.0,10.0,1.0,1.0,25.0,113.80125,0.0
50%,6.0,0.0,35.0,0.0,2990.0,54.0,21.0,21.0,1.0,1.0,30.0,228.48,0.0
75%,12.0,0.0,38.0,1.0,6478.25,95.0,87.0,34.0,1.0,1.0,30.0,788.38875,0.0
max,36.0,1.0,47.0,10.0,17090.0,255.0,522.0,97.0,2.0,2.0,55.0,2165.28,1.0


After calling the describe method of the dataframe, it's clear the necessity of rescaling the data. For example, column 'total_call_seconds' has a mean and median of 4,472 and 2,990, respectively. In contrast, all the other features are, at max, in the scale of 100s

# Data Preprocessing

In [16]:
# creating feature and label arrays
X = df.drop('churn', axis = 1).values
y = df['churn'].values

In [17]:
# splitting data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

In [18]:
# scaling the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

# Training Machine Learning Models

Since the problem at hand is to determine if a client, given a set of features, is going to churn or not, we are dealing with a classification problem. Therefore, we will use **classification** machine learning models, such as **logistic regression** (applicable to the problem, as we are trying to predict a binary label - churn or not churn) and **random forests**. To measure each model's performance, we will compute and compare the different **mean accuracy** outputs

In [20]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(random_state = 1)
logreg.fit(X_train_scaled, y_train)
logreg_score = logreg.score(X_test_scaled, y_test)
print(logreg_score)

0.8934010152284264


In [21]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(random_state = 1)
random_forest.fit(X_train_scaled, y_train)
rf_score = random_forest.score(X_test_scaled, y_test)
print(rf_score)

0.949238578680203


# Evaluating Model Performance

In order to guarantee that the mean accuracy computed is representative of the model's ability to generalize, we will use cross-validation with 5-folds

In [26]:
# unifying X data in one array to perform cross-validation
X_scaled = np.concatenate([X_train_scaled, X_test_scaled], axis = 0)

In [33]:
# Logistic Regression model's accuracy mean and standard deviation
from sklearn.model_selection import cross_val_score, KFold
logreg_kf = KFold(n_splits = 5, shuffle = True, random_state = 1)
cv_logreg = LogisticRegression()
logreg_accuracy = cross_val_score(cv_logreg, X_scaled, y, cv = logreg_kf)
print(np.mean(logreg_accuracy), np.std(logreg_accuracy))

0.8428571428571429 0.015323832871297996


In [35]:
# Random Forest model's accuracy mean and standard deviation
from sklearn.model_selection import cross_val_score, KFold
rf_kf = KFold(n_splits = 5, shuffle = True, random_state = 2)
cv_random_forest = RandomForestClassifier()
random_forest_accuracy = cross_val_score(cv_random_forest, X_scaled, y, cv = rf_kf)
print(np.mean(random_forest_accuracy), np.std(random_forest_accuracy))

0.8285714285714286 0.011313352178543162


After computing the mean and standard deviation of the accuracy scores arrays for each model cross-validated, we see that, actually, the logistic regression model performs slightly **better**, on average, than the random forest model, without adding too much variability (standard deviations are similar). This is a conclusion that we couldn't reach training the models just once, because, in the specific training and test process performed initially, we saw that logistic regression performed **worse** than random forest. By that, we understand the importance of **cross-validation**