Importing required libraries

In [None]:
import pandas as pd
import numpy as np

from sklearn.utils import resample

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

Importing dataset.

In [None]:
df = pd.read_csv('dataset_telecom_01.csv',delimiter =',') #data added in repo
df.head()

In [None]:
print(f'Dataset shape: ', df.shape)
print('\n')
df.rename(columns={'service_06': 'Service_06'}, inplace=True)

#removing user id from dataset
df.drop('uid', axis = 1, inplace = True)

df[['Factor_00','is_churn']] = df[['Factor_00','is_churn']].astype('category')

df.info()

Checking for missing values. Charges Column has 11 missing values.

In [None]:
missing_data = df.isnull()

for column in missing_data.columns.values.tolist():
    print(column)
    print(missing_data[column].value_counts())
    print("")

Missing value is %0.15 of the column. Thus, they will be filled with the mean value of the "Charges" column.

In [None]:
avg_charges = df["Charges"].astype("float").mean(axis=0)
print("Avg of Charges: ", avg_charges, "\n")

df["Charges"].replace(np.nan, avg_charges, inplace = True)

missing_data = df.isnull()
print(missing_data["Charges"].value_counts())

In [None]:
df.describe(include = "all")

#### Data is biased in terms of the target variable. A dataset with equal number of instances will be created.

In [None]:
df['is_churn'].value_counts()

Dealing with "Class Imbalance" problem. Creating same size target variable set

In [None]:
data1 = df[df['is_churn']=='Yes']
print("Churned-data1:"+ str(data1.shape))
data2 = df[df['is_churn']=='No']
print("Non Churn-data2:"+ str(data2.shape))
print("")
print("As we see %74 of the data set is Non churners. Therefore, if we estimate all as 0 we'd achieve %74 accuracy. \n ")

# Sample Non Churners class
data2 = resample(data2, 
                replace=True,     # sample with replacement
                n_samples=1869,    # to match Churners class
                random_state=237) # reproducible results 123

#Want same sized data on both classes
df = data1.append(data2[:1869])
print("Final Dataset :"+ str(df.shape))
df['is_churn'].value_counts()

## Which of the Gender, Factors and Service info play important role in whether a customer will churn.

Some of the insights from the code below;

- Gender does not play distinctive role in churn.

- When Factor 1 is "No" it seems more likely to a customer to churn.
- When Factor 2 is "Yes" it seems more likely for a customer to churn.

- Service 1 doesn't tell much
- Service 2: Categories doesn't tell much (0,33 - 0,33 - 0,40) but service 2 and 1's Yes and No are same numbers. Service 2   has additional "No phone service" info
- Service 3: Fiber Optic customers are more likely to churn than DSL and "No" customers.
- Service 4: "No" customers are more likley to churn than "No internet service" and "Yes" customers.
- Service 5: "No" customers are more likely to churn than "No internet service" and "Yes" customers.
- Service 6: "No" customers are more likely to churn than "No internet service" and "Yes" customers.
- Service 7: "No" customers are more likely to chrun than "No internet service" and "Yes" customers.
- Service 8: "No" and "Yes" customers are more likely to churn than "No internet service" customers.
- Service 9: "No" and "Yes" customers are more likely to churn than "No internet service" customers.

- C 1: "Month-to-month" customers seems more likely to churn than "One Year" and "Two Year" customers.
- C 2: "Yes" customers are more likely to churn than "No" customers.
- C 3: "Electronic Check" customers are more likely to churn than orher "Bank Transfer (automatic)", "Credit Card             (automatic)", "Mailed Check" customers.


In [None]:
for column in df:
    print(df.groupby(column)['is_churn'].value_counts())
    print("")

Churners pay less then non-churners. Reason could be service quality related, Factor_00 or Factor_03 can be indicator in churning.

In [None]:
print(" --- Means"+" "+"---"*7)
print(df.groupby(['is_churn']).mean())
print("\n --- Std Deviations"+" "+"---"*7)
print(df.groupby(['is_churn']).std())
print("\n --- Counts"+" "+"---"*7)
print(df.groupby(['is_churn']).count())

## Visualizations

Legend: Churn
- Monthly Charges - Factor 3 comparison: Customers that churn accumulates on higher Monthly Charges
- Charges - Factor 3 comparison: Positive correlation observed on both churned (Stronger) and non churn customers. Charges effects more than Factor 3. As Factor 3 increases, Charges increase yet churned customers accumulates on increasing charges.
- As Charges increase Monthly Charges increase. Churned customers accumulates more on higher Monthly Charges than Charges. This can be extra spendings on regular tariff.

Legend: Service 3
- Expensive to cheap: Fiber Optic - DSL - No. Fiber Optic is the most expensive factor and this can be one reason for it is to be the most churnes come from in its category.

Legend: Service 4-5-6-7-8-9
- These (Service 4-5-6-7) customers ("No") are all over the place yet they churn more than the others. This can be due to paying same money but not getting related additional services.
- These (8-9) customers (Yes) churn more than other customers that use services. Unlike other services, this can be due to service quality.

Legend: C01 - C02 - C03
- C01: "Month to Month" customers were churning more, as they appear to accumulate more on higher Monthly Charges.
- C02: "Yes" customers were churning more than No customers. This can be due to that they accumulate more on higher Monthly Charges
- C03: Partially same applies to Electronic Check Customers

In [None]:
df.dropna()
g = sns.pairplot(df,
                 x_vars=["MonthlyCharges", "Charges", "Factor_03"], 
                 y_vars=["Factor_03", "MonthlyCharges", "Charges"], 
                 hue = 'is_churn', 
                 markers=["X", "s"], height = 4)
g.fig.suptitle("Data Correlations", y = 1.05)
plt.show()


Checking correlations between numerical columns.

In [None]:
# Calculating the correlation matrix
corr = df1.corr() #df
#print(corr)
# Generating a heatmap
fig, ax = plt.subplots(figsize=(20,20))         # Sample figsize in inches

sns.heatmap(corr,xticklabels=corr.columns, yticklabels=corr.columns,
            annot=True, linewidths=.5, ax=ax)
plt.show()

In [None]:
sns.pairplot(df1, 
              x_vars = ["C_01","C_02","C_03"],
              y_vars = ["C_01","C_02","C_03"],
             height = 5,
             hue = 'is_churn'
             )
plt.show() #,"Service_06","Service_07","Service_08","Service_09"

Below; 
- Graph 1 shows that Factor 3 can be used in predicting the churn as churned customers have lower factor values. Altough the gender info does not play inportant role. Variables like, Service 2-3 etc plays role.
- 2ns graph shows Monthly Charges can also be distinctive to some point on churn
- 3th graph shows Charges are also helpful to some point understanding the churn

In [None]:
sns.boxplot(x = 'is_churn', y = 'Factor_03', data = df, hue = 'Service_08').set_title('1st') #sym = "", hue = 'gend' 
plt.show()

In [None]:
sns.boxplot('is_churn','MonthlyCharges', data = df, hue = 'Factor_00').set_title('2nd')
plt.show()

In [None]:
sns.boxplot('is_churn','Charges', data = df, hue = 'Service_09').set_title('3th')
plt.show()

## Distributions of the numeric variables

Looking for normal distributions.

In [None]:
sns.distplot(df['MonthlyCharges'])
plt.show()

In [None]:
sns.distplot(df['Charges'])
plt.show()

In [None]:
sns.distplot(df['Factor_03'])
plt.show()

### Feature Selection and Feature Engineering / Encoding Binary Features & One Hot Encoding
- Unable to perform feature engineering since no domain knowledge is provided and features are anonym
- Features like Gender can be discarded from future model to be tried
- Mean encoding is used as an encoding method. Also, one hot encoding could be used

In [None]:
#Preprocessing: yes, no mapping for target column
df[['Factor_00','is_churn']] = df[['Factor_00','is_churn']].astype('object')

df['is_churn'].replace(["Yes","No"],[1,0], regex = True, inplace = True)

df['is_churn'] = df['is_churn'].astype('int')

df.head()

#### Mean Encoding for the Feature Variables

General formula: 
Encoding for Gender = 
[Number of true (1) targets under the label Male / Total Number of targets under the label Male]

In [None]:
#Creating new df object
df1 = df.copy(deep=True)

df1.head()

In [None]:
#Replacing feature classes with Target Means

for column in df1:
    
    if df1[column].dtype == (object or category):
        means = df1.groupby(column)['is_churn'].mean()
        df1[column] = df1[column].map(means)
    else:
        pass
    
df1.head()

Mean Encoding applied for the categorigal features. For the numeric variables, Normalization will be applied with Min - Max Scaling

In [None]:
df1['Factor_03'] = (df1['Factor_03'] - df1['Factor_03'].min())/(df1['Factor_03'].max()-df1['Factor_03'].min())
df1['MonthlyCharges'] = (df1['MonthlyCharges'] - df1['MonthlyCharges'].min())/(df1['MonthlyCharges'].max()-df1['MonthlyCharges'].min())
df1['Charges'] = (df1['Charges'] - df1['Charges'].min())/(df1['Charges'].max()-df1['Charges'].min())

df1.head()

Now we are ready for model development and classification!

## Classification - Model Development
- ### SVM

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

X = df1.copy(deep=True)
y = X['is_churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3) 

#Train Model
svmodel = SVC(kernel='poly', degree = 5, gamma = 'auto')
svmodel.fit(X_train, y_train)

#Make prediction
y_pred = svmodel.predict(X_test)

#Model evaluation (Conf. mat, precision and F1 scores)
mat = confusion_matrix(y_test, y_pred)

#heatmap visualization
sns.heatmap(mat.T,square=True,annot=True,fmt ='d',cbar=False,
           xticklabels=True,yticklabels=True)
plt.xlabel('true label')
plt.ylabel('predicted label')

print(classification_report(y_test,y_pred),"\n")

### Testing model accuracy with Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(svmodel, X, y, cv=10)
np.average(scores)

- #### ROC Curve

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
# ROC AUC
auc = roc_auc_score(y_test, y_pred)
print('ROC AUC: %f' % auc)

In [None]:
def plot_roc_curve(fpr, tpr):  
    plt.plot(fpr, tpr, color='orange', label='ROC')
    plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend()
    plt.show()

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
plot_roc_curve(fpr, tpr)  

### Hyperparameter Tuning
- List of different penalty parameter 'C' are used for svm model.
    - [0.001, 0.01, 0.1, 1, 10, 100, 1000]

In [None]:
C_list = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
accuracy_list = []

for c in C_list:
    #Train Model
    svmodel = SVC(kernel='poly', degree = 5, gamma = 'auto', C = c)
    svmodel.fit(X_train, y_train)
    #Make prediction
    y_pred = svmodel.predict(X_test)
    auc = roc_auc_score(y_test, y_pred)
    accuracy_list.append(auc)

sns.barplot(y=accuracy_list, x=[0.001, 0.01, 0.1, 1, 10, 100, 1000], palette="Blues_d")
plt.ylabel('AUC Score')
plt.xlabel('C Parameter')
plt.title('C Parameter AUC Scores')


- Confusion matrix and Precision-Recall scores of C = 1000 model

In [None]:
#Train Model
svmodel = SVC(kernel='poly', degree = 5, gamma = 'auto', C = 1000)
svmodel.fit(X_train, y_train)

#Make prediction
y_pred = svmodel.predict(X_test)

#Model evaluation (Conf. mat, precision and F1 scores)
mat = confusion_matrix(y_test, y_pred)

#heatmap visualization
sns.heatmap(mat.T,square=True,annot=True,fmt ='d',cbar=False,
           xticklabels=True,yticklabels=True)
plt.xlabel('true label')
plt.ylabel('predicted label')

print(classification_report(y_test,y_pred),"\n")
print("Class Counts:")
print(y_test.value_counts())

- ### Decision Tree

Creating model

In [None]:
from sklearn import tree
model = tree.DecisionTreeClassifier()

#Fitting model
model.fit(X_train, y_train)

#Predicting Class
y_predict = model.predict(X_test)

#Model evaluation (Conf. mat, precision and F1 scores)
mat1 = confusion_matrix(y_test, y_predict)

#heatmap visualization
sns.heatmap(mat.T,square=True,annot=True,fmt ='d',cbar=False,
           xticklabels=True,yticklabels=True)
plt.xlabel('true label')
plt.ylabel('predicted label')

print(classification_report(y_test,y_predict),"\n")
print("Class Counts:")
print(y_test.value_counts())

In [None]:
# ROC AUC
auc = roc_auc_score(y_test, y_predict)
print('ROC AUC: %f' % auc)

fpr, tpr, thresholds = roc_curve(y_test, y_predict)
plot_roc_curve(fpr, tpr)  