#### Question: Are there any specific services or products that are more commonly associated with each reason category for churn? then, How well can we predict the churn reason category based on the services and products that a customer has?

##### Expectations:
analyzing the specific services or products that are commonly associated with each reason category for churn can provide valuable insights for the business. For example, if customers are churning due to issues with internet speed, the company may need to invest in improving their network infrastructure. If customers are churning due to high prices, the company may need to consider adjusting their pricing strategy or offering more affordable packages. By understanding the specific services or products that are driving customer churn, the company can make targeted improvements to reduce churn rates and improve customer satisfaction. Additionally, this information can inform the development of new products or services that better meet the needs and preferences of customers.

#### EDA:

In [None]:
# Importing the libraries
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from imblearn.over_sampling import ADASYN, BorderlineSMOTE
from collections import Counter

In [None]:
# Load the data from /Dataset/Telco_customer_churn_services.xlsx
dataset1 = pd.read_excel('../Dataset/Telco_customer_churn_services.xlsx')

In [None]:
dataset1.columns

In [None]:
my_columns = ['Customer ID', 'Phone Service', 'Internet Service', 'Multiple Lines',
              'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Unlimited Data']

dataset1 = dataset1[my_columns]


In [None]:
# we need the churn category from a different file
dataset2 = pd.read_excel('../Dataset/Telco_customer_churn_status.xlsx')

In [None]:
dataset2.columns

In [None]:
my_columns = ['Customer ID', 'Churn Category','Churn Label']

dataset2 = dataset2[my_columns]

In [None]:
# merge the two datasets
dataset = pd.merge(dataset1, dataset2, on='Customer ID')

In [None]:
dataset.columns

In [None]:
# Check for messing values
dataset.isnull().sum()

In [None]:
Customer_ID = dataset['Customer ID']

We see that churn category is missing some values but we are not sure yet if that is a problem or it is just because the customer did not churn. We will check that.

In [None]:
# make sure that when churn category is missing the churn label is false
dataset[dataset['Churn Category'].isnull()]['Churn Label'].value_counts()

it is indeed the case that the missing values in the churn category are due to the fact that the customer did not churn. We will drop all customers who have not churned because we are only interested in the customers who have churned.

In [None]:
# drop customers who have not churned
dataset = dataset[dataset['Churn Category'].notnull()]

In [None]:
dataset.isnull().sum()

In [None]:
# drop the customer ID column
if 'Customer ID' in dataset.columns:
    dataset.drop({'Customer ID','Churn Label'}, axis=1, inplace=True)

In [None]:
dataset.dtypes

In [None]:
# clone the dataset
datasetDummies = dataset.copy()
# turn the categorical variables into dummy variables except for the churn category 
dataset = pd.get_dummies(dataset.drop(
    'Churn Category', axis=1), drop_first=True)

# add the churn category to the dataset
dataset['Churn Category'] = datasetDummies['Churn Category']

# check the data types of the columns
dataset.dtypes


In [None]:
dataset.columns

In [None]:
# rename the columns with _Yes to remove the _Yes
if 'Phone Service_Yes' in dataset.columns:
    dataset.rename(columns={'Phone Service_Yes': 'Phone Service',
                            'Internet Service_Yes': 'Internet Service',
                            'Multiple Lines_Yes': 'Multiple Lines',
                            'Online Security_Yes': 'Online Security',
                            'Online Backup_Yes': 'Online Backup',
                            'Device Protection Plan_Yes': 'Device Protection Plan',
                            'Premium Tech Support_Yes': 'Premium Tech Support',
                            'Unlimited Data_Yes': 'Unlimited Data'}, inplace=True)

In [None]:
dataset.head()

In [None]:
# get unique values for churn category
dataset['Churn Category'].unique()

##### We have four categories of churn reasons:
1. Competitor offers
2. Price
3. Dissatisfaction with service
4. Attitude of support person
5. Other

In [None]:
# visualize distribution of churn category
counts = dataset['Churn Category'].value_counts(
    normalize=True).rename('Percentage').mul(100).reset_index()
plt.figure(figsize=(8, 6))
sns.barplot(x='index', y='Percentage', data=counts, width=0.5)
plt.title('Distribution of Churn Category')
plt.ylabel('Number of Occurrences', fontsize=12)
plt.xlabel('Churn Category', fontsize=12)
plt.show()

##### Insights:
1. The most common reason for churn is competitor offers, followed by attitude of support person, dissatisfaction with service, and price.

In [None]:
# visualize the distribution of the different services and the churn categories all in one plot
# we are only interested in the customers who have the service hence only consider the 1 values

# plot the phone service distribution normalized by the total count of customers who have the service
plt.figure(figsize=(20, 10))
plt.subplot(2, 4, 1)
counts = dataset[dataset['Phone Service'] == 1]['Churn Category'].value_counts(
    normalize=True).rename('Percentage').mul(100).reset_index()
sns.barplot(x='index', y='Percentage', data=counts, width=0.5)
plt.title('Phone Service Distribution')
plt.ylabel('Percentage')
plt.xlabel('Churn Category')

# plot the internet service distribution normalized by the total count of customers who have the service
plt.subplot(2, 4, 2)
counts = dataset[dataset['Internet Service'] == 1]['Churn Category'].value_counts(
    normalize=True).rename('Percentage').mul(100).reset_index()
sns.barplot(x='index', y='Percentage', data=counts, width=0.5)
plt.title('Internet Service Distribution')
plt.ylabel('Percentage')
plt.xlabel('Churn Category')

# plot the multiple lines distribution normalized by the total count of customers who have the service
plt.subplot(2, 4, 3)
counts = dataset[dataset['Multiple Lines'] == 1]['Churn Category'].value_counts(
    normalize=True).rename('Percentage').mul(100).reset_index()
sns.barplot(x='index', y='Percentage', data=counts, width=0.5)
plt.title('Multiple Lines Distribution')
plt.ylabel('Percentage')
plt.xlabel('Churn Category')

# plot the online security distribution normalized by the total count of customers who have the service
plt.subplot(2, 4, 4)
counts = dataset[dataset['Online Security'] == 1]['Churn Category'].value_counts(
    normalize=True).rename('Percentage').mul(100).reset_index()
sns.barplot(x='index', y='Percentage', data=counts, width=0.5)
plt.title('Online Security Distribution')
plt.ylabel('Percentage')
plt.xlabel('Churn Category')

# plot the online backup distribution normalized by the total count of customers who have the service
plt.subplot(2, 4, 5)
counts = dataset[dataset['Online Backup'] == 1]['Churn Category'].value_counts(
    normalize=True).rename('Percentage').mul(100).reset_index()
sns.barplot(x='index', y='Percentage', data=counts, width=0.5)
plt.title('Online Backup Distribution')
plt.ylabel('Percentage')
plt.xlabel('Churn Category')

# plot the device protection plan distribution normalized by the total count of customers who have the service
plt.subplot(2, 4, 6)
counts = dataset[dataset['Device Protection Plan'] == 1]['Churn Category'].value_counts(
    normalize=True).rename('Percentage').mul(100).reset_index()
sns.barplot(x='index', y='Percentage', data=counts, width=0.5)
plt.title('Device Protection Plan Distribution')
plt.ylabel('Percentage')
plt.xlabel('Churn Category')


# plot the premium tech support distribution normalized by the total count of customers who have the service
plt.subplot(2, 4, 7)
counts = dataset[dataset['Premium Tech Support'] == 1]['Churn Category'].value_counts(
    normalize=True).rename('Percentage').mul(100).reset_index()
sns.barplot(x='index', y='Percentage', data=counts, width=0.5)
plt.title('Premium Tech Support Distribution')
plt.ylabel('Percentage')
plt.xlabel('Churn Category')


# plot the unlimited data distribution normalized by the total count of customers who have the service
plt.subplot(2, 4, 8)
counts = dataset[dataset['Unlimited Data'] == 1]['Churn Category'].value_counts(
    normalize=True).rename('Percentage').mul(100).reset_index()
sns.barplot(x='index', y='Percentage', data=counts, width=0.5)
plt.title('Unlimited Data Distribution')
plt.ylabel('Percentage')
plt.xlabel('Churn Category')

plt.tight_layout()


In [None]:
# Now let's reverse it I want to visualize the distribution churn reason across the different services
def plot_dist(churn_reason):
    # Filter the DataFrame
    reason = dataset[dataset['Churn Category']
                     == churn_reason].drop("Churn Category", axis=1)
    # Reshape the DataFrame to a long format
    df_long = reason.melt(var_name='Column', value_name='Value')

    # Filter the rows where the value is 1
    df_long_ones = df_long[df_long['Value'] == 1]

    # Create a box plot
    plt.figure(figsize=(20, 10))
    sns.countplot(x='Column', data=df_long_ones)
    plt.title('Distribution of Churn Reason for ' + churn_reason)
    plt.ylabel('Number of Occurrences')
    plt.xlabel('Service')

In [None]:
# plot the distribution of churn reason for each churn category 
plot_dist('Competitor')
plot_dist('Attitude')
plot_dist('Dissatisfaction')
plot_dist('Price')
plot_dist('Other')

In [None]:
# turn the churn reason into dummy variables (make a new copy of the dataset)
datasetCorr = pd.get_dummies(dataset, columns=['Churn Category'])
datasetCorr.columns

In [None]:
# draw heatmap to visualize the correlation between the different features
plt.figure(figsize=(20, 10))
sns.heatmap(datasetCorr.corr(method='pearson'), annot=True, fmt='.2f')
plt.show()


In [None]:
# draw heatmap to visualize the correlation between the different features
plt.figure(figsize=(20, 10))
sns.heatmap(datasetCorr.corr(method='kendall'), annot=True, fmt='.2f')
plt.show()


In [None]:
# draw heatmap to visualize the correlation between the different features
plt.figure(figsize=(20, 10))
sns.heatmap(datasetCorr.corr(method='spearman'), annot=True, fmt='.2f')
plt.show()


In [None]:
# visualize the correlation between the features and the target variable
plt.figure(figsize=(20, 20))
sns.heatmap(datasetCorr.corr()[['Churn Category_Attitude']].sort_values(
    by='Churn Category_Attitude', ascending=False), annot=True, fmt='.2f')
plt.show()

plt.figure(figsize=(20, 20))
sns.heatmap(datasetCorr.corr()[['Churn Category_Competitor']].sort_values(
    by='Churn Category_Competitor', ascending=False), annot=True, fmt='.2f')
plt.show()

plt.figure(figsize=(20, 20))
sns.heatmap(datasetCorr.corr()[['Churn Category_Dissatisfaction']].sort_values(
    by='Churn Category_Dissatisfaction', ascending=False), annot=True, fmt='.2f')
plt.show()


plt.figure(figsize=(20, 20))
sns.heatmap(datasetCorr.corr()[['Churn Category_Other']].sort_values(
    by='Churn Category_Other', ascending=False), annot=True, fmt='.2f')
plt.show()

plt.figure(figsize=(20, 20))
sns.heatmap(datasetCorr.corr()[['Churn Category_Price']].sort_values(
    by='Churn Category_Price', ascending=False), annot=True, fmt='.2f')
plt.show()


We need to encode the Churn Category column into numerical values so that we can use it in our model.

In [None]:
dataset['Churn Category'] = dataset['Churn Category'].map(
    {'Competitor': 0, 'Attitude': 1, 'Dissatisfaction': 2, 'Price': 3,'Other':4})

In [None]:
# split the dataset into training and testing sets
X = dataset.drop(['Churn Category'], axis=1)
y = dataset['Churn Category']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=7)


#### Model Building:

##### Logistic Regression

In [None]:
# this function takes in the model, trains it and evaluates it on the test set
# it also uses SequentialFeatureSelector to select the best features for the model
def trainModel(model, X_train, y_train, X_test, y_test):

    # get features range
    feature_range = range(1, len(X_train.columns))

    best_model_accuracy = None
    best_accuracy = 0.0
    best_model_f1 = None
    best_f1 = 0.0
    best_features = None

    # loop through all the features
    for i in feature_range:
        sfs = SequentialFeatureSelector(
            model, n_features_to_select=i, direction='forward')

        # train the model using the training sets
        sfs.fit(X_train, y_train)

        # transform the data sets so that only the selected features are retained
        X_train_sfs = sfs.transform(X_train)
        X_test_sfs = sfs.transform(X_test)

        # Print the selected features
        print("Selected Features for %d Features: %s" %
            (i, X_train.loc[:, sfs.support_].columns))

        # calculate the accuracy of the model using the test sets
        model.fit(X_train_sfs, y_train)

        # predict the response for the test sets
        y_pred = model.predict(X_test_sfs)
        #save the best model and best features based on the accuracy
        if best_accuracy < accuracy_score(y_test, y_pred):
            best_accuracy = accuracy_score(y_test, y_pred)
            best_model_accuracy = sfs
            best_features = sfs.get_support()

        #save the best model and best features based on the f1 score
        if best_f1 < f1_score(y_test, y_pred, average='weighted'):
            best_f1 = f1_score(y_test, y_pred, average='weighted')
            best_model_f1 = sfs
            best_features = sfs.get_support()

    # Print the selected features based on the accuracy
    print("Selected Features for Best Accuracy: %s" %
        (X.loc[:, best_model_accuracy.get_support()].columns))

    # Print best accuracy
    print("Best Accuracy: %f" % (best_accuracy))

    # Print the selected features based on the f1 score
    print("Selected Features for Best F1 Score: %s" %
        (X.loc[:, best_model_f1.get_support()].columns))

    # Print best f1 score
    print("Best F1 Score: %f" % (best_f1))



In [None]:
# train the model using Logistic Regression
trainModel(LogisticRegression(), X_train, y_train, X_test, y_test)

In [None]:
# train the model using svm
trainModel(SVC(), X_train, y_train, X_test, y_test)

In [None]:
# train the model using Random Forest
trainModel(RandomForestClassifier(), X_train, y_train, X_test, y_test)

##### Results:
1. the model accuracy is very bad which is expected after what we have seen in EDA where all features hd very low correlation with the target variable. So further analysis and features should be added to the model to improve the accuracy.

In [None]:
# Load the first dataset from ../Dataset/Telco_customer_churn_services.xlsx
services = pd.read_excel('../Dataset/Telco_customer_churn_services.xlsx')

In [None]:
# Load the second dataset from ../Dataset/Telco_customer_churn.xlsx
compound = pd.read_excel('../Dataset/Telco_customer_churn.xlsx')

In [None]:
# Load the data from /Dataset/Telco_customer_churn_demographics.xlsx
demographics = pd.read_excel(
    '../Dataset/Telco_customer_churn_demographics.xlsx')

In [None]:
# rename the column to match the column name in the dataset
compound.rename(columns={'CustomerID': 'Customer ID'}, inplace=True)

In [None]:
# Join the two datasets on the column 'Customer ID'
dataset1 = pd.merge(demographics, compound, on='Customer ID')

In [None]:
# check the data types of the columns
dataset1.dtypes

In [None]:
my_columns = ['Gender_x', 'Age', 'Married',
              'Number of Dependents', 'Churn Value', 'Tenure Months', 'Churn Score', 'Monthly Charges', 'Customer ID']

dataset1 = dataset1[my_columns]

In [None]:
dataset = pd.merge(services, dataset1, on='Customer ID')

In [None]:
dataset.dtypes

In [None]:
my_columns += ['Phone Service', 'Internet Service', 'Multiple Lines',
               'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Unlimited Data', 'Total Revenue', 'Referred a Friend']

dataset = dataset[my_columns]

In [None]:
dataset.dtypes

In [None]:
# rename Gender_x to gender
dataset.rename(columns={'Gender_x' : 'Gender'},inplace=True)

In [None]:
dataset.dtypes

In [None]:
status = pd.read_excel('../Dataset/Telco_customer_churn_status.xlsx')

In [None]:
status.dtypes

In [None]:
columns = ['Customer ID', 'Satisfaction Score', 'CLTV']
status1 = status[columns]

In [None]:
dataset = pd.merge(dataset, status1, on='Customer ID')

In [None]:
dataset.dtypes

In [None]:
Customer_ID = dataset['Customer ID']
if 'Customer ID' in dataset.columns:
    dataset.drop('Customer ID', axis=1, inplace=True)

In [None]:
dataset = pd.get_dummies(dataset,drop_first=True)

In [None]:
dataset.head()

In [None]:
if 'Married_Yes' in dataset.columns:
    dataset.rename(columns={'Married_Yes':'Married','Gender_Male': 'Gender','Phone Service_Yes':'Phone Service', 'Internet Service_Yes':'Internet Service', 'Multiple Lines_Yes':'Multiple Lines',
                        'Online Security_Yes':'Online Security', 'Online Backup_Yes':'Online Backup','Device Protection Plan_Yes':'Device Protection Plan',
                        'Premium Tech Support_Yes':'Premium Tech Support','Unlimited Data_Yes':'Unlimited Data','Referred a Friend_Yes':'Referred a Friend'}, inplace=True)

In [None]:
dataset.head()

In [None]:
columns = ['Customer ID', 'Churn Category']

status2 = status[columns]

In [None]:
# merge the two datasets
dataset['Customer ID'] = Customer_ID
dataset = pd.merge(dataset, status2, on='Customer ID')

In [None]:
if 'Customer ID' in dataset.columns:
    dataset.drop('Customer ID', axis=1, inplace=True)

In [None]:
datasetCorr = pd.get_dummies(dataset)

In [None]:
datasetCorr.head()

In [None]:
#visualize the distribution of the all features in the dataset
datasetCorr.hist(figsize=(20, 20))
plt.show()

In [None]:
#visualize the correlation between the features
plt.figure(figsize=(20, 20))
sns.heatmap(datasetCorr.corr(), annot=True, cmap='coolwarm')
plt.show()


In [None]:
# visualize the correlation between the features and the target variable
plt.figure(figsize=(20, 20))
sns.heatmap(datasetCorr.corr()[['Churn Category_Attitude']].sort_values(
    by='Churn Category_Attitude', ascending=False), annot=True, fmt='.2f')
plt.show()

plt.figure(figsize=(20, 20))
sns.heatmap(datasetCorr.corr()[['Churn Category_Competitor']].sort_values(
    by='Churn Category_Competitor', ascending=False), annot=True, fmt='.2f')
plt.show()

plt.figure(figsize=(20, 20))
sns.heatmap(datasetCorr.corr()[['Churn Category_Dissatisfaction']].sort_values(
    by='Churn Category_Dissatisfaction', ascending=False), annot=True, fmt='.2f')
plt.show()


plt.figure(figsize=(20, 20))
sns.heatmap(datasetCorr.corr()[['Churn Category_Other']].sort_values(
    by='Churn Category_Other', ascending=False), annot=True, fmt='.2f')
plt.show()

plt.figure(figsize=(20, 20))
sns.heatmap(datasetCorr.corr()[['Churn Category_Price']].sort_values(
    by='Churn Category_Price', ascending=False), annot=True, fmt='.2f')
plt.show()

In [None]:
# split the dataset into training and testing sets
X = dataset.drop(['Churn Category'], axis=1)
y = dataset['Churn Category']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=7)


In [None]:
# train the model using Logistic Regression
trainModel(LogisticRegression(), X_train, y_train, X_test, y_test)