#### Question: Can we predict whether a customer will refer a friend or not ?

#### Expectations:

The question of predicting whether a customer will recommend the company, is closely related to Net Promoter Score (NPS),which can be extremely useful for the telecommunications company. The NPS is a widely used metric that measures customer loyalty and satisfaction, with a focus on the likelihood of customers recommending the company to others.

There are several reasons why predicting the customer recommending the company can be valuable:

Improving customer retention: Customers who are more likely to recommend the company are also more likely to stay with the company. So this can help identify customers who are at risk of leaving, allowing the company to take proactive measures to retain them.

Identifying areas for improvement: Customers who are less likely to recommend the company may have specific pain points or areas of dissatisfaction. Answering this question can help the company identify these areas and prioritize efforts to address them.

Driving growth: Customers who are more likely to recommend the company can help drive new customer acquisition and revenue growth through positive word-of-mouth referrals. Answering this question can help the company identify customers who are likely to be brand advocates and target them for additional promotions or incentives.

In terms of expectations, predicting the NPS can help the company gain a deeper understanding of its customers and their preferences, as well as drive improvements in customer retention, satisfaction, and revenue growth. The insights gained from answering this question can be used to inform targeted marketing and customer retention strategies, ultimately leading to increased customer loyalty and business success.

#### EDA:

In [None]:
# Importing the libraries
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from imblearn.over_sampling import ADASYN, BorderlineSMOTE
from collections import Counter


In [None]:
# Load the first dataset from ../Dataset/Telco_customer_churn_services.xlsx
services = pd.read_excel('../Dataset/Telco_customer_churn_services.xlsx')

In [None]:
# Load the second dataset from ../Dataset/Telco_customer_churn.xlsx
compound = pd.read_excel('../Dataset/Telco_customer_churn.xlsx')

In [None]:
# Load the data from /Dataset/Telco_customer_churn_demographics.xlsx
demographics = pd.read_excel('../Dataset/Telco_customer_churn_demographics.xlsx')

In [None]:
# rename the column to match the column name in the dataset
compound.rename(columns={'CustomerID':'Customer ID'}, inplace=True)

In [None]:
# Join the two datasets on the column 'Customer ID'
dataset1 = pd.merge(demographics, compound, on='Customer ID')

In [None]:
# check the data types of the columns
dataset1.dtypes

In [None]:
my_columns = ['Gender_x', 'Age', 'Married',
              'Number of Dependents', 'Churn Value', 'Tenure Months', 'Churn Score', 'Monthly Charges', 'Customer ID']

dataset1 = dataset1[my_columns]

In [None]:
dataset = pd.merge(services, dataset1, on='Customer ID')

In [None]:
dataset.dtypes

In [None]:
my_columns += ['Phone Service', 'Internet Service', 'Multiple Lines',
               'Online Security', 'Online Backup', 'Device Protection Plan', 'Premium Tech Support', 'Unlimited Data', 'Total Revenue', 'Referred a Friend']

dataset = dataset[my_columns]

In [None]:
dataset.dtypes

In [None]:
# rename Gender_x to gender
dataset.rename(columns={'Gender_x' : 'Gender'},inplace=True)

In [None]:
if 'Customer ID' in dataset.columns:
    dataset.drop('Customer ID', axis=1, inplace=True)

In [None]:
dataset = pd.get_dummies(dataset,drop_first=True)

In [None]:
dataset.head()

In [None]:
if 'Married_Yes' in dataset.columns:
    dataset.rename(columns={'Married_Yes':'Married','Gender_Male': 'Gender','Phone Service_Yes':'Phone Service', 'Internet Service_Yes':'Internet Service', 'Multiple Lines_Yes':'Multiple Lines',
                        'Online Security_Yes':'Online Security', 'Online Backup_Yes':'Online Backup','Device Protection Plan_Yes':'Device Protection Plan',
                        'Premium Tech Support_Yes':'Premium Tech Support','Unlimited Data_Yes':'Unlimited Data','Referred a Friend_Yes':'Referred a Friend'}, inplace=True)

In [None]:
dataset.head()

In [None]:
#visualize the distribution of the all features in the dataset
dataset.hist(figsize=(20,20))
plt.show()

In [None]:
#visualize the correlation between the features
plt.figure(figsize=(20,20))
sns.heatmap(dataset.corr(), annot=True, cmap='coolwarm')
plt.show()

In [None]:
# visualize the correlation between the features and the target variable
plt.figure(figsize=(20, 20))
sns.heatmap(dataset.corr()[['Referred a Friend']].sort_values(
    by='Referred a Friend', ascending=False), annot=True, fmt='.2f')
plt.show()

##### Insights:

Only feature worth looking at are : 
1. Married
2. Tenure Months
3. Total Revenue
4. Number of Dependents

In [None]:
features = ['Married','Tenure Months','Total Revenue','Number of Dependents','Referred a Friend']
dataset = dataset[features]

In [None]:
dataset.head()

In [None]:
# visualize the distribution of the features vs the target variable (Referred a Friend)
for feature in features:
    if feature == 'Referred a Friend':
        continue
    if feature == 'Married':
        sns.countplot(x=feature, hue='Referred a Friend', data=dataset)
        plt.show()
    else:
        sns.boxplot(x='Referred a Friend', y=feature, data=dataset)
        plt.show()

##### Insights:
1. Married customers are more likely to recommend the company.
2. Total Revenue and Tenure Months are positively correlated with the likelihood of recommending the company. Which means that customers who have been with the company longer and have higher total revenue are more likely to recommend the company.


#### Building Model:

In [None]:
# split the dataset into train and test sets
X = dataset.drop('Referred a Friend', axis=1)
y = dataset['Referred a Friend']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=101)

In [None]:
# this function takes in the model, trains it and evaluates it on the test set
# it also uses SequentialFeatureSelector to select the best features for the model
def trainModel(model, X_train, y_train, X_test, y_test):

    # get features range
    feature_range = range(1, len(X_train.columns))

    best_model_accuracy = None
    best_accuracy = 0.0
    best_model_f1 = None
    best_f1 = 0.0
    best_features = None

    # loop through all the features
    for i in feature_range:
        sfs = SequentialFeatureSelector(
            model, n_features_to_select=i, direction='forward')

        # train the model using the training sets
        sfs.fit(X_train, y_train)

        # transform the data sets so that only the selected features are retained
        X_train_sfs = sfs.transform(X_train)
        X_test_sfs = sfs.transform(X_test)

        # Print the selected features
        print("Selected Features for %d Features: %s" %
              (i, X_train.loc[:, sfs.support_].columns))

        # calculate the accuracy of the model using the test sets
        model.fit(X_train_sfs, y_train)

        # predict the response for the test sets
        y_pred = model.predict(X_test_sfs)
        #save the best model and best features based on the accuracy
        if best_accuracy < accuracy_score(y_test, y_pred):
            best_accuracy = accuracy_score(y_test, y_pred)
            best_model_accuracy = sfs
            best_features = sfs.get_support()

        #save the best model and best features based on the f1 score
        if best_f1 < f1_score(y_test, y_pred, average='weighted'):
            best_f1 = f1_score(y_test, y_pred, average='weighted')
            best_model_f1 = sfs
            best_features = sfs.get_support()

    # Print the selected features based on the accuracy
    print("Selected Features for Best Accuracy: %s" %
          (X.loc[:, best_model_accuracy.get_support()].columns))

    # Print best accuracy
    print("Best Accuracy: %f" % (best_accuracy))

    # Print the selected features based on the f1 score
    print("Selected Features for Best F1 Score: %s" %
          (X.loc[:, best_model_f1.get_support()].columns))

    # Print best f1 score
    print("Best F1 Score: %f" % (best_f1))

In [None]:
# train using a logistic regression model
trainModel(LogisticRegression(), X_train, y_train, X_test, y_test)

In [None]:
# train using a random forest classifier model
trainModel(RandomForestClassifier(), X_train, y_train, X_test, y_test)

In [None]:
# train using a svm model
trainModel(SVC(), X_train, y_train, X_test, y_test)