# Telco customer churn prediction
The objective of this project is to predict whether a customer will leave a fictional telecommunications company. We perform an exploratory data analysis (EDA), preprocess the data, compare several models and sampling strategies for imbalanced data.

The data set description on Kaggle is given below:
> **Context**
>
> "Predict behavior to retain customers. You can analyze all relevant customer data and develop focused customer retention programs." [IBM Sample Data Sets]
>
> **Content**
>
> Each row represents a customer, each column contains customer’s attributes described on the column Metadata.
> 
> The data set includes information about:
> 
> - Customers who left within the last month – the column is called Churn
> - Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device > > protection, tech support, and streaming TV and movies
> - Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
> - Demographic info about customers – gender, age range, and if they have partners and dependents
>

# Import libraries

In [None]:
# python utilities
import random
import os

# general data science
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# ---------- scikit-learn ------------
# preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

# models
from sklearn.naive_bayes import GaussianNB, CategoricalNB
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator # for custom estimators
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# model selection
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# model evaluation
from sklearn.metrics import classification_report, precision_recall_fscore_support, confusion_matrix, roc_auc_score, roc_curve
from sklearn.metrics import fbeta_score, make_scorer
# -------------------------------------

# imbalanced-learn
from imblearn.pipeline import Pipeline # if using imblearn's sampling we must use this over sklearn's Pipeline 
from imblearn.over_sampling import SMOTE, RandomOverSampler # oversampling
from imblearn.under_sampling import EditedNearestNeighbours, TomekLinks, NearMiss, RandomUnderSampler # undersampling

import warnings  
warnings.filterwarnings('ignore')

def seed_everything(seed = 42):
    random.seed(seed)
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    
seed_everything()

In [None]:
# load the data
data = pd.read_csv('../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv')
data.shape

# A quick look at the data

In [None]:
data.head()

In [None]:
data.info()

We see there are a mix of datatypes, but most of them are categorical variables (in particular, binary yes/no). According to above, there are three numeric variables: SeniorCitizen, tenure, and MonthlyCharges; however, it looks like SeniorCitizen is a binary 0,1 variable. Also, looking at the dataframe, TotalCharges looks numeric, but is listed as "object" above. We suspect these are "NA" strings.

We will deal with these later.

# Creating a train-test split

In [None]:
X = data.drop("Churn", axis=1)
y = data["Churn"]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2)

# Exploratory data analysis
Here we take a look at the training data. We will examine the distributions of our predictor variables and target variable. We will look to see how each predictor variable relates with other predictors and with the target variable.

In [None]:
df_train = pd.concat([X_train, y_train], axis=1)

In [None]:
df_train.shape

In [None]:
df_train.head()

## Missing values

In [None]:
df_train.isna().mean()

Based on the above, it looks like there are no missing values. **Hold up** -- the missing values are not encoded as typical NA values. We see below that TotalCharges uses a "space" string for NA values.

In [None]:
df_train["TotalCharges"][df_train["customerID"] == "2775-SEFEE"].values[0]

In [None]:
df_train.apply(lambda x: x==' ', axis=1).mean()

There are only a few missing values here. Lets remove these while we're at it.

In [None]:
df_train = df_train[df_train["TotalCharges"] != ' ']
df_train["TotalCharges"] = df_train["TotalCharges"].astype('float64')

Let's do a couple sanity checks. First, let's make sure none of the numeric features are negative.

In [None]:
df_train[["tenure", "TotalCharges", "MonthlyCharges"]].describe()

The ranges look reasonable. Lets check to make sure there aren't any different "NA" strings in the categorical variables.

In [None]:
df_train.nunique().sort_values(ascending=False)

Let's drop customerID since it should not contain any relevant information.

In [None]:
df_train = df_train.drop("customerID", axis=1)

In [None]:
low_unq_feats = df_train.columns[df_train.nunique()<10]
for feat in low_unq_feats:
    print(feat, df_train[feat].unique())

There don't appear to be other "NA" strings. We see that the categorical variables are generally either "Yes", "No", or "No phone/internet service." Let's recode SeniorCitizen as a "Yes" or "No" also.

In [None]:
df_train["SeniorCitizen"] = df_train["SeniorCitizen"].replace({0:"No",1:"Yes"})

In [None]:
df_train.dtypes

## Churn frequency

In [None]:
sns.countplot("Churn", data=df_train)

In [None]:
(df_train["Churn"]=="No").sum(), (df_train["Churn"]=="Yes").sum()

In our training data, there are 1496 cases of customer churn. The other 4128 customers were retained. There is an **imbalanced data** issue here. To better predict customers that will leave, we will need to use models that allow for class weights, or we will need to use undersampling and/or oversampling.

## Categorical features

In [None]:
cat_feats = df_train.columns[df_train.dtypes == 'object'][:-1] # :-1 to remove churn
cat_feats

### Barplots for distributions

In [None]:
fig, ax = plt.subplots(4,4,figsize=(14,14))
ax = ax.flatten()
for i,feat in enumerate(cat_feats):
    plt.sca(ax[i])
    df_unq = df_train[feat].value_counts().sort_values(ascending=False)

    sns.barplot(df_unq.index, df_unq.values, order=df_unq.index)
    
    plt.xlabel(str(feat), color='red', fontsize=14)
    plt.xticks(rotation=45, ha='right')
    
plt.tight_layout(h_pad=2)
plt.show()

### Relation to target variable
For plotting purposes, it will help to convert customer churn to a numeric 0/1 variable.

In [None]:
df_train["Churn"] = df_train["Churn"].replace({"No":0, "Yes":1})

In [None]:
fig, ax = plt.subplots(4,4,figsize=(14,14))
ax = ax.flatten()
for i,feat in enumerate(cat_feats):
    plt.sca(ax[i])
    sns.barplot(x=feat, y="Churn", data=df_train)
                
    plt.xlabel(str(feat), color='red', fontsize=14)
    plt.xticks(rotation=45, ha='right')
    
plt.tight_layout(h_pad=2)
plt.show()

The plot below provides a more succint view of the factors contributing to churn/no churn. To create it, we recoded "no" as 0 and "yes" as 1. For the few categories that have other levels, we recoded them (somewhat arbitraily) according to whether we thought they would lead to customer churn.

In [None]:
import plotly.graph_objects as go

for feat in cat_feats:
    df_train[feat] = df_train[feat].replace({"No":0,"Yes":1,"No internet service":0,"No phone service":0})
df_train["gender"] = df_train["gender"].replace({"Female":0,"Male":1})
df_train["InternetService"] = df_train["InternetService"].replace({"DSL":1,"Fiber optic":1})
df_train["Contract"] = df_train["Contract"].replace({"One year":1, "Two year":1, "Month-to-month":0})
df_train["PaymentMethod"] = df_train["PaymentMethod"].replace({"Mailed check":0, "Bank transfer (automatic)":1, 
                                                               "Electronic check":1, "Credit card (automatic)":1})


# based on the plot at https://plotly.com/python/radar-chart/
fig = go.Figure()

fig.add_trace(go.Scatterpolar(
      r=df_train.loc[df_train["Churn"]==0,cat_feats].mean().tolist(),
      theta=cat_feats.tolist(),
      fill='toself',
      name='No churn'
))

fig.add_trace(go.Scatterpolar(
      r=df_train.loc[df_train["Churn"]==1,cat_feats].mean().tolist(),
      theta=cat_feats.tolist(),
      fill='toself',
      name='Churn'
))

fig.update_layout(
  polar=dict(
    radialaxis=dict(
      visible=True,
      range=[0, 1]
    )),
  showlegend=False
)

fig.show()

There are a number of insights to take away here. Churn is relatively independent of gender and PhoneService. InternetService, PaperlessBilling, and SeniorCitizen are all associated with churn. OnlineSecurity, OnlineBackup, DeviceProtection, and TechSupport tend to retain customers. Although correlated, these factors may not be causative -- longer kept customers may simply be more likely to use these services. Finally, customers with dependents or partners are more likely to stay, possibly indicating the company's family plans are well-received.

## Numeric features

In [None]:
num_feats = ["tenure", "TotalCharges", "MonthlyCharges"]
df_train[num_feats].describe()

In [None]:
fig, ax = plt.subplots(1,3,figsize=(14,6))
for i,feat in enumerate(num_feats):
    plt.sca(ax[i])
    sns.boxplot(x="Churn", y=feat, data=df_train, ax=ax[i])
    plt.ylabel("")
    plt.title(feat, color="red", fontsize=14)
    
plt.tight_layout()
plt.show()

We see customer churn tends to be associated with lower tenure and TotalCharges. This makes sense, the longer a customer stays and the more they pay, the more likely they will stay. Interestingly these two distributions are especially right skewed for the churn class -- and with a median tenure around only 10 months. This indicates most people who leave have low tenure and TotalCharges, but there are some customers with very high tenure and TotalCharges that end up leaving.

We see that monthly charges are higher for the churn class. This may be due to deals that these customers are not getting because they don't stay long enough. It could also be because they didn't initially research different plans, and so they wound up paying higher and then ultimately found a better deal elsewhere.

In [None]:
df_train[num_feats].corr()

As we would expect, total charges are highly correlated with tenure and monthly charges. It may be wise to remove one of these features from our model eventually. Monthly charges are not very correlated with tenure.

# Preprocessing

During our EDA we performed a couple transformations. We'll want to apply these (and any subsequent transformations) to X_train, y_train, X_test, y_test.

In [None]:
def tidy_up(df):
    df = df[df["TotalCharges"] != ' ']
    df["TotalCharges"] = df["TotalCharges"].astype('float64')
    
    df["SeniorCitizen"] = df["SeniorCitizen"].replace({0:"No",1:"Yes"})
    
    df["Churn"] = df["Churn"].replace({"No":0, "Yes":1})
    
    df.drop("customerID", axis=1, inplace=True)
    
    X = df.drop("Churn", axis=1)
    y = df["Churn"]
    
    return X,y

X_train, y_train = tidy_up(pd.concat([X_train, y_train], axis=1))
X_test, y_test = tidy_up(pd.concat([X_test, y_test], axis=1))

## Encoding categorical features
A common way to encode categorical features is one-hot encoding. One-hot encoding is good because it preserves all information about the categories, the downside is it increases the dimension and sparsity of the data, especially for high cardinality features. Since our feature space is relatively small (16 categorical features with only 2-4 unique values each) we will just try one-hot-encoding our categorical features. 

In [None]:
# define categorical features & initialize encoder
cat_feats = X_train.columns[X_train.dtypes == 'object']
onehot_encoder = OneHotEncoder() 

## Rescaling numerical features
Here we rescale numeric features. This helps several classifiers -- in particular, SVM and KNN -- perform better.

In [None]:
# define numeric features & initialize scaler
num_feats = ["tenure", "TotalCharges", "MonthlyCharges"]
scaler = StandardScaler()

# Modeling

In this section we will:
- Try several methods for correcting imbalanced data (when applicable) such as: Synthetic Minority Oversampling Technique (SMOTE), Near-Miss undersampling, Tomek Links undersampling.
- Try several models such as: Naive Bayes, Logistic Regression, SVM, KNN, Random Forest. If time permits we will look at more advanced models like Gradient Boosted Trees (XGBoost) and Neural Networks.
- Perform cross-validation to optimize hyperparameters and prevent overfitting.
- Our model evaluation on the test set will be guided by *recall* and the *f1 score*. Recall is an appropriate metric for customer churn prediction. We want to be able to identify customers that are going to leave so that we can talk to them (i.e. send them email offers). If we accidently predict a user will leave when they don't plan on it -- well -- talking to them won't incur a great cost.

## Naive Bayes
As a baseline model we will use Naive Bayes. Unfortunately, Naive Bayes is not straightforward to implement when there are a combination of categorical and numeric features. We solve this by fitting a categorical Naive Bayes model to the categorical features, a Gaussian Naive Bayes model to the numeric features, and aggregating them with -- you guessed it -- a third, Gaussian Naive Bayes model.

Two notes:
- We have not used any cross-validation yet.
- Naive Bayes does not suffer as much from imbalanced data, so we did not employ undersampling/oversampling. Naive Bayes does suffer from small datasets, though. This is because it's more likely that an arbitrary train/test split will have differently distributed features, and thus have incorrectly specified priors. To compound that, the impact of these incorrect priors is larger for small datasets.

In [None]:
# fit CategoricalNB to categorical features
ord_encoder = OrdinalEncoder()
X_train_c = X_train[cat_feats]
X_train_c = ord_encoder.fit_transform(X_train_c)
nb_c = CategoricalNB()
nb_c.fit(X_train_c, y_train)

# fit GaussianNB to numeric features
X_train_n = X_train[num_feats]
nb_n = GaussianNB()
nb_n.fit(X_train_n, y_train)

# get predicted class probabilities, P(Y=1|X),from each model. Then stack predictions and train another GaussianNB. 
train_preds_c = nb_c.predict_proba(X_train_c)[:,1]
train_preds_n = nb_n.predict_proba(X_train_n)[:,1]
train_preds_cn = np.vstack((train_preds_c, train_preds_n)).T
nb_cn = GaussianNB()
nb_cn.fit(train_preds_cn, y_train)

# test set predictions
X_test_c = X_test[cat_feats]
X_test_c = ord_encoder.transform(X_test_c)
test_preds_c = nb_c.predict(X_test_c)
X_test_n = X_test[num_feats]
test_preds_n = nb_n.predict(X_test_n)
test_preds_cn = np.vstack((test_preds_c, test_preds_n)).T
test_preds = nb_cn.predict(test_preds_cn)

# evaluate model
print("Train accuracy...")
print(classification_report(y_train, nb_cn.predict(train_preds_cn)))
print("Test accuracy...")
print(classification_report(y_test, test_preds))

## Logistic Regression
Below we train a logistic regression model and optimize the regularization parameter using cross-validation. We chose to use $\ell_1$ (LASSO) regularization over $\ell_2$ because the $\ell_2$ solution was not converging. 

We illustrate the issue with imbalanced data.

In [None]:
ct = ColumnTransformer([('cat_feats', onehot_encoder, cat_feats),
                        ('num_feats', scaler, num_feats)])

model = LogisticRegression(penalty="l1", solver="liblinear")

# no undersampling/oversampling
pipe = Pipeline([("preprocessing", ct),
                 ("logreg", model)])

kf = StratifiedKFold(n_splits=5)

grid = GridSearchCV(pipe, param_grid={'logreg__C': [0.01, 0.1, 1, 10]}, cv=kf, scoring='f1')
grid.fit(X_train, y_train)

In [None]:
grid.best_estimator_["logreg"]

In [None]:
predict = grid.best_estimator_.predict(X_test)
print(classification_report(y_test, predict))

With no sampling, the recall score is only 0.58. This means we are only correctly identifying 58% of all customer churn. The accuracy score of 0.81 is misleading because there are many more "no churn" examples that "churn" examples. 

Below we employ several different sampling methods and compare their performance.

In [None]:
def brief_classification_report(y_test, predict):
    rep = np.array(precision_recall_fscore_support(y_test, predict))
#     print("       precision       recall            f1")
#     print("0\t", "\t\t".join(["%0.02f" % x for x in rep[:-1,0]]) )
    print("1\t", "\t\t".join(["%0.02f" % x for x in rep[:-1,1]]))

sm_names = ["Edited NN", "Tomek Links", "Random Undersampling", "Near-Miss", "SMOTE", "Random Oversampling"] 
sms = [EditedNearestNeighbours(), TomekLinks(), RandomUnderSampler(), NearMiss(), SMOTE(), RandomOverSampler()]

ct = ColumnTransformer([('cat_feats', onehot_encoder, cat_feats),
                        ('num_feats', scaler, num_feats)])

model = LogisticRegression(penalty="l1", solver="liblinear")

print("       precision       recall            f1")
for sm_name, sm in zip(sm_names, sms):
    
    pipe = Pipeline([("preprocessing", ct),
                     ("sampling", sm),
                    ("logreg", model)])

    kf = StratifiedKFold(n_splits=5)

    grid = GridSearchCV(pipe, param_grid={'logreg__C': [0.01, 0.1, 1, 10]}, cv=kf, scoring='f1')
    grid.fit(X_train, y_train)
    
    predict = grid.best_estimator_.predict(X_test)
    print(sm_name)
    brief_classification_report(y_test, predict)

The sampling methods worked -- we've sacrificed precision for recall!

The f1 scores of the three methods are relatively close except Near-miss which has a good recall score but very poor precision, ultimately lowering the f1 score. Tomek links provides a good balance between precision and recall. Edited NN, random undersampling, SMOTE, and random oversampling all perform approximately the same (precision:0.51-0.53, recall:0.82-0.84).

From what I've read, there is no principled way to choose a sampling method, and the "best" sampling method may itself be algorithm dependent. We think the performance of SMOTE and other complex sampling algorithms may be limited due to the fact most of our data is categorical. **Based on our observations we will use random oversampling.** 

## SVM, KNN, Random Forests

We look at three other algorithms: SVM, KNN, and random forests. After searching a few parameters, we will assess which model seems to be performing the best.

In [None]:
# This is a custom estimator. Code courtesty of:  https://stackoverflow.com/a/53926097/7638741 .
class ClfSwitcher(BaseEstimator):
    def __init__(
        self, 
        estimator = SVC(),
    ):
        """
        A Custom BaseEstimator that can switch between classifiers.
        :param estimator: sklearn object - The classifier
        """ 

        self.estimator = estimator


    def fit(self, X, y=None, **kwargs):
        self.estimator.fit(X, y)
        return self


    def predict(self, X, y=None):
        return self.estimator.predict(X)


    def predict_proba(self, X):
        return self.estimator.predict_proba(X)


    def score(self, X, y):
        return self.estimator.score(X, y)

In [None]:
ct = ColumnTransformer([('cat_feats', onehot_encoder, cat_feats),
                    ('num_feats', scaler, num_feats)])

pipe = Pipeline([("preprocessing", ct),
                 ("sampling", RandomOverSampler()),
                 ("clf", ClfSwitcher())])

parameters = [
    {
        'clf__estimator': [SVC()],
        'clf__estimator__C': [0.1, 1, 10, 20],
        'clf__estimator__kernel': ['rbf', 'poly']
    },
    {
        'clf__estimator': [KNeighborsClassifier()],
        'clf__estimator__n_neighbors':[3,5,10,20]
    },
    {
        'clf__estimator': [RandomForestClassifier()],
        'clf__estimator__n_estimators': [100,200], 
        'clf__estimator__max_depth': [15,30], 
        'clf__estimator__max_features': [5,10],
        'clf__estimator__min_samples_leaf': [4,8]
    },
]

kf = StratifiedKFold(n_splits=5)

grid = GridSearchCV(estimator = pipe, param_grid=parameters, cv=kf, scoring='f1')

grid.fit(X_train, y_train)

In [None]:
def format_cv_results(search):
    df = pd.concat([pd.DataFrame(grid.cv_results_["params"]),pd.DataFrame(grid.cv_results_["mean_test_score"], columns=["Score"])],axis=1)
    df = df.sort_values("Score", ascending=False)
    return df.fillna(value="")
df_res = format_cv_results(grid)
df_res

We can see that random forest scored the best, SVM scored second best, and KNN scored last. Essentially all random forest models scored above SVM and KNN. Note that we could just not be exploring the parameter space for each model enough. In particularly, KNN's performance appears to continually increase with the `n_neighbors` parameter. The best model selected during validation was a random forest model with `max_depth=15, max_features=5, min_samples_leaf=8, n_estimators=100`. We could likely improve this model further by exploring more parameters, but we will stop here.

Below is the testing report for the best model (out of RF, SVM, KNN)

In [None]:
predict = grid.best_estimator_.predict(X_test)
print(classification_report(y_test, predict))

The random forest model slightly outperforms the previous logistic regression model in f1 score. Another model diagnostic we can use is the ROC curve and the AUC score, which we plot below.

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, grid.predict_proba(X_test)[:,1])
roc_auc = roc_auc_score(y_test, predict)

plt.plot(fpr, tpr, lw=1, label='AUC = %0.2f'%(roc_auc))
plt.plot([0, 1], [0, 1], '--k', lw=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.legend(loc="lower right", frameon = True).get_frame().set_edgecolor('black')

We see with an appropriate classification threshold we can achieve around a 85% recall with only a 20% false positive rate.

## XGBoost
We now try to improve on the previous random forest model by applying XGBoost, which is generally considered a more powerful version of random forest. We use a larger hyper-parameter space, but only perform a randomized search.

In [None]:
ct = ColumnTransformer([('cat_feats', onehot_encoder, cat_feats),
                        ('num_feats', scaler, num_feats)])

model = XGBClassifier(learning_rate=0.02, 
                    n_estimators=200,
                    booster = 'gbtree',
                    objective='binary:logistic')

pipe = Pipeline([("preprocessing", ct),
                ("sampling", RandomOverSampler()),
                ("xgb", model)])

tuned_parameters = {
        'xgb__min_child_weight': [1, 5, 10],
        'xgb__gamma': [0.5, 1, 1.5, 2, 5, 10],
        'xgb__subsample': [0.6, 0.8, 1.0],
        'xgb__colsample_bytree': [0.6, 0.8, 1.0],
        'xgb__max_depth': [3, 5, 8]
        }

kf = StratifiedKFold(n_splits=5)

grid = RandomizedSearchCV(estimator = pipe, 
                                   param_distributions=tuned_parameters, 
                                   cv=kf,
                                   n_iter=20, 
                                   scoring='f1', 
                                   n_jobs=-1, 
                                   verbose=3)

grid.fit(X_train, y_train)

In [None]:
preds = grid.best_estimator_.predict(X_test)
print(classification_report(y_test, preds))

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, grid.best_estimator_.predict_proba(X_test)[:,1])
roc_auc = roc_auc_score(y_test, preds)

plt.plot(fpr, tpr, lw=1, label='AUC = %0.2f'%(roc_auc))
plt.plot([0, 1], [0, 1], '--k', lw=1)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve')
plt.legend(loc="lower right", frameon = True).get_frame().set_edgecolor('black')

The XGBoost model performs about the same as the random forest model (for the hyperparameters we searched). The recall of the XGBoost model is slightly higher but the f1 score is slightly lower.

## Feature Importances

In [None]:
feat_names = []
for feat in cat_feats:
    for level in X_train[feat].unique():
        feat_names.append("%s_%s" % (feat,level))
feat_names.extend(num_feats)

importances = grid.best_estimator_["xgb"].feature_importances_
importances_dict = {f:i for f,i in zip(feat_names, importances)}

n = 20 # only plot a few 
importances = pd.DataFrame.from_dict(importances_dict, orient='index').rename(columns={0: 'Gini-importance'}).head(n)

importances.sort_values(by='Gini-importance', ascending=False).plot(kind='bar', rot=45, figsize=(14,6), fontsize=14)
plt.xticks(ha='right')
plt.show()

OnlineSecurity and InternetService appear at the top of the feature importances. Perhaps more resources should be invested into improving the company's internet and online services.

## Save the best model (optional)

In [None]:
import pickle
with open('customer-churn_XGBoost','wb') as f:
    pickle.dump(grid.best_estimator_, f)

## Conclusion & future work
We were able to achieve above 80% recall, 50% precision, and 0.65 f1 by XGBoost. This equates to correctly identifying 80% of customer churn cases, while unnecessarily targeting loyal customers 50% of the time. Assuming reasonable actions are taken, i.e. emailing the customer an offer, this model could be leveraged to improve customer satisfcation. 

For other models, logistic regression and random forest also performed well. Naive Bayes, SVM, and KNN did not perform well, possibly in part due to the large number of categorical variables. Our EDA indicated that factors like gender and phone service do not impact customer churn.  Internet service, paperless billing, and whether the customer was a senior citizen or not all were related to churn. The feature importances from our model further indicate that internet service (or not) is a dominating factor in why customers stay or leave.

Future work should explore multivariate relationships relevant to churn (are senior citizens and no-internet separate causes of churn or do senior citizens not use internet as often?). More advanced models could be built, such as neural network models. Furthermore, advanced hyperparameter optimization methods (see hypopt package) should be preferred over random/grid-search. 

Please let me know if you have any suggestions! 😊