# Business problem:

Companies usually have a greater focus on customer acquisition than customer. However, it can cost anywhere between five to twenty five times more to attract a new customer than retain an existing one. Increasing customer retention rates by 5% can increase profits by 25%, according to a research done by Bain & Company.

Churn is a metric that measures the no. of customers who stop doing business with a company. Through this metric, most businesses would try to understand the reason behind churn numbers and tackle those factors with reactive action plans.

But what if you could identify a customer who is likely to churn and take appropriate steps to prevent it from happening? The reasons that lead customers to the cancellation decision can be numerous, ranging from poor service quality to new competitors entering the market. Usually, there is no single reason, but a combination of factors that result to customer churn.

Although the customers have churned, their data is still available. Through machine learning we can sift through this valuable data to discover patterns and understand the combination of different factors which lead to customer churn.

Our goal in this project is to identify behavior among customers who are likely to churn. Subsequent to that we need to train a machine learning model to identify these signals from a customer before they churn. Once deployed, our model will identify customers who might churn and alert us to take necessary steps to prevent their churn.

# Initialisation

In [8]:
###############################################################################
#
#Importing libraries
#
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler, LabelBinarizer
from sklearn.model_selection import train_test_split as tts
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier,AdaBoostClassifier
from imblearn.under_sampling import RandomUnderSampler
from numpy import random
from sklearn.metrics import f1_score, roc_auc_score, confusion_matrix, recall_score as R
import warnings
import pickle
###############################################################################
#
#Notebook options
#
pd.options.display.max_columns =100
warnings.filterwarnings('ignore')
###############################################################################
#
#Reading the data
#
df = pd.read_csv(r"../Data/Telco-Customer-Churn.csv")
df.drop(["customerID"], axis=1, inplace=True)
###############################################################################

# Data preparation

In [9]:
###############################################################################
#
# def preprocess(df)
# Input:
# df = input dataframe 
#
# 1. Prepare X and y as feature and target matrix
# 2. Binarize the target feature y
# 3. Segregate columns into binary, numeric and categorical features
# 4. Binarize the binary features
# 5. Convert categorical features to dummy variables
# 6. Perform standard scaling for all features
# 7. Convert the numpy arrays to dataframes for furthur processing
# 8. Return the formatted data
###############################################################################


def preprocess(df):
  df.TotalCharges.astype('float')
  X=df.drop('Churn', axis=1)
  y=df.Churn

  lb=LabelBinarizer()
  y=lb.fit_transform(y)

  binary_feat = X.nunique()[X.nunique() == 2].keys().tolist()
  numeric_feat = [col for col in X.select_dtypes(['float','int']).columns.tolist() if col not in binary_feat]
  categorical_feat = [ col for col in X.select_dtypes('object').columns.to_list() if col not in binary_feat + numeric_feat ]

  #le = LabelEncoder()
  for i in binary_feat:
    X[i] = lb.fit_transform(X[i])

  X = pd.get_dummies(X, columns=categorical_feat)
  sc=StandardScaler()
  cols=X.columns
  x = sc.fit_transform(X[['TotalCharges','MonthlyCharges','tenure']])

  X=pd.concat([X,pd.DataFrame(x)],axis=1)
  X.drop(['tenure','MonthlyCharges','TotalCharges'],axis=1,inplace=True)
  X=pd.DataFrame(X)
  y=pd.DataFrame(y)

  with open('../deployment/scalar','wb') as a:
    pickle.dump(sc,a)

  return X,y

## Undersampling the over represented class

In [10]:
###############################################################################
#
# def undersample(X,y)
# Input
# X = feature vectors
# y = response vector
#
# 1. Split the data into test and train sets
# 2. Undersample the overrepresented class in the training set
# 3. Perform standard scaling on the new set of features because the original distribution has been undersampled
# 4. Apply the same transformation on the test feature set
# 5. Return the X_train, X_test, y_train, y_test
#
###############################################################################

def undersample(X,y):
    x_train, X_test, y_tr,y_test = tts(X,y, test_size=0.2,random_state=123456)
    rus = RandomUnderSampler()
    X_train, y_train = rus.fit_resample(x_train, y_tr)
    return X_train, X_test, y_train, y_test