## **Client Credit Card Churn Prediction**
---

- Task : Classification
- Objective : Prediksi churn

<img src="https://storage.googleapis.com/kaggle-datasets-images/2590623/4422779/632f2da59bc9ca5e85e9aa5e1101457e/dataset-cover.jpg?t=2022-10-30-13-24-17" alt="iris" width="700" align=center/>

### **Data description:**

"Churn Rate" is a business term describing the rate at which customers leave or cease paying for a product or service. It's a critical figure in many businesses, as it's often the case that acquiring new customers is a lot more costly than retaining existing ones (in some cases, 5 to 20 times more expensive).

Predicting churn is particularly important for businesses with subscription models such as cell phone, cable, or merchant credit card processing plans.

**Features**
There are 21 variables: 
1. State
2. Account Length
3. Area Code
4. Phone
5. Int'l Plan
6. VMail Plan
7. VMail Message
8. Day Mins
9. Day Calls
10. Day Charge
11. Eve Mins
12. Eve Calls
13. Eve Charge
14. Night Mins
15. Night Calls
16. Night Charge
17. Intl Mins
18. Intl Calls
19. Intl Charge
20. CustServ Calls
21. Churn? : Output


In [46]:
import pandas as pd
import numpy as np

In [47]:
data = pd.read_csv("churn.csv")
data.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


In [48]:
data.shape

(3333, 21)

In [49]:
data.duplicated().sum()

0

**Buat semuanya dalam fungsi**

1. Import data
2. Cek **Jumlah observasi** dan **Jumlah kolom**
3. Drop duplicate
4. Cek **Jumlah observasi** dan **Jumlah kolom** setelah di-drop
5. Return data setelah di-drop

In [50]:
def importData(filename, dropped_column):
    """
    Fungsi untuk import data & hapus duplikat
    :param filename: <string> nama file input (format .csv)
    :param dropped_column: <string> nama fitur yang di drop
    :return df: <pandas dataframe> sampel data
    """

    # read data
    df = pd.read_csv(filename)
    print("Data asli            : ", df.shape, "- (#observasi, #kolom)")

    # drop column
    df = df.drop(dropped_column, axis=1)

    # drop duplicates
    df = df.drop_duplicates()
    print("Data setelah di-drop : ", df.shape, "- (#observasi, #kolom)")

    return df

In [51]:
# input
file_credit = "churn.csv"

# panggil fungsi
data = importData(filename = file_credit,
                  dropped_column = "Phone")

Data asli            :  (3333, 21) - (#observasi, #kolom)
Data setelah di-drop :  (3333, 20) - (#observasi, #kolom)


In [52]:
data.head()

Unnamed: 0,State,Account Length,Area Code,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


#### <b><font> Data Preprocessing:</font></b>
* Input-Output Split, Train-Test Split
* Processing Categorical
* Imputation, Normalization, Drop Duplicates

In [53]:
def extractInputOutput(data,
                       output_column_name):
    """
    Fungsi untuk memisahkan data input dan output
    :param data: <pandas dataframe> data seluruh sample
    :param output_column_name: <string> nama kolom output
    :return input_data: <pandas dataframe> data input
    :return output_data: <pandas series> data output
    """
    # buat output
    output_data = data[output_column_name]
    
    # buat input
    input_data = data.drop(output_column_name,
                           axis = 1)
    
    return input_data, output_data


In [54]:
output_column_name = ["Churn?"]

X, y = extractInputOutput(data = data,
                          output_column_name = output_column_name)

#### **Train-Test Split**

In [55]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.25,
                                                    random_state = 123)

In [56]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(2499, 19)
(2499, 1)
(834, 19)
(834, 1)


In [57]:
# Ratio
X_test.shape[0] / X.shape[0]

0.2502250225022502

#### **Data Imputation**

- Proses pengisian data yang kosong (NaN)
- Ada 2 hal yang diperhatikan:
  - Numerical Imputation
  - Categorical Imputation

In [58]:
X_train.isnull().sum()

State             0
Account Length    0
Area Code         0
Int'l Plan        0
VMail Plan        0
VMail Message     0
Day Mins          0
Day Calls         0
Day Charge        0
Eve Mins          0
Eve Calls         0
Eve Charge        0
Night Mins        0
Night Calls       0
Night Charge      0
Intl Mins         0
Intl Calls        0
Intl Charge       0
CustServ Calls    0
dtype: int64

**Bedakan antara data categorical & numerical**

In [59]:
X_train.columns

Index(['State', 'Account Length', 'Area Code', 'Int'l Plan', 'VMail Plan',
       'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins',
       'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge',
       'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls'],
      dtype='object')

In [60]:
#_get_numeric_data() hanya akan mengambil column berisikan integer dan float
X_train_numerical = X_train._get_numeric_data() 

In [61]:
X_train_numerical.head()

Unnamed: 0,Account Length,Area Code,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
1066,117,510,25,216.0,140,36.72,224.1,69,19.05,267.9,112,12.06,11.8,4,3.19,0
1553,86,415,0,217.8,93,37.03,214.7,95,18.25,228.7,70,10.29,11.3,7,3.05,0
2628,37,415,0,221.0,126,37.57,204.5,110,17.38,118.0,98,5.31,6.8,3,1.84,4
882,130,415,0,162.8,113,27.68,290.3,111,24.68,114.9,140,5.17,7.2,3,1.94,1
984,77,415,0,142.3,112,24.19,306.3,111,26.04,196.5,82,8.84,9.9,1,2.67,1


In [62]:
# drop unexpected numerical column if any
num_categorical = ["Area Code"]
X_train_numerical = X_train_numerical.drop(num_categorical, axis = 1)
X_train_numerical.head()

Unnamed: 0,Account Length,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
1066,117,25,216.0,140,36.72,224.1,69,19.05,267.9,112,12.06,11.8,4,3.19,0
1553,86,0,217.8,93,37.03,214.7,95,18.25,228.7,70,10.29,11.3,7,3.05,0
2628,37,0,221.0,126,37.57,204.5,110,17.38,118.0,98,5.31,6.8,3,1.84,4
882,130,0,162.8,113,27.68,290.3,111,24.68,114.9,140,5.17,7.2,3,1.94,1
984,77,0,142.3,112,24.19,306.3,111,26.04,196.5,82,8.84,9.9,1,2.67,1


In [63]:
numerical_column = list(X_train_numerical.columns.values)
numerical_column

['Account Length',
 'VMail Message',
 'Day Mins',
 'Day Calls',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls']

**Categorical Imputation**

In [64]:
X_train_categorical = X_train.drop(list(X_train_numerical.columns.values), axis=1)
X_train_categorical.head()

Unnamed: 0,State,Area Code,Int'l Plan,VMail Plan
1066,KS,510,no,yes
1553,CO,415,no,no
2628,TN,415,no,no
882,FL,415,no,no
984,NV,415,no,no


In [65]:
categorical_column = list(X_train_categorical.columns.values)

#### **Preprocessing Categorical Variables**

In [66]:
categorical_ohe = pd.get_dummies(X_train_categorical)

In [67]:
categorical_ohe.head(2)

Unnamed: 0,Area Code,State_AK,State_AL,State_AR,State_AZ,State_CA,State_CO,State_CT,State_DC,State_DE,...,State_VA,State_VT,State_WA,State_WI,State_WV,State_WY,Int'l Plan_no,Int'l Plan_yes,VMail Plan_no,VMail Plan_yes
1066,510,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,True
1553,415,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,True,False


In [68]:
def extractCategorical(data, categorical_column):
    """
    Fungsi untuk ekstrak data kategorikal dengan One Hot Encoding
    :param data: <pandas dataframe> data sample
    :param categorical_column: <list> list kolom kategorik
    :return categorical_ohe: <pandas dataframe> data sample dengan ohe
    """
    data_categorical = data[categorical_column]
    categorical_ohe = pd.get_dummies(data_categorical)

    return categorical_ohe

In [69]:
X_train_categorical_ohe = extractCategorical(data = X_train,
                                             categorical_column = categorical_column)

In [70]:
X_train_categorical_ohe.head()

Unnamed: 0,Area Code,State_AK,State_AL,State_AR,State_AZ,State_CA,State_CO,State_CT,State_DC,State_DE,...,State_VA,State_VT,State_WA,State_WI,State_WV,State_WY,Int'l Plan_no,Int'l Plan_yes,VMail Plan_no,VMail Plan_yes
1066,510,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,True
1553,415,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,True,False
2628,415,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,True,False
882,415,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,True,False
984,415,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,True,False


In [71]:
ohe_columns = X_train_categorical_ohe.columns
ohe_columns

Index(['Area Code', 'State_AK', 'State_AL', 'State_AR', 'State_AZ', 'State_CA',
       'State_CO', 'State_CT', 'State_DC', 'State_DE', 'State_FL', 'State_GA',
       'State_HI', 'State_IA', 'State_ID', 'State_IL', 'State_IN', 'State_KS',
       'State_KY', 'State_LA', 'State_MA', 'State_MD', 'State_ME', 'State_MI',
       'State_MN', 'State_MO', 'State_MS', 'State_MT', 'State_NC', 'State_ND',
       'State_NE', 'State_NH', 'State_NJ', 'State_NM', 'State_NV', 'State_NY',
       'State_OH', 'State_OK', 'State_OR', 'State_PA', 'State_RI', 'State_SC',
       'State_SD', 'State_TN', 'State_TX', 'State_UT', 'State_VA', 'State_VT',
       'State_WA', 'State_WI', 'State_WV', 'State_WY', 'Int'l Plan_no',
       'Int'l Plan_yes', 'VMail Plan_no', 'VMail Plan_yes'],
      dtype='object')

#### **Join data Numerical dan Categorical**

In [72]:
X_train_concat = pd.concat([X_train_numerical,
                            X_train_categorical_ohe],
                           axis = 1)

In [73]:
X_train_concat.shape

(2499, 71)

#### **Standardizing Variables**

In [74]:
from sklearn.preprocessing import StandardScaler

# Buat fungsi
def standardizerData(data):
    """
    Fungsi untuk melakukan standarisasi data
    :param data: <pandas dataframe> sampel data
    :return standardized_data: <pandas dataframe> sampel data standard
    :return standardizer: method untuk standardisasi data
    """
    data_columns = data.columns  # agar nama kolom tidak hilang
    data_index = data.index  # agar index tidak hilang

    # buat (fit) standardizer
    standardizer = StandardScaler()
    standardizer.fit(data)

    # transform data
    standardized_data_raw = standardizer.transform(data)
    standardized_data = pd.DataFrame(standardized_data_raw)
    standardized_data.columns = data_columns
    standardized_data.index = data_index

    return standardized_data, standardizer

In [75]:
X_train_clean, standardizer = standardizerData(data = X_train_concat)

In [76]:
X_train_clean.head()

Unnamed: 0,Account Length,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,...,State_VA,State_VT,State_WA,State_WI,State_WV,State_WY,Int'l Plan_no,Int'l Plan_yes,VMail Plan_no,VMail Plan_yes
1066,0.422132,1.249874,0.665609,1.957851,0.665555,0.451895,-1.551391,0.452184,1.316807,0.612775,...,-0.138449,-0.147201,-0.124261,-0.152779,-0.175899,-0.1555,0.325947,-0.325947,-1.633892,1.633892
1553,-0.360628,-0.586231,0.698711,-0.377054,0.699089,0.265705,-0.244133,0.265761,0.543769,-1.551279,...,-0.138449,-0.147201,-0.124261,-0.152779,-0.175899,-0.1555,0.325947,-0.325947,0.612036,-0.612036
2628,-1.597894,-0.586231,0.757558,1.262347,0.757504,0.063668,0.510055,0.063026,-1.639273,-0.108576,...,-0.138449,-0.147201,-0.124261,-0.152779,-0.175899,-0.1555,0.325947,-0.325947,0.612036,-0.612036
882,0.750387,-0.586231,-0.312731,0.616522,-0.312356,1.76315,0.560334,1.764138,-1.700406,2.055478,...,-0.138449,-0.147201,-0.124261,-0.152779,-0.175899,-0.1555,0.325947,-0.325947,0.612036,-0.612036
984,-0.587881,-0.586231,-0.689724,0.566844,-0.68989,2.08007,0.560334,2.081057,-0.091226,-0.932977,...,-0.138449,-0.147201,-0.124261,-0.152779,-0.175899,-0.1555,0.325947,-0.325947,0.612036,-0.612036


#### <b><font> Training Machine Learning:</font></b>

#### **Benchmark / Baseline**

- Baseline untuk evaluasi nanti
- Karena ini klasifikasi, bisa kita ambil dari proporsi kelas target yang terbesar
- Dengan kata lain, menebak hasil output marketing response dengan nilai "no" semua tanpa modeling

In [77]:
y_train.value_counts(normalize = True)

# baseline akurasi = 85%

Churn?
False.    0.85114
True.     0.14886
Name: proportion, dtype: float64

In [78]:
# Import dari sklearn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [79]:
# Model knn
knn = KNeighborsClassifier()
knn.fit(X_train_clean, y_train)
knn.score(X_train_clean, y_train)

  return self._fit(X, y)


0.8759503801520608

In [80]:
# Model Logistic Regression
logreg = LogisticRegression(random_state = 123)
logreg.fit(X_train_clean, y_train)
logreg.score(X_train_clean, y_train)

  y = column_or_1d(y, warn=True)


0.8639455782312925

In [81]:
# Model Decision Tree
decTree = DecisionTreeClassifier(random_state = 123)
decTree.fit(X_train_clean, y_train)
decTree.score(X_train_clean, y_train)

1.0

In [82]:
# Model Random Forest
RF = RandomForestClassifier(random_state = 123)
RF.fit(X_train_clean, y_train)
RF.score(X_train_clean, y_train)

  return fit_method(estimator, *args, **kwargs)


0.9995998399359743

**Eksperimentasi**

In [83]:
# Import cross-validation
from sklearn.model_selection import GridSearchCV

In [84]:
# Parameter untuk eksperimen

# Parameter KNN
knn_param = {"n_neighbors": [3, 5, 7, 9, 11, 13, 15],
             "weights": ["uniform", "distance"],
             "metric": ["euclidean", "manhattan", "minkowski"]}

# Parameter logreg
log_reg_param = {"penalty": ["l1", "l2", "elasticnet", "none"],
                 "C": [0.01, 0.1, 1, 10, 100],
                 "solver": ["liblinear", "saga", "lbfgs"]}

# Parameter Decision Tree
decTree_param = {"max_depth": [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
                 "criterion": ["gini", "entropy", "log_loss"]}

# Parameter Random Forest
rf_param = {"n_estimators": [100, 200, 300, 400, 500],
            "max_depth": [None, 10, 20, 30, 40, 50],
            "criterion": ["gini", "entropy", "log_loss"]}

In [85]:
random_knn = GridSearchCV(estimator = KNeighborsClassifier(),
                          param_grid = knn_param,
                          cv = 5,
                          scoring = "accuracy")
random_log_reg = GridSearchCV(estimator = LogisticRegression(random_state=123, max_iter=1000),
                              param_grid = log_reg_param,
                              cv = 5,
                              scoring = "accuracy")
random_decTree = GridSearchCV(estimator = DecisionTreeClassifier(random_state=123),
                              param_grid = decTree_param,
                              cv = 5,
                              scoring = "accuracy") 
random_rf = GridSearchCV(estimator = RandomForestClassifier(random_state=123),
                         param_grid = rf_param,
                         cv = 5,
                         scoring = "accuracy")

In [88]:
# Lakukan fitting eksperimentasi
#random_knn.fit(X_train_clean, y_train.values.ravel())
#random_log_reg.fit(X_train_clean, y_train.values.ravel())
random_decTree.fit(X_train_clean, y_train.values.ravel())
#random_rf.fit(X_train_clean, y_train.values.ravel())

In [89]:
# Evaluasi model
#random_knn.score(X_train_clean, y_train)
#random_log_reg.score(X_train_clean, y_train)
random_decTree.score(X_train_clean, y_train)
#random_rf.score(X_train_clean, y_train)

0.9595838335334134

In [90]:
# Best parameters
#random_knn.best_params_
#random_log_reg.best_params_
random_decTree.best_params_
#random_rf.best_params_

{'criterion': 'entropy', 'max_depth': 6}

**Buat model dengan **parameter terbaik** & pakai seluruh data training**

In [None]:
# Buat model KNN dengan parameter terbaik
best_knn = KNeighborsClassifier(n_neighbors=random_knn.best_params_["n_neighbors"],
                                weights=random_knn.best_params_["weights"],
                                algorithm=random_knn.best_params_["algorithm"])

# Fit model
best_knn.fit(X_train_clean, y_train.values.ravel())

In [None]:
# Buat model Logistic Regression dengan parameter terbaik
best_log_reg = LogisticRegression(C=random_log_reg.best_params_["C"],
                                  penalty=random_log_reg.best_params_["penalty"],
                                  solver=random_log_reg.best_params_["solver"],
                                  random_state=123)

# Fit model
best_log_reg.fit(X_train_clean, y_train.values.ravel())

In [92]:
# Buat model decTree
best_decTree = DecisionTreeClassifier(max_depth = random_decTree.best_params_["max_depth"],
                                      criterion= random_decTree.best_params_['criterion'],
                                      random_state = 123)
# Fit model
best_decTree.fit(X_train_clean, y_train)

In [None]:
# Buat model Random Forest dengan parameter terbaik
best_rf = RandomForestClassifier(n_estimators=random_rf.best_params_["n_estimators"],
                                 max_depth=random_rf.best_params_["max_depth"],
                                 criterion=random_rf.best_params_["criterion"],
                                 random_state=123)

# Fit model
best_rf.fit(X_train_clean, y_train.values.ravel())

#### **Test Prediction**

1. Siapkan file test dataset
2. Lakukan preprocessing yang sama dengan yang dilakukan di train dataset
3. gunakan `imputer_numerical` dan `standardizer` yang telah di-fit di train dataset

In [93]:
def extractTest(data,
                numerical_column, categorical_column, ohe_column,
                standardizer):
    """
    Fungsi untuk mengekstrak & membersihkan test data 
    :param data: <pandas dataframe> sampel data test
    :param numerical_column: <list> kolom numerik
    :param categorical_column: <list> kolom kategorik
    :param ohe_column: <list> kolom one-hot-encoding dari data kategorik
    :param standardizer: <sklearn method> standardizer data
    :return cleaned_data: <pandas dataframe> data final
    """
    # Filter data
    numerical_data = data[numerical_column]
    categorical_data = data[categorical_column]

    # Proses data kategorik
    categorical_data = pd.get_dummies(categorical_data)
    categorical_data.reindex(index = categorical_data.index, 
                             columns = ohe_column)

    # Gabungkan data
    concat_data = pd.concat([numerical_data, categorical_data],
                             axis = 1)
    cleaned_data = pd.DataFrame(standardizer.transform(concat_data))
    cleaned_data.columns = concat_data.columns

    return cleaned_data

In [94]:
X_test_clean = extractTest(data = X_test,
                           numerical_column = numerical_column,
                           categorical_column = categorical_column,
                           ohe_column = ohe_columns,
                           standardizer = standardizer)

In [95]:
X_test_clean.shape

(834, 71)

In [97]:
# Cek Test data
#best_knn.score(X_test_clean, y_test)
#best_log_reg.score(X_test_clean, y_test)
best_decTree.score(X_test_clean, y_test)
#best_rf.score(X_test_clean, y_test)

0.9532374100719424