<div style="text-align:center;">
<img src="https://upload.wikimedia.org/wikipedia/en/thumb/1/1e/Institute_of_Business_Administration%2C_Karachi_%28logo%29.png/300px-Institute_of_Business_Administration%2C_Karachi_%28logo%29.png" width="100px">

# **MACHINE LEARNING I - FINAL PROJECT**
## **CUSTOMER CHURN PREDICTION**
### Syed Asad Rizvi  ERP 25365
### Fareed Hassan Khan  ERP 25367

</div>

_____

Importing Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import train_test_split, RepeatedKFold, GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from numpy import mean
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import CategoricalNB, GaussianNB
from mixed_naive_bayes import MixedNB
from sklearn.model_selection import KFold
from keras.models import Sequential
from keras.layers import Dense

# For removing any warnings
import warnings
warnings.filterwarnings('ignore')

### **Importing dataset**

In [None]:
df = pd.read_csv('Telecom_customer churn.csv')

Printing the first few rows and the shape of dataset

In [None]:
df.head()

In [None]:
df.shape

### **Cleaning** the dataset  (Removing NAN values)

In [None]:
df.dropna(subset=['rev_Mean', 'kid11_15', 'dualband', 'area', 'hnd_price', 'change_mou'], inplace=True)
df.drop(['avg6mou', 'avg6qty', 'avg6rev', 'prizm_social_one', 'ownrent', 'lor', 'dwlltype', 'adults', 'infobase', 'numbcars', 
'HHstatin', 'dwllsize', 'income', 'hnd_webcap'], axis=1, inplace=True)
df.isna().sum().sum()

In [None]:
df.shape

### **One hot encoding** on dataset

In [None]:
df_onehot = pd.get_dummies(df)

In [None]:
df_onehot

### **Fitting model code** for calculating the **ROC AUC** value of each model 

In [None]:
def fit_model(model, model_name):
    model.fit(trainX,trainy)
    md_probs = model.predict_proba(testX)
    md_probs = md_probs[:,1]
    md_auc = roc_auc_score(testy, md_probs)
    print(model_name, " : ", md_auc)

### **Classification Models**

In the next step, several **machine learning classification** methods were applied on the **one-hot**
encoded dataset one-by-one.
These methods include:

* K-Nearest Neighbors

* Logistic Regression

* Decision Trees

* Random Forest

* Gradient Boosting

* Naïve Bayes 
  
  - Gaussian NB

  - Categorical NB
  
  - Mixed NB

* Neural Networks

In [None]:
df_onehot = df_onehot.loc[:, df_onehot.columns != 'churn']
y = df[['churn']]

trainX, testX, trainy, testy = train_test_split(df_onehot, y, test_size=0.3, random_state=2)

1) K-Nearest Neighbors (kNN)

In [None]:
pipe_kn = Pipeline([("scaler", MinMaxScaler()),("KNN Classifier", KNeighborsClassifier(n_neighbors=5))])
fit_model(pipe_kn, "KNN")

2) Logistic Regression

In [None]:
pipe_lg = Pipeline([("scaler", MinMaxScaler()),("Logistic", LogisticRegression())])
fit_model(pipe_lg, "Logistic")

3) Decision Tree

In [None]:
dt = DecisionTreeClassifier(max_depth=5)  
fit_model(dt, "Decision Tree") 

4) Random Forest Classifier

In [None]:
rf = RandomForestClassifier(max_depth=20,n_estimators=1000)
fit_model(rf, "Random Forest Classifier")

5) Gradient Boosting Classifier

In [None]:
gb = GradientBoostingClassifier(max_depth=5,n_estimators=200)
fit_model(gb, "Graident Boosting Classifier")

In [None]:
# Including Learning rate parameter
gb = GradientBoostingClassifier(max_depth=5,n_estimators=200, learning_rate=0.05)
fit_model(gb, "Graident Boosting Classifier With Learning Rate")

6) Naïve Bayes

In [None]:
# Gaussian NB
nb_g = GaussianNB()
fit_model(nb_g, "Gaussian")

In [None]:
# Categorical NB

categorical_columns = list(df.columns[df.dtypes == 'object'])
categorical_columns.append('churn')


def convert_categorical(df1):
    df_q = pd.DataFrame()
    label_encoder = LabelEncoder()
    for col in df1:
        if col not in categorical_columns:
            df_q[col] = pd.qcut(df1[col], 5, duplicates='drop')            
            df_q[col]= label_encoder.fit_transform(df_q[col])
            df_q[col] = df_q[col].astype('str')

    X_cat = df1[categorical_columns[:-1]]
    df_cat = pd.concat([df_q,X_cat],axis=1)
    return df_cat

 
temp_df1 = convert_categorical(df) 
temp_df1.head()

X_cat = convert_categorical(df)
trainX, testX, trainy, testy = train_test_split(X_cat, y, test_size=0.3, random_state=2)

# Model Code
nb_c = CategoricalNB(min_categories = 100)
fit_model(nb_c, "Naive Bayes Categorical")

In [None]:
# Mixed NB
nb_mix = MixedNB(categorical_features=[1,2,3])
fit_model(nb_mix, "Naive Bayes Mixed")

7) Neural Networks

In [None]:
scaler = StandardScaler()
trainX = scaler.fit_transform(trainX)
testX = scaler.transform(testX)
trainX.shape
model = Sequential()
model.add(Dense(187, input_dim=187, activation='relu'))
model.add(Dense(187, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(trainX, trainy, epochs=5, batch_size=10)
_, accuracy = model.evaluate(testX, testy)
print('Accuracy: %.2f' % (accuracy*100))

### **Ensemble Methods**

After applying **classification models**, ensemble methods were used, to determine whether it improve the performance of models or not?

Several ensemble methods were applied on the **one-hot encoded** dataset one-by-one.

These methods include:

* Bagging Classifier

* Stacking Classifier

* Voting Classifier


1. **Bagging** Classifier

In [None]:
cv = RepeatedKFold(n_splits=10, n_repeats=1)
reg_bg = BaggingClassifier(base_estimator=GradientBoostingClassifier(max_depth=5, n_estimators=200),
                        n_estimators=20, random_state=0)
scores = cross_val_score(reg_bg, df_onehot, y, cv=cv)
score = format(mean(scores), '.4f')
print(score)

2. **Stacking** Classifier

In [None]:
cv = RepeatedKFold(n_splits=10, n_repeats=1)#, random_state=1)
estimators = [
('lr', LogisticRegression()),
('dt', DecisionTreeClassifier(max_depth=5)),
('rf', RandomForestClassifier(max_depth=20, n_estimators=1000))
]

reg_sr = StackingClassifier(estimators=estimators, final_estimator=GradientBoostingClassifier(max_depth=5, n_estimators=200, random_state=42))
scores = cross_val_score(reg_sr, df_onehot, y, cv=cv)
score = format(mean(scores), '.4f')
print(score)

3. **Voting** Classifier

In [None]:
cv = RepeatedKFold(n_splits=10, n_repeats=1)
r1 = DecisionTreeClassifier(max_depth=5)
r2 = RandomForestClassifier(max_depth=20,n_estimators=1000)
r3 = GradientBoostingClassifier(max_depth=5,n_estimators=200)

reg_vr = VotingClassifier([('dt', r1), ('rf', r2),('gb', r3)])
scores = cross_val_score(reg_vr, df_onehot, y, cv=cv)
score = format(mean(scores), '.4f')
print(score)

### **Data Pre-Processing** (Filling Missing Data)

Importing dataset

In [None]:
df = pd.read_csv('Telecom_customer churn.csv')

In [None]:
# columns names contains missing values 

categorical_columns = ['prizm_social_one', 'hnd_webcap', 'ownrent', 'infobase', 'HHstatin', 'dwllsize', 'dwlltype']
numeric_columns = ['avg6mou', 'avg6qty', 'avg6rev', 'lor', 'adults', 'income', 'numbcars']

Filling Missing categorical columns data with **mode()**

In [None]:
for each in categorical_columns:
    mode_value = df[each].mode()
    df[each].fillna(mode_value[0], inplace=True)

Filling Missing numerical columns data 

1. Using **mean()**

In [None]:
for each in numeric_columns:
    mean_value = df[each].mean()
    df[each].fillna(mean_value, inplace=True)

In [None]:
# -------------------- Best three models Results --------------------# 

# One hot encoding on dataset
df_onehot = pd.get_dummies(df)

# X and y Split
df_onehot = df_onehot.loc[:, df_onehot.columns != 'churn']
y = df[['churn']]

# Train Test Split
trainX, testX, trainy, testy = train_test_split(df_onehot, y, test_size=0.3, random_state=2)

# Logistic Regression
pipe_kn = Pipeline([("scaler", MinMaxScaler()),("Logistic", LogisticRegression())])
fit_model(pipe_kn, "Logistic")

# Random Forest Classifier
rf = RandomForestClassifier(max_depth=20,n_estimators=1000)
fit_model(rf, "Random Forest Classifier")

# Gradient Boosting Classifier
gb = GradientBoostingClassifier(max_depth=5,n_estimators=200)
fit_model(gb, "Graident Boosting Classifier")

2. Using **Median()**

In [None]:
for each in numeric_columns:
    median_value = df[each].median()
    df[each].fillna(mean_value, inplace=True)

In [None]:
# -------------------- Best three models Results --------------------# 

# One hot encoding on dataset
df_onehot = pd.get_dummies(df)

# X and y Split
df_onehot = df_onehot.loc[:, df_onehot.columns != 'churn']
y = df[['churn']]

# Train Test Split
trainX, testX, trainy, testy = train_test_split(df_onehot, y, test_size=0.3, random_state=2)

# Logistic Regression
pipe_kn = Pipeline([("scaler", MinMaxScaler()),("Logistic", LogisticRegression())])
fit_model(pipe_kn, "Logistic")

# Random Forest Classifier
rf = RandomForestClassifier(max_depth=20,n_estimators=1000)
fit_model(rf, "Random Forest Classifier")

# Gradient Boosting Classifier
gb = GradientBoostingClassifier(max_depth=5,n_estimators=200)
fit_model(gb, "Graident Boosting Classifier")

3. Using **Mode()**

In [None]:
for each in numeric_columns:
    mode_value = df[each].median()
    df[each].fillna(mode_value[0], inplace=True)

In [None]:
# -------------------- Best three models Results --------------------# 

# One hot encoding on dataset
df_onehot = pd.get_dummies(df)

# X and y Split
df_onehot = df_onehot.loc[:, df_onehot.columns != 'churn']
y = df[['churn']]

# Train Test Split
trainX, testX, trainy, testy = train_test_split(df_onehot, y, test_size=0.3, random_state=2)

# Logistic Regression
pipe_kn = Pipeline([("scaler", MinMaxScaler()),("Logistic", LogisticRegression())])
fit_model(pipe_kn, "Logistic")

# Random Forest Classifier
rf = RandomForestClassifier(max_depth=20,n_estimators=1000)
fit_model(rf, "Random Forest Classifier")

# Gradient Boosting Classifier
gb = GradientBoostingClassifier(max_depth=5,n_estimators=200)
fit_model(gb, "Graident Boosting Classifier")

### **Grid Search**
Since we do know that **gradient boosting classifier** yield the highest **ROC AUC** value. 

The last part would be finding the optimal hyperparameter for this model.

In [None]:
df = pd.read_csv('Telecom_customer churn.csv')
# Dropping NAN values
df.dropna(subset=['rev_Mean', 'kid11_15', 'dualband', 'area', 'hnd_price', 'change_mou'], inplace=True)

df.drop(['avg6mou', 'avg6qty', 'avg6rev', 'prizm_social_one', 'ownrent', 
        'lor', 'dwlltype', 'adults', 'infobase', 'numbcars', 
        'HHstatin', 'dwllsize', 'income', 'hnd_webcap'], 
         axis=1, inplace=True)

In [None]:
X = df_onehot.loc[:, df_onehot.columns != 'churn']
y = df_onehot[['churn']]

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
regRF = GradientBoostingClassifier(max_depth=5, random_state=0)
param_grid = {
    'bootstrap': [True],
    'max_depth': [5, 10, 15],
    'max_features': [2, 3, 4],    
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300] 
}
grid_search = GridSearchCV(estimator = regRF, param_grid=param_grid, cv = cv, n_jobs = -1, verbose = 2)
grid_search.fit(X, y)
best_grid = grid_search.best_estimator_
print(best_grid)

### **Hyperparameter Tuning with Cross validation**

To estimate the skill of our best model i.e., **Gradient Boosting Classifier** on unseen data. We used **cross validation** for the model. 

In [None]:
score_onehot = []
s_no = []
for i in range(0,30):
    # prepare the cross-validation procedure
    cv = KFold(n_splits=10, random_state=i, shuffle=True)
    GB = GradientBoostingClassifier(max_depth=5, random_state=0)
    

    scores = cross_val_score(GB, df_onehot, y, scoring='roc_auc', cv=cv) 
    score_onehot.append(mean(scores))
    
    s_no.append(i)
    
scores_df = pd.DataFrame(
    {'S #': s_no,
     'onehot': score_onehot
    })
scores_df.head(10)

### **Feature Importance**

Since we have fitted our best model **Gradient Boosting Classifier**, we can now extract the **feature importance**. This is stored in a property called feature_importances_.

We sorted them from least to greatest and did remove features one by one which have the least importance and analyze the **ROC AUC** value.

In [None]:
X = df_onehot.loc[:, df_onehot.columns != 'churn']
y = df_onehot[['churn']]

In [None]:
clf = GradientBoostingClassifier(max_depth=5, n_estimators=200, random_state=0)

clf.fit(X,y)

feature_scores = pd.Series(clf.feature_importances_, index=X.columns).sort_values(ascending=False)

# Printing features scores
feature_scores

In [None]:
# Features having importance vale greater than 0.0069
features_gt = feature_scores[feature_scores> 0.0069]

# Creating a dataframe of those featurs only
X = df_onehot.loc[:, df_onehot.columns != 'churn']
X = X[features_gt.index]


# Gradient Boosting Classifier (Best Model)
gb = GradientBoostingClassifier(max_depth=5,n_estimators=200)
fit_model(gb, "Graident Boosting Classifier")

Following are the results of Feature Importance with Gradient Boosting based on different Threshold values

| Threshold Value | Total Features | ROC AUC value |
|-----------------|----------------|---------------|
| 0               | 155            | 68.14%        |
| 0.0001          | 151            | 68.51%        |
| 0.001           | 71             | 68.11%        |
| 0.01            | 19             | 68.32%        |

____

### **Winner Model (Gradient Boosting Classifier)**

After the implementation of different **classification methods** and **ensemble approaches** with
our best models

We concluded that our winner model is **Gradient Boosting Classifier** with

* max depth = 5
* n_estimators = 200
* learning_rate = 0.05

Generating the **ROC AUC** value around **69%**.

In [None]:
gb = GradientBoostingClassifier(max_depth=5,n_estimators=200, learning_rate=0.05)
fit_model(gb, "Graident Boosting Classifier")