# To Churn or not to Churn?
## Naive Bayes Classifier
### Author: E. Thompson-Becker
##### Code editied from python weka tutorial recieved from Toronto Metropolitan University, CIND 119 - Introduction to Big Data by Syed Shariyar Murtaza,Ph.D.

Install Weka's python package.

In [1]:
! pip install python-weka-wrapper3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Set up java and and launch it in the python environment.

In [2]:
import os
import sys
sys.path
sys.path.append("/usr/lib/jvm/java-11-openjdk-amd64/bin/")
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"

import weka.core.jvm as jvm
jvm.start()

DEBUG:weka.core.jvm:Adding bundled jars
DEBUG:weka.core.jvm:Classpath=['/usr/local/lib/python3.7/dist-packages/javabridge/jars/rhino-1.7R4.jar', '/usr/local/lib/python3.7/dist-packages/javabridge/jars/runnablequeue.jar', '/usr/local/lib/python3.7/dist-packages/javabridge/jars/cpython.jar', '/usr/local/lib/python3.7/dist-packages/weka/lib/python-weka-wrapper.jar', '/usr/local/lib/python3.7/dist-packages/weka/lib/weka.jar']
DEBUG:weka.core.jvm:MaxHeapSize=default
DEBUG:weka.core.jvm:Package support disabled


Import data into the python environment.

In [3]:
from google.colab import files
uploaded = files.upload()


Saving churn.arff to churn (3).arff


Import packages

In [4]:
from weka.core.converters import Loader
from weka.core.classes import Random
from weka.classifiers import Classifier, Evaluation
from weka.filters import Filter

Load data into memory to keep it in the environment. 

In [5]:
loader = Loader(classname="weka.core.converters.ArffLoader")
data_file = "churn.arff"
data = loader.load_file(data_file)

Identify the class index.

In [6]:
class_idx = 20
print("Data will be classified on",data.attribute(class_idx))
data.class_index = class_idx

Data will be classified on @attribute Churn {FALSE,TRUE}


## 1. Naive Bayes classifier including all attributes

Split train and test set by 70/30 split.

In [7]:
train, test = data.train_test_split(70.0, Random(1))

Run Naive Bayes classifier set to use both numeric and categorical data.

In [8]:
nb = Classifier(classname="weka.classifiers.bayes.NaiveBayes")
nb.build_classifier(train)
#understand the NB model by printing it
#print(nb)

Evaluate on the test set.

In [9]:
evl_nb = Evaluation(train)
evl_nb.test_model(nb, test)
print(evl_nb.summary())


Correctly Classified Instances         877               87.7    %
Incorrectly Classified Instances       123               12.3    %
Kappa statistic                          0.4753
Mean absolute error                      0.1971
Root mean squared error                  0.3124
Relative absolute error                 79.5751 %
Root relative squared error             88.9798 %
Total Number of Instances             1000     



Evaluation metrics for both False and True classes. 

In [10]:
#Here are all the metrics for Naive Bayes

print("Classes at different positions are ",data.attribute(class_idx))

print("confusion Matrix")
#Note that the TP here will be for the class at the first position printed by the previous line
print(evl_nb.confusion_matrix)

###############
# Print metrics for the first class: False
##############
class_position=0
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl_nb.true_positive_rate(class_position))
print("FP",evl_nb.false_positive_rate(class_position))
print("Precision ",evl_nb.precision(class_position))
print("Recall ",evl_nb.recall(class_position))


###############
# Print metrics for the second class: True
##############
class_position=1
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl_nb.true_positive_rate(class_position))
print("FP",evl_nb.false_positive_rate(class_position))
print("Precision ",evl_nb.precision(class_position))
print("Recall ",evl_nb.recall(class_position))

Classes at different positions are  @attribute Churn {FALSE,TRUE}
confusion Matrix
[[803.  53.]
 [ 70.  74.]]

Evaluation from the perspective of class at position 0
TP  0.9380841121495327
FP 0.4861111111111111
Precision  0.9198167239404352
Recall  0.9380841121495327

Evaluation from the perspective of class at position 1
TP  0.5138888888888888
FP 0.06191588785046729
Precision  0.5826771653543307
Recall  0.5138888888888888


## 2. Naive Bayes Classification on selected attributes

Use subset of data to select features found in initial analysis. 

In [11]:
#create new data set
sel_data=data.subset(col_range='5,6,8,11,17,18,20,21')

Find the new class index.

In [12]:
class_idx2 = 7
print("Data will be classified on",sel_data.attribute(class_idx2))
sel_data.class_index = class_idx2

Data will be classified on @attribute Churn {FALSE,TRUE}


Create train and test sets using a 70/30 split.

In [13]:
train2, test2 = sel_data.train_test_split(70.0, Random(1))

Make new naive bayes classifier with selected attributes. 

In [14]:
nb2 = Classifier(classname="weka.classifiers.bayes.NaiveBayes")
nb2.build_classifier(train2)
#understand the NB model by printing it
#print(nb)

Evaluate on the test set.

In [15]:
evl_nb2 = Evaluation(train2)
evl_nb2.test_model(nb2, test2)
print(evl_nb2.summary())


Correctly Classified Instances         872               87.2    %
Incorrectly Classified Instances       128               12.8    %
Kappa statistic                          0.3298
Mean absolute error                      0.1862
Root mean squared error                  0.3072
Relative absolute error                 75.1766 %
Root relative squared error             87.5014 %
Total Number of Instances             1000     



Evaluation metrics for the naive bayes classifier.

In [16]:
print("Classes at different positions are ",sel_data.attribute(class_idx2))

print("confusion Matrix")
#Note that the TP here will be for the class at the first position printed by the previous line
print(evl_nb2.confusion_matrix)

###############
# Print metrics for the first class: False
##############
class_position=0
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl_nb2.true_positive_rate(class_position))
print("FP",evl_nb2.false_positive_rate(class_position))
print("Precision ",evl_nb2.precision(class_position))
print("Recall ",evl_nb2.recall(class_position))


###############
# Print metrics for the second class: True
##############
class_position=1
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl_nb2.true_positive_rate(class_position))
print("FP",evl_nb2.false_positive_rate(class_position))
print("Precision ",evl_nb2.precision(class_position))
print("Recall ",evl_nb2.recall(class_position))


Classes at different positions are  @attribute Churn {FALSE,TRUE}
confusion Matrix
[[831.  25.]
 [103.  41.]]

Evaluation from the perspective of class at position 0
TP  0.9707943925233645
FP 0.7152777777777778
Precision  0.8897216274089935
Recall  0.9707943925233645

Evaluation from the perspective of class at position 1
TP  0.2847222222222222
FP 0.029205607476635514
Precision  0.6212121212121212
Recall  0.2847222222222222


In [17]:
#stop the JVM (Java Virtual Machine)
jvm.stop()