# To Churn or not to Churn?
## Decision Tree Classifier
### Author: E. Thompson-Becker
##### Code editied from python weka tutorial recieved from Toronto Metropolitan University, CIND 119 - Introduction to Big Data by Syed Shariyar Murtaza,Ph.D.

Install Weka's Python Package

In [None]:
#install weka's python package
! pip install python-weka-wrapper3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting python-weka-wrapper3
  Downloading python-weka-wrapper3-0.2.10.tar.gz (14.4 MB)
[K     |████████████████████████████████| 14.4 MB 20.9 MB/s 
[?25hCollecting python-javabridge>=4.0.0
  Downloading python-javabridge-4.0.3.tar.gz (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 36.3 MB/s 
Building wheels for collected packages: python-weka-wrapper3, python-javabridge
  Building wheel for python-weka-wrapper3 (setup.py) ... [?25l[?25hdone
  Created wheel for python-weka-wrapper3: filename=python_weka_wrapper3-0.2.10-py3-none-any.whl size=12993854 sha256=2f178bfac69d6f53f5f9a7aa22d49ca658f3084cccb7fccd76638e1c502d425e
  Stored in directory: /root/.cache/pip/wheels/a4/e9/93/c8dc5119f22ea38aa2bfbd02c33f4b2a6c6293f1a86283fd91
  Building wheel for python-javabridge (setup.py) ... [?25l[?25hdone
  Created wheel for python-javabridge: filename=python_javabridge-4.0.3-cp37-

Set up Java and launch it in a python environment.

In [5]:
import os
import sys
sys.path
sys.path.append("/usr/lib/jvm/java-11-openjdk-amd64/bin/")
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"

import weka.core.jvm as jvm
jvm.start()

DEBUG:weka.core.jvm:Adding bundled jars
DEBUG:weka.core.jvm:Classpath=['/usr/local/lib/python3.7/dist-packages/javabridge/jars/rhino-1.7R4.jar', '/usr/local/lib/python3.7/dist-packages/javabridge/jars/runnablequeue.jar', '/usr/local/lib/python3.7/dist-packages/javabridge/jars/cpython.jar', '/usr/local/lib/python3.7/dist-packages/weka/lib/python-weka-wrapper.jar', '/usr/local/lib/python3.7/dist-packages/weka/lib/weka.jar']
DEBUG:weka.core.jvm:MaxHeapSize=default
DEBUG:weka.core.jvm:Package support disabled


Import data set into the environment.

In [6]:
from google.colab import files
uploaded = files.upload()

Saving churn.arff to churn (1).arff


Import packages.

In [7]:
from weka.core.converters import Loader
from weka.core.classes import Random
from weka.classifiers import Classifier, Evaluation
from weka.filters import Filter

Load data into the memory to keep it in the environment.

In [8]:
loader = Loader(classname="weka.core.converters.ArffLoader")
data_file = "churn.arff"
data = loader.load_file(data_file)

Identify the class index, which is attribute number 20, Churn?  
Churn has two outcomes True or False.

In [9]:
class_idx = 20
print("Data will be classified on",data.attribute(class_idx))
data.class_index = class_idx

Data will be classified on @attribute Churn {FALSE,TRUE}


## 1. Decision Tree Classifier including all attributes 


Split data into train and test sets. Split on a 70/30 split. Setting a seed of one for a random number generator. 

In [10]:
train, test = data.train_test_split(70.0, Random(1))

Create decision tree using weka J48 which creates a pruned C4.5 decision tree. This tree splits using the attribute with the highest information gain.

In [11]:
# You can change it to different threshold values to change the size of the tree.
cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.25"])
cls.build_classifier(train)
# See the tree below. 
print(cls)

J48 pruned tree
------------------

Total Day Min <= 264.3
|   No of Calls Customer Service <= 3
|   |   Inter Plan = no
|   |   |   Total Day Min <= 224.8: FALSE (1576.0/39.0)
|   |   |   Total Day Min > 224.8
|   |   |   |   Total Evening Charge <= 20.5: FALSE (193.0/13.0)
|   |   |   |   Total Evening Charge > 20.5
|   |   |   |   |   VoiceMail Plan = yes: FALSE (9.0)
|   |   |   |   |   VoiceMail Plan = no: TRUE (47.0/17.0)
|   |   Inter Plan = yes
|   |   |   Total Int Calls <= 2: TRUE (36.0)
|   |   |   Total Int Calls > 2
|   |   |   |   Total Int Min <= 13.1: FALSE (124.0/5.0)
|   |   |   |   Total Int Min > 13.1: TRUE (34.0)
|   No of Calls Customer Service > 3
|   |   Total Day Min <= 159.7: TRUE (69.0/9.0)
|   |   Total Day Min > 159.7
|   |   |   Total Evening Min <= 141.6
|   |   |   |   Total Int Calls <= 6: TRUE (11.0)
|   |   |   |   Total Int Calls > 6: FALSE (2.0)
|   |   |   Total Evening Min > 141.6: FALSE (86.0/17.0)
Total Day Min > 264.3
|   VoiceMail Plan = yes
|

show graph.

In [20]:
import weka.plot.graph as graph  # If pygrpahviz is installed, you can plot the graph of tree too but it may not work
graph.plot_dot_graph(cls.graph)


Evaluate the decision tree on the test set.

In [21]:
evl = Evaluation(train)
evl.test_model(cls, test)
print(evl.summary())


Correctly Classified Instances         930               93      %
Incorrectly Classified Instances        70                7      %
Kappa statistic                          0.7005
Mean absolute error                      0.1073
Root mean squared error                  0.2484
Relative absolute error                 43.3111 %
Root relative squared error             70.7577 %
Total Number of Instances             1000     



Evaluate the decision tree classifier on the test set for the first class: False and the second class: True.

In [22]:
# Here are all the metrics
#print ("Class Index ", class_idx)
print("Classes at different positions are ",data.attribute(class_idx))

print("confusion Matrix")
#Note that the TP here will be for the class at the first position printed by the previous line and TN will be for the class at second position
print(evl.confusion_matrix)

###############
# Print metrics for the first class
##############
class_position=0
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl.true_positive_rate(class_position))
print("FP",evl.false_positive_rate(class_position))
print("Precision ",evl.precision(class_position))
print("Recall ",evl.recall(class_position))


###############
# Print metrics for the second class
##############
class_position=1
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl.true_positive_rate(class_position))
print("FP",evl.false_positive_rate(class_position))
print("Precision ",evl.precision(class_position))
print("Recall ",evl.recall(class_position))


Classes at different positions are  @attribute Churn {FALSE,TRUE}
confusion Matrix
[[830.  26.]
 [ 44. 100.]]

Evaluation from the perspective of class at position 0
TP  0.969626168224299
FP 0.3055555555555556
Precision  0.9496567505720824
Recall  0.969626168224299

Evaluation from the perspective of class at position 1
TP  0.6944444444444444
FP 0.030373831775700934
Precision  0.7936507936507936
Recall  0.6944444444444444


## 2. Decision Tree Classifier with select attributes

Create the new data set with the selected attributes found in the initial analysis. 

In [23]:
#create new data set
sel_data=data.subset(col_range='5,6,8,11,17,18,20,21')

Identify the new class index. 

In [24]:
#Let's look at the attributes and their types
# We have two data types here: categorical and numeric.
for i in range(sel_data.num_attributes):
  print ("index ",i)
  print(sel_data.attribute(i))

index  0
@attribute 'Inter Plan' {no,yes}
index  1
@attribute 'VoiceMail Plan' {yes,no}
index  2
@attribute 'Total Day Min' numeric
index  3
@attribute 'Total Evening Min' numeric
index  4
@attribute 'Total Int Min' numeric
index  5
@attribute 'Total Int Calls' numeric
index  6
@attribute 'No of Calls Customer Service' numeric
index  7
@attribute Churn {FALSE,TRUE}


In [25]:
class_idx2 = 7
print("Data will be classified on",sel_data.attribute(class_idx2))
sel_data.class_index = class_idx2

Data will be classified on @attribute Churn {FALSE,TRUE}


Split data into train and test sets, using the 70/30 split with the random seed.

In [26]:
train2, test2 = sel_data.train_test_split(70.0, Random(1))

Create the new decision tree classifier.

In [27]:
# We are generating a pruned C4.5 decision tree, with a confidence factor used for pruning of 0.25.
# You can change it to different threshold values to change the size of the tree.
cls2 = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.25"])
cls2.build_classifier(train2)
# See the tree below. 
print(cls2)

J48 pruned tree
------------------

Total Day Min <= 264.3
|   No of Calls Customer Service <= 3
|   |   Inter Plan = no
|   |   |   Total Day Min <= 224.8: FALSE (1576.0/39.0)
|   |   |   Total Day Min > 224.8
|   |   |   |   Total Evening Min <= 241.2: FALSE (193.0/13.0)
|   |   |   |   Total Evening Min > 241.2
|   |   |   |   |   VoiceMail Plan = yes: FALSE (9.0)
|   |   |   |   |   VoiceMail Plan = no
|   |   |   |   |   |   Total Evening Min <= 269.3
|   |   |   |   |   |   |   Total Day Min <= 235.9: FALSE (13.0/2.0)
|   |   |   |   |   |   |   Total Day Min > 235.9
|   |   |   |   |   |   |   |   Total Int Min <= 7.3: FALSE (2.0)
|   |   |   |   |   |   |   |   Total Int Min > 7.3
|   |   |   |   |   |   |   |   |   Total Int Calls <= 6: TRUE (14.0/1.0)
|   |   |   |   |   |   |   |   |   Total Int Calls > 6: FALSE (3.0/1.0)
|   |   |   |   |   |   Total Evening Min > 269.3: TRUE (15.0/1.0)
|   |   Inter Plan = yes
|   |   |   Total Int Calls <= 2: TRUE (36.0)
|   |   |   Total

Evaluate on the test set.

In [28]:
evl2 = Evaluation(train2)
evl2.test_model(cls2, test2)
print(evl2.summary())


Correctly Classified Instances         947               94.7    %
Incorrectly Classified Instances        53                5.3    %
Kappa statistic                          0.7608
Mean absolute error                      0.0787
Root mean squared error                  0.2203
Relative absolute error                 31.7809 %
Root relative squared error             62.7539 %
Total Number of Instances             1000     



Use evaluation metrics to test the individual classes. 

In [29]:
# Here are all the metrics
#print ("Class Index ", class_idx)
print("Classes at different positions are ",sel_data.attribute(class_idx2))

print("confusion Matrix")
#Note that the TP here will be for the class at the first position printed by the previous line and TN will be for the class at second position
print(evl2.confusion_matrix)

###############
# Print metrics for the first class
##############
class_position=0
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl2.true_positive_rate(class_position))
print("FP",evl2.false_positive_rate(class_position))
print("Precision ",evl2.precision(class_position))
print("Recall ",evl2.recall(class_position))


###############
# Print metrics for the second class
##############
class_position=1
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl2.true_positive_rate(class_position))
print("FP",evl2.false_positive_rate(class_position))
print("Precision ",evl2.precision(class_position))
print("Recall ",evl2.recall(class_position))

Classes at different positions are  @attribute Churn {FALSE,TRUE}
confusion Matrix
[[847.   9.]
 [ 44. 100.]]

Evaluation from the perspective of class at position 0
TP  0.9894859813084113
FP 0.3055555555555556
Precision  0.9506172839506173
Recall  0.9894859813084113

Evaluation from the perspective of class at position 1
TP  0.6944444444444444
FP 0.010514018691588784
Precision  0.9174311926605505
Recall  0.6944444444444444


In [30]:
#stop the JVM (Java Virtual Machine)
jvm.stop()