**Tutorial 2**

This tutorial is based on using  ML package Weka for machine learning. Weka is a famous machine learning software and a set of libraries that one can use within a programming language. Weka was created at the University of Waikato, New Zealnd (https://www.cs.waikato.ac.nz/ml/weka/). It is accompanied with a text book of data mining taught in schools around the world (https://www.cs.waikato.ac.nz/ml/weka/book.html). The advantage of using Weka's Python package is that the implementation of algorithms is complete, comprehsive and easy to use. Let's see below.


First install Weka's Python package.

In [None]:
! pip install python-weka-wrapper3

Weka was built on Java, and below we shall be setting Java and launching it in Python environment. Don't worry about understanding this code. 

In [None]:
import os
import sys
sys.path
sys.path.append("/usr/lib/jvm/java-11-openjdk-amd64/bin/")
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64/"



In [None]:

import weka.core.jvm as jvm
jvm.start()

We shall now upload a dataset file. Weka works with arff format easily, it can load CSV too. We shall upload .arff file because I have defined the correct data types of variables (cagtegorical or numerical) in it already.

In [None]:
from google.colab import files
uploaded = files.upload()

Let's load our dataset into memory. It will be loaded using the following code. Dataset file that I have uplaoded is german_credit.arff. Note this loaded data in moemeory is not a Pandas' data frame.

In [None]:
from weka.core.converters import Loader
from weka.core.classes import Random
from weka.classifiers import Classifier, Evaluation

In [None]:
loader = Loader(classname="weka.core.converters.ArffLoader")
data_file = 'german_credit.arff'
#data_file="churn.arff"
#data_file="bank.arff"
data = loader.load_file(data_file)

print('Data set size: ', data.num_instances)

In [None]:
#Let's look at the attributes and their types
# We have two data types here: categorical and numeric.
for i in range(data.num_attributes):
  print ("index ",i)
  print(data.attribute(i))

Index of class attribute in our data is 0--creditability. It can be observed above. I am setting up class attribute here.

In [None]:

# index of class atrribute is 0 (Creditability) for German credit card
# index of class attribute is 20(Churn) for Churn data set 
# index of class attribute is 16(y) for bank data set
# Again, you can see all the index numbers for attributes by running the previous cell
class_idx=0
print('Will be classifying on: ', data.attribute(class_idx))
data.class_index = class_idx


Time to split dataset into train and test set.

In [None]:
# Splitting 66% for training and 34% for testing using a seed of 1 for random number generator
train, test = data.train_test_split(66.0, Random(1))

We are now going to train a decision tree. This decision tree is C4.5 decision tree and it's name in Weka is J48. Good thing about this decision tree is that it is the exact implementation of the C4.5 decision tree as in theory and as we studied. C4.5 decision tree algorithm can handle numeric and categorical attributes by itself. So there is no need to convert categorical features(or variables) to numeric features by using on-hot-encoding.

In [None]:
# We are generating a pruned C4.5 decision tree, with a confidence factor used for pruning of 0.25.
# You can change it to different threshold values to change the size of the tree.
cls = Classifier(classname="weka.classifiers.trees.J48", options=["-C", "0.25"])
cls.build_classifier(train)
# See the tree below. 
print(cls)

In the above tree, these values ": 1 (8.0/2.0)" means the class at the leaf is 1, total training records during evlaution on the training set after building the tree reached here are 8 but only 2 of them were incorrectly predicted.

In [None]:
#import weka.plot.graph as graph  # If pygrpahviz is installed, you can plot the graph of tree too but it may not work
#graph.plot_dot_graph(cls.graph)

In [None]:
# Let's evaluate it on the test set

evl = Evaluation(train)
evl.test_model(cls, test)
print(evl.summary())

Here "Correctly Classified Instances"   means accuracy, and "Total Number of Instances" means total records in the test set. Ignore everything else as we have not studied them. 

In [None]:
# Here are all the metrics
#print ("Class Index ", class_idx)
print("Classes at different positions are ",data.attribute(class_idx))

print("confusion Matrix")
#Note that the TP here will be for the class at the first position printed by the previous line and TN will be for the class at second position
print(evl.confusion_matrix)

###############
# Print metrics for the first class
##############
class_position=0
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl.true_positive_rate(class_position))
print("FP",evl.false_positive_rate(class_position))
print("Precision ",evl.precision(class_position))
print("Recall ",evl.recall(class_position))


###############
# Print metrics for the second class
##############
class_position=1
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl.true_positive_rate(class_position))
print("FP",evl.false_positive_rate(class_position))
print("Precision ",evl.precision(class_position))
print("Recall ",evl.recall(class_position))


**Naive Bayes**

Below is the code to run Naive Bayes algorithm. It is a different version of Naive Bayes that is suited to both numeric and categorical features(atrributes or variables).
 (https://weka.sourceforge.io/doc.dev/weka/classifiers/bayes/NaiveBayes.html)

In [None]:

nb = Classifier(classname="weka.classifiers.bayes.NaiveBayes")
nb.build_classifier(train)
#let's understand the NB model by printing it
print(nb)

In [None]:
# Time for evaluation on the test set
evl_nb = Evaluation(train)
evl_nb.test_model(nb, test)
print(evl_nb.summary())

In [None]:
#Here are all the metrics for Naive Bayes

print("Classes at different positions are ",data.attribute(class_idx))

print("confusion Matrix")
#Note that the TP here will be for the class at the first position printed by the previous line
print(evl_nb.confusion_matrix)

###############
# Print metrics for the first class
##############
class_position=0
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl_nb.true_positive_rate(class_position))
print("FP",evl_nb.false_positive_rate(class_position))
print("Precision ",evl_nb.precision(class_position))
print("Recall ",evl_nb.recall(class_position))


###############
# Print metrics for the second class
##############
class_position=1
print("")
print ("Evaluation from the perspective of class at position "+ str(class_position))
print("TP ",evl_nb.true_positive_rate(class_position))
print("FP",evl_nb.false_positive_rate(class_position))
print("Precision ",evl_nb.precision(class_position))
print("Recall ",evl_nb.recall(class_position))


**Appendix**

Using the following code  you can find out the best attribute by using the BestFIRst algorithm in Weka. Again it is not necessary to understand the whole code below but if you wanna learn more about BesrFirst and CfsSubsetEval, you can go here https://weka.sourceforge.io/doc.dev/weka/attributeSelection/package-summary.html. You can also replace them with options available on the above site.


In [None]:
from weka.attribute_selection import ASSearch, ASEvaluation, AttributeSelection
search = ASSearch(classname="weka.attributeSelection.BestFirst", options=["-D", "1", "-N", "5"])
evaluator = ASEvaluation(classname="weka.attributeSelection.CfsSubsetEval", options=["-P", "1", "-E", "1"])
attsel = AttributeSelection()
attsel.search(search)
attsel.evaluator(evaluator)
attsel.select_attributes(data)

print("# attributes: " + str(attsel.number_attributes_selected))
print("attributes: " + str(attsel.selected_attributes))
print("result string:\n" + attsel.results_string)

Weka's best first search method resulted into above attributes selection. Let's create a new copy of dataset with those attributes only

In [None]:
# As you see above, we only attributes 2,3 and 4 are important as judged by Weka for German Credit card data set. So we are going to load 
# data again and remove all the attributes from 5-21. Atrribute at index 1 is the class atrribute, so we'll keep that too
from weka.filters import Filter

data2 = loader.load_file(data_file)
# Filtering method 1
remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "5-21"])
remove.inputformat(data2)
filtered_data = remove.filter(data2)

print(filtered_data.subset(row_range="1-10"))

In [None]:
#Filtering method 2
#Another way of filtering columns usingthe following code. Here we are keeping only features 1-4 and 7.
filtered_data=data2.subset(col_range='1-4,7')


Now you can remove the above filtered data set as an input data set in the code examples shown above and repeat the experiments.

More examples on the use of different functionalities of Weka's Python package are here for curious readers:
http://fracpete.github.io/python-weka-wrapper3/examples.html

In [None]:
#If you are done stop the JVM (Java Virtual Machine)
jvm.stop()

It turns out that Weka's python package is easier and comprehensive than other Python packages.



```
For CIND 119 course at Ryerson
  by Syed Shariyar Murtaza,Ph.D.
```

