
### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2020 Semester 1

## Assignment 1: Naive Bayes Classifiers

###### Submission deadline: 7 pm, Monday 20 Apr 2020

**Student Name(s):**    Shuyang Fan, Yiran Wang 

**Student ID(s):**     988301, 987751


This iPython notebook is a template which you will use for your Assignment 1 submission.

Marking will be applied on the four functions that are defined in this notebook, and to your responses to the questions at the end of this notebook (Submitted in a separate PDF file).

**NOTE: YOU SHOULD ADD YOUR RESULTS, DIAGRAMS AND IMAGES FROM YOUR OBSERVATIONS IN THIS FILE TO YOUR REPORT (the PDF file).**

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find.

**Adding proper comments to your code is MANDATORY. **

In [1]:
import pandas as pd
import numpy as np
from pandas.api.types import is_numeric_dtype
from scipy.stats import mode 
from collections import Counter, defaultdict
import math
from sklearn.cluster import KMeans

In [2]:
# Read data from csv
def read_data(fileName):
    data = pd.read_csv(fileName, header=None)
    return data

In [3]:
def handle_missing_value(data):
    # Make a copy of raw data
    copy = data.copy()
    # Drop rows with question mark
    copy = copy[(copy.astype(str) != '?').all(axis=1)]
    copy.dropna()
    return copy

In [4]:
def equal_width_binning(data, column_indexes, bin_num):
    copy = data.copy()
    for column in column_indexes:
        copy.iloc[:,column] = pd.cut(copy.iloc[:,column], bin_num)
    return copy

In [5]:
def equal_frequency_binning(data, column_indexes, bin_num):
    copy = data.copy()
    for column in column_indexes:
        copy.iloc[:,column] = pd.qcut(copy.iloc[:,column], bin_num)
    return copy

In [6]:
def kmeans_binning(data, column_indexes, k):
    copy = data.copy()
    for column in column_indexes:
        kmeans = KMeans(n_clusters=k, random_state=0).fit(copy.iloc[:,column])
        copy.iloc[:,column] = kmeans.labels_
    return copy



In [7]:
#hold_out the training and test data in the ratio of 8:2    
def train_test_split(X, y, test_size=0.2):
    X_total = X.shape[0]
    assert(X_total == y.size)
    arr_rand = np.random.rand(X.shape[0])
    split = arr_rand < np.percentile(arr_rand, test_size*100)
    X_test = X[split]
    y_test = y[split]
    X_train =  X[~split]
    y_train = y[~split]
    return X_train, X_test, y_train, y_test



In [8]:
#Bayes calculate the product of prior and conditinoals,, take the max
#then predict, throw the X-test into the model
#then check the accuracy

In [9]:
def find_key_with_max_value(dic):
    max_value = -1e5
    max_key = None
    for key in dic:
        if dic[key] > max_value:
            max_key = key
            max_value = dic[key]
    return max_key
print(find_key_with_max_value({'<=50K': -15.768737460661638, '>50K': -18.479883639470405}))

<=50K


In [10]:
class BayesClassifier():

    def train(self, X_train, y_train, type_dict):
        self.X_train = X_train.copy()
        self.y_train = y_train.copy()
        #extract the possible class labels from training data
        self.possible_labels = np.unique(self.y_train)
        #get the prior probability of training data
        self.prior = self.get_prior(self.y_train)
        self.type_dict = type_dict
        self.categorical_prob = self.compute_categorical(self.type_dict['nominal'],self.X_train, self.y_train)
        self.numeric_prob = self.compute_numeric(self.type_dict['numeric'], self.X_train, self.y_train)
    
    #function calculates the prior probability for all labels
    def get_prior(self, y_train):
        train_inputs = y_train
        #counts is a list that stores number of each label accordingly
        labels, counts = np.unique(train_inputs, return_counts=True)
        prior = {}
        for i, label in enumerate(labels):
            prior[label] = float(counts[i])/len(train_inputs)
        return prior      
    
    #function examines whether an attribute is numeric 
    def is_numeric_attribute(self, data, column_index):
        return is_numeric_dtype(data.iloc[:,column_index])

    #function that computes the conditional probability of categorical attribtues
    def compute_categorical(self, categorical_indexes, X_train, y_train):
        #create a dictionary to store results
        categorical_prob = defaultdict(lambda: defaultdict(dict))
        # Separate training instances by label
        grouped = X_train.groupby(y_train)
        
        for label in self.possible_labels:
            separated = grouped.get_group(label)
            #for each attribute in the same class
            for column_index in categorical_indexes:
                # Only process categorical attribute
                #if (self.is_numeric_attribute(X_train, column_index) == False):
                # Extract one attribute from group
                attribute = separated.iloc[:,column_index]
                total_rows = attribute.shape[0]
                # Find all possible values of this attribute
                possible_values = np.unique(X_train.iloc[:,column_index])
                # Call Counter to count the frequency of each value
                counts = Counter(attribute)
                for value in possible_values:
                    if value in counts:
                        categorical_prob[column_index][str(value)][label] = counts[value]/total_rows
        return categorical_prob
    
    #function calculated the conditional probability for numeric attribtues
    def compute_numeric(self, numeric_indexes, X_train, y_train):
        numeric_prob = defaultdict(lambda: defaultdict(dict))
        # Sepeate training instances by label
        grouped = X_train.groupby(y_train)
        
        for label in self.possible_labels:
            separated = grouped.get_group(label)
            for column_index in numeric_indexes:
                
                # if self.is_numeric_attribute(X_train, column_index):
                # Extract the attribute from group
                attribute = separated.iloc[:,column_index]
                # print(column_index, attribute)
                #calculate the mean and standard deviation of each attribute
                numeric_prob[column_index]['mean'][label] = attribute.mean()
                numeric_prob[column_index]['std'][label] = attribute.std()
                # print(label, column_index, numeric_prob[column_index]['std'][label])
        # print(numeric_prob)
        return numeric_prob
    
    
    def guassian(self, value, mean, stdev):
            exponent = math.exp(-((value-mean)**2 / (2 * stdev**2)))
            return (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent

    def predict(self, X_test):
        X_test_copy = X_test.copy()
        categorical_prob = self.categorical_prob
        numeric_prob = self.numeric_prob 
        priors = self.prior
        possible_labels = self.possible_labels
        predicted_outputs = []
    
        row, column = X_test_copy.shape
        for row_index in range(row):
            probability = defaultdict(float)
            
            for label in possible_labels:
                probability[label] = safe_log(priors[label])
                
            for column_index in range(column):
                # Get the value of this attribute
                value = X_test_copy.iloc[row_index, column_index]
                # Get conditional probability
                for label in possible_labels:
                    #calculate the numeric conditional probabily
                    if column_index in self.type_dict['numeric']:
                        #print("numeric",column_index)
                        try:
                            conditional_probability = self.guassian(value, numeric_prob[column_index]['mean'][label], numeric_prob[column_index]['std'][label])
                        except ZeroDivisionError:
                            print("Error", column_index, numeric_prob[column_index], label)
                    elif column_index in self.type_dict['nominal']:
                        #print("nomial",column_index)
                        if label not in categorical_prob[column_index][str(value)]:
                            #assign a tiny probability to each event even if events are unlikely (epsilon smoothing)
                            conditional_probability = 1e-9
                        else:
                            conditional_probability = categorical_prob[column_index][str(value)][label]
                    #take the log of each probabilty and sum it
                    probability[label] += safe_log(conditional_probability)
            # The prediced outcome is the lebel with the highest probability
            predicted_outputs.append(find_key_with_max_value(probability))
        return predicted_outputs
    


def safe_log(x):
    if x <= 0:
        return 0
    return math.log(x)

In [11]:
def oridinal_to_integer(X, replacement_map):
    copy = X.copy()
    for column_index in replacement_map:
        map = {key:index for index, key in enumerate(replacement_map[column_index])}
        copy.iloc[:,column_index] = copy.iloc[:,column_index].map(map)
    return copy



In [12]:
def accuracy(y_predicted, y_truth):
    assert(y_predicted.size==y_truth.size)
    return np.sum(y_predicted == y_truth)/y_predicted.size

In [13]:
# data_set ="./datasets/wdbc.data"
# data = read_data(data_set)
# #exlude the first column of ID
# data = data.iloc[:,1:]
# #exclude the label column and then discretize the data
# wbdc_freq = []
# for i in range(3,10):
#     X, y = preprocess(data,0,list(range(1,31)), i)
#     X_train, X_test, y_train, y_test = train_test_split(X,y)
#     #分Nomimal index is wrong 
#     type_dict = {"numeric":[],"nominal":list(range(30))}
#     bayes = BayesClassifier()
#     bayes.train(X_train, y_train,type_dict)
#     result = bayes.predict(X_test)
#     print(accuracy(np.array(result), y_test))
#     wbdc_freq.append(accuracy(np.array(result), y_test))


In [14]:
# data_set ="./datasets/wine.data"
# data = read_data(data_set)
# #exclude the label column and then discretize the data
# wine_freq = []
# 
# for i in range(3,10):
#     X, y = preprocess(data,0,list(range(1,13)), i)
#     X_train, X_test, y_train, y_test = train_test_split(X,y)
#     #分Nomimal index is wrong 
#     type_dict = {"numeric":[],"nominal":list(range(13))}
#     bayes = BayesClassifier()
#     bayes.train(X_train, y_train,type_dict)
#     result = bayes.predict(X_test)
#     print(accuracy(np.array(result), y_test))
#     wine_freq.append(accuracy(np.array(result), y_test))
#X_train, X_test, y_train, y_test = train_test_split(X,y)
#X_train

In [15]:
# data_set ="./datasets/car.data"
# data = read_data(data_set)
# #data = binning(data)
# X, y = preprocess(data,1)
# X_train, X_test, y_train, y_test = train_test_split(X,y)
# bayes = BayesClassifier()
# bayes.train(X_train, y_train)
# result = bayes.predict(X_test)
# print(accuracy(np.array(result), y_test))

In [16]:
# data_set ="./datasets/wine.data"
# data = read_data(data_set)
# #data = binning(data)
# X, y = preprocess(data,0)
# print(X_train.head())
# X_train, X_test, y_train, y_test = train_test_split(X,y)
# bayes = BayesClassifier()
# bayes.train(X_train, y_train)
# result = bayes.predict(X_test)
# print(accuracy(np.array(result), y_test))

In [17]:
# data_set ="./datasets/mushroom.data"
# data = read_data(data_set)
# X_train, X_test, y_train, y_test = train_test_split(X,y)
# bayes = BayesClassifier()
# bayes.train(X_train, y_train)
# result = bayes.predict(X_test)
# 
# print(accuracy(np.array(result), y_test))
# print(accuracy(np.array(baseline), y_test))

In [18]:
# #random_baseline
# data_set ="./datasets/adult.data"
# data = read_data(data_set)
# X_train, X_test, y_train, y_test = train_test_split(X,y)
# # result = random_baseline(X_test, y_test)
# 
# bayes = BayesClassifier()
# type_dict=defaultdict(list)
# type_list = [2, 0, 2, 0, 1, 0, 0, 0, 0, 0, 2, 2, 2, 0]
# for index, type in enumerate(type_list):
#     if (type==0):
#         type_dict["nominal"].append(index)
#     elif type==1:
#         type_dict["nominal"].append(index)    
#         # type_dict["numeric"].append(index)    
#     else:
#         type_dict["numeric"].append(index)
# 
# bayes.train(X_train, y_train, type_dict)
# result = bayes.predict(X_test)
# baseline = random_baseline(X_test, y_test)
# 
# print(accuracy(np.array(result), y_test))

In [19]:
def build_type_dict(type_list, treat_ordinal_as):
    type_dict=defaultdict(list)
    for index, type in enumerate(type_list):
        if (type==0):
            type_dict["nominal"].append(index)
        elif type==1:
            type_dict[treat_ordinal_as].append(index)    
        else:
            type_dict["numeric"].append(index)
    return type_dict


In [20]:
def random_baseline(X_train, y_train):
    labels = np.unique(y_train)
    y_baseline = np.array([np.random.choice(labels) for i in range(y_train.size)])
    return y_baseline

In [21]:
def zero_r_baseline(X_train, y_train):
    try:
        label = mode(y_train)
    except:
        print(y_train)
    return np.repeat(label.mode, y_train.size)

In [22]:
def label_confusion_matrix(y_predicted, y_truth, label):
    TP = 0
    TN = 0
    FP = 0
    FN = 0
    for index in range(y_predicted.size):
        result =  y_predicted[index]
        if (result==label):
            if ((y_truth[index]) == label):
                TP +=1
            else:
                FP +=1
        else:
            if ((y_truth[index]) == label):
                FN += 1
            else:
                TN += 1
    return TP, TN, FP, FN
                
  
# (TP, TN, FP, FN) = label_confusion_matrix(np.array([1, 0, 0, 1, 2]), np.array([1, 0, 1, 0, 2]), 2)  
# print(TP, TN, FP, FN)  

## Questions 


If you are in a group of 1, you will respond to question (1), and **one** other of your choosing (two responses in total).

If you are in a group of 2, you will respond to question (1) and question (2), and **two** others of your choosing (four responses in total). 

A response to a question should take about 100–250 words, and make reference to the data wherever possible.

#### NOTE: you may develope codes or functions in respond to the question, but your formal answer should be added to a separate file.

1. Try discretising the numeric attributes in these datasets and treating them as discrete variables
in the na¨ıve Bayes classifier. You can use a discretisation method of your choice and group the
numeric values into any number of levels (but around 3 to 5 levels would probably be a good
starting point). Does discretizing the variables improve classification performance, compared
to the Gaussian na¨ıve Bayes approach? Why or why not?

2. Implement a baseline model (e.g., random or 0R) and compare the performance of the na¨ıve
Bayes classifier to this baseline on multiple datasets. Discuss why the baseline performance
varies across datasets, and to what extent the na¨ıve Bayes classifier improves on the baseline
performance.

# NOMINAL ATTRIBUTES DATASETS
## breast-cancer-wisconsin

In [23]:
data_set ="./datasets/breast-cancer-wisconsin.data"
data = pd.read_csv(data_set, header=None)
data = handle_missing_value(data)
label_index = 10
y = data.iloc[:,label_index]  
X = data.drop(data.columns[label_index], axis=1, inplace=False)
X = X.iloc[:,1:]

In [24]:
X.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9
0,5,1,1,1,2,1,3,1,1
1,5,4,4,5,7,10,3,2,1
2,3,1,1,1,2,2,3,1,1
3,6,8,8,1,3,4,3,7,1
4,4,1,1,3,2,1,3,1,1


In [25]:
y.head()

0    2
1    2
2    2
3    2
4    2
Name: 10, dtype: int64

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X,y)
# result = random_baseline(X_test, y_test)

bayes = BayesClassifier()
type_list = [0 for i in range(X_train.shape[1])]
bayes.train(X_train, y_train, build_type_dict(type_list, "nominal"))

result = bayes.predict(X_test)

print("bayes result:",accuracy(np.array(result), np.array(y_test)))

ran_baseline = random_baseline(X_test, np.array(y_test))
print("random baseline result:", accuracy(ran_baseline, y_test))

zeror_baseline = zero_r_baseline(X_test, np.array(y_test))
print("zeroR baseline result:", accuracy(zeror_baseline, y_test))

bayes result: 0.9416058394160584
random baseline result: 0.5036496350364964
zeroR baseline result: 0.6861313868613139


#NUMERIC ATTRIBUTES DATASETS
## wdbc

In [27]:
label_index = 1
data_set ="./datasets/wdbc.data"
data = pd.read_csv(data_set, header=None)
y = data.iloc[:,label_index]  


In [28]:
width_wdbc = []
for i in range(3,10):
    discretized_data = equal_width_binning(data, list(range(2,32)), i)
    #y = discretized_data.iloc[:,label_index]  
    X = discretized_data.drop(data.columns[label_index], axis=1, inplace=False)
    # Drop ID column
    X = X.iloc[:,1:]
    X_train, X_test, y_train, y_test = train_test_split(X,y)

    bayes = BayesClassifier()
    type_list = [0 for i in range(X_train.shape[1])]
    bayes.train(X_train, y_train, build_type_dict(type_list, treat_ordinal_as="numeric"))

    result = bayes.predict(X_test)
    width_wdbc.append(result)

    print("Discretized bayes result is {:f} with {:d} of bins :".format(accuracy(np.array(result), np.array(y_test)), i))

Discretized bayes result is 0.929825 with 3 of bins :
Discretized bayes result is 0.964912 with 4 of bins :
Discretized bayes result is 0.938596 with 5 of bins :
Discretized bayes result is 0.947368 with 6 of bins :
Discretized bayes result is 0.938596 with 7 of bins :
Discretized bayes result is 0.973684 with 8 of bins :
Discretized bayes result is 0.982456 with 9 of bins :


In [29]:
X.head()

Unnamed: 0,2,3,4,5,6,7,8,9,10,11,...,22,23,24,25,26,27,28,29,30,31
0,"(16.372, 18.719]","(9.68, 12.996]","(108.106, 124.184]","(929.333, 1191.278]","(0.114, 0.126]","(0.273, 0.309]","(0.285, 0.332]","(0.134, 0.156]","(0.238, 0.26]","(0.0763, 0.0816]",...,"(23.547, 26.67]","(16.189, 20.358]","(184.27, 206.58]","(1993.556, 2445.644]","(0.155, 0.172]","(0.6, 0.714]","(0.696, 0.835]","(0.259, 0.291]","(0.438, 0.495]","(0.106, 0.123]"
1,"(18.719, 21.067]","(16.281, 19.567]","(124.184, 140.263]","(1191.278, 1453.222]","(0.0772, 0.0896]","(0.0556, 0.0918]","(0.0474, 0.0948]","(0.0671, 0.0894]","(0.172, 0.194]","(0.0552, 0.0605]",...,"(23.547, 26.67]","(20.358, 24.527]","(139.65, 161.96]","(1541.467, 1993.556]","(0.122, 0.138]","(0.142, 0.256]","(0.139, 0.278]","(0.162, 0.194]","(0.269, 0.326]","(0.0889, 0.106]"
2,"(18.719, 21.067]","(19.567, 22.852]","(124.184, 140.263]","(1191.278, 1453.222]","(0.102, 0.114]","(0.128, 0.164]","(0.19, 0.237]","(0.112, 0.134]","(0.194, 0.216]","(0.0552, 0.0605]",...,"(23.547, 26.67]","(24.527, 28.696]","(139.65, 161.96]","(1541.467, 1993.556]","(0.138, 0.155]","(0.371, 0.485]","(0.417, 0.556]","(0.226, 0.259]","(0.326, 0.382]","(0.072, 0.0889]"
3,"(9.329, 11.676]","(19.567, 22.852]","(75.948, 92.027]","(141.142, 405.444]","(0.139, 0.151]","(0.273, 0.309]","(0.237, 0.285]","(0.0894, 0.112]","(0.238, 0.26]","(0.0922, 0.0974]",...,"(14.177, 17.3]","(24.527, 28.696]","(95.03, 117.34]","(181.131, 637.289]","(0.206, 0.223]","(0.829, 0.943]","(0.556, 0.696]","(0.226, 0.259]","(0.607, 0.664]","(0.157, 0.174]"
4,"(18.719, 21.067]","(12.996, 16.281]","(124.184, 140.263]","(1191.278, 1453.222]","(0.0896, 0.102]","(0.128, 0.164]","(0.19, 0.237]","(0.0894, 0.112]","(0.172, 0.194]","(0.0552, 0.0605]",...,"(20.423, 23.547]","(16.189, 20.358]","(139.65, 161.96]","(1541.467, 1993.556]","(0.122, 0.138]","(0.142, 0.256]","(0.278, 0.417]","(0.162, 0.194]","(0.213, 0.269]","(0.072, 0.0889]"


In [30]:
y.head()

0    M
1    M
2    M
3    M
4    M
Name: 1, dtype: object

In [31]:
X = data.drop(data.columns[label_index], axis=1, inplace=False)
# Drop ID column
X = X.iloc[:,1:]
X_train, X_test, y_train, y_test = train_test_split(X,y)
# result = random_baseline(X_test, y_test)

bayes = BayesClassifier()
type_list = [2 for i in range(X_train.shape[1])]
bayes.train(X_train, y_train, build_type_dict(type_list, treat_ordinal_as="numeric"))

result = bayes.predict(X_test)

print("Gussian bayes result:",accuracy(np.array(result), np.array(y_test)))

ran_baseline = random_baseline(X_test, np.array(y_test))
print("random baseline result:", accuracy(ran_baseline, y_test))

zeror_baseline = zero_r_baseline(X_test, np.array(y_test))
print("zeroR baseline result:", accuracy(zeror_baseline, y_test))

Gussian bayes result: 0.956140350877193
random baseline result: 0.4298245614035088
zeroR baseline result: 0.5526315789473685




# Wine

In [32]:
label_index = 0
data_set ="./datasets/wine.data"
data = pd.read_csv(data_set, header=None)
y = data.iloc[:,label_index]  



In [33]:
width_wine = []
for i in range(3,10):
    discretized_data = equal_width_binning(data, list(range(1,14)), i)
    #y = discretized_data.iloc[:,label_index]  
    X = discretized_data.drop(data.columns[label_index], axis=1, inplace=False)
    # Drop ID column
    X_train, X_test, y_train, y_test = train_test_split(X,y)

    bayes = BayesClassifier()
    type_list = [0 for i in range(X_train.shape[1])]
    bayes.train(X_train, y_train, build_type_dict(type_list, treat_ordinal_as="numeric"))

    result = bayes.predict(X_test)
    width_wine.append(result)

    print("Discretized bayes result is {:f} with {:d} of bins :".format(accuracy(np.array(result), np.array(y_test)), i))

Discretized bayes result is 0.916667 with 3 of bins :
Discretized bayes result is 1.000000 with 4 of bins :
Discretized bayes result is 0.972222 with 5 of bins :
Discretized bayes result is 0.861111 with 6 of bins :
Discretized bayes result is 1.000000 with 7 of bins :
Discretized bayes result is 0.944444 with 8 of bins :
Discretized bayes result is 0.944444 with 9 of bins :


In [34]:
X.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,"(13.986, 14.408]","(1.302, 1.864]","(2.399, 2.607]","(14.911, 17.067]","(121.111, 131.333]","(2.591, 2.913]","(2.973, 3.5]","(0.248, 0.307]","(2.171, 2.523]","(5.187, 6.489]","(1.027, 1.163]","(3.697, 4.0]","(1056.889, 1212.667]"
1,"(13.141, 13.563]","(1.302, 1.864]","(1.983, 2.191]","(10.581, 12.756]","(90.444, 100.667]","(2.591, 2.913]","(2.447, 2.973]","(0.248, 0.307]","(1.114, 1.467]","(3.884, 5.187]","(1.027, 1.163]","(3.393, 3.697]","(901.111, 1056.889]"
2,"(13.141, 13.563]","(1.864, 2.427]","(2.607, 2.814]","(17.067, 19.222]","(100.667, 110.889]","(2.591, 2.913]","(2.973, 3.5]","(0.248, 0.307]","(2.523, 2.876]","(5.187, 6.489]","(1.027, 1.163]","(3.09, 3.393]","(1056.889, 1212.667]"
3,"(13.986, 14.408]","(1.864, 2.427]","(2.399, 2.607]","(14.911, 17.067]","(110.889, 121.111]","(3.558, 3.88]","(2.973, 3.5]","(0.189, 0.248]","(2.171, 2.523]","(7.791, 9.093]","(0.753, 0.89]","(3.393, 3.697]","(1368.444, 1524.222]"
4,"(13.141, 13.563]","(2.427, 2.989]","(2.814, 3.022]","(19.222, 21.378]","(110.889, 121.111]","(2.591, 2.913]","(2.447, 2.973]","(0.366, 0.424]","(1.819, 2.171]","(3.884, 5.187]","(1.027, 1.163]","(2.787, 3.09]","(589.556, 745.333]"


In [35]:
X = data.drop(data.columns[label_index], axis=1, inplace=False)
X_train, X_test, y_train, y_test = train_test_split(X,y)
# result = random_baseline(X_test, y_test)

bayes = BayesClassifier()
type_list = [2 for i in range(X_train.shape[1])]
bayes.train(X_train, y_train, build_type_dict(type_list, treat_ordinal_as="numeric"))

result = bayes.predict(X_test)

print("guasiian bayes result:",accuracy(np.array(result), np.array(y_test)))

ran_baseline = random_baseline(X_test, np.array(y_test))
print("random baseline result:", accuracy(ran_baseline, y_test))

zeror_baseline = zero_r_baseline(X_test, np.array(y_test))
print("zeroR baseline result:", accuracy(zeror_baseline, y_test))

guasiian bayes result: 0.9722222222222222
random baseline result: 0.2222222222222222
zeroR baseline result: 0.4444444444444444


# Nursery

In [36]:
label_index = 8
data_set ="./datasets/nursery.data"
data = pd.read_csv(data_set, header=None)
y = data.iloc[:,label_index]  
X = data.drop(data.columns[label_index], axis=1, inplace=False)
replacement_map = {
    0: ["usual", "pretentious", "great_pret"],
    1: ["proper", "less_proper", "improper", "critical", "very_crit"],
    2: ["complete", "completed", "incomplete", "foster"],
    3: ["1", "2","3", "more"],
    4: ["convenient", "less_conv", "critical"],
    5: ["convenient", "inconv"],
    6: ["nonprob", "slightly_prob", "problematic"],
    7: ["recommended", "priority", "not_recom"],
}

In [37]:
last = np.unique(X.iloc[:,-1])
last

array(['not_recom', 'priority', 'recommended'], dtype=object)

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X,y)
# result = random_baseline(X_test, y_test)

bayes = BayesClassifier()

#consider all the ordinal attributes as nominal
type_list_nominal = [0 for i in range(X_train.shape[1])]
bayes.train(X_train, y_train, build_type_dict(type_list_nominal, treat_ordinal_as="numeric"))
nominal_result = bayes.predict(X_test)

print("treat ordinal as nominal naive bayes result:",accuracy(np.array(nominal_result), np.array(y_test)))


ran_baseline = random_baseline(X_test, np.array(y_test))
print("random baseline result:", accuracy(ran_baseline, y_test))

zeror_baseline = zero_r_baseline(X_test, np.array(y_test))
print("zeroR baseline result:", accuracy(zeror_baseline, y_test))

treat ordinal as nominal naive bayes result: 0.8935185185185185
random baseline result: 0.22723765432098766
zeroR baseline result: 0.3314043209876543




In [39]:
X_numeric = oridinal_to_integer(X, replacement_map)
X_train, X_test, y_train, y_test = train_test_split(X_numeric, y)
bayes = BayesClassifier()

#consider all the ordinal attributes as numeric

X_numeric.head()


Unnamed: 0,0,1,2,3,4,5,6,7
0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,2
3,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,1,1


In [40]:
type_list_numeric = [2 for i in range(X_train.shape[1])]
bayes.train(X_train, y_train, build_type_dict(type_list_numeric, treat_ordinal_as="numeric"))
numeric_result = bayes.predict(X_test)

print("treat ordinal as numeric naive bayes result:",accuracy(np.array(numeric_result), np.array(y_test)))

Error 7 defaultdict(<class 'dict'>, {'mean': {'not_recom': 2.0, 'priority': 0.42950430394775896, 'recommend': 0.0, 'spec_prior': 0.6112814224402208, 'very_recom': 0.0}, 'std': {'not_recom': 0.0, 'priority': 0.4950788920155539, 'recommend': nan, 'spec_prior': 0.4875339076907105, 'very_recom': 0.0}}) not_recom
Error 7 defaultdict(<class 'dict'>, {'mean': {'not_recom': 2.0, 'priority': 0.42950430394775896, 'recommend': 0.0, 'spec_prior': 0.6112814224402208, 'very_recom': 0.0}, 'std': {'not_recom': 0.0, 'priority': 0.4950788920155539, 'recommend': nan, 'spec_prior': 0.4875339076907105, 'very_recom': 0.0}}) very_recom
Error 7 defaultdict(<class 'dict'>, {'mean': {'not_recom': 2.0, 'priority': 0.42950430394775896, 'recommend': 0.0, 'spec_prior': 0.6112814224402208, 'very_recom': 0.0}, 'std': {'not_recom': 0.0, 'priority': 0.4950788920155539, 'recommend': nan, 'spec_prior': 0.4875339076907105, 'very_recom': 0.0}}) not_recom
Error 7 defaultdict(<class 'dict'>, {'mean': {'not_recom': 2.0, 'prio



## Car

In [41]:
label_index = 6
data_set ="./datasets/car.data"
data = pd.read_csv(data_set, header=None)
y = data.iloc[:,label_index]  
X = data.drop(data.columns[label_index], axis=1, inplace=False)


In [42]:
X.head()

#%

y.head()

0    unacc
1    unacc
2    unacc
3    unacc
4    unacc
Name: 6, dtype: object

In [43]:
X_train, X_test, y_train, y_test = train_test_split(X,y)
# result = random_baseline(X_test, y_test)

bayes = BayesClassifier()
type_list = [0 for i in range(X_train.shape[1])]
bayes.train(X_train, y_train, build_type_dict(type_list, "nominal"))

nominal_result = bayes.predict(X_test)

print("treat ordinal as nominal naive bayes result:",accuracy(np.array(nominal_result), np.array(y_test)))


ran_baseline = random_baseline(X_test, np.array(y_test))
print("random baseline result:", accuracy(ran_baseline, y_test))

zeror_baseline = zero_r_baseline(X_test, np.array(y_test))
print("zeroR baseline result:", accuracy(zeror_baseline, y_test))

treat ordinal as nominal naive bayes result: 0.869942196531792
random baseline result: 0.27167630057803466
zeroR baseline result: 0.7023121387283237


In [44]:
replacement_map = {
    0: ["low", "med", "high", "vhigh"],
    1: ["low", "med", "high", "vhigh"],
    2: ["2", "3", "4", "5-more"],
    3: ["2", "4", "more"],
    4: ["small", "med", "big"],
    5: ["low", "med", "high"]
}

X_numeric = oridinal_to_integer(X, replacement_map)

In [45]:
X_numeric.head()

Unnamed: 0,0,1,2,3,4,5
0,3,3,0.0,0,0,0
1,3,3,0.0,0,0,1
2,3,3,0.0,0,0,2
3,3,3,0.0,0,1,0
4,3,3,0.0,0,1,1


In [46]:
X_train, X_test, y_train, y_test = train_test_split(X_numeric, y)
bayes = BayesClassifier()

#consider all the ordinal attributes as numeric
type_list_numeric = [2 for i in range(X_train.shape[1])]
bayes.train(X_train, y_train, build_type_dict(type_list_numeric, "numeric"))
numeric_result = bayes.predict(X_test)

print("treat ordinal as numeric naive bayes result:",accuracy(np.array(numeric_result), np.array(y_test)))

Error 5 defaultdict(<class 'dict'>, {'mean': {'acc': 1.5488958990536277, 'good': 1.4137931034482758, 'unacc': 0.7554858934169278, 'vgood': 2.0}, 'std': {'acc': 0.49839017290186827, 'good': 0.49681380460171537, 'unacc': 0.8006167203521827, 'vgood': 0.0}}) vgood
Error 5 defaultdict(<class 'dict'>, {'mean': {'acc': 1.5488958990536277, 'good': 1.4137931034482758, 'unacc': 0.7554858934169278, 'vgood': 2.0}, 'std': {'acc': 0.49839017290186827, 'good': 0.49681380460171537, 'unacc': 0.8006167203521827, 'vgood': 0.0}}) vgood
Error 5 defaultdict(<class 'dict'>, {'mean': {'acc': 1.5488958990536277, 'good': 1.4137931034482758, 'unacc': 0.7554858934169278, 'vgood': 2.0}, 'std': {'acc': 0.49839017290186827, 'good': 0.49681380460171537, 'unacc': 0.8006167203521827, 'vgood': 0.0}}) vgood
Error 5 defaultdict(<class 'dict'>, {'mean': {'acc': 1.5488958990536277, 'good': 1.4137931034482758, 'unacc': 0.7554858934169278, 'vgood': 2.0}, 'std': {'acc': 0.49839017290186827, 'good': 0.49681380460171537, 'unacc'



In [47]:
# Mixed data type
## Adult

# Nursery

In [48]:
label_index = 8
data_set ="./datasets/nursery.data"
data = pd.read_csv(data_set, header=None)
y = data.iloc[:,label_index]  
X = data.drop(data.columns[label_index], axis=1, inplace=False)
replacement_map = {
    0: ["usual", "pretentious", "great_pret"],
    1: ["proper", "less_proper", "improper", "critical", "very_crit"],
    2: ["complete", "completed", "incomplete", "foster"],
    3: ["1", "2","3", "more"],
    4: ["convenient", "less_conv", "critical"],
    5: ["convenient", "inconv"],
    6: ["nonprob", "slightly_prob", "problematic"],
    7: ["recommended", "priority", "not_recom"],
}

In [49]:
last = np.unique(X.iloc[:,-1])
last

array(['not_recom', 'priority', 'recommended'], dtype=object)

In [50]:
X_train, X_test, y_train, y_test = train_test_split(X,y)
# result = random_baseline(X_test, y_test)

bayes = BayesClassifier()

#consider all the ordinal attributes as nominal
type_list_nominal = [0 for i in range(X_train.shape[1])]
bayes.train(X_train, y_train, build_type_dict(type_list_nominal))
nominal_result = bayes.predict(X_test)

print("treat ordinal as nominal naive bayes result:",accuracy(np.array(nominal_result), np.array(y_test)))


ran_baseline = random_baseline(X_test, np.array(y_test))
print("random baseline result:", accuracy(ran_baseline, y_test))

zeror_baseline = zero_r_baseline(X_test, np.array(y_test))
print("zeroR baseline result:", accuracy(zeror_baseline, y_test))

TypeError: build_type_dict() missing 1 required positional argument: 'treat_ordinal_as'

In [None]:
X_numeric = oridinal_to_integer(X, replacement_map)
X_train, X_test, y_train, y_test = train_test_split(X_numeric, y)
bayes = BayesClassifier()

#consider all the ordinal attributes as numeric

X_numeric.head()


In [None]:
type_list_numeric = [2 for i in range(X_train.shape[1])]
bayes.train(X_train, y_train, build_type_dict(type_list_numeric))
numeric_result = bayes.predict(X_test)

print("treat ordinal as numeric naive bayes result:",accuracy(np.array(numeric_result), np.array(y_test)))

## Car

In [None]:
label_index = 6
data_set ="./datasets/car.data"
data = pd.read_csv(data_set, header=None)
y = data.iloc[:,label_index]  
X = data.drop(data.columns[label_index], axis=1, inplace=False)
replacement_map = {
    0: ["low", "med", "high", "vhigh"],
    1: ["low", "med", "high", "vhigh"],
    2: ["2", "3", "4", "5-more"],
    3: ["2", "4", "more"],
    4: ["small", "med", "big"],
    5: ["low", "med", "high"]
}


In [None]:
X.head()

In [None]:
y.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y)
# result = random_baseline(X_test, y_test)

bayes = BayesClassifier()
type_list = [0 for i in range(X_train.shape[1])]
bayes.train(X_train, y_train, build_type_dict(type_list))

nominal_result = bayes.predict(X_test)

print("treat ordinal as nominal naive bayes result:",accuracy(np.array(nominal_result), np.array(y_test)))


ran_baseline = random_baseline(X_test, np.array(y_test))
print("random baseline result:", accuracy(ran_baseline, y_test))

zeror_baseline = zero_r_baseline(X_test, np.array(y_test))
print("zeroR baseline result:", accuracy(zeror_baseline, y_test))

In [None]:
X_numeric = oridinal_to_integer(X, replacement_map)
X_train, X_test, y_train, y_test = train_test_split(X_numeric, y)
bayes = BayesClassifier()

#consider all the ordinal attributes as numeric
type_list_numeric = [2 for i in range(X_train.shape[1])]
bayes.train(X_train, y_train, build_type_dict(type_list_numeric))
numeric_result = bayes.predict(X_test)

print("treat ordinal as numeric naive bayes result:",accuracy(np.array(numeric_result), np.array(y_test)))

# Mixed data type
## Adult

In [None]:
label_index = 14
data_set ="./datasets/adult.data"

In [None]:
data = pd.read_csv(data_set, header=None)
data = handle_missing_value(data)

y = data.iloc[:,label_index]  
X = data.drop(data.columns[label_index], axis=1, inplace=False)



In [None]:
X.head()

In [None]:
y.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y)
# result = random_baseline(X_test, y_test)

bayes = BayesClassifier()
type_list = [2, 0, 2, 1, 1, 0, 0, 0, 0, 0, 2, 2, 2, 0]
bayes.train(X_train, y_train, build_type_dict(type_list))

result = bayes.predict(X_test)

print(accuracy(np.array(result), np.array(y_test)))

ran_baseline = random_baseline(X_test, np.array(y_test))
print(accuracy(ran_baseline, y_test))

zeror_baseline = zero_r_baseline(X_test, np.array(y_test))
print(accuracy(zeror_baseline, y_test))

3. Since it’s difficult to model the probabilities of ordinal data, ordinal attributes are often treated as
either nominal variables or numeric variables. Compare these strategies on the ordinal datasets
provided. Deterimine which approach gives higher classification accuracy and discuss why.

4. Evaluating the model on the same data that we use to train the model is considered to be a major
mistake in Machine Learning. Implement a hold–out or cross–validation evaluation strategy
(you should implement this yourself and do not simply call existing implementations from
scikit-learn). How does your estimate of effectiveness change, compared to testing on
the training data? Explain why. (The result might surprise you!)