# CSC-321: Data Mining and Machine Learning
# Matthew Caulfield
## Assignment 7: Classification with probability

### Part 1: Naive Bayes

Everything so far has been a linear classifier. Now we'll move up a gear, and implement some non-linear classifiers. The first, as we saw in class, is Naive Bayes, that makes use of proability to make predictions.

We make use of Bayes Theorem, that allows us to calculate the probability of a piece of data belonging to a given class, given our prior knowledge. Bayes Theorem is stated as:

P(class|data) = (P(data|class) * P(class)) / P(data)

Where P(class|data) is the probability of class given the provided data

We're going to break this down into several steps. Again, I've given you a contrived data set for you to test your functions.

#### (a) Separate by class

Just as in class, we need to calculate the probability of data by the class they belong to. We'll need to separate our data by the class. Create a dictionary, where the key is class, and the values is a list of all instances with that class value. 

In [1]:
# Contrived data set
import statistics
import random
import csv
import math
from collections import Counter

dataset = [[3.393533211,2.331273381,0],
    [3.110073483,1.781539638,0],
    [1.343808831,3.368360954,0],
    [3.582294042,4.67917911,0],
    [2.280362439,2.866990263,0],
    [7.423436942,4.696522875,1],
    [5.745051997,3.533989803,1],
    [9.172168622,2.511101045,1],
    [7.792783481,3.424088941,1],
    [7.939820817,0.791637231,1]]


# implement separateByClass(dataset) here

def seperateByClass(dataset):
    classDict = {}
    for dataPoint in dataset:
        currClass = dataPoint[-1]
        if currClass not in classDict:
            classDict[currClass] = [dataPoint[:-1]]
        else:
            classDict[currClass].append(dataPoint[:-1])
    return classDict
testDict = seperateByClass(dataset)
print('0 data:', testDict[0])
print('1 data:', testDict[1])

0 data: [[3.393533211, 2.331273381], [3.110073483, 1.781539638], [1.343808831, 3.368360954], [3.582294042, 4.67917911], [2.280362439, 2.866990263]]
1 data: [[7.423436942, 4.696522875], [5.745051997, 3.533989803], [9.172168622, 2.511101045], [7.792783481, 3.424088941], [7.939820817, 0.791637231]]


#### (b) Summarize the data

We need two statistics from the data, the mean and the standard deviation. You should have these functions in a previous assignment, remembering the standard deviation is simply the square root of the variance. We need the mean and standard deviation for each of our attributes, i.e. for each column of our data. Create a function that summarizes a given data set, by gathering all of the information for each column, and calculating the mean and standard deviation on that columns data. We'll collect this information into a tuple, one per column, comprising the mean, the standard deviation and the number of elements in each column). Return a list of these tuples. 

In [2]:

# implement summarizeDataset(dataset) here, and copy across any functions you need to help you

def mean(listOfValues):
    total = 0
    for num in listOfValues:
        total += num
    return total/len(listOfValues)

def variance(listOfValues, meanValue):
    total = 0
    for num in listOfValues:
       total +=  (num - meanValue)**2/(len(listOfValues)-1)
    return total

def summarizeDataset(dataset):
    summaryData = []
    for col in range(len(dataset[0])):
        currCol = []
        for dataPoint in dataset:
            currCol.append(dataPoint[col])
        colMean = mean(currCol)
        colVar = variance(currCol, colMean)
        colStDev = colVar**0.5
        summaryData.append((colMean, colStDev, len(currCol)))
    return summaryData

testSummary = summarizeDataset(dataset)
print(testSummary)



[(5.178333386499999, 2.7665845055177263, 10), (2.9984683241, 1.218556343617447, 10), (0.5, 0.5270462766947299, 10)]


#### (c) Summarize data by class

We now need to combine the functions from (a) and (b) above. Create a summarizeByClass function, that splits the data by class, and then caluclates statistics for each row of the data for each class. The results - the list of tuples of statistics, one per column - should then be stored in a dictionary by their class value. summarizeByClass should return such a dictionary.

In [3]:

# implement summarizeByClass(dataset) here
def summarizeByClass(dataset):
    classDict = seperateByClass(dataset)
    summaryDict = {}
    for currClass in classDict:
        classData = classDict[currClass]
        classSummary = summarizeDataset(classData)
        summaryDict[currClass] = classSummary
    return summaryDict


# The dictionary for the contrived data should look like:
# {0: [(2.7420144012, 0.9265683289298018, 5), (3.0054686692, 1.1073295894898725, 5)], 1: [(7.6146523718, 1.2344321550313704, 5), (2.9914679790000003, 1.4541931384601618, 5)]}

testSummaryClass = summarizeByClass(dataset)
print(testSummaryClass)

{0: [(2.7420144012, 0.9265683289298018, 5), (3.0054686692, 1.1073295894898725, 5)], 1: [(7.6146523718, 1.2344321550313704, 5), (2.9914679790000003, 1.4541931384601618, 5)]}


#### (d) Guassiaun Probability Density

We're working with numerical data here, so we need to implement the gaussian probability density function (PDF) we talked about in class, so we can attach probabilities to real values. A gaussian distribution can be summarized from two values - guess which two? If you guessed mean and standard deviation, you were correct. The gaussian PDF is calculated as follows:

probability(x) = (1 / (sqrt(2 * pi) * std_dev)) * exp(-((x-mean) ** 2 / 2 * std_dev ** 2 )))

Hopefully, you can see why we're going to need the mean and the std_dev from function (c)

Create a function that:
- takes a value
- takes a mean
- takes a standard deviation

and returns the probability of seeing that value, given that distribution, using the formula above.

In [4]:

# Implement calcProb(value, mean, std_dev) here
def calcProb(value, mean, std_dev):
    probability = ((1 / math.sqrt(2 * math.pi * std_dev)) * math.exp(-((value-mean) ** 2) / (2 * std_dev ** 2 )))
    return probability

print(calcProb(3, 2, 1))


0.24197072451914337


#### (e) Class Probabilities

We can now use probabilites calculated from our training data to calculate probabilities for an instance of new data, by creating a function called calcClassProbs. Probabilites have to be calculated separately for each possible class in our data, so for each class we have to calculate the likelihood the new instance of data belongs to that class. The probability that a piece of data belongs to a class is calculated by:

p(class|data) = p(X|class) * P(class)

The divison has been removed, because we're just trying to maximize the result of the formula above. The largest value we get for each class above determines which class we assign. Each input value is treated separately, so in the case where we have TWO input values in our data (X1 and X2), the probablility that an instance belongs to class 0 is calculated by:

P(class=0|X1,X2) = P(X1|class=0) * P(X2|class=0) * P(class=0)

We have to repeat this for each class, and then choose the class with the highest score. We should not assume a fixed number of input features, X, the above was just an illustration. 

We'll start by creating a function that will return the probabilities of predicting each class for a given instance. This function will take a dictionary of summaries (as returned by (c), above) and an instance, and will generate a dictionary of probabilites, with one entry per class. The steps are as follows:

- We need to calculate the total number of training instances, by counting the counts stored in the summary statistics. So if there are 9 instances with one label, and 5 with another (as in the weather data) then we need to know there are 14 instances. 

- This will help us calculate the probability of a given class, the prior probability P(class), as the ratio of rows with a given class divided by all rows in the training data

- Next probabilities are calculated for each input value in the instance, using the gaussian PDF, and the statistics for that column and of that class. Probabilites are multiplied together as they are accumulated with the formula given above. 

- The process is repeated for each class in the data

- Return the dictionary of probabilities for each class for the new instance

Some things that might help with implementation. 

- Dictionaries are your friend here
- The data returned by (c) above is already divided by class. You can:
    - discover the prior probability from this data (how many instances for this class, divided by the total instances)
    - iterate over the tuples, which give you the information (mean, std_dev, count) on a per column basis
    - calculate probability given the attribute value corresponding to that column using your function from (d)

Try this out on the contrived data. 

NOTE: If you want to output ACTUAL probabilities by class, we divide each score in the dictionary for an instance, by the sum of the values. You don't need to do this, it's just a reminder.


In [5]:

# Implement calcClassProbs(summaries, instance) here

def calcClassProbs(summaries, instance):
    probDict = {}
    totalLen = 0
    for currClass in summaries:
        totalLen += summaries[currClass][0][-1]
    for currClass in summaries:
        priorProb = summaries[currClass][0][-1]/totalLen
        #print(priorProb)
        classProb = priorProb
        for i in range(len(summaries[currClass])):
            colMean = summaries[currClass][i][0]
            colStdDev = summaries[currClass][i][1]
            #print('mean', colMean, 'stdDev', colStdDev)
            colProb = calcProb(instance[i],colMean,colStdDev)
            classProb *= colProb
        probDict[currClass] = classProb
    return probDict
            

# Test it out here

summaries = summarizeByClass(dataset)
probabilities = calcClassProbs(summaries, dataset[0])
print('Probabilities are:',probabilities)

# I think if everything works, it should be:
# {0: 0.05032427673372075, 1: 0.00011557718379945765}
# which according to the percentage calculation give above should be:
# 99.77% in favour of class 0 

sumProbs = sum([v for _,v in probabilities.items()])
for k,v in probabilities.items():
    print('The probability of the instance belonging to class %d is %.2f' % (k,v/sumProbs*100))

Probabilities are: {0: 0.050974704886547935, 1: 0.00015485198134648}
The probability of the instance belonging to class 0 is 99.70
The probability of the instance belonging to class 1 is 0.30


#### (f) Tying it all together

You need to create a predict function. This function works very much as the example above, in that it takes a dictionary of summaries and a single row, and uses calcClassProbabilites to get the dictionary of probabilities. From this dictionary, find the largest value and corresponding class. Return this class. 

You also need a naiveBayes function, that takes a training set and a test set. It needs to generate summary statistics from the training set (using (c), above), then make predictions for each instance in the test set, by calling your predict function above for each instance, using the summaries generated. Append these predictions to a list you return.

In [6]:

# Implement predict(summaries,instance) here
import operator
def predict(summaries, instance):
    probDict = calcClassProbs(summaries, instance)
    return max(probDict.items(), key=operator.itemgetter(1))[0]




# Implement naiveBayes(train,test) here

def naiveBayes(train, test):
    classSummary = summarizeByClass(train)
    return[predict(classSummary, instance) for instance in test]



### Applying to real data

You've seen bits of the iris dataset in class. It's one of the most well known data sets in machine learning and data mining. So you might as well have a go at it! You can find out more about it here: http://archive.ics.uci.edu/ml/datasets/Iris

You'll need to:

- Load the data
- convert all but the last column to floats
- convert the last column to an int. There are THREE classes, so convert them to 0, 1 and 2 accordingly
- call evaluate algorithm, using a 5-fold cross-validation
- print the mean, min and max scores
- compare this to some reasonable baseline
- give me a very short write up of the results

In [7]:

def load_data(filename):
    csvTxt = csv.reader(open(filename))
    data = []
    for row in csvTxt:
        data.append(row)
    return data

def column2Float(dataset,column):
    for instance in dataset:
        instance[column] = float(instance[column])
    return dataset

def rmse_eval(actual, predicted):
    error = 0.0
    for i in range(len(actual)):
        error += (predicted[i] - actual[i])**2
    error = error/len(actual)
    error = error**0.5
    return error

def minmax(dataset):
    listMinMax = []
    for column in range(len(dataset[0])):
        columnData = [dataset[i][column] for i in range(len(dataset))]
        listMinMax.append([min(columnData), max(columnData)])
    return listMinMax

def normalize(dataset, minmax):
    for row in range(len(dataset)):
        for column in range(len(dataset[row])):
            dataset[row][column] = (dataset[row][column] - minmax[column][0]) / (minmax[column][1] - minmax[column][0])

def accuracy(actual, predicted):
    counter = 0
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            counter += 1
    return counter*100/len(actual)

def zeroRC(train, test):
    trainY = [i[-1] for i in train]
    count = Counter(trainY)
    dataMode = count.most_common(1)[0][0]
    return [dataMode for i in test]

random.seed(1)

def cross_validation_data(dataset, folds):
    dataCopy = dataset[:]
    foldLen = len(dataCopy)//folds
    crossData = []
    for i in range(folds - 1):
        currFold = []
        for j in range(foldLen):
            currData = random.choice(dataCopy)
            currFold.append(currData)
            dataCopy.pop(dataCopy.index(currData))
        crossData.append(currFold)
    currFold = []
    for i in range(len(dataCopy)):
            currData = random.choice(dataCopy)
            currFold.append(currData)
            dataCopy.pop(dataCopy.index(currData))
    crossData.append(currFold)
    return crossData

def evaluate_algorithm(dataset, algorithm, folds, metric, *args):
    foldedData = cross_validation_data(dataset, folds)
    scores = []
    for i in range(len(foldedData)):
        copyFolded = foldedData[:]
        test_data = copyFolded.pop(i)
        test = [test_data[j][:-1] for j in range(len(test_data))]
        for j in test:
            j.append(None)
        train = []
        for fold in copyFolded:
            train += fold
        predicted = algorithm(train,test, *args)
        actual = [j[-1] for j in test_data]
        result = metric(actual,predicted)
        scores.append(result)
    return scores


filename = 'iris.csv'
irisData = load_data(filename)
print('Number of Instances:', len(irisData), 'Number of Features:', len(irisData[0]))
for column in range(len(irisData[0])-1):
    column2Float(irisData, column)
    
def irisClass(data):
    for i in data: 
        if i[-1] == 'Iris-setosa':
            i[-1] = 1
        elif i[-1] == 'Iris-virginica':
            i[-1] = 0
        elif i[-1] == 'Iris-versicolor':
            i[-1] = 2
            
irisClass(irisData)

irisCopy = irisData[:]
print('irisData first row', irisCopy[0])
folds = 5

scores = evaluate_algorithm(irisCopy, naiveBayes, folds, accuracy)
zeroRCScores = evaluate_algorithm(irisCopy, zeroRC, folds, accuracy)
print('Bayes:', scores)
print('Bayes Min: %.3f' % min(scores), 'Bayes Max: %.3f' % max(scores), 'Bayes Mean: %.3f' % mean(scores))
print('zeroRC: ', zeroRCScores)
print('zeroRC Min: %.3f' % min(zeroRCScores), 'zeroRC Max: %.3f' % max(zeroRCScores), 'zeroRC Mean: %.3f' % mean(zeroRCScores))


Number of Instances: 150 Number of Features: 5
irisData first row [5.1, 3.5, 1.4, 0.2, 1]
Bayes: [96.66666666666667, 96.66666666666667, 100.0, 93.33333333333333, 93.33333333333333]
Bayes Min: 93.333 Bayes Max: 100.000 Bayes Mean: 96.000
zeroRC:  [13.333333333333334, 30.0, 30.0, 20.0, 33.333333333333336]
zeroRC Min: 13.333 zeroRC Max: 33.333 zeroRC Mean: 25.333


Write up your observations on the experiment here

The baseline I used was zeroRC because it is a good based line for categorization. The baseline was lower than expected because I thought it should be correctly predict the answer more than 1/3 of the time when it only worked correctly 25% of the time. This could be due to how the folds are split. Naive Bayes was a good classifier and was able to classif the data with a mean score of 96%. This shows that the data had clear splits for certain attributes. 