### QUESTION 1

Text classification is an example of Na ̈ıve Bayes application. You are required to classify the
following statements, “a cup of hot coffee” and “a cone of ice cream”, given the categories Sunny
and Rainy.

Given a training data, an objective of Na ̈ıve Bayes will be to compute P(sunny|a cone of
ice cream) and P(rainy|a cup of hot coffee) and classify the statement as the category with a
higher probability.

1. P(sunny|a cone of ice cream) = ?
2. P(rainy|a cup of hot coffee) = ?

#### Answer:

Bayes Theorem states thus $ P(A|B) = \frac{P(B|A) * P(A)}{P(B)} $

we assume that our given probabilities are made up of independent events as follows:

$ P(a\, cone\, of\, ice\, cream\,) = P(a) * P(cone) * P(of) * P(ice) * P(cream) \quad $
$ P(a\, cup\, of\, hot\,\, coffee\,) = P(a) * P(cup) * P(of) * P(hot) * P(coffee) $

the probability of each independent event will be as follows:

$ P(sunny\,|\,a\, cone\, of\, ice\, cream\,) = P(a|sunny) * P(cone|sunny) * P(of|sunny) * P(ice|sunny) * P(cream|sunny) $
$ P(rainy\,|\,a \, cup\, of\, hot\,\, coffee\,) = P(a|rainy) * P(cup|rainy) * P(of|rainy) * P(hot|rainy) * P(coffee|rainy) $


### QUESTION 2


In [1]:
import mostFrequent
import naiveBayes
import samples
import sys
import util

TEST_SET_SIZE = 100
DIGIT_DATUM_WIDTH=28
DIGIT_DATUM_HEIGHT=28
FACE_DATUM_WIDTH=60
FACE_DATUM_HEIGHT=70

In [2]:
def basicFeatureExtractorDigit(datum):
    """
    Returns a set of pixel features indicating whether
    each pixel in the provided datum is white (0) or gray/black (1)
    """
    a = datum.getPixels()

    features = util.Counter()
    
    for x in range(DIGIT_DATUM_WIDTH):
        for y in range(DIGIT_DATUM_HEIGHT):
            if datum.getPixel(x, y) > 0:
                features[(x,y)] = 1
            else:
                features[(x,y)] = 0
    return features

In [6]:
def analysis(classifier, guesses, testLabels, testData, rawTestData, printImage):
    """
    This function is called after learning.
    Include any code that you want here to help you analyze your results.

    Use the printImage(<list of pixels>) function to visualize features.

    An example of use has been given to you.

    - classifier is the trained classifier
    - guesses is the list of labels predicted by your classifier on the test set
    - testLabels is the list of true labels
    - testData is the list of training datapoints (as util.Counter of features)
    - rawTestData is the list of training datapoints (as samples.Datum)
    - printImage is a method to visualize the features 
    (see its use in the odds ratio part in runClassifier method)

    This code won't be evaluated. It is for your own optional use
    (and you can modify the signature if you want).
    """
  
    # Put any code here...
    # Example of use:
    for i in range(len(guesses)):
        prediction = guesses[i]
        truth = testLabels[i]
        if (prediction != truth):
            print("===================================")
            print(("Mistake on example %d" % i)) 
            print(("Predicted %d; truth is %d" % (prediction, truth)))
            print("Image: ")
            print((rawTestData[i]))
            break

In [None]:
## =====================
## You don't have to modify any code below.
## =====================


class ImagePrinter:
    def __init__(self, width, height):
        self.width = width
        self.height = height

    def printImage(self, pixels):
        """
        Prints a Datum object that contains all pixels in the 
        provided list of pixels.    This will serve as a helper function
        to the analysis function you write.
      
        Pixels should take the form 
        [(2,2), (2, 3), ...] 
        where each tuple represents a pixel.
        """
        image = samples.Datum(None,self.width,self.height)
        for pix in pixels:
            try:
                # This is so that new features that you could define which 
                # which are not of the form of (x,y) will not break
                # this image printer...
                x,y = pix
                image.pixels[x][y] = 2
            except:
                print(("new features:", pix))
                continue
        print(image)

def default(str):
    return str + ' [Default: %default]'

def readCommand( argv ):
    "Processes the command used to run from the command line."
    from optparse import OptionParser  
    parser = OptionParser(USAGE_STRING)
  
    parser.add_option('-c', '--classifier', help=default('The type of classifier'), choices=['mostFrequent', 'nb', 'naiveBayes', 'perceptron', 'mira', 'minicontest'], default='mostFrequent')
    parser.add_option('-d', '--data', help=default('Dataset to use'), choices=['digits', 'faces'], default='digits')
    parser.add_option('-t', '--training', help=default('The size of the training set'), default=100, type="int")
    parser.add_option('-a', '--autotune', help=default("Whether to automatically tune hyperparameters"), default=False, action="store_true")
    parser.add_option('-i', '--iterations', help=default("Maximum iterations to run training"), default=3, type="int")

    options, otherjunk = parser.parse_args(argv)
    if len(otherjunk) != 0: raise Exception('Command line input not understood: ' + str(otherjunk))
    args = {}
  
    # Set up variables according to the command line input.
    print("Doing classification")
    print("--------------------")
    print(("data:\t\t" + options.data))
    print(("classifier:\t\t" + options.classifier))
    print(("training set size:\t" + str(options.training)))
    if(options.data=="digits"):
        printImage = ImagePrinter(DIGIT_DATUM_WIDTH, DIGIT_DATUM_HEIGHT).printImage
        featureFunction = basicFeatureExtractorDigit    
    else:
        print(("Unknown dataset", options.data))
        print(USAGE_STRING)
        sys.exit(2)
    
    if(options.data=="digits"):
        legalLabels = list(range(10))
    else:
        legalLabels = list(range(2))
    
    if options.training <= 0:
        print(("Training set size should be a positive integer (you provided: %d)" % options.training))
        print(USAGE_STRING)
        sys.exit(2)

    if(options.classifier == "mostFrequent"):
        classifier = mostFrequent.MostFrequentClassifier(legalLabels)
    elif(options.classifier == "naiveBayes" or options.classifier == "nb"):
        classifier = naiveBayes.NaiveBayesClassifier(legalLabels)
    if (options.autotune):
        print("using automatic tuning for naivebayes")
        classifier.automaticTuning = True
    else:
        print(("Unknown classifier:", options.classifier))
        print(USAGE_STRING)
    
        sys.exit(2)

    args['classifier'] = classifier
    args['featureFunction'] = featureFunction
    args['printImage'] = printImage
  
    return args, options


USAGE_STRING = """
    USAGE:      python dataClassifier.py <options>
    EXAMPLES:   (1) python dataClassifier.py
                  - trains the default mostFrequent classifier on the digit dataset
                  using the default 100 training examples and
                  then test the classifier on test data
    """

# Main harness code

def runClassifier(args, options):

    featureFunction = args['featureFunction']
    classifier = args['classifier']
    printImage = args['printImage']
      
    # Load data  
    numTraining = options.training

    rawTrainingData = samples.loadDataFile("digitdata/trainingimages", numTraining,DIGIT_DATUM_WIDTH,DIGIT_DATUM_HEIGHT)
    trainingLabels = samples.loadLabelsFile("digitdata/traininglabels", numTraining)
    rawValidationData = samples.loadDataFile("digitdata/validationimages", TEST_SET_SIZE,DIGIT_DATUM_WIDTH,DIGIT_DATUM_HEIGHT)
    validationLabels = samples.loadLabelsFile("digitdata/validationlabels", TEST_SET_SIZE)
    rawTestData = samples.loadDataFile("digitdata/testimages", TEST_SET_SIZE,DIGIT_DATUM_WIDTH,DIGIT_DATUM_HEIGHT)
    testLabels = samples.loadLabelsFile("digitdata/testlabels", TEST_SET_SIZE)
    
  
    # Extract features
    print("Extracting features...")
    trainingData = list(map(featureFunction, rawTrainingData))
    validationData = list(map(featureFunction, rawValidationData))
    testData = list(map(featureFunction, rawTestData))
  
    # Conduct training and testing
    print("Training...")
    classifier.train(trainingData, trainingLabels, validationData, validationLabels)
    print("Validating...")
    guesses = classifier.classify(validationData)
    correct = [guesses[i] == validationLabels[i] for i in range(len(validationLabels))].count(True)
    print((str(correct), ("correct out of " + str(len(validationLabels)) + " (%.1f%%).") % (100.0 * correct / len(validationLabels))))
    print("Testing...")
    guesses = classifier.classify(testData)
    correct = [guesses[i] == testLabels[i] for i in range(len(testLabels))].count(True)
    print((str(correct), ("correct out of " + str(len(testLabels)) + " (%.1f%%).") % (100.0 * correct / len(testLabels))))
    analysis(classifier, guesses, testLabels, testData, rawTestData, printImage)

if __name__ == '__main__':
    # Read input
    args, options = readCommand( sys.argv[1:] )
    #args, options = readCommand( sys.argv[1:] ) 
    # Run classifier
    runClassifier(args, options)

In [10]:
%run dataClassifier.py -c naiveBayes -a

Doing classification
--------------------
data:		digits
classifier:		naiveBayes
training set size:	100
using automatic tuning for naivebayes
trying to read: digitdata/trainingimages
testing
trying to read: digitdata/traininglabels
testing
trying to read: digitdata/validationimages
testing
trying to read: digitdata/validationlabels
testing
trying to read: digitdata/testimages
testing
trying to read: digitdata/testlabels
testing
Extracting features...
Training...
Validating...
('74', 'correct out of 100 (74.0%).')
Testing...
('65', 'correct out of 100 (65.0%).')
Mistake on example 3
Predicted 3; truth is 5
Image: 
                            
                            
                            
                            
                            
          +#########+       
         +###########+      
         ############+      
         ############       
         ####+++#####       
         +##+     +++       
         +###++++           
          ########+         
   

### References for Question No 2.

- http://ai.berkeley.edu/projects/release/classification/v1/001/docs/naiveBayes.html
- https://github.com/anthony-niklas/cs188/blob/master/p5/naiveBayes.py
- https://github.com/anthony-niklas/cs188/blob/341f854af50863f6f30e09ca32910ee3025ec5b2/p5/dataClassifier.py
- https://www.youtube.com/watch?v=FgaM-TzT7qk&feature=emb_imp_woyt