Question 1 [10 points]

1. 
P(sunny | a cone of ice cream) = P(a cone of ice cream | sunny) x P(sunny) / P(a cone of ice cream)

   - Given the Naive Bayes assumption:
   
   P(a cone of ice cream | sunny) = [P(a | sunny) x P(cone | sunny) x P(of | sunny) x P(ice | sunny) x P(cream | sunny)]

Therefore, the full expression becomes:

P(sunny | a cone of ice cream) = [P(a | sunny) x P(cone | sunny) x P(of | sunny) x P(ice | sunny) x P(cream | sunny)] x P(sunny) / P(a cone of ice cream)

We see a similar for the second expression: 
2. 
P(rainy | a cup of hot coffee) = P(a cup of hot coffee | rainy) x P(rainy) / P(a cup of hot coffee)

   - Given the Naive Bayes assumption:
   
   P(a cup of hot coffee | rainy) = [P(a | rainy) x P(cup | rainy) x P(of | rainy) x P(hot | rainy) x P(coffee | rainy)]

Therefore, the full expression becomes:

P(rainy | a cup of hot coffee) = [P(a | rainy) x P(cup | rainy) x P(of | rainy) x P(hot | rainy) x P(coffee | rainy)] x P(rainy) / P(a cup of hot coffee)


Question 2 [25 points]

Please review the following cells of markdown, code, and outputs: 

In [117]:
# Here we are executing the dataClassifier.py code provided with the assignment

import mostFrequent
import naiveBayes
import samples
import util

TEST_SET_SIZE = 100
DIGIT_DATUM_WIDTH = 28
DIGIT_DATUM_HEIGHT = 28
FACE_DATUM_WIDTH = 60
FACE_DATUM_HEIGHT = 70


def basicFeatureExtractorDigit(datum):
    """
    Returns a set of pixel features indicating whether
    each pixel in the provided datum is white (0) or gray/black (1)
    """
    a = datum.getPixels()

    features = util.Counter()
    for x in range(DIGIT_DATUM_WIDTH):
        for y in range(DIGIT_DATUM_HEIGHT):
            if datum.getPixel(x, y) > 0:
                features[(x, y)] = 1
            else:
                features[(x, y)] = 0
    return features


def analysis(classifier, guesses, testLabels, testData, rawTestData, printImage):
    """
    This function is called after learning.
    Include any code that you want here to help you analyze your results.

    Use the printImage(<list of pixels>) function to visualize features.

    An example of use has been given to you.

    - classifier is the trained classifier
    - guesses is the list of labels predicted by your classifier on the test set
    - testLabels is the list of true labels
    - testData is the list of training datapoints (as util.Counter of features)
    - rawTestData is the list of training datapoints (as samples.Datum)
    - printImage is a method to visualize the features
    (see its use in the odds ratio part in runClassifier method)

    This code won't be evaluated. It is for your own optional use
    (and you can modify the signature if you want).
    """

    # Put any code here...
    # Example of use:
    for i in range(len(guesses)):
        prediction = guesses[i]
        truth = testLabels[i]
        if (prediction != truth):
            print("===================================")
            print("Mistake on example %d" % i)
            print("Predicted %d; truth is %d" % (prediction, truth))
            print("Image: ")
            print(rawTestData[i])
            break


## =====================
## You don't have to modify any code below.
## =====================


class ImagePrinter:
    def __init__(self, width, height):
        self.width = width
        self.height = height


class Options:
    def __init__(self):
        self.classifier = 'mostFrequent'  # Set your default classifier
        self.data = 'digits'  # Set your default dataset
        self.training = 100  # Set your default training set size
        self.autotune = False  # Set autotune option as needed
        self.iterations = 3  # Set the maximum iterations


options = Options()
args = {}

# Set up variables according to the command line input.
print("Doing classification")
print("--------------------")
print("data:\t\t" + options.data)
print("classifier:\t\t" + options.classifier)
print("training set size:\t" + str(options.training))
if (options.data == "digits"):
    printImage = ImagePrinter(DIGIT_DATUM_WIDTH, DIGIT_DATUM_HEIGHT)
    featureFunction = basicFeatureExtractorDigit
else:
    print("Unknown dataset", options.data)

if (options.data == "digits"):
    legalLabels = list(range(10))

# Load data
numTraining = options.training

rawTrainingData = samples.loadDataFile("trainingimages", numTraining, DIGIT_DATUM_WIDTH, DIGIT_DATUM_HEIGHT)
trainingLabels = samples.loadLabelsFile("traininglabels", numTraining)
rawValidationData = samples.loadDataFile("validationimages", TEST_SET_SIZE, DIGIT_DATUM_WIDTH, DIGIT_DATUM_HEIGHT)
validationLabels = samples.loadLabelsFile("validationlabels", TEST_SET_SIZE)
rawTestData = samples.loadDataFile("testimages", TEST_SET_SIZE, DIGIT_DATUM_WIDTH, DIGIT_DATUM_HEIGHT)
testLabels = samples.loadLabelsFile("testlabels", TEST_SET_SIZE)

# Extract features
print("Extracting features...")
trainingData = list(map(featureFunction, rawTrainingData))
validationData = list(map(featureFunction, rawValidationData))
testData = list(map(featureFunction, rawTestData))

# Create classifier
if (options.classifier == "mostFrequent"):
    classifier = mostFrequent.MostFrequentClassifier(legalLabels)
elif (options.classifier == "naiveBayes" or options.classifier == "nb"):
    classifier = naiveBayes.NaiveBayesClassifier(legalLabels)
    if (options.autotune):
        print("using automatic tuning for naivebayes")
        classifier.automaticTuning = True
else:
    print("Unknown classifier:", options.classifier)

args['classifier'] = classifier
args['featureFunction'] = featureFunction
args['printImage'] = printImage

# Conduct training and testing
print("Training...")
classifier.train(trainingData, trainingLabels, validationData, validationLabels)
print("Validating...")
guesses = classifier.classify(validationData)
correct = [guesses[i] == validationLabels[i] for i in range(len(validationLabels))].count(True)
print(str(correct), ("correct out of " + str(len(validationLabels)) + " (%.1f%%).") % (
        100.0 * correct / len(validationLabels)))
print("Testing...")
guesses = classifier.classify(testData)
correct = [guesses[i] == testLabels[i] for i in range(len(testLabels))].count(True)
print(str(correct), ("correct out of " + str(len(testLabels)) + " (%.1f%%).") % (100.0 * correct / len(testLabels)))
analysis(classifier, guesses, testLabels, testData, rawTestData, printImage)


Doing classification
--------------------
data:		digits
classifier:		mostFrequent
training set size:	100
Extracting features...
Training...
Validating...
14 correct out of 100 (14.0%).
Testing...
14 correct out of 100 (14.0%).
Mistake on example 0
Predicted 1; truth is 9
Image: 
                            
                            
                            
                            
                            
                            
                            
             ++###+         
             ######+        
            +######+        
            ##+++##+        
           +#+  +##+        
           +##++###+        
           +#######+        
           +#######+        
            +##+###         
              ++##+         
              +##+          
              ###+          
            +###+           
            +##+            
           +##+             
          +##+              
         +##+               
         ##+            

^ Above is our output of the dataClassifier.py running the default classifier mostFrequent.py ^

Question 2 [25 points]

Please review the following cells of markdown, code, and outputs: 

In [118]:
# Here we are attempting to optimize the naiveBayes classifier

import util
import classificationMethod
import math

class NaiveBayesClassifier(classificationMethod.ClassificationMethod):
    """
    See the project description for the specifications of the Naive Bayes classifier.
    
    Note that the variable 'datum' in this code refers to a counter of features
    (not to a raw samples.Datum).
    """
    def __init__(self, legalLabels):
        self.legalLabels = legalLabels
        self.type = "naivebayes"
        self.k = 1  # this is the smoothing parameter, ** use it in your train method **
        self.automaticTuning = True  # Look at this flag to decide whether to choose k automatically ** use this in your train method **

    def setSmoothing(self, k):
        """
        This is used by the main method to change the smoothing parameter before training.
        Do not modify this method.
        """
        self.k = k

    def train(self, trainingData, trainingLabels, validationData, validationLabels):
 
        """
        Train the Naive Bayes classifier.
        
        Args:
        trainingData: A list of feature Counters for the training data.
        trainingLabels: A list of labels for the training data.
        validationData: A list of feature Counters for the validation data.
        validationLabels: A list of labels for the validation data.
        """
    
        from sklearn.feature_selection import SelectKBest
        from sklearn.feature_selection import f_classif

        # Perform feature selection using SelectKBest with ANOVA F-test
        num_features_to_select = 100  # Adjust this value as needed
        feature_selector = SelectKBest(score_func=f_classif, k=num_features_to_select)
        selected_training_data = feature_selector.fit_transform(trainingData, trainingLabels)

        self.features = list(trainingData[0].keys())

        if self.automaticTuning:
            kgrid = [0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 20, 50, 100, 200, 500, 1000]
        else:
            kgrid = [self.k]

        bestAccuracy = -1
        bestK = -1

        for k in kgrid:
            self.k = k
            # Collect counts over the training data
            self.prior = util.Counter()
            self.condProb = {label: util.Counter() for label in self.legalLabels}

            for i, label in enumerate(trainingLabels):
                self.prior[label] += 1
                for feature, value in trainingData[i].items():
                    # Apply Laplace smoothing here by adding 'k' to both numerator and denominator
                    self.condProb[label][feature, value] += k
                    # Don't forget to increment the total count for 'label' as well

            # Apply smoothing and normalization
            for label in self.legalLabels:
                self.prior[label] += k
                total = self.prior[label]

                for feature, value in self.features:
                    # Normalize the probabilities here
                    self.condProb[label][feature, value] /= total

            # Classify the validation data
            guesses = self.classify(validationData)
            correct = [guesses[i] == validationLabels[i] for i in range(len(validationLabels))].count(True)
            accuracy = correct / len(validationLabels)

            # Update the best parameters if accuracy is better
            if accuracy > bestAccuracy:
                bestAccuracy = accuracy
                bestK = k

        # Set the best smoothing parameter
        self.k = bestK

        test_guesses = self.classify(testData)
        test_correct = [test_guesses[i] == testLabels[i] for i in range(len(testLabels))].count(True)
        test_accuracy = test_correct / len(testLabels)
        print("Testing accuracy with best k (k={}): {:.2%}".format(self.k, test_accuracy))
    
        # Your code for model selection or tuning goes here
        # Any code added here won't be evaluated, it's for your own analysis

    def classify(self, testData):
        # Classify the data based on the posterior distribution over labels.
        guesses = []
        self.posteriors = []  # Log posteriors are stored for later data analysis (autograder).
        for datum in testData:
            posterior = self.calculateLogJointProbabilities(datum)
            guesses.append(posterior.argMax())
            self.posteriors.append(posterior)
        return guesses

    def calculateLogJointProbabilities(self, datum):
        logJoint = util.Counter()
        for label in self.legalLabels:
            logJoint[label] = math.log(self.prior[label])
            for feature, value in datum.items():
                if (feature, value) in self.condProb[label]:
                    logJoint[label] += math.log(self.condProb[label][feature, value])
        return logJoint

    def findHighOddsFeatures(self, label1, label2):
        featuresOdds = []
        odds = util.Counter()

        for feature in self.features:
            p_label1 = self.condProb[label1][feature, 1]
            p_label2 = self.condProb[label2][feature, 1]

            if p_label2 == 0:
                odds[feature] = float('inf')
            else:
                odds[feature] = p_label1 / p_label2

        # Find the top 100 features with the highest odds ratio
        featuresOdds = odds.sortedKeys()[:100]

        return featuresOdds

    !python dataClassifier.py -c naiveBayes -a -t 900


Doing classification
--------------------
data:		digits
classifier:		naiveBayes
training set size:	900
using automatic tuning for naivebayes
Extracting features...
Training...
Validating...
60 correct out of 100 (60.0%).
Testing...
52 correct out of 100 (52.0%).
Mistake on example 0
Predicted 7; truth is 9
Image: 
                            
                            
                            
                            
                            
                            
                            
             ++###+         
             ######+        
            +######+        
            ##+++##+        
           +#+  +##+        
           +##++###+        
           +#######+        
           +#######+        
            +##+###         
              ++##+         
              +##+          
              ###+          
            +###+           
            +##+            
           +##+             
          +##+              
         +##+    

^ Above is our output of the dataClassifier.py running the optimized naiveBayes.py ^