Skip to content

Digit classification task using Naive Bayes, Perceptron, and MIRA.

License

Notifications You must be signed in to change notification settings

Steepspace/Digit-Classification

Repository files navigation

Digit Classification

Report

Checkout the report here.

Getting Started

To try out the classification pipeline, run dataClassifier.py from the command line. This will classify the digit data using the default classifier (mostFrequent) which blindly classifies every example with the most frequent label.

python dataClassifier.py

As usual, you can learn more about the possible command line options by running:

python dataClassifier.py -h

Our simple feature set includes one feature for each pixel location, which can take values 0 or 1 (off or on). The features are encoded as a Counter where keys are feature locations (represented as (column,row)) and values are 0 or 1. The face recognition data set has value 1 only for those pixels identified by a Canny edge detector.

Naive Bayes

Smoothing

Implementation of trainAndTune and calculateLogJointProbabilities in naiveBayes.py. In trainAndTune, estimate conditional probabilities from the training data for each possible value of k given in the list kgrid. Evaluate accuracy on the held-out validation set for each k and choose the value with the highest validation accuracy. In case of ties, prefer the lowest value of k. The classifier is tested with:

python dataClassifier.py -c naiveBayes --autotune

Hints and observations:

  • The method calculateLogJointProbabilities uses the conditional probability tables constructed by trainAndTune to compute the log posterior probability for each label y given a feature vector. The comments of the method describe the data structures of the input and output.
  • You can add code to the analysis method in dataClassifier.py to explore the mistakes that your classifier is making. This is optional.
  • When trying different values of the smoothing parameter k, think about the number of times you scan the training data. Your code should save computation by avoiding redundant reading.
  • To run your classifier with only one particular value of k, remove the –autotune option. This will ensure that kgrid has only one value, which you can change with -k.
  • Using a fixed value of k=2 and 100 training examples, you should get a validation accuracy of about 69% and a test accuracy of 55%.
  • Using –autotune, which tries different values of k, you should get a validation accuracy of about 74% and a test accuracy of 65%.
  • Accuracies may vary slightly because of implementation details. For instance, ties are not deterministically broken in the Counter.argMax() method.
  • To run on the face recognition dataset, use -d faces (optional).

Odd Ratios

The function findHighOddsFeatures(self, label1, label2) returns a list of the 100 features with highest odds ratios for label1 over label2. The option -o activates an odds ratio analysis. Use the options -1 label1 -2 label2 to specify which labels to compare. Running the following command will show you the 100 pixels that best distinguish between a 3 and a 6.

python dataClassifier.py -a -d digits -c naiveBayes -o -1 3 -2 6

Perceptron

Learning Weights

Run the code with:

python dataClassifier.py -c perceptron

Hints and observations:

  • The command above should yield validation accuracies in the range between 40% to 70% and test accuracy between 40% and 70% (with the default 3 iterations). These ranges are wide because the perceptron is a lot more sensitive to the specific choice of tie-breaking than naive Bayes.
  • One of the problems with the perceptron is that its performance is sensitive to several practical details, such as how many iterations you train it for, and the order you use for the training examples (in practice, using a randomized order works better than a fixed order). The current code uses a default value of 3 training iterations. You can change the number of iterations for the perceptron with the -i iterations option. Try different numbers of iterations and see how it influences the performance. In practice, you would use the performance on the validation set to figure out when to stop training, but you don’t need to implement this stopping criterion for this assignment.

Visualizing Weights

The function findHighWeightFeatures(self, label) in perceptron.py returns a list of the 100 features with highest weight for that label. You can display the 100 pixels with the largest weights using the command:

python dataClassifier.py -c perceptron -w

Links

https://inst.eecs.berkeley.edu//~cs188/sp11/projects/classification/classification.html

About

Digit classification task using Naive Bayes, Perceptron, and MIRA.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published