exceptions, inner classes, k-means clustering
- Extend the design of the Cluster, and implement a new class, KMeans, which performs K-means clustering on multidimensional real-valued data.
- Learn implement and use exceptions to fortify your program's interface.
- Practice input/output within the streams abstraction.
- Complete the design and implementation of a fully functional C++ program.
- Continue using git and Github.
- Develop good coding style.
PA3 brings you to the culmination of your work with points, asking you to implement the K-means algorithm for clustering multidimensional real-valued points. Most of the functionality of your Point
and Cluster
classes will be used as is, with some key extensions to Cluster
. Extensive exception coverage will fortify the program's interface. The actual clustering algorithm will be implemented in a new class, KMeans
. You have to write four files, Point.cpp, Cluster.cpp, KMeans.cpp, and Exceptions.cpp. See the Detailed Instructions at the bottom of this file.
PA3 is in the test-driven-development (TDD) style, just like PA2. It has 392 tests that your implementation should pass for full points. The tests from PA2 are a subset of the PA3 test suite. Start by adding your original Point.cpp and Cluster.cpp files and creating and implementing the Exceptions.cpp file, which should be fairly straightforward. When you pass all the Point
and Cluster
tests, with all the Exceptions
checking, uncomment and move on to the KMeans
tests. (Once again, don't forget to git add
whatever files you add to your project.) There are some much larger data files contained in the repository than for PA2.
PA3 is commensurate to PA2. You should expect to write about 400 lines of code and spend about 15-25 hours on it, depending on how well you did on PA2. Once again, it is advisable that you start early and do it in stages. If you get stuck on anything for more than one hour, seek help.
You don't need to submit anything. Once you fork the repository (this is your remote repository on Github, aka origin), you will clone it to your development machine (this is your local repository), and start work on it. Commit your changes to your local repository often and push them up to the remote repository occasionally. Make sure you push at least once before the due date. At the due date, your remote repository will be cloned and tested automatically by the grading script. Note: Your code should be in the master branch of your remote repository.
An autograding script will run the test suite against your files. Your grade will be based on the number of tests passed. (E.g. if your code passes 3 out of 6 test cases, your score will be 50% and the grade will be the corresponding letter grade in the course's grading scale). The test suite for PA3 has 392 tests. Note: The testing and grading will be done with fresh original copies of all the provided files. In the course of development, you can modify them, if you need to, but your changes will not be used. Only your Point.cpp, Cluster.cpp, KMeans.cpp, and Exceptions.cpp files will be used.
Your program should run on GCC 4.9.0 or later, or Clang 3.3 or later. No other compilers are supported.
The assignment is due on Wed, Mar 30, at 23:59 Mountain time. No late work. The last commit to your PA3 repository before the deadline will be graded.
Free Github repositories are public so you can look at each other's code. Please, don't do that. You can discuss any programming topics and the assignments in general but sharing of solutions diminishes the individual learning experience of many people. Assignments might be randomly checked for plagiarism and a plagiarism claim may be raised against you.
For this assignment, no external libraries should be used, except for the Standard Library minus the Standard Template Library (STL). You should reuse your own implementation of the linked list from PA2. You are encouraged to use STL classes in your implementation of KMeans::pickCentroids
and Kmeans::run
. We will use the STL more extensively in PA4-PA5.
Familiarize yourself with and start following coding style guidelines. There are others on the Web. Pick one and be consistent. Note: If you stumble on the Google C++ Style Guide, be advised that it has been heavily criticized by many leading C++ programmers. I don't advise you to follow it, especially the more advanced features. This Guide is for entry-level coders at Google who need to be able to work with their legacy code. It is not advisable for new projects and novice programmers.
Operator overloading guidelines.
A very good C++ tutorial, including many topics we are covering.
Two websites with C++ Reference, here and here.
-
All the exceptions that you need to implement are declared in the Exceptions.h file. They are very simple classes, which hold some environment data, if any, to report back to the handler when thrown. They also hold their
name
so that the test suite handlers can make sure the correct exception is thrown in expected anomalous situations. They also print themselves to a stream, so an informative message can be reported and/or logged. -
OutOfBoundsEx
. -
Attributes:
__current
is the size of the current object,__rhs
is the offending argument. -
Thrown by
Point::setValue
,Point::getValue
,Point::operator[]
,Cluster::operator[]
. -
Caught by user code (test suite).
-
DimensionalityMismatchEx
. -
Attributes:
__current
is the dimensionality of the current object,__rhs
is the offending argument. -
Thrown by
Point::distanceTo
,Point::operator+=
,Point::operator-=
,Point::operator<
(and the other comparison operators through it),Point::operator=
,Point::operator==
(and theoperator!=
through it),Point::operator>>
;Cluster::add
andCluster::remove
(and, through them, any other functions or operators which use them),Cluster::operator+=
andCluster::operator-=
(for bothPoint
andCluster
right-hand sides) (and, through them, the correspoinding simple-arithmetic operators),Cluster::operator==
(and theoperator!=
through it). -
Caught by user code (test suite) and by
Cluster::operator>>
when reading in a data file. -
ZeroClustersEx
. -
Attributes: none.
-
Thrown by
KMeans::KMeans
constructor when k=0. -
Caught by user code (test suite).
-
DataFileOpenEx
. -
Attributes: the data file's
filename
. -
Thrown by
KMeans::KMeans
constructor when the file with such a name cannot be found or can otherwise not be opened. -
Caught by user code (test suite).
-
ZeroDimensionsEx
. -
Attributes: none.
-
Thrown by
Point::Point
constructor when d=0. Note that this exception can be thrown by bothCluster::Cluster
andKMeans::KMeans
but, since the initialization of bothCluster
andKMeans
involves initialization of at least onePoint
, and since this is a fatal condition after which the clustering program cannot proceed, throwing it fromPoint::Point
only is minimalistic design. -
Caught by user code (test suite).
-
EmptyClusterEx
. -
Attributes: none.
-
Thrown by
Cluster::operator[]
when the cluster is empty. -
Caught by user code (test suite).
-
Throw the indicated exceptions from the indicated methods in the Exceptions section.
-
Define and initialize the
static const char POINT_VALUE_DELIM
to','
. Use it in yourPoint::operator>>
. -
Define the
static void Point::rewindIdGen
method. Use it to rewind thePoint
id generator when aDimensionalityMismatchEx
is caught inCluster::operator>>
when aPoint
cannot read itself. This will ensure thePoint
is-s are sequential and there are no gaps in the sequence due to data file reading problems. -
Define the
const Point::operator[]
. It is needed by theKMeans
class.
-
This class is a private inner class of class
Cluster
. This means that: -
Objects of class
Centroid
can only be declared within the scope of classCluster
(declarations and definitions). -
An object of class
Centroid
has private access to an object of classCluster
. For this to be useful, a reference to aCluster
object has to be passed as an argument to theCentroid::Centroid
constructor. See theCluster
header. -
Implement the methods of the inner class
Centroid
. All implementations will be identified by theCluster::Centroid::
qualifier. -
Cluster::Centroid::equal
only compares thedouble
values and ignores the id of thePoint
argument. -
Cluster::Centroid::compute
uses the private access to itsCluster
to compute the average of its points. -
Cluster::Centroid::set
andCluster::Centroid::compute
set the centroid to valid. -
An "infinite"
Point
is used to indicate the centroid of an emptyCluster
. For the purposes of this assignment, this is defined as aPoint
with all values set tostd::numeric_limits<double>::max()
. -
Cluster::Centroid::toInfinity
sets the centroid to an infinite point.
-
This class is a very simple public inner class of class
Cluster
. This means that objects of classMove
can be declared in any scope by using theCluster::Move
type. -
This class is used to encapsulate the indivisible operation of adding a
Point
from oneCluster
after removing it from another. In our clustering paradigm, allPoint
objects have to "belong" toCluster
objects, and do not make sense outside of them. -
Implement the
Cluster::Move::perform
method and use it in the implementation ofKMeans::run
.
Note: This method should invalidate the centroids of both clusters.
-
Throw the indicated exceptions from the indicated methods in the Exceptions section.
-
Define and initialize the
static const char POINT_CLUSTER_ID_DELIM
to':'
. Use it in yourCluster::operator<<
. -
Initialize the new
public Centroid
member.
Note: You can now use calls like c7.centroid.isValid();
in your code.
Note: A Cluster
constructed with a copy ctor has to compute its centroid. Notice that Centroid::Centroid(const Centroid &)
is explicitly deleted disallowing copying of Centoid
objects. (This is due to the fact that the private Cluster
reference in the Centroid
is explicitly constant and cannot be reassigned. C++ references can only be assigned upon initialization (i.e. in a ctor).)
Note: A Cluster
assigned to with a operator=
cannot assignconstructed with a copy ctor has to set its centroid to the centroid of the argument Cluster
and also set it to valid. Notice that Centroid::operator=(const Centroid &)
is explicitly deleted disallowing copying and assignment of Centoid
objects.
-
When a
KMeans
object is initializing itself (i.e. in theKMeans::KMeans
constructor), it createsk
number of clusters (in a dynamic array) and uses the firstCluster
array element to read the specified data file in. Then it picksk
number of centroids from thePoint
s of thisCluster
to set as initial centroids of allk
Cluster
array elements. Implement theCluster::pickCentroids
method. -
The arguments to this method are
k
, the number of clusters, and an already allocated array ofPoint
s where to put the selected centroids.
Note: The caller, which in this case is the KMeans::KMeans
constructor, is respoinsible for allocation of this array. The double pointer Point **
is because Point
has no default constructor and we cannot declare a static array of Point
s. KMeans::KMeans
constructor or KMeans::~KMeans
destructor should deallocate this array.
-
When k >= __size (the size of the
Cluster
) the firstk
elements of the array should be set to thePoint
s of theCluster
and the rest should be set to infinite points. See the Centroid section for an explanation of infinite points. ThePoint::Point(const Point &)
operator should be used to set the centroids. -
When k < __size
k
points should be picked from the cluster to set as initial centroids. The algorithm is up to you and you should start by just picking the firstk
points to get things going. Eventually, you might need to pick your centroids better to be able to finish each K-means clustering test within 20 iterations. Hint: The best performance is achieved by pickeing points that are far away from each other. You can loop throughk
iterations of an algorithm that picks the farthest point from the set of points already picked. -
This is a great use case for the
std::set
class. -
Implement cluster id-s through the
static unsigned int
id generator, just like you did forPoint
. -
Modify
Cluster::operator<<
to output the cluster id after each point, as follows:2.30000, 5.60000, 0.00000, 5.60000, 7.90000 : 6093
-
The
KMeans
class does three things: -
Initializes itself.
-
Runs.
-
Outputs the results to a file.
-
Throw the indicated exceptions from the indicated methods in the Exceptions section.
-
Implement the
KMeans::KMeans
constructor. This is where all the initialization and anomaly checking has to happen. Either the constructor throws an exception or after its execution theKMeans::run
method can safely be called. If any exceptions are to be thrown, any already allocated memory should be deallocated before thethrow
statement. -
Implement the getters and simple getters. The variables are as follows:
-
__maxIter
is the maximum number of iterations the algorithm can run. This is a constructor parameter and has to be observed. Note that this is a termination condition for the K-means algorithm. -
__numIter
is the actual number of iterations the algorithm ran. Note that this can be less than__maxIter
as the points may stop moving between clusters before__maxIter
iterations have been performed. Note that this is another termination condition for the K-means algorithm. -
__numNonemptyClusters
is the number of non-empty clusters before or after the K-means algorithm has run. Note that before running, it's equal to 1, if the initialization in theKMeans::KMeans
constructor was successful. -
__numMovesLastIter
is the number of moves that were performed during the last iteration of the algorithm. Note that this can be larger than zero if__maxIter
was set to a number smaller than the minimum iterations necessary for the point movement to stop. -
Implement the
KMeans::run
method. This is the heart of the algorithm. Here it is in pseudocode: -
Check if the centroids have been assigned. You can use
assert
. -
Initialize locals
moves = 100
aiter = 0
. These will be used for the termination condition. -
while (moves > 0 and iter < __maxIter)
* `moves = 0` // restart the count for the new iteration
* `for (c : k clusters)`
* `for (p : cluster points)`
* Find closest centroid
* If it is **not** the centroid of `c`
* Move `p` to the other cluster. Use a local `Cluster::Move`
* `moves ++`
* Recompute all invalidated centroids.
* `iter ++`
- Set the metadata:
* `__numIter = iter`
* `__numMovesLastIter = moves`
* Count non-empty clusters in `nonempty`
* `__numNonempty = nonempty`
-
Implement
KMeans::operator<<
. This should write out the results to a file. A file output stream is supplied by the caller. Only nonempty clusters should be written out, in no particular order, in the following format:48.9819, 47.4258, 52.0498 : 10440 49.7811, 49.1227, 51.7090 : 10440 50.2934, 48.7180, 51.8880 : 10440 50.4146, 48.4508, 51.4070 : 10440 51.4925, 48.3226, 51.9171 : 10440 13.1473, 45.7105, 46.8457 : 10441 12.6223, 51.8332, 48.9889 : 10442 13.4091, 51.1114, 51.8555 : 10442 13.4349, 49.4999, 51.6522 : 10442 13.4117, 47.1157, 51.4740 : 10443 13.6196, 45.1573, 52.3948 : 10443 13.8807, 47.5944, 55.9948 : 10444 15.6523, 46.7538, 55.6698 : 10444 14.6511, 48.6109, 45.2570 : 10445 14.7062, 48.2472, 47.5892 : 10445 14.7980, 48.4135, 47.7455 : 10445 15.7419, 48.8970, 47.0186 : 10445 14.7341, 46.1872, 42.4663 : 10446 14.7637, 47.3621, 49.4916 : 10447 14.9784, 49.5267, 49.5483 : 10447 15.1096, 50.0164, 48.2296 : 10447 15.7378, 49.5870, 50.3862 : 10447 15.7588, 50.3209, 49.7688 : 10447 16.7001, 49.3881, 50.4152 : 10447 15.1601, 49.5503, 53.4565 : 10448 15.9976, 49.1623, 51.5173 : 10448 16.0302, 50.3806, 52.0563 : 10448 16.4739, 49.4711, 52.6637 : 10448 15.2441, 56.2196, 48.3763 : 10449 15.9123, 54.1316, 48.0258 : 10449 17.3475, 54.4616, 48.3485 : 10449 15.3573, 43.8995, 47.8981 : 10450 15.5129, 49.0171, 42.9764 : 10451 16.1129, 48.3082, 43.9640 : 10451 17.0414, 48.9724, 44.4231 : 10451