ClusteredLabeledData

Using clustering algorithms and metrics, create labeled data to be used in classifying algorithms.

What is this?

This is a code repository of me exploring a topic I wrote on for my final essay for my AI/ML Special Topics class at UAH. To see the actual essay, here it is.

But I don't want to read an essay!

Well here you go!

What are Clustering Algorithms?

Clustering algorithms are a part of the data science field. The algorithm takes in data and tries to partition it up so that the data are in nice little spots. A classic example of a clustering algorithm is the k-means algorithm.

The k-means clustering algorithm isn't the greatest however. This is in part because of the random initialization it has in the beginning of the algorithm and the fact that it looks for spherical clusters. Most data is not spherical as all experienced data scientists will tell you. The textbook referenced above explains this in much more detail. Also, the link above is actually a very good resource on most things related to data science. The author has a textbook that is available for purchase if you don't want to use the free online textbook or would prefer a hard copy of the book. I personally recommend it.

What is Category Utility?

Category utility is an information gain metric that is best used with clustered data. An information gain metric is a formula that tells you how predictive the data it's operating on can be guessed to be in a certain category. Or cluster for our purposes. The material covered in the textbook that is referenced within my essay can be found here and the equation for Category Utility can be found on slide 118 on this slide deck. Or you can use this Wikipedia article.

How do the two relate together?

Well by using the Category Utility, we can essentially find the number of clusters that is likely to be the total number of possible classifications within the data. If we treat each cluster as its own classification, we can then take unlabeled data and turn it into labeled data that can be used in classification algorithms, such as neural networks. We're assuming that as our number of clusters approach the total number of classifications, our category utility will reach its maximum value.

How will you address overfitting?

For those unaware, overfitting is when a model "memorizes" the data that is passed in. This is bad, as we want our model to be able to perform optimally on all types of data.

How we address overfitting is that we put in stopping measures for the algorithm. For instance, we can stop the algorithm once it reaches a number of clusters proportional to the total number of data points. An example of this is if we had 100,000 data points but we wanted to stop if the number of clusters was a fifth of the total number of data points. A fifth of 100,000 is 20,000. So we would stop once the number of clusters was 20,000. We would likely go to a lower number. What would be the ideal number is how many data points you think is in each classification. We don't expect there to be 5 data points in each classification in our data set, so that's why 20,000 might be too large of a number.

Other possible overfitting measures can include stopping once the category utility starts decreasing. Another might be to stop after a certain number of iterations of the algorithm. These are just ideas that are by no means the only measures that can be taken to prevent overfitting. If anyone else has any more ideas to solve overfitting, please contribute to the discussion!

How does the algorithm work?

Select an initial number of clusters
- a good start might be 2 at the very least
Run the clustering algorithm on our data.
- the clustering algorithm can be any that requires a number of clusters as a parameter.
Run the category utility on the generated clusters.
See if any of our stopping conditions is true.
- Repeat from 2 onward until they are.
Your clusters are now the potentially labeled data that can be used in classification algorithms.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
data_generator.py		data_generator.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClusteredLabeledData

What is this?

But I don't want to read an essay!

What are Clustering Algorithms?

What is Category Utility?

How do the two relate together?

How will you address overfitting?

How does the algorithm work?

About

Releases

Packages

Languages

Duke02/ClusteredLabeledData

Folders and files

Latest commit

History

Repository files navigation

ClusteredLabeledData

What is this?

But I don't want to read an essay!

What are Clustering Algorithms?

What is Category Utility?

How do the two relate together?

How will you address overfitting?

How does the algorithm work?

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages