# [CPSC 322](https://github.com/GonzagaCPSC322) Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# PA8 BONUS Clustering (25 BONUS pts)

## Learner Objectives
At the conclusion of this programming assignment, participants should be able to:
* Implement k-means clustering

## Prerequisites
Before starting this programming assignment, participants should be able to:
* Implement test-driven development
* Evaluate classifiers using train/test sets
* Tell a data science story using Jupyter Notebook
* Understand Bramer Chapter 19 (Clustering)

## Acknowledgments
Content used in this assignment is based upon information in the following sources:
* [k-Means Clustering](https://www.engage-csedu.org/find-resources/k-means-clustering) assignment by Chris Bailey-Kellogg
* [Spotify Multi-Genre Playlists Data](https://www.kaggle.com/datasets/siropo/spotify-multigenre-playlists-data) on Kaggle

## Github Classroom Setup
For this assignment, you will use GitHub Classroom to create a private code repository to track code changes and submit your assignment. Open this PA8 link to accept the assignment and create a private repository for your assignment in Github classroom: https://classroom.github.com/a/lJVBCjrN

Your repo, for example, will be named GonzagaCPSC322/pa8-yourusername (where yourusername is your Github username). I highly recommend committing/pushing regularly so your work is always backed up.

## Overview and Requirements
This assignment involves implementing a k-means clusterer. It has two main parts:
1. `mysklearn`: Test and implement a general and re-usable unsupervised learning algorithm for k-means clustering
1. Spotify Songs Dataset Mining (pa8.ipynb): Write a Jupyter Notebook that uses `mysklearn` to perform clustering tasks on a Spotify songs dataset

I highly encourage you to design functions that are generic and re-usable for future programming assignments and data mining tasks.

Note: we are learning data science from scratch! The only non-standard Python libraries you should need to use for this assignment are `tabulate`, `numpy` (math functions, random number generation, etc.), and `scipy` (sparingly). This means that beyond these libraries, you should not `pip install` any additional libraries beyond what is included in the [continuumio/anaconda3:2024.06-1](https://hub.docker.com/r/continuumio/anaconda3) Docker image and you should not use `pandas/sklearn/`etc... (exceptions are made for testing purposes only!!).

## Part 1: `mysklearn` (15 BONUS pts)
Our k-means clusterer we are going to implement for PA8 will:
* Generate a set of clusters using the k means clustering algorithm from Bramer
* Be parameterized to accept different values of k (`n_clusters`) and initial cluster centers (`init`)
    * If initial clusters are not provided, select random instances for initial clusters
* "Pretty print" the discovered clusters (see the example pretty print below)

### Step 1: Implement a k-Means Clustering Unit Test for `myclusterer.py`
Finish the k-means clustering unit test `test_kmeans_clusterer_fit()` in `test_myclusterer.py` for testing the `MyKMeansClusterer` method `fit(X_train)` by implementing the following test cases:
1. Use the 15 instance example from Bramer self-assessment 19.5 exercise 1, asserting against the exercise solution in Bramer Appendix E
1. Use the 6 instance "simple gene" dataset below, asserting against the following solution cluster labels: `[0, 1, 0, 2, 1, 2]`
        
For convenience, I've provided the Bramer and simple gene datasets as Python lists below.

In [1]:
# bramer self assessment 19.5.1 example
table = [
    [10.9, 12.6], #0
    [2.3, 8.4], #1
    [8.4, 12.6], #2
    [12.1, 16.2],#3
    [7.3, 8.9], #4
    [23.4, 11.3], #5
    [19.7, 18.5], #6
    [17.1, 17.2], #7
    [3.2, 3.4], #8
    [1.3, 22.8], #9
    [2.4, 6.9], #10
    [2.4, 7.1], #11
    [3.1, 8.3], #12
    [2.9, 6.9], #13
    [11.2, 4.4], #14
    [8.3, 8.7] #15
] 
initial_centroids = [1, 2, 7]

# simple gene data
simple_gene_table = [
    ["g0",0,0.1,0.2,0,0.4,0.5,0.6,0.7,0.8,0.9], #0
    ["g1",1.0,0.9,0.8,0.7,0.6,0.5,0.4,0.3,0,0.1], #1
    ["g2",0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0], #2
    ["g3",0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4,0.4], #3
    ["g4",0.9,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1,0.0], #4
    ["g5",0.5,0,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5] #5
]
simple_gene_initial_centroids = [0, 1, 3]

### Step 2: `fit()`
Complete the `mysklearn.myclusterer.MyKMeansClusterer` method `fit()` and test your code for functional correctness against the `test_kmeans_clusterer_fit()` unit test.

### Step 3: `print_clusters()`
Finish the `print_clusters()` method of `MyKMeansClusterer` that "pretty prints" the clusters created from a call to `fit()`. Display the discovered clusters with their total sums of squares (TSS) and their instances (original table index and values). For example, for the simple gene expression dataset:

```
Cluster #0 (TSS=??)
Original index #0: [0, 0.1, 0.2, 0, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
Original index #2: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

Cluster #1 (TSS=??)
Original index #1: [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0, 0.1]
Original index #4: [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]

Cluster #2 (TSS=??)
Original index #3: [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]
Original index #5: [0.5, 0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
```

## Part 2: 🎸 Spotify Song Clustering 🎸 (10 BONUS pts)
### Step 1: k-Means Clustering
We are going to cluster songs in a Spotify song dataset and see if the clusters formed seem to correspond to music genres. I preprocessed and trimmed the original dataset to form the dataset we will use, songs.csv. Please see the [original dataset's description on Kaggle](https://www.kaggle.com/datasets/siropo/spotify-multigenre-playlists-data) and the [Spotify API](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features) for details on the song attributes in the dataset. At a minimum, remove the categorical attributes of "Artist Name", "Track Name", "Genre" before clustering. When I made the songs.csv, I kept the first two for their for context about a song, and I kept "Genre" so you could investigate how the various genres get distributed among your clusters. There are seven genres with 100 randomly sampled songs from each genre.

Run your k-means clustering algorithm over songs.csv with several values of k (start at 2, increment by 1, up to a reasonable max k depending on how long this takes to run on your machine). For reproducible results, set your algorithm's `random_state` (you can/should try different values of this). Create a chart showing the clusters' TSS for each value of k. Add a vertical line and/or annotation for your k = 7 results (because there are 7 genres). Is there an "elbow" point signifying a clear best number of clusters? Does this match the supposed 7 genres represented in the dataset? Based on this chart, choose a number of clusters to work with for step 2.

### Step 2: Data Visualization
To determine what attributes mostly define which clusters, visualize the clusters in at least two different ways:
1. Genre: Show the distribution of the 7 genres across your k clusters
    * Do the genres matter in terms of the clusters?
1. Attributes: Show the relationship between several attributes and the clusters
    * Which are the most discriminating attributes?

You are free to come up with any chart or collection of charts to create these visualizations. An example of the relationship between energy and acousticness for k = 4 clusters is shown below.

<img src="https://raw.githubusercontent.com/GonzagaCPSC322/PAs/master/figures/energy_acousticness_scatter.png" width="400" />

## Submitting Assignments
1. Turn in your assignment files via a Github Classroom repo. See the "Github Classroom Setup" section at the beginning of this document for details on how to do this.
    1. Your repo should contain all of the files needed to run and test your solution (e.g. .py file(s), input files, etc.). 
    1. Double-check that this is the case by "pretending to be the grader": clone (or download a zip) your submission repo and run your code in a fresh [continuumio/anaconda3:2024.06-1](https://hub.docker.com/r/continuumio/anaconda3) Docker container like we will when we grade your code.
1. Submit this PA’s associated assignment in Canvas to mark your PA as "done" and ready for grading. We will then pull your Github repo and grade your PA as soon as possible. The date and time you submit the PA assignment in Canvas will be used for marking your assignment as "late" or "on-time."

## Grading Guidelines
This assignment is worth 25 BONUS points. Your assignment will be evaluated based on a successful execution in the [continuumio/anaconda3:2024.06-1](https://hub.docker.com/r/continuumio/anaconda3) Docker container and adherence to the program requirements. We will grade according to the following criteria:
* 3 pts for correct part 1 step 1 (define `test_kmeans_clustering_fit()` unit test)
* 8 pts for correct part 1 step 2 (finish `fit()` and pass test)
* 2 pts for correct part 1 step 3 (finish `print_clusters()`)
* 3 pts for correct part 2 step 1 (songs.csv clustering)
* 3 pts for correct part 2 step 2 (songs.csv visualization of genre distributions)
* 3 pts for correct part 2 step 2 (songs.csv visualization of attribute/cluster relationships)
* 3 pts for adherence to course [coding standard](https://nbviewer.jupyter.org/github/GonzagaCPSC322/PAs/blob/master/Coding%20Standard.ipynb), including data storytelling (narrative is clear and grammatically correct, Notebook is organized with headers, formulas are typeset with Latex, code receives a "good" `pylint` rating, etc.).
    * Note: these bonus points are only awarded on completion of the entirety of this assignment
    * Note: these bonus points include having a linting score of at least 8/10