# COGS 118A- Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training a supervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

### Peer Review

You will all have an opportunity to look at the Project Proposals of other groups to fuel your creativity and get more ideas for how you can improve your own projects. 

Both the project proposal and project checkpoint will have peer review.

# Team

- Xiaoxuan Zhang
- Yunxiang Chi
- Xiaoyan He
- Jiayi Dong
- Yilin Ge

# Abstract 

The project is designed to be a Music Genre Classification Tool...

This section should be short and clearly stated. It should be a single paragraph <200 words.  It should summarize: 
- what your goal/problem is
- what the data used represents and how they are measured
- what you will be doing with the data
- how performance/success will be measured

# Background

Remeber you are trying to explain why someone would want to answer your question or why your hypothesis is in the form that you've stated. 

Our project is focused on developing a state-of-the-art music genre classification system that optimizes the way music platforms and music companies design their personalized recommendation systems to improve overall music listening experience and user experience through a more precise way of avigating and discovering music. 

To achieve this, we employ advanced machine learning algorithms and data analysis techniques, leveraging a diverse dataset of labeled music samples spanning different genres with various features. We will also compare among different algoritms' performances to ensure the system's accuracy, robustness, and ability to generalize across a wide range of music styles.

# Problem Statement

Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

# Data

The dataset we are using is GTZAN Dataset - Music Genre Classification <a name="gtzan"></a>[<sup>[1]</sup>](#gtzannote)

- https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification
- Content:
    - original soundfile: Collection of 10 genres with 100 30-seconds audio files each.
    - images: Power Spectra of each audio files in image formats. 
    - 2 csv: Containing multiple features of all the songs.

For our model, we design to use both csv files, which have the following 10 variables (with both mean and variance) as well as 1000 observations:
- pitch (chroma in the dataset)
- RMS in audio signals
- spectral centroid
- spectral bandwidth
- roll-off frequency
- zero crossing rate
- harmony
- perceptron
- tempo
- 20 groups of Mel Frequency Cepstral Coefficients


special handling, transformations, cleaning will be decided and listed here later.
 
 

(We're aiming to find an extra dataset that can be combined with GTZAN Dataset)

# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Why might your solution work? Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

Since this is a classification problem, we will use the following metrics to evaluate both our benchmark model and solution model:
- Accuracy (mathematical representation: (TP + TN) / (TP + TN + FP + FN))
- Precision (mathematical representation: TP / (TP + FP))
- Recall (mathematical representation: TP / (TP + FN))
- F1 Score (mathematical representation: 2 * (Precision * Recall) / (Precision + Recall))
- Confusion Matrix (mathematical representation: [[TP, FP], [FN, TN]])

If applicable (if using deep learning model), we may also try to use the cross entropy loss function to evaluate our model and see how our model improves its performance for each learning iteration.

The mathematical representation of cross entropy loss function is as follows: 
$ -\sum_{i=1}^{n} y_i \log(p_i) $

where $y_i$ is the true label and $p_i$ is the predicted probability of the label.



# Ethics & Privacy

The dataset we get is a free dataset named GTZAN, the MNIST of sounds, from Kaggle which should not give any privacy concern since many music genre recognition ML models are trained on this dataset. However, if in future we are going to train a larger model based on more data, copy right may be one of the issue related to privacy.

As music creation develops, the boundries between genres get vague. More and more music have mixed styles and new genres will be created. While it makes the classification harder, classifying songs into specific genres may raise social issues at this time. Many song writeer may not be happy with their songs being classified into a specific genre. Labeling the dataset may become a harder work in the future.

If a powerful MGR model is developed and used in recommendation system of music, it may cause people to have a music taste bias as the system can always find the music that fit users current taste, thus reduces the chance for user to find new types music that they may like.

# Team Expectations 

* *Weekly meetings on Sundays on general progress check*
* *Bi-weekly quick meetings on Wednesdays before each check-point submission*
* *Frequent discussion through online platforms (text, zoom meetings, etc.)*

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting (all or assigned in the previous meeting)  | Discuss at Meeting (all or team member) |
|---|---|---|---|
| 5/14  |  7 PM |  Determine best form of communication; Brainstorm topics/questions  |  Decide on final project topic (all); discuss ideal datasets and ethics (all); do background research (Xiaoxuan) | 
| 5/16  |  10 PM |  Do background research on topic | Draft project proposal (Xiaoxuan, Jiayi) | 
| 5/17  |  6 PM |  Draft project proposal | Edit, finalize, and submit proposal (all) | 
| TBD  | 10 AM  | Search for extra datasets | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| TBD  | 6 PM  | Import & Wrangle Data ,do some EDA () | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| TBD  | 12 PM  | Finalize wrangling/EDA; Begin programming for project () | Discuss/edit project code; Complete project |
| TBD  | 12 PM  | Complete analysis; Draft results/conclusion/discussion ()| Discuss/edit full project |
| 6/14  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes
<a name="gtzannote"></a>1.[^](#gtzan): GTZAN Dataset - Music Genre Classification. https://www.kaggle.com/datasets/andradaolteanu/gtzan-dataset-music-genre-classification<br> 

