# COGS 118A- Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training a supervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

### PLANNING NOTES:

Official Idea: Using a dataset of songs and their details to predict what range of views a song gets.

https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information

### Peer Review

You will all have an opportunity to look at the Project Proposals of other groups to fuel your creativity and get more ideas for how you can improve your own projects. 

Both the project proposal and project checkpoint will have peer review.

# Names

- Raunit Kohli
- Saarthak Trivedi 
- Dhaval Jani
- Mohammad Alkhalifah

# Abstract 
This section should be short and clearly stated. It should be a single paragraph <200 words.  It should summarize: 
- what your goal/problem is
- what the data used represents and how they are measured
- what you will be doing with the data
- how performance/success will be measured

# Background

Music platforms such as Spotify and Apple Music have been employing the use of recommender systems for a years now. Built on a series of algorithms, these platforms will maintain a massive database of songs and their attributes, and recommend certain music selections to individuals based on past music taste and popular streams. However, with these algorithms getting more advanced, song recommendations are getting very specialized based on clustering and clouds of similarity<a name="recsystem"></a>[<sup>[1]</sup>](#recsystem). 

Song popularity is constantly changing and is usually based on all-time views but also number of shares, pre-saves, and recent popularity.<a name="popularity"></a>[<sup>[2]</sup>](#popularity). When recommeding new music, platforms will try to recommend new music uploads to build a profile on individuals and spread releases to gain popularity. Yet, because of the recommender algorithm basing profiles on clustering techniques, and the fact that popularity of a song is measured by more factors than just all-time views, it's possible that the songs recommended as popular are in fact not the most popular songs<a name="popularity"></a>[<sup>[2]</sup>](#popularity). 

A series of studies have been completed on the topic of music recommender systems and how they choose the popular songs that are recommended. One specific paper shows that in the United States, especially songs in languages other than english or from non-American artists are less likely to be recommended to individuals even if all other attributes of the song's metadata match up with the user's preferences<a name="survey"></a>[<sup>[3]</sup>](#survey). This means that recommender systems for music need to be improved so that total views has a more balanced influence on music choices. 

Having a good algorithm for predicting the view count of a song based on its other features allows recommender systems to more effectively suggest relevant songs. It can also help music artists and record labels better understand the features necessary in a song for it to be popular. 

# Problem Statement

Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

(Saarthak)The problem at hand is to develop a machine learning model that accurately predicts the range of the number of views/plays a song will receive based on various factors such as language, genre, lyrics and other relevant attributes. The goal is to leverage a database containing over 3 million songs and their corresponding play/view counts to build a robust predictive model. We can express it in mathematical terms by defining the target variable (number of views/plays) as a continuous numerical value. The model's prediction range can be quantified as a lower bound and an upper bound estimate. Since we have access to a vast database of songs with their historical view/play counts. We can train and evaluate the model using this data, observing the predicted ranges against the actual counts to determine its efficacy. Furthermore, the model's performance can be measured and compared across different subsets of the dataset or on new unseen songs to validate its generalizability. The model can be trained and evaluated multiple times using various subsets of the dataset or with different feature combinations to explore different hypotheses and improve its predictive capabilities.

# Data

You should have a strong idea of what dataset(s) will be used to accomplish this project. 

If you know what (some) of the data you will use, please give the following information for each dataset:
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc will be needed

If you don't yet know what your dataset(s) will be, you should describe what you desire in terms of the above bullets.


---
Note: 

> Before training the machine learning models, we will preprocess the data. For numerical features, we will apply standard scaling to bring them to a similar scale. For categorical features like genre, artist, and language, we will use one-hot encoding to convert them into numerical format. For the lyrics, we will employ Natural Language Processing (NLP) techniques, like word embeddings, to extract meaningful features.


# Proposed Solution

(Mohammad) In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Why might your solution work? Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

---

The proposed solution for this problem is to employ a supervised machine learning approach, specifically a regression model, to predict lower and upper bounds for song views based on given song features. The model will leverage the dataset of over 3 million songs, which includes features like genre, artist, language, title, lyrics, and year of release. 

We plan to employ ensemble methods, particularly using deep learning, Recurrent Neural Networks (RNNs), for lyrics analysis. To handle combining the lyrics analysis with the rest of the features we propose to use Random Forest algorithms because of their potential to handle complex, non-linear relationships, and mitigate overfitting. The model will be implemented using the Scikit-learn, TensorFlow, and Keras libraries in Python. Scikit-learn will be used for pre-processing, feature selection, and training traditional machine learning models. TensorFlow and Keras will be used for the implementation of deep learning part. 

The performance of the models will be evaluated using metrics such as Mean Absolute Error (MAE) and R-squared (R2) score, which are common performance metrics for regression problems. We will perform k-fold cross-validation to assess the model's performance and its robustness across different subsets of the data. We think that this solution might work as it leverages the power of both traditional machine learning and deep learning capturing both linear and non-linear patterns in the data. Furthermore, the use of a large dataset increases the likelihood of the model generalizing well to unseen data. 

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

**Some ideas: The performance of the model can be measured using various evaluation metrics, such as mean absolute error (MAE) or root mean square error (RMSE), to assess its accuracy in predicting the view/play counts.

# Ethics & Privacy

(Mohammad)
If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination. Get creative!

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

---

While the goal of this project is predect the views, ultimityly it might be used in recomnedation systems or by artists. Hence, it's important to consider potential ethical issues and implications related to data privacy, as well as possible unintended consequences. If the model prioritizes songs predicted to have high view counts, it might stifle diversity and reinforce a popularity echo chamber. To address this, we could consider incorporating some measure of diversity or novelty into our model while addressing these consernses. One potential ethical concern is the possibility of bias in our model. For instance, if the training data is skewed towards certain genres, languages, or artists, our model may inadvertently favor those groups when predicting views. We'll carefully examine our dataset for such biases and consider techniques such as resampling or weighting to mitigate them. We will ensure transparency by documenting our methodology, acknowledging limitations, and being open to feedback. 


# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...
* We will communicate often using the established text group chat we have for our group
* We will reply within 24 hours in the text group if asked specifically
* We will attend reguarly scheduled meetings and be proactive to ensure our attendance at those meetings
* We will divide the work for the project equitably and ensure everyone completes a fair amount of work
* We will utilize a deomcratic process to vote on decisions in the project (and use a coin-flip to break ties)
* We will send an email outlining the requirements of a member to any team member who is non-cooperative

# Project Timeline Proposal

Replace this with something meaningful that is appropriate for your needs. It doesn't have to be something that fits this format.  It doesn't have to be set in stone... "no battle plan survives contact with the enemy". But you need a battle plan nonetheless, and you need to keep it updated so you understand what you are trying to accomplish, who's responsible for what, and what the expected due dates are for each item.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/11  |  11:30 AM |  Determine communication medium; Brainstorm topics + questions | Research, discuss, and decide on final project topic; Discuss problem statement; Begin background research; Assign everyone a part of the proposal | 
| 5/17  |  10:30 AM |  Everyone completes their specific tasks for the proposal (all)  | Finish proposal; Discuss issues or problems with topic; Finalize submission for proposal | 
| 5/24  | 11:30 AM  | Read advice from TA and Peer reviews; Discuss if topic needs to change or requires a rewrite | Discuss data cleaning and lyric embedding techiques; Assign everyone roles for checkpoint  |
| 5/28  | 8 PM  | Finalize data wrangling and preparation (combine encodings + embeddings) | Finalize data and start ML tecniques; Assign everyone roles/checkpoints for training and testing |
| 5/31  | 10:30 AM  | Everyone completes their specific tasks for the checkpoint (all) | Finish checkpoint; Discuss issues or problems with resutls; Finalize submission for checkpoint |
| 6/04  | 8 PM  | Read advice from TA and Peer reviews; Discuss if ML techniques need to change | Discuss/edit full project; Continue with more eval metrics and training|
| 6/08  | 11:30 AM  | Everyone completes their specific tasks for the final (all) | Discuss last minute edits; Check in with TA if any questions; Finalize testing and evaluation; Finalize conclusions |
| 6/14  | Before 11:59 PM  | NA | Turn in Final Project  |

# Footnotes

<a name="recsystem"></a>1.[^](#recsystem): Pastukhov, Dmitry. “Inside Spotify’s Recommender System: A Complete Guide to Spotify Recommendation Algorithms.” https://www.music-tomorrow.com/blog/how-spotify-recommendation-system-works-a-complete-guide-2022#:~:text=%22We%20can%20understand%20songs%20to,recommend%20song%20Z%20to%20them.<br>

<a name="popularity"></a>2.[^](#popularity):“Song Popularity on Spotify – How It Works: Pansentient League.” Pansentient League | Spotify and Synthpop, 15 Nov. 2016, pansentient.com/2009/09/spotify-song-popularity/. <br>

<a name="survey"></a>3.[^](#survey):Song, Yading & Dixon, Simon & Pearce, Marcus. "A Survey of Music Recommendation Systems and Future Perspectives". Proceedings of The 9th International Symposium on Computer Music Modeling and Retrieval, 2012.  https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e0080299afae01ad796060abcf602abff6024754