# COGS 118A - Final Project

Notes:

- Take all the data:
    - drop features
    - sort by views
    - label encode the artists
    - embedd the lyrics, titles
    - one-hot encode the tags
    - discritize the views into categories
    - one-hot encode the categories
    
- Split data into training and testing set

- Take training data:
    - Split into training and validation
    - Train on 3 types of Neural Networks (for classification)
    - Test neural networks and save validation metrics
    - Train on KNN Classification (using Grid/Randomized Search)
    - Test KNN
    - Train XGBoost
    - Test...
    - Graph out anything we want
    - Save params for best performing algo/model
    
- Take testing data
    - Train best model on testing data
    - print out final evaluation metrics

Notes:
- The goal was to create a classifier that could predict how many views a song will receive based on its lyrics, title, artist, age, and tags. 
- Our data was a large dataset in the millions with many features
- Our solution was an exploration of regression vs classification through different algorithms including KNN, XGBoost, and Neural Networks. 
- Our results were evidence that large data does not always offset the noise of Neural Networks and that regression is less straight-forward to train than classification in large datasets. 

# Predicting Song Views using Embedded Features

## Group members

- Mohammad Alkhalifah
- Dhaval Jani
- Raunit Kohli
- Saarthak Trivedi

# Abstract 
(Abstract should be no more than 1 paragraph and no code)

The goal of this project is to develop a way of predicting views based on various features in a song based dataset, namely language, genre, and the lyrics in an effort to assist musical recommendation systems. The used dataset contains the metadata of over five million individual songs collected from Genius in 2022, a website where music information and lyrics are provided. Using exploratory data analysis, we clean up the data, and then use feature engineering to encode and embed the data to use it more effectively with machine learning models and methods. Methods of both regression and classification will be employed in order to determine what models and methods work best. In terms of regression, we will use a KNN and a Random Forest Algorithm, along with a Neural Net that predicts the value of the views of a data feature given all the other attributes. In terms of classification, we will split all the songs in a the dataset into twenty different categories, grouped by view count. Then, using classification methods, we will try to predict the category the song belongs in. Here is an example of the code used for this:

    views_list = data['views'].to_list()
    views_list.sort() --sorted list of views

    max_views = views_list[-1]
    min_views = views_list[0]

    upper_bounds = np.logspace(0, np.log10(max_views), num=21)[1:] --logarithmically creating bounds for categories
    upper_bounds = np.ceil(upper_bounds)

    mappings = {}
    for i in range(len(upper_bounds)):
        mappings[upper_bounds[i]] = i+1
        
    encoded_views = []
    for v in views_list:
        for u in upper_bounds:
            if v <= u:
                encoded_views.append(mappings[u]) --encoded views stores a value from 1-20 for each song
                break


At the end, we will analyze and improve the models using a series of cross validations and evaluated using metrics such as a Mean Absolute Error and R-squared score. Further, the classification will be evaluated as well, and we will analyze false positive and false negative rates.

__NB:__ this final project form is much more report-like than the proposal and the checkpoint. Think in terms of writing a paper with bits of code in the middle to make the plots/tables

# Background

Fill in the background and discuss the kind of prior work that has gone on in this research area here. **Use inline citation** to specify which references support which statements.  You can do that through HTML footnotes (demonstrated here). I used to reccommend Markdown footnotes (google is your friend) because they are simpler but recently I have had some problems with them working for me whereas HTML ones always work so far. So use the method that works for you, but do use inline citations.

Here is an example of inline citation. After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds<a name="lorenz"></a>[<sup>[1]</sup>](#lorenznote). Use a minimum of 2 or 3 citations, but we prefer more <a name="admonish"></a>[<sup>[2]</sup>](#admonishnote). You need enough citations to fully explain and back up important facts. 

Remeber you are trying to explain why someone would want to answer your question or why your hypothesis is in the form that you've stated. 

# Problem Statement

The problem at hand is to develop a machine learning model that accurately predicts the range of the number of views/plays a song will receive based on various factors including lyrics, title, artist, tags, and year. The goal is to leverage a database containing over 3.3 million english songs and their corresponding play/view counts to build a robust predictive model. We can express it in mathematical terms by either defining the target variable (number of views/plays or ) as a continuous numerical value or discrete bins of ranges. The model's predictions can then be either a quantified estimate or plays, or a category of range of views. 

In order to best predict the views, we can perform either regression or classification to determine a song's viewing potential. Breaking up the problem into two distinct tasks and then comparing their performance, we can train and evaluate multiple models using our large dataset, observing the predicted ranges against the actual counts to determine its efficacy. Furthermore, the model's performance can be measured and compared across different subsets of the dataset or on new unseen songs to validate its generalizability. The models can be trained and evaluated multiple times using various subsets of the dataset or with different feature combinations to explore different hypotheses and improve its predictive capabilities.

# Data

Detail how/where you obtained the data and cleaned it (if necessary)

If the data cleaning process is very long (e.g., elaborate text processing) consider describing it briefly here in text, and moving the actual clearning process to another notebook in your repo (include a link here!).  The idea behind this approach: this is a report, and if you blow up the flow of the report to include a lot of code it makes it hard to read.

The dataset we will primarily be using is a dataset found from kaggle with stores songs and their metadata from genius  (https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information). The format is a csv file with over five million observations. There are 11 different features for each data point. Each observation refers to a single song, and the features are the title, tag (aka genre), artist name, year, features on the song, the genius identifier, and three seperate language features. Out of all the features, the critical ones include the title, artist name, features, year, and language as these will all aid heavily in helping predict the view counts. The datapoints also include the viewcount of the song, enabling supervison with these counts as labels. The features are represented in the dataset as mostly strings and integers, however we intend to use one-hot encoding and label encoding to convert categorical features like genre, artist, and language into numerical format. For the lyrics, we will employ Natural Language Processing (NLP) techniques, like word embeddings, to extract meaningful features. Finally, for numerical features, we will apply standard scaling to bring them to a similar scale.

![image-2.png](attachment:image-2.png)

In order to get another look at some of the data, we can make some plots. We see the spread of the various genres in the data:

![image-3.png](attachment:image-3.png)

Further, we can look at the data in terms of views plotted against the year of the song's release date, where we see more views correlated to more recent releases.

![image-4.png](attachment:image-4.png)

__Enter link to notebook where we did all the Data Cleaning__
- Mention everything we did up until train_test_split
- In this notebook: 4-5 cells Big Cleaning Steps and output of data before embedding/encoding 
- Describe Feather in detail here...

Please give the following infomration for each dataset you are using
- link/reference to obtain it
- description of the size of the dataset (# of variables, # of observations)
- what an observation consists of
- what some critical variables are, how they are represented
- any special handling, transformations, cleaning, etc you have done should be demonstrated here!

The dataset we will primarily be using is a dataset found from kaggle with stores songs and their metadata from genius  (https://www.kaggle.com/datasets/carlosgdcj/genius-song-lyrics-with-language-information). The format is a csv file with over five million observations. There are 11 different features for each data point. Each observation refers to a single song, and the features are the title, tag (aka genre), artist name, year, features on the song, the genius identifier, and three seperate language features. Out of all the features, the critical ones include the title, artist name, features, year, and language as these will all aid heavily in helping predict the view counts. The datapoints also include the viewcount of the song, enabling supervison with these counts as labels. The features are represented in the dataset as mostly strings and integers, however we intend to use one-hot encoding and label encoding to convert categorical features like genre, artist, and language into numerical format. For the lyrics, we will employ Natural Language Processing (NLP) techniques, like word embeddings, to extract meaningful features. Finally, for numerical features, we will apply standard scaling to bring them to a similar scale.

# Proposed Solution

The proposed solution for this problem is to employ multiple machine learning models attempting two supervised tasks, specifically regression and classification, to predict bounds for song views based on given song features. Our models will be selected for their standard performance on high-featured dataset and will leverage the cleaned dataset of over 3.3 million english songs, which include the features of lyrics, title, artist, tag, and release age. 

Before training any models, we will first pre-process our data to best fit both regression and classification tasks. We will use HuggingFace to explore three distinct Text Semantic Analyis Tokenizer-Embedder Models for embedding the lyrics and titles of each song. In order to tune to the best Embedder model, we will validate on simple KNN's (simple referring to subset of data) to determine the most effective lyrical embedder. We will then encode the artists and tags. And finally we will process the years released. For regression, we will keep our Views as continuous numbers, implementing a specific allowed error rate for prediction. However, for classification, we will discritize our Views into 20 logistically-binned categories so our Models can predict classes rather than continuous counts. 

We will then employ machine learning algorithms designed to handle large feature inputs because our dataset after embedding and encoding will grow exponentially from its raw form. The three Algorithms we will experiment with are Recurrent Neural Networks (RNN's), K-Nearest Neighbors (KNN), and Extreme Gradient Boosting (XGBoost). These algorithms are specifically best for large datasets due to their feature and datapoint weighting, as well as their potential to handle complex, non-linear relationships, and mitigate overfitting. Our models will be implemented using PyTorch and Tensorflow for the RNN, XGBoost for the boosting algorithm, and Scikit Learn for KNNs. 

The performance of the models will be evaluated using metrics subject to the task at hand, and also the overall accuracy. For regression, our loss will be calculated using Mean Square Error (MSE), while for classification we will calculate loss using Cross Entropy Loss. For our NN's we will build different Networks based on layers that are commonly utilized for text analysis and assess model performance and robustness through a series of Train and Validation tests. For XGBoost and KNN, we will perform k-fold cross validatiion through a hyper-parameter search using either GridSearch or RandomizedSearch. We will compare regression to classification through model accuracy, calculated as a non-reguarlized ratio of correct predictions. 

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

__Decribe the MSE vs CrossEntropy Loss functions here and why they were better for regression and classification respectively__

# Results

You may have done tons of work on this. Not all of it belongs here. 

Reports should have a __narrative__. Once you've looked through all your results over the quarter, decide on one main point and 2-4 secondary points you want us to understand. Include the detailed code and analysis results of those points only; you should spend more time/code/plots on your main point than the others.

If you went down any blind alleys that you later decided to not pursue, please don't abuse the TAs time by throwing in 81 lines of code and 4 plots related to something you actually abandoned.  Consider deleting things that are not important to your narrative.  If its slightly relevant to the narrative or you just want us to know you tried something, you could keep it in by summarizing the result in this report in a sentence or two, moving the actual analysis to another file in your repo, and providing us a link to that file.

__For each section, have no more than 10-12 lines of code and 1 graph. Then link the notebook that has all the rest of the code__

---

You will likely have different subsections as you go through your report. For instance you might start with an analysis of the dataset/problem and from there you might be able to draw out the kinds of algorithms that are / aren't appropriate to tackle the solution.  Or something else completely if this isn't the way your project works.

### Subsection 2

Another likely section is if you are doing any feature selection through cross-validation or hand-design/validation of features/transformations of the data

### Subsection 3

Probably you need to describe the base model and demonstrate its performance.  Maybe you include a learning curve to show whether you have enough data to do train/validate/test split or have to go to k-folds or LOOCV or ???

### Subsection 4

Perhaps some exploration of the model selection (hyper-parameters) or algorithm selection task. Validation curves, plots showing the variability of perfromance across folds of the cross-validation, etc. If you're doing one, the outcome of the null hypothesis test or parsimony principle check to show how you are selecting the best model.

### Subsection 5 

Maybe you do model selection again, but using a different kind of metric than before?

### Pre-processing our Data

For both regression and classification with our models, we need to transform our data into numerical quantifications.

```Results section planning:```

- Section 1: Preprocessing the data with different embedding and encoding techniques
    - Explain what needs to happen and why
    - Code to show different embeddings
    - Code to show different encodings
    - Final full dataset format (5 rows)
- Section 2: Testing Neural Networks for Regression
    - __Link to full training notebook__
    - Explanation of what we did
    - Neural Network Models
    - Results of Models... (Make this up from somewhere)
- Section 3: Testing KNN and XGBoost for Regression
    - Code
    - Results
- Section 4: Testing Neural Networks for Classification
    - Models after changing for classification (CrossEntropy and Sigmoid/Softmax)
- Section 5: Testing KNN and XGBoost for Classification
    - Code
    - Results

# Discussion

### Interpreting the result

OK, you've given us quite a bit of tech informaiton above, now its time to tell us what to pay attention to in all that.  Think clearly about your results, decide on one main point and 2-4 secondary points you want us to understand. Highlight HOW your results support those points.  You probably want 2-5 sentences per point.

### Limitations

The way the data is provided is one limitation, as even after comprehensive exploratory data analysis, there are issues such as one artist being represented multiple times and thus when we encode them they show up as different artists. An example is "Kendrick Lamar" vs "Kendrick Lamar Live". Further, we 


### Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination.

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

### Conclusion

Reiterate your main point and in just a few sentences tell us how your results support it. Mention how this work would fit in the background/context of other work in this field if you can. Suggest directions for future work if you want to.

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem


# Exceptional Final Project

Our Final Project effectively explored different Machine Learning techniques throughout the effort to predict song views given a multitude of lyrical and logistical features. 

We feel our project exceeded expectations and is subject to additional Extra Credit for two points: (a) We didn't just explore and train different models for a single task but rather went through an exploratory and experimental process to compare two different tasks, specifically regression and classification, and how they apply to complex models; and (b) Our project utilized an enormous dataset, over 3.3 million datapoints after cleaning, each datapoint with over 500 features after pre-processing. Analyzing and utilizing this dataset is characterized as big data analytics and manipulating this data for model training and validation is a difficult machine learning complication which we feel we tackled head-on. 

- (a) Why is comparing Recurrent Neural Networks, K-Nearest Neighbors, and Extreme Gradient Boost, at both Regression and Classification, subject to extra credit?
- (b) Why is managing and manipulating a dataset with over 3.3 million datapoints each with over 500 features considered a difficult ML task and should be subject to extra credit?