In [None]:
%load_ext autoreload
%autoreload 2

# GenReco - The Lyrics-based Genre Recognizer

The primary objective of this project is to investigate the feasibility of predicting a song's genre based on its lyrical features. With the vast amount of music available today, categorizing songs into genres manually can be time-consuming and subjective.

By leveraging the power of data science techniques, we aim to explore whether the lyrical characteristics of songs can serve as reliable indicators for genre classification. Through this research, we intend to contribute to the field of music analysis and enhance our understanding of the relationship between lyrics and musical genres.

# Procedure

The project will follow a systematic procedure to investigate the predictability of song genres based on lyrical features.

## Study
Initially, extensive research will be conducted to gain a comprehensive understanding of the subject matter, exploring existing studies and theories related to music genre classification.

## Dataset Generation
To generate a suitable dataset for analysis, we will utilize the Spotify API to retrieve relevant song metadata such as artist, album, and genre information. Additionally, we will crawl lyrics from Genius.com, a popular lyrics website, to obtain the lyrical features necessary for our analysis. By combining these sources, we aim to create a diverse and representative dataset that encompasses a wide range of genres and artists.

## Pre-EDA
Once the dataset is obtained, we will focus on understanding its composition and characteristics.
 The characteristics we will be focusing on are the data types and their distributions.
 Exploratory data analysis (EDA) techniques, including visualizations and statistical summaries, will be employed to gain insights into the distribution and relationships between lyrical features and song genres. This exploratory phase will allow us to draw preliminary conclusions and identify any initial patterns or trends within the data, as well as measures of Central Tendency and Variability.

## Data Curation
Next, data curation steps will be applied to ensure the quality and reliability of the dataset. This process will involve the identification and removal of duplicate entries, as well as the detection and handling of outliers or inconsistencies in the data. By carefully curating the dataset, we aim to enhance the accuracy and integrity of our subsequent analysis.

## EDA
Following data curation, an in-depth exploratory data analysis will be conducted. This phase will involve examining the distributions, correlations, and other relevant statistical properties of the lyrical features across different genres. Visualizations such as histograms, scatter plots, and box plots will be employed to facilitate a comprehensive understanding of the dataset and unveil potential insights regarding the relationship between lyrical features and song genres.

## Machine Learning
With a solid grasp of the dataset and its characteristics, we will proceed to the model selection stage. Various machine learning algorithms, such as decision trees, random forests, or KNN, will be evaluated and compared to identify the most suitable model for genre prediction based on lyrical features. Model performance metrics, including accuracy, precision, and recall, will be assessed to determine the effectiveness of each algorithm in capturing the underlying patterns within the data.

## Conclusions
Finally, based on the chosen model and its performance, we will draw final conclusions regarding the project's main objective: the predictability of song genres using lyrical features. The findings of this research endeavor will shed light on the relationship between lyrics and musical genres, providing valuable insights for music analysis and potentially influencing future advancements in genre classification methodologies.

# Some Imports...
Here we are importing the 3rd-party libraries required for the notebook and demonstrations purposes,
as well as some proprietary functionality that we've exported to modules, in order to avoid clutter in the notebook.

In [None]:
# IMPORT 3RD PARTY LIBRARIES

import os.path as osp
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import math

In [None]:
# IMPORT UTILS

from utils.ml_utils import split_to_train_and_test, get_classifier_obj, calc_evaluation_val, find_best_k_for_KNN, \
    find_best_model
from utils.curation_utils import transfer_to_categorical, remove_duplicates_and_drop_na, repair_numeric_missing_vals, \
    outlier_detection_iqr, my_dist_to_avg
from utils.plot_utils import plot_frequent_elements, plot_cross_tabulation, get_highly_correlated_cols, \
    transfer_str_to_numeric_vals, plot_frequencies, plot_continuous_feature_relations, plot_histograms
from utils.general_utls import load_dataset
from utils.lyrics_utils import reget_lyrics_df, remove_non_english_songs

# Dataset Generation

The generation of the dataset was carried out separately using a regular Python script execution. This approach was chosen to maintain a clear distinction between the dataset and its analysis within this notebook. By treating the dataset as pre-existing data, it allows for better organization of our thought process and prevents mixing the data generation process with its subsequent analysis.

>Having said that, during the process of the data analysis, we will not shy away from manipulating the data to fit our needs.

<i>Refer to generate_dataset.py</i>

# Loading the Dataset

In [None]:
file_name = "huge_dataset.csv"
dataset_file_path = osp.join(osp.dirname(osp.abspath("__file__")), file_name)
dataset = load_dataset(dataset_file_path)

# Pre-EDA: Data Visualization

Let us examine the individual columns in the data.
The data contains the following fields:

| Column Name      | Data Type   | Variable type |
|------------------|-------------|----|
| source_genre     | object      | Free text |
| name             | object      | Free text |
| artists          | object      | Free text |
| release_year     | int64       | Ordinal |
| release_month    | float64     | Ordinal |
| genres           | object      | List |
| genre     | object      | Categorial |
| duration         | int64       | Numeric|
| popularity       | int64       | Ordinal|
| lyrics_file      | object      | Free text (metadata) |
| lyrics_url       | object      | Free text (metadata) |
| intro_cnt        | float64     | Ordinal|
| outro_cnt        | float64     | Ordinal|
| verse_cnt        | float64     | Ordinal|
| chorus_cnt       | float64     | Ordinal|
| line_cnt         | float64     | Numeric|
| word_cnt         | float64     | Numeric|
| unique_words_cnt | float64     | Ordinal |
| stop_words_cnt   | float64     | Ordinal |
| slang_words_cnt  | float64     | Ordinal |
| positive         | float64     | Ordinal |
| negative         | float64     | Ordinal |
| neutral          | float64     | Ordinal |
| compound         | float64     | Ordinal |

## Top Elements
With Pie Charts, we can see the ratio of the leading elements of each feature

In [None]:
pie_cols = ['genre', 'artists', 'release_year']
plot_frequencies(dataset, pie_cols, 'pie', 7)

As the graphs show, we can learn that, out of the leading values:
- there are significantly more pop songs than other genres
- the artists are mostly balanced
- big majority of the data is recently released songs

## Ordinal Variables Frequency
With Bar Charts, it's easier to understand how frequent every ordinal variable is, in relation to itself

In [None]:
bars_cols = ['release_year', 'release_month',
             'intro_cnt', 'outro_cnt', 'verse_cnt', 'chorus_cnt']
plot_frequencies(dataset, bars_cols, 'bar')

As the graphs show, we can learn that:
- as seen before, most of the data is indeed new songs. In addition, the data decreases as the years go back
- suspiciously, there are 0 chorus and verse count, which might indicate an issue with the data
- songs are release relatively on the same rate across the year's month, with a slight increase at the beginning and middle of the year

##

In [None]:
histogram_cols = ['popularity', 'unique_word_cnt', 'slang_word_cnt',
                  'positive', 'negative', 'neutral', 'compound', 'stop_word_cnt']
plot_histograms(dataset, histogram_cols)

# Conclusions - 1D visualizations

From looking at the graphs we can identify the following:
- Pop is the most popular genre
- Many songs have 0 popularity - but the next "spike" of data is around 80
-

# FEATURE RELATIONS VISUALIZATION

In [None]:
continuous_vars = ["duration", "line_cnt", "word_cnt", "unique_word_cnt", "stop_word_cnt", "slang_word_cnt"]
plot_continuous_feature_relations(dataset, continuous_vars)

# MISSING VALUES & OUTLIERS HUNT
As part of the data manipulation process, let's test the data's missing values, and look for outlier values

In [None]:
missing_values = (dataset.isna() | dataset.isnull()).sum() / len(dataset)
missing_values = missing_values[missing_values>0]
plt.figure(figsize=(18, 6))
plt.xticks(rotation=45)
plt.title("Missing values per Column (ratio, non-zero)")
plt.bar(missing_values.index, missing_values.values, width=0.2)

## Analysis: missing values
Some records don't have the release_month field set. This is because the original released date in the dataset included the release year only. Since this is a not-so-significant amount of data, we will get rid of these lines

In [None]:
dataset_1 = dataset.dropna(axis="index", subset=["release_month"])
len(dataset_1)/len(dataset)

## CLEAN DATASET

In [None]:
dup_na_removed = remove_duplicates_and_drop_na(dataset)
non_english_removed = remove_non_english_songs(dup_na_removed)
outliers = outlier_detection_iqr(non_english_removed, my_dist_to_avg)
repaired = repair_numeric_missing_vals(outliers, dup_na_removed.select_dtypes('number'))
# Fix cnt values to be int
for col in repaired.columns:
    if "_cnt" in col or col in ['duration', 'popularity', 'release_year', 'release_month']:
        repaired[col] = repaired[col].astype(int)

repaired.drop(repaired[repaired.genre == 'classical'].index, inplace=True)
repaired.genre = repaired.genre.astype('category')
repaired.info()

In [None]:
pie_cols = ['genre', 'artists', 'release_year']
plot_frequencies(repaired, pie_cols, 'pie', 7)

In [None]:
continuous_vars = ["duration", "line_cnt", "word_cnt", "unique_word_cnt", "stop_word_cnt", "slang_word_cnt"]
plot_continuous_feature_relations(repaired, continuous_vars)

## EDA

In [None]:
for_heat = repaired.copy()
for_heat.genre = pd.factorize(for_heat.genre)[0]
plt.figure(figsize=(15, 10))
sub_df = for_heat.select_dtypes('number')
sns.heatmap(sub_df.corr())

In [None]:
ratios = ["slang_word", "unique_word", "stop_word"]
repaired[[f"{field}_ratio" for field in ratios]] = repaired.apply(lambda x: x[[f"{field}_cnt" for field in ratios]]/x.word_cnt, axis=1)

In [None]:
df_params = pd.DataFrame({'plot_type': ['bar', 'line', 'pie'],
                          'col_name': ['genre', 'release_year', 'artists'],
                          'num_top_elements': [6,6,6]})
plot_frequent_elements(repaired, df_params)

In [None]:
plt.rcParams["figure.figsize"] = (18,6)

In [None]:
cols_to_bin = ['release_year', 'duration', 'line_cnt', 'word_cnt', 'positive', 'negative', 'neutral', 'compound']
categorical_cols = ['chorus_cnt', 'verse_cnt', 'intro_cnt', 'outro_cnt']
transferred = transfer_to_categorical(repaired, cols_to_bin, categorical_cols)

## MACHINE LEARNING SECTION

In [None]:
transferred = transfer_str_to_numeric_vals(transferred)
transferred.info()

In [None]:
# 'popularity', 'compound', 'neutral', 'line_cnt'
debug_cols = ['artists', 'genres', 'source_genre', 'name', 'lyrics_file', 'lyrics_url', 'popularity', 'release_month']
prepared = transferred.drop(columns=debug_cols, axis=1)

In [None]:
X_train, X_test, y_train, y_test = split_to_train_and_test(prepared, 'genre', 0.2, 5)
best_k, best_score = find_best_k_for_KNN(X_train, y_train)
print(best_k, best_score)

In [None]:
params = {'n_neighbors':best_k}
knn_clf = get_classifier_obj("KNN",params)
knn_clf.fit(X_train, y_train)
y_predicted = knn_clf.predict(X_test)
accuracy_val = calc_evaluation_val("accuracy", y_test, y_predicted)
cm_val = calc_evaluation_val("confusion_matrix", y_test, y_predicted)
print(accuracy_val)
print(cm_val)

In [None]:
max_dep = 4
min_smpl_split = 5
best_clf, best_recall_val=find_best_model(X_train, y_train, max_dep, min_smpl_split)
print(best_clf,best_recall_val)

In [None]:
best_clf.fit(X_train, y_train)
y_predicted = best_clf.predict(X_test)
accuracy_val = calc_evaluation_val("accuracy", y_test, y_predicted)
print(accuracy_val)