## Predicting the Beats-per-Minute of Songs

### Data loading and Initial Inspection.

#### -load Datasets : import the training and testing datasets using pandas
#### -Data Duplication : Create a copy of the original to work freely on it


In [None]:
import pandas as pd

train_data = pd.read_csv("/kaggle/input/playground-series-s5e9/train.csv")
test_data = pd.read_csv("/kaggle/input/playground-series-s5e9/test.csv")

explore = train_data.copy()

### Feature Descriptions  
####  ** id **: Unique id for each track.
####  ** RhythmScore **: A measure of the rhythmic complexity or regularity of a track. A higher score might indicate a more defined, clear rhythm.
####  ** AudioLoudness **:  The overall average loudness of the audio track. Loudness is a key acoustic property, and its values often exist on a logarithmic scale, which is why they can be negative.
####  ** VocalContent **: The proportion or a score of how much of the audio track is dominated by vocals. A high value would mean a song is primarily vocals, whereas a low value would be for instrumental tracks
####  ** AcousticQuality **:  This could be a measure of the sound quality or fidelity of the recording
####  ** InstrumentalScore **: his is likely a measure of how much of the track is purely instrumental. It's often inversely correlated with VocalContent
####  ** LivePerformanceLikelihood **: A score, often between 0 and 1, that quantifies the probability that the recording is from a live performance rather than a studio recording
####  ** MoodScore **: : A numerical score attempting to quantify the emotional tone or mood of the song. It could represent a spectrum (e.g., from sad to happy) or a specific emotion.
####  ** TrackDurationMs **: : The duration of the song in milliseconds
####  ** Energy **: : A perceptual measure of a track's intensity and activity. Songs with a lot of movement, a loud feel, and a fast tempo usually have a higher energy score
####  ** BeatsPerMinute **: his is a direct measure of the song's tempo. A higher number means a faster song.

#### Let's begin our Exploratory Data Analysis (EDA) to understand the dataset's characteristics.

In [None]:
explore.info()

### Initial finding 

#### - No missing values.
#### - The target class is BeatsPerMinute (numeric feature), Therefore this is a regression problem.
#### - All the predictor features is numeric.
 

In [None]:
explore.describe()

### We will now use visualizations to further explore the dataset.


In [None]:
import matplotlib.pyplot as plt 

fig, axes = plt.subplots(nrows = 2, ncols = 5, figsize = (20,12), layout = 'tight')
axes = axes.flatten()

for i, col in enumerate(explore.columns.drop('id')):
    
    ax = axes[i]
    ax.scatter(explore['BeatsPerMinute'], explore[col], alpha = 0.1)
    ax.set_xlabel('BeatsPerMinute')
    ax.set_ylabel(col)

plt.show()

### No obvious relationship from the plots
### The Data needs scaling before training as there is different scales for each feature

### Building & Training the model
#### We split the Dataset to training and testing sets to properly evaluate our model on unseen data
#### To ensure a high performance we, we will scale the feature so they are all on a similar scale
#### Based on EDA finding of non-linear relationships, I will use a RandomForestRegressor to train the model

In [None]:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.preprocessing import StandardScaler


X = train_data.drop(["BeatsPerMinute"], axis = 1)
y = train_data["BeatsPerMinute"]
X = X[:70000]
y = y[:70000]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 23)

model = make_pipeline(StandardScaler(),RandomForestRegressor(max_depth = 10, random_state = 23))

train_size, train_score, test_score = learning_curve(
    estimator=model,
    X = X_train,
    y = y_train,
    train_sizes = np.linspace(0.1, 1.0, 10),
    cv = 3,
    scoring = "neg_mean_squared_error",
    n_jobs = -1
)

train_rmse_score = np.sqrt(-train_score)
test_rmse_score = np.sqrt(-test_score)

train_mean_rmse = np.mean(train_rmse_score, axis = 1)
test_mean_rmse = np.mean(test_rmse_score, axis = 1)

plt.figure(figsize = (10,7))
plt.title("Learning Cruve (RMSE)")
plt.xlabel("Training Examples")
plt.ylabel("Root Mean Squared Error")
plt.grid()

plt.plot(train_size, train_mean_rmse, 'o-', color = 'r', label = 'Training Score')
plt.plot(train_size, test_mean_rmse, 'o-', color = 'g', label = 'Cross Validation Score')

plt.legend(loc = "best")
plt.show()

In [17]:
model.fit(X_train, y_train)
y_pred = model.predict(test_data)

output = pd.DataFrame({'id': test_data['id'], "BeatsPerMinute": y_pred})

output.to_csv("submission.csv", index = False)