First we import all necessary libraries. We'll be using the already processed dataset from the previous notebook.

In [19]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error

processed_data = pd.read_csv('processed_dataset.csv')

processed_data.head()
copy = processed_data.copy()

# Feature Selection
We will use the SelectKBest class from scikit-learn to select the best features from the dataset. The SelectKBest class takes a score function as a parameter and returns the best features from the dataset based on the score function. We will use the f_regression score function to calculate the correlation between the input features and the target variable. The f_regression score function returns the F-value and p-value for each input feature. The F-value is the correlation between the input feature and the target variable. The p-value is the probability that the correlation between the input feature and the target variable is caused by chance. Since we are only interested in the best feature to use in order to predict the target variable we have displayed only the best features.

In [20]:
# Feature Selection
# Exclude the target variable from input features
X = copy.drop(columns=['danceability'])
y = copy['danceability']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Feature Selection
selector = SelectKBest(score_func=f_regression, k=4)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
best_features = selector.get_support(indices=True)

# Display the selected feature names
selected_feature_names = X.columns[best_features].tolist()
print("Selected Features:", selected_feature_names)

Selected Features: ['loudness', 'instrumentalness', 'valence', 'time_signature']


# Split the data into training and testing sets

In [21]:
X = processed_data[selected_feature_names]
y = processed_data['danceability']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)


# Model Selection and training
We have opted for the Linear regression model because of its advantages over other regression models such as the Lasso and Ridge regression models. The Linear regression model is simple and easy to implement and also less prone to overfitting. We will train the model using the training data and then use the trained model to make predictions on the test data. We will then calculate the R-squared and mean squared error values to evaluate the performance of the model.

In [22]:
model_lnn = LinearRegression()
model_lnn.fit(X_train, y_train)

predictions_lnn = model_lnn.predict(X_test)

# joblib.dump(model_lnn, 'model_lnn.pkl')

mse = mean_squared_error(y_test, predictions_lnn)
r_squared = r2_score(y_test, predictions_lnn)

print(f'R-squared: {r_squared:.2f}')
print(f'Mean Squared Error: {mse:.2f}')

R-squared: 0.26
Mean Squared Error: 0.02


# Conclusion
The R-squared value of 0.26 indicates that the model is able to explain 26% of the variance in the target variable. The mean squared error value of 0.02 indicates that the model is able to predict the target variable with an error of 0.02. Overall the model is performing well, but we can improve it by using more data, by using other regression models such as the Lasso and Ridge regression models or in combination with them as an Ensemble.