Spotify 2023 songs: Predective modelling 

Task 1: Data Exploration and Cleaning
1. Understand the Dataset:

   - Load the data and display the first few rows.
   - Check for null values and handle them appropriately (e.g., remove rows, impute values, etc.).
   - Identify the numerical and categorical features.
2. Explore Key Statistics:

   - Use .describe() and .info() to summarize the dataset.
   - Visualize the distribution of key metrics (e.g., popularity, streams, duration) using histograms or box plots.

Task 2: Feature Engineering
1. Correlation Analysis:

   - Identify which features are most correlated with song popularity or streams.
   - Create a heatmap of the correlation matrix.
2. Create New Features:

   - Derive features like "average streams per day" or "duration in minutes."
   - Group songs by genre or artist and calculate aggregated metrics like average popularity per artist.

Task 3: Predictive Modeling
1. Regression Task: Predict Song Popularity

   - Define the target variable: popularity.
   - Select features like streams, duration, genre, or any other relevant attributes.
   - Train a regression model (e.g., Linear Regression, Random Forest Regressor, or XGBoost).
   - Evaluate performance using metrics like RMSE and R².
2. Classification Task: Identify Hit Songs

   - Create a binary classification label: A "hit song" is one with popularity above a threshold (e.g., 80).
   - Use features like streams, release year, and duration to train a classification model (e.g., Logistic Regression, Decision Tree, or SVM).
   - Evaluate using metrics like accuracy, precision, and recall.

Task 4: Clustering
1. Group Similar Songs:
   - Use unsupervised learning (e.g., K-Means or DBSCAN) to cluster songs based on attributes like tempo, danceability, and energy.
   - Visualize the clusters using PCA or t-SNE to reduce dimensions.

Task 5: Recommendation System
1. Content-Based Filtering:

   - Build a recommendation system that suggests songs based on a user’s preferences for specific attributes like genre, tempo, or energy.
2. Collaborative Filtering:

   - If the dataset includes user data (e.g., user IDs and song IDs), implement a collaborative filtering algorithm to recommend songs to users.

Exploratory Data Analysis (EDA):

Step 1: Load and Preview the Data
  - Load the dataset and examine its structure
  - Check basic information
  
Step 2: Check for Missing Values
  - Identify and handle missing data

Step 3: Analyze Data Distributions
  - Visualize numerical and categorical data distributions

Step 4: Correlation Analysis
  - Analyze relationships between numerical features
  - Look for features that have strong positive or negative correlations with the target variable (e.g., popularity)

Step 5: Feature Exploration
  - Identify outliers using boxplots
  - Analyze trends
  
Step 6: Data Preparation for ML
  - Convert categorical features (e.g., genre) to numerical using one-hot encoding or label encoding
  - Scale numerical features for ML models

Next Steps: Machine Learning Models
Once EDA is complete, you can proceed with these tasks:

  - Regression Task: Predict popularity using features like streams, duration_ms, and tempo.
  - Classification Task: Classify songs as "hit" (popularity > 80) or "non-hit" (popularity ≤ 80).
  - Clustering: Group songs into clusters based on features like danceability, energy, and tempo.

Machine Learning model:

Step 1: Define the Problem
  - We aim to predict the popularity of songs using various features such as, streams,duration_ms, tempo, etc.

Step 2: Prepare the Dataset
  - Select Features and Target Variable
  - Split the Dataset: Divide the dataset into training and testing sets

Step 3: Train the Linear Regression Model
  - Import the Model
  - Initialize and Fit the Model

Step 4: Make Predictions
  - Use the trained model to predict the popularity of songs in the test set

Step 5: Evaluate the Model
 Assess how well your regression model performs using evaluation metrics
  -  R² Score
  - Mean Squared Error (MSE) and Root Mean Squared Error (RMSE)

Step 6: Visualize Results
  - Compare Predictions vs. Actual Values
  - Residual Plot: Visualize errors in predictions

Next Steps
If the performance isn't satisfactory:

Feature Engineering: Add or transform features (e.g., log-transform streams or normalize features).
Regularization: Use models like Ridge or Lasso regression to reduce overfitting.
Experiment: Try other regression models like Decision Trees, Random Forests, or XGBoost


In [3]:
import kagglehub
import pandas as pd
import numpy

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

ModuleNotFoundError: No module named 'kagglehub'

In [None]:
path = kagglehub.dataset_download("nelgiriyewithana/top-spotify-songs-2023")
print("Path to dataset files:", path)

In [None]:
df = pd.read_csv(r"C:\Users\tdawa\.cache\kagglehub\datasets\nelgiriyewithana\top-spotify-songs-2023\versions\1\spotify-2023.csv", encoding='ISO-8859-1')
df.head()

In [None]:
#Checking basic information (data types and null values)
df.info()
#summary of statistics for numerical columns
df.describe().round(1)

In [None]:
#auto adjusting the Dtype if required
df = df.infer_objects()
df.info()

In [None]:
missing_values = df.isnull().sum()
print(missing_values)

In [None]:
df = df.dropna()
df.info()

In [None]:
#Histogram of a numerical column 
df['bpm'].hist()
plt.title('Popularity Distribution')
plt.xlabel('Popularity')
plt.ylabel('Frequency')
plt.show()

In [None]:
#Bar plot for a categorical column 
x = df['released_year'] #.value_counts()
y = df['in_spotify_charts']
plt.title('Charts by year realeased')
plt.xlabel('released_year')
plt.ylabel('in_spotify_charts')
plt.bar(x,y,align='center',width=0.8)

plt.show()

In [None]:
#Plotting two different scatter plots in the same graph

#plot 1
x = df['bpm']
y = df['in_spotify_charts']

print("Min: ",min(y), "Max: ",max(y))
plt.scatter(x, y, color='blue')
plt.title('Relation between bpm and in_spotify_charts')
plt.xlabel('bpm')
plt.ylabel('in_spotify_charts')

#plot 2
x = df['bpm']
y = df['energy_%']

print("Min: ",min(y), "Max: ",max(y))
plt.scatter(x, y, color= 'green')
plt.title('Relation between bpm and energy_%')
plt.xlabel('bpm')
plt.ylabel('energy_%')

plt.show()

In [None]:
#Correlation matrix

correlation_matrix = df['bpm','energy_%'].corr()

#Heatmap of the correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

#Need to clean the data and convert it into integers or floats first, 
# then we can compare which features have the best correleation to each other.

In [None]:
#identify outliers using boxplots
sns.boxenplot(df['bpm'])
plt.title('Boxplot of popularity')
plt.show()

In [None]:
#Scatter plot to explore replationships

sns.scatterplot(data=df[['streams','bpm']], x='streams', y='bpm')
plt.title('Streams vs Popularity')
plt.show()

In [None]:
#Data preparation for ML
#one hot encoding
df = pd.get_dummies(df, columns=['mode'], drop_first=True)
print(df[['mode']])

In [None]:
scaler = StandardScaler()
numerical_columns = ['streams', 'duration_ms', 'tempo']  # Example columns
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

In [None]:
# ML model

y = df['streams'] #Target variable

X = df[['bpm','energy_%']]

# Split data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Initialize the model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

In [None]:
# Calculate R² score to evaluate the model
r2 = r2_score(y_test, y_pred)
print(f'R² Score: {r2}')

In [None]:
# Calculate MSE and RMSE
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
print(f'MSE: {mse}, RMSE: {rmse}')

In [None]:
# Scatter plot for actual vs predicted values
plt.scatter(y_test,y_pred, alpha=0.3,color='red')
plt.xlabel('Actual Popularity')
plt.ylabel('Predicted Popularity')
plt.title('Actual vs Predicted Popularity')
plt.show()

In [None]:
residuals = y_test - y_pred

# Plot residuals
plt.hist(residuals, bins=20, color='blue', edgecolor='black')
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residuals Distribution')
plt.show()

This is a test work and I am using a 55inch oled screen to work. This work flow works for me and I like it. I think i should make a  
