# Music Data Analysis: A Deep Dive into Song Characteristics

## 1. Introduction

In this notebook, we will conduct a comprehensive exploratory data analysis (EDA) of a music dataset. Our primary goals are to understand the factors that contribute to a song's popularity, explore the relationships between different audio features, and build a simple model to predict popularity. We will cover the following steps:

*   **Data Loading and Cleaning:** We'll start by loading the dataset and preparing it for analysis by handling missing values, duplicates, and irrelevant columns.
*   **Exploratory Data Analysis (EDA):** We'll use visualizations to explore the distributions of key variables, analyze the characteristics of different genres, and uncover correlations between audio features.
*   **Predictive Modeling:** We'll build a linear regression model to predict song popularity based on its audio features and evaluate its performance.

### 1.1. Importing Libraries

First, let's import the necessary libraries for our analysis.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Set plot style
sns.set_style('whitegrid')

## 2. Data Loading and Cleaning

Now, we'll load the dataset and perform some initial cleaning to ensure our data is ready for analysis.

In [5]:
df = pd.read_csv('dataset.csv', encoding='latin1')

ParserError: Error tokenizing data. C error: Expected 1 fields in line 15, saw 2


### 2.1. Initial Data Inspection

Let's start by looking at the first few rows of the dataframe, its shape, and some basic information about the columns.

In [None]:
df.head()

In [None]:
df.info()

### 2.2. Cleaning the Data

From the initial inspection, we can see an `Unnamed: 0` column, which appears to be an index. We'll drop this column as it's redundant. We also need to check for any missing values and duplicates.

In [None]:
# Drop the unnecessary column
df = df.drop('Unnamed: 0', axis=1)

# Check for missing values
print('Missing values:', df.isnull().sum())

# Check for duplicate tracks
print('Number of duplicate tracks:', df.duplicated(subset=['track_id']).sum())

We have a few missing values and some duplicate tracks. Let's remove the duplicates and the rows with missing values to clean up our dataset.

In [None]:
# Remove duplicates and missing values
df.dropna(inplace=True)
df.drop_duplicates(subset=['track_id'], inplace=True)

# Verify the cleaning
print('Missing values after cleaning:', df.isnull().sum().sum())
print('Number of duplicate tracks after cleaning:', df.duplicated(subset=['track_id']).sum())

## 3. Exploratory Data Analysis (EDA)

With our data cleaned, we can now dive into the exploratory analysis. We'll start by looking at the distribution of song popularity.

### 3.1. Distribution of Popularity

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df['popularity'], bins=30, kde=True)
plt.title('Distribution of Song Popularity')
plt.xlabel('Popularity')
plt.ylabel('Frequency')
plt.show()

The popularity score is fairly evenly distributed, with a slight skew towards less popular songs. Now, let's see which genres are the most popular.

### 3.2. Top 10 Most Popular Genres

In [None]:
top_genres = df.groupby('track_genre')['popularity'].mean().sort_values(ascending=False).head(10)

plt.figure(figsize=(12, 7))
sns.barplot(x=top_genres.values, y=top_genres.index, palette='viridis')
plt.title('Top 10 Most Popular Genres')
plt.xlabel('Average Popularity')
plt.ylabel('Genre')
plt.show()

Pop, rock, and dance-related genres seem to dominate the top of the popularity charts. This gives us a good idea of what kind of music is generally popular.

Next, let's examine the correlation between the different audio features.

### 3.3. Correlation Matrix of Audio Features

In [None]:
audio_features = ['popularity', 'danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']
correlation_matrix = df[audio_features].corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='viridis', fmt='.2f')
plt.title('Correlation Matrix of Audio Features')
plt.show()

From the heatmap, we can see some interesting correlations:

*   `energy` and `loudness` are strongly positively correlated, which makes sense as high-energy songs are often louder.
*   `acousticness` and `energy` are strongly negatively correlated, meaning acoustic tracks tend to be lower in energy.
*   `popularity` has a moderate positive correlation with `loudness` and a slight positive correlation with `energy`. This suggests that louder, more energetic songs tend to be more popular.

## 4. Predictive Modeling

Now, let's build a simple linear regression model to predict a song's popularity based on its audio features. This will help us understand how much of a song's popularity can be explained by these features.

### 4.1. Feature Selection and Data Splitting

In [None]:
features = ['danceability', 'energy', 'loudness', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo']
X = df[features]
y = df['popularity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 4.2. Training the Model

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)

### 4.3. Model Evaluation

In [None]:
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

The R-squared value is quite low, which indicates that our model doesn't explain much of the variance in song popularity. This suggests that while audio features have some influence, other factors not included in our model (like artist fame, marketing, and cultural trends) play a much larger role in determining a song's popularity.

## 5. Conclusion

In this analysis, we cleaned the music dataset, explored the relationships between various audio features, and built a simple predictive model. Our key findings include:

*   **Data Quality:** The initial dataset was relatively clean but required some preprocessing to handle duplicates and irrelevant columns.
*   **Popular Music Trends:** Popular songs tend to be higher in energy and loudness, and often fall into genres like pop, rock, and dance.
*   **Feature Relationships:** We confirmed intuitive relationships between audio features, such as the strong positive correlation between energy and loudness.
*   **Predictive Power:** While audio features have some predictive power, they are not sufficient to fully explain song popularity, highlighting the importance of external factors.

This analysis provides a solid foundation for further investigation. Future work could involve incorporating more features (such as artist information or album data), exploring more complex models, or analyzing trends over time.