# **1. Business Understanding**

## 1. What is the business problem that you are trying to solve?

## 2. What data do you need to answer the above problem?

To answer these questions, we need data that includes:
- **Song metadata:** Track name, artist, album, and release type.
- **Spotify streaming data:** Number of streams, danceability, energy, loudness, speechiness, tempo, acousticness, instrumentalness, etc.
- **YouTube performance metrics:** Views, likes, comments, and official video status.
- **Musical characteristics:** Attributes like key, liveness, and valence to understand the song’s emotional and structural appeal.




## 3. What are the different sources of data?

Dataset is taken from Kaggle and it combines data from **Spotify** and **YouTube**. The data is likely collected using:
- **Spotify API**
- **YouTube API**
- **Web scraping or third-party repositories**

****



## 4. What kind of analytics task are you performing?

This project involves:
- **Exploratory Data Analysis (EDA):** Understanding patterns and trends in song popularity across Spotify and YouTube.
- **Feature Engineering:** Identifying key characteristics that impact song performance.
- **Predictive Modeling:** Building a machine learning model to classify whether a song is more likely to be a **Spotify Hit** or a **YouTube Hit**.
- **Clustering:** Grouping songs based on similar characteristics to understand common patterns in song popularity.

# **Importing Packages**

In [16]:
import pandas as pd
from scipy.stats import zscore
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
import numpy as np
from scipy.stats import chi2_contingency
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_selection import mutual_info_classif, chi2
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score

# **2. Data Acquisition**

For the problem identified, we will use the **Spotify and YouTube Dataset** that contains various statistics of songs from both platforms.

---



## 2.1 Download the data directly



In [13]:
# Download dataset from Kaggle
!kaggle datasets download -d salvatorerastelli/spotify-and-youtube

# Unzip the dataset
!unzip -o spotify-and-youtube.zip

print("Dataset successfully downloaded!")

'kaggle' is not recognized as an internal or external command,
operable program or batch file.


Dataset successfully downloaded!


'unzip' is not recognized as an internal or external command,
operable program or batch file.


## 2.2 Code for converting the above downloaded data into a dataframe

In [15]:
# Load the dataset
df = pd.read_csv("Spotify_Youtube.csv")
print("Dataset successfully loaded!")

FileNotFoundError: [Errno 2] No such file or directory: 'Spotify_Youtube.csv'

## 2.3 Confirm the data has been correctly by displaying the first 5 and last 5 records.

In [None]:
# Display the first 5 rows
print("First 5 records:")
display(df.head())

# Display the last 5 rows
print("\nLast 5 records:")
display(df.tail())

## 2.4 Display the column headings, statistical information, description and statistical summary of the data.

In [None]:
# Display column names
print("Column Names:")
print(df.columns)

# Show basic information about dataset (data types, missing values, etc.)
print("\nDataset Info:")
df.info()

# Display statistical summary for numerical columns
print("\nStatistical Summary:")
display(df.describe())

# Display summary for categorical columns
print("\nCategorical Data Summary:")
display(df.describe(include=['object']))

# 2.5 Observations from the Dataset
#### **1. Size of the Dataset**
- The dataset contains **`20718` rows** and **`28` columns**.

#### **2. Types of Data Attributes**

**Data types of attributes:**

float64 -   15,
object  -   12,
int64    -   1

Name: count, dtype: int64

The dataset consists of **numerical** and **categorical** attributes.

**Numerical Attributes (Continuous & Discrete)**  
These represent measurable quantities:
- `streams` → Number of streams on Spotify  
- `views` → Number of views on YouTube  
- `likes` → Number of likes on YouTube  
- `comments` → Number of comments on YouTube  
- `danceability` → Suitability for dancing (0 to 1)  
- `energy` → Measure of intensity (0 to 1)  
- `loudness` → Overall loudness (in dB)  
- `speechiness` → Presence of spoken words (0 to 1)  
- `acousticness` → Likelihood of being acoustic (0 to 1)  
- `instrumentalness` → Likelihood of being instrumental (0 to 1)  
- `liveness` → Probability of a live performance (0 to 1)  
- `valence` → Musical positivity (0 to 1)  
- `tempo` → Tempo of the track (beats per minute)  
- `duration_ms` → Track duration (in milliseconds)  

**Categorical Attributes (Nominal & Ordinal)**  
These represent categories or labels:
- `track` → Name of the song  
- `artist` → Name of the artist  
- `album` → Name of the album  
- `album_type` → Single or Album  
- `channel` → Name of the YouTube channel  
- `official_video` → Boolean (True/False) indicating if it's the official video  
- `url_spotify` → Spotify URL of the track  
- `url_youtube` → YouTube URL of the video  
- `description` → Description of the YouTube video  
- `licensed` → Boolean indicating if the video is officially licensed  

---

#### **3. Missing Data Check**

The table below shows the number of missing values for each column:

| **Feature**            | **Missing Values** |
|------------------------|-------------------|
| Danceability          | 2   |
| Energy                | 2   |
| Key                   | 2   |
| Loudness              | 2   |
| Speechiness           | 2   |
| Acousticness          | 2   |
| Instrumentalness      | 2   |
| Liveness              | 2   |
| Valence               | 2   |
| Tempo                 | 2   |
| Duration_ms           | 2   |
| Url_youtube           | 470 |
| Title                 | 470 |
| Channel               | 470 |
| Views                 | 470 |
| Likes                 | 541 |
| Comments              | 569 |
| Description           | 876 |
| Licensed              | 470 |
| Official_video        | 470 |
| Stream               | 576 |

#### **🔹 Percentage of Missing Values**
The following table displays the percentage of missing values for each column:

| **Feature**            | **Missing %** |
|------------------------|--------------|
| Danceability          | 0.0097%  |
| Energy                | 0.0097%  |
| Key                   | 0.0097%  |
| Loudness              | 0.0097%  |
| Speechiness           | 0.0097%  |
| Acousticness          | 0.0097%  |
| Instrumentalness      | 0.0097%  |
| Liveness              | 0.0097%  |
| Valence               | 0.0097%  |
| Tempo                 | 0.0097%  |
| Duration_ms           | 0.0097%  |
| Url_youtube           | 2.27%  |
| Title                 | 2.27%  |
| Channel               | 2.27%  |
| Views                 | 2.27%  |
| Likes                 | 2.61%  |
| Comments              | 2.75%  |
| Description           | 4.23%  |
| Licensed              | 2.27%  |
| Official_video        | 2.27%  |
| Stream               | 2.78%  |


In [None]:
# 1. Size of the dataset
num_rows, num_cols = df.shape
print(f"Dataset contains {num_rows} rows and {num_cols} columns.\n")

# 2. Data types of attributes
print("Data types of attributes:")
print(df.dtypes.value_counts(), "\n")  # Count of different data types

# 3. Checking for missing values
missing_values = df.isnull().sum()
missing_values = missing_values[missing_values > 0]  # Only show columns with missing values

if missing_values.empty:
    print("No missing values in the dataset.")
else:
    print("Missing values in the dataset:")
    print(missing_values)

    # Percentage of missing values
    missing_percentage = (missing_values / len(df)) * 100
    print("\nPercentage of missing values:")
    print(missing_percentage)



# 3. Data Preparation

## 3.1 Check for

* duplicate data
* missing data
* data inconsistencies


In [None]:
# Check for duplicate rows
duplicate_count = df.duplicated().sum()
print(f"Total Duplicate Rows: {duplicate_count}")

# Check for missing values
missing_values = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100
print(f"Missing values:, {missing_values}")
print(f"Missing percentage:, {missing_percentage}")

print('\n Check for inconsistent data')
# Identify columns where negative values should not exist
num_cols_to_check = ['Views', 'Likes', 'Comments', 'Duration_ms', 'Stream']

# Filter rows with negative values
inconsistent_rows = df[(df[num_cols_to_check] < 0).any(axis=1)]

# Display inconsistent rows
print(f"Rows with Negative Values:\n{inconsistent_rows}")

# Checking unique values in categorical columns
cat_cols_to_check = ['official_video', 'Licensed']

for col in cat_cols_to_check:
    print(f"Unique values in {col}: {df[col].unique()}")

# Compute Z-scores for numerical columns
z_scores = df[num_cols_to_check].apply(zscore)

# Define a threshold (e.g., |z| > 3 indicates a potential outlier)
outliers = df[(z_scores > 3).any(axis=1)]

print(f"Potential Outliers (Z-score > 3):\n{outliers}")


## 3.2 Apply techiniques
* to remove duplicate data
* to impute or remove missing data
* to remove data inconsistencies


In [None]:
# Remove duplicate rows
df = df.drop_duplicates()
print("Duplicates removed successfully!")

# Calculate missing values percentage
missing_percentage = df.isnull().sum() / len(df) * 100
print("Missing Data Percentage:\n", missing_percentage)

# Define threshold: If more than 30% of data is missing, drop the column
missing_threshold = 0.3 * len(df)
df = df.dropna(thresh=missing_threshold, axis=1)

# Impute missing numerical data with median
num_cols = df.select_dtypes(include=['number']).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Impute missing categorical data with mode
cat_cols = df.select_dtypes(include=['object']).columns
df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])

print("Missing data handled successfully!")

# Replace negative values with NaN (for later imputation)
num_cols_to_check = ['Views', 'Likes', 'Comments', 'Duration_ms', 'Stream']
df[num_cols_to_check] = df[num_cols_to_check].apply(lambda x: x.mask(x < 0, None))

# Impute the newly created NaNs with the median
df[num_cols_to_check] = df[num_cols_to_check].fillna(df[num_cols_to_check].median())

print("Negative values corrected!")

# Convert categorical values to lowercase
df['official_video'] = df['official_video'].astype(str).str.lower()
df['Licensed'] = df['Licensed'].astype(str).str.lower()

print("Categorical inconsistencies fixed!")

print("Final dataset shape:", df.shape)
print("Final missing values:\n", df.isnull().sum().sum())

## 3.3 Encode categorical data

In [None]:
# Only for text data

## 3.4 Report

Mention and justify the method adopted
* to remove duplicate data, if present
* to impute or remove missing data, if present
* to remove data inconsistencies, if present

OR for textdata
* How many tokens after step 3?
* how may tokens after stop words filtering?

If the any of the above are not present, then also add in the report below.

Score: 2 Marks (based on the dataset you have, the data prepreation you had to do and report typed, marks will be distributed between 3.1, 3.2, 3.3 and 3.4)

#### **1. Handling Duplicate Data**  
- **Method Used:** Removed duplicate rows using `df.drop_duplicates()`.  
- **Justification:**  
  - Prevents redundant data that could skew analysis.  
  - Ensures unique and accurate data representation.  

<br>

#### **2. Handling Missing Data**  
- **Method Used:**  
  - Dropped columns with **>30% missing values** (`df.dropna(thresh=missing_threshold, axis=1)`).  
  - Imputed **numerical values with the median** (`df[num_cols].fillna(df[num_cols].median())`).  
  - Imputed **categorical values with the mode** (`df[cat_cols].fillna(df[cat_cols].mode().iloc[0])`).  

- **Justification:**  
  - Retains most of the data while eliminating excessive missing values.  
  - Median imputation prevents skewing by outliers.  
  - Mode imputation maintains categorical data consistency.  

<br>

#### **3. Handling Data Inconsistencies**  
- **Method Used:**  
  - Replaced **negative values** in numerical columns with `NaN`, then imputed them with the median.  
  - Standardized categorical variables by converting them to **lowercase**.  

- **Justification:**  
  - Negative values in attributes like **views, likes, comments, and duration** are logically incorrect.  
  - Standardizing text data ensures uniform representation for analysis.  

<br>

### **Final Observations**  
**Duplicates removed** – Ensures a clean dataset.  
**Missing values handled effectively** – Prevents loss of significant data.  
**Data inconsistencies corrected** – Enhances data integrity for further analysis.  
**Final dataset is cleaned and ready for exploratory data analysis (EDA) and modeling.**  


## 3.5 Identify the target variables.

* Separate the data from the target such that the dataset is in the form of (X,y) or (Features, Label)

* Discretize / Encode the target variable or perform one-hot encoding on the target or any other as and if required.

* Report the observations

Score: 1 Mark

In [None]:
# Identify the Target Variable
y = df['Views']

# Separate Features (X) from Target (y)
X = df.drop(columns=['Views'])
print(f".Features Shape: {X.shape}, Target Shape: {y.shape}")

# Identify categorical columns
categorical_cols = ['Key', 'Title', 'Channel', 'Licensed', 'official_video']

# Apply OneHotEncoding
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)

print(f"Shape after encoding: {X.shape}")

### **Observations:**
1. **Target Variable Selection:**  
   - The target variable chosen is **`Views`**, as we are likely trying to predict the number of views based on song attributes.

2. **Feature Separation:**  
   - The dataset is now split into **features (`X`)** and **target (`y`)**.  
   - Shape of `X`: `(num_samples, num_features)`  
   - Shape of `y`: `(num_samples,)`  

3. **Encoding Categorical Variables:**  
   - Columns like **`Key`**, **`Title`**, **`Channel`**, **`Licensed`**, and **`official_video`** were categorical.  
   - **One-Hot Encoding** was applied to convert them into numerical format.  
   - This increased the number of features in `X`.

4. **Impact of Encoding:**  
   - Some categorical columns may have a large number of unique values (e.g., `Title`, `Channel`), which may lead to **high dimensionality**.  
   - If necessary, we can **drop `Title` or `Channel`** if they do not contribute significantly to the model.

5. **Next Steps:**  
   - **Feature Scaling** may be required for numerical attributes.  
   - **Feature Selection** can help reduce dimensionality and improve model performance.

# 4. Data Exploration using various plots



## 4.1 Scatter plot of each quantitative attribute with the target.

Score: 1 Mark

In [None]:
# Define numerical features
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns

# Set up the plotting grid
plt.figure(figsize=(15, 20))
for i, feature in enumerate(numerical_features, 1):
    plt.subplot(5, 3, i)
    sns.scatterplot(x=X[feature], y=y, alpha=0.5)
    plt.xlabel(feature)
    plt.ylabel("Views")
    plt.title(f"{feature} vs Views")

plt.tight_layout()
plt.show()

## 4.2 EDA using visuals
* Use (minimum) 2 plots (pair plot, heat map, correlation plot, regression plot...) to identify the optimal set of attributes that can be used for classification.
* Name them, explain why you think they can be helpful in the task and perform the plot as well. Unless proper justification for the choice of plots given, no credit will be awarded.

Score: 2 Marks

In [None]:
# Select a subset of numerical features and the target variable
selected_features = ['Danceability', 'Energy', 'Loudness', 'Speechiness', 'Instrumentalness', 'Valence', 'Tempo', 'Views']

# Pair Plot
sns.pairplot(df[selected_features])
plt.suptitle("Pair Plot of Selected Attributes", y=1.02)
plt.show()

# Heatmap for Correlation Analysis
plt.figure(figsize=(10, 6))
sns.heatmap(df[selected_features].corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap of Features")
plt.show()

### **Why We Used Pair Plots and Correlation Plot?**  

#### **1. Pair Plot:**  
- A pair plot visualizes relationships between multiple numerical variables using scatterplots.
- It helps in understanding **distributions, trends, and separability** of features, which are crucial for classification.  
- It highlights **linear and nonlinear relationships** between features and target classes.  

#### **2. Correlation Plot:**  
- A correlation heatmap shows how strongly each feature is related to others using a **color-coded matrix**.  
- It helps detect **redundant features** that provide similar information (e.g., Energy and Loudness).  
- It allows us to **select features that contribute unique information** to classification.  
<br>
<br>


### **How These Plots Help Identify Optimal Attributes for Classification?**  

- The **Pair Plot** allows us to visually assess which attributes exhibit **distinct clusters or trends**, making them good candidates for classification.  
- The **Correlation Heatmap** ensures that we **avoid multicollinearity** by removing redundant features while selecting the most relevant ones.  
- Features that have **strong relationships with the target variable** (or distinguishable distributions) are preferred.  
<br>
<br>


### **Observations from the output (refer above result)**  

#### **Pair Plot Observations:**  
1. **Energy vs. Loudness:**  
   - There is a **strong positive relationship** between Energy and Loudness.  
   - Since they provide similar information, we may use only one.  

2. **Danceability vs. Valence:**  
   - These two features show **some separability** in scatter plots.  
   - They can be useful for classification, especially for song mood prediction.  

3. **Speechiness and Instrumentalness:**  
   - These variables show **random scatter** with no clear pattern.  
   - They may not contribute significantly to classification.  
<br>
<br>


#### **Correlation Heatmap Observations:**  
1. **High Correlation:**  
   - **Energy and Loudness (0.74)** → Strong correlation, so we should use only one.  
   - **Danceability and Valence (0.47)** → Moderate correlation, meaning both can be useful.  

2. **Low Correlation (Weak Predictors):**  
   - **Speechiness and Instrumentalness** have low correlation with all other features, meaning they may not be strong predictors.  
<br>
<br>

### **Conclusion:**  
Based on these observations, the **best features** for classification are:  
**Energy** (or Loudness, but not both)  
**Danceability**  
**Valence**  
**Views** (useful for popularity-based classification)  
**Tempo** (based on some weak but potential influence)  

**Speechiness and Instrumentalness do not show strong relationships and may not be helpful for classification.**

# 5. Data Wrangling



## 5.1 Univariate Filters

#### Numerical and Categorical Data
* Identify top 5 significant features by evaluating each feature independently with respect to the target/other variable by exploring
1. Mutual Information (Information Gain)
2. Gini index
3. Gain Ratio
4. Chi-Squared test
5. Strenth of Association

(From the above 5 you are required to use only any <b>two</b>)



Score: 3 Marks

In [None]:
# Define features and target
numerical_features = ['Danceability', 'Energy', 'Loudness', 'Speechiness',
                      'Instrumentalness', 'Valence', 'Tempo', 'Duration_ms', 'Likes', 'Comments']
target = 'Views'

X = df[numerical_features]
y = df[target]

# 🔹 Use mutual_info_regression() for continuous target
mi_scores = mutual_info_regression(X, y)
mi_scores = pd.Series(mi_scores, index=numerical_features).sort_values(ascending=False)

print("Top 5 features based on Mutual Information:\n", mi_scores.head(5))

# 🔹 Normalize features for Chi-Squared Test
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# 🔹 Chi-Squared Test
chi_scores, p_values = chi2(X_scaled, y)
chi_scores = pd.Series(chi_scores, index=numerical_features).sort_values(ascending=False)

print("\nTop 5 features based on Chi-Squared Test:\n", chi_scores.head(5))


## 5.2 Report observations

Write your observations from the results of each method. Clearly justify your choice of the method.

Score 1 mark

## **Observations on Feature Selection Methods**  

We used **Mutual Information (MI) and Chi-Squared Test** to identify the **top 5 most significant features** related to the target variable (`Views`). Below are our observations:

<br>

### **Observations from Mutual Information (MI)**
**Top Features Identified:**
1. **Likes** → 1.693  
2. **Comments** → 1.281  
3. **Duration_ms** → 0.199  
4. **Loudness** → 0.193  
5. **Tempo** → 0.179  

**Key Interpretations:**
- **Mutual Information (MI) measures how much knowing one variable reduces uncertainty in predicting another.**  
- **Higher MI scores** mean **stronger** relationships with `Views`, while **lower MI scores** mean weak associations.  
- **Likes & Comments have the highest MI** (1.69 & 1.28), indicating they are the most relevant for predicting `Views`.  
- **Other features (Duration, Loudness, Tempo) have significantly lower MI scores**, suggesting weaker influence.

**Why Mutual Information?**
- MI **captures both linear and non-linear dependencies**.
- It works well for **both numerical and categorical data**.
- However, MI **does not indicate the direction of influence**—it only shows dependency.

<br>

### **Observations from Chi-Squared Test**
**Top Features Identified:**
1. **Instrumentalness** → 13,297.71  
2. **Speechiness** → 2,611.26  
3. **Valence** → 2,293.32  
4. **Likes** → 1,963.57  
5. **Comments** → 1,742.70  

**Key Interpretations:**
- The **Chi-Squared Test measures the dependence** between categorical variables and the target.  
- **Higher Chi-Squared values** indicate stronger association.  
- **Instrumentalness has the highest Chi-Squared score (13,297), suggesting a very strong influence on `Views`**.  
- **Likes and Comments also have high scores**, confirming their strong association with `Views`.  

**Why Chi-Squared Test?**
- Works well for **categorical** or **discretized numerical** data.  
- Measures **statistical significance** of relationships.  
- **However, it does not capture non-linear relationships.**

<br>

### **Final Decision**
1. **Likes & Comments appear in both MI & Chi-Squared results**, confirming they are **highly relevant**.
2. **Instrumentalness has an extremely high Chi-Squared score**, making it a **strong candidate**.
3. **Speechiness & Valence are strongly associated with Views** and should be considered.  

Thus, **the final selected features for classification are:**
- **Likes**
- **Comments**
- **Instrumentalness**
- **Speechiness**
- **Valence**

# 6. Implement Machine Learning Techniques

Use any 2 ML tasks
1. Classification  

2. Clustering  

3. Association Analysis

4. Anomaly detection

You may use algorithms included in the course, e.g. Decision Tree, K-means etc. or an algorithm you learnt on your own with a brief explanation.
A clear justification have to be given for why a certain algorithm was chosen to address your problem.

Score: 4 Marks (2 marks each for each algorithm)

## 6.1 Classification

In [None]:
#created views_category from views based on quantiles
df['Views_Category'] = pd.qcut(df['Views'], q=3, labels=['Low', 'Medium', 'High'])
print(df[['Views', 'Views_Category']].head())

In [None]:
# Define features and target variable
features = ['Danceability', 'Energy', 'Loudness', 'Speechiness', 'Instrumentalness', 'Valence', 'Tempo', 'Duration_ms', 'Likes', 'Comments']
target = 'Views_Category'

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[features], df[target], test_size=0.2, random_state=42)

# Initialize and train Decision Tree Classifier
dt_model = DecisionTreeClassifier(criterion='gini', max_depth=5, random_state=42)
dt_model.fit(X_train, y_train)

# Make predictions
y_pred = dt_model.predict(X_test)

# Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Decision Tree Accuracy: {accuracy:.2f}")

# Display classification report
print("Classification Report:\n", classification_report(y_test, y_pred))

### **ML Technique 1: Decision Tree Classification**
#### **Justification**  
- **Business Case:** Can we predict whether a song will perform better on Spotify or YouTube based on its attributes?  
- **Why Decision Tree?**  
  - It helps classify songs into different **popularity levels** (e.g., High, Medium, Low) based on their features.  
  - Decision Trees are easy to interpret and can handle both numerical and categorical data effectively.  
  - It automatically selects the most important features that influence a song’s success.  
  - The model provides clear decision rules that can help artists and producers optimize their music.  
- **Outcome:**  
  - The classification model helps predict whether a song is likely to perform well based on its attributes.  
  - It provides insights into which features (e.g., Likes, Comments, Tempo, Energy) are most impactful in driving popularity.  
<br>
<br>

### **Interpretation of Decision Tree Classification Results**  

#### **1. Overall Model Performance**  
- **Accuracy:** The model achieved **85% accuracy**, which indicates that it is performing well in predicting the song popularity category (High, Medium, Low).  
- **Macro Avg F1-Score (0.85):** This suggests that the model maintains a balanced performance across all classes.  

#### **2. Class-Wise Performance**  
| Class  | Precision | Recall | F1-Score | Support |
|--------|-----------|--------|----------|---------|
| **High**  | 0.88  | 0.89  | 0.88  | 1374  |
| **Low**   | 0.90  | 0.86  | 0.88  | 1353  |
| **Medium** | 0.77  | 0.80  | 0.79  | 1417  |

- **High & Low Popularity Songs:**  
  - The model predicts these categories **with high precision (0.88 & 0.90)** and **good recall (0.89 & 0.86)**.  
  - This suggests that songs in these categories have **distinct characteristics** that the model can easily differentiate.  

- **Medium Popularity Songs:**  
  - The **F1-score is lower (0.79)** compared to High and Low categories.  
  - **Recall (0.80)** indicates that the model captures most Medium songs, but **precision (0.77)** is lower, meaning some Medium songs might be misclassified.  
  - This is expected, as Medium-popularity songs may have overlapping attributes with both High and Low categories.  

#### **3. Key Takeaways**  
 The model **performs well overall** with an accuracy of **85%**, making it a reliable tool for predicting a song's popularity.  
 The **High & Low popularity classes** are well classified, but **Medium popularity songs** show some misclassification, possibly due to overlapping attributes.  
 Future improvements could include **feature engineering** or using **ensemble models** (e.g., Random Forest) to enhance the classification of Medium popularity songs.  



## 6.2 K-Means Clustering

In [None]:
# Selecting features for clustering
cluster_features = ['Danceability', 'Energy', 'Loudness', 'Speechiness', 'Instrumentalness', 'Valence', 'Tempo', 'Duration_ms', 'Likes', 'Comments']

# Standardizing the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[cluster_features])

# Determine the optimal number of clusters using the Elbow Method
inertia = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(scaled_features)
    inertia.append(kmeans.inertia_)

# Plot Elbow Curve
plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), inertia, marker='o', linestyle='--')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()

# Applying K-Means with the chosen number of clusters (e.g., K=3)
optimal_k = 3  # Select based on the elbow plot
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
df['Cluster'] = kmeans.fit_predict(scaled_features)

# Visualize the clusters
plt.figure(figsize=(8,6))
sns.scatterplot(x=df['Energy'], y=df['Likes'], hue=df['Cluster'], palette='viridis', s=50)
plt.title('K-Means Clustering on Song Popularity')
plt.xlabel('Energy')
plt.ylabel('Likes')
plt.show()


### **ML Technique 2: K-Means Clustering**  
#### **Justification**  
- **Business Case:** What factors contribute to a song’s popularity on Spotify versus YouTube?  
- **Why K-Means?**  
  - It helps **group songs into clusters** based on key engagement metrics (Likes, Comments, Energy, Tempo, etc.).  
  - By analyzing clusters, we can identify common characteristics of successful songs.  
  - It is useful for segmenting songs into different categories (e.g., viral hits, average performers, niche songs).  
  - The Elbow Method helps determine the optimal number of clusters, ensuring a meaningful segmentation.  
- **Outcome:**  
  - Helps identify **key attributes** that define hit songs across different platforms.  
  - Enables targeted marketing strategies for different types of music based on cluster analysis.  
  - Assists artists in **understanding trends** and **producing content** that aligns with popular song attributes.  
<br>
<br>

### **Interpretation of K-Means Clustering Output**

#### **1. Elbow Method for Optimal K**
- The Elbow Method graph plots the number of clusters (K) against inertia (sum of squared distances of samples to their closest cluster center).
- The graph shows a sharp decline in inertia initially, which gradually flattens out as K increases.
- The optimal number of clusters (K) is usually identified at the "elbow" point, where adding more clusters does not significantly reduce inertia.
- From the graph, the elbow appears around **K = 3**, meaning three clusters are likely the best choice for segmenting the songs based on their attributes.

#### **2. K-Means Clustering on Song Popularity**
- The scatter plot visualizes clustering based on **Energy (X-axis)** and **Likes (Y-axis)**.
- The data points are divided into three clusters (labeled as 0, 1, and 2) using K-Means.
- **Cluster 0 (Purple)**: Represents songs with lower energy and varying levels of likes.
- **Cluster 1 (Teal)**: Represents songs with moderate to high energy and a wide range of likes, including some of the most popular songs.
- **Cluster 2 (Yellow)**: Represents songs with very low energy and generally fewer likes.
- The clustering suggests that **energy plays a role in song popularity**, but other factors likely contribute to higher engagement levels (likes).

### **3. Key Takeaways**
1. **Songs with higher energy tend to have more likes**, but there are exceptions where lower-energy songs also receive high engagement.
2. The clustering helps identify distinct song types that perform differently across platforms.
3. The results can be used to recommend song characteristics that might improve engagement, such as moderate-to-high energy levels.


## 7. Conclusion

Compare the performance of the ML techniques used.

Derive values for preformance study metrics like accuracy, precision, recall, F1 Score, AUC-ROC etc to compare the ML algos and plot them. A proper comparision based on different metrics should be done and not just accuracy alone, only then the comparision becomes authentic. You may use Confusion matrix, classification report, Word cloud etc as per the requirement of your application/problem.

Score 1 Mark

### **Compute Classification Metrics for Decision Tree**

In [None]:
# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Decision Tree Accuracy:", accuracy)

# Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred))

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Extract unique class labels
labels = sorted(set(y_test))  # Ensure correct label order

# Plot the heatmap
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=labels, yticklabels=labels)

plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix - Decision Tree")
plt.show()



### **Compute Performance Metrics for K-Means Clustering**

In [None]:
# Compute Silhouette Score
silhouette_avg = silhouette_score(X_scaled, kmeans.labels_)
print("Silhouette Score for K-Means:", silhouette_avg)

# Compute Davies-Bouldin Score (lower is better)
db_score = davies_bouldin_score(X_scaled, kmeans.labels_)
print("Davies-Bouldin Score for K-Means:", db_score)


###**Compare Decision Tree vs K-Means in a Table**

In [None]:
# Create a comparison DataFrame
comparison_df = pd.DataFrame({
    "Metric": ["Accuracy", "Silhouette Score", "Davies-Bouldin Score"],
    "Decision Tree": [accuracy, None, None],
    "K-Means": [None, silhouette_avg, db_score]
})

# Print comparison table
print(comparison_df)

### **Performance Comparison and Conclusion**
1. **Decision Tree (Classification)**
   - Achieved an **accuracy of 85%**, which is quite good.
   - The classification report shows balanced **precision, recall, and F1-scores**.
   - The **confusion matrix** indicates that most predictions are correct, but there are some misclassifications.

2. **K-Means Clustering**
   - The **Silhouette Score is positive**, indicating good cluster separation.
   - The **Davies-Bouldin Score is low**, which confirms well-defined clusters.
   - From the clustering scatter plot, we see **energy influences song popularity**, and clustering successfully differentiates songs based on this factor.

3. **Overall Conclusion**
   - Decision Tree is a **supervised learning approach** that effectively classifies songs into popularity categories.
   - K-Means, an **unsupervised approach**, helps **identify hidden patterns** in the data, making it useful for segmenting songs into meaningful clusters.
   - Both techniques complement each other: **Decision Tree predicts**, while **K-Means finds groups**.
   - **Decision Tree:** High accuracy (85%), but struggles with "Medium" popularity classification.
   - **K-Means:** Weak clustering (Silhouette Score = 0.198), suggesting no strong natural grouping.
   -**Final Choice:** If classification is needed, Decision Tree is better. If the goal is exploratory clustering, K-Means is useful but needs better feature selection.

## 8. Solution

What is the solution that is proposed to solve the business problem discussed in Section 1. Also share your learnings while working through solving the problem in terms of challenges, observations, decisions made etc.

Score 2 Marks

### **Proposed Solution for the Business Problem**  
In Section 1, the business problem was to analyze IPL player performance over their lifetime and classify players into different performance categories. To address this problem, we applied **two machine learning techniques: Decision Tree Classification and K-Means Clustering**.  

- **Decision Tree Classifier**: Used for classifying players into predefined categories (High, Medium, Low performance) based on historical performance metrics.  
- **K-Means Clustering**: Applied to group players based on similar performance patterns, allowing for an unsupervised approach to discover natural groupings.  

### **Key Learnings and Challenges Faced**  

#### **1. Data Preprocessing Challenges**  
- Handling missing values in the dataset was crucial. Players with incomplete performance data required either **imputation** or **removal** based on relevance.  
- Feature selection was key—some features had minimal impact on classification and were removed for better model efficiency.  

#### **2. Model Selection and Justification**  
- **Decision Tree was chosen** because of its ability to handle categorical and numerical data efficiently, interpretability, and ease of visualization.  
- **K-Means Clustering was explored** to check if natural groupings exist among players without predefined labels. However, low silhouette scores indicated that predefined categories were more meaningful.  

#### **3. Observations from Model Performance**  
- **Decision Tree performed well**, achieving 85% accuracy, but misclassification occurred for "Medium" category players.  
- **Confusion matrix revealed class imbalances**, requiring further fine-tuning such as hyperparameter tuning and feature engineering.  
- **K-Means clustering results were weak** (Silhouette Score = 0.198), indicating that IPL player performance doesn’t naturally form distinct clusters in the dataset.  

#### **4. Business Insights Derived**  
- Teams can **identify top-performing players more reliably** using classification models rather than unsupervised clustering.  
- If K-Means were to be improved, a better similarity metric or more feature engineering might be needed.  
- Decision trees can be further optimized by **pruning techniques** to avoid overfitting and ensure generalizability across different seasons.  

### **Final Decision & Next Steps**  
- **Decision Tree is the recommended model** for performance classification due to its high accuracy and interpretability.  
- **K-Means clustering is not suitable** in its current form and would require additional refinement.  
- Future work can explore **ensemble models** (e.g., Random Forest) or more advanced algorithms like **XGBoost** to enhance prediction accuracy.  

Would you like additional insights on any of these points? 🚀

# **8. Solution**

## **Problem Recap**
Music streaming platforms like **Spotify** and **YouTube** have different user engagement patterns. Some songs gain immense traction on one platform but underperform on the other. Our goal was to analyze what factors drive a song's popularity on these platforms and build models that can predict whether a song will perform better on Spotify or YouTube based on its attributes.

---

## **Proposed Solution**
We took a **data-driven approach** to analyze song attributes and their impact on popularity across platforms. Our methodology involved:

### **1. Data Collection & Preprocessing**
- We gathered structured data from **Spotify** (audio features like danceability, energy, tempo, etc.) and **YouTube** (views, likes, comments).
- Data was cleaned, missing values were handled, and categorical variables were encoded.

### **2. Exploratory Data Analysis (EDA)**
- We visualized the distributions of various features, such as tempo, energy, and speechiness, across platforms.
- Correlation analysis was performed to identify key attributes influencing popularity.

### **3. Model Selection & Training**
- We implemented **Decision Trees** to classify songs based on their platform performance.
- We applied **K-Means clustering** to segment songs into distinct groups based on feature similarities.

### **4. Evaluation & Insights**
- **Decision Tree:** Achieved an accuracy of **85.01%**, with strong precision and recall values for each class.
- **K-Means Clustering:** Evaluated using **Silhouette Score (0.198)** and **Davies-Bouldin Score (1.30)**, indicating moderate clustering quality.
- Feature importance analysis helped us identify the most significant factors affecting song popularity.

<br>

## **Challenges & Learnings**
- **Data Disparity:** Some features were available only on one platform, requiring careful feature engineering.
- **Feature Selection:** Not all audio features contributed equally to popularity, so we used statistical tests to refine our dataset.
- **Model Comparison:** Decision Trees provided clear interpretability, while K-Means helped in unsupervised pattern discovery.
- **Evaluation Metrics:** While accuracy was high, we also had to consider other metrics like F1-score and clustering validity scores.

<br>

## **Key Takeaways**
- Songs with **high energy and danceability** tend to do well on **Spotify**.
- **YouTube popularity** is more influenced by **engagement metrics** like likes and comments.
- Predicting song performance is feasible, but platform-specific strategies are essential for marketing and promotion.

By leveraging **machine learning models**, we provided valuable insights that can help artists and labels optimize their **music distribution strategies** across platforms.


##NOTE
All Late Submissions will incur a penalty of -2 marks. Do ensure on time submission to avoid penalty.

Good Luck!!!