# 1.Descriptive Statistics by Song

----

This analysis aims to understand the individual behavior of each of the 7 Christmas songs on Spotify between 2017 and 2025. Through descriptive statistics (mean, median, standard deviation, minimum and maximum), we seek to identify consumption patterns and variability in the streams of each track. The Coefficient of Variation (CV%) allows us to compare the stability of the songs, revealing which ones show more consistent or volatile behavior over the period. With these metrics, we can establish a popularity ranking and understand the unique characteristics of each song in the Christmas streaming scenario.

In [None]:
#installing the necessary libraries
!pip install pandas numpy matplotlib seaborn

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('Data_Collection\spotify_christmas_streams_kworb_2017_2025.csv') #loads dataset data
df.head() #shows first 5

In [None]:
df['date'].dtype #checks the declared type of the date column

- We see that it is declared as object, so we will need to convert to datetime with pandas.

In [None]:
df['date'] = pd.to_datetime(df['date'], errors='coerce')  #pd.to_datetime converts to date
df['date'].dtype #<M8[ns] is NumPy's internal format for datetime64 with nanosecond precision.

- Starting descriptive analysis by song:

In [48]:
pip install tabulate

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
stats = df.groupby('track')['streams'].describe() #Groups the dataframe by 'tracks' column and calculates descriptive statistics of the 'streams' column for each song.
#organizing results in table:

from tabulate import tabulate #to plot result in organized table
 
stats_table = stats.reset_index() #Resets index
print(tabulate(stats_table, headers='keys', tablefmt='fancy_grid', showindex=False, floatfmt=".2f"))
#The table column titles being the column names from the describe method's own dataframe (count, mean, std, min, max, 25%, 50% and 75%)
#'fancy_grid' creates lines and columns with drawn borders (table visual style)

In [None]:
cv = (df.groupby('track')['streams'].std() / df.groupby('track')['streams'].mean()) * 100 #calculates coefficient of variation as standard deviation over mean
cv_sorted = cv.sort_values(ascending=False) #orders from highest to lowest
print("Coefficient of Variation (CV%):")
cv_sorted

The analyzed songs show significant variation in streams, indicating that some are more popular in certain periods.

- Rockin' Around the Christmas Tree leads the coefficient of variation (73.36%), showing large relative fluctuation in its audience.
- Last Christmas and All I Want for Christmas Is You also have high CVs, reflecting stream peaks in specific periods.
- Songs like It's Beginning to Look a Lot Like Christmas show less variation, with more consistent streams over time.

In [None]:
total_streams = df.groupby(['track'])['streams'].sum().sort_values(ascending=False) #sums all streams from all songs and orders from highest to lowest
print("Total Streams by Song:")
total_streams

- Between 2017 and 2025, Christmas songs showed enormous popularity in streaming, with "All I Want for Christmas Is You" leading with 1.77 billion plays.
- Other classics like "Last Christmas" and "Rockin' Around the Christmas Tree" also surpassed 1 billion streams, showing strong consistent audience.
- Even songs with fewer streams, like "Feliz Navidad" (around 633 million), maintain relevance, evidencing the lasting impact of these songs over the years.

In [None]:
#dataframe with the main analysis metrics
comparison = pd.DataFrame({
    'Mean': df.groupby('track')['streams'].mean(), 
    'Median': df.groupby('track')['streams'].median(),
    'Standard Deviation': df.groupby('track')['streams'].std(),
    'Minimum': df.groupby('track')['streams'].min(),
    'Maximum': df.groupby('track')['streams'].max(),
    'CV%': cv
})
comparison = comparison.sort_values('Mean', ascending=False) #the models are ordered by mean from highest to lowest
comparison

Between 2017 and November 2025, "All I Want for Christmas Is You" leads in average streams, with approximately 25.7 million, followed closely by "Last Christmas" (â‰ˆ23.9 million). Songs like "Rockin' Around the Christmas Tree" and "Jingle Bell Rock" also show high averages, above 20 million, showing consistency in public interest. "Santa Tell Me" and "It's Beginning to Look a Lot Like Christmas" have averages close to 19 million, still very relevant for Christmas playlists. "Feliz Navidad", despite being lower, maintains around 16.2 million average, evidencing continuous international popularity. Overall, all these songs stand out with high averages and consistent presence in the Top 200..

In [None]:
plt.figure(figsize=(12, 6))
comparison['Mean'].sort_values().plot(kind='barh', color='skyblue', edgecolor='black') #bar chart
plt.title('Average Streams (in top 200) by Song 2017 - nov 2025', fontsize=14, fontweight='bold')
plt.xlabel('Average Streams 2017- nov 2025')
plt.ylabel('Song')
plt.ticklabel_format(style='plain', axis='x')
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
cv_sorted.sort_values().plot(kind='barh', color='coral', edgecolor='black')
plt.title('Coefficient of Variation (CV%) by Song', fontsize=14, fontweight='bold')
plt.xlabel('CV%')
plt.ylabel('Song')
plt.tight_layout()
plt.show()

In [None]:
total_by_track = df.groupby('track')['streams'].sum().sort_values(ascending=False) #total streams by song

plt.figure(figsize=(12, 8))
colors = plt.cm.Set3(range(len(total_by_track)))
plt.pie(total_by_track, labels=total_by_track.index, autopct='%1.1f%%', startangle=90, colors=colors) #pie chart
plt.title('Total Streams Distribution by Song (2017-2025)', fontsize=14, fontweight='bold')
plt.axis('equal')
plt.tight_layout()
plt.show()