# **PROJECT PROPOSAL**

### INTRODUCTION

Statistics and data science are interrelated fields that focus on the collection, analysis, interpretation, presentation, and organization of data. By using statistical methods and models, data science is often used to extract valuable insights from voluminous data sets. These insights can be used to drive decision-making, strategies, and understanding of more complex systems.

In this project, we will be focusing on a dataset called "Top Spotify Songs 2023" available on KAGGLE. This dataset holds information on various Spotify songs, their artists, where they rank, and various other features like the year of release, tempo, energy, and danceability. Spotify is one of the world's leading music streaming service providers, with over 345 million users, including 155 million subscribers, across 178 countries.

#### Question: "Does the danceability and energy of a song contribute to its popularity on Spotify?"

This question allows us to assess two random variables of interest – *danceability and energy*, against the popularity measure (the response variable) on Spotify. As these features are continuous variables, they could be interpreted differently across the spectrum (lower, middle, high energy/danceability).

In terms of the **location parameter**, we will be using the *mean*. The mean will give us the average danceability and energy rating for a song on Spotify, which we can then compare to its popularity. 

This allows us to see average tendencies in the dataset. 

For the **scale parameter**, we'll use *standard deviation*. As the standard deviation measures the amount of variation or dispersion of a set of values, it would be useful in evaluating how much the energy and danceability of songs vary and how this variation affects song popularity. Therefore, by understanding the average rating (mean) and the spread of ratings (standard deviation), we can make a more informed assessment about the connection between a song's energy, danceability, and its popularity on Spotify.  

In conclusion, using basic statistical concepts like mean and standard deviation, we seek to better understand this dataset and uncover possibly hidden relationships, contributing to a more nuanced understanding of what traits may lead to a song's success on Spotify.

### PRELIMINARY RESULTS

In [None]:
library(tidyverse)
library(repr)
library(datateachr)
library(digest)
library(infer)
library(gridExtra)
library(cowplot)

spotify_original <- read_csv("https://drive.google.com/uc?export=download&id=1UQy2DuHB0IszFK4ZVgA20xDIWBHI4eTe")

spotify <- spotify_original %>%
    select("track_name", "streams", "danceability_%", "energy_%") %>%
    rename("danceability" = "danceability_%", "energy" = "energy_%") %>%
    mutate(streams = as.double(streams) / 1000000,
           danceability = as.integer(danceability), 
           energy = as.integer(energy)) %>%
    filter(if_all(everything(), ~!is.na(.)))

bottom_third_boundary_energy <- spotify %>%
    select(energy) %>%
    pull() %>%
    quantile(1/3)

top_third_boundary_energy <- spotify %>%
    select(energy) %>%
    pull() %>%
    quantile(2/3)

bottom_third_boundary_danceability <- spotify %>%
    select(danceability) %>%
    pull() %>%
    quantile(1/3)

top_third_boundary_danceability <- spotify %>%
    select(danceability) %>%
    pull() %>%
    quantile(2/3)

spotify_energy_categories <- spotify %>%
    mutate(energy = cut(energy, breaks = c(0, bottom_third_boundary_energy, top_third_boundary_energy, Inf), 
                                  labels = c("Low Energy", "Medium Energy", "High Energy")))

spotify_danceability_categories <- spotify %>%
    mutate(danceability = cut(danceability, breaks = c(0, bottom_third_boundary_danceability, top_third_boundary_danceability, Inf), 
                                  labels = c("Low Danceability", "Medium Danceability", "High Danceability")))

In [None]:
options(repr.plot.width = 8.5, repr.plot.height = 5)

spotify_vis_danceability_streams <- spotify %>%
    ggplot(aes(x = danceability, y = streams)) +
      geom_point(alpha = 0.3, shape = 16) +
      labs(x = "Danceability (%)", y = "Streams (Millions)") +
      theme_minimal()

spotify_vis_energy_streams <- spotify %>%
    ggplot(aes(x = energy, y = streams)) +
      geom_point(alpha = 0.3, shape = 16) +
      labs(x = "Energy (%)", y = "Streams (Millions)") +
      theme_minimal()

spotify_vis_energy <- spotify %>%
    ggplot(aes(y = energy)) +
        geom_histogram(bins = 15, color = "lightgray", linewidth = 0.2) +
        ylab("Energy (%)") +
        theme_minimal()

spotify_vis_danceability <- spotify %>%
    ggplot(aes(y = danceability)) +
        geom_histogram(bins = 15, color = "lightgray", linewidth = 0.2) +
        ylab("Danceability (%)") +
        theme_minimal()

spotify_vis_streams <- spotify %>%
    ggplot(aes(y = streams)) +
        geom_histogram(bins = 15, color = "lightgray", linewidth = 0.2) +
        ylab("Streams (Millions)") +
        theme_minimal()

combined_plot_scatter <- plot_grid(spotify_vis_danceability_streams, spotify_vis_energy_streams, 
                                   labels = c("Danceability vs Streams", "Energy vs Streams"), ncol = 1)

combined_plot_hist <- plot_grid(spotify_vis_streams, spotify_vis_energy, spotify_vis_danceability,
                                labels = c("Streams Histogram", "Energy Histogram", 
                                           "Danceability Histogram"), ncol = 3)

streams_energy <- spotify_energy_categories %>%
    group_by(energy) %>%
    summarize(mean_streams = mean(streams),
              standard_deviation_streams = sd(streams))

streams_danceability <- spotify_danceability_categories %>%
    group_by(danceability) %>%
    summarize(mean_streams = mean(streams),
              standard_deviation_streams = sd(streams))

combined_plot_scatter

combined_plot_hist

streams_energy

streams_danceability



### METHODS: PLAN

### REFERENCES