# DATA SCIENCE PROJECT
# DONE BY ANDREI PAHADAYEU

# Music genres and preferences worldwide research: finding connections between music preferences in different regions of the world, as well as analyzing most popular regional artists and songs/tracks (original topic)

# NEW TOPIC 
# Finding critical difficulties in getting required data for projects closely related to Spotify data

## Table of Contents
* [1. Project overview](#chapter1)
    * [1.1 Introduction](#section_1_1)
    * [1.2 Used data and tools](#section_1_2)
* [2. Initial project](#chapter2)
    * [2.1 Spotify API](#section_2_1)
    * [2.2 Python and libraries](#section_2_2)
    * [2.3 Code used](#section_2_3)
* [3. Changing the direction of research](#chapter3)
* [4. Conclusion](#chapter4)

# 1.Project overview <a class="anchor" id="chapter1"></a>

## 1.1 Introduction  <a class="anchor" id="section_1_1"></a>
Music has the power to transcend boundaries and connect people across the globe. It reflects diverse cultural expressions, influences, and individual preferences. The aim of this data science project is to explore music genres and preferences worldwide, specifically focusing on finding connections between music preferences in different regions of the world. Additionally, we will analyze the most popular regional artists and songs/tracks, shedding light on the unique musical landscapes in various parts of the globe.
![image-5.png](attachment:image-5.png)

Understanding the relationship between music genres, preferences, and regions is vital for multiple stakeholders, including music streaming platforms, artists, record labels, and marketing agencies. By uncovering these connections, we can develop insights that contribute to personalized music recommendations, targeted marketing strategies, cross-cultural collaborations, and even cultural preservation efforts.

However we will also take a look at difficulties that we might face while doing projects related to music and specifically project related to Spotify data.
![image-4.png](attachment:image-4.png)

In recent years, Spotify has emerged as one of the leading music streaming platforms, revolutionizing the way people discover, listen to, and share music. With its vast collection of songs, comprehensive user data, and advanced recommendation systems, Spotify's data has become a goldmine for data scientists, researchers, and music enthusiasts. However, accessing and obtaining the required data for projects closely related to Spotify can present significant challenges and hurdles.

This project aims to explore the critical difficulties that researchers and data scientists encounter when acquiring the necessary data for projects related to Spotify. By identifying and addressing these obstacles, we can gain a better understanding of the limitations and possibilities of working with Spotify data, ultimately enabling more effective and meaningful analyses in the future.

## 1.2 Used data and tools <a class="anchor" id="section_1_2"></a>

For this project I was focusing on extracting data directly from Spotify, by accessing it via Spotify API.
I registered my project at https://developer.spotify.com/ which allowed me to access data using special calls.
![image.png](attachment:image.png)

As for extracting data I was using Python language and specifically Spotipy and Pandas libraries to compile data into useful and readable form.
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)

For research about difficulties working with Spotify data I was focusing on my personal experience while doing this project.
![image-6.png](attachment:image-6.png)

# 2. Initial project <a class="anchor" id="chapter2"></a>

## 2.1 Spotify API <a class="anchor" id="section_2_1"></a>
The Spotify API (Application Programming Interface) is a powerful tool that allows developers to interact with Spotify's vast music catalog and access a wide range of features and data. It enables developers to build applications, services, and integrations that leverage Spotify's music streaming capabilities, user data, playlists, and more. By integrating with the Spotify API, developers can create innovative music-related applications and enhance the Spotify experience for users.
![image.png](attachment:image.png)

The Spotify API offers a rich set of functionalities that developers can utilize to enhance their applications and services. Some of the key features and functionalities provided by the Spotify API include:
<ul>
<li>Music Metadata: Access comprehensive information about tracks, albums, artists, genres, and playlists available on Spotify. Retrieve details such as track names, release dates, artist information, album covers, and more.</li>

<li>Audio Features: Retrieve audio analysis and features of tracks, such as tempo, key, danceability, energy, and acousticness. These audio features can be utilized for music recommendation, playlist generation, and other data-driven music applications.</li>

<li>User Data: Access and manage user-related data, including user profiles, playlists, saved tracks, and listening history. This functionality allows developers to create personalized experiences, recommend music based on user preferences, and synchronize user data across platforms.</li>

<li>Search and Recommendations: Utilize the Spotify API's search capabilities to retrieve tracks, albums, playlists, and artists based on various parameters such as keywords, genres, and popularity. The API also provides recommendation endpoints that offer personalized song suggestions based on a user's listening history and preferences.</li>

<li>Playback Control: Control the playback of music on Spotify's web player, mobile apps, or other connected devices. Developers can pause, resume, skip tracks, and control the volume through the API, enabling seamless integration of Spotify playback into their own applications.</li>
</ul>

To access and interact with the Spotify API, developers need to authenticate their applications and obtain access tokens. Spotify supports the OAuth 2.0 authorization framework, which allows users to grant limited access to their Spotify accounts to third-party applications securely. By obtaining the required permissions from users, developers can access user-related data and perform authorized actions on behalf of the user.

The authentication process involves obtaining an access token, which is used to make authorized API requests. Access tokens have expiration times and can be refreshed using refresh tokens, allowing long-term access to user data.
![image-2.png](attachment:image-2.png)

In order to start working with Spotify API developers firstly have to start by registering their spotify account as well as creating their project on dashboard at https://developer.spotify.com/dashboard. At this dashboard we can see user activity as well as used calls and other statistics. Since I was working on a project by myself, there is only one active user connected to this project.
![image-3.png](attachment:image-3.png)

Here, in application statistics we can also retrieve necessary id's and keys that are required further to authorize user in order to make calls for data. 
![image-4.png](attachment:image-4.png)

After getting required keys we can start performing the data science process itself by starting with getting required data.

## 2.2 Python and libraries <a class="anchor" id="section_2_2"></a>
In this project I was using Python to access data, however there are also other ways to do that, since API can be adapted to use in web-pages and other applications. One of the most popular ways is to use JavaScript to write a web-application. But since I didn't have any experience with it, I decided to find other ways to get data. With help of professor Trihinas, I learned about Spotipy library for Python, that allows me to use proper calls and retrieve required data.
![image-5.png](attachment:image-5.png)

For this project I was using Spyder IDE mainly for testing all the code and getting results back. This IDE was preferable for me, since it allows me to install additional libraries as well as providing me with user friendly environment for development.
![image-6.png](attachment:image-6.png)

To compile data I was using Pandas library, which gives opportunities for great visualization of data that makes data more readable and usable for research.
![image-7.png](attachment:image-7.png)


## 2.3 Code used <a class="anchor" id="section_2_3"></a>
Here I will show different variations of code used throughout the project, which will show the development process.
#### NOTICE: I won't be able to show all the versions of code, since I was redoing the code for the project 6 times, so I won't be able to restore all of them.
![image.png](attachment:image.png)

I will start with initial libraries required. I will skip the installation process for them and will get right into the coding process itself. Spotipy is a library that is required to authenticate user to use API and allows to access data gathered by Spotify by using calls.

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import pandas as pd

For the next step I had to retrieve my ClientID and SecretID in order to get a proper authentication.

In [2]:
client_id = '56d3570df22c40339c679a173cbdd917'
client_secret = '96c06f0548ef47229643d69f32a81481'

client_credentials_manager = SpotifyClientCredentials(client_id, client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

Code above is pretty similar for all the versions of my code, since the process of setting right credentials is pretty much the same. The only things that changed few times were the clients id and secret, because Spotify discarded my project in dashboard couple of times. I will address this issue further below in chapter 3. 

Now I will show couple of testing programs, that I was using mainly to test the capabilities of calls and data retrieval.
#### NOTICE: Not every piece of code is working. Some of them were made purely for testing or not working, because of issues I faced with API. I'll mark not working code. Also, in order to not overwhelm the notebook and this documentation, I will run my code in Spyder and attach screenshots of results, so it would be easier to read.

## Version 1.
Testing capabilities of calls. Used to check how data is retrieved and in which form the results are.

![image.png](attachment:image.png)

Result:
![image-2.png](attachment:image-2.png)

As we can see, this code returns error, because of not proper authorization. By the time, I still wasn't aware on how this system works. However, few version later, I found how process the authroization correctly, so this error wasn't appearing that much.

## Version 2.

Trying to retrieve statistics on popularity of track in different countries. In this case, USA and Russia.

![image.png](attachment:image.png)

Result:
![image-2.png](attachment:image-2.png)

In this version error appears, because of invalid track id's. Reason for this is because of changes in how the id's are retrieved that took place around late March. These changes made it impossible to use old id's that I prepaired for this code.

## Version 3.

Attempt to access number of listens from all available countries for playlist 'Top-50 USA'.

![image.png](attachment:image.png)

Result:
![image-2.png](attachment:image-2.png)

Apparently in this version of code method of retrieval of available countries became out of date, which results in error.

## Version 4.

This version of code was the most promising by the time, since I spent around 2 weeks to compile all the available information in one code. Here I tried using different authentication method.

![image.png](attachment:image.png)

Result:
![image-2.png](attachment:image-2.png)

As we can see, code doesn't produce any results. By the time, I wasn't sure of what kind of issue I'm facing. Only after a couple of weeks, I found in the internet that the main issue was with authorization, which basically required me to create a local server in order to log in. I'll describe this issue more specifically in chapter 3.

## Versions 5-current

Unfortunately I don't have the code for some, versions so I will broadly describe what I was doing here.
Coming after version 4, future versions were changed in order to have a proper authrization, as well as there were few changes in data retrieval(track and artist id's were changed to acceptable form, so that I can receive at least some real data).

Here I will show the latest version, that is actually working and showing some results.

![image.png](attachment:image.png)

Result:
![image-3.png](attachment:image-3.png)

As we can see, this code returns some actual data, which is giant step forward. However, there are issues with resulting table, since Spotify can't provide data about popularity in usable form. The issue is, call 'get_track_popularity' assigns the same value for the whole table, so there is no way this data is actualy useful much.

In latest versions I also tried to swap from track popularity to artist popularity, however call for artist popularity works the same way as for tracks. Also I tried changing code to give popularity for certain artist. Here is code for this version:

![image.png](attachment:image.png)

Result:
![image-2.png](attachment:image-2.png)

There we can see popularity of songs by Elton John, however it has the same issue as the code above: popularity meter is the same for every country. Also, for some tracks as we can see some regions are unavailable.

### Conclusion for this chapter
As you can see, even though I managed to extract data from Spotify there are serious issues with retrieved data, since it's barely usable for my project. Considering how long it took me only to make working code, it's a shame that data is in such bad quality. Specifically about these issues I will address in the next chapter of my research: Issues with research using data given by Spotify.

# 3. Changing the direction of research  <a class="anchor" id="chapter3"></a>

## Premises and reasons for changes. Uncovered issues, found throughout the project development

As you saw in the chapter above, due to various reasons, most important of which is data quality, initial project couldn't be finished with any good insight. That's the major reason, why I decided to focus on finishing this project by describing issues that developers may face during research which may inflict the outcome greatly.


Here is a list that in my opinion depicts biggest difficulties while doing a project using Spotify API and their data:
<ol>
    <li><b>Limitations</b>: Spotify's data accessibility is limited compared to other platforms, and accessing granular user-level data is generally not feasible. This restriction makes it difficult to conduct in-depth analyses that require personalized user information. Additionally, while Spotify provides a developer API that allows access to certain data points, it comes with limitations on the amount and type of data that can be retrieved. These limitations can hinder the scope and depth of projects that heavily rely on extensive and diverse Spotify data.</li>
    <li><b>Changes in API and syntaxis of calls</b>: Spotify's data is continuously evolving, and the platform may modify its API or data structures over time, just like it happened in my research. This can lead to compatibility issues and necessitate constant adjustments to data retrieval and analysis methods. It requires data scientists to stay up to date with changes in the Spotify API and adapt their projects accordingly, which adds another layer of complexity.</li>
    <li><b>Understanding data</b>: Spotify's data ecosystem often lacks comprehensive metadata and context, making it challenging to fully understand the intricacies of the available data. This limitation can hinder the interpretation and meaningful analysis of the collected data, as the absence of critical contextual information can limit the insights that can be extracted.</li>
    <li><b>Data Cleaning and Preprocessing</b>: Raw Spotify data often requires significant cleaning and preprocessing before it can be used effectively. This process involves dealing with missing values, standardizing formats, resolving inconsistencies, and ensuring data quality. These tasks can be time-consuming and resource-intensive, potentially delaying the project's progress.</li>
    <li><b>Privacy and Security</b>: Spotify must adhere to strict privacy regulations and protect user data. As a result, accessing individual-level data or sensitive user information may be restricted or inaccessible. This limitation can pose challenges for projects that require personalized data for detailed analysis or modeling.</li>
    <li><b>Validation and Reliability</b>: Validating the accuracy and reliability of the obtained data is crucial for drawing meaningful conclusions. However, verifying the authenticity of Spotify data and ensuring its integrity can be challenging, especially when working with third-party data providers or aggregated data sources.</li>
    <li><b>Limited Historical Data</b>: Spotify's data access may be limited to a specific timeframe. This restriction can be problematic for projects that require long-term trend analysis, studying user behavior changes over time, or conducting retrospective studies. Im my case I faced the issue that I couldn't find datasets produced by other researchers, that show recent data, which may result in biased results of research iteslf.</li>
</ol>



I might say, that items 1,2 and 5 are the most difficult, since they prevent researchers from getting really useful data. Spotify stores data not in a real usable form, which of course can be avoided by doing data preprocessing and cleaning, however due to limitations it's really hard if not impossible to do. Terms of service made by Sptify doesn't allow developers and researchers to extract basically any data related to user. That means researchers unable to get data about number of streams of certain track, regional statistics and other useful information.

# 4. Conclusion  <a class="anchor" id="chapter4"></a>

In conclusion, working on projects closely related to Spotify data presents a range of difficulties that data scientists and researchers must overcome. These challenges include limited data accessibility, evolving API structures, insufficient metadata and context, data privacy and security concerns, sampling biases, data cleaning and preprocessing requirements, data integration complexities, limited historical data availability, licensing and copyright restrictions, and the need for data validation and reliability.

Despite these obstacles, it is important to approach these projects with creativity, adaptability, and persistence. Researchers can explore alternative data sources, leverage publicly available aggregate data, employ web scraping techniques, and utilize third-party tools and services to gather insights from Spotify-related data. Rigorous data cleaning, preprocessing, and validation processes are crucial to ensure the integrity of the findings. Collaboration and interdisciplinary approaches can also aid in addressing these challenges by combining expertise from different domains such as music analytics, data engineering, and legal compliance.

By acknowledging and navigating these difficulties, researchers can unlock valuable insights into music consumption patterns, user preferences, and industry trends. These findings can contribute to the advancement of music analytics, personalized recommendations, and user-centric services in the digital music landscape. Ultimately, overcoming the challenges associated with Spotify data empowers researchers to uncover meaningful connections and drive innovation at the intersection of data science and music.

## Links to resources
### Youtube video: https://youtu.be/BWTcZKbds-Q
### GitHub: 