# 9CT200 - Assessment Task 2
The data that I am looking to analyse is the top 2500 most streamed songs on Spotify for 2024 so far, with the goal of finding and displaying statistics based on certain artists. The data is publicly available and can be accessed through a .csv file at 
* https://www.kaggle.com/datasets/nelgiriyewithana/most-streamed-spotify-songs-2024
---

## ***Functional Requirements***
* **Data Loading:**
    * Description: The program should be able to load .csv files, detect errors in files and appropriately inform the user that there is an error
    * Input: The user accidentally inputs a .txt file instead of a .csv file
    * Output: The program rejects the file and displays an error message, before providing a prompt to retry
# 
* **Data Cleaning:**
    * Description: The system must clear out duplicate songs that appear twice (occurs due to both album and single versions of a song appearing in the dataset)
    * Input: The user will input a dataset into the system
    * Output: The system will detect any identical songs by matching both name and artists, and eliminating the entry with less streams
# 
* **Data Analysis:**
    * Description: The system must allow for mode and mean data analysis, as well as correlation and range.
    * Input: The user will search for a certain artist
    * Output: The system will be able to generate general information about the artist, such as best performing song, average streams or range of activity
# 
* **Data Visualisation:**
    * Description: The program should display the information in a Pandas dataframe
    * Input: The user will ask to display the best performing songs of a specific artist
    * Output: The system will analyse the data, order the songs from most streamed to least and display the information in a table
# 
* **Data Reporting:**
    * Description: The program should have a visual confirmation of a successful/unsuccessful analysis and visualisation and save the final dataset in a .csv file for easier access later on
    * Input: The user will load a dataset, then search for a new one
    * Output: The system will display the first dataset, using text to visually confirm the success. Upon searching for something different, the program will store the old dataset in a .csv file, then analyse and display the next dataset.
---

## ***Use Cases***
*Data Loading*
* **Actor**: User
* **Goal**: To load a dataset into the system.
* **Preconditions**: User has a dataset, with the correct format, ready to load.
* **Main Flow**:
    * User places the dataset for reading into the correct folder.
    * System validates the file format as .csv
    * System loads the dataset and displays the information in a dataframe.
* **Postconditions**: Dataset is loaded and ready for analysis.
#
*Data Cleaning*
* **Actor**: Program
* **Goal**: To clean the dataset provided.
* **Preconditions**: User has already loaded a dataset correctly.
* **Main Flow**:
    * System briefly analyses 'name' and 'artist' columns, looking for duplicates
    * System identifies duplicate songs, as well as identifying the duplicate with the highest streams
    * System removes duplicate entries, leaving only the copy with the highest streams
* **Postconditions**: Dataset is cleaned and ready to be displayed.
#
*Data Analysis*
* **Actor**: User
* **Goal**: To filter results into a specific sample.
* **Preconditions**: System is already displaying base dataset.
* **Main Flow**:
    * User filters to most streamed songs of artist 'Kanye West'
    * System removes all entries without 'Kanye West' in the 'artist' column
    * System sorts entries in descending order based on number of streams
    * System loads and displays new dataset, with filtered results
* **Postconditions**: Dataset is loaded and displays filtered information.
#
*Data Visualisation*
* **Actor**: Program
* **Goal**: To appropriately visualise the data.
* **Preconditions**: User has loaded a dataset into the system.
* **Main Flow**:
    * System validates the file format as .csv
    * System loads the dataset and displays the information in a Pandas dataframe.
* **Postconditions**: Dataset is loaded and displayed in a Pandas dataframe.
#
*Data Reporting*
* **Actor**: Program
* **Goal**: To clearly communicate a succesful process.
* **Preconditions**: User has input a command into the system.
* **Main Flow**:
    * System executes command
    * System confirms that the command has been succesfully executed
    * System outputs text as confirmation that the command has been executed succesfully
* **Postconditions**: System has successfully executed a command and given visual confirmation of success
#
---

## ***Non-Functional Requirements***
* **Usability**:
    * User Interface:
        * The user interface should be as simple to use and navigate as possible for newer users, while also allowing for more experienced users to filter data into more specific samples
    * README document:
        * The README document should clearly and concisely describe the purpose of the program, as well as explain how to set up and use the program properly
#
* **Reliability**:
    * Reliability depends on ensuring data integrity, for which there are 5 principles:
        * Attributable: The data can be traced back to the one generating it
        * Legible: The data can be easily deciphered
        * Contemporaneous: Simultaneously recorded
        * Original: The data is the original copy of the data
        * Accurate: The data is accurate
---

## ***Research of Issue***
*Purpose*
* The purpose of this project is to create an easy way to find information about specific artists, in order to deepen our understanding of the current trending artists and musical genres. This is important to understand as music serves an important role in bringing people together, through popular songs, national anthems and other largely known songs.
#
*Missing data*
* This analysis is necessary as it provides a more focused analysis of specific artists and aspects about their albums and songs, such as average streams.
#
*Stakeholders*
* Benefactors of this information could be music enthusiasts, as well as people who engage with music as a profession.
#
*Use*
* The data can be used to track trends, which helps music enthusiasts find newer emerging genres and allows professional musicians to suit their work to the current trends.
---

## ***Privacy and Security***
*Privacy of Source*
* My data is being sourced from 'kaggle,' an online data sharing platform. The user that collected the data I am using is responsible for ensuring that the data was collected on an appropriate and legal basis, such as confirming artists consented and acknowledged that this data could be used in this way, and that the owners of the songs listed can have their songs removed without question. Data that should be protected should be the names of artists and the names of Spotify users listed as listeners of the songs listed.
#
*Application of Privacy*
* My responsibility is almost the same, ensuring that the data privacy rights of the artists of the songs are respected. If this application were to be pushed out to the public space, I'd need to respect the rights of the original creator of the data, by crediting them as the original source and allowing them to have the data they collected be removed from my system if they want.
#
*Cyber Security*
* An application should have at least one user authentication process. Common methods include:
    * Passwords and strong password policies
    * 2 Factor Authentification
    * Biometrics
*  An application may also have more measures after the user has inputed a password, such as:
    * Password Hashing
    * Encryption, typically through use of SSL
    * Web Application Firewalls (WAFs)
* ***User Authentication*** refers to the process of verifying the identity of someone attempting to access a website. This is done to ensure that the application knows who is accessing what and can take appropriate action if required. Most websites involving interaction with other users often require you to login to an account before interacting.
* ***Password Hashing*** is a form of encryption, where the password that a user has entered into a website is sent through a hash function, essentially scrambling it into incoherent text. The purpose of password hashing is to add an extra layer of defense in the case of a data breach. While information may be leaked, it will be leaked in a hash form, which is useless without a dehasher/decrypter.
* ***Encryption*** is a broad term that refers to the transformation of information in such a way that only someone or something with a specific decrypter could decipher, and thus process. Encryption is one of the best ways to ensure data privacy, as there are endless possibilities for encryption and decryption processes.
---

## ***Data Dictionary***
| Field | Datatype | Format for Display | Description | Example | Validation |
|-------|----------|--------------------|-------------|---------|------------|
| name | object | XX..XX | name of the song | Not Like Us | Can be any amount of characters in the Modern English Alphabet |
| artist | object | XX..XX | name of the artist who made the song | Kendrick Lamar | Can be any amount of characters in the Modern English Alphabet |
| album | object | XX..XX | name of the album the song is from | Not Like Us - Single | Can be any amount of characters in the Modern English Alphabet |
| spotify streams | integer64 | NN..NN | the amount of streams on Spotify the song has | 323703884 | Can be any amount of characters, but only numbers |
| youtube views | integer64 | NN..NN | the amount of views the song has on YouTube | 116347040 | Can be any amount of characters, but only numbers |
| tiktok views | integer64 | NN..NN | the amount of views the song has on TikTok | 208339025 | Can be any amount of characters, but only numbers |
| release date | datetime64 | YYYY-MM-DD | the release date of the song | 2024-5-4 | Must be in standard international date format; Year-Month-Day (YYYY-MM-DD) |
#
---