# **Data Science Module**

## **Identifying and Defining**

### **Data Scenario**

- **Data**: List of the top spotify songs and the statistics about the song itself
- **Goal**: Understanding the similarities between top performing songs
- **Source**: https://github.com/jivanjotk/Most-Streamed-Spotify-Songs-2023-Analysis-/blob/main/spotify-2023.csv
- **Access**: It is a publicly accessable file on github
- **Access Method**: github → .csv file



---

### **Functional Requirements**

**Data Loading**: 
- Description: The program should be able to successfully load the data file with little to no problems. If there is a problem it should be able to understand the problem, then being able to do something about the problem such as reporting it.
- Input: The user imports the correct dataset. 
- Output: If the data is in an incorrect format or imported with missing files, there should be an error. If the dataset is correct, nothing will happen.

**Data Cleaning**:
- Description: All unnecessary files can be deleted such as the date of release, or the in spotify/deezer/apple playlists count.  
- Input: The pandas dataframe or a matplotlib chart will be imported.
- Output: A visulisation that is shortened with only the needed categories and without errors will be outputted. Commas will be added to make it more readible.

**Data Analysis**:
- Description: The median, mean or mode can be found on applicable columns
- Input: The input will be the output of the data cleansing
- Output: The average bpm, artist count and the statistics about the song creation can be averaged out to find out the usual number of each in order to find out information about the similarities between top performing spotify songs. 

**Data Visulisation**: 
- Description: The data will be put into a matplotlib graph which will make it easier to visulise, making it easier to distinguish the similarties that most successful songs have.
- Input: The data of the analysed data and the cleansed data will be inputted.
- Output: The user gets the data in a visualised form such as in a pandas dataframe or a matplotlib chart.The user gets the data in a visualised form such as in a pandas dataframe or a matplotlib chart.

**Data Reporting**:
- Description: The new dataset will be put into a .csv file and the visualised version of the data will be saved as a .pdf file.
- Input: The dataset and the visulised data.
- Output: A file of the visulised data and the dataset that is visulised.



---

### **Use Cases** 

**Data Loading**:
- Actor: User
- Goal: Loading the dataset with no problems. Any files incorrectly formatted or missing files should be known with an error message.
- Pre-conditions: Data is ready for loading.
- Mainflow:
1. User places the file in a place the program specifies so that it can run properly, in the right format.
2. Program checks for any errors or formatting in the dataset, either taking them out or displaying an error code.
- Post-conditions: Data is loaded, ready for cleaning.

**Data Cleaning**:
- Actor: User
- Goal: Removing any unnecessary files for the data analysis section. Any songs without a key will be defaulted to C.
- Pre-conditions: The data has been loaded into a pandas or matplotlib file.
- Mainflow:
1. Removes the date of release column
2. Removes the in spotify playlists column
3. Removes the in deezer playlists column
4. Removes the in apple playlists column
5. Adds C to the key of any song without a key.
- Post-conditons: All unnecessary columns are deleted and all key rows filled.

**Data Analysis**:
- Actor: User
- Goal: Adding at the bottom, a section that talks about the median, mean or mode of applicable columns.
- Pre-conditions: All data is loaded and cleaned, removing any unnessary columns.
- Mainflow:
1. Finding the median, mean and mode of the average bpm column.
2. Finding the median, mean and mode of the artist count column.
3. Finding the median, mean and mode of the statistics of the song itself such as the danceability, valence, energy, acousticness, intrumentalness, liveliness and speechiness percentages column.
4. Finding the mode of the key column.
5. Finding the mode of the major/minor column.
- Post-conditons: All statistics and analysis done on the dataset is completed.

**Data Visulisation**:
- Actor: User
- Goal: To visualise the data in order to make it easier to see.
- Pre-conditions: Data is loaded and cleaned and the data is also analysed.
- Mainflow:
1. It connects to matplotlib.
2. A chart is created on the artist count.
3. A chart is created on the bpm.
4. A chart is created on the key.
5. A chart is created on whether it is major or minor.
6. A chart is created on the streams.
7. A chart is created on the release year.
8. All charts are displayed to the user.
- Post-conditions: All data is visualised and broadcasted to the user.

**Data Reporting**
- Actor: User
- Goal: Both the visulised and dataset are both saved in a file.
- Pre-conditions: Data is visulised and in a dataset.
- Mainflow:
1. Files and respective folders are created.
2. The dataset and charts are all stored in the file or created folder.
3. The user is told where it is.
- Post-conditions: The dataset and visulised variant of the data are both in a known file location.

---

## **Non-Functional Requirements** 

- Useability: The GUI created should be informative and concise. The README file should have how the program works and anything created due to the program.
- Reliability: The program should be reliable and correct without inaccuracies. There should be no security problems with data security being safe. Reports will be created on errors.

---

## **Researching and Planning**

### **Research of Chosen Issue**:
- Purpose: The purpose of this dataset is to find the similarities between popular songs and can be diffcult to find.
- Missing Information: There is no statistic besides the ranking of songs that tells us nothing about the similarities. 
- Stakeholders: The main stakeholders that will be benefitted are artists and listeners. 
- Use: Listeners can use this new statistic to find new songs with similar stats which may help them find new songs while artists can find the things that they need to improve on by looking at their song and top performing songs.

### **Privacy and Security**

- Data Privacy of Source: I sourced my information from a github user who got their information from spotify themselves. The data that is taken from users should be minimalised meaning that they only take the data that they need and secure or encrypted meaning that if any 3rd party person got ahold of this data, they wouldn't be able to do anything with it
- Application Data Security: All personal information will be de-identified and minimalised. Only the needed data will be used and all forms of identifying the user who streamed/did anything with the spotify will be removed.
- Cyber Security: User authentication is where users will have to put a username and password in order to keep their account safe. However, there is also two-factor authentication where you will need to verify mutliple things to get into your account, keeping your account even more secure. This is good as hackers will need both their username and password. In order to make their password is even harder to guess, there will be password hashing which is where instead of displaying your password, it is put into either a string of hashes or encrypting it. Encryption is where your data is scrambled mathematically so that only those who know how to unencrypt it, can.

---

### **Data Dictionary**

| Field | Datatype | Format for Display | Description | Example | Validation |
| --- | --- | --- | --- | --- | --- |
| Track Name | Object | XX...XX | The name of the song | Seven (feat. Latto) (Explicit Ver.) | Can be any character | 
| Artist Name | Object | XX...XX | The name of the artist of the song | Latto, Jung Kook | Can be any character |
| Artist Count | Float64 | N | The amount of artists on the song | 2 | Must be a single or double digit number |
| Release Year | Datetime64 | YYYY | The year the song was released | 2023 | Must be a year |
| Streams | Float64 | N | The amount of streams the song has | 141381703 | Can be any number but can't have any letters |
| BPM | Float64 | N | How fast the song is | 126 | Can be any number but can't have any letters |
| Key | Object | XX...XX | The pitch of the song | C | Must be any letter from A-G and can have the b or # behind any letter |
| Mode | Object | XX...XX | Whether the song is in major or minor key | Major | Is either major or minor |

---

### **Testing and Evaluating**

#### **Analyse and Conclude**

#### **Data Visulisation**

Streams Chart
![image.png](attachment:image.png)

Aritst Count Chart
![image-2.png](attachment:image-2.png)

BPM Chart
![Screenshot 2024-08-22 at 21.31.01.png](<attachment:Screenshot 2024-08-22 at 21.31.01.png>)

Mode Chart
![Screenshot 2024-08-22 at 21.33.20.png](<attachment:Screenshot 2024-08-22 at 21.33.20.png>)

Key Chart
![image-3.png](attachment:image-3.png)

Released Year Chart
![Screenshot 2024-08-22 at 21.40.42.png](<attachment:Screenshot 2024-08-22 at 21.40.42.png>)

#### **Calculations**

The only calculation done within my code is on the average of a specific category. It is accurate to the nearest number as it is found it out using an already made function.

####  **Accuracy**

The information used within the dataset is extracted from spotify themself, verifying it as it is of spotify's most streamed songs. This makes it accurate.

#### **Conclusion**

Ultimately, from the dataset and program, we can understand that there are many nuances to do with music. The most popular songs are those released within the previous few years. The major mode is used over the minor mode but both are still used. Faster paced songs are more catchy and tend to be more popular, averaging out at about 123 bpm. The most popular of songs are either a solo or duo artist songs. This means that in order to be most statistically successful when it comes to making a song with no regards to how good the song is or how it is marketed, it should be a solo song, be made around two years ago, be in major mode rather than minor. The key should be C#. 

---

### **Peer Evaluation**



Shawn: The program was made well with all functional and non-functional requirements completed. The GUI could be better and made more good looking. I had no problem in regards to most of the program as it worked successfully for me, creating a pdf file and a .csv when I ran it. From what I can see, it met the outlined requirements but in some areas could be improved a little bit. 

Evaluation: The evaluation was similar to what I had invisioned to be as my problems. I believe the only crucial part that I have not yet completed is making it look better. This evaluation reinforces the idea of a more aesthetically pleasing program, providing insight into what can be improved in the future.

Rufus: The program met most functional and non functional requirements. I had no problem in regards to most of the program as it worked successfully for me, creating a pdf file and a .csv when I ran it. From what I can see, it met the outlined requirements but in some areas could be improved a little bit. It would be better with more functions.

Evalution: Similar to the first evaluation, from my standpoint, I have completed most of the function and non-functional requirements. Rufus didn't really care about the aesthetics ad more about the running and functionality of the task. In the future, I should add more functions to make it more useful however, from the statistics provided from my program, I can find out the answer to my question.


---

### **Evaluation in Relation to Functional Requirements**

Data Loading: Data loading was used to find out whether or not the data would successfully load. With the correct dataset, the code should work correctly.. Without the correct dataset, it is unable to perform all wanted functions, resulting in there being an error code or something to show that there is no dataset. If there is no error code, it shows that the data is successfully loaded.

Data Cleaning: All unnecessary data will be cleansed, it can be seen within the code as the unnecessary columns are dropped, indicating the data cleaning process. We can see that the data is successfully cleansed with the differentiation between the original dataframe and the cleansed dataframe, shown when the code is running.

Data Analysis: Data analysis refers to the averages, medians and modes of a column. The averages, medians and modes of a specific column can be easily found. This helps to understand the key metrics related to song creation, helping to reveal patterns and similarities among top-performing Spotify songs. With the correct data analysis, it can be easily found with the pressing of a number.

Data Visulisation: Data visulisation uses processed data, graphing it using Matplotlib. The visual representation makes it easier for users to interpret the data and draw conclusions, resulting in the visulisation of a dataset. Done succesfully, it can show the visulised varients of a dataset, easily showing the key aspects of song metrics in an easy to understand graphic.

Data Reporting: Data reporting is the result of all of the above but making it so that it can be easily accessed and transferred around. The visulised varients of the dataset was saved and shown to the user resulting in it working.

---

### **Evaluation in Relation to Non-Functional Requirements**

Usability: The GUI is fairy useable and straightforward as the user is able to understand what to do and is easily able to what they want with the program. Although it is straightforward, useable and easy to understand, it isn't exactly aesthetically pleasing as it is just a bulk of text. This can be improved on in the future. A good looking GUI will result in a better user experience which needs to be furter refined in the future.

Reliability: The program itself is extremely reliable with no errors happening. There is nothing that actively causes error messages to pop-up. However, in the case of an error, it is able to understand and show what part of the program is causing errors but not extremely effectively. Due to it not being able to fully comprehend all errors, it indicates a need for further refinement in regards to the reliability of the code.

---

### **Evaluation in Relation to Testing and Project Management**

Time Management: A majority of the work was completed in class and on time in the expected time durations outlined by the teacher. I was prepared to finish some of the work at home. I ensured this by repeatedly pushing and pulling all files used in the assessment. The only thing that needs to be improved on is in relation to the theory as only some of it was completed in the given class time. Peer evaluations also need to be completed more quickly next time as it was the final thing I needed to complete before my assessment was due. Generally, all the deadlines were met, resulting in a generally well-time managed project.

Testing/Challenges: There were times that I attempted to problem solve and times I faced challenges such as some frames being blank etc. however, it was mostly fixed with the use of some programming. The problem of there being blank spaces was resolved with a simple line of code. Moreover, I also had trouble trying to turn graphs into a pdf which was resolved with the addition of an extention. Furthermore, I also tried to use pie graphs however, it was unsuccessful indicating a need for an additonal improvement in the future. There were a few small bugs here and there but generally there weren't any challenging bugs that required me to spend hours in order to understand the meaning and resolution of errors.

Effort: I put in an ample amount of effort into the project, working hard during class and at home. However it could have been more efficient in the way I had done things. More functions could have been implemented and a low line count can indicate a need in advanced optimisation of functions which can easily be resolved with more time on the table.

---