# Analysis of Spotify Top-2000 songs

### Content:
* 1.Introduction
* 2.Description of project
* 3.Research questions
* 4.Data preparation:shaping and cleaning

## Part 1 - Introduction

In the modern world, almost everyone interacts with an art form - music. Music is listened to, created, reproduced and sung. Over the years, listening to music has developed with tremendous success and today one of the most popular music services is the streaming service Spotify, with nearly 300 million users worldwide. The service contains a database of 60 million tracks of all genres. As users themselves say, this service, unlike other platforms, has many advantages, such as: a simple and convenient interface, reasonable rates for access to catalogs of various tracks, a high-quality algorithm for selecting individual playlists, easy registration and payment methods, integration with social networks, cross-platform, streaming of tracks and albums - this function affects the increase in the popularity of an artist all over the world (the more followers and the number of listens to the artist's tracks, the artist is recognizable).Those songs that achieve great success among the audience remain in memory for a long time. Their influence on people can be so great that they leave their contribution to history. Songs can sometimes show us the development of entire generations.So, main goal of the project is to research and see power of the music.

## Part 2 - Description of project

The popularity factor is central to this research.
The analysis is based on ready dataset taken from site kaggle.com : 
(https://www.kaggle.com/iamsumat/spotify-top-2000s-mega-dataset).
Given dataset contains audio statistics of the top 2000 tracks on Spotify. The data contains about 15 columns each describing the track and it's qualities.Chosen tracks were released in period from 1956 to 2019 and there were included songs from some notable and famous artists like Queen, The Beatles, Guns N' Roses, etc.
This data is extracted from the Spotify playlist - Top 2000s on PlaylistMachinery(@plamere) using Selenium with Python. More specifically, it was scraped from http://sortyourmusic.playlistmachinery.com/ This data contains audio features like Danceability, BPM, Liveness, Valence(Positivity) and many more.
Each feature's description has been given in detail below.
* Index is ID.
* Title- name of the track.
* Artist- name of the artist.
* Top Genre is the genre of track.
* Year is the year that track was released.
* Beats per Minute(BPM) - The tempo of the song.
* Energy: The energy of a song - the higher the value, the more energtic song is.
* Danceability - The higher the value, the easier to dance to this song.
* Loudness - The higher the value, the louder the song is.
* Liveness - The higher the value , the more lively song hears to listener.
* Valence - The higher the value, the more positive mood for the song is.
* Length - duration of the song.
* Acoustic - The higher the value the more acoustic the song is.
* Speechiness- The higher the value the more spoken words the song contains.
* Popularity - The higher the value the more popular the song is.

## Part 3 - Formulation of research questions

#### For detailed and deep analysis, we need to answer to the following questions:
1. Analyze the most known Artists in the world of all times
2. Analyze the most listened genres of all times
3. Analyze the in which years people had listened dance and energetic songs
4. Find trend for acousticness in songs was popular in 1960s, than they are today
5. Find which words contained in songs are more popular?

We start our analysis by opening ready dataset and detecting shape of dataset and find missing values

In [1]:
# importing useful packages
import pandas as pd
import csv
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
import random

In [2]:
# making data frame from csv file  
top = pd.read_csv(r"C:\Users\User\Desktop\data analytics\Project\Analysis-Spotify-Top-2000\Spotify-2000.csv",delimiter = ",",
            encoding = "UTF-8", doublequote=True, engine="python", quotechar='"', quoting=csv.QUOTE_ALL)
top.head(60)

Unnamed: 0,Index,Title,Artist,Top Genre,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity
0,1,Sunrise,Norah Jones,adult standards,2004.0,157.0,30.0,53.0,-14.0,11.0,68.0,201.0,94.0,3.0,71.0
1,2,Black Night,Deep Purple,album rock,2000.0,135.0,79.0,50.0,-11.0,17.0,81.0,207.0,17.0,7.0,39.0
2,3,Clint Eastwood,Gorillaz,alternative hip hop,2001.0,168.0,69.0,66.0,-9.0,7.0,52.0,341.0,2.0,17.0,69.0
3,4,The Pretender,Foo Fighters,alternative metal,2007.0,173.0,96.0,43.0,-4.0,3.0,37.0,269.0,0.0,4.0,76.0
4,5,Waitin' On A Sunny Day,Bruce Springsteen,classic rock,2002.0,106.0,82.0,58.0,-5.0,10.0,87.0,256.0,1.0,3.0,59.0
5,6,The Road Ahead (Miles Of The Unknown),City To City,alternative pop rock,2004.0,99.0,46.0,54.0,-9.0,14.0,14.0,247.0,0.0,2.0,45.0
6,7,She Will Be Loved,Maroon 5,pop,2002.0,102.0,71.0,71.0,-6.0,13.0,54.0,257.0,6.0,3.0,74.0
7,8,Knights of Cydonia,Muse,modern rock,2006.0,137.0,96.0,37.0,-5.0,12.0,21.0,366.0,0.0,14.0,69.0
8,9,Mr. Brightside,The Killers,modern rock,2004.0,148.0,92.0,36.0,-4.0,10.0,23.0,223.0,0.0,8.0,77.0
9,10,Without Me,Eminem,detroit hip hop,2002.0,112.0,67.0,91.0,-3.0,24.0,66.0,290.0,0.0,7.0,82.0


In [3]:
#show size of table- in rows and columns
top.shape

(1994, 15)

In [4]:
#showing all info about columns
top.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1994 entries, 0 to 1993
Data columns (total 15 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Index                   1994 non-null   object 
 1   Title                   1936 non-null   object 
 2   Artist                  1936 non-null   object 
 3   Top Genre               1936 non-null   object 
 4   Year                    1936 non-null   float64
 5   Beats Per Minute (BPM)  1936 non-null   float64
 6   Energy                  1936 non-null   float64
 7   Danceability            1936 non-null   float64
 8   Loudness (dB)           1936 non-null   float64
 9   Liveness                1936 non-null   float64
 10  Valence                 1936 non-null   float64
 11  Length (Duration)       1936 non-null   float64
 12  Acousticness            1936 non-null   float64
 13  Speechiness             1936 non-null   float64
 14  Popularity              1936 non-null   

Here we can see that our number of records in index column doesn't match to the 
number of records in other columns.
Next, we need to reveal this missing records

In [5]:
#finding missing values on each column
top.isnull().sum()

Index                      0
Title                     58
Artist                    58
Top Genre                 58
Year                      58
Beats Per Minute (BPM)    58
Energy                    58
Danceability              58
Loudness (dB)             58
Liveness                  58
Valence                   58
Length (Duration)         58
Acousticness              58
Speechiness               58
Popularity                58
dtype: int64

In this moment, we already know that we have some missing values in table.
So, to find the main cause of this, we will see rows which includes NaN values

In [6]:
# creating bool series True for NaN values 
is_title_null = pd.isnull(top["Artist"])

# displaying data only with Gender = NaN  
top[is_title_null]


Unnamed: 0,Index,Title,Artist,Top Genre,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity
48,"49,""You're The First, The Last, My Everything""...",,,,,,,,,,,,,,
57,"58,""Listen (From the Motion Picture """"Dreamgir...",,,,,,,,,,,,,,
103,"104,""Trapped - Live at Meadowlands Arena, E. R...",,,,,,,,,,,,,,
113,"114,""Bloed, Zweet En Tranen"",Andre Hazes,dutch...",,,,,,,,,,,,,,
114,"115,""Lose Yourself - From """"8 Mile"""" Soundtrac...",,,,,,,,,,,,,,
117,"118,""Harder, Better, Faster, Stronger"",Daft Pu...",,,,,,,,,,,,,,
170,"171,""Comptine d'un autre été, l'après-midi"",Ya...",,,,,,,,,,,,,,
173,"174,""Free Fallin' - Live at the Nokia Theatre,...",,,,,,,,,,,,,,
181,"182,""(I've Had) The Time of My Life - From """"D...",,,,,,,,,,,,,,
411,"412,Peter Gunn Theme,""Emerson, Lake & Palmer"",...",,,,,,,,,,,,,,


Teacher, I really tried to solve the problem with these double quotes, but I failed and spended so much time to find the solution. I tried many ways but all of them just didn't worked.
And main cause probably that these records has comma (which is a delimiter)inside strings,so pandas couldn't convert it right
I choosed to delete these records.

In [7]:
new_top = top.dropna(axis = 0, how ='any')

In [8]:
new_top.head(59)

Unnamed: 0,Index,Title,Artist,Top Genre,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity
0,1,Sunrise,Norah Jones,adult standards,2004.0,157.0,30.0,53.0,-14.0,11.0,68.0,201.0,94.0,3.0,71.0
1,2,Black Night,Deep Purple,album rock,2000.0,135.0,79.0,50.0,-11.0,17.0,81.0,207.0,17.0,7.0,39.0
2,3,Clint Eastwood,Gorillaz,alternative hip hop,2001.0,168.0,69.0,66.0,-9.0,7.0,52.0,341.0,2.0,17.0,69.0
3,4,The Pretender,Foo Fighters,alternative metal,2007.0,173.0,96.0,43.0,-4.0,3.0,37.0,269.0,0.0,4.0,76.0
4,5,Waitin' On A Sunny Day,Bruce Springsteen,classic rock,2002.0,106.0,82.0,58.0,-5.0,10.0,87.0,256.0,1.0,3.0,59.0
5,6,The Road Ahead (Miles Of The Unknown),City To City,alternative pop rock,2004.0,99.0,46.0,54.0,-9.0,14.0,14.0,247.0,0.0,2.0,45.0
6,7,She Will Be Loved,Maroon 5,pop,2002.0,102.0,71.0,71.0,-6.0,13.0,54.0,257.0,6.0,3.0,74.0
7,8,Knights of Cydonia,Muse,modern rock,2006.0,137.0,96.0,37.0,-5.0,12.0,21.0,366.0,0.0,14.0,69.0
8,9,Mr. Brightside,The Killers,modern rock,2004.0,148.0,92.0,36.0,-4.0,10.0,23.0,223.0,0.0,8.0,77.0
9,10,Without Me,Eminem,detroit hip hop,2002.0,112.0,67.0,91.0,-3.0,24.0,66.0,290.0,0.0,7.0,82.0


Next, we will shape data in columns

In [9]:
#because we modified data before, to work with old data and for showing no warnings, we need to add copy() method
new_top = new_top[new_top['Year'].notnull()].copy()
#converting 'Year' from float to int 
new_top['Year'] = new_top['Year'].astype(int) 
new_top.head(5)

Unnamed: 0,Index,Title,Artist,Top Genre,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity
0,1,Sunrise,Norah Jones,adult standards,2004,157.0,30.0,53.0,-14.0,11.0,68.0,201.0,94.0,3.0,71.0
1,2,Black Night,Deep Purple,album rock,2000,135.0,79.0,50.0,-11.0,17.0,81.0,207.0,17.0,7.0,39.0
2,3,Clint Eastwood,Gorillaz,alternative hip hop,2001,168.0,69.0,66.0,-9.0,7.0,52.0,341.0,2.0,17.0,69.0
3,4,The Pretender,Foo Fighters,alternative metal,2007,173.0,96.0,43.0,-4.0,3.0,37.0,269.0,0.0,4.0,76.0
4,5,Waitin' On A Sunny Day,Bruce Springsteen,classic rock,2002,106.0,82.0,58.0,-5.0,10.0,87.0,256.0,1.0,3.0,59.0
