#### **Importing modules and libraries**

In [7]:
#importing primary modules
import pandas as pd
import numpy as np

#importing visualisation libraries
import matplotlib.pyplot as plt
import seaborn as sns

#matplotlib for inline plot
%matplotlib inline

#import and setting warning
import warnings
warnings.filterwarnings('ignore')

#### Loading the dataset

In [19]:
#load the dataset
Spotify = pd.read_csv(r'C:\Users\DELL\Desktop\Spotify.csv')

#set maximum viewable columns
pd.set_option("display.max_column", 30)

#create a copy of the original dataset
original_copy = Spotify.copy()

#viewing the first five rows
Spotify.head()

Unnamed: 0.1,Unnamed: 0,track_id,artists,album_name,track_name,popularity,duration_ms,explicit,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,track_genre
0,0,5SuOikwiRyPMVoIQDJUgSV,Gen Hoshino,Comedy,Comedy,73,230666,False,0.676,0.461,1,-6.746,0,0.143,0.0322,1e-06,0.358,0.715,87.917,4,acoustic
1,1,4qPNDBW1i3p13qLCt0Ki3A,Ben Woodward,Ghost (Acoustic),Ghost - Acoustic,55,149610,False,0.42,0.166,1,-17.235,1,0.0763,0.924,6e-06,0.101,0.267,77.489,4,acoustic
2,2,1iJBSr7s7jYXzM8EGcbK5b,Ingrid Michaelson;ZAYN,To Begin Again,To Begin Again,57,210826,False,0.438,0.359,0,-9.734,1,0.0557,0.21,0.0,0.117,0.12,76.332,4,acoustic
3,3,6lfxq3CG4xtTiEg7opyCyx,Kina Grannis,Crazy Rich Asians (Original Motion Picture Sou...,Can't Help Falling In Love,71,201933,False,0.266,0.0596,0,-18.515,1,0.0363,0.905,7.1e-05,0.132,0.143,181.74,3,acoustic
4,4,5vjLSffimiIP26QG5WcN2K,Chord Overstreet,Hold On,Hold On,82,198853,False,0.618,0.443,2,-9.681,1,0.0526,0.469,0.0,0.0829,0.167,119.949,4,acoustic


## **Column description**

| Column     | Description              |
|------------|--------------------------|
| `track_id` | The Spotify ID number of the track. |
| `artists` | Names of the artists who performed the track, separated by a `;` if there's more than one.|
| `album_name` | The name of the album that includes the track.|
| `track_name` | The name of the track.|
| `popularity` | Numerical value ranges from `0` to `100`, with `100` being the highest popularity. This is calculated based on the number of times the track has been played recently, with more recent plays contributing more to the score. Duplicate tracks are scored independently.|
| `duration_min` | The length of the track, measured in minutes.|
| `explicit` | Indicates whether the track contains explicit lyrics. `true` means it does, `false` means it does not or it's unknown.|
| `danceability` | A score ranges between `0.0` and `1.0` that represents the track's suitability for dancing. This is calculated by algorithm and is determined by factors like tempo, rhythm stability, beat strength, and regularity.|
| `energy` | A score ranges between `0.0` and `1.0` indicating the track's intensity and activity level. Energetic tracks tend to be fast, loud, and noisy.|
| `key` | The key the track is in. Integers map to pitches using standard Pitch class notation. E.g.`0 = C`, `1 = C♯/D♭`, `2 = D`, and so on. If no key was detected, the value is `-1`.| 
| `loudness` | The overall loudness, measured in decibels (dB).|
| `mode` |  The modality of a track, represented as `1` for major and `0` for minor.| 
| `speechiness` | Measures the amount of spoken words in a track. A value close to `1.0` denotes speech-based content, while `0.33` to `0.66` indicates a mix of speech and music like rap. Values below `0.33` are usually music and non-speech tracks.| 
| `acousticness` | A confidence measure ranges from `0.0` to `1.0`, with `1.0` representing the highest confidence that the track is acoustic.|
| `instrumentalness` | Instrumentalness estimates the likelihood of a track being instrumental. Non-lyrical sounds such as "ooh" and "aah" are considered instrumental, whereas rap or spoken word tracks are classified as "vocal". A value closer to `1.0` indicates a higher probability that the track lacks vocal content.|
| `liveness` | A measure of the probability that the track was performed live. Scores above `0.8` indicate a high likelihood of the track being live.|
| `valence` | A score from `0.0` to `1.0` representing the track's positiveness. High scores suggest a more positive or happier track.|
| `tempo` | The track's estimated tempo, measured in beats per minute (BPM).|
| `time_signature` | An estimate of the track's time signature (meter), which is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from `3` to `7` indicating time signatures of `3/4`, to `7/4`.|
| `track_genre` |  The genre of the track.|


In [None]:
#size of the dataset
Spotify.shape

We have `114000` rows and `20` columns of music tracks in our dataset

#### Checking info summary of the data

In [None]:
Spotify.info()

Our dataset contains `20 columns and 11400 rows`. It also shows that there are some null values in the artists, album_name and track_name column. We will now proceed to cleaning the data.

#### **Handling missing values**

In [None]:
#Checking for null values
Spotify.isnull().sum()

The result shows that the artists, album_name and track_name column have single null values. We'll look further into the null variable to understand the reason for the null data. 

In [None]:
#A look at the artists null values
Spotify[Spotify.artists.isna()]

It turns out that the 3 null values fall under a single row. After careful observation of other values in thesame row, there is no pointer to what the missing values could have been which indicates that it could be an entry error. 

Therefore, we will drop the null values were dropped from the dataset.

#### drop null row

In [None]:
#dropping null values
Spotify.dropna(inplace=True)

#### **Addressing duplicate values**

In [None]:
#view the total number of duplicate values
Spotify.duplicated().sum()

In [None]:
#Observing the duplicate rows further and keeping the first value
duplicates = Spotify[Spotify.duplicated(keep='first')]

duplicates

The above table displays all duplicate values by their `track_id`. The above output further shows that we have duplicate track ids on the `track_id` column

#### drop duplicate values

In [None]:
#dropping all duplicate values
Spotify.drop_duplicates( inplace=True)

#checking the new shape of the dataset
print(Spotify.shape)

After dropping duplicate values, we now have `113549 rows`.

#### Checking unique values in the track_id
we do this because the track id is unique to individual tracks. We can have track with similar artists and album name but not the same id.  

In [None]:
#unique values 
Spotify['track_id'].nunique()

In [None]:
#checking for duplicates in the track_id
duplicate_track_ids = Spotify[Spotify['track_id'].duplicated(keep = 'first')]

#check the size of duplicate values
print(duplicate_track_ids.shape)

#checking counts of duplicate track_id
duplicate_track_ids.groupby('track_id')['track_id'].count()

There are only `89,740` unique track IDs, indicating that there are approximately `24,259` duplicate values present. To proceed with our analysis, it is necessary to eliminate these duplicate instances and clean the data.

#### drop duplicate track using their track_ids

In [None]:
#drop duplicate track_id
Spotify = Spotify[Spotify['track_id'].duplicated(keep = 'first') == False]

#check shape of new data
Spotify.shape

After dropping all duplicate track_id, we now have a total of `89740 songs` in our data set.

### **Correcting inconsistencies**

The song duration given here is in Milliseconds, we'll convert it to minutes and drop the `duration_ms`

In [None]:
#convert time from milliseconds to minutes
Spotify['duration_min'] = round(Spotify['duration_ms'] / 60000,1)

#Drop duration_ms columns
Spotify.drop('duration_ms', axis=1, inplace=True)

In [None]:
#confirming changes
Spotify.head(2)

After dropping all duplicate values and scrutinizing the track_id, we now have `89740` rows left in our spotify music data frame.

## **STUDYING INDIVIDUAL VARIABLE SEPARATELY**

> 1. **Artists**

In [None]:
Spotify['artists'].nunique()

#### Top 5 contributing Artists


In [None]:
#setting the plot size
plt.figure(figsize = (10,5), dpi = 100)

#plotting the chart
sns.barplot( x = Spotify['artists'].value_counts()[0:5].values, y = Spotify['artists'].value_counts()[0:5].index, orient='h')

#setting the plot title and axis labels
plt.title('Top 5 Contributing Artists')
plt.ylabel('No. of tracks')
plt.xlabel('Artists name')

#display the plot
plt.show()

We were able to find out the top 5 contributing artist and George Jones ranked the highest followed by my little airport and others followed suit

> 2. **Album Name**

Most common album names 

In [None]:
#setting the plot size
plt.figure(figsize = (8,5), dpi = 100)

#plotting the chart
sns.barplot(data=Spotify , y = Spotify['album_name'].value_counts()[0:5].index, x = Spotify['album_name'].value_counts()[0:5],orient="h")

#setting the plot title and axis labels
plt.title('Top 5 Most common Album names')
plt.ylabel('Album name')
plt.xlabel('No. of times')

#display the plot
plt.show()


`The complete Hank Williams` appear over `100` times followed by `Greatest Hits` and the others. This could mean that songs from this to albums will populate the dataset. 
It could also mean that there are lot of songs with the same album name by different artists. We will take further look into this in the Bivariate analysis section.

> 3. **Popularity**

In [None]:
#Setting the plot size
plt.figure(figsize = (8,5), dpi=100)

#Plotting the popularity chart
sns.histplot(Spotify['popularity'], kde=None, fill = True)

#display the plot
plt.show()

Our Spotify data encompasses a wide range of songs, and a higher proportion of it is filled with tracks that have popularity ratings spanning from `0 to 1`. The popularity is most concentrated between `20 and 60` where we have a significant number of tracks.

While these tracks may not enjoy the same level of mainstream popularity, they present a rich and varied assortment for the audience in search of distinctive and less-familiar music. Whether it's hidden treasures or specialized genres, this collection appeals to a broad spectrum of music enthusiasts by curating a selection that extends beyond the most well-known chart-toppers. 

> 4. **Duration**



In [None]:
#plotting for time duration with tracks in minutes
plt.figure(figsize = (10,3), dpi = 100)

#define the plot
sns.kdeplot(Spotify['duration_min'], fill = True)

#set label values
plt.xlabel('Track duration (minutes)')

#display the plot
plt.show() 

The graph provides valuable insight into the length of the tracks, indicating a prominent trend within this specific time range. This data suggests that the duration of tracks in the given dataset tends to cluster around the 5-6 minute mark, which is an indication of a common preference or standard within the music industry.


Based on the information presented in the graph, it can be concluded that the majority of tracks had a duration of approximately 5-6 minutes. The graph provides further insights that there are tracks with minutes above `5 minues up to 80 minutes`. 

This is an indication of outliers in the `duration_min` column. We further investigate the `duration_min` column and determine what happens to data.

#### Checking for outliers in the `duration_min` column.

We will make use of the boxplot becuase it can be used to easily detect outliers.

In [None]:
#figuresize
plt.figure(figsize=(15,8))

#define the plot
sns.boxplot(Spotify['duration_min'], orient='h')

In [None]:
Spotify.describe()

The plot and summary stats shows that duration below `2.9 up to 0.1 min` and values above `3.8 up to 87 min` are outliers in the data. We will now look at tracks that fall into this category. After manually checking some of the tracks in this category, it is obvious that the values are just variance in the dataset and not errors. We will ignore the outliers because they are real data and not errors.

5. **Explicit**

In [None]:
#setting the figure size
plt.figure(figsize = (4,4), dpi = 100)

#defining the plot
sns.barplot(x = Spotify['explicit'].value_counts().index, y = Spotify['explicit'].value_counts().values)

#Setting the title and labels
plt.xlabel('Contains profanity')
plt.ylabel('No. of tracks')

#display the plots
plt.show()

Including songs with explicit content in a company party playlist can create a different impression about the party compared to a personal or casual party. Company parties are often meant to foster a professional and inclusive environment. Including songs with explicit content may be seen as unprofessional and inappropriate in this context. It's important to maintain a level of decorum and ensure that the music aligns with the company's values and the expectations of the attendees. From the graph we have realised that about 5,000 tracks contains explicit words...so  we will drop them. 

#### drop explicit rows

In [None]:
#dropping all rows with explicit contents
Spotify.drop(Spotify[Spotify['explicit'] == True].index, inplace=True)

6. **Danceability**

In [None]:
#set figure size
plt.figure(figsize = (3,3), dpi = 100)

#define the plot
sns.kdeplot(Spotify['danceability'], fill = True)
sns.lineplot(x = (0.5 for i in range(0,3)), y = range(0,3), color = 'Red')

#set plot label
plt.xlabel('Danceability')

#display the plot
plt.show()

According to the displayed graph, it can be observed that songs generally exhibit an average level of danceability when their score falls within the range of 0.5 to 0.6. However, as the danceability score exceeds 0.6, the songs are characterized as highly danceable. This graph provides valuable information regarding the relationship between danceability scores and the perceived level of danceability in songs. It suggests that songs with higher danceability scores are more likely to be considered as highly danceable, while those with scores in the 0.5 to 0.6 range are deemed to have an average level of danceability.

Since our primary aim is to get danceable songs for the summer party, we will drop the songs that fall below the average danceability.


7. **Energy**



In [None]:
plt.figure(figsize = (3,3), dpi = 100)
sns.kdeplot(Spotify['energy'], fill = True)
sns.lineplot(x = (0.5 for i in range(0,3)), y = range(0,3), color = 'Red')
plt.xlabel('Energy')
plt.show()

According to the displayed graph, it can be observed that songs generally exhibit an average level of Energy when their score falls within the range of 0.5 upwards. However, as the energy score exceeds 0.6, the songs are characterized as highly Energetic. This graph provides valuable information regarding the relationship between Energetic scores and the perceived level of Energy in songs.

8. **Key**



In [None]:
#Creating a categorical mapping named key names
Spotify['key_names'] = Spotify.key.replace({0: 'C', 1: "C#/Db",  2: 'D', 3: 'D#/Eb', 4: 'E',5:'F', 6: 'F#/Gb', 7: 'G',8: 'G#/Ab', 9: 'A',   10: 'A#/Bb', 11: 'B'})

In [None]:
#plotting key variation
plt.figure(figsize = (10,3), dpi = 100)
sns.barplot(x = Spotify['key_names'].value_counts().index, y = Spotify['key'].value_counts().values)
plt.xlabel('Key_names')
plt.ylabel('No. of tracks')
plt.show()

Upon analyzing the dataset, it becomes evident that the majority of songs are in the key of G, followed by the keys C and D. Conversely, the key D#/Eb appeared the least frequently in the dataset. This dataset provides valuable insights into the distribution of musical keys within the analyzed songs. The prevalence of songs in the key of G suggests its popularity among musicians, potentially due to its tonal qualities or ease of playability. Similarly, the occurrence of songs in the keys of C and D highlights their significance in the musical landscape. On the other hand, the infrequent appearance of the key D#/Eb indicates its lower prevalence compared to the other keys in the dataset.

9. **Loudness**



In [None]:
#plotting loudness value
plt.figure(figsize = (3,3), dpi = 100)

#plot definition
sns.kdeplot(Spotify['loudness'], fill = True)
plt.xlabel('Loudness')

#display plot
plt.show()

A KDE plot of the "loudness" column with a higher concentration of values between -20 and 0 suggests that a significant proportion of the songs in the dataset have a moderate to high loudness level. This likely indicates that a substantial portion of the songs are relatively loud or have a strong auditory presence.

In summary, the KDE plot of the "loudness" column with values concentrated between -20 and 0 highlights the prevalence of moderately loud songs in the dataset, offering valuable information for music analysis and genre classification.

10. **Mode**

#### converting the Mode from figures to major/minor



In [None]:
mode_key = {'1':'major', '0':'minor' }

#convert type
Spotify['mode'] = Spotify['mode'].astype(str) 

In [None]:
Spotify['mode'] = Spotify['mode'].replace({'1':'major', '0':'minor' })

Within the dataset, there are several songs that are composed in the major mode. These songs exhibit a distinct tonal quality and convey a sense of brightness and positivity. The major mode is characterized by a specific pattern of intervals that create a harmonically pleasing and uplifting sound. By identifying the songs in the major mode, we can gain insights into the prevalence and popularity of this musical structure within the dataset. Furthermore, the presence of songs in the major mode suggests that musicians often utilize this mode to evoke emotions such as joy, optimism, and triumph in their compositions.

In [None]:
minor = Spotify[Spotify['mode'] =='minor']
minor

Within the dataset, there are several songs that are composed in the minor mode. These songs possess a distinct tonal quality that evokes a sense of melancholy, introspection, or even darkness. The minor mode is characterized by a specific pattern of intervals that create a somber and emotional atmosphere. By identifying the songs in the minor mode, we can gain insights into the prevalence and popularity of this musical structure within the dataset. Furthermore, the presence of songs in the minor mode suggests that musicians often utilize this mode to convey feelings of sadness, longing, or introspection in their compositions. The minor mode offers a rich and diverse range of emotions, allowing artists to explore and express a wide array of moods and sentiments in their music.

In [None]:
plt.figure(figsize = (4,4), dpi = 100)
sns.barplot(x = Spotify['mode'].value_counts().index, y = Spotify['mode'].value_counts().values)
plt.xlabel('Scale (1 = Major, 0 = Minor)')
plt.ylabel('No. of tracks')
plt.show()


According to the chart, it can be observed that a significant number of songs, over 50,000, are composed in the major mode, while approximately 29,000 songs are in the minor mode. This indicates that a majority of the songs in the dataset convey a sense of positivity and evoke uplifting emotions. The prevalence of songs in the major mode suggests that musicians often utilize this mode to create a cheerful and optimistic atmosphere in their compositions. However, it is worth noting that there is still a substantial number of songs in the minor mode, which signifies that artists also recognize the power of evoking deeper emotions and introspection through their music. Overall, the dataset showcases a diverse range of musical expressions, with a significant focus on positivity and soul-lifting themes.

11. **Speechiness**



In [None]:
plt.figure(figsize = (10,3), dpi = 100)
sns.kdeplot(Spotify['speechiness'], fill = True)
plt.xlabel('speechiness')
plt.show()

The vast majority of songs in the Spotify library have a speechiness rating below 0.33, indicating that they contain minimal vocals. This suggests that a significant portion of the tracks available on Spotify are instrumental or have limited vocal content. Whether you prefer instrumental music or are seeking a background soundtrack without distracting lyrics, you'll find a wide selection of songs with minimal speechiness in the Spotify library.

12. **Acousticness**

In [None]:
plt.figure(figsize = (3,3), dpi = 100)
sns.kdeplot(Spotify['acousticness'], fill = True)
plt.xlabel('acousticness')
plt.show()

The majority of tracks available in the library undergo electronic amplification and digital processing. This indicates that a large portion of the music in the library has been enhanced and modified using electronic means. Whether it's through the use of synthesizers, digital effects, or other electronic techniques, these tracks have been shaped and transformed using technology. This prevalence of electronically processed music in the library offers a wide range of sonically diverse and innovative options for listeners

13. **Instrumentalness**

In [None]:
plt.figure(figsize = (3,3), dpi = 100)
sns.kdeplot(Spotify['instrumentalness'], fill = True)
plt.xlabel('Instrumentalness')
plt.show()

 The graph reveals that a significant portion of the songs in the dataset have a value of 0.0 for instrumentalness. This suggests that the majority of these songs contain vocals in their composition. The prevalence of songs with instrumentalness value of 0.0 indicates that vocals play a prominent role in the dataset, highlighting that the songs are primarily driven by vocal performances and lyrics. It implies that most of the songs in the dataset are not purely instrumental, but rather feature vocals as a key component of their musical arrangement.

14. **Liveness**


In [None]:
plt.figure(figsize = (3,3), dpi = 100)
sns.kdeplot(Spotify['liveness'], fill = True)
plt.xlabel('liveness')
plt.show()

The data indicates that a majority of the songs in the dataset have a low liveness value. This suggests that most of these songs are not performed live and are more likely to be studio recordings. The prevalence of songs with low liveness values implies that the dataset consists primarily of tracks that lack the characteristic ambiance, audience interaction, or live performance energy associated with live recordings. It suggests that the majority of the songs in the dataset are not intended to be experienced as live performances but rather as studio-produced tracks.

15. **Valence**

In [None]:
plt.figure(figsize = (3,3), dpi = 100)
sns.histplot(Spotify['valence'], bins =25)
plt.xlabel('Valence')
plt.show()

16. **Tempo**

In [None]:
plt.figure(figsize = (3,3), dpi = 100)
sns.kdeplot(Spotify['tempo'], fill = True)
plt.xlabel('tempo')
plt.show()

The majority of the songs in the dataset have a tempo that falls within the range of approximately 90 to 150. This indicates that most of these songs have a moderate to moderately fast pace. The prevalence of songs within this tempo range suggests that it is a commonly preferred tempo for the analyzed dataset. It implies that the songs in this range are likely to have a similar energetic feel and rhythmic characteristics, contributing to a cohesive musical experience.

17. **Time Signature**




In [None]:
plt.figure(figsize = (5,4), dpi = 100)
sns.barplot(x = Spotify['time_signature'].value_counts().index, y = Spotify['time_signature'].value_counts().values)
plt.xlabel('Beats per Bar/Measure')
plt.ylabel('No. of Tracks')
plt.show()

The majority of the songs in the dataset are written in a 4/4 time signature. This means that these songs have four beats per measure, with a quarter note receiving one beat. The prevalence of songs with this time signature suggests that it is a commonly used and familiar rhythmic structure within the analyzed dataset. The consistent use of 4/4 time signature contributes to a cohesive and easily recognizable rhythmic framework for these songs.

18. **Track Genre**

In [None]:
#setting plot size
plt.figure(figsize = (10,3), dpi = 100)

#Define the plot
sns.barplot(x = Spotify['track_genre'].value_counts()[0:20].index, y = Spotify['track_genre'].value_counts()[0:20].values)

#set title and label
plt.xlabel('Genre')
plt.ylabel('No. of Tracks')
plt.xticks(rotation = 90)
plt.title('Top 20 Genres')

#display the plot
plt.show()

# **Analysis of two Variables & Multivariate Analysis**

#### understanding the correlation between variables

In [None]:
#plot size
plt.figure(figsize=(20,8))

# setting the plot
sns.heatmap(Spotify.drop(columns='explicit', axis=1).corr(), vmax=1, vmin=-1, center=0,
            linewidth=.5,square=True, annot = True,
           fmt='.1f', cmap='BrBG_r',  
            cbar_kws = dict(use_gridspec=False,location="top", shrink=0.9)) 


#set plot title
plt.title('Correlation plot')

#display the plot
plt.show()

There is a **strong positive correlation** between `energy and loudness` and a **strong negative correlation** between `energy and acousticness` with `valence and danceability` having a **moderate positive correlation**. On the other hand, `loudness and acousticness` have a **moderate negative correlation** and other columns maintaining minimal correlation.

 **Energy VS Loudness**

In [None]:
#set plot size
plt.figure(figsize=(15,8))

#define plot
plt.scatter(x=Spotify['energy'], y=Spotify['loudness'])

#set labels
plt.title('Relationship between energy and loudness')
plt.xlabel('Energy')
plt.ylabel('Loudness')

#display the plot
plt.show()

1. **Track_name vs Danceability**

In [None]:
Spotify.query('track_name == "Bitches"') 

#Track has explicit value of False despite having an explicit track_name and content

In [None]:
grouped = Spotify.groupby('track_name')['danceability'].mean()

#set plot size
plt.figure(figsize = (10,5), dpi = 100)

#define the plot
sns.barplot(y= grouped.sort_values(ascending = False)[0:10].index, x = grouped.sort_values(ascending = False)[0:10].values)

#setting plot labels
plt.xlabel('Trac ')
plt.ylabel('Danceability')
# plt.xticks(rotation = 90)
plt.title('Top 10 Most Danceable Songs')

#display plot
plt.show()

 By leveraging the track name and danceability attributes, we successfully curated a selection of the dataset's top 10 danceable songs. Through this process, we were able to identify and sort out the tracks that exhibited the highest levels of danceability, allowing us to present a refined and captivating collection of music for dance enthusiasts.

2. **Track Name VS Popularity**

In [None]:
grouped = Spotify.groupby('track_name')['popularity'].mean()

plt.figure(figsize = (10,3), dpi = 100)
sns.barplot(x = grouped.sort_values(ascending = False)[0:10].index, y = grouped.sort_values(ascending = False)[0:10].values)
plt.xlabel('Tracks')
plt.ylabel('Popularity')
plt.xticks(rotation = 90)
plt.title('Top 10 Most popular Tracks')
plt.show()

By utilizing the track name and popularity metrics, we were able to effectively categorize and identify the top 10 popular songs within the dataset. Through this analysis, we were able to sort the songs based on their level of popularity, allowing us to present a curated list of the most widely recognized and well-received tracks. This approach provides valuable insights into the preferences and trends of music listeners, enabling us to highlight the songs that have garnered significant attention and acclaim.

3. **Track Name VS Energy**

In [None]:
grouped.sort_values(ascending = False)[0:30]

In [None]:
grouped = Spotify.groupby('track_name')['energy'].mean()

plt.figure(figsize = (10,3), dpi = 100)
sns.barplot(x = grouped.sort_values(ascending = False)[0:10].index, y = grouped.sort_values(ascending = False)[0:10].values)
plt.xlabel('Tracks')
plt.ylabel('Energy')
plt.xticks(rotation = 90)
plt.title('Top 10 Most Energetic Tracks')
plt.show()

By employing the track name and energy attributes, we successfully organized the dataset to identify the top 10 energetic songs. This process involved analyzing the energy levels of each track and sorting them accordingly, allowing us to present a curated selection of high-energy songs. By focusing on the energy aspect, we were able to highlight the tracks that exude vibrancy, excitement, and a dynamic musical experience. This compilation provides an electrifying playlist for those seeking a boost of energy and enthusiasm in their music.

In [None]:
# len(grouped)
# grouped['tempo'].sort_values(ascending=False).plot(kind='bar')

4. **Track_Name VS Tempo**

In [None]:
groupTempo = Spotify[(Spotify['tempo'] ) & (Spotify['tempo'])]
grouped = groupTempo.groupby('track_name')['tempo'].mean()
plt.figure(figsize = (10,3), dpi = 100)
sns.barplot(  x = grouped.sort_values( ascending = False)[0:10].index, y = grouped.sort_values( ascending = False)[0:10].values)
plt.xlabel('Tracks')
plt.ylabel('Tempo')
plt.xticks(rotation = 90)
plt.title('Top 10 Tracks with Danceable Tempo')
plt.show()

We selected the top 10 tracks that fall within the danceable tempo range of 120 to 140 beats per minute (BPM). This tempo range is known to be conducive to dancing, providing a lively and energetic rhythm. By focusing on tracks within this range, we ensured that the chosen songs would have a tempo that aligns with the desired danceability criteria.

5. **Track_Name VS Valence**

In [None]:
grouped = Spotify.groupby('track_name')['valence'].mean()

plt.figure(figsize = (10,3), dpi = 100)
sns.barplot(x = grouped.sort_values(ascending = False)[0:10].index, y = grouped.sort_values(ascending = False)[0:10].values)
plt.xlabel('Tracks')
plt.ylabel('valence')
plt.xticks(rotation = 90)
plt.title('Top 10 Tracks with positive valence')
plt.show()

Songs with a positive valence have the power to uplift the mood and create an atmosphere that is lively and happy. These songs emit a sense of positivity and joy, infusing the room with an uplifting energy. With their upbeat melodies, catchy rhythms, and optimistic lyrics, songs with positive valence can bring a smile to people's faces and encourage them to dance, sing along, or simply enjoy the moment. Whether it's a social gathering, a party, or even a personal listening experience, these songs have the ability to enhance the overall mood and create a vibrant and cheerful ambiance.

6. **Popularity VS Genre**

In [None]:
grouped = Spotify.groupby('track_genre')['popularity'].mean()

plt.figure(figsize=(10, 3), dpi=100)
sns.barplot(x=grouped.sort_values(ascending=False)[0:10].index, y=grouped.sort_values(ascending=False)[0:10].values)
plt.xlabel('Genres')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.title('Top 10 Most popular Genres')
plt.show()

The following analysis reveals the top 10 most popular genres found within the dataset. By examining the data, we have identified the genres that occur most frequently across the tracks. This information provides valuable insights into the musical preferences and trends represented in the dataset. By understanding the most popular genres, we can gain a better understanding of the overall musical landscape and the genres that resonate the most with listeners. These top 10 genres represent the ones that have the highest occurrence rate, showcasing their popularity and prominence within the dataset.

7. **Danceability VS Genre**

In [None]:
genre_spotify = Spotify.groupby('track_genre')['danceability'].mean()

plt.figure(figsize = (12,5), dpi = 100)
sns.barplot(x = genre_spotify.sort_values(ascending = False)[0:10].index,
            y = genre_spotify.sort_values(ascending = False)[0:10].values)
plt.xlabel('Genre')
plt.ylabel('Avg Danceability')
plt.title('Top 10 Danceable Genre')

plt.show()

After analyzing the dataset, we have identified the top 10 most danceable genres. These genres stand out for their high danceability scores, indicating their suitability for getting people on their feet and moving to the rhythm. By considering factors such as tempo, rhythm, and beat, the danceability metric provides insights into the genres that are most likely to inspire and encourage dancing. These top 10 genres represent the ones with the highest danceability ratings within the dataset, showcasing their ability to create an energetic and lively atmosphere that is perfect for dancing and enjoying the music.

In [None]:
Spotify.sort_values(by = 'popularity', ascending = False)[['track_name', 'artists', 'danceability','popularity','track_genre','valence','tempo','energy']].head(50)

# **Data Preprocessing**

Create a copy of the dataframe

In [None]:
spotify_copy = Spotify.copy()

spotify_copy.head(2)

#### Checking and removing outliers in the data using boxplot

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(spotify_copy, orient='h')

The margin of outliers vary per column in the dataset. Each column will be investigated and the outliers removed. The `popularity and tempo` column will be ignored because it is a secondary feature that determines if a track is danceable.

## Removing Outlier using Loudness
Using the Boxplot IQR, values above the 25th Quartile and below the 75th Quartile

In [None]:
spotify_copy.loudness.describe()

In [None]:
spotify_copy = spotify_copy[(spotify_copy.loudness >= -10.56) & (spotify_copy.loudness <= -5.17)]

In [None]:
spotify_copy.head(2)

In [None]:
spotify_copy.shape

## Removing Outlier using Duration

In [None]:
spotify_copy.duration_min.describe()

Since its  a company party with, events will be scheduled. And assuming that the playlist will be in place of the DJ, It will be inadmissible to involve tracks with longer minutes. 

Overtime, tracks with shorter time frame tend to keep the mood lively as it transitions between a genre to another. We will be selecting the tracks based on  the value of the `25% and 75%` as minimum and maximum outlier values respectively.

In [None]:
spotify_copy = spotify_copy.query('duration_min  < 4.2')

In [None]:
spotify_copy

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(spotify_copy, orient='h')

## Observing the track genre for genres that might be considered  undanceable

In [None]:
genres = spotify_copy.track_genre.values

In [None]:
genres_unique = set(genres.tolist())

### Number of genres in our dataset

In [None]:
len(genres_unique)

observing genre categories

from the above result genres like `kids, children, sleep and sad` contain noise (based on practical examination) and are not suitable to create an electritying mood. This genres will be dropped

In [None]:
spotify_copy

Observing from the tail artists like `Hillsong, Tenth avenue, Bryan & katie Torwait` were observed to be gospel artists. This led to further observation of their profile and they're found to be Gospel/Christain artists. We will look into the `world-music` genre and observe the artists profile.

In [None]:
world_music = spotify_copy.query('track_genre == "world-music"')

In [None]:
print(set(world_music.artists.tolist()))

`world-music` is a broad and inclusive term used to describe a genre of music that encompasses a wide range of musical styles and traditions from around the world. It is not a specific genre but rather a category that serves as an umbrella term for music that originates from various cultures, regions, and traditions.  

From the above list, There are more christain artists compared to the traditional and other cultural artists. Therefore, `world-music, kids, children, sleep and sad` genres will be dropped.

In [None]:
genres_to_drop = ['world-music', 'kids', 'children', 'sleep', 'sad']
spotify_copy = spotify_copy[~spotify_copy['track_genre'].isin(genres_to_drop)]

## Dropping tracks based on liveness.

This is due to the facts that live tracks tend to be performed at the discretion of the artists unlike how it is perfromed in a studio track. Live tracks are most suitable for its immediate audience so tracks live tracks will be dropped. i.e tracks with liveness above 0.8

In [None]:
live_tracks = spotify_copy.query('liveness >= 0.8')

#### **Drop non needed columns**

In [None]:
spotify_df = spotify_copy.drop(columns=['track_id',  'album_name', 'explicit', 'mode', 'key','key_names'], axis=1)

### **Dimensionality reduction**

#### Reducing the dimensionality by mean

In [None]:
spotify_df = spotify_df.query('instrumentalness < 0.11 & valence > 0.56 & tempo > 0.4 & energy > 0.67')

In [None]:
spotify_df.tempo.describe()

In [None]:
spotify_df.energy.mean()

In [None]:
rap = spotify_df.query('speechiness >= 0.33 & speechiness <= 0.66')

From our data description, tracks that fall within the speechinees value of `0.33 and 0.66` can be regarded as rap songs.

In [None]:
rap_tracks = spotify_df.query('speechiness >= 0.33 & speechiness <= 0.66')

#### Observe columns by variance to understand which requires scaling

In [None]:
spotify_df.var()

In [None]:
spotify_df.head(2)

## Scaling columns

The disparity in variance of `popularity, loudness, tempo, and duration_min` is unbalanced. We will now scale this values to put them on the same measurement. 

In [None]:
from sklearn.preprocessing import MinMaxScaler

columns_to_scale = ['popularity', 'tempo', 'loudness', 'duration_min']
scaler = MinMaxScaler()

# Apply Min-Max scaling to the selected columns and replace the original values
spotify_df[columns_to_scale] = scaler.fit_transform(spotify_df[columns_to_scale])

In [None]:
spotify_df.shape

In [None]:
#rechecking the var scaled values
spotify_df.var()

## Training and evaluating our RandomForestRegressor model

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
data = spotify_df

# Features and target variable
features = ['energy', 'valence', 'liveness', 'speechiness', 'tempo', 'acousticness']  
target = 'danceability'  # Replace with your target variable

# Split the data into training and testing sets
X = data[features]
y = data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Random Forest Regressor model
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)
random_forest_model.fit(X_train, y_train)

# Predict danceability scores
y_pred = random_forest_model.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R2): {r2}")

Energy was selected because a relationship was obseved between energy and danceability while practicallly testing the songs (i.e if energy and danceability are above 0.56, the song gives danceable vibes.). Also valence was inicluded becuase it determines the how happy and lively t  

## Predict and recommend danceability using our model

In [None]:
# Predict danceability scores for all songs 
predicted_danceability = random_forest_model.predict(data[features])

# Add the predicted danceability scores to the data
data['predicted_danceability'] = predicted_danceability

# Sort the dataset by predicted danceability in descending order
recommended_songs = data.sort_values(by='predicted_danceability', ascending=False)

# Select the top 50 recommended songs
top_50_recommendations = recommended_songs.head(50)

# Display the top 50 recommended songs
print("Top 50 Recommended Songs based on Predicted Danceability:")
# print(top_50_recommendations[['track_name', 'artists', 'predicted_danceability']])
recommended_songs.head(50)

In [None]:
plt.figure(figsize=(15,8))
recommended_songs.track_genre.value_counts().plot()

The metrics i used to overall

In [None]:
# # #valence is a factor for identifying happy and or cheerful songs 
# # #checking liveness

# # # energy_dance = energy_dance[energy_dance['track_genre'] != 'world-music']
# # #there are gospel musics in various categories like alternate, world-music, children, sad, emo, goth
# # #From observation, liveness does not affect the danceability of the song 

# # #checking liveness
# # #there are gospel musics in various categories like alternate, world-music
# # #From observation, liveness does not affect the danceability of the song 

# # energy_dance[energy_dance['instrumentalness'] > 0.15].head(23)
# # #songs with instrumentalness above average (0.15)tend to be slow despite having a high tempo and high danceability value
# # #Although it doesn't show in the correlation matrix, instrumentalness is a cogent factor of fast and danceable music


# # #observing valence: the closer the value is to 1, the more positive(cheerful, happy, euphoric) the song is.
# energy_dance = Spotify[(Spotify['danceability'] > 0.56) & (Spotify['energy'] > 0.64)]

# #Although it doesn't show in the correlation matrix, instrumentalness is a cogent factor of fast and danceable music



#drop key and mode column

#Speechiness
#drop songs with low speechiness
# average speechiness is rap


#there is a negative correlation between acousticness and energy