# Portfolio Task 2: Programming Test


# Task 2

To solve the following exercises you need the following data files:

- **songs.csv** contains details on individual songs: e.g. the song name, artist names, genre, its duration, and more variables
- **charts_xx.csv**: there are 4 files, one for each of the countries Germany (de), United States (us), Great Britain (gb) and Spain (es). Each file contains information on the Top 200 Spotify charts for December 24 of the years 2017 to 2022 of that country: e.g. a song identifier, the position in the ranking, and the number of played streams
- **charts_backup.csv**: _a backup file that combines all the charts_xx.csv files; to be used ONLY in exercise 6 in case you did not manage to solve exercise 5_


Import packages


In [144]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import glob as glob

# Exercise 1 (15 points)


Read the songs data and display the first 5 rows.


In [145]:
df_songs = pd.read_csv('task2/data/songs.csv')
df_songs.head(5)

Unnamed: 0,song_id,song_name,artist_names,genre,duration_ms,danceability,valence,tempo,explicit,lyrics
0,5XM67HdnPub0nMDqJEUPdy,The Best Side Of Life,Sarah Connor,pop,229333,0.631,0.554,90.947,False,Is calling me home \n The best time of the yea...
1,1xK1Gg9SxG8fy2Ya373oqb,Bandido,"Myke Towers, Juhn",Reggaeton,232853,0.713,0.682,168.021,False,"Ella e' buena, pero le gustan lo' malo' (Malo'..."
2,1mlGScrDQqHqmhdIqE8MmA,Besos En Guerra,"Morat, Juanes",pop,231533,0.69,0.772,143.957,False,¿Quién te dijo esa mentira? \n Que eras fácil ...
3,4J2HLNTxiVxxs6kWgTIN43,Porfa no te vayas,"Beret, Morat",spanish,209712,0.589,0.792,171.976,False,Recuerdo aquel verano que pasé contigo \n Y ca...
4,0e8nrvls4Qqv5Rfa2UhqmO,THATS WHAT I WANT,Lil Nas X,Hip-Hop,143901,0.737,0.546,87.981,True,"One, two, three, four \n \n Need a boy who ca..."


The column `duration_ms` contains the duration of songs in milliseconds. Derive from it a new column `duration` that contains the duration in minutes. Then delete the column `duration_ms`. Display the `song_name`, `artist_names` and `duration` of the 5 longest songs


In [146]:
df_songs['duration'] = df_songs.duration_ms / 1000 / 60
df_songs.drop(labels='duration_ms', axis=1, inplace=True)

df_songs.sort_values(by='duration', ascending=False)[['song_name','artist_names','duration']].head(5)

Unnamed: 0,song_name,artist_names,duration
274,All Too Well (10 Minute Version) (Taylor's Ver...,Taylor Swift,10.217117
1346,"Weihnachtsoratorium, BWV 248: Part I: For the ...","Johann Sebastian Bach, Nikolaus Harnoncourt",7.97755
789,105 F Remix,"KEVVO, Farruko, Chencho Corleone, Arcangel, Da...",7.736667
620,Only Fans - Remix,"Young Martino, Lunay, Myke Towers, Jhay Cortez...",7.04295
1256,Te Boté - Remix,"Nio Garcia, Casper Magico, Bad Bunny, Darell, ...",6.965333


# Exercise 2 (15 points)


Songs can be uniquely identified via their `song_id`, while a given `song_name` may be performed by several, different artists.
How many song_ids and how many different song_names are contained in the data set? Print a short answer text.


In [147]:
num_songs_ids = df_songs.song_id.nunique()
num_songs_names = df_songs.song_name.nunique()

print(f"There are {num_songs_ids} different song_id, and there are  {num_songs_names} different song_name")

There are 2132 different song_id, and there are  1851 different song_name


Which `song_name` is the most frequent? Display a subset of the songs data set that contains only rows with this song name.


In [156]:
most_frequent_song_name = df_songs.song_name.mode()[0]
most_frequent_song_name = df_songs.value_counts('song_name').index[0]
most_frequent_song_name = df_songs.groupby('song_name').size().idxmax()

# df_songs[df_songs.song_name == most_frequent_song_name]

# Exercise 3 (15 points)


Calculate the average `tempo` per genre, and display the 5 genres with the highest average tempo.


In [149]:
pd.DataFrame(df_songs.groupby('genre').tempo.mean().nlargest(5))

Unnamed: 0_level_0,tempo
genre,Unnamed: 1_level_1
Corridos tumbados,200.156
Rock and Roll,190.448
Psychedelic Rock,182.008
Classical,177.171
hardstyle,172.0


If a song is marked as `explicit`, it means that it contains strong language, reference to violence, sexual activity, etc. What percentage of songs are marked as `explicit`? Print a short answer text.


In [150]:
percentage_explicit = df_songs[df_songs.explicit=='True'].shape[0] / df_songs.explicit.notna().sum() * 100
print(f"A {percentage_explicit:.2f}% of song are marked as explicit")

A 29.22% of song are marked as explicit


# Exercise 4 (15 points)


How many songs use the words "Christmas" in the lyrics? Print a short answer text.


In [151]:
number_christmas_songs = df_songs.lyrics.str.contains("CHRISTMAS",case=False).sum()
print(f"The number of songs that use the word \"Christmas\" is {number_christmas_songs}")

The number of songs that use the word "Christmas" is 282


Count for each song how often the word "love" occurs in the `lyrics`. Display the `song_name`, `artist_names` and the number of occurrences of "love" for the 5 songs with the highest number of occurences.


In [152]:
df_songs["love_count"] = df_songs.lyrics.str.upper().str.count("LOVE")
df_songs[['song_name','love_count']].sort_values('love_count',ascending=False).head(5)

Unnamed: 0,song_name,love_count
2099,My Lover,55.0
929,Lose You To Love Me,48.0
1628,Bring Me Love,34.0
1456,Bring Me Love,34.0
581,Bring Me Love,34.0


# Exercise 5 (25 points)


The files `charts_xx.csv` contain the top 200 songs of the respective country (de: Germany, us: United States, gb: Great Britain, es: Spain ) on the days December 24 of the years 2017 - 2022.

Read in the four data sets and combine them appropriately into a single "tidy" data frame. Avoid repetitions in your code as much as possible.


In [153]:
csv_files = glob.glob('task2/data/charts_*.csv')

dataframes = []
for file in csv_files:
    df_results_country = pd.read_csv(file)
    country = file.split('\\')[1].split('_')[1].split('.')[0]
    df_results_country['country'] = country
    dataframes.append(df_results_country)

df_results = pd.concat(objs=dataframes)
df_results.dtypes

date             object
position        float64
song_id          object
song_name        object
artist_names     object
streams           int64
country          object
dtype: object

Display the total number of streams per country on the date 2022-12-24


In [154]:
pd.DataFrame(df_results[df_results.date=='2022-12-24'].groupby('country').streams.sum())

Unnamed: 0_level_0,streams
country,Unnamed: 1_level_1
de,108837012
es,31182666
gb,59695507
us,161514161


# Exercise 6 (15 points)


Calculate the average `danceability` and the total number of `streams` per country.

Notes:

- For this task you need to combine the songs and charts DataFrames appropriately into a single DataFrame.
- If you did not manage to solve exercise 5, you can use the file `charts_backup.csv`.


In [155]:
df = pd.merge(left=df_results, left_on='song_id', right=df_songs, right_on='song_id', how='left')

pd.DataFrame({
    "mean_danceability":df[['danceability','country']].groupby('country').danceability.mean(),
    "total_streams":df[['streams','country']].groupby('country').streams.sum()
})

Unnamed: 0_level_0,mean_danceability,total_streams
country,Unnamed: 1_level_1,Unnamed: 2_level_1
de,0.553947,491345726
es,0.679203,134341837
gb,0.572602,286075531
us,0.598899,747236256
