#__Analysing Music Preferences__
This project is about analysing music preferences of Yandex.Music customers. The goal is to find out insights from dataset that is contained the information about listening preferences of customers from Moscow and Saint Petersburg. 

##__Gather the data__


In [0]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

In [3]:
# Read dataset
music_df = pd.read_csv('/content/music_project.csv')
music_df.head()

Unnamed: 0.1,Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday


In [4]:
# Get general info about the dataset
music_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  65079 non-null  int64 
 1     userID    65079 non-null  object
 2   Track       63848 non-null  object
 3   artist      57876 non-null  object
 4   genre       63881 non-null  object
 5     City      65079 non-null  object
 6   time        65079 non-null  object
 7   Day         65079 non-null  object
dtypes: int64(1), object(7)
memory usage: 4.0+ MB


**Conclusions:**

Total number of columns is 8, and there are the following:
* *Unnamed: 0* - seems like this column just dublicates the indexes
* *userID* - user identificator
* *Track* - track name
* *artist* - name of singer or band
* *genre* - name of genre
* *City* - city where the listening is occured
* *time* - time stamp when user were listening a track
* *Day* - day of week

The number of values in each column is different which means that some columns have missing values

#__Data Preprocessing__

In [5]:
music_df.columns

Index(['Unnamed: 0', '  userID', 'Track', 'artist', 'genre', '  City  ',
       'time', 'Day'],
      dtype='object')

In [0]:
# Drop 'Unnamed: 0' as this column is not needed
music_df.drop(columns='Unnamed: 0', inplace=True)

In [7]:
# Rename columns for convenience
music_df.set_axis(['user_id', 'track_name', 'artist_name', 'genre', 'city', 'time', 'weekday'], axis='columns', inplace=True)
music_df.head()

Unnamed: 0,user_id,track_name,artist_name,genre,city,time,weekday
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday


In [8]:
# Check dataset for missing values
music_df.isnull().sum()

user_id           0
track_name     1231
artist_name    7203
genre          1198
city              0
time              0
weekday           0
dtype: int64

In [9]:
# Fill in missing values for convenience
music_df['track_name'] = music_df['track_name'].fillna('unknown')
music_df['artist_name'] = music_df['artist_name'].fillna('unknown')
music_df['genre'] = music_df['genre'].fillna('unknown')

music_df.isnull().sum()

user_id        0
track_name     0
artist_name    0
genre          0
city           0
time           0
weekday        0
dtype: int64

In [10]:
music_df[music_df['genre'] == 'unknown']

Unnamed: 0,user_id,track_name,artist_name,genre,city,time,weekday
15,E3C5756F,unknown,unknown,unknown,Moscow,09:24:51,Monday
35,A8AE9169,unknown,unknown,unknown,Moscow,08:56:10,Monday
54,3FA9A6A8,Inside Out,unknown,unknown,Moscow,10:00:41,Friday
161,364C85C0,unknown,unknown,unknown,Moscow,20:06:58,Monday
182,4AFB623B,My Name Is Love,unknown,unknown,Moscow,20:23:47,Wednesday
...,...,...,...,...,...,...,...
64792,7D9627FD,unknown,unknown,unknown,Moscow,08:57:15,Monday
64837,6E12D163,Fantasy Boy,NEW BACCARA,unknown,Moscow,20:51:44,Friday
64901,90B5E5A2,Hearts & Silence,Myon x Late Night Alumni,unknown,Moscow,09:23:42,Friday
64930,A8AE9169,unknown,unknown,unknown,Moscow,08:54:17,Friday


In [17]:
# Check dublicates
print('Total number of dublicated values is {}'.format(len(music_df[music_df.duplicated() == True])))
music_df[music_df.duplicated()]

Total number of dublicated values is 3826


Unnamed: 0,user_id,track_name,artist_name,genre,city,time,weekday
575,E7F07B46,Crazy,The Manhattans,rnb,Moscow,13:39:46,Monday
832,7671A47A,Миражи,Восток,ruspop,Moscow,21:59:33,Wednesday
1216,69467B01,Change It All,Harrison Storm,singer,Moscow,20:53:06,Wednesday
1754,13B1A573,Te Adoramos Jesús,Athenas,spiritual,Moscow,13:19:37,Monday
1964,B24668A0,Mad over You Mashup,Nana Fofie,singer,Moscow,20:36:51,Monday
...,...,...,...,...,...,...,...
65042,83E9C8C4,Buddhist Beat,Asian Zen Spa Music Meditation,ambient,Moscow,13:25:29,Monday
65056,2E25BCD2,Psychobitch,Easter,pop,Moscow,14:53:07,Friday
65059,F231C47E,All Summer in a Day,VHS Or BETA,electronic,Moscow,20:31:23,Friday
65067,F1B93F29,Poison Kiss,Centerstone,rock,Saint-Petersburg,22:00:29,Monday


In [20]:
# Remove dublicated values
music_df = music_df.drop_duplicates().reset_index(drop=True)
music_df.shape

(61253, 7)

In [21]:
# Check dublicates again
print('Total number of dublicated values is {}'.format(len(music_df[music_df.duplicated() == True])))

Total number of dublicated values is 0


In [23]:
# Save genre unique values into a separate list
genres_list = music_df['genre'].unique()
genres_list

array(['rock', 'pop', 'folk', 'dance', 'rusrap', 'ruspop', 'world',
       'electronic', 'unknown', 'alternative', 'children', 'rnb', 'hip',
       'jazz', 'postrock', 'latin', 'classical', 'metal', 'reggae',
       'tatar', 'blues', 'instrumental', 'rusrock', 'dnb', 'türk', 'post',
       'country', 'psychedelic', 'conjazz', 'indie', 'posthardcore',
       'local', 'avantgarde', 'punk', 'videogame', 'techno', 'house',
       'christmas', 'melodic', 'caucasian', 'reggaeton', 'soundtrack',
       'singer', 'ska', 'shanson', 'ambient', 'film', 'western', 'rap',
       'beats', "hard'n'heavy", 'progmetal', 'minimal', 'contemporary',
       'new', 'soul', 'holiday', 'german', 'tropical', 'fairytail',
       'spiritual', 'urban', 'gospel', 'nujazz', 'folkmetal', 'trance',
       'miscellaneous', 'anime', 'hardcore', 'progressive', 'chanson',
       'numetal', 'vocal', 'estrada', 'russian', 'classicmetal',
       'dubstep', 'club', 'deep', 'southern', 'black', 'folkrock',
       'fitness', '

In [0]:
# Write a function to check if the name of genre is written correctly
def find_genre(genre_name):
    counter = 0

    for i in range(len(genres_list)):
        if genres_list[i] == genre_name:
            counter += 1
    
    return counter

As an example, we can check the correctness of genre names based on the list of genres from this source: https://www.musicgenreslist.com/

In [25]:
find_genre('indie rock')

0

In [26]:
find_genre('rock')

1

In [27]:
find_genre('indie')

1

In [28]:
find_genre('hip-hop')

0

In [29]:
find_genre('hip')

1

In [31]:
find_genre('hiphop')

1

Just as an example, we can create a function for replacing wrong genre name for correct one. 

In [0]:
def find_hip_hop(df, wrong):
  corrent = 'hip-hop'
  df['genre'] = df['genre'].replace(wrong, corrent)
  wrong_count = df[df['genre'] == wrong]['genre'].count()
  
  return wrong_count

In [33]:
find_hip_hop(music_df, 'hip')

0

In [34]:
find_hip_hop(music_df, 'hiphop')

0

In [36]:
# Check the whole dataset after preprocessing
music_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61253 entries, 0 to 61252
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      61253 non-null  object
 1   track_name   61253 non-null  object
 2   artist_name  61253 non-null  object
 3   genre        61253 non-null  object
 4   city         61253 non-null  object
 5   time         61253 non-null  object
 6   weekday      61253 non-null  object
dtypes: object(7)
memory usage: 3.3+ MB


Now we can check the hypothesis that users in Moscow and Saint Petersburg listen music differently. So, we can test this hypothesis based on the data for Monday, Wednesday and Friday. For this purpose we need to count the number of songs with known genres per city and compare the results. 

In [47]:
# Group the data by city and count songs 
music_df.groupby('city')['genre'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: genre, dtype: int64

Looks like users from Moscow are more active listeners. However, Moscow has more users than Saint-Petersburg, and poplation as well. Therefore, such proportion makes sense

In [49]:
# Group by weekdays and count 
music_df.groupby('weekday')['genre'].count()

weekday
Friday       21840
Monday       21354
Wednesday    18059
Name: genre, dtype: int64

Looks like Monday and Friday are days when people listen music more. It makes sense, because on Mondays people are still have a weekend mood, and on Fridays they are feeling more relaxed before weekend. Wednesday is the middle of week and people are more work-centric.

In [0]:
# Create a function for calculating number of songs per city and weekday
def number_tracks(df, day, city):
    track_list = df[(df['weekday'] == day) & (df['city'] == city)]
    track_list_count = track_list['genre'].count()
    return track_list_count

In [0]:
list_days = ['Monday', 'Wednesday', 'Friday']

In [54]:
# Count for Moscow
for i in list_days:
  results = number_tracks(music_df, i, 'Moscow')
  print('Number of listenings for Moscow on {0} is {1}'.format(i, results))

Number of listenings for Moscow on Monday is 15740
Number of listenings for Moscow on Wednesday is 11056
Number of listenings for Moscow on Friday is 15945


In [56]:
# Count for Saint-Petersburg
for i in list_days:
  results = number_tracks(music_df, i, 'Saint-Petersburg')
  print('Number of listenings for Saint-Petersburg on {0} is {1}'.format(i, results))

Number of listenings for Saint-Petersburg on Monday is 5614
Number of listenings for Saint-Petersburg on Wednesday is 7003
Number of listenings for Saint-Petersburg on Friday is 5895


Create a chart for our results

In [57]:
music_data = [
    ['Moscow', 15740, 11056, 15945],
    ['Saint-Petersburg', 5614, 7003, 5895]
]

table_results = pd.DataFrame(data=music_data, columns=['city', 'monday', 'wednesday', 'friday'])
table_results

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusion:**
Interesting results - users in Saint-Perersburg listen music more on Wednesday than on Monday and Friday. Well, it's the cultural capital of Russia, as I always say - music gives energy to work and live :)

Create the function that returns the list of top 20 genres for a specific weekday and time interval

In [0]:
moscow_general = music_df[music_df['city'] == 'Moscow']
spb_general = music_df[music_df['city'] == 'Saint-Petersburg']

In [0]:
def genre_weekday(df, day, time1, time2):
    genre_list = df[(df['weekday'] == day)]
    
    genre_list_sorted = genre_list.groupby('genre')['genre'].count().sort_values(ascending=False).head(20)
    genres_list_dataframe = pd.DataFrame(genre_list_sorted, columns=['genre'])

    return genres_list_dataframe

Compare results for both cities for 7am-11am (morning hours) on Monday

In [0]:
genres_moscow_7_11 = pd.DataFrame(genre_weekday(moscow_general, 'Monday', 7, 11))
genres_spb_7_11 = pd.DataFrame(genre_weekday(spb_general, 'Monday', 7, 11))

In [136]:
# Visualize the results
data = px.data.gapminder()

fig1 = px.bar(genres_moscow_7_11, x=genres_moscow_7_11.index, y='genre',
            labels={'x':'Genres for 7am-11am Moscow', 'genre':'Count'}, height=400)
fig2 = px.bar(genres_spb_7_11, x=genres_spb_7_11.index, y='genre',
            labels={'x':'Genres for 7am-11am Spb', 'genre':'Count'}, height=400)

fig1.show()
fig2.show()

Compare results for both cities for 5pm-11pm (evening hours) on Friday

In [0]:
genres_moscow_17_23 = pd.DataFrame(genre_weekday(moscow_general, 'Friday', 17, 23))
genres_spb_17_23 = pd.DataFrame(genre_weekday(spb_general, 'Friday', 17, 23))

In [137]:
# Visualize the results
data = px.data.gapminder()

fig1 = px.bar(genres_moscow_17_23, x=genres_moscow_17_23.index, y='genre',
            labels={'x':'Genres for 5pm-11pm Moscow', 'genre':'Count'}, height=400)
fig2 = px.bar(genres_spb_17_23, x=genres_spb_17_23.index, y='genre',
            labels={'x':'Genres for 7am-11am Spb', 'genre':'Count'}, height=400)

fig1.show()
fig2.show()

**Conclusion:**
In both cities musical user preferences are quite similar - top 5 genres are the same in both cities. After top 5 the rating of genres is a bit different, but not significant 

#__Final Conclusions__
Musical preferences in both cities are similar. In Moscow as well as in Saint-Petersburg pop music is prevail. In addition, there is no correlation between genres and weekdays in both cities. The only one significant onservation is that people in Saint-Petersburg tend to listen music more on Wednesday. In Moscow people tend to listen music more on Monday and Friday than on Wednesday. 