# Description of the project

Music search service

In [1]:
# import library
import pandas as pd 

In [2]:
# Read
df = pd.read_csv('music_project.csv', index_col=0)

In [3]:
# First 10 rows
df.head(10) 

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
# Info
df.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 4.0+ MB


Let's consider the information received in more detail.

There are a total of 7 columns in the table, the data type of each column is object.

Let's take a closer look at what *df* columns are and what information they contain:

* userID — user identifier;
* Track — track name;
* artist — name of the artist;
* genre — genre name;
* City — the city in which the audition took place;
* time — time at which the user listened to the track;
* Day - day of the week.

The number of values in the columns varies. This indicates that there are empty values in the data.

## Summary

Each row of the table contains information about compositions of a certain genre in a certain performance that users listened to in one of the cities at a certain time and day of the week. Two problems that need to be addressed are omissions and poor-quality column names. The *time*, *day*, and *City* columns are especially valuable for testing working hypotheses. Data from the *genre* column will allow you to find out the most popular genres.

In [5]:
# Columns
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

There are spaces in the column names, which can make the data difficult to access.

In [6]:
# Rename columns
df.set_axis(['user_id', 'track_name', 'artist_name', 'genre_name', 'city', 'time', 'weekday'], axis='columns', inplace = True)

In [7]:
# checking the results - list of column names
df.columns 

Index(['user_id', 'track_name', 'artist_name', 'genre_name', 'city', 'time',
       'weekday'],
      dtype='object')

In [8]:
# total number of gaps
df.isnull().sum()

user_id           0
track_name     1231
artist_name    7203
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

Empty values indicate that not all information is available for some tracks. The reasons may be different: for example, a specific performer of a folk song has not been named. It's worse if there are problems with data recording. Each individual case must be analyzed and the cause identified.

Replace missing values in the track name and artist columns with the string 'unknown'. After this operation, you need to make sure that the table no longer contains gaps.

In [9]:
# replacing missing values in the 'track_name' column with the string 'unknown' using a special replacement method
df['track_name'] = df['track_name'].fillna('unknown') 

In [10]:
# replacing missing values in the 'artist_name' column with the string 'unknown' using a special replacement method
df['artist_name'] = df['artist_name'].fillna('unknown') 

In [11]:
# total number of gaps
df.isnull().sum() 

user_id           0
track_name        0
artist_name       0
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

In [12]:
# removing missing values in column 'genre_name'
df.dropna(subset = ['genre_name'], inplace = True) 

In [13]:
# Check
df.isnull().sum() 

user_id        0
track_name     0
artist_name    0
genre_name     0
city           0
time           0
weekday        0
dtype: int64

In [14]:
# Check duplicates
df.duplicated().sum() 

3755

In [15]:
# Drop duplicates
df = df.drop_duplicates().reset_index(drop=True) 

In [16]:
# Check
df.duplicated().sum() 

0

Duplicates may have appeared due to a failure in data recording. It is worth paying attention and understanding the reasons for the appearance of such “information garbage”.

We store a list of unique values of the genre column in the *genres_list* variable.

Let's declare a function *find_genre()* to find implicit duplicates in the genre column. For example, when the name of the same genre is written in different words.

In [17]:
# storing in the genres_list variable a list of unique values identified by a special method in the 'genre_name' column
genres_list = df['genre_name'].unique() 

In [18]:
# creating the find_genre() function 
def find_genre(genre):  
    count = 0 
    for row in genres_list:
        if row == genre:
            count += 1
    return count 

Call the *find_genre()* function to search for different variations of the hip-hop genre name in a table.

The correct name is *hiphop*. Let's look for other options:

* hip
* hop
* hip-hop

In [19]:
# calling the find_genre() function checks for the presence of the 'hip' variant
find_genre('hip') 

1

In [20]:
# checks for the presence of the 'hop' option
find_genre('hop')

0

In [21]:
# checks for the presence of the 'hip-hop' option
find_genre('hip=hop')

0

Let's declare a function *find_hip_hop()* that replaces the incorrect name of this genre in the *'genre_name'* column with *'hiphop'* and checks that the replacement was successful.

So we correct all spelling variants that the check revealed.

In [22]:
def find_hip_hop(df, wrong):
    df['genre_name'] = df['genre_name'].replace(wrong, 'hiphop')
    result = df[df['genre_name'] == wrong]['genre_name'].count()
    return result

In [23]:
# replacing one incorrect option with hiphop by calling the find_hip_hop() function
find_hip_hop(df, 'hip') 

0

In [24]:
# Info
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60126 entries, 0 to 60125
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      60126 non-null  object
 1   track_name   60126 non-null  object
 2   artist_name  60126 non-null  object
 3   genre_name   60126 non-null  object
 4   city         60126 non-null  object
 5   time         60126 non-null  object
 6   weekday      60126 non-null  object
dtypes: object(7)
memory usage: 3.2+ MB


## Summary

During the preprocessing stage, the data revealed not only gaps and problems with column names, but also all kinds of duplicates. Removing them will allow for more accurate analysis. Since information about genres is important to preserve for analysis, we do not simply remove all missing values, but fill in the missing artist names and track titles. Column names are now correct and convenient for further work.

It was hypothesized that in Moscow and St. Petersburg users listen to music differently. We check this assumption using data on three days of the week - Monday, Wednesday and Friday.

For each city, we determine the number of songs with a known genre listened to these days, and compare the results.

We group the data by city and call the *count()* method to count the compositions for which the genre is known.

In [25]:
# grouping df table data by 'city' column and counting the number of values in 'genre_name' column
df.groupby('city')['genre_name'].count() 

city
Moscow              41892
Saint-Petersburg    18234
Name: genre_name, dtype: int64

There are more auditions in Moscow than in St. Petersburg, but this does not mean that Moscow is more active. Yandex.Music generally has more users in Moscow, so the values are comparable.

Let's group the data by day of the week and count the songs listened to on Monday, Wednesday and Friday, for which the genre is known.

In [26]:
# grouping data by column 'weekday' and counting the number of values in column 'genre_name'
df.groupby('weekday')['genre_name'].count() 

weekday
Friday       21482
Monday       20866
Wednesday    17778
Name: genre_name, dtype: int64

Monday and Friday are time for music; on Wednesdays, users are a little more engaged.

We create a function *number_tracks()*, which takes a table, day of the week and city name as parameters, and returns the number of tracks listened to for which the genre is known. We check the number of songs listened to for each city and on Monday, then Wednesday and Friday.

In [27]:
def number_tracks(df, day, city):
    track_list = df[(df['weekday'] == day) & (df['city'] == city)]         
    track_list_count = track_list['genre_name'].count()
    return track_list_count

In [28]:
# list of compositions for Moscow on Monday
number_tracks(df, 'Monday', 'Moscow') 

15347

In [29]:
# list of compositions for St. Petersburg on Monday
number_tracks(df, 'Monday', 'Saint-Petersburg')

5519

In [30]:
# list of compositions for Moscow on Wednesday
number_tracks(df, 'Wednesday', 'Moscow')

10865

In [31]:
# list of compositions for St. Petersburg on Wednesday
number_tracks(df, 'Wednesday', 'Saint-Petersburg')

6913

In [32]:
# list of compositions for Moscow on Friday
number_tracks(df, 'Friday', 'Moscow')

15680

In [33]:
# list of compositions for St. Petersburg on Friday
number_tracks(df, 'Friday', 'Saint-Petersburg')

5802

Let's summarize the received information into one table, where ['city', 'monday', 'wednesday', 'friday'] are the names of the columns.

In [34]:
data = [['Moscow', 15347, 10865, 15680],
        ['Saint-Petersburg', 5519, 6913, 5802]]

columns = ['city', 'monday', 'wednesday', 'friday']
table = pd.DataFrame(data = data, columns = columns)


## Summary

The results show that, relative to Wednesday, music is listened to in a “mirror way” in St. Petersburg and Moscow: in Moscow the peaks occur on Monday and Friday, and on Wednesday the listening time decreases. Whereas in St. Petersburg, Wednesday is the day of the greatest interest in music, and on Monday and Friday it is less, and almost equally less.

We are looking for an answer to the question of what genres predominate in different cities on Monday morning and Friday evening. There is an assumption that on Monday morning users listen to more uplifting music (for example, pop), and on Friday evening - more dance music (for example, electronica).

We will receive data tables for Moscow *moscow_general* and St. Petersburg *spb_general*.

In [35]:
moscow_general = df[(df['city'] == 'Moscow')]

In [36]:
spb_general = df[(df['city'] == 'Saint-Petersburg')] 

We create a function *genre_weekday()* that returns a list of genres by the requested day of the week and time of day from such and such an hour to such and such.

In [37]:
# declaration of the function genre_weekday() with parameters df, day, time1, time2
def genre_weekday(df, day, time1, time2):
    genre_list = df[(df['weekday'] == day) & (df['time'] > time1) & (df['time'] < time2)]
    genre_list_sorted = genre_list.groupby('genre_name')['genre_name'].count()
    return genre_list_sorted

We compare the results obtained in the table for Moscow and St. Petersburg on Monday morning (from 7 to 11) and on Friday evening (from 17 to 23).

In [38]:
# calling a function for Monday morning in Moscow (instead of df table moscow_general)
genre_weekday(moscow_general, 'Monday', '07:00:00', '11:00:00') 

genre_name
adult            1
africa           2
alternative    164
ambient         22
americana        1
              ... 
videogame        7
vocal            8
western         10
world          181
worldbeat        1
Name: genre_name, Length: 152, dtype: int64

In [39]:
# calling a function for Monday morning in St. Petersburg (instead of df table spb_general)
genre_weekday(spb_general,'Monday','07:00:00','11:00:00')  

genre_name
adult           1
alternative    58
ambient         5
balkan          1
beats           1
               ..
variété         3
videogame       1
vocal           2
western         7
world          36
Name: genre_name, Length: 107, dtype: int64

In [40]:
# calling a function for Friday evening in Moscow
genre_weekday(moscow_general,'Friday','17:00:00','23:00:00') 

genre_name
adult            1
africa           1
alternative    163
ambient         18
americana        2
              ... 
variété          6
videogame        5
vocal            9
western          4
world          208
Name: genre_name, Length: 163, dtype: int64

In [41]:
# calling a function for Friday evening in St. Petersburg
genre_weekday(spb_general,'Friday','17:00:00','23:00:00') 

genre_name
acoustic        1
adult           1
alternative    63
ambient         7
americana       1
               ..
variété         4
videogame       3
vocal           2
western         5
world          54
Name: genre_name, Length: 126, dtype: int64

Popular genres on Monday morning in St. Petersburg and Moscow turned out to be similar: pop was popular everywhere, as expected. Despite this, the end of the top 10 for the two cities is different: in St. Petersburg the top 10 includes jazz and Russian rap, and in Moscow the *world* genre is included.

At the end of the week the situation does not change. Pop music is still in first place. Again, the difference is noticeable only at the end of the top 10, where the *world* genre is also present in St. Petersburg on a Friday evening.

## Summary

The pop genre is the undisputed leader, and the top 5 as a whole does not differ in both capitals. At the same time, it is clear that the end of the list is more “live”: for each city, more characteristic genres are identified, which actually change their positions depending on the day of the week and time.

Hypothesis: St. Petersburg is rich in its rap culture, so this genre is listened to more often there, and Moscow is a city of contrasts, but the bulk of users listen to pop music.

Let's group the *moscow_general* table by genre, count the number of compositions of each genre using the *count()* method, sort in descending order and save the result in the *moscow_genres* table.

Let's look at the first 10 rows of this new table.

In [42]:
moscow_genres = moscow_general.groupby('genre_name')['genre_name'].count().sort_values(ascending = False)

In [43]:
# First 10 rows
moscow_genres.head(10) 

genre_name
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre_name, dtype: int64

Let's group the *spb_general* table by genre, count the number of compositions of each genre using the *count()* method, sort in descending order and save the result in the *spb_genres* table.

Let's look at the first 10 rows of this table. Now you can compare the two cities.

In [44]:
spb_genres = spb_general.groupby('genre_name')['genre_name'].count().sort_values(ascending = False)

In [45]:
# First 10 rows
spb_genres.head(10)

genre_name
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre_name, dtype: int64

## Summary

In Moscow, in addition to the absolutely popular pop genre, there is a direction of Russian popular music. This means that interest in this genre is wider. And rap, contrary to assumption, occupies close positions in both cities.

# Summary


Working hypotheses:

* music in two cities - Moscow and St. Petersburg - is listened to in different modes;

* the lists of the ten most popular genres on Monday morning and Friday evening have characteristic differences;

* the population of two cities prefers different musical genres.

**General results**

Moscow and St. Petersburg have similar tastes: popular music predominates everywhere. At the same time, there is no dependence of preferences on the day of the week in each individual city - people constantly listen to what they like. But between cities, in terms of days of the week, there is a mirror image relative to Wednesday: Moscow listens more on Monday and Friday, and St. Petersburg, on the contrary, listens more on Wednesday, but less on Monday and Friday.