# Import data

Let's examine the data provided by Yandex music for this project.

## Import libraries

In [1]:
import pandas as pd

Read the *music_project.csv* file and store it in the *df* variable.

In [2]:
df = pd.read_csv('/datasets/music_project.csv')

Getting the first 10 rows of a table

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


Let's do light data exploration

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


There are 7 columns in the table. The data type of each column is object.

Let's take a closer look at each column in the dataset

* userID - id number of each user;
* Track - track name;
* artist - artist name;
* genre - the name of the genre;
* City - the city in which the listening session took place;
* time - the time at which the user listened to the track;
* Day — day of the week.

The number of values in the columns varies. This indicates that there are missing values in the data.

**findings**

Each line of the table contains information about compositions of a certain genre which users listened to in one of the cities at a certain time and day of the week. 

The two problems that need to be addressed are missing values and non-standardized column names

# Data cleaning

Let's eliminate the gaps, rename the columns, and also check the data for duplicates

In [5]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

There are spaces in the column names that can make it difficult to access the data and lead to typos.

Let's fix it and rename the columns for the convenience of further work. Let's check the result.

In [6]:
df.set_axis(['user_id', 'track_name',
             'artist_name', 'genre_name',
             'city', 'time', "weekday"], 
            axis = 'columns', inplace = True)

In [7]:
df.columns

Index(['user_id', 'track_name', 'artist_name', 'genre_name', 'city', 'time',
       'weekday'],
      dtype='object')

In [8]:
df.isnull().sum()

user_id           0
track_name     1231
artist_name    7203
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

Empty values indicate that not all information is available for some tracks

I will replace the missing values in the columns 'track_name' and 'artist_name' with 'unknown' tag.

In [9]:
df['track_name'] = df['track_name'].fillna('unknown')

In [10]:
df['artist_name'] = df['artist_name'].fillna('unknown')

In [11]:
df.isnull().sum()

user_id           0
track_name        0
artist_name       0
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

I will delete empty values in the genres column to make sure that no missing values are left in the dateset.

In [12]:
df.dropna(subset = ['genre_name'], inplace = True)

In [13]:
df.isnull().sum()

user_id        0
track_name     0
artist_name    0
genre_name     0
city           0
time           0
weekday        0
dtype: int64

Let's check the dataframe for duplicates now and delete those if necessary

In [14]:
df.duplicated().sum()

3755

In [15]:
df = df.drop_duplicates().reset_index(drop = True)

In [16]:
df.duplicated().sum()

0

The duplicates may have appeared due to some problems in data recording. I should communicate this finding to the data engineers to avoid those in the future.

I will save the list of unique values in the genre_name column  in the *genres_list* variable

In [17]:
genres_list = df['genre_name'].unique()

In [18]:
genres_list

array(['rock', 'pop', 'folk', 'dance', 'rusrap', 'ruspop', 'world',
       'electronic', 'alternative', 'children', 'rnb', 'hip', 'jazz',
       'postrock', 'latin', 'classical', 'metal', 'reggae', 'tatar',
       'blues', 'instrumental', 'rusrock', 'dnb', 'türk', 'post',
       'country', 'psychedelic', 'conjazz', 'indie', 'posthardcore',
       'local', 'avantgarde', 'punk', 'videogame', 'techno', 'house',
       'christmas', 'melodic', 'caucasian', 'reggaeton', 'soundtrack',
       'singer', 'ska', 'shanson', 'ambient', 'film', 'western', 'rap',
       'beats', "hard'n'heavy", 'progmetal', 'minimal', 'contemporary',
       'new', 'soul', 'holiday', 'german', 'tropical', 'fairytail',
       'spiritual', 'urban', 'gospel', 'nujazz', 'folkmetal', 'trance',
       'miscellaneous', 'anime', 'hardcore', 'progressive', 'chanson',
       'numetal', 'vocal', 'estrada', 'russian', 'classicmetal',
       'dubstep', 'club', 'deep', 'southern', 'black', 'folkrock',
       'fitness', 'french', 'd

In [19]:
# searches for a specific genre in the genres_list
def find_genre(genre):
    counter = 0
    for i in genres_list:
        if i == genre:
            counter += 1
    return counter

Let's try to find alternative titles to the hiphop genre:
* hip
* hop
* hip-hop

In [20]:
print(find_genre('hip'))

1


In [21]:
print(find_genre('hop'))

0


In [22]:
print(find_genre('hip-hop'))

0


Let's declare a function *find_hip_hop()* that replaces the incorrect name of this genre in the *'genre_name'* column with *'hiphop'* and checks if the replacement was successful

In [23]:
def find_hip_hop(df, wrong):
    df['genre_name'] = df['genre_name'].replace(wrong, 'hiphop')
    count = df[df['genre_name'] == wrong]['genre_name'].count()
    return count

In [24]:
print(find_hip_hop(df,'hip'))

0


Let's double check that the data cleaning was completed successfully.

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60126 entries, 0 to 60125
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      60126 non-null  object
 1   track_name   60126 non-null  object
 2   artist_name  60126 non-null  object
 3   genre_name   60126 non-null  object
 4   city         60126 non-null  object
 5   time         60126 non-null  object
 6   weekday      60126 non-null  object
dtypes: object(7)
memory usage: 3.2+ MB


**Results**

During preprocessing stage I found and handled missing values and duplicates in the dataset. In addition to this I also corrected some misspellings of the genre names and fixed column names

# Is it true that different cities listen to music differently?

There is a hypothesis that in Moscow and St. Petersburg users listen to music differently. I will check this assumption on the data for three days of the week - Monday, Wednesday and Friday.

For each city I'll set the number of compositions with a known genre listened to these days, and will compare the results.

In [26]:
df.groupby('city')['genre_name'].count()

city
Moscow              41892
Saint-Petersburg    18234
Name: genre_name, dtype: int64

There are more sessions in Moscow than in St. Petersburg, but this does not mean that Moscow is a more active region. Yandex.Music generally has more users in Moscow so the numbers are relatively comparable.

In [27]:
df.groupby('weekday')['genre_name'].count()

weekday
Friday       21482
Monday       20866
Wednesday    17778
Name: genre_name, dtype: int64

On Monday and Friday people tend to listen to music more compared to Wednesday

In [28]:
def number_tracks(df, day, city):
    # returns the number of tracks played in the city on a specified day
    track_list = df[(df['weekday'] == day) & (df['city'] == city)]
    track_list_count = track_list['genre_name'].count()
    return track_list_count

In [29]:
print(number_tracks(df, 'Monday', 'Moscow'))

15347


In [30]:
print(number_tracks(df, 'Monday', 'Saint-Petersburg'))

5519


In [31]:
print(number_tracks(df, 'Wednesday', 'Moscow'))

10865


In [32]:
print(number_tracks(df, 'Wednesday', 'Saint-Petersburg'))

6913


In [33]:
print(number_tracks(df, 'Friday', 'Moscow'))

15680


In [34]:
print(number_tracks(df, 'Friday', 'Saint-Petersburg'))

5802


Let's summarize the received information in one table.

In [35]:
data = [['Moscow', 15347, 10865, 15680], 
        ['Saint Petersburg', 5519, 6913, 5802]]

table = pd.DataFrame(columns = 
                     ['city', 'monday', 
                      'wednesday', 'friday'], 
                     data = data)
table

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15347,10865,15680
1,Saint Petersburg,5519,6913,5802


**Findings**

On the table above you can see that in Moscow people tend to listen to music on Monday and Friday. In the contrary Saint Petersburg users prefer to listen to music on Wednesdays compared to monday and friday

# Do people listen to the same music on Monday mornings and Friday evenings?

We are looking for an answer to the question of what genres prevail in different cities on Monday morning and Friday evening. There is an assumption that on Monday mornings users listen to more invigorating music (for example the 'pop' genre) and on Friday evenings - more dance music.

In [36]:
moscow_general = df[df['city'] == 'Moscow']

In [37]:
spb_general = df[df['city'] == 'Saint-Petersburg']

In [38]:
def genre_weekday(df, day, time1, time2):
    # returns number of sessions for each genre at the specified time
    genre_list = df[(df['weekday'] == day) & (df['time'] > time1) & (df['time'] < time2)]
    genre_list_sorted = genre_list.groupby('genre_name')['genre_name'].count().sort_values(ascending = False)
    return genre_list_sorted.head(10)

Let's compare the results for Moscow and St. Petersburg on Monday morning (7 AM to 11 AM) and on Friday evening (5 PM to 11 PM).

In [39]:
print(genre_weekday(moscow_general, 'Monday', '07:00:00', '11:00:00'))

genre_name
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
classical      157
Name: genre_name, dtype: int64


In [40]:
print(genre_weekday(spb_general, 'Monday', '07:00:00', '11:00:00'))

genre_name
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre_name, dtype: int64


In [41]:
print(genre_weekday(moscow_general, 'Friday', '17:00:00', '23:00:00'))

genre_name
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre_name, dtype: int64


In [42]:
print(genre_weekday(spb_general, 'Friday', '17:00:00', '23:00:00'))

genre_name
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre_name, dtype: int64


Popular genres on Monday morning in St. Petersburg and Moscow turned out to be similar: as expected pop is popular in both cities. Despite this, the genres at the bottom of top 10 for each city is different: in St. Petersburg, the top 10 includes jazz and Russian rap while in Moscow there is something called the *world* genre.

At the end of the week, the situation does not change. Pop music is still in the first place. Again, the difference is noticeable only at the near the tail of the top 10 list where the *world* genre is also present in St. Petersburg on Friday evening.

**Findings**

The pop genre is the undisputed leader in both cities. 
Overall the top 5 genres are the same in these cities

# Do Moscow and St. Petersburg have different directions in music?

Hypothesis: St. Petersburg is rich in its rap culture, so this direction is listened to more often there. 

Moscow is a city of contrasts, but the majority of users listen to pop music.

In [43]:
moscow_genres = \
moscow_general.groupby('genre_name')\
['genre_name'].count().sort_values(ascending = False)

In [44]:
moscow_genres.head(10)

genre_name
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre_name, dtype: int64

Let's group the *spb_general* table by genre, count the number of songs of each genre using the *count()* method, sort it in descending order, and store the result in the *spb_genres* table.

In [45]:
spb_genres = \
spb_general.groupby('genre_name')\
['genre_name'].count().sort_values(ascending = False)

In [46]:
spb_genres.head(10)

genre_name
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre_name, dtype: int64

**Findings**

In Moscowin addition to the most popular pop genre there is a direction of Russian popular music. This means that Moscow people have interest in Russia pop music. 

Rap, contrary to assumption, holds nearly identical positions in both cities.

# Conclusions

Working hypotheses:

* People listen differently to music in Moscow and St. Petersburg

* There is some variation to music that people listen on Monday mornings and Friday evenings

* People from these two cities prefers different musical genres

**Final results**

Pop music is the most popular genre in both Moscow and St. Petersburg. At the same time there is no dependence of preferences on the day of the week in each individual city - people tend to listen to what they like no matter what day of the week it is. 

But if we look at each city separately we can see that the users from Moscow tend to listen more on Monday and Friday while the St. Petersburg users listen more on Wednesday.