#  Music Service

Comparison of Moscow and St. Petersburg is surrounded by myths. For example:
 * Moscow is a metropolis subject to the intensive pace of life;
 * St. Petersburg is a cultural capital, with its own tastes.

Using Yandex.Music data, the behavior of users in the two cities is compared.

**An aim of the research** — ckeck three hypotheses:
1. User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg it takes place in different ways.
2. On Monday morning, certain musical genres dominate in Moscow, while others dominate in St. Petersburg. Similarly, Friday evenings are dominated by different genres, depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow, they listen to pop music more often, in St. Petersburg - Russian rap.

**Research progress**

The dataframe is extracted from `yandex_music_project.csv` file . Due to quality of the file is unknown, processing the data is required to carry out the analysis.

Therefore, the research is devided into three steps:
 1. Exploring the data
 2. Pre-processing the data
 3. Performing an analysis



## Exploring the data


In [2]:
import pandas as pd # import pandas libraries

In [3]:
# read dataset and assignt to df
df = pd.read_csv('C:\YandexPracticumProjects\Project_1_YandexMusic\GitProject/yandex_music_project.csv')

In [4]:
display(df.head(10)) # display 1st 10 rows

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [5]:
df.info() # get general information about dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


The data frame has seven columns. Data type in all columns — `object`.

There are three style violations in the column headings:
1. Lowercase letters are combined with uppercase letters
2. There are gaps
3. Column names start with uppercase and lowercase letters


The number of values in the columns varies. This means there are missing values in the data.


**To sum up**

Each line of the table contains data about the track. Some of the columns describe the song itself: title, artist and genre. The rest of the data tells about the user: what city he is from, when he listened to music.

Preliminarily, it can be argued that there is enough data to check hypotheses. But there are gaps in the data, and discrepancies in the names of the columns with good style.

## Pre-processing

### Heading style

In [6]:
display(df.columns) # list of column names of dataframe

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [7]:
# rename columns
df = df.rename(columns={'  userID': 'user_id', 'Track': 'track', '  City  ': 'city', 'Day': 'day'}) 

In [8]:
display(df.columns) # check the results - list column names

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

In [9]:
display(df.isna().sum()) # sum missing values

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the research. So in `track` and `artist` the gaps are not important. It suffices to replace them with explicit notation.

But omissions in `genre` can interfere with the comparison of musical tastes in Moscow and St. Petersburg. In practice, it would be correct to determine the cause of the gaps and restore the data. So, it is necessary to:
* fill in the gaps with explicit notation,
* evaluate how harmful they are to research results.

In [10]:
# loop through column names and replace missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']
for i in columns_to_replace:
    df[i] = df[i].fillna('unknown')   

In [11]:
display(df.isna().sum()) # summarize missing values

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

In [12]:
print(df.duplicated().sum()) # summarize explicit duplicates

3826


In [13]:
# remove explicit duplicates (with the removal of old indexes and formation of new ones)
df = df.drop_duplicates().reset_index(drop=True)

In [14]:
print(df.duplicated().sum()) # check for duplicates

0


Now get rid of the implicit duplicates in the `genre` column. For example, the name of the same genre can be spelled slightly in different way. Such errors also affect the result of the research.

In [15]:
 # view unique genre names
 df['genre'].sort_values().unique() 

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

There is a list of implicit duplicates of the name `hiphop` including misspelled titles and alternative titles in the same genre.

There are implicit duplicates:
* *hip*,
* *hop*,
* *hip-hop*.

In [16]:
# function to replace implicit duplicates
def replace_wrong_genres(wrong_genres, correct_genre): # the input is a list of incorrect values and a string with the correct value
    for wrong_genre in wrong_genres: # loop through the incorrect values
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre) # the replace() method is called for each incorrect name 

In [17]:
# Eliminate implicit duplicates
duplicates = ['hip', 'hop', 'hip-hop'] # list of invalid names
name = 'hiphop' # valid name
replace_wrong_genres(duplicates, name) # replace() function inside is called 3 times

In [18]:
# Check for implicit duplicates
display(df['genre'].sort_values().unique())

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**To sum up:**

Pre-processing indicates three problems in the dataset:

- invalid names,
- missing values,
- dublicates — explicit and implicit.

## Hypothesis checking

### Comparison of user behavior in the cities

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. This assumption is checked on the three week days - Monday, Wednesday and Friday by following tasks:

* Separate users of Moscow and St. Petersburg
* Compare how many tracks each group of users listened to on Monday, Wednesday and Friday


In [19]:
# Convert datetime type from hh:mm:ss to seconds
def time_convert(x):
    times = x.split(':')
    return (60*int(times[0])+60*int(times[1]))+int(times[2])

df_convert = df.copy()
df_convert['time'] = df_convert['time'].apply(time_convert)
display(df_convert['time'].head(10)) 

0    2913
1    1269
2    4687
3    2709
4    2554
5    1361
6     787
7    4069
8    1600
9    2509
Name: time, dtype: int64

In [20]:
# Calculate listening time in each city
df_grouped_by_city = df.groupby('city')['time'].count()
display(df_grouped_by_city) 

city
Moscow              42741
Saint-Petersburg    18512
Name: time, dtype: int64

The frequency of listening in Moscow is higher than in St. Petersburg. It does not mean that Moscow users listen to music more often. There are simply more users in Moscow.

In [21]:
# Count how many times songs were listened to on each week day
time_grouped_by_day = df_convert.groupby('day')['city'].count()
display(time_grouped_by_day)

day
Friday       21840
Monday       21354
Wednesday    18059
Name: city, dtype: int64

On average, users from the two cities are less active on Wednesdays. But this tendency may change if each city is evaluated separately.

In [22]:
# Function calculating how many times songs were listened to on each week day
def number_tracks(day, city):
    track_list = df.loc[df['day'] == day]
    track_list = track_list.loc[track_list['city'] == city]
    track_list_count = track_list['user_id'].count()
    return track_list_count        

In [23]:
# Plays in Moscow on Monday
number_tracks('Monday', 'Moscow')

15740

In [24]:
# Plays in St. Petersburg on Monday
number_tracks('Monday', 'Saint-Petersburg')

5614

In [25]:
# Plays in Moscow on Wednesday
number_tracks('Wednesday', 'Moscow')

11056

In [26]:
# Plays in St. Petersburg on Wednesday
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [27]:
# Plays in Moscow on Friday
number_tracks('Friday', 'Moscow')

15945

In [28]:
# Plays in St. Petersburg on Friday
number_tracks('Friday', 'Saint-Petersburg')

5895

Create a table using constructor `pd.DataFrame`, where:
* names of columns — `['city', 'monday', 'wednesday', 'friday']`;
* data — results obtained using function `number_tracks`.

In [29]:
# Таблица с результатами
data = [['Moscow', 15740, 11056, 15945],
       ['Saint-Petersburg', 5614, 7003, 5895]]
columns = ['city', 'monday', 'wednesday', 'friday']

research_tracks_result = pd.DataFrame(data=data, columns=columns)
display(research_tracks_result)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**To sun up:**

The tabular data represents the difference in user behavior:

- The highest value of listening in Moscow falls on Monday and Friday, and there is a noticeable decline on Wednesday
- On the contrary in St. Petersburg users listen to music more on Wednesdays. Listenung on Monday and Friday here is almost equally inferior to Wednesday.

Therefore, the first hypothesis is correct.

### Listening to music at the beginning and end of the week

According to the second hypothesis, certain genres dominate on Monday morning in Moscow, while others dominate in St. Petersburg. Similarly, Friday evenings can be spent by listening to different genres, depending on the city.

Save datasets into two variables:
* for Moscow — in `moscow_general`;
* for St. Petersburg — in `spb_general`.

In [30]:
# filter values in the dataframe which are equal to 'Moscow'
moscow_general = df[df['city'] == 'Moscow']
display(moscow_general)

Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,unknown,ruspop,Moscow,09:17:40,Friday
...,...,...,...,...,...,...,...
61247,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Moscow,21:07:12,Monday
61248,729CBB09,My Name,McLean,rnb,Moscow,13:32:28,Wednesday
61250,C5E3A0D5,Jalopiina,unknown,industrial,Moscow,20:09:26,Friday
61251,321D0506,Freight Train,Chas McDevitt,rock,Moscow,21:43:59,Friday


In [31]:
# filter values in the dataframe which are equal to 'Saint-Petersburg'
spb_general = df[df['city'] == 'Saint-Petersburg']
display(spb_general)

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Saint-Petersburg,21:20:49,Wednesday
...,...,...,...,...,...,...,...
61239,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Saint-Petersburg,21:14:40,Monday
61240,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Saint-Petersburg,21:06:50,Monday
61241,29E04611,Bre Petrunko,Perunika Trio,world,Saint-Petersburg,13:56:00,Monday
61242,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Saint-Petersburg,09:22:13,Monday


Create function `genre_weekday()` with 4 parameters:
* dataframe
* week day
* start timestamp in the format 'hh:mm'
* end timestamp in the format 'hh:mm'

The function returns information about the top 10 genres of those tracks that were listened to on the specified day, in the interval between two timestamps.

In [32]:
# return Series of top-10 most popular genrs as genre_df_sorted
# (on the certain day and time)
def genre_weekday(table, day, time1, time2):
    genre_df = table.loc[table['day'] == day]
    genre_df = genre_df.loc[genre_df['time'] > time1]
    genre_df = genre_df.loc[genre_df['time'] < time2]
    genre_df_count = genre_df.groupby('genre')['user_id'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    return genre_df_sorted.head(10)    

Compare the results of the `genre_weekday()` function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [33]:
# call the function for Monday morning in Moscow
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: user_id, dtype: int64

In [34]:
# call the function for Monday morning in St. Petersburg
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: user_id, dtype: int64

In [35]:
# call the function for Monday evening in Moscow
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: user_id, dtype: int64

In [36]:
# call the function for Monday evening in St. Petersburg
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: user_id, dtype: int64

**To sum up:**

By comparing the top 10 genres on Monday morning, the following conclusions can be drawn:

1. In Moscow and St. Petersburg users listen to similar music. The only difference is that the Moscow rating includes the “world” genre, while the St. Petersburg rating includes jazz and classical.

2. There were many missing values in Moscow that the value `'unknown'` took tenth place among the most popular genres. This means that these missing values decrease accuracy of the results.

Friday night does not change the result of the analysis. Some genres rise a little higher, others go down, but overall the top 10 stays the same.

Thus, the second hypothesis is confirmed partially:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not significant. In Moscow, they listen to Russian popular music more often, in St. Petersburg they prefer jazz.

However, missing values play a vary important role in the analysis. There are so many of them in Moscow that the top 10 ranking could look different if there were not the lost data for genres.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: People in St. Petersburg prefer rap, the music of this genre is listened to more often than in Moscow. By contrast,  pop music is dominated in Moscow.

In [37]:
# count the number of values in 'genre' column and sort them out in descending order
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [38]:
# show 1s 10 rows in moscow_genres
display(moscow_genres.head(10))

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [39]:
# count the number of values in 'genre' column and sort them out in descending order
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [40]:
# show 1s 10 rows in spb_genres
display(spb_genres.head(10))

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**To sum up:**

The hypothesis is confirmed partially:

* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is an another genre - Russian popular music.
* On the contrary, rap is equally popular in Moscow and St. Petersburg.

## Research results

Three hypotheses are checked and what is found out:

1. The week day affects the activity of users in Moscow and St. Petersburg in different ways.

The first hypothesis is fully confirmed.

2. Musical preferences do not change much during the week either in Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
* Users listen to music of the “world” genre in Moscow,
* Jazz and classical music is of main interest in St. Petersburg.

Thus, the second hypothesis is only partly confirmed. This result could have been different if there were not missing values in the dataset.

3. The tastes of users of Moscow and St. Petersburg have more in common than differences. Genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis is not confirmed. If there are differences in preferences, they will be invisible majority of users.