# Yandex Music in Russia

Comparisons of Moscow and St. Petersburg are surrounded by myths. For example:
 * Moscow is a megapolis subject to the rigid rhythm of the work week;
 * Petersburg is a cultural capital, with its own taste.

Using Yandex Music data, let us compare the behavior of users in the two cities.

**The purpose of the study** — is to test three hypotheses:
1. User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg this manifests itself in different ways.
2. On Monday mornings, certain genres dominate in Moscow, while different ones dominate in St. Petersburg. Similarly, Friday evenings are dominated by different genres, depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow, they listen to pop music more often, in St. Petersburg - Russian rap.

**How do we approach this research?**

I will get user behavior data from the yandex_music_project.csv. Nothing is known about the quality of the data. Therefore, before testing hypotheses, a review of the data is required.

I will then check the data for errors and assess their impact on the study. Then, during the pre-processing phase, I can look for opportunities to correct the most critical data errors.

The research will take place in three stages:
 1. a Data review.
 2. Data preprocessing and cleaning
 3. Testing the hypotheses



## An Overview of the data

Let's get the first impressions of Yandex Music.




**Step 1**

The main analytics tool is pandas. Import this library.

In [None]:
import pandas as pd # imports the pandas library hereon to be referred to as pd

**Step 2**

Read the file  yandex_music_project.csv from the folder  /datasets and store it in a variable df:

In [6]:
df = pd.read_csv('/datasets/yandex_music_project.csv')# to read the data and store it under df

**Step 3**


Display the first 10 rows of the table to get acquinted

In [7]:
df.head(10) # pulls the first 10 rows from our file df

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


**Step 4**


Get general information about the table using the method info():

In [8]:
df.info() # getting general information about the data in the df table

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So the table has seven columns. The data type in all columns is — `object`.

According to the data documentation:
* `userID` — user ID;
* `Track` — name of the track;  
* `artist` — the name of the artist;
* `genre` — the name of the genre;
* `City` — user's city;
* `time` — start time of listening;
* `Day` — day of the week.

Note that number of values ​​in the columns vary. This means there are missing values ​​in the data.

**Step 5**

**Problems identified from first glance**

There are three style violations in the column headings:
1. Lowercase letters are combined with uppercase letters.
2. There are gaps.
3. Snake register of names

**Conclusions**

Each line of the table contains data about the track the user has listened to. Some of the columns describe the composition itself: title, artist and genre. The rest of the data tells us about the user: what city he is from, when he listened to music.

Preliminarily, it can be argued that the data is sufficient to test the hypotheses. But there are gaps in the data, and discrepancies in the names of the columns with the style of writing.

To move forward, I'll need to fix problems in the data.

## Data cleaning and preprocessing
Correct the style in the column headings, eliminate gaps. Then check the data for duplicates.

### Heading style

**Step 6**

Let's take a look at the column names:

In [11]:
df.columns # list of column names of table df

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

**Step 7**


Establish a good heading style:
* write a few words in the title in "snake_register",
* make all characters lowercase,
* eliminate gaps.

To do this, we rename the columns like this:
* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [14]:
df = df.rename(columns={'  userID': 'user_id', 'Track': 'track', '  City  ': 'city', 'Day': 'day'}) # renames the columns

**Step 8**


We trust the code but let's check the result. To do this, we print the column names again:

In [15]:
df.columns # checking our results for accuracy

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

**Step 9**

First, count how many missing values ​​are in the table.

In [17]:
df.isna().sum() # counts the missing values

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values will ​​affect our research. So, in the columns track and artist missing values are not important for your work. It suffices to replace them with explicit notation.

But omissions in genre can make it difficult to compare musical tastes in Moscow and St. Petersburg. In practice, I would try to determine the cause of the gaps and restore the data. This option is not available for us right now. 
Therefore, we will have to:
* fill in these gaps with explicit notation;
* estimate how much they will ruin our calculations.

**Step 10**

Shall we replace the missing values ​​in columns 'track', 'artist' and 'genre' with string 'unknown'. To do this, we create a list columns_to_replace, iterate through its elements in a loop for, and for each column, replace the missing values:

In [20]:
columns_to_replace = ['track', 'artist', 'genre'] # list of columns to loop through

for column in columns_to_replace:
    df[column] = df[column].fillna('unknown') #the loop with the replacement word

**Step 11**

Check the code again for confirmation.

In [21]:
df.isnull().sum() # checks the table and guarantees that we have good code

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Dealing with duplicates

**Step 12**

Count explicit duplicates in a table:

In [23]:
df.duplicated().sum() # counting explicit duplicates

3826

**Step 13**

We can use our special library tools from pandas to remove obvious duplicates:

In [26]:
df = df.drop_duplicates() # removes obvious duplicates

**Step 14**

Once again, count the obvious duplicates in the table - make sure you got rid of them completely:

In [28]:
df.duplicated().sum() # check for duplicates

0

Now we factor in typing errors and get rid of the implicit duplicates in the genre. For example, the name of the same genre can be spelled slightly differently. Such errors will also affect the result of the study.

**Step 15**

Let's display a list of unique genre names sorted alphabetically. For this:
* we retrieve the desired dataframe column;
* apply a sort method to it;
* return unique values from the column.

In [37]:
genre_names = df['genre'].sort_values().unique()
genre_names # to view our unique genre names

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Step 16**

Would you look at that? the duplicates of the title hiphop. These may be misspelled titles or alternative titles in the same genre.

You will see the following implicit duplicates:
* *hip*,
* *hop*,
* *hip-hop*.

To clear them from the table, we use the replace() tool with two arguments: a list of duplicate strings (including hip , hop , and hip-hop ) and a string with the correct value. You need to fix a column genre in the table df: replace each value from the list of duplicates with the correct one. Instead of hip, hop and hip-hop the table should have the value hiphop:

In [42]:
df['genre'] = df['genre'].replace('hip', 'hiphop')
df['genre'] = df['genre'].replace('hop', 'hiphop')
df['genre'] = df['genre'].replace('hip-hop', 'hiphop')# removes all the duplicates of hiphop

**Step 17**

As usual we check the code just to confirm:

*   *hip*,
*   *hop*,
*   *hip-hop*.

Output a sorted list of unique column values `genre`:

In [44]:
df['genre'].sort_values().unique() # check for implicit duplicates

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

In preprocessing we found three problems in the data:

- column heading style violations,
- missing values,
- both explicit and implicit duplicates.

We fixed the headers for easier reference in the future and without duplicates the reasearch is likely to be more accurate.

Missing values have been replaced with `'unknown'`. We should check later to see if the gaps we replaced in the `genre` column will add faults in our results.

Let's go ahead and check the hypotheses.

## Hypothesis testing

### Comparing user behavior in the two capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. We can check this assumption against the data on the three days of the week - Monday, Wednesday and Friday. For this:

* Separate the users of Moscow and St. Petersburg.
* Compare how many tracks each user group listened to on Monday, Wednesday and Friday.


**Step 18**

So let's estimate user activity in each city. Group the data by city and count the amount track plays in each group.



In [46]:
df.groupby('city')['user_id'].count() # counts the 'plays' in each city

city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

There are more plsyed songs in Moscow than in St. Petersburg. It does not mean that Moscow users listen to music more often. There are simply more users in Moscow.

**Step 19**

Now we can group the data by day of the week and count the plays on Monday, Wednesday, and Friday. Please note that the data contains information about the plays for these days only.


In [50]:
df.groupby('day')['user_id'].count() # counts 'plays' on each of these 3 days

day
Friday       21840
Monday       21354
Wednesday    18059
Name: user_id, dtype: int64

On average, users from the two cities seem to be less active on Wednesdays. But this could change if we consider each city separately.

**Step 20**


We have seen how grouping by city and by day of the week works. Now let's write a function that combines these two calculations.

Our function will be called `number_tracks()`, that will count the plays for a given day and city. It will have two parameters:
* day of the week,
* city name.

These are the values our function will work with :
  * the column `day` that is equal to the parameter `day`,
  * the column `city` that is equal to the parameter `city`.

To do this, apply sequential filtering with logical indexing.

Then count the values ​​in the column `user_id` of the resulting table. It will save the result to a new variable. Then return this variable from the function.

In [51]:
def number_tracks(day, city): #the funtion with it's parameters
    track_list = df[(df['day']== day) & (df['city']==city)] #day =day in table
    track_list_count = track_list['user_id'].count() #counts the tracks
    return track_list_count #the result
    
    

**Step 21**

Remember those days and times from our hypothesis? we use `number_tracks()` six times, changing the value of the parameters so that we get data for each city on each of the three days.

In [52]:
number_tracks('Monday', 'Moscow')# number of listeners in Moscow on Monday

15740

In [53]:
number_tracks('Monday', 'Saint-Petersburg')# number of listeners in St Petes on Monday

5614

In [54]:
number_tracks('Wednesday', 'Moscow')# number of listeners in Moscow on Wednesday

11056

In [55]:
number_tracks('Wednesday', 'Saint-Petersburg')# number of listeners in St Pete on wednedays

7003

In [57]:
number_tracks('Friday', 'Moscow')# number of listeners in Moscow on fridays

15945

In [56]:
number_tracks('Friday', 'Saint-Petersburg') # number of listeners in St Pete on Fridays

5895

**Step 22**

Using `pd.DataFrame` we create a table with: 
* column names — `['city', 'monday', 'wednesday', 'friday']`;
* data — our results from our function `number_tracks`.

Sometimes simple tables are the visual aid we need!

In [67]:
info = pd.DataFrame(data=[['Moscow', 15740, 11056, 15945], #Moscow data from our function
                          ['Saint-Petersburg', 5614, 7003, 5895]], #St Pete data from our function
                    columns=['city', 'monday', 'wednesday', 'friday']) #column names assigned
info #prints our table

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

This data shows the difference in user behavior:

- In Moscow, the peak of listening falls on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday here is almost equally inferior to Wednesday.

This means our first hypothesis is spot on!

### Music at the beginning and at the end of the week

According to the second hypothesis, on Monday morning certain genres predominate in Moscow, while others dominate in St. Petersburg. Similarly, Friday evenings are dominated by different genres, depending on the city.

**Step 23**

Let us first save the tables with data in two variables:
* Moscow data — in `moscow_general`;
* St. Petersburg data — in `spb_general`.

In [68]:
moscow_general = df[df['city']=='Moscow'] # creates our moscow_general table from df,
# takes the rows where 'city' is 'Moscow'


In [70]:
spb_general = df[df['city']== 'Saint-Petersburg']# creates our spb_general table from df,
#takes rows where 'city' is 'Saint-Petersburg'


**Step 24**

Let's make another function `genre_weekday()` with these four parameters:
* table (dataframe) with data,
* day of the week,
* start time in the format 'hh:mm',
* end time in the format 'hh:mm'.

The function should return information about the top 10 genres of those tracks that were listened to on the specified day, in the interval between two timestamps.

In [88]:

def genre_weekday(df, day, time1, time2): 
    # sequential filtering 
    # leave in genre_df only those df lines whose day is equal to day 
    genre_df = df[df['day'] == day]
    # leave in genre_df only those genre_df lines whose time is less than time2 
    genre_df = genre_df[genre_df['time'] < time2] 
    # leave in genre_df only those rows genre_df whose time is greater than time1 
    genre_df = genre_df[genre_df['time'] > time1]
    # group the filtered dataframe by the genre column, take the genre column and count the number of rows for each genre using count()
    genre_df_grouped = genre_df.groupby('genre')['genre'].count() 
    # sort the result in descending order (so that the most popular genres are at the beginning of the Series) 
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False) 
    # return a Series with the 10 most popular genres in the specified time period of the given day 
    return genre_df_sorted[:10]


**Step 25**


Compare the results of the function `genre_weekday()` for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [89]:

genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unkown         161
Name: genre, dtype: int64

In [90]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [92]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00') 

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [93]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00') 

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg they listen to similar music. The only difference is that the Moscow rating includes the “world” genre, while the St. Petersburg rating includes jazz and classical.

2. There were so many missing values in Moscow that the `'unknown'` genre that we added was tenth place among the most popular genres. This means that missing values ​​occupy a significant part of the data and threaten the reliability of the research.

Friday night does not change though. Some genres rise a little higher, others go down, but overall the top 10 stay the same.

Thus, the second hypothesis can only be partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not very big. In Moscow, they listen to Russian popular music more often, in St. Petersburg - jazz.

However, gaps in the data cast doubt on this result. There are so many of them in the Moscow data that the top 10 ranking could look different if it were not for the lost genre data.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, the music of this genre is listened to there more often than in Moscow. And Moscow is a city of contrasts, which, nevertheless, is dominated by pop music.

**Step 26**

We will need to group the table `moscow_general` by genre and count the listens of tracks of each genre using `count()`. Then sort the result in descending order and store it in a table  `moscow_genres`.

In [97]:
mg_grouped = moscow_general.groupby('genre')['genre'] ,
mg_grouped_count = mg_grouped.count() 
moscow_genres = mg_grouped_count.sort_values(ascending=False)

**Step 27**

Take a look at the first ten lines to the screen `moscow_genres`:

In [98]:
moscow_genres.head(10) # view the first 10 lines moscow_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

**Step 28**


we will do the same for St Petersburg


In [102]:
sbg_grouped = spb_general.groupby('genre')['genre'] 
sbg_groups_counted = sbg_grouped.count() 
spb_genres = sbg_groups_counted.sort_values(ascending=False) 

**Step 29**

Show the first 10 lines of  `spb_genres`:

In [103]:
spb_genres.head(10) 

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a second close genre - Russian music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg


## Analysis results

We tested three hypotheses and found:

1. The day of the week affects the activity of users in Moscow and St. Petersburg in different ways.

The first hypothesis was fully confirmed.

2. Musical preferences do not change much during the week - be it Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
* in Moscow they listen to music of the “world” genre,
* Petersburg - jazz and classical music.

Thus, the second hypothesis was only partly confirmed. This result could have been different were it not for gaps in the data.

3. The tastes of Moscow and St. Petersburg users have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If there are differences in preferences, they are invisible to most of the users.

**In practice, studies contain tests of statistical hypotheses.**
From the data of one service, it is not always possible to draw a conclusion about all the inhabitants of the city. Tests of statistical hypotheses will show how reliable they are, based on the available data. 