# Yandex.Music

The comparison between Moscow and St Petersburg is surrounded by myths. For example:

- Moscow is a metropolis subject to the rigours of the working week;
- St. Petersburg is a cultural capital, with its own tastes.

Using Yandex.Music data, you will compare the behaviour of users of the two capitals.

**The purpose of the study** is to test three hypotheses:

1. User activity depends on the day of the week. And it manifests itself differently in Moscow and St. Petersburg.
2. On Monday morning some genres dominate in Moscow and others in St. Petersburg. Friday evenings are also dominated by different genres depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow people listen to pop music more often, in St. Petersburg - Russian rap.

**Survey progress**

You will get data on user behaviour from the file yandex_music_project.csv. Nothing is known about the quality of the data. Therefore you will need to review the data before testing hypotheses.

You will check the data for errors and assess their impact on the study. Then, in the preprocessing phase, you will look for opportunities to correct the most critical data errors.

The study will therefore take place in three phases:

1. Data review.
2. Data preprocessing.
4. Hypothesis testing.


## Data review

First impression of the Yandex.Music data.

In [124]:
import pandas as pd # import pandas library

Read the file `yandex_music_project.csv` from `/datasets` directory into `df` variable:

In [None]:
df = pd.read_csv('/datasets/yandex_music_project.csv') # reading and saving file into df

In [126]:
df.head(10) # obtain the first 10 rows of table

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [127]:
df.info() # getting general information about the data in table 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
  userID    65079 non-null object
Track       63848 non-null object
artist      57876 non-null object
genre       63881 non-null object
  City      65079 non-null object
time        65079 non-null object
Day         65079 non-null object
dtypes: object(7)
memory usage: 3.5+ MB


So there are seven columns in the table. The data type in all columns is `object`.

According to the data documentation:

- `userID` - user ID;
- `Track` - track name;
- `artist` - artist's name;
- `Genre` - name of the genre;
- `City` - user's city;
- `time` - time the listening start;
- `Day` - day of the week.

There are three style violations visible in the column names:

1. Small letters are combined with capital letters.
2. There are some spaces.
3. 'Snake_register' is not applied in the `userID` column. If we convert the name of this column to lowercase letters, the name 'userid' is unreadable. We should use 'snake_register' and replace it with 'user_id'.
The number of values in the columns is different. So there are missing values in the data.

**Conclusions**

Each row in the table contains data about the track listened to. Part of the columns describes the song itself: the title, the artist, and the genre. The rest of the data tells about the user: what city the user is from, when the user listened to the music.

Preliminarily it can be stated that there is enough data to test hypotheses. But there are omissions in the data, and there are discrepancies with good style in the names of the columns.

In order to move forward, the problems in the data need to be eliminated.

## Data preprocessing

### Heading style

In [128]:
df.columns # the list of columns in df table

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Bring the titles in line with good style:

- write words in the columns title in "snake_register",
- make all characters lowercase,
- remove the spaces.

To do this, rename the columns as follows:
* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [129]:
df = df.rename(columns={'  userID':'user_id', 'Track':'track', '  City  ':'city', 'Day':'day'}) 

In [130]:
df.columns # result check - list of columns name

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

In [131]:
df.isna().sum() # counting missing values

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the study. So in `track` and `artist` the omissions are not important for our work. It is enough to replace them with explicit denotations.

But missing values in `genre` can interfere with the comparison of musical tastes in Moscow and St. Petersburg. In practice, it would be correct to establish the reason for the missing data and restore the data. There is no such possibility in the study project. We will have to:

- fill in these gaps with explicit notations,
- assess to what extent they will affect the calculations.

Replace the missing values in the `track`, `artist` and `genre` columns with the string `'unknown'`.

In [132]:
columns_to_replace = ['track', 'artist', 'genre'] # creating a list of column names in which to replace the missing values

for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')
        

In [133]:
df.isna().sum() # missing values count

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

In [134]:
df.duplicated().sum() # duplicates count

3826

Removal of explicit duplicates.

In [135]:
df = df.drop_duplicates().reset_index(drop=True) # removal of explicit duplicates (with old indexes removed and new indexes generated)

In [136]:
df.duplicated().sum() # проверка на отсутствие дубликатов

0

Now get rid of the implicit duplicates in the `genre` column. For example, the name of the same genre may be written slightly differently. Such mistakes will also affect the result of the study.


In [137]:
# Viewing unique genre titles

df['genre'].sort_values().unique() 

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

We go through the list and look for implicit duplicates of hiphop titles. These can be misspelled titles or alternative titles of the same genre.

We see the following implicit duplicates:

- hip,
- hop,
- hip-hop.

To clear them from the table, we will write `replace_wrong_genres()` function with two parameters:

- `wrong_genres` - list of duplicates,
- `correct_genre` - the row with the correct value.

The function should correct the `genre` column in table `df`: replace each value from `wrong_genres` list with a value from `correct_genre`.

In [138]:
def replace_wrong_genres(wrong_genres, correct_genre):  
    for wrong_genre in wrong_genres: 
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre) 


In [139]:
# replacement of implicit duplicates

wrong_genres = ['hip', 'hop', 'hip-hop']  
correct_genre = 'hiphop' 

replace_wrong_genres(wrong_genres, correct_genre) 

Check that we've replaced the wrong names:

- hip
- hop
- hip-hop

Output a sorted list of unique values of the `genre` column:

In [140]:
df['genre'].sort_values().unique() 

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

Preprocessing revealed three problems in the data:

- header style irregularities,
- missing values,
- duplicates - explicit and implicit.

We have corrected the headings to make the table easier to work with. Without duplicates, the study will be more accurate.

We have replaced the missing values with `unknown`. It remains to be seen if the omissions in the `genre` column will harm the study.

We can now move on to hypothesis testing.

## Hypothesis testing

### Comparison of user behaviour in the two capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. Test this hypothesis with data from three days of the week - Monday, Wednesday and Friday. To do this we:

- divide the users in Moscow and St. Petersburg
- compare how many tracks each group of users listened to on Monday, Wednesday and Friday.

In [141]:
df.groupby('city')['user_id'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

There are more listening sessions in Moscow than in St. Petersburg. It doesn't mean that Moscow users listen to music more often. It's just that there are more users in Moscow.

Now group the data by day of the week and count listening on Monday, Wednesday and Friday. Note that the data have information only about the listening on those days.

In [142]:
df.groupby('day')['user_id'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: user_id, dtype: int64

On average, users from the two cities are less active on Wednesdays. But the picture can change if you look at each city separately.

You have seen how grouping by city and by day of the week works. Now write a function that combines these two calculations.

Create a function, `number_tracks()`, which calculates the auditions for a given day and city. It will need two parameters:

- day of the week,
- city name.

In the function, save to a variable the rows of the source table that have the value:

- in the `day` column is equal to the `day` parameter,
- in the `city` column is equal to the `city` parameter.

To do this, apply sequential filtering with logical indexing.

Then count the values in the `user_id` column of the resulting table. Save the result to a new variable. Return this variable from the function.

In [143]:
def number_tracks(day, city): 
    track_list = df[df['day'] == day] 
    track_list = track_list[track_list['city'] == city]  
    track_list_count = track_list['user_id'].count() 
    return track_list_count 


In [144]:
# number of auditions in Moscow on Mondays

number_tracks('Monday', 'Moscow')

15740

In [145]:
# Number of auditions in St Petersburg on Mondays

number_tracks('Monday', 'Saint-Petersburg')

5614

In [146]:
number_tracks('Wednesday', 'Moscow')

11056

In [147]:
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [148]:
number_tracks('Friday', 'Moscow')

15945

In [149]:
number_tracks('Friday', 'Saint-Petersburg')

5895

We use the `pd.DataFrame` constructor to create a table where

- column names are `['city', 'monday', 'wednesday', 'friday']`;
- the data are the results we got with `number_tracks`.

In [150]:
columns = ['city', 'monday', 'wednesday', 'friday']
data = [
    ['Moscow', number_tracks('Monday', 'Moscow'), number_tracks('Wednesday', 'Moscow'), number_tracks('Friday', 'Moscow')],
    ['Saint-Petersburg', number_tracks('Monday', 'Saint-Petersburg'), number_tracks('Wednesday', 'Saint-Petersburg'), number_tracks('Friday', 'Saint-Petersburg')],
]

user_behaviour = pd.DataFrame(data = data, columns = columns)
display(user_behaviour)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data show a difference in user behaviour:

- In Moscow, listening peaks on Mondays and Fridays, with a noticeable decline on Wednesdays.
- In St. Petersburg, on the contrary, more music is listened to on Wednesdays. The activity on Monday and Friday here is almost equally inferior to that on Wednesdays.

So the data are in favour of the first hypothesis.

### Music at the beginning and at the end of the week

According to the second hypothesis, on a Monday morning some genres dominate in Moscow and others in St. Petersburg. Likewise, Friday evenings are dominated by different genres - depending on the city.

Save the table data into two variables:

- for Moscow - in `moscow_general`;
- for St. Petersburg - in `spb_general`.

In [151]:
moscow_general = df[df['city'] == 'Moscow'] # create a table with data for Moscow only

In [152]:
spb_general = df[df['city'] == 'Saint-Petersburg'] # create a table with data for St Petersburg only

Create a `genre_weekday()` function with four parameters:

- a dataframe with data,
- weekday,
- start time stamp in 'hh:mm' format,
- last time stamp in 'hh:mm' format.

The function should return information about the top 10 genres of those tracks listened to on the specified day, between the two time stamps.

In [153]:
def genre_weekday(df, day, time1, time2):
    genre_df = df[df['day'] == day] 
    genre_df = genre_df[genre_df['time'] > time1] 
    genre_df = genre_df[genre_df['time'] < time2] 
    
    genre_df_count = genre_df.groupby('genre')['user_id'].count() 
    genre_df_sorted = genre_df_count.sort_values(ascending=False) 
    
    return genre_df_sorted.head(10)

Compare the results of `genre_weekday()` for Moscow and St. Petersburg on Monday morning (7:00-11:00) and Friday evening (17:00-23:00):

In [154]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: user_id, dtype: int64

In [155]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: user_id, dtype: int64

In [156]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: user_id, dtype: int64

In [157]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: user_id, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday mornings we can draw the following conclusions:

1. In Moscow and St. Petersburg people listen to similar music. The only difference is that Moscow's ranking included the 'world' genre, while St Petersburg's was jazz and classical.

2. In Moscow there were so many missing values that `'unknown'` took tenth place among the most popular genres. So the missing values take up a significant proportion of the data and threaten the credibility of the study.

Friday night doesn't change that picture. Some genres go slightly higher, others go down, but overall the top 10 remains the same.

Thus, the second hypothesis is only partially confirmed:

- Users listen to similar music at the beginning of the week and at the end of the week.
- The difference between Moscow and St. Petersburg is not very pronounced. In Moscow they listen to Russian popular music more often, in St. Petersburg - jazz.

However, omissions in the data cast doubt on this result. There are so many in Moscow that the top-10 ranking could look different if not for the missing data on genres.

### Genre preferences in Moscow and St Petersburg

Hypothesis: St. Petersburg is the rap capital, music of this genre is listened to there more often than in Moscow.  And Moscow is a city of contrasts, which, however, is dominated by pop music.

Group the `moscow_general` table by genre and count the number of listenings of tracks by genre by using `count()`. Then sort the result in descending order and store it in the `moscow_genres` table.

In [158]:
moscow_genres = moscow_general.groupby('genre')['track'].count().sort_values(ascending=False)

In [159]:
# view first 10 lines of moscow_genres

moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: track, dtype: int64

Now repeat the same for St Petersburg.

In [160]:
spb_genres = spb_general.groupby('genre')['track'].count().sort_values(ascending=False)

In [161]:
# view first 10 lines of spb_genres

spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: track, dtype: int64

**Conclusions**

The hypothesis was partly confirmed:

- Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, there is a close genre in the top 10 genres - Russian popular music.
- Contrary to expectations, rap is equally popular in Moscow and St Petersburg.

## Study results

We have tested three hypotheses and established:

1. The day of the week has a different effect on user activity in Moscow and St. Petersburg.

The first hypothesis is fully confirmed.

2. Musical preferences do not change much during the week - be it in Moscow or St. Petersburg. Slight differences are noticeable at the beginning of the week, on Mondays:
- in Moscow they listen to "world" music,
- in St. Petersburg they listen to jazz and classical music.

Thus, the second hypothesis was only partly confirmed. This result could have been different had it not been for the missing data.

3. The tastes of Moscow and St. Petersburg users have more in common than in difference. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If differences in preferences exist, they are not noticeable on the bulk of users.
