# Yandex Music

## Data overview.

Import libraries for analysis

In [1]:
import pandas as pd
import numpy as np

We load the data by storing it in `music_df`. With one command we will get general information about the table and display first 10 rows:

In [2]:
music_df = pd.read_csv('datasets/yandex_music_project.csv')
music_df.info()
music_df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


So, the table has seven columns. The data type in all columns is `object`.

According to the data documentation:  

* `userID` — User identification number;
* `Track` —	Name of a music track;
* `artist` — Name of an artist;
* `genre` — Name of a music genre;
* `City` — City where a song was played;
* `time` — Time at which a user started listening to a song;
* `Day` — Day of the week;

Number of values ​​in the columns varies. This means that there are missing values ​​in the data.  

Each row of the DataFrame contains information about the track listened to. Some of the columns describe the composition itself: title, artist and genre. The rest of the data tells about the user: what city he is from, when he listened to music.

To test working hypotheses, we can first say that there is enough data. But there are gaps in the data, and in the column names there are discrepancies with good style.

To move forward, our need to preprocessing data.

## Data preprocessing.

Let's correct the style in the column names and eliminate missing values. Let's check for duplicate data.

### Headers style.

In [3]:
print('Old names', music_df.columns.to_list())
music_df = music_df.rename(columns = {'  userID':'user_id', 'Track':'track', '  City  ':'city', 'Day':'day'})
print('New names', music_df.columns.to_list())

Old names ['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day']
New names ['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day']


### Missing values.

First, let's count the number of missing values

In [4]:
music_df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values ​​affect the study. So in `track` and `artist`, missing values ​​are not important for your work. It is enough to replace them with explicit notations.  

But missing valu in `genre` may interfere with the comparison of musical tastes in Moscow and St. Petersburg. In practice, it would be correct to determine the cause of the missing values ​​and restore the data, but this is not possible.  
Have to:


- also replace with explicit notations;

- assess how much they will harm the calculations.

In [5]:
# Replace missing values ​​in the 'track', 'artist' and 'genre' columns with the string 'unknown' by looping through the column names in a 'for' loop
columns_to_replace = ['track', 'artist', 'genre']

for column in columns_to_replace:
    music_df[column] = music_df[column].fillna('unknown')

music_df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates.

In [6]:
# Let's count and remove explicit duplicates
print('Before: ', music_df.duplicated().sum())
music_df = music_df.drop_duplicates()
print('After: ', music_df.duplicated().sum())

Before:  3826
After:  0


Now, we need to get rid of implicit duplicates. Let's start by displaying the unique genre names in the alphabetical order.

In [7]:
music_df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Looking through the list we will see the following implicit duplicates:
- hip;
- hop;
- hip-hop.

We need to get rid of such duplicates by replacing the duplicated names with one common name hiphop

In [8]:
music_df['genre'] = music_df['genre'].replace(['hip', 'hop', 'hip-hop'], 'hiphop')
music_df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

Data preprocessing has identified 3 problems in data:

- inadequate headers style;
- missing values;
- duplicates - explicit and implicit.

The headings have been corrected to make working with the table easier. Without duplicates, the research will be more accurate. Missing values ​​were replaced with `unknown`. It remains to be seen whether the omissions in the `genre` column will harm the research. 

Now we can move on to testing hypotheses.

## Hypotheses testing.

### Differences in music preferences between two cities.

The first hypothesis states that users listen to music differently in Moscow and Saint-Petersburg. We can test this hypothesis by making use of data on three weekdays — Monday, Wednesday, and Friday. For this:

- separate the users of Moscow and St. Petersburg;
- compare how many tracks each user group listened to on Monday, Wednesday and Friday.

First estimate the activity of users in each city. Group the data by city and count the auditions in each group.

In [9]:
music_df.groupby('city')['time'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: time, dtype: int64

There are more auditions in Moscow than in St. Petersburg. This does not mean that Moscow users listen to music more often. There are just more users in Moscow.

Now, let's group data by weekday and compute numbers of track plays on Monday, Wednesday and Friday. The data contains information about track plays on these days only.

In [10]:
music_df.groupby('day')['time'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: time, dtype: int64

On average, users from two cities are less active on Wednesday, but the results can change once we examine each city separately.  

In order to be able to distinguish between different cities and weekdays, we will use `number_tracks()` function that will combine the above calculations so that we could understand how often users from these two cities listen to music.  

Two parameters are needed:  
- day of the week;
- city ​​name.

In the function, save into a variable the rows of the source table that have the value:  
- in the `day` column is equal to the `day` parameter;
- in the `city` column is equal to the `city` parameter. 

To do this, apply sequential filtering with logical indexing. 

Then count the values ​​in the user_id column of the resulting table.

In [11]:
def number_tracks(df, day, city):
    '''
    Computes day- and city-specific number of track plays.
    Given the data with users' musical preferences, function
    returns the number of tracks played in a particular city
    on a particular day.
    '''
    # Sequentially filtering the data for a certain day/city
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    # Computing the number of track plays on a certain day/ in a certain city
    track_list_count = track_list['user_id'].count()
    return track_list_count

In [12]:
# Defining unique cities and weekdays
city_list = music_df['city'].unique()
day_list = music_df['day'].unique()

In [13]:
# Filling in info about track plays for each city/weekday
track_plays = np.empty([city_list.size, day_list.size])
print(f'Number of plays in each city by day of the week:')
for i, city in enumerate(city_list):
    print(f'\n {city}')
    for j, day in enumerate(day_list):
        track_plays[i][j] = number_tracks(music_df, day, city)
        print(f'\t {day.rjust(9)}: {number_tracks(music_df, day, city)}')

Number of plays in each city by day of the week:

 Saint-Petersburg
	 Wednesday: 7003
	    Friday: 5895
	    Monday: 5614

 Moscow
	 Wednesday: 11056
	    Friday: 15945
	    Monday: 15740


Enter the received data into the table using the `pd.DataFrame` constructor

In [14]:
info = pd.DataFrame(data=track_plays, columns=day_list, index=city_list)
info

Unnamed: 0,Wednesday,Friday,Monday
Saint-Petersburg,7003.0,5895.0,5614.0
Moscow,11056.0,15945.0,15740.0


**Conclusions**

Data shows difference in user behavior: 
- number of track plays in Moscow peaks on Monday and Friday, while it declines on Wednesday;
- in Saint-Petersburg, conversely, users listen to music more frequently on Wednesday.

Thus, the data confirms the first hypothesis

### Music at the beginning and end of the week.

According to the second hypothesis, on Monday morning in Moscow some genres prevail, and in St. Petersburg others. Likewise, on Friday evenings, different genres predominate, depending on the city.  

Save the tables with data in two variables:  
- in Moscow - in `moscow_general`;
- in St. Petersburg - in `spb_general`.


In [15]:
# DataFrame for Moscow users
moscow_general = music_df[music_df['city'] == 'Moscow']

# DataFrame for Saint-Petersburg users
spb_general = music_df[music_df['city'] == 'Saint-Petersburg']

Create a function `genre_weekday()` with four parameters:

- table (dataframe) with data;
- day of the week;
- start timestamp in "hh:mm" format;
- end timestamp in "hh:mm" format.

The function should return information about the top-10 genres of those tracks that were listened to on a specified day, in the interval between two timestamps.

In [16]:
def genre_weekday(df, day, start_time, end_time):
    '''
    Compiles genres rating for a specific day.
    Function returns a rating of top-10 most popular
    genres on a given day at a given time period.
    '''
    # Filtering the data by day and time
    genre_df = df[(df['day'] == day) & (df['time'] < end_time) & (df['time'] > start_time)]
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()
    # Compiling a rating of the most popular genres
    genre_df_top_10 = genre_df_grouped.sort_values(ascending = False).head(10)
    return genre_df_top_10


Compare the results of the` genre_weekday(`) function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and on Friday evening (from 17:00 to 23:00):

In [17]:
# Computing top-10 genres for Moscow (Monday morning)
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [18]:
# Computing top-10 genres for Saint-Petersburg (Monday morning)
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [19]:
# Computing top-10 genres for Moscow (Friday evening)
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [20]:
# Computing top-10 genres for Saint-Petersburg (Friday evening)
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Conclusions**

If we compare the top-10 genres on Monday morning, we can draw the following conclusions:

1. Users in Moscow and Saint-Petersburg listen to similar music. The only difference is that Moscow rating includes world genre, while that for Saint-Petersburg includes `jazz` and `classical` music genres.
2. In Moscow there are so many missing values that `unknown` genre took the 10th place in the rating of the most popular genres. Hence, missing values account for a substantial fraction of data and are highly likely to affect the results.  

Friday evening does not change the results: some genres ascend and descend across the top-10 rating but in general rating stays the same.

Thus, the second hypothesis has been partially confirmed:

- Users listen to similar music at the beginning and the end of the week.
- We were not able to detect a distinct difference between Moscow and Saint-Petersburg. Users in Moscow often listen to the russian popular music (`ruspop`), while users in Saint-Peterburg - `jazz`.

However, missing values do not allow accurately confirming the above results. Moscow data has so many of them that it is likely that the top-10 rating could have looked differently should we have had all information.

### Preferences by genre in Moscow and St. Petersburg.

Hypothesis: St. Petersburg is the capital of rap (`rusrap`), music of this genre is listened to more often there than in Moscow. And Moscow is a city of contrasts, in which, nevertheless, pop music prevails `pop`.

We can start by grouping the table by genre and counting the number of tracks played in Moscow.

In [21]:
# Computing top-10 genres for Moscow
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending = False)
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [22]:
# Computing top-10 genres for Saint-Petersburg
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending = False)
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusion**

The hypothesis was partially confirmed:

- `pop` music is the most popular genre in Moscow, as the hypothesis predicted. Moreover, in the top-10 genres there is a similar genre - Russian popular music.
- contrary to expectations, rap (`rusrap`) is equally popular in Moscow and St. Petersburg.

## Research results.

We tested three hypotheses and found:

1. The day of the week affects user activity differently in Moscow and St. Petersburg.

The first hypothesis was completely confirmed.

2. Musical preferences do not change much during the week - be it Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:

- in Moscow they listen to music of the `world` genre;
- in St. Petersburg - `jazz` and `classics`.

Thus, the second hypothesis was only partially confirmed. This result might have been different if not for missing values ​​in the data.

3. The tastes of Moscow and St. Petersburg users have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If differences in preferences exist, they are not noticeable for the majority of users.