# Yandex.Music

**The purpose of the study** is to test the hypotheses:
1. User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg this manifests itself in different ways.
2. On Monday morning in Moscow, some genres prevail, and in St. Petersburg, others. Likewise, on Friday evenings, different genres predominate, depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow people listen to pop music more often, in St. Petersburg they listen to Russian rap

## Data overview

In [1]:
# import pandas library
import pandas as pd

In [2]:
# reading data file and saving to df
df = pd.read_csv('yandex_music_project.csv')

In [3]:
# getting the first 10 rows of the df table
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
# get general information about the data in the df table
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So, the table has seven columns. The data type in all columns is `object`.

According to the data documentation:
* `userID` — user identifier;
* `Track` — track name;
* `artist` — name of the artist;
* `genre` — genre name;
* `City` — user’s city;
* `time` — start time of listening;
* `Day` - day of the week.

There are three style violations visible in the column titles:
1. Lowercase letters are combined with uppercase ones.
2. There are spaces.

The number of values ​​in the columns varies. This means there are missing values ​​in the data.


**Conclusions**

Each row of the table contains data about the track listened to. Some of the columns describe the composition itself: title, artist and genre. The rest of the data tells about the user: what city he is from, when he listened to music.

Preliminarily, it can be stated that there is enough data to test the hypotheses. But there are gaps in the data, and in the column names there are discrepancies with good style.

To move forward, we need to fix the problems in the data.

## Data preprocessing


### Heading style


In [5]:
# list of df table column names
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [6]:
# renaming columns
df = df.rename(columns={'  userID': 'user_id', 'Track': 'track', '  City  ': 'city', 'Day': 'day'})

In [7]:
# checking the results - list of column names
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

In [8]:
# skip counting
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

In [9]:
# loop through column names and replace missing values ​​with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In [10]:
# skip counting
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

In [11]:
# counting obvious duplicates
df.duplicated().sum()

3826

In [12]:
# removing obvious duplicates (removing old indexes and forming new ones)
df = df.drop_duplicates().reset_index(drop=True)

In [13]:
# checking for duplicates
df.duplicated().sum()

0

In [14]:
# Browse unique genre titles
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

In [15]:
# Function to replace implicit duplicates
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

In [16]:
# Eliminating implicit duplicates
replace_wrong_genres(['hip', 'hop', 'hip-hop'], 'hiphop')

In [17]:
# Checking for implicit duplicates
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

Preprocessing found three problems in the data:

- violations in the style of headings,
- missing values,
- duplicates - explicit and implicit.

The headings have been corrected to make working with the table easier. Without duplicates, the research will be more accurate.

Missing values ​​were replaced with `'unknown'`. It remains to be seen whether the omissions in the `genre` column will harm the research.

Now we can move on to testing hypotheses.

## Testing hypotheses

### Comparison of user behavior in two capitals

In [18]:
# Counting auditions in each city
city_group = df.groupby('city')
city_group['track'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: track, dtype: int64

In [19]:
# Counting plays on each of three days
day_group = df.groupby('day')
day_group['track'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

In [20]:
# <creating the number_tracks() function>
# A function is declared with two parameters: day, city.
# The track_list variable stores those rows of the df table for which
# the value in the 'day' column is equal to the day parameter and at the same time the value
# in the 'city' column is equal to the city parameter (use sequential filtering
# using logical indexing).
# The track_list_count variable stores the number of values ​​in the 'user_id' column,
# calculated by the count() method for the track_list table.
# The function returns a number - the track_list_count value.
def number_tracks(day, city):
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count
# Function for counting plays for a specific city and day.
# Using sequential filtering with logical indexing, it
# first get rows with the desired day from the source table,
# then from the result it will filter the lines with the desired city,
# The count() method will count the number of values ​​in the user_id column.
# The function will return this quantity as a result

In [21]:
# number of auditions in Moscow on Mondays
number_tracks('Monday', 'Moscow')

15740

In [22]:
# number of auditions in St. Petersburg on Mondays
number_tracks('Monday', 'Saint-Petersburg')

5614

In [23]:
# number of auditions in Moscow on Wednesdays
number_tracks('Wednesday', 'Moscow')

11056

In [24]:
# number of auditions in St. Petersburg on Wednesdays
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [25]:
# number of auditions in Moscow on Fridays
number_tracks('Friday', 'Moscow')

15945

In [26]:
# number of auditions in St. Petersburg on Fridays
number_tracks('Friday', 'Saint-Petersburg')

5895

In [27]:
# Table with results
columns = ['city', 'monday', 'wednesday', 'friday']
data = [['Moscow', number_tracks('Monday', 'Moscow'), number_tracks('Wednesday', 'Moscow'), number_tracks('Friday', 'Moscow')],
       ['Saint-Petersburg', number_tracks('Monday', 'Saint-Petersburg'), number_tracks('Wednesday', 'Saint-Petersburg'), number_tracks('Friday', 'Saint-Petersburg')]]
pd.DataFrame(data=data, columns=columns)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data shows the difference in user behavior:

- In Moscow, the peak of listening is on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday here is almost equally inferior to Wednesday.

This means that the data speaks in favor of the first hypothesis.

### Music at the beginning and end of the week

In [28]:
# getting the moscow_general table from those rows of the df table,
# for which the value in the 'city' column is 'Moscow'
moscow_general = df[df['city'] == 'Moscow']

In [29]:
# getting the spb_general table from those rows of the df table,
# for which the value in the 'city' column is 'Saint-Petersburg'
spb_general = df[df['city'] == 'Saint-Petersburg']

In [30]:
# Declaration of the function genre_weekday() with parameters table, day, time1, time2,
# which returns information about the most popular genres on a specified day on
# given time:
# 1) the genre_df variable stores those rows of the transferred table dataframe for
# of which at the same time:
# - the value in the day column is equal to the value of the day argument
# - the value in the time column is greater than the value of the time1 argument
# - the value in the time column is less than the value of the time2 argument
# Use sequential filtering using logical indexing.
#2) group the dataframe genre_df by the genre column, take one of it
# columns and use the count() method to calculate the number of records for each
# of genres present, write the resulting Series to a variable
# genre_df_count
# 3) sort genre_df_count in descending order of occurrence and save
# into the genre_df_sorted variable
#4) return a Series of the first 10 genre_df_sorted values, these will be the top 10
# popular genres (on a specified day, at a specified time)
def genre_weekday(table, day, time1, time2):
    genre_df = table[(table['day'] == day) & (time1 < table['time']) & (table['time'] < time2)]
    genre_df_count = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    return genre_df_sorted.head(10)

In [31]:
# calling a function for Monday morning in Moscow (instead of df - table moscow_general)
# time objects are strings and are compared as strings
# example call: genre_weekday(moscow_general, 'Monday', '07:00', '11:00')
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [32]:
# calling a function for Monday morning in St. Petersburg (instead of df - table spb_general)
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [33]:
# function call for Friday evening in Moscow
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [34]:
# function call for Friday evening in St. Petersburg
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg they listen to similar music. The only difference is that the Moscow rating included the “world” genre, while the St. Petersburg rating included jazz and classical.

2. In Moscow there were so many missing values ​​that the value `'unknown'` took tenth place among the most popular genres. This means that missing values ​​occupy a significant proportion of the data and threaten the reliability of the study.

Friday evening does not change this picture. Some genres go a little higher, others go down, but overall the top 10 remains the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not very pronounced. In Moscow people listen to Russian popular music more often, in St. Petersburg they listen to jazz.

However, gaps in the data cast doubt on this result. There are so many of them in Moscow that the top 10 ranking might look different if not for the lost data on genres.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, music of this genre is listened to more often there than in Moscow. And Moscow is a city of contrasts, in which, nevertheless, pop music prevails.

In [35]:
# in one line: grouping the moscow_general table by the 'genre' column,
# counting the number of 'genre' values ​​in this grouping using the count() method,
# sort the resulting Series in descending order and save it in moscow_genres
moscow_genres = moscow_general.groupby('genre')['user_id'].count().sort_values(ascending=False)

In [36]:
# view first 10 lines moscow_genres
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: user_id, dtype: int64

In [37]:
# in one line: grouping the spb_general table by the 'genre' column,
# counting the number of 'genre' values ​​in this grouping using the count() method,
# sort the resulting Series in descending order and save it in spb_genres
spb_genres = spb_general.groupby('genre')['user_id'].count().sort_values(ascending=False)

In [38]:
# view first 10 lines spb_genres
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: user_id, dtype: int64

**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis predicted. Moreover, in the top 10 genres there is a similar genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.