# Yandex.Music

 **Table of contents:**
 1. Introduction
 
 2. Data preprocessing
 
 3. Hypothesis testing
 
**The aim of this project** is to compare the beahavior of Yandex.Music users in Saint-Peterburg and Moscow. Bearing in mind that Moscow is  a megapolis with a very tense working life while Saint-Petersburg is one of the main cultural cities of the country we shall test three hypothises:

1. Users activity depends on day of the week and is different in Moscow and Saint-Petersburg;


2. Monday morning Moscow users prefer different genres in comparison with those of Saint-Petersburg. The same situation is in respect of Friday music preferences;


3. Moscow and Saint petersburg users prefer different genrs of music. Moscow users prefer pop-music while those of Saint-Petersburg - Russian rap.

# 1.Introduction 

Let's get into Yandex.Music data.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('yandex_music_project.csv')

Let's look at the first 10 strings and general information.

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So, there are seven columns in the table. The data type in all columns is `object'.

According to the data documentation:
* userID - the user ID;
* Track — the name of the track;
* artist — artist's name;
* genre — the name of the genre;
* City — the user's city;
* time — the start time of listening;
* Day — the day of the week.

Three style violations can be found in the column titles:
1. Lowercase letters are combined with uppercase.
2. There are gaps.
3. there is no "snake register"

The number of values in the columns varies. So there are missing values in the data.

**Intermediate conclusion**

Each row of the table contsains data about the listened track. Part of the columns describes the composition itself: name, artist and genre. The rest of the data tells about the user: what city he is from, when he listened to music.

It can be stated  by now that there is enough data to test hypothesis. But there are gaps in the data, and in the column names. There are also discrepancies with good style.

To move on, we need to fix the problems in the data.

## 2. Data preprocessing

### Style of headers

In [5]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [6]:
# columns renaming
df = df.rename(columns = {'  userID':'user_id', 'Track': 'track', '  City  ': 'city', 'Day':'day'})

In [7]:
# let's check the results
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

In [8]:
# let's check out missing values
print(df.isna().sum())

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64


Not all missing values shall affect futher analysis. So in `track` and `artist` the omissions are not important can be replaced with explicit designations.

But omissions in `genre` may interfere with the comparison of musical tastes in Moscow and St. Petersburg.
Since we cannot find out the true reasons for those ommissions, let's:

* fill in these gaps with explicit notation,

* assess how much they will damage the calculations.

In [9]:
# replacing the missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']

for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In [10]:
# checking out the results
print(df.isna().sum())

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64


### Duplicates

In [11]:
# let's count explicit duplicates
display(df.duplicated().sum())
df[df.duplicated()].head(10)

3826

Unnamed: 0,user_id,track,artist,genre,city,time,day
575,E7F07B46,Crazy,The Manhattans,rnb,Moscow,13:39:46,Monday
832,7671A47A,Миражи,Восток,ruspop,Moscow,21:59:33,Wednesday
1216,69467B01,Change It All,Harrison Storm,singer,Moscow,20:53:06,Wednesday
1754,13B1A573,Te Adoramos Jesús,Athenas,spiritual,Moscow,13:19:37,Monday
1964,B24668A0,Mad over You Mashup,Nana Fofie,singer,Moscow,20:36:51,Monday
2457,B896FACC,No Pussy Blues,Grinderman,pop,Saint-Petersburg,09:08:49,Wednesday
2770,6573A983,Суета-муета,Дюмин А.,shanson,Moscow,21:15:40,Friday
2885,512B0511,Purgatory,Scientific Harmony,dance,Moscow,09:55:06,Monday
2895,2696EA88,Daddy D,Kung Fu,funk,Moscow,08:33:34,Wednesday
2979,3C8ECBFE,Ederlezi,Coq Au Vin,world,Moscow,13:25:15,Monday


In [12]:
# let's delete explicit duplicates
df = df.drop_duplicates().reset_index(drop=True)

In [13]:
# checking for the absence of duplicates
display(df.duplicated().sum())

0

Let's get ride of implicit duplicates.

In [14]:
# unique genres
display(df['genre'].sort_values().unique())

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

In [15]:
# replacement of implicit duplicates
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

In [16]:
correct_genre1 = 'hiphop'
correct_genre2 = 'electronic'
wrong_genres1 = ['hip', 'hop', 'hip-hop']
wrong_genres2 = ['электроника']

replace_wrong_genres(wrong_genres1, correct_genre1)
replace_wrong_genres(wrong_genres2, correct_genre2)


In [17]:
# Let's check out the results
display(df['genre'].sort_values().unique())

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Intermediate conclusion**

Preprocessing has showed three problems in the data:

- violations in the style of headlines,

- missing values,

- duplicates — explicit and implicit.

We have corrected the headers to make it easier to work with our dataframe. Without duplicates, the study will become more accurate.

We have replaced the missing values with "unknown". It remains to be seen whether omissions in the `genre` column will harm the study.

Now we can proceed to hypothesis testing.

## Hypothesis testing

### Comparison of two capitals users' behavior 

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. Let's check this assumption based on data of three days of the week — Monday, Wednesday and Friday. For this:

* Separate the users of Moscow and St. Petersburg

* Compare how many tracks each user group listened to on Monday, Wednesday and Friday.


In [18]:
# Let's count the number of songs listened to in each city
display(df.groupby('city')['genre'].count())

city
Moscow              42741
Saint-Petersburg    18512
Name: genre, dtype: int64

There are more songs listened to in Moscow than in St. Petersburg. This does not mean that Moscow users listen to music more often. It's just that there are more users in Moscow.

In [19]:
# Let's count the number songs listened during Monday, Wednesday and Friday
day_split_users = df.groupby('day')['genre'].count()
display(day_split_users)

day
Friday       21840
Monday       21354
Wednesday    18059
Name: genre, dtype: int64

On average, users from two cities are less active on Wednesdays. But the picture may change if we consider each city separately.

In [20]:
def number_tracks(day, city):
    track_list = df[(df['day'] == day)&(df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count

In [21]:
# the number auditions listened to in Moscow on Mondays
moscow_monday_tracks = number_tracks('Monday', 'Moscow')
print(moscow_monday_tracks)

15740


In [22]:
# the number auditions listened to in St. Petersburg on Mondays
spb_monday_tracks = number_tracks('Monday', 'Saint-Petersburg')
print(spb_monday_tracks)

5614


In [23]:
# the number auditions listened to in Moscow on Wednesdays
moscow_wednesday_tracks = number_tracks('Wednesday', 'Moscow')
print(moscow_wednesday_tracks)

11056


In [24]:
# the number auditions listened to in St. Petersburg on Wednesdays
spb_wednesday_tracks = number_tracks('Wednesday', 'Saint-Petersburg')
print(spb_wednesday_tracks)

7003


In [25]:
# the number auditions listened to in Moscow on Fridays
moscow_friday_tracks = number_tracks('Friday', 'Moscow')
print(moscow_friday_tracks)

15945


In [26]:
# the number auditions listened to in st. Petersburg on Fridays
spb_friday_tracks = number_tracks('Friday', 'Saint-Petersburg')
print(spb_friday_tracks)

5895


In [27]:
# Results
data = [
    ['Moscow', moscow_monday_tracks, moscow_wednesday_tracks, moscow_friday_tracks],
    ['Saint-Petersburg', spb_monday_tracks, spb_wednesday_tracks, spb_friday_tracks]
]
columns = ['city', 'monday', 'wednesday', 'friday']

df_city_day = pd.DataFrame(data = data, columns = columns)
display(df_city_day)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


The data shows the difference in user behavior:

- In Moscow, the peak of auditions falls on Monday and Friday, and on Wednesday there is a noticeable decline.

- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday here is almost equally inferior to Wednesday.

So, the data speak in favor of the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, some genres prevail in Moscow on Monday morning, and others in St. Petersburg. Similarly, on Friday evening, different genres prevail — depending on the city.

In [28]:
moscow_general = df[df['city'] == 'Moscow']
display(moscow_general.head())

Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,unknown,ruspop,Moscow,09:17:40,Friday


In [29]:
spb_general = df[df['city'] == 'Saint-Petersburg']
display(spb_general.head())

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Saint-Petersburg,21:20:49,Wednesday


In [30]:
# Let's write the function which should return information about the top 10 genres of those tracks that were listened to on the specified day, in the interval between two timestamps.
def genre_weekday(table, day, time1, time2):
    genre_df = table[(table['day'] == day)&(table['time'] > time1)&(table['time'] < time2)]
    genre_df_count = genre_df.groupby('genre')['user_id'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending = False)
    return genre_df_sorted.head(10)

Let's compare the results of the 'genre_weekday()' function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [31]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: user_id, dtype: int64

In [32]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: user_id, dtype: int64

In [33]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: user_id, dtype: int64

In [34]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: user_id, dtype: int64

**Intermediate conclusion**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg, they listen to similar music. The only difference is that the “world” genre entered the Moscow rating, and jazz and classical music entered the St. Petersburg rating.

2. There were so many missing values in Moscow that the value "unknown" took the tenth place among the most popular genres. This means that the missing values occupy a significant share in the data and threaten the reliability of the study.

Friday night does not change this picture. Some genres rise a little higher, others go down, but overall the top 10 remains the same.

Thus, the second hypothesis was only partially confirmed:

* Users listen to similar music at the beginning of the week and at the end.

* The difference between Moscow and St. Petersburg is not too pronounced. In Moscow, they listen to Russian popular music more often, in St. Petersburg — jazz.

However, omissions in the data cast doubt on this result. There are so many of them in Moscow that the top-10 rating could look different if not for the lost data on genres.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, music of this genre is listened to there more often than in Moscow. And Moscow is a city of contrasts, in which, nevertheless, pop music prevails.

In [35]:
moscow_genres = moscow_general.groupby('genre')['user_id'].count().sort_values(ascending = False)

In [36]:
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: user_id, dtype: int64

In [37]:
spb_genres = spb_general.groupby('genre')['user_id'].count().sort_values(ascending = False)

In [38]:
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1737
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: user_id, dtype: int64

**Intermediate conclusion**

The hypothesis was partially confirmed:

* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a similar genre - Russian popular music.


* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.


## Final conclusion

We have tested three hypotheses and established:

1. The day of the week has different effects on user activity in Moscow and St. Petersburg.

The first hypothesis was fully confirmed.

2. Musical preferences do not change much during the week — whether it is Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:

* in Moscow, they listen to music of the “world” genre,

* in St. Petersburg — jazz and classics.

Thus, the second hypothesis was only partially confirmed. This result could have been different if not for the omissions in the data.

3. There are more similarities than differences in the tastes of users in Moscow and St. Petersburg. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If there are differences in preferences, they are invisible to the majority of users.