# Yandex Music

Comparison of Moscow and Saint Petersburg is surrounded by myths. For example:

* Moscow is a metropolis governed by the strict rhythm of the workweek;
* Saint Petersburg is a cultural capital with its own tastes.

Using Yandex Music data, compare the behavior of users in the two capitals.

**Research goal** — test three hypotheses:

1. User activity depends on the day of the week, and this manifests differently in Moscow and Saint Petersburg.
2. On Monday morning, different genres dominate in Moscow and Saint Petersburg. The same applies to Friday evening, with genres varying by city.
3. Moscow and Saint Petersburg prefer different music genres. Pop music is more popular in Moscow, while Russian rap is more frequent in Saint Petersburg.

**Research process**

You will receive user behavior data from the file `yandex_music_project.csv`. Nothing is known about data quality, so a data overview is required before hypothesis testing.

Check the data for errors and assess their impact on the study. Then, during preprocessing, find ways to fix the most critical data errors.

Thus, the study will proceed in three stages:

1. Data overview.
2. Data preprocessing.
3. Hypothesis testing.




## Data Overview

Form an initial understanding of the Yandex Music data.



In [None]:
import pandas as pd # Importing the pandas library

In [None]:
df = pd.read_csv('/datasets/yandex_music_project.csv') # Reading the data file and saving it to a DataFrame (df)

In [None]:
df.head(10) # Getting the first 10 rows of the DataFrame (df)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [None]:
df.info() # Getting the overall information about the data in the DataFrame (df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


The number of values in the columns varies. This means the data contains missing values.

In [None]:
# The column names show three style issues: lowercase letters mixed with uppercase, spaces are present, and there are no underscores.

**Conclusions**

Each row in the table represents data about a listened track. Some columns describe the track itself: title, artist, and genre. Other columns provide information about the user: their city and when they listened to the music.

Preliminarily, the data appears sufficient for testing the hypotheses. However, there are missing values in the data and inconsistencies in column naming styles.

To proceed, these data issues need to be addressed.

## Data Preprocessing

### Renaming Column

In [None]:
df.columns # List of column names in the DataFrame (df)

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [None]:
df = df.rename( columns={'  userID':'user_id', 'Track':'track', '  City  ':'city', 'Day':'day'}) # Renaming columns

In [None]:
df.columns # Verification of results — list of column names

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Handling Missing Values

In [None]:
df.isna().sum() # Counting missing values

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

In [None]:
# Replacing missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In [None]:
df.isna().sum() # Checking for absence of missing values

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Handling Duplicates

In [None]:
df.duplicated().sum() # Counting explicit duplicates

3826

In [None]:
df = df.drop_duplicates().reset_index(drop=True) # Removing explicit duplicates, resetting the index, and dropping the old index

In [None]:
df.duplicated().sum() # Checking for absence of explicit duplicates

0

In [None]:
df['genre'].sort_values().unique() # Viewing unique sorted genre names

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

In [None]:
# Removing implicit duplicates
duplicates = ['hip', 'hop', 'hip-hop']
correct_genre_name = 'hiphop'
df['genre'] = df['genre'].replace(duplicates, correct_genre_name)

In [None]:
df['genre'].sort_values().unique() # Checking for absence of implicit duplicates

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

Data preprocessing revealed three issues:

* inconsistencies in header naming style,
* missing values,
* duplicates — both explicit and implicit.

You corrected the headers to simplify working with the table. Removing duplicates will make the analysis more accurate.

Missing values were replaced with `'unknown'`. It remains to be seen whether missing data in the `genre` column will affect the study.

Now, you can proceed to hypothesis testing.

## Hypothesis Testing

### Comparison of User Behavior in Two Capitals

The first hypothesis states that users listen to music differently in Moscow and Saint Petersburg. Let's test this assumption using data for three days of the week — Monday, Wednesday, and Friday. To do this:

* Separate users from Moscow and Saint Petersburg.
* Compare how many tracks each group listened to on Monday, Wednesday, and Friday.

In [None]:
df.groupby('city')['user_id'].count() # Counting the number of listens in each city

city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

In [None]:
df.groupby('day')['city'].count() # Counting the number of listens on each of the three days

day
Friday       21840
Monday       21354
Wednesday    18059
Name: city, dtype: int64

In [None]:
def number_tracks(day, city):
    track_list = df[(df['day']==day)] # Select only the rows from `df` where the value of the variable `day` matches in the `day` column.
    track_list = track_list[(track_list['city']==city)] # Select only the rows from `track_list` where the value of the variable `city` matches the `city` column.
    track_list_count = track_list['user_id'].count() # Call the row count method for `track_list` grouped by the `user_id` column.
    return track_list_count # Return the value `track_list_count` from the function.

In [None]:
number_tracks('Monday', 'Moscow') # Number of listens in Moscow on Mondays

15740

In [None]:
number_tracks('Monday', 'Saint-Petersburg') # Number of listens in Saint Petersburg on Mondays

5614

In [None]:
number_tracks('Wednesday', 'Moscow') # Number of listens in Moscow on Wednesdays

11056

In [None]:
number_tracks('Wednesday', 'Saint-Petersburg') # Number of listens in Saint Petersburg on Wednesdays

7003

In [None]:
number_tracks('Friday', 'Moscow') # Number of listens in Moscow on Fridays

15945

In [None]:
number_tracks('Friday', 'Saint-Petersburg') # Number of listens in Saint Petersburg on Fridays

5895

In [None]:
# Creating a table with the results
columns = ['city', 'monday', 'wednesday', 'friday']
number_tracks = [['Moscow', '15740', '11056', '15945'], ['Saint-Petersburg', '5614', '7003', '5895']]
info = pd.DataFrame(data=number_tracks, columns=columns)

# Displaying the table on the screen
info

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data shows differences in user behavior:

* In Moscow, listening peaks on Monday and Friday, with a noticeable drop on Wednesday.
* In Saint Petersburg, users listen more on Wednesdays. Activity on Monday and Friday is roughly equal but lower than on Wednesday.

Thus, the data supports the first hypothesis.

### Music at the Beginning and End of the Week

According to the second hypothesis, on Monday mornings, different genres dominate in Moscow and Saint Petersburg. Similarly, on Friday evenings, the prevailing genres vary depending on the city.

In [None]:
 # Get the table `moscow_general` by selecting rows from `df` where the value in the `'city'` column equals `'Moscow'`.
moscow_general = df[df['city'] == 'Moscow']

In [None]:
# Get the table `spb_general` by selecting rows from `df` where the value in the `'city'` column equals `'Saint-Petersburg'`.
spb_general = df[df['city'] == 'Saint-Petersburg']

In [None]:
def genre_weekday(df, day, time1, time2):
    # Sequential filtering
    # Keep only those rows in `genre_df` from `df` where the day equals `day`.
    genre_df = df[df['day'] == day]
    # Keep only the rows in `genre_df` where the time is less than `time2`.
    genre_df = genre_df[genre_df['time'] < time2]
    # Keep only the rows in `genre_df` where the time is greater than `time1`.
    genre_df = genre_df[genre_df['time'] > time1]
    # Group the filtered DataFrame by the column with genre names (`genre`), then count the number of rows for each genre using the `count()` method.
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()
    # Sort the result in descending order so that the most popular genres appear at the top of the Series.
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)
    # Return a Series with the top 10 most popular genres during the specified time interval on the given day.
    return genre_df_sorted[:10]

In [None]:
# Call the function for Monday morning in Moscow, using the table `moscow_general` instead of `df`.
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [None]:
# Call the function for Monday morning in Saint Petersburg, using the table `spb_general` instead of `df`.
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [None]:
# Call the function for Friday evening in Moscow.
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [None]:
# Call the function for Friday evening in Saint Petersburg.
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Conclusions**

Comparing the top 10 genres on Monday morning leads to the following insights:

1. Users in Moscow and Saint Petersburg listen to similar music. The only difference is that the Moscow ranking includes the genre “world” while the Saint Petersburg ranking features jazz and classical.

2. In Moscow, there are so many missing values that the `'unknown'` genre ranks tenth among the most popular genres. This indicates that missing data constitute a significant portion of the dataset and threaten the study’s reliability.

Friday evening doesn’t change this picture much. Some genres rise slightly, others fall, but overall the top 10 remains the same.

Thus, the second hypothesis is only partially confirmed:

* Users listen to similar music at the beginning and end of the week.
* The difference between Moscow and Saint Petersburg is not very pronounced. Moscow listeners prefer Russian pop music more often, while jazz is more common in Saint Petersburg.

However, data gaps cast doubt on this result. In Moscow, the number of missing genre entries is so large that the top 10 ranking might look different if these lost data were available.


### Genre Preferences in Moscow and Saint Petersburg

Hypothesis: Saint Petersburg is the capital of rap, where this genre is listened to more often than in Moscow. Meanwhile, Moscow is a city of contrasts, where pop music nevertheless predominates.

In [None]:
# One line: group the `moscow_general` table by the `'genre'` column, select the `genre` column, count the number of `genre`
#  values using the `count()` method, and save the result to `moscow_genres`.
# Sorting the resulting Series in descending order and saving it back to `moscow_genres`.
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [None]:
moscow_genres.head(10) # Viewing the first 10 rows of `moscow_genres`.

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [None]:
# One line: group the `spb_general` table by the `'genre'` column, select the `genre` column, count the number of `genre` 
#  values using the `count()` method, and save the result to `spb_genres`.
# Sorting the resulting Series in descending order and saving it back to `spb_genres`.
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [None]:
spb_genres.head(10) # Viewing the first 10 rows of `spb_genres`.

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis was partially confirmed:

* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, a related genre—Russian pop music—is also in the top 10.
* Contrary to expectations, rap is equally popular in both Moscow and Saint Petersburg.


## Research Summary

You tested three hypotheses and found:

1. The day of the week affects user activity differently in Moscow and Saint Petersburg.

The first hypothesis is fully confirmed.

2. Musical preferences do not change much throughout the week — whether in Moscow or Saint Petersburg. Small differences are noticeable at the start of the week, on Mondays:

* In Moscow, the “world” music genre is popular,
* In Saint Petersburg, jazz and classical dominate.

Thus, the second hypothesis is only partially confirmed. This result might have been different if not for missing data.

3. Users in Moscow and Saint Petersburg share more similarities than differences in musical tastes. Contrary to expectations, genre preferences in Saint Petersburg resemble those in Moscow.

The third hypothesis is not confirmed. If differences in preferences exist, they are not noticeable in the main user base.

**In practice, studies involve statistical hypothesis testing.**
It is impossible to draw broad conclusions about all users of a service based on data from just one part without statistical methods.
Statistical hypothesis tests will show how reliable these conclusions are given the available data.
You will learn about hypothesis testing methods in upcoming topics.