# Yandex.Music

The comparison between Moscow and Saint Petersburg is surrounded by myths. For example:
 * Moscow is a megacity governed by the strict rhythm of the workweek;
 * Saint Petersburg is the cultural capital, with its own unique flavors.

Using the data from Yandex.Music, I will compare the user behavior in the two capitals.

**Research objective** —  to test three hypotheses:
1. User activity depends on the day of the week. Moreover, this is manifested differently in Moscow and Saint Petersburg.
2. On Monday morning in Moscow, certain genres prevail, while in Saint Petersburg, there are others. Similarly, on Friday evening, different genres prevail depending on the city.
3. Moscow and Saint Petersburg prefer different music genres. Pop music is more often listened to in Moscow, while Russian rap is more popular in Saint Petersburg.

**Course of the investigation**

I will obtain user behavior data from the `yandex_music_project.csv` file. Nothing is known about the data quality. Therefore, a data review will be required before testing the hypotheses.

I will check the data for errors and assess their impact on the research. Then, during the data preprocessing stage, I will look for opportunities to correct the most critical data errors.
 
Thus, the investigation will proceed in three stages:
 1. Data review.
 2. Data preprocessing.
 3. Hypothesis testing.



## Data Review

Let's form the initial understanding of the Yandex.Music data.




The main tool for an analyst is `pandas`. Let's import this library.

In [2]:
# import library pandas
import pandas as pd

Let's read the file `yandex_music_project.csv` from the `/datasets` folder and store it in the variable `df`:

In [4]:
# reading the file with data and saving to df
df = pd.read_csv('yandex_music_project.csv')

Displaying the first ten rows of the table:

In [5]:
# retrieving the first 10 rows of the df table
display(df.head(10))

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


The first look at raw data is an important part of any research. We can get general information about the table with one command:

In [6]:
# obtaining general information about the data in the df table
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


<div class="alert alert-success">
<h2> Комментарий ревьюера ✔️ <a class="tocSkip"> </h2>

Пайплан первичной обработки можно усилить добавив [describe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) 


So, there are seven columns in the table. The data type in all columns is `objec`t.

According to the data documentation:

* `userID` — user identifier;
* `Track` — track title;
* `artist` — artist's name;
* `genre` — genre name;
* `City` — user's city;
* `time` — time of starting to listen;
* `Day` — day of the week.
* `There` are three violations of style in the column names:

Lowercase letters are mixed with uppercase.
1. Spaces are present.
2. Some words need to be written in “snake_case”
3. The number of values in columns varies. 

This means there are missing values in the data.

**Conclusions**

Each row of the table contains data about a track listened to. Part of the columns describes the composition itself: its title, performer, and genre. The rest of the data tells about the user: what city they're from, when they listened to music.

Preliminarily, it can be stated that there is enough data to test hypotheses. However, there are gaps in the data, and there are discrepancies with good style in the column names.

To move forward, it's necessary to address these issues in the data.

## Data Preprocessing
Let's correct the style in the column headers, eliminate the gaps. Then check the data for duplicates.

### Column Header Style
Let's display the column names:

In [7]:
# list of df table column names
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Let's conform the names to good style:

* write multiple words in the title in “snake_case”,
* make all characters lowercase,
* eliminate spaces.

To do this, we'll rename the columns as follows:

* `' userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `' City '` → `'city'`;
* `'Day'` → `'day'`.

In [8]:
# renaming columns
df = df.rename(columns = {'  userID':'user_id', 
                          'Track':'track', 
                          '  City  ':'city', 
                          'Day':'day'})

Let's check the result. To do this, we'll display the column names again:

In [9]:
# checking results - list of column names
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing Values
First, we'll count how many missing values there are in the table. For this, two `pandas` methods are sufficient:


In [10]:
# counting missing values
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

The methods `isna()` and `isnull()` are similar, but `isnull()` is a copy (alias) for `isna()`, so using it is considered best practice.

Not all missing values affect the research. So in `track` and `artist`, gaps are not important for our work. It's enough to replace them with explicit designations.

However, gaps in the `genre` may hinder the comparison of musical tastes in Moscow and St. Petersburg. In practice, it would be correct to establish the cause of the gaps and restore the data. I will have to:

* fill in these gaps with explicit designations,
* assess how much they will harm the calculations.

Replace the missing values in the `track`, `artist`, and `genre` columns with the string `'unknown'`. For this, I'll create a list `columns_to_replace`, iterate through its elements, and for each column, perform the replacement of missing values:

In [11]:
# iterating through column names and replacing missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']
df[columns_to_replace] = df[columns_to_replace].fillna('unknown')

Let's make sure there are no more gaps in the table. To do this, we'll count the missing values again.



In [12]:
# counting missing values
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates
Let's calculate the explicit duplicates in the table with one command:

In [13]:
# counting explicit duplicates
df.duplicated().sum()

3826

<div class="alert alert-success">

Wow, that's a lot...

![](https://i.ibb.co/s13m30m/C2pL.gif)


Let's invoke a special `pandas` method to remove explicit duplicates

In [14]:
# removing explicit duplicates (deleting old indexes and forming new ones)
df = df.drop_duplicates().reset_index(drop=True)

Let's count the explicit duplicates in the table again to ensure they are completely gone:

In [15]:
# checking for the absence of duplicates
df.duplicated().sum()

0

Now, let's eliminate the implicit duplicates in the `genre` column. For example, the name of the same genre might be recorded slightly differently. Such mistakes will also affect the research outcome.

Let's display a list of unique genre names, sorted in alphabetical order. To do this, we'll:

* extract the necessary dataframe column,
* apply the sorting method to it,
* for the sorted column, invoke the method that will return unique values from the column.

In [16]:
# Viewing unique genre names
genre_column = df['genre'].sort_values().unique()
genre_column

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Let's review the list and find implicit duplicates for the name hiphop. These could be names with mistakes or alternative names for the same genre.

I can see the following implicit duplicates:

* *hip*,
* *hop*,
* *hip-hop*.

To clean the table from them, let's write a function replace_wrong_genres() with two parameters:

* `wrong_genres` — a list of duplicates,
* `correct_genre` — a string with the correct value.

The function should correct the `genre` column in the `df` table: replace each value from the `wrong_genres` list with the value from `correct_genre`.

In [17]:
# the function for the e;iminating of the duplicates
def replace_wrong_genres(wrong_genres, correct_genre):
    df['genre'] = df['genre'].replace(wrong_genres, correct_genre)
    return df

<div class="alert alert-success">

To make thi things siplier, we can use one `replace`: 
    
```python
duplicates = ['hip', 'hop', 'hip-hop']
correct_name = 'hiphop'  

df['genre'] = df['genre'].replace(duplicates, correct_name)
```

We can also use this method:

    
```python
df['genre'] = df['genre'].replace(['hip', 'hop', 'hip-hop'], 'hiphop')
```


Let's call `replace_wrong_genres()` and pass the arguments, such that the words `hip`, `hop` and `hip-hop` will be replaces on `hiphop`:

In [18]:
# eliminating the duplicates
list_of_wrong_genres = ['hip', 'hop', 'hip-hop']
for element in list_of_wrong_genres:
    replace_wrong_genres(element, 'hiphop')

Let's check that we have replaced the incorrect names:

*   hip
*   hop
*   hip-hop

And let's output the sorted list of unique values from the `genre` column:

In [19]:
# Cheching the duplicates
genre_column = df['genre'].sort_values()
genre_column.unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

The preprocessing revealed three problems with the data:

- violations in header style,
- missing values,
- duplicates — both explicit and implicit.

We corrected the headers to simplify working with the table. The research will be more accurate without duplicates.

We replaced missing values with `'unknown'`. It remains to be seen whether the missing data in the genre column will harm the research.

Now we can move on to hypothesis testing.

## Hypothesis Testing

### Comparison of User Behavior in Two Capitals

The first hypothesis claims that users listen to music differently in Moscow and Saint Petersburg. Let's verify this assumption based on data from three weekdays - Monday, Wednesday, and Friday. To do this, we'll:

* Divide the users from Moscow and Saint Petersburg.
* Compare how many tracks each group of users listened to on Monday, Wednesday, and Friday.

For practice, let's perform each of the calculations separately first, and evaluate user activity in each city. then, we'll group the data by city and count the listens in each group.

In [20]:
# Counting listens in each city
music_in_cities = df.groupby('city')['track'].count()
music_in_cities

city
Moscow              42741
Saint-Petersburg    18512
Name: track, dtype: int64

There are more listens in Moscow than in Saint Petersburg. This doesn't necessarily mean that users in Moscow listen to music more often. There are simply more users in Moscow.

Now we'll group the data by weekday and count the listens on Monday, Wednesday, and Friday. We have to note that the data only has information about listens for these days.


In [21]:
# Counting listens on each of the three days
music_in_cities = df.groupby('day')['track'].count()
music_in_cities

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

On average, users from both cities are less active on Wednesdays. But the picture may change if we consider each city separately.

We've seen how grouping works by city and by days of the week. Now we'll write a function that combines these two calculations.

Let's create a function `number_tracks()`, which will count the listens for a given day and city. It will need two parameters:

* the day of the week,
* the name of the city.

In the function, we'll save to a variable the rows of the original table where the value:

    *  in the `day` column equals the `day` parameter,
    *  in the `city` column equals the `city` parameter.
    
To do this, we'll apply sequential filtering with logical indexing.

Then we'll count the values in the `user_id` column of the resulting table and save the result to a new variable. Then we'll return this variable from the function.

In [22]:
# <creating the function number_tracks()>
def number_tracks(day, city):
    '''A function is declared with two parameters: day, city.'''
    # The track_list variable stores those rows of the df table for which 
    # the value in the 'day' column equals the day parameter and at the same time the value
    # in the 'city' column equals the city parameter (we've used sequential filtering
    # with logical indexing).
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    
    # The track_list_count variable stores the number of 'user_id' column values,
    # calculated by the count() method for the track_list table.
    # The function returns a number - the value of track_list_count.
    
    # A function to count the listens for a specific city and day.
    # Using sequential filtering with logical indexing, it will
    # first retrieve rows with the required day from the original table,
    # then from the result, it will filter rows with the required city,
    # and using the count() method, it will calculate the number of values in the user_id column. 
    # The function will return this quantity as a result
    track_list_count = track_list['user_id'].count()
    return track_list_count

   



Let's call `number_tracks()` six times, changing the parameter values — to get data for each city on each of the three days.

In [23]:
# number of listens in Moscow on Mondays
number_tracks('Monday','Moscow')

15740

In [24]:
# number of listens in Saint Petersburg on Mondays
number_tracks('Monday', 'Saint-Petersburg')

5614

In [23]:
# number of listens in Moscow on Wednesdays
number_tracks('Wednesday', 'Moscow')

11056

In [25]:
# number of listens in Saint Petersburg on Wednesdays
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [27]:
# number of listens in Moscow on Fridays
number_tracks('Friday', 'Moscow')

15945

In [28]:
# number of listens in Saint Petersburg on Fridays
number_tracks('Friday', 'Saint-Petersburg')

5895

<div class="alert alert-success"> 

We can also do this in a cycle:

```python
for weekday, city in zip(['Monday', 'Wednesday', 'Friday']*2, sorted(['Moscow', 'Saint-Petersburg']*3)):
    print(f'The number of listens in a {city} on {weekday} is {number_tracks(weekday, city)}')
```

Let's now create a table using the `pd.DataFrame` constructor, where: 
* column_names  — `['city', 'monday', 'wednesday', 'friday']`;
* data - the results I have obtained using `number_tracks`.

In [29]:
# Table with results
column_names = ['city', 'monday', 'wednesday', 'friday']
data = [['Moscow', 15740, 11056, 15945], ['Saint-Petersburg', 5614, 7003, 5895]]
table_city_days = pd.DataFrame(data = data, columns = column_names)
display(table_city_days)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data shows a difference in user behavior:

- In Moscow, there's a peak in music listening on Monday and Friday, with a noticeable decrease on Wednesday.
- In contrast, in Saint Petersburg, more music is listened to on Wednesdays. Activity on Monday and Friday falls behind that of Wednesday almost equally.

This means the data supports the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, different genres prevail in Moscow on Monday morning, and different ones in Saint Petersburg. Similarly, different genres prevail on Friday evening, depending on the city.

Let's save the tables with the data into two variables:

* for Moscow — in `moscow_general`;
* for Saint Petersburg — in `spb_general`.

In [30]:
# getting the moscow_general table from those rows of the df table, 
# for which the value in the 'city' column equals 'Moscow'
moscow_general = df[df['city'] == 'Moscow']

In [31]:
# getting the spb_general table from those rows of the df table,
# for which the value in the 'city' column equals 'Saint-Petersburg'
spb_general = df[df['city'] == 'Saint-Petersburg']

Let's create a function `genre_weekday()` with four parameters:

* a table (dataframe) with data,
* the day of the week,
* the initial time stamp in the 'hh:mm' format,
* the final time stamp in the 'hh:mm' format.

The function should return information about the top 10 genres of those tracks that were listened to on the specified day, in the period between the two-time marks.

In [32]:
def genre_weekday(table, day, time1, time2):
    # 1) The genre_df variable stores those rows of the passed dataframe 'table' for which:
    #    - the value in the 'day' column is equal to the 'day' argument value
    #    - the value in the 'time' column is greater than the 'time1' argument value
    #    - the value in the 'time' column is less than the 'time2' argument value
    #    Sequential filtering is used with logical indexing.
    genre_df = table[table['day'] == day]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df = genre_df[genre_df['time'] < time2]
    
    # 2) Group the 'genre_df' dataframe by the 'genre' column, take one of its columns, 
    #    and use the count() method to count the number of records for each of the present genres. 
    #    The resulting Series is recorded in the 'genre_df_count' variable.
    genre_df_count = genre_df.groupby('genre')['track'].count()
    
    # 3) Sort 'genre_df_count' in descending order of frequency and save it 
    #    in the 'genre_df_sorted' variable.
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    
    # 4) Return a Series of the first 10 values of 'genre_df_sorted', which will be the top-10
    #    popular genres (on the specified day, at the specified time).
    return genre_df_sorted.head(10)


Let's compare the results of the `genre_weekday()` function for Moscow and Saint Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [33]:
# function call for Monday morning in Moscow (instead of df — the moscow_general table)
# objects holding time are strings and are compared as strings
# example call: genre_weekday(moscow_general, 'Monday', '07:00', '11:00')
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: track, dtype: int64

In [34]:
# function call for Monday morning in Saint Petersburg (instead of df — the spb_general table)
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: track, dtype: int64

In [35]:
# function call for Friday evening in Moscow
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: track, dtype: int64

In [36]:
# function call for Friday evening in Saint Petersburg
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: track, dtype: int64

**Conclusions**

Upon comparing the top 10 genres on Monday morning, the following conclusions can be drawn:

1. Both Moscow and Saint Petersburg listeners enjoy similar music. The only difference is the inclusion of the "world" genre in the Moscow ranking, while jazz and classical music made it to the Saint Petersburg list.

2. In Moscow, there were so many missing values that the `'unknown'` category ranked tenth among the most popular genres. This suggests that missing values constitute a significant portion of the data and pose a threat to the reliability of the study.

Friday evening does not alter this picture significantly. Some genres move slightly up, others go down, but overall, the top 10 remains the same.

Thus, the second hypothesis is only partially confirmed:

* Users listen to similar music at the beginning and the end of the week.
* The difference between Moscow and Saint Petersburg is not too pronounced. Russian pop music is more frequently listened to in Moscow, while jazz is more popular in Saint Petersburg.

However, gaps in the data cast doubt on this result. There are so many in Moscow that the top 10 ranking could look different if not for the lost data on genres.

### Genre Preferences in Moscow and Saint Petersburg

Hypothesis: Saint Petersburg is the capital of rap, and music of this genre is listened to more frequently there than in Moscow. Moscow, on the other hand, is a city of contrasts, where pop music nonetheless prevails.

Let's group the `moscow_general` table by genre and count the number of track plays for each genre using the `count()` method. Then we'll sort the results in descending order and save it in the `moscow_genres` table.

In [37]:
# in one line: grouping the moscow_general table by the 'genre' column,
# counting the number of 'genre' values in this grouping using the count() method,
# sorting the resulting Series in descending order and saving it in moscow_genres
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending = False)

Let's display the first ten rows of `moscow_genres`:

In [38]:
# view the first 10 rows of moscow_genres
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [39]:
moscow_genres.to_frame()

Unnamed: 0_level_0,genre
genre,Unnamed: 1_level_1
pop,5892
dance,4435
rock,3965
electronic,3786
hiphop,2096
...,...
neoklassik,1
mood,1
metalcore,1
marschmusik,1


Now let's repeat the same process for Saint Petersburg.

Let's group the `spb_general` table by genre and count the number of times tracks of each genre were played. Then we'll sort the results in descending order and save them in the `spb_genres` table:

In [40]:
# in one line: grouping the spb_general table by the 'genre' column,
# counting the number of 'genre' values in this group using the count() method,
# sorting the resulting Series in descending order and saving it in spb_genres
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending = False)

Let's display the first ten rows of `spb_genres`:

In [41]:
# viewing the first 10 rows of spb_genres
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis is partially confirmed:

* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Furthermore, a related genre - Russian pop music - is also in the top 10.
* Contrary to expectations, rap is equally popular in both Moscow and Saint Petersburg.

## Research Findings

We have tested three hypotheses and determined:

1. The day of the week affects user activity differently in Moscow and Saint Petersburg.

The first hypothesis was fully confirmed.

2. Musical preferences don't change much throughout the week — be it Moscow or Saint Petersburg. Minor differences are noticeable at the beginning of the week, on Mondays:
* in Moscow, they listen to “world” music,
* in Saint Petersburg — jazz and classical.

Thus, the second hypothesis was only partially confirmed. This result could have been different if not for the missing data.

3. The tastes of users in Moscow and Saint Petersburg have more in common than differences. Contrary to expectations, genre preferences in Saint Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If differences in preferences do exist, they are not noticeable among the majority of users.

**In practice, research involves testing statistical hypotheses.**

It's not always possible to draw conclusions about all the residents of a city from the data of one service. 

Testing statistical hypotheses will show how reliable they are based on the available data. 