# Yandex.Music

Using Yandex.Music data to compare the behavior of users of the two capitals.

**Research objective** — test three hypotheses:
1. User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg, this manifests itself in different ways.
2. On Monday morning, some genres prevail in Moscow, and others in St. Petersburg. Similarly, on Friday evening, different genres prevail — depending on the city. 
3. Moscow and St. Petersburg prefer different genres of music. In Moscow, pop music is more often listened to, in St. Petersburg - Russian rap.

**Progress of the study**

User behavior data is `yandex_music_project.csv'. Nothing is known about the quality of the data. Therefore, a review of the data will be needed before testing hypotheses. 

Check the data for errors and evaluate their impact on the study. Find an opportunity to correct the most critical data errors.
 
Thus, the study will take place in three stages:
 1. Data overview.
 2. Data preprocessing.
 3. Hypothesis testing.

## Data overview

I am making up the first idea about the Yandex.Music data.

In [1]:
import pandas as pd #importing the pandas library

In [2]:
df=pd.read_csv('/datasets/yandex_music_project.csv') # reading a data file and saving it to df

In [3]:
df.head(10)#getting the first 10 rows of the df table

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
df.tail(10) #getting the last 10 rows of the df table

Unnamed: 0,userID,Track,artist,genre,City,time,Day
65069,BE1AAD74,Waterwalk,Eduardo Gonzales,electronic,Moscow,20:38:59,Monday
65070,49F35D53,Ass Up,Rameez,dance,Moscow,14:08:58,Friday
65071,92378E24,Swing it Like You Mean it,OJOJOJ,techno,Moscow,21:12:56,Friday
65072,C532021D,We Can Not Be Silenced,Pänzer,extrememetal,Moscow,08:38:24,Friday
65073,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Moscow,21:07:12,Monday
65074,729CBB09,My Name,McLean,rnb,Moscow,13:32:28,Wednesday
65075,D08D4A55,Maybe One Day (feat. Black Spade),Blu & Exile,hip,Saint-Petersburg,10:00:00,Monday
65076,C5E3A0D5,Jalopiina,,industrial,Moscow,20:09:26,Friday
65077,321D0506,Freight Train,Chas McDevitt,rock,Moscow,21:43:59,Friday
65078,3A64EF84,Tell Me Sweet Little Lies,Monica Lopez,country,Moscow,21:59:46,Friday


In [5]:
df.info() #getting general information about the data in the df table with one command

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So, there are seven columns in the table. The data type in all columns is `object'.

According to the data documentation:
* `'userID'` - is the user ID;
* `'Track'` — the name of the track; 
* `'artist'` — artist's name; 
* `'genre'` — the name of the genre;
* `'City'` — the user's city;
* `'time'` — the start time of listening;
* `'Day'` — the day of the week.

Three style violations are visible in the column names:
1. Lowercase letters are combined with uppercase.
2. There are gaps.
3. There is no writing "snake font" (through underlining).



The number of values in the columns varies. So there are missing values in the data.

**Conclusions**

In each row of the table — data about the listened track. Part of the columns describes the composition itself: name, artist and genre. The rest of the data tells about the user: what city he is from, when he listened to music. 

Previously, it can be argued that there is enough data to test hypotheses. But there are gaps in the data, and in the column names there are discrepancies with good style.

To move on, you need to fix the problems in the data.

## Data preprocessing

### Header style

In [6]:
df.columns #list of column names in the df table

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

I bring the names in accordance with a good style:
* a few words in the name - in the "zmeinom_register",
* all characters are lowercase,
* gaps have been eliminated.

To do this, I will rename the existing columns in this way:
* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [7]:
df=df.rename(columns={'  userID':'user_id', 'Track':'track', '  City  ':'city', 'Day':'day'}) #renaming columns

In [8]:
df.columns #checking the results - a list of column names

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

In [9]:
print(df.isna().sum())


user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64


Not all missing values affect the study. So in `track` and `artist` the omissions are not important for your work. It is enough to replace them with explicit designations.

But omissions in `genre` may interfere with the comparison of musical tastes in Moscow and St. Petersburg. In practice, it would be correct to establish the reason for the omissions and restore the data. There is no such possibility in the training project. Ll have to:
* fill in these gaps with explicit notation,
* assess how much they will damage the calculations.

I replace the missing values in the columns `track`, `artist` and `genre` with the string `unknown". 
To do this, I create a list of `columns_to_replace`, iterate through its elements with a `for` loop and replace the missing values for each column:

In [10]:
columns_to_replace=['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column]=df[column].fillna('unknown')
    # iterating over column names in a loop and replacing missing values with 'unknown'

You need to make sure that there are no gaps left in the table. To do this, I count the missing values again.

In [11]:
df.isna().sum() #counting passes

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

In [12]:
df.duplicated().sum() #counting explicit duplicates

3826

In [13]:
df = df.drop_duplicates().reset_index(drop=True) #removal of explicit duplicates (with the removal of old indexes and the formation of new ones)

In [14]:
df.duplicated().sum() #checking for the absence of duplicates

0

I display a list of unique genre names, sorted alphabetically. To do this:
* extract the desired dataframe column, 
* I apply the sorting method to it,
* for a sorted column, I call a method that will return unique values from the column.

In [15]:
sorted_df = df['genre'].sort_values().unique()
display(sorted_df) #View unique genre names

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Now you need to get rid of implicit duplicates in the `genre` column. For example, the name of the same genre may be written a little differently. Such errors will also affect the result of the study.

There are the following implicit duplicates:
* *hip*,
* *hop*,
* *hip-hop*,
* *electronics*.

To clear the table of them, I prescribe the function `replace_wrong_genres()` with two parameters: 
* `wrong_genres' — list of duplicates,
* `correct_genre' is a string with the correct value.

The function should correct the `genre` column in the `df` table: replace each value from the `wrong_genres` list with the value from `correct_genre`.

In [16]:
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['name'] = df['name'].replace(wrong_genre, correct_genre) #Function for replacing implicit duplicates

I call `replace_wrong_genres()` and pass it such arguments so that it eliminates implicit duplicates: instead of `hip`, `hop`, `hip-hop` the table should have the value `hiphop`:

In [17]:
wrong_genres = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'
df = df.replace(wrong_genres, correct_genre)#Elimination of implicit duplicates

In [18]:
df['genre'] = df['genre'].replace('электроника','electronic') #the table should have the value `electronic`

In [19]:
sorted_df = df['genre'].sort_values().unique() #Checking for implicit duplicates
display(sorted_df)

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

Preprocessing found three problems in the data:

- violations in the style of headlines,
- missing values,
- duplicates — explicit and implicit.

Fixed headers to simplify working with the table. Without duplicates, the study will become more accurate.

The missing values are replaced with `unknown`. It remains to be seen whether omissions in the `genre` column will harm the study.

Now we can proceed to hypothesis testing. 

## Hypothesis testing

### Comparison of user behavior of two capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. I check this assumption based on data on three days of the week — Monday, Wednesday and Friday. For this:

* I separate the users of Moscow and St. Petersburg
* I compare how many tracks each group of users listened to on Monday, Wednesday and Friday.


In [20]:
activity = df.groupby('city')['day'].count() #Counting auditions in each city
print(activity) 

city
Moscow              42741
Saint-Petersburg    18512
Name: day, dtype: int64


There are more auditions in Moscow than in St. Petersburg. This does not mean that Moscow users listen to music more often. It's just that there are more users in Moscow.

Now I will group the data by day of the week and count the auditions on Monday, Wednesday and Friday, taking into account that the data contains information about auditions only for these days.

In [21]:
activity_day=df.groupby('day')['time'].count() #Counting auditions on each of the three days
print(activity_day) 

day
Friday       21840
Monday       21354
Wednesday    18059
Name: time, dtype: int64


On average, users from two cities are less active on Wednesdays. But the picture may change if we consider each city separately.

To do this, I create a function `number_tracks()`, which will combine the grouping of auditions for a given day and city. She will need two parameters:
* day of the week,
* name of the city.

In the function, I will save to a variable the rows of the source table, which have the value:
* in the column `day` is equal to the parameter `day`,
* in the column `city` is equal to the parameter `city`.

To do this, I will use sequential filtering with logical indexing.

Then I will calculate the values in the `user_id` column of the resulting table. I will save the result to a new variable and return this variable from the function.

In [22]:
#<creating the number_tracks() function>
def number_tracks(day, city):
# A function with two parameters is declared: day, city.
    track_list=df[(df['day'] == day) & (df['city'] == city)]
# The track_list variable stores those rows of the df table for which the value in the 'day' column 
# is equal to the day parameter and at the same time the value 
#in the 'city' column is equal to the city parameter (sequential filtering using logical indexing).
    track_list_count = track_list['user_id'].count()
# The track_list_count variable stores the number of values of the 'user_id' column,
# calculated by the count() method for the track_list table.
    return track_list_count #The function returns a number - the value of track_list_count.

# A function for counting auditions for a specific city and day.
# Using sequential filtering with logical indexing, it 
# first, it will get the rows with the desired day from the source table,
# then filters out the rows with the desired city from the result,
# will use the count() method to count the number of values in the user_id column. 
# This is the number the function will return as a result

Calling `number_tracks()` six times, changing the value of the parameters — so as to get data for each city on each of the three days.

In [23]:
number_tracks('Monday', 'Moscow') #the number of auditions in Moscow on Mondays

15740

In [24]:
number_tracks('Monday', 'Saint-Petersburg') #the number of auditions in St. Petersburg on Mondays

5614

In [25]:
number_tracks('Wednesday', 'Moscow') #number of auditions in Moscow on Wednesdays

11056

In [26]:
number_tracks('Wednesday', 'Saint-Petersburg') #number of auditions in St. Petersburg  on Wednesdays

7003

In [27]:
number_tracks('Friday', 'Moscow') #number of auditions in Moscow on Fridays

15945

In [28]:
number_tracks('Friday', 'Saint-Petersburg') #number of auditions in St. Petersburg on Fridays

5895

In [29]:
data = [
    ['Moscow', 15740, 11056, 15945],
    ['Saint-Petersburg', 5614, 7003, 5895]
]
columns = ['city', 'monday', 'wednesday', 'friday']
table = pd.DataFrame(data = data, columns = columns) #Results table
display(table)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data shows the difference in user behavior:

- In Moscow, the peak of auditions falls on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday is almost equally inferior to Wednesday here.

So, the data speak in favor of the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, some genres prevail in Moscow on Monday morning, and others in St. Petersburg. Similarly, on Friday evening, different genres prevail — depending on the city.

I save tables with data in two variables:
* in Moscow — in `moscow_general`;
* in St. Petersburg — in `spb_general`.

In [31]:
moscow_general = df[df['city'] == 'Moscow'] #getting the moscow_general table from those rows of the df table, 
#for which the value in the 'city' column is 'Moscow'

spb_general = df[df['city'] == 'Saint-Petersburg'] #getting the spb_general table from those rows of the df table,
#for which the value in the 'city' column is 'Saint-Petersburg'

Creating the `genre_weekday()' function` with four parameters:
* table (dataframe) with data,
* day of the week,
* initial timestamp in the format 'hh:mm', 
* the last timestamp in the format 'hh:mm'.

The function should return information about the top 10 genres of those tracks that were listened to on the specified day, in the interval between two timestamps.

In [32]:
def genre_weekday(table, day, time1, time2): #Declaration of the genre_weekday() function with the parameters table, day, time1, time2,
# which returns information about the most popular genres on the specified day at the specified time:
# 1) the genre_df variable stores those rows of the transmitted table dataframe for which at the same time:
# - the value in the day column is equal to the value of the day argument
# - the value in the time column is greater than the value of the time1 argument
# - the value in the time column is less than the value of the time2 argument
    genre_df = table[(table['day'] == day) & (table['time'] > time1) & (table['time'] < time2)]
#Using sequential filtering using logical indexing.
# 2) I group the genre_df dataframe by the genre column, count the number of records for each of
# of genres present, the resulting Series is written to the genre_df_count variable
    genre_df_count = genre_df.groupby('genre')['user_id'].count()
# 3) I sort genre_df_count in descending order of occurrence and save it to the genre_df_sorted variable
    genre_df_sorted = genre_df_count.sort_values(ascending = False)
# 4) I return a Series of the first 10 values of genre_df_sorted, these will be the top 10 popular genres (on the specified day, 
# at the specified time)
    return genre_df_sorted.head(10)

I compare the results of the `genre_weekday()` function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [33]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')
# function call for Monday morning in Moscow (instead of df — moscow_general table)
# objects storing time are strings and are compared as strings

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: user_id, dtype: int64

In [34]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00')
# function call for Monday morning in St. Petersburg (spb_general table instead of df)

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: user_id, dtype: int64

In [35]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')
# function call for Friday evening in Moscow

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: user_id, dtype: int64

In [36]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00')
# function call for Friday evening in St. Petersburg

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: user_id, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg, they listen to similar music. The only difference is that the “world” genre entered the Moscow rating, and jazz and classical music entered the St. Petersburg rating.

2. There were so many missing values in Moscow that the value `unknown" took the tenth place among the most popular genres. This means that the missing values occupy a significant share in the data and threaten the reliability of the study.

Friday night doesn't change that picture. Some genres rise a little higher, others go down, but overall the top 10 remains the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not too pronounced. In Moscow, they listen to Russian popular music more often, in St. Petersburg — jazz.

However, omissions in the data cast doubt on this result. There are so many of them in Moscow that the top-10 rating could look different if not for the lost data on genres.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, music of this genre is listened to there more often than in Moscow. And Moscow is a city of contrasts, in which, nevertheless, pop music prevails.

I will group the `moscow_general` table by genre and count the listenings of tracks of each genre using the `count()` method. Then I will sort the result in descending order and save it in the `moscow_genres` table.

In [37]:
moscow_genres = moscow_general.groupby('genre')['user_id'].count().sort_values(ascending=False)
# one row: grouping the moscow_general table by the 'genre' column, 
# counting the number of 'genre' values in this grouping by the count() method, 
# sorting the resulting Series in descending order and saving to moscow_genres

In [38]:
moscow_genres.head(10) #view the first 10 lines of moscow_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: user_id, dtype: int64

It is the same for St. Petersburg.

In [39]:
spb_genres = spb_general.groupby('genre')['user_id'].count().sort_values(ascending=False)
# one row: grouping the spb_general table by the 'genre' column, 
# counting the number of 'genre' values in this grouping by the count() method, 
# sorting the resulting Series in descending order and saving to spb_genres

In [40]:
spb_genres.head(10) #view the first 10 lines of moscow_genres

genre
pop            2431
dance          1932
rock           1879
electronic     1737
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: user_id, dtype: int64

**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a similar genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.

## Сonclusion of the study

3 hypotheses were tested and it was established:

1. The day of the week has different effects on user activity in Moscow and St. Petersburg. For example, **on Wednesday**, Moscow has the least activity, while St. Petersburg has the greatest.

The first hypothesis was fully confirmed.

2. Musical preferences do not change much during the week — be it Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
* in Moscow, they listen to music of the “world” genre,
* in St. Petersburg — jazz and classics.

Thus, the second hypothesis was only partially confirmed. This result could have been different if not for the omissions in the data.

3. The tastes of users in Moscow and St. Petersburg have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If there are differences in preferences, they are invisible to the majority of users.

**In practice, studies contain tests of statistical hypotheses.**
From the data of one service, it is not always possible to draw a conclusion about all residents of the city.