# Yandex.Music

The comparison between Moscow and St. Petersburg is surrounded by myths. For example:
 * Moscow is a megalopolis, subject to the rigid rhythm of the working week;
 * St. Petersburg is a cultural capital with its own tastes.

Using Yandex.Music data, we will compare the behavior of users of the two capitals.

**Research Objective** - test three hypotheses:
1. user activity depends on the day of the week. And in Moscow and St. Petersburg it manifests itself differently.
2. On Monday morning in Moscow some genres prevail, while in St. Petersburg other genres prevail. Similarly, on Friday evening, different genres prevail depending on the city. 
3. Moscow and St. Petersburg prefer different genres of music. In Moscow they listen to pop music more often, in St. Petersburg - Russian rap.

**Study progress**

We will get data on user behavior from the `yandex_music_project.csv` file. Nothing is known about the quality of the data. Therefore, we will need a review of the data before testing hypotheses. 

We will check the data for errors and assess their impact on the study. Then, in the preprocessing phase, We will look for opportunities to correct the most critical data errors.
 
Thus, the study will proceed in three phases:
 1. Data review.
 2. Data preprocessing.
 3. Hypothesis testing.

## Data overview

Let's make a first overview of Yandex.Music data.




The analyst's primary tool is `pandas`. Therefore we should import this library.

In [1]:
# import pandas library
import pandas as pd

Read the `yandex_music_project.csv` file from the `/datasets` folder and save it in the `df` variable:

In [2]:
# read the data file and save it to df
df = pd.read_csv('/datasets/yandex_music_project.csv')

Display the first ten rows of the table:

In [3]:
# get the first 10 rows of table df
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


Get general information about the table with one command :

In [4]:
# get general information about the data in table df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So, there are seven columns in the table. The data type in all columns is `object`.

According to the data documentation:
* `userID` - user ID;
* `Track` - track name;  
* `artist` - artist name;
* `genre` - genre name;
* `City` - user's city;
* `time` - start time of listening;
* `Day` - day of the week.

Three style violations are visible in the column names:
1. Lowercase letters are combined with uppercase letters.
2. Spaces occur.
3. "snake_register" is not respected.



The number of values in the columns are different. So, there are missing values in the data.

**Findings**

Each row of the table contains data about the track you listened to. Part of the columns describes the song itself: title, artist and genre. The rest of the data tells about the user: what city he/she is from, when he/she listened to the music. 

Preliminarily, we can say that, the data is enough to test hypotheses. But there are omissions in the data, and discrepancies with good style in the column names.

To move forward, we need to fix the problems in the data.

## Data preprocessing
Let's correct the style in the column headings, eliminate omissions. Then let's check the data for duplicates.

### Heading Style
Display the column titles on the screen:

In [5]:
# list of df table column names
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Bring titles into line with good style:
* write several words in the title in "snake_register",
* make all characters lowercase,
* eliminate spaces.

To do this, rename the columns as follows:
* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [6]:
# column renaming
df = df.rename(columns = {df.columns[0]:'user_id',
                         df.columns[1]:'track',
                         df.columns[-3]:'city',
                         df.columns[-1]:'day'})

Сheck the result. To do this, display the column names on the screen again:

In [7]:
# check results - list of column names
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values
First, let's count how many missing values there are in the table. Two `pandas` methods are sufficient for this purpose:

In [8]:
# omission counting
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the study. So in `track` and `artist` the omissions are not important for your work. It is enough to replace them with explicit designations.

But omissions in `genre` may interfere with the comparison of musical tastes in Moscow and St. Petersburg. In practice, it would be correct to establish the reason for the omissions and restore the data. Such a possibility is not available in the study project. We will have to:
* fill in these omissions with explicit designations,
* assess how much they will damage the calculations.

Replace the missing values in columns `track`, `artist` and `genre` with the string `'unknown'`. To do this, create a `columns_to_replace` list, search its elements by `for` loop and replace missing values for each column:

In [9]:
# loop through the column names and replace missing values with 'unknown'.
columns_to_replace = ['track','artist','genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

Make sure that there are no gaps left in the table. To do this, count the missing values again.

In [10]:
# omission counting
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates
Count explicit duplicates in a table with a single command:

In [11]:
# counting obvious duplicates
df.duplicated().sum()

3826

Call the special `pandas` method to remove obvious duplicates:

In [12]:
# removal of obvious duplicates (with deletion of old indexes and formation of new ones)
df = df.drop_duplicates().reset_index(drop=True)

Count the obvious duplicates in the table again - make sure we get rid of them completely:

In [13]:
# check for duplicates
df.duplicated().sum()

0

Now let's get rid of implicit duplicates in the `genre` column. For example, the name of the same genre may be written slightly differently. Such errors will also affect the result of the survey.

Display a list of unique genre titles sorted alphabetically. To do this:
* extract the required dataframe column, 
* apply the sorting method to it,
* for the sorted column call the method that will return unique values from the column.

In [14]:
# View unique genre titles
df.sort_values(by='genre')['genre'].unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Browse the list and look for implicit duplicates of the title `hiphop`. These may be misspelled titles or alternative titles of the same genre.

We will see the following implicit duplicates:
* *hip*,
* *hip*,
* *hip-hop*.

To clear them from the table, write the `replace_wrong_genres()` function with two parameters: 
* `wrong_genres` - list of duplicates,
* `correct_genre` - the row with the correct value.

The function should correct the `genre` column in the `df` table: replace each value from the `wrong_genres` list with a value from `correct_genre`.

In [15]:
# Function for replacing implicit duplicates
def replace_wrong_genres(wrong_genres,correct_genre):
    for genre in wrong_genres:
        df['genre'] = df['genre'].replace(genre,correct_genre)

Call `replace_wrong_genres()` and pass it such arguments that it eliminates implicit duplicates: instead of `hip`, `hop` and `hip-hop`, the table should have the value `hiphop`:

In [16]:
# Устранение неявных дубликатов
replace_wrong_genres(['hip','hop','hip-hop'],'hiphop')

Check that we've replaced the wrong names:

* hip
* hop
* hip-hop

Print the sorted list of unique values of the `genre` column:

In [17]:
# Checking for implicit duplicates
df.sort_values(by='genre')['genre'].unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**.

Pre-processing found three problems in the data:

- irregularities in header style,
- missing values,
- duplicates, explicit and implicit.

We have corrected the headings to make the table easier to work with. Without duplicates, the study will be more accurate.

We have replaced the missing values with `'unknown'. It remains to be seen if the omissions in the `genre` column will harm the study.

Now we can move on to hypothesis testing. 

## Hypothesis testing

### Comparison of user behavior of two capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. Let's test this assumption using data on three days of the week - Monday, Wednesday and Friday. To do this:

* Let's divide Moscow and St. Petersburg users
* Let's compare how many tracks each user group listened to on Monday, Wednesday and Friday.


For practice, first perform each of the calculations separately. 

Estimate the user activity in each city. Group the data by city and count the listens in each group.



In [18]:
# Counting auditions in each city
df.groupby('city')['user_id'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

There are more listens in Moscow than in St. Petersburg. It does not follow that Moscow users listen to music more often. It's just that there are more users in Moscow.

Now let's group the data by day of the week and count the listens on Monday, Wednesday and Friday. Keep in mind that the data contains information about the listens for these days only.


In [19]:
# Counting auditions on each of the three days
df.groupby('day')['user_id'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: user_id, dtype: int64

On average, users from the two cities are less active on Wednesdays. But the picture may change if we look at each city separately.

We have seen how grouping by city and by day of the week works. Now let's write a function that combines these two calculations.

Let's create a function `number_tracks()` that will count the listens for a given day and city. It will need two parameters:
* day of the week,
* the name of the city.

In the function, we should save to a variable the rows of the source table that have the value:
  * in the `day` column is equal to the `day` parameter,
  * in the `city` column is equal to the `city` parameter.

To do this, let's apply sequential filtering with logical indexing.

Then count the values in the `user_id` column of the resulting table. Save the result into a new variable. Return this variable from the function.

In [20]:
# A function to calculate the auditions for a specific city and day.
# Using sequential filtering with logical indexing, it will 
# first retrieves the rows with the correct day from the source table,
# and then filter out the city rows from the result,
# count() method to count the number of values in the user_id column. 
# The function will return this number as a result

def number_tracks(day,city):
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count = track_list['user_id'].count()
    return track_list_count


Call `number_tracks()` six times, changing the value of the parameters - so that we get data for each city on each of the three days.

In [21]:
# the number of auditions in Moscow on Mondays
number_tracks('Monday','Moscow')

15740

In [22]:
# the number of auditions in Saint-Petersburg on Mondays
number_tracks('Monday','Saint-Petersburg')

5614

In [23]:
# the number of auditions in Moscow on Wednesdays
number_tracks('Wednesday','Moscow')

11056

In [24]:
# the number of auditions in Saint-Petersburg on Wednesdays
number_tracks('Wednesday','Saint-Petersburg')

7003

In [25]:
# the number of auditions in Moscow on Fridays
number_tracks('Friday','Moscow')

15945

In [26]:
# the number of auditions in Saint-Petersburg on Wednesdays
number_tracks('Friday','Saint-Petersburg')

5895

Create a table using the `pd.DataFrame` constructor, where
* the column names are `['city', 'monday', 'wednesday', 'friday']`;
* the data is the results you got with `number_tracks`.

In [27]:
# Table of results
pd.DataFrame(data = [['Moscow',number_tracks('Monday','Moscow'),number_tracks('Wednesday','Moscow'),number_tracks('Friday','Moscow')],
                     ['Saint-Petersburg',number_tracks('Monday','Saint-Petersburg'),number_tracks('Wednesday','Saint-Petersburg'),number_tracks('Friday','Saint-Petersburg')]]
                      ,columns = ['city', 'monday', 'wednesday', 'friday'])

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Findings**.

The data shows the difference in user behavior:

- In Moscow, listening peaks on Monday and Friday, with a noticeable decline on Wednesday.
- In St. Petersburg, on the contrary, more people listen to music on Wednesdays. The activity on Monday and Friday is almost equally inferior to Wednesday.

So the data are in favor of the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, on Monday morning in Moscow certain genres prevail, and in St. Petersburg other genres prevail. In the same way, different genres prevail on Friday evenings, depending on the city.

Save the tables with data into two variables:
* for Moscow - in `moscow_general`;
* for St. Petersburg - in `spb_general`.

In [28]:
# get table moscow_general from those rows of table df, 
# for which the value in the 'city' column is equal to 'Moscow'
moscow_general = df[df['city'] == 'Moscow']

In [29]:
# get table moscow_general from those rows of table df, 
# for which the value in the 'city' column is equal to 'Saint-Petersburg'
spb_general = df[df['city'] == 'Saint-Petersburg']

Create a `genre_weekday()` function with four parameters:
* a table (dataframe) with the data,
* day of the week,
* initial timestamp in the format 'hh:mm', 
* last timestamp in the format 'hh:mm'.

The function should return information about the top 10 genres of those tracks listened to on the specified day, between the two time stamps.

In [30]:
# Declare a genre_weekday() function with parameters table, day, time1, time2,
# which returns information about the most popular genres on a given day at a given time.
# specified time:
def genre_weekday(table, day, time1, time2):

    genre_df = table[table['day'] == day]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df = genre_df[genre_df['time'] < time2]

    genre_df_count = genre_df.groupby('genre')['day'].count()

    genre_df_sorted = genre_df_count.sort_values(ascending = False)

    return genre_df_sorted.head(10)

Compare the results of the `genre_weekday()` function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [31]:
# function call for Monday morning in Moscow
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: day, dtype: int64

In [32]:
## function call for Monday morning in St. Petersburg
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: day, dtype: int64

In [33]:
# function call for Friday evening in Moscow
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: day, dtype: int64

In [34]:
# function call for Friday evening in St. Petersburg
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: day, dtype: int64

**Findings**.

If we compare the top 10 genres on Monday morning, we can draw these conclusions:

1. People in Moscow and St. Petersburg listen to similar music. The only difference is that the Moscow rating includes the genre "world", while the St. Petersburg rating includes jazz and classical.

2. In Moscow there were so many missing values that `'unknown'` took the tenth place among the most popular genres. This means that the missing values take up a significant proportion of the data and threaten the validity of the study.

Friday night doesn't change this picture. Some genres rise a little higher, others come down, but overall the top 10 remains the same.

Thus, the second hypothesis is only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not too pronounced. In Moscow, Russian popular music is listened to more often, in St. Petersburg - jazz.

However, omissions in the data cast doubt on this result. There are so many of them in Moscow that the top-10 ranking could look different if it were not for the missing data on genres.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap music, and music of this genre is listened to there more often than in Moscow.  And Moscow is a city of contrasts, in which, however, pop music prevails.

Group the `moscow_general` table by genre and count the track listens of each genre using the `count()` method. Then sort the result in descending order and store it in the `moscow_genres` table.

In [35]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending = False)

Print the first ten lines of `moscow_genres`:

In [36]:
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Now repeat the same for St. Petersburg.

Group the `spb_general` table by genre. Count the listens of tracks of each genre. Sort the result in descending order and save it in the `spb_genres` table:

In [37]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending = False)

Print the first ten lines of `spb_genres`:

In [38]:
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**.

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, a close genre - Russian popular music - is found in the top 10 genres.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.

## Survey Results

You have tested three hypotheses and established:

1. The day of the week has a different effect on user activity in Moscow and St. Petersburg. 

The first hypothesis is completely confirmed.

2. Music preferences do not change much during the week, whether in Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
* in Moscow they listen to music of the "world" genre,
* in St. Petersburg - jazz and classical.

Thus, the second hypothesis was only partially confirmed. This result could have been different if there had not been omissions in the data.

3. There are more similarities than differences in the tastes of Moscow and St. Petersburg users. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If there are differences in preferences, they are unnoticeable for the bulk of users.
