## Yandex.Music

The comparison between Moscow and St. Petersburg is surrounded by myths. For example:
 * Moscow is a metropolis, subject to the rigid rhythm of the working week;
 * St. Petersburg is a cultural capital, with its own tastes.

Using Yandex Music data, you will compare the behavior of users in two capitals.

**The purpose of the task** is to test three hypotheses:
1. User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg this manifests itself in different ways.
2. On Monday morning in Moscow, some genres prevail, and in St. Petersburg, others. Likewise, on Friday evenings, different genres predominate, depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow people listen to pop music more often, in St. Petersburg they listen to Russian rap.

**Progress of research**

You will receive data on user behavior from the file `yandex_music_project.csv`. Nothing is known about the quality of the data. Therefore, a review of the data will be needed before testing hypotheses.

You will check data for errors and evaluate their impact on the task. Then, during the preprocessing phase, you look to correct the most critical data errors.

Thus, the research will take place in three stages:
 1. Review of data.
 2. Data preprocessing.
 3. Testing hypotheses.

## Data overview

Get your first impression of Yandex Music data.

**Task 1**

The main analyst tool is `pandas`. Import this library.

In [1]:
import pandas as pd

**Task 2**

Read the file `yandex_music_project.csv` from the `/datasets` folder and save it in the `df` variable:

In [2]:
df = pd.read_csv('yandex_music_project.csv')

**Задание 3**


Выведите на экран первые десять строк таблицы:

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


**Task 4**


With one command, get general information about the table using the `info()` method:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So, the table has seven columns. The data type in all columns is `object`.

According to the data documentation:
* `userID` — user identifier;
* `Track` — track name;
* `artist` — name of the artist;
* `genre` — genre name;
* `City` — user’s city;
* `time` — start time of listening;
* `Day` - day of the week.

The number of values ​​in the columns varies. This means there are missing values ​​in the data.

## Data preprocessing
Correct the style in the column headings, eliminate the gaps. Then check the data for duplicates.

**Task 5**

Display column names

In [5]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

**Task 6**


Bring your titles into good style:
* write a few words in the title in “snake_register”,
* make all characters lowercase,
* eliminate the spaces.

To do this, rename the columns like this:
* `'userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'City'` → `'city'`;
* `'Day'` → `'day'`.

In [6]:
df = df.rename(columns = {'  userID' : 'user_id', 'Track' : 'track', '  City  ' : 'city', 'Day'  : 'day'})
df

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
...,...,...,...,...,...,...,...
65074,729CBB09,My Name,McLean,rnb,Moscow,13:32:28,Wednesday
65075,D08D4A55,Maybe One Day (feat. Black Spade),Blu & Exile,hip,Saint-Petersburg,10:00:00,Monday
65076,C5E3A0D5,Jalopiina,,industrial,Moscow,20:09:26,Friday
65077,321D0506,Freight Train,Chas McDevitt,rock,Moscow,21:43:59,Friday


**Task 7**


Check the result. To do this, display the column names again:

In [7]:
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

**Task 8**

First, count how many missing values ​​there are in the table. To do this, two `pandas` methods are enough:

In [8]:
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

**Task 9**

Replace the missing values ​​in the `track`, `artist` and `genre` columns with the string `'unknown'`. To do this, create a list `columns_to_replace`, iterate through its elements using a `for` loop and for each column replace the missing values:

In [9]:
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')# перебор названий столбцов в цикле и замена пропущенных значений на 'unknown'

**Task 10**

Make sure there are no gaps left in the table. To do this, count the missing values ​​again.

In [10]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

**Task 11**

Count obvious duplicates in a table with one command:

In [11]:
df.duplicated().sum()

3826

**Task 12**

Call a special `pandas` method to remove obvious duplicates:

In [12]:
df = df.drop_duplicates()

**Task 13**

Count the obvious duplicates in the table again - make sure you get rid of them completely:

In [13]:
df.duplicated().sum()

0

**Task 14**

Display a list of unique genre names, sorted in alphabetical order. For this:
* extract the desired dataframe column,
* apply a sorting method to it,
* for a sorted column, call a method that will return the unique values ​​from the column.

In [14]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Task 15**

Go through the list and look for implicit duplicates of the name `hiphop`. These may be misspelled titles or alternative titles of the same genre.

You will see the following implicit duplicates:
* *hip*,
* *hop*,
* *hip-hop*.

To clear the table of them, use the `replace()` method with two arguments: a list of duplicate strings (including *hip*, *hop* and *hip-hop*) and a string with the correct value. You need to fix the `genre` column in the `df` table: replace each value from the list of duplicates with the correct one. Instead of `hip`, `hop` and `hip-hop` the table should have the value `hiphop`:


In [15]:
df = df.replace(['hip', 'hop', 'hip-hop'], 'hiphop')
df

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
...,...,...,...,...,...,...,...
65074,729CBB09,My Name,McLean,rnb,Moscow,13:32:28,Wednesday
65075,D08D4A55,Maybe One Day (feat. Black Spade),Blu & Exile,hiphop,Saint-Petersburg,10:00:00,Monday
65076,C5E3A0D5,Jalopiina,unknown,industrial,Moscow,20:09:26,Friday
65077,321D0506,Freight Train,Chas McDevitt,rock,Moscow,21:43:59,Friday


**Task 16**

Check that incorrect names have been replaced:

* hip
*hop
* hip-hop

Print a sorted list of unique values ​​for the `genre` column:

In [16]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Выводы**

Предобработка обнаружила три проблемы в данных:

- нарушения в стиле заголовков,
- пропущенные значения,
- дубликаты — явные и неявные.

Вы исправили заголовки, чтобы упростить работу с таблицей. Без дубликатов исследование станет более точным.

Пропущенные значения вы заменили на `'unknown'`. Ещё предстоит увидеть, не повредят ли исследованию пропуски в колонке `genre`.

Теперь можно перейти к проверке гипотез. 

## Testing hypotheses

### Comparison of user behavior in two capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. Test this assumption using data on three days of the week - Monday, Wednesday and Friday. For this:

* Separate users in Moscow and St. Petersburg
* Compare how many tracks each user group listened to on Monday, Wednesday and Friday.

**Task 17**

To practice, first perform each of the calculations separately.

Evaluate user activity in each city. Group the data by city and count the auditions in each group.

In [17]:
df.groupby('city')['city'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: city, dtype: int64

There are more auditions in Moscow than in St. Petersburg. It does not follow from this that Moscow users listen to music more often. There are simply more users in Moscow.

**Task 18**

Now group the data by day of the week and count listens on Monday, Wednesday and Friday. Please note that the data only contains information about auditions for these days only.

In [18]:
df.groupby('day')['day'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: day, dtype: int64

On average, users from the two cities are less active on Wednesdays. But the picture may change if we consider each city separately.

**Task 19**


You've seen how grouping works by city and by day of the week. Now write a function that combines these two calculations.

Create a `number_tracks()` function that will count plays for a given day and city. She will need two parameters:
* day of the week,
* city name.

In the function, save into a variable the rows of the source table that have the value:
 * in the `day` column is equal to the `day` parameter,
 * in the `city` column is equal to the `city` parameter.

To do this, use sequential filtering with Boolean indexing (or complex one-line Boolean expressions if you are already familiar with them).

Then count the values ​​in the `user_id` column of the resulting table. Save the result into a new variable. Return this variable from the function.

In [19]:
def number_tracks(day,city):
    track_list = df[(df['day']==day) & (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count

**Task 20**

Call `number_tracks()` six times, changing the values ​​of the parameters so that you get data for each city on each of the three days.

In [20]:
number_tracks('Monday', 'Moscow')

15740

In [21]:
number_tracks('Monday', 'Saint-Petersburg')

5614

In [22]:
number_tracks('Wednesday', 'Moscow')

11056

In [23]:
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [24]:
number_tracks('Friday', 'Moscow')

15945

In [25]:
number_tracks('Friday', 'Saint-Petersburg')

5895

**Task 21**

Create a table using the `pd.DataFrame` constructor, where
* column names - `['city', 'monday', 'wednesday', 'friday']`;
* data - the results you got with `number_tracks`.

In [26]:
data_dict = {'city': ['Moscow', 'Saint-Petersburg'],
             'monday': [15740, 5614],
             'wednesday': [11056, 7003],
             'friday': [15945, 5895]}
pd.DataFrame(data = data_dict, columns = ['city', 'monday', 'wednesday', 'friday'])

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data shows the difference in user behavior:

- In Moscow, the peak of listening occurs on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday here is almost equally inferior to Wednesday.

This means that the data speaks in favor of the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, on Monday morning in Moscow some genres prevail, and in St. Petersburg others. Likewise, on Friday evenings, different genres predominate, depending on the city.

**Task 22**

Save tables with data in two variables:
* in Moscow - in `moscow_general`;
* for St. Petersburg - in `spb_general`.

In [27]:
moscow_general = df[df['city']== 'Moscow']

In [28]:
spb_general = df[df['city']== 'Saint-Petersburg']

**Task 23**

Create a function `genre_weekday()` with four parameters:
* table (dataframe) with data,
* day of the week,
* initial timestamp in 'hh:mm' format,
* last timestamp in 'hh:mm' format.

The function should return information about the top 10 genres of those tracks that were listened to on a specified day, in the interval between two timestamps.

In [29]:
def genre_weekday(df, day, time1, time2):
    genre_df = df[df['day'] == day]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df_grouped = genre_df.groupby('genre')['genre'].count() 
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)
    return genre_df_sorted[:10]

**Task 24**


Compare the results of the `genre_weekday()` function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and on Friday evening (from 17:00 to 23:00):

In [30]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [31]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [32]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [33]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg they listen to similar music. The only difference is that the Moscow rating included the “world” genre, while the St. Petersburg rating included jazz and classical.

2. In Moscow there were so many missing values ​​that the value `'unknown'` took tenth place among the most popular genres. This means that missing values ​​occupy a significant proportion of the data and threaten the reliability of the task.

Friday evening does not change this picture. Some genres go a little higher, others go down, but overall the top 10 remains the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not very pronounced. In Moscow people listen to Russian popular music more often, in St. Petersburg they listen to jazz.

However, gaps in the data cast doubt on this result. There are so many of them in Moscow that the top 10 ranking might look different if not for the lost data on genres.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, music of this genre is listened to more often there than in Moscow. And Moscow is a city of contrasts, in which, nevertheless, pop music prevails.

**Task 25**

Group the `moscow_general` table by genre and count the plays of tracks of each genre using the `count()` method. Then sort the result in descending order and store it in the `moscow_genres` table.

In [34]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

**Task 26**

Print the first ten lines of `moscow_genres`:

In [35]:
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

**Task 27**


Now repeat the same for St. Petersburg.

Group the `spb_general` table by genre. Count the plays of tracks of each genre. Sort the result in descending order and save it in the `spb_genres` table:

In [36]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

**Task 28**

Print the first ten lines of `spb_genres`:

In [37]:
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis predicted. Moreover, in the top 10 genres there is a similar genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.

## Results of the task

1. The day of the week affects user activity differently in Moscow and St. Petersburg.

The first hypothesis was completely confirmed.

2. Musical preferences do not change much during the week - be it Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
* in Moscow they listen to music of the “world” genre,
* in St. Petersburg - jazz and classics.

Thus, the second hypothesis was only partially confirmed. This result might have been different if not for gaps in the data.

3. The tastes of users in Moscow and St. Petersburg have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If differences in preferences exist, they are not noticeable for the majority of users.