# Yandex.Music

**Study objective** - test three hypotheses:

1. User activity depends on the day of the week. This is different in Moscow and St. Petersburg;
2. On a Monday morning in Moscow some genres dominate and in St. Petersburg others. Friday nights are also dominated by different genres, depending on the city;
3. Moscow and St. Petersburg prefer different genres of music. In Moscow they listen to pop music more often, in St. Petersburg - Russian rap.

**Study progress**

Thus, the study will take place in three phases:
1. Get and review of data. Data is available in file `yandex_music_project.csv`. The data quality is known.
 
2. Pre-processing of the data. (Find gaps, duplicates and ets).
 
3. Hypothesis testing.

## Data review

In [1]:
import pandas as pd 

# Actual path
data_path = 'C:/Users/Churiulin/Desktop/Yandex/Projects'

# Get data
df = pd.read_csv(f'{data_path}/music_project.csv')

# Print top 10 lines
display(df.head(10))
print('')

# Get common information about df
df.info() 

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So, there are seven columns in df. The data type in all columns is `object`.

According to the data documentation:
* `userID` - user ID;
* `Track` - track name;  
* `artist` - artist name;
* `genre` - genre name;
* `City` - user's city;
* `time` - start time of listening;
* `Day` - day of the week.

In the names of the columns, three style violations can be seen:
1. Small letters are combined with capital letters.
2. Spaces occur.
3. There are missing values in the data.

**Conclusions:**

Each row of the table contains data about the track you have listened to. Part of the columns describes the song itself: the title, the artist and the genre. The rest of the data tells about the user: what city the user is from, when the user listened to the music. Preliminarily it can be stated that, there is enough data to test hypotheses. But there are omissions in the data, and there are discrepancies with good style in the names of the speakers.

## Data pre-processing

### Headline style

In [2]:
df.columns   # get columns name

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [3]:
df = df.rename(columns = {'  userID' : 'user_id',
                          'Track'    : 'track'  ,
                          '  City  ' : 'city'   ,
                          'Day'      : 'day'})
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

In [4]:
print(df.isnull().sum()) # count missing values

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64


Replace the missing values in the `track`, `artist` and `genre` columns with the string `'unknown'`.

In [5]:
columns_to_replace = ['track', 'artist', 'genre']

for col in columns_to_replace:
    df[col] = df[col].fillna('unknown')

# Check missing values, again
print(df.isnull().sum())

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64


### Duplicates

In [6]:
print(df.duplicated().sum())# count of obvious duplicates

3826


In [7]:
# removal of obvious duplicates (with old indexes deleted and new indexes formed)
df = df.drop_duplicates().reset_index(drop = True)

# Check duplicates, again
print(df.duplicated().sum())

0


Now get rid of the implicit duplicates in the `genre` column. For example, the name of the same genre may be written slightly differently. Such mistakes will also affect the result of the study.

In [8]:
# Get unique gerne
genres_list = df['genre'].unique() 
print(sorted(genres_list))

['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans', 'alternative', 'alternativepunk', 'ambient', 'americana', 'animated', 'anime', 'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook', 'author', 'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill', 'chinese', 'choral', 'christian', 'christmas', 'classical', 'classicmetal', 'club', 'colombian', 'comedy', 'conjazz', 'contemporary', 'country', 'cuban', 'dance', 'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic', 'electropop', 'emo', 'entehno', 'epicmetal', 'estrada', 'ethnic', 'eurofolk', 'european', 'experimental', 'extrememetal', 'fado', 'fairyt

There are implicit duplicates of `hiphop` genre:
* *hip*,
* *hop*,
* *hip-hop*.

In [9]:
# 1. Function for replacing implicit duplicates
#    Input parameters: `wrong_genres`  — duplicate list,
#                      `correct_genre` — string with the correct value.
#
# task: The function should correct the `genre` column in table `df`: 
# replace each value from the `wrong_genres` list with a value from `correct_genre`.

def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong in wrong_genres:
        df['genre'] = df['genre'].replace(wrong, correct_genre)

Start corrections:

In [10]:
duplicates = ['hip', 'hop', 'hip-hop'] # list of incorrect genre names
correct = 'hiphop'                     # correct genre name

# Eliminating implicit duplicates
replace_wrong_genres(duplicates, correct)

# Check implicit duplicates, again
genres_list = df['genre'].unique() 
print(sorted(genres_list))  

['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans', 'alternative', 'alternativepunk', 'ambient', 'americana', 'animated', 'anime', 'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook', 'author', 'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill', 'chinese', 'choral', 'christian', 'christmas', 'classical', 'classicmetal', 'club', 'colombian', 'comedy', 'conjazz', 'contemporary', 'country', 'cuban', 'dance', 'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic', 'electropop', 'emo', 'entehno', 'epicmetal', 'estrada', 'ethnic', 'eurofolk', 'european', 'experimental', 'extrememetal', 'fado', 'fairyt

In [11]:
# Check data, again
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61253 entries, 0 to 61252
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  61253 non-null  object
 1   track    61253 non-null  object
 2   artist   61253 non-null  object
 3   genre    61253 non-null  object
 4   city     61253 non-null  object
 5   time     61253 non-null  object
 6   day      61253 non-null  object
dtypes: object(7)
memory usage: 3.3+ MB


**Conclusions:**

Pre-processing detected three problems in the data:

- header style irregularities,
- missing values,
- duplicates - explicit and implicit.

Corrected the headers to make the table easier to work with, removed duplicates, and replaced missing values with `'unknown'`. 

## Hypothesis testing

### Comparing user behaviour in the two capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. Check this assumption with data from three days of the week - Monday, Wednesday and Friday. 

In [12]:
city = df.groupby('city')['genre'].count() # Counting auditions in each city
print(city)

city
Moscow              42741
Saint-Petersburg    18512
Name: genre, dtype: int64


There are more listens in Moscow than in St. Petersburg. It doesn't follow that Moscow users listen to music more often. It's just that there are more users in Moscow.

Now group the data by day of the week and count listening on Monday, Wednesday and Friday. Note that the data have information only about the listening on those days.


In [13]:
day = df.groupby('day')['genre'].count() # Counting auditions on each of the three days
print(day)

day
Friday       21840
Monday       21354
Wednesday    18059
Name: genre, dtype: int64


On average, users from the two cities are less active on Wednesdays. But the picture can change if we look at each city separately. We have seen how grouping by city and by day of the week works. Now we can write a function that combines these two calculations.

In [14]:
# 2. Function `number_tracks()`,  which will count the auditions for a given day and city.
#    Input  parameters: df, weekday and city name
#    Output paremeters: track_list_count

def number_tracks(df, day, city): 
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count

In [15]:
# get data for Moscow on Monday
msk_mon  = number_tracks(df, 'Monday','Moscow')
print(f'number of auditions in Moscow on Mondays: {msk_mon}', '\n')

# get data for Moscow on Wednesday
msk_wen  = number_tracks(df, 'Wednesday','Moscow')
print(f'number of auditions in Moscow on Wednesdays: {msk_wen}', '\n')

# get data for Moscow on Friday
msk_fri  = number_tracks(df, 'Friday','Moscow')
print(f'number of auditions in Moscow on Fridays: {msk_fri}', '\n')

# get data for Saint-Petersburg on Monday
spb_mon = number_tracks(df, 'Monday','Saint-Petersburg') 
print(f'number of auditions in St Petersburg on Mondays: {spb_mon}', '\n')

# get data for Saint-Petersburg on Wednesday
spb_wen = number_tracks(df, 'Wednesday','Saint-Petersburg')
print(f'number of auditions in St Petersburg on Wednesdays: {spb_wen}', '\n')

# get data for Saint-Petersburg on Friday
spb_fri = number_tracks(df, 'Friday','Saint-Petersburg')
print(f'number of auditions in St Petersburg on Fridays: {spb_fri}', '\n')

number of auditions in Moscow on Mondays: 15740 

number of auditions in Moscow on Wednesdays: 11056 

number of auditions in Moscow on Fridays: 15945 

number of auditions in St Petersburg on Mondays: 5614 

number of auditions in St Petersburg on Wednesdays: 7003 

number of auditions in St Petersburg on Fridays: 5895 



Create table based on `pd.DataFrame`, where
* column names — `['city', 'monday', 'wednesday', 'friday']`;
* data - the results from `number_tracks`.

In [16]:
columns = ['city', 'monday', 'wednesday', 'friday'] 

data    = [['Moscow'          , msk_mon, msk_wen, msk_fri],
           ['Saint-Petersburg', spb_mon, spb_wen, spb_fri]]

table = pd.DataFrame(data=data, columns=columns)# <таблица с полученными данными>
display(table)# Таблица с результатами

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data show a difference in user behaviour:

- In Moscow, listening peaks on Mondays and Fridays, with a noticeable decline on Wednesdays.
- In St. Petersburg, on the contrary, more music is listened to on Wednesdays. The activity on Monday and Friday here is almost equally inferior to that on Wednesdays.

So the data are in favour of the first hypothesis.

### Music at the beginning and at the end of the week

According to the second hypothesis, on a Monday morning some genres dominate in Moscow and others in St. Petersburg. Likewise, Friday evenings are dominated by different genres - depending on the city.

In [17]:
moscow_general = df[df['city'] == 'Moscow'] 
moscow_general                              

Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,unknown,ruspop,Moscow,09:17:40,Friday
...,...,...,...,...,...,...,...
61247,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Moscow,21:07:12,Monday
61248,729CBB09,My Name,McLean,rnb,Moscow,13:32:28,Wednesday
61250,C5E3A0D5,Jalopiina,unknown,industrial,Moscow,20:09:26,Friday
61251,321D0506,Freight Train,Chas McDevitt,rock,Moscow,21:43:59,Friday


In [18]:
spb_general = df[df['city'] == 'Saint-Petersburg']
spb_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Saint-Petersburg,21:20:49,Wednesday
...,...,...,...,...,...,...,...
61239,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Saint-Petersburg,21:14:40,Monday
61240,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Saint-Petersburg,21:06:50,Monday
61241,29E04611,Bre Petrunko,Perunika Trio,world,Saint-Petersburg,13:56:00,Monday
61242,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Saint-Petersburg,09:22:13,Monday


Let's compare the results of `genre_weekday()` for **Moscow** and **St. Petersburg** on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [19]:
# 2. Function `genre_weekday` - The function should return information about the top 10 genres 
#    of those tracks listened to on a given day, between the two time stamps.
#
#    Input parameters:  table - table with data
#                       day   - weekday
#                       time1 - an initial timestamp in 'hh:mm' format,
#                       time2 - the last timestamp in the format 'hh:mm'.
#    Output parameters: genre_df_sorted - top 10 gerne
           
def genre_weekday(table, day, time1, time2):
    genre_df = table[(table['day'] == day) & (table['time'] > time1) & (table['time'] <= time2)]
    genre_df_count  = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending = False).head(10)
    return genre_df_sorted    

In [20]:
print('Moscow Monday')
print(genre_weekday(moscow_general, 'Monday', '07:00:00', '11:00:00'), '\n')

print('St. Petersburg Monday')
print(genre_weekday(spb_general   , 'Monday', '07:00:00', '11:00:00'), '\n')

print('Moscow Friday')
print(genre_weekday(moscow_general, 'Friday', '17:00:00', '23:00:00'), '\n')

print('St. Petersburg Friday')
print(genre_weekday(spb_general   , 'Friday', '17:00:00', '23:00:00'), '\n')

Moscow Monday
genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64 

St. Petersburg Monday
genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64 

Moscow Friday
genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64 

St. Petersburg Friday
genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64 



**Conclusions:**

If we compare the top 10 genres on a Monday morning, we can draw these conclusions:

1. In Moscow and St. Petersburg people listen to similar music. The only difference is that the Moscow rating includes the `world` genre, while the St. Petersburg rating includes `jazz` and `classical` music.

2. In Moscow, there were so many missing values that `'unknown'` ranked tenth among the most popular genres. So, the missing values occupy a significant share of the data and threaten the credibility of the study.

3. Friday night doesn't change this picture. Some genres go up a little higher, others go down, but overall the top 10 remains the same. 

Thus, the second hypothesis is only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not very pronounced. In Moscow they listen to Russian popular music more often, in St. Petersburg - to jazz.

However, the missing data call this result into question. There are so many of them in Moscow that the top-10 ranking could look different if not for the missing data on genres.

### Genre choices in Moscow and St Petersburg

*Hypothesis:* St. Petersburg is the rap capital, music of this genre is listened to there more often than in Moscow. And Moscow is a city of contrasts, which, however, is dominated by pop music.

Group the `moscow_general` and `spb_general` tables by genre and `count` each genre's track. Then `sort` the result in descending order and store it in tables `moscow_genres` and `spb_genres`.

In [21]:
# Moscow
moscow_genres = (moscow_general
                              .groupby('genre')['genre']
                              .count()
                              .sort_values(ascending = False)
                )
# St. Petersburg
spb_genres    = (spb_general
                            .groupby('genre')['genre']
                            .count()
                            .sort_values(ascending = False)
                )

# get the first 10 gernes
print('Moscow genres')
display(moscow_genres.head(10)) 

print('\n', 'St. Petersburg genres')
display(spb_genres.head(10))


Moscow genres


genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64


 St. Petersburg genres


genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

Now let's repeat the same for St Petersburg.

**Conclusions**.

The hypothesis was partly confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, there is a close genre - Russian popular music - in the top 10 genres.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg. 


## Results of the study

We tested three hypotheses and established

1. The day of the week has a different effect on user activity in Moscow and St. Petersburg. The first hypothesis is fully confirmed

2. Musical preferences don't change much during the week - whether in Moscow or St Petersburg. Slight differences are noticeable at the beginning of the week, on Mondays:
    * In Moscow they listen to world music,
    * In St. Petersburg they listen to jazz and classical music.
    
Thus, the second hypothesis was only partially confirmed. This result might have been different had it not been for the omissions in the data.

3. The choices of Moscow and St. Petersburg users have more in common than in difference. Contrary to expectations, genre preferences in St Petersburg resemble those in Moscow.

The third hypothesis did not hold true. If differences in preferences do exist, they are not noticeable on the bulk of users.