# Music_of_big_cities

There are two cities, participating in this research:
 
 * City M — considered to be a subject to the rigid rhythm of the working week;;
 * City P — considered to be cultural, with its particular tastes.

**Research Purpose** - Test three hypotheses:
1. User activity depends on the day of the week. It shows itself differently in City M and City P.
2. On Monday morning, certain genres prevail in City M, while others prevail in City P. Similarly, Friday evenings are dominated by different genres, depending on the city.
3. City M and City P prefer different genres of music. In City M, they listen to pop music more often, in City P - Russian rap.

**Research Progress**

Data is presented in a rough format - nothing is known about its quality. Data will be checked on errors and omissions.

Thus, the study will take place in three stages:
 1. Data review.
 2. Data preprocessing.
 3. Hypothesis testing.

## Data review

In [1]:
import pandas as pd 

In [2]:
df = pd.read_csv('/datasets/yandex_music_project.csv') 

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
  userID    65079 non-null object
Track       63848 non-null object
artist      57876 non-null object
genre       63881 non-null object
  City      65079 non-null object
time        65079 non-null object
Day         65079 non-null object
dtypes: object(7)
memory usage: 3.5+ MB


**Conclusions**

Each line of the table contains data about the track. Some of the columns describe the composition itself: title, artist and genre. The rest of the data tells about the user: what city he is from, when he listened to music.

Preliminarily, it can be argued that there is enough data to test hypotheses. But the data has gaps, besides - discrepancies in the names of the columns (lowercase combined with uppercase; gaps; not in "snake case").

To move forward, these problems in the data will be fixed.

## Data preprocessing.

### Heading style

In [5]:
df.columns 

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [6]:
df = df.rename(columns = {'  userID':'user_id', 'Track':'track', '  City  ':'city', 'Day':'day'}) # renaming columns

In [7]:
df.columns # check 

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing data

The missing values in the `track`, `artist` and `genre` columns will be replaced with the string `'unknown'`

In [9]:
columns_to_replace = ['track','artist','genre'] 
for column in columns_to_replace:  
    df[column] = df[column].fillna('unknown') 

In [10]:
df.isna().sum() # missing data count

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

In [11]:
df.duplicated().sum() 

3826

Complete duplicates will be deleted. 

In [12]:
df = df.drop_duplicates().reset_index(drop = True) 

In [13]:
df.duplicated().sum() # check

0

Below is a check on implicit duplicates in `genre`. 

In [14]:
df['genre'].sort_values().unique() #sort unique values in 'genre'

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Implicit duplicates of `hiphop`:
* *hip*,
* *hop*,
* *hip-hop*

Implicit duplicates of `french`:

* *frankreich*,
* *franzosisch*

As well as `electronic`:
* *electronics*.

In [15]:
def replace_wrong_genres(wrong_genres, correct_genre): # function to change value names
    for wrong_genre in wrong_genres: 
        df['genre'] = df['genre'].replace(wrong_genre,correct_genre) 

In [16]:
duplicated_genres = ['hip','hop','hip-hop'] 
new_genre = 'hiphop' 
replace_wrong_genres(duplicated_genres, new_genre) 

In [17]:
duplicated_genres = ['frankreich','französisch'] 
new_genre = 'french' 
replace_wrong_genres(duplicated_genres, new_genre) 

In [18]:
replace_wrong_genres(['электроника'], 'electronic') 

In [19]:
df['genre'].sort_values().unique() # Check

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

Preprocessing found three problems in the data:

 - heading style violations
 - missing values
 - duplicates - explicit and implicit.

The problems were fixed - next step will be hypothesis testing.

## Hypothesis testing

### 1. User activity depends on the day of the week. It shows itself differently in City M and City P.

In [20]:
df.groupby('city')['user_id'].count() 

city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

There are more listenings in City M than in City P. It does not necessarily mean that M. users listen to music more often -  there are simply more users in City M.

In [21]:
df.groupby('day')['user_id'].count() # group by day

day
Friday       21840
Monday       21354
Wednesday    18059
Name: user_id, dtype: int64

On average, users from the two cities are less active on Wednesdays. But the picture may change if each city is considered separately.

In [22]:
# Function for counting plays for a specific city and day
def number_tracks(day, city): 
    track_list = df[df['day'] == day] 
    track_list = track_list[track_list['city'] == city] 
    track_list_count = track_list['user_id'].count() 
    return (track_list_count) 

In [23]:
monday_moscow = number_tracks('Monday', 'Moscow') 
monday_moscow

15740

In [24]:
monday_spb = number_tracks('Monday', 'Saint-Petersburg') 
monday_spb

5614

In [25]:
wednesday_moscow = number_tracks('Wednesday', 'Moscow') 
wednesday_moscow

11056

In [26]:
wednesday_spb = number_tracks('Wednesday', 'Saint-Petersburg') 
wednesday_spb

7003

In [27]:
friday_moscow = number_tracks('Friday', 'Moscow') 
friday_moscow

15945

In [28]:
friday_spb = number_tracks('Friday', 'Saint-Petersburg') 
friday_spb

5895

Below the results are presented in a table format.

In [29]:
columns = ['city', 'monday', 'wednesday', 'friday'] 
number_tracks_res = [['Moscow', monday_moscow, wednesday_moscow, friday_moscow],
                ['Saint_Petersburg', monday_spb, wednesday_spb, friday_spb]]                      
first_guess_result = pd.DataFrame(data = number_tracks_res, columns = columns) 
first_guess_result

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint_Petersburg,5614,7003,5895


**Conclusions**

Based on the data obtained, it is impossible to draw an unambiguous conclusion in favor of the first hypothesis, while it is also not possible to reject it.
User activity really depends on the day of the week.
At the same time, the activity of users in City M and City P manifests itself differently on Wednesdays only.

The data shows the difference in user behavior:

- In Moscow, the peak of listening is on Monday and Friday.
- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday here is almost equally inferior to Wednesday.

At the same time, there is a similarity in user behavior:
- In both cities, about the same number of users listen to music on Mondays as on Fridays
- In both cities there is a slight advantage towards Friday.

### 2. On Monday morning, certain genres prevail in City M, while others prevail in City P. Similarly, Friday evenings are dominated by different genres, depending on the city.

In [30]:
moscow_general = df[df['city']=='Moscow'] 

In [31]:
spb_general = df[df['city']=='Saint-Petersburg'] 

In [32]:
# function, which returns information about the most popular genres on a given day at a given time (top 10)
def genre_weekday(table, day, time1, time2): 
    
    genre_df = table[table['day'] == day]  
    genre_df = genre_df[genre_df['time'] > time1] 
    genre_df = genre_df[genre_df['time'] < time2] 
    genre_df_count = genre_df.groupby('genre')['user_id'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending = False)
    return(genre_df_sorted.head(10))


Monday morning comparison

In [33]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00') 

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: user_id, dtype: int64

In [34]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00') # вызов функции для утра понедельника в Петербурге (вместо df — таблица spb_general)

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: user_id, dtype: int64

Friday evening comparison

In [35]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00') 

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: user_id, dtype: int64

In [36]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00') 

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: user_id, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In City M and City P they listen to similar music. The only difference is that City M rating includes the “world” genre, while City P rating includes jazz and classical.

2. Missing values in City M (`'unknown'`) took tenth place among the most popular genres. Missing values reduce the validity of the study. Even though, the impact of missing values is not expected to be critical, as top 10 ratings in both cities include mostly the same positions. Omissions can be a new genre or a combination of genres that would change the position of popular genres in the rankings.

Friday night does not change this picture. Some genres rise a little higher, others go down, but overall the top 10 stays the same. 

Thus, the second hypothesis was not confirmed. The difference between City M and City P is not very pronounced. 

### 3. City M and City P prefer different genres of music. In City M, they listen to pop music more often, in City P - Russian rap.

In [37]:
moscow_genres = moscow_general.groupby('genre')['track'].count().sort_values(ascending=False)

In [38]:
moscow_genres.head(10) # first 10 rows moscow_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: track, dtype: int64

In [39]:
spb_genres = spb_general.groupby('genre')['track'].count().sort_values(ascending=False)

In [40]:
spb_genres.head(10) # просмотр первых 10 строк spb_genres

genre
pop            2431
dance          1932
rock           1879
electronic     1737
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: track, dtype: int64

The place, that `rap` holds in both tops.

In [41]:
(moscow_genres[:'rap']).count() 

18

In [42]:
(spb_genres[:'rap']).count() 

15

**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in City M, as the hypothesis suggested. Moreover, in the top 10 genres there is a close genre - Russian pop music.
* Contrary to expectations, the rap genre (excluding Russian rap) is not in the top 10 popularity in City M and City P, ranking only 18th and 15th in popularity, respectively.
* At the same time, Russian rap is equally popular in City M and City P, ranking 10th and 8th, respectively.

## Final conclusion

The test of three hypotheses has shown the following:

1. The first hypothesis was partially confirmed. Users in both cities tend to listen to music in approximately equal volumes on Mondays and Fridays, while the Wednesday affects them differently.

2. The second hypothesis was not confirmed. Genres are almost the same for both cities for Monday morning and Friday evening.

3. The third hypothesis was partially confirmed. In City M, they really prefer pop music to other genres. However, if there are differences in the preferences of residents of both cities, they are invisible to the majority of users.