# Yandex.music

**Main goals** — to check 3 hypotheses:
1. Users activity depends on weekday. 
2. Top genres depends on city and part of day (at morning and at evening different top genres) 
3. Users at Moscow and St. Petersburg prefer differents genres of music. At Moscow people prefers pop, and at SPB people prefer russian rap. 

## Check data






In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/Users/andrejlesov/Desktop/yandex_music_project.csv')

In [3]:
display(df.head(10))

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB



We have 7 columns in a table. All of them - objects.
Names of columns:
* `userID`
* `Track`   
* `artist`
* `genre`
* `City` 
* `time` 
* `Day` 

So we can see bad style of naming. We will correct it.



Also we have a different numbers of information in columns. It is mean that we have empty data.


**Conclusion**

We have  datafreme with 7 columns and we have misses in it.


## Data preprocessing


### Columns naming


In [5]:
display(df.columns)

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [6]:
df = df.rename(columns={'  userID':'user_id','Track':'track','  City  ':'city','Day':'day'})# переименование столбцов

In [7]:
display(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Misses in dataframe

In [8]:
print(df.isna().sum())# count of misses

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64


We have misses in columns: track, artist and genre. So we can add information about track, genre and artist as 'unknown' instead of misses. 

In [9]:
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In [10]:
print(df.isna().sum())

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64


### Duplicates


In [11]:
df.duplicated().sum()

3826

Lets delete duplicates.

In [12]:
df = df.drop_duplicates().reset_index(drop=True)

In [13]:
df.duplicated().sum()

0

I have to check column genre, because it can consider duplicates. For example rap may be hip-hop, Rap, raP etc.

In [14]:
unique_genre = df['genre']
sort_genre = unique_genre.sort_values()
sort_genre = sort_genre.unique()
display(sort_genre)



array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

In [15]:
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

In [16]:
wrong_name = ['hip', 'hop', 'hip-hop']
correct_name = 'hiphop'
replace_wrong_genres(wrong_name,correct_name)

In [17]:
print(df['genre'].sort_values().unique())

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'alternativepunk' 'ambient' 'americana' 'animated' 'anime' 'arabesk'
 'arabic' 'arena' 'argentinetango' 'art' 'audiobook' 'author' 'avantgarde'
 'axé' 'baile' 'balkan' 'beats' 'bigroom' 'black' 'bluegrass' 'blues'
 'bollywood' 'bossa' 'brazilian' 'breakbeat' 'breaks' 'broadway'
 'cantautori' 'cantopop' 'canzone' 'caribbean' 'caucasian' 'celtic'
 'chamber' 'chanson' 'children' 'chill' 'chinese' 'choral' 'christian'
 'christmas' 'classical' 'classicmetal' 'club' 'colombian' 'comedy'
 'conjazz' 'contemporary' 'country' 'cuban' 'dance' 'dancehall' 'dancepop'
 'dark' 'death' 'deep' 'deutschrock' 'deutschspr' 'dirty' 'disco' 'dnb'
 'documentary' 'downbeat' 'downtempo' 'drum' 'dub' 'dubstep' 'eastern'
 'easy' 'electronic' 'electropop' 'emo' 'entehno' 'epicmetal' 'estrada'
 'ethnic' 'eurofolk' 'european' 'experimental' 'extrememetal' 'fado'
 'fairytail' 'film' 'fitness' 'flamenco' 'folk' 'folklore' 'folkmetal'
 'folkrock' 

**Conclusion**

I solve 3 problem with data:

- bad naming,
- missing data,
- duplicates.


## Hypotheses

### Different behavior of users from Moscow and SPB


First hypothes: in different citys people listen differnt music.
* So I need to seporate users by citys
* Compare how many tracks people listen at mondey, wednsday and friday. 


In [18]:
display(df.groupby('city')['time'].count())

city
Moscow              42741
Saint-Petersburg    18512
Name: time, dtype: int64

Looks like in Moscow people more likes music. But it is not tru becouse in Moscow lives more people. 



In [19]:
df.groupby('day')['time'].count()


day
Friday       21840
Monday       21354
Wednesday    18059
Name: time, dtype: int64

We can see that at Wednssday not so many users listen to music.

In [20]:
def number_tracks(day, city):
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count = track_list['user_id'].count()
    return track_list_count 


In [21]:
number_tracks('Monday','Moscow')# Moscow at Mondey

15740

In [22]:
number_tracks('Monday','Saint-Petersburg')# SPB at Mondey

5614

In [23]:
number_tracks('Wednesday', 'Moscow')# Moscow at Wednsday

11056

In [24]:
number_tracks('Wednesday', 'Saint-Petersburg')# SPB at Wednsday

7003

In [25]:
number_tracks('Friday', 'Moscow')# Moscow at Friday

15945

In [26]:
number_tracks('Friday', 'Saint-Petersburg')# SPB at Friday

5895

In [27]:
data = [['Moscow', 15740, 11056, 15945 ],['Saint-Petersburg', 5614, 7003, 5895 ]]
columns = ['city', 'monday', 'wednesday', 'friday']
display(pd.DataFrame(data=data, columns=columns))
# Таблица с результатами

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**



-  At Moscow more users listen music at Monday and Friday.
-  At SPB more people listen music at Wednsday.

So first hypothesis was right. In different cities users has different behavior.

### Music at Monday and Friday

Second hypothes: people in Moscow and SPB listen different genres of music.

In [28]:
moscow_general = df[df['city'] == 'Moscow']



In [29]:
spb_general = df[df['city'] == 'Saint-Petersburg']


In [30]:
def genre_weekday(df, day, time1, time2):
    genre_list =  df[(df['day'] == day) & (df['time']>time1) & (time2>df['time'])]
    genre_list_sorted = genre_list.groupby('genre')['genre'].count().sort_values(ascending = False).head(10) 
    return genre_list_sorted



Compare of genres at Monday morning and Friday evening:

In [31]:
genre_weekday(moscow_general, 'Monday', '07:00:00', '11:00:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [32]:
genre_weekday(spb_general, 'Monday', '07:00:00', '11:00:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [33]:
genre_weekday(moscow_general, 'Friday', '17:00:00', '23:00:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [34]:
genre_weekday(spb_general, 'Friday', '17:00:00', '23:00:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Conclusions**


Second hypothes as wrong. In general people listen same music in both citys. Except that at SPB prefer Jaz, and at Moscow Pop music. But TOP-10 genres looks the same.

### Genres at Moscow and SPB

Thrid hypothes: at SPB people prefer RAP and at Moscow people prefer POP-music 

In [35]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending = False)


In [36]:
display(moscow_genres.head(10))# top 10 rows moscow_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [37]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending = False)

In [38]:
display(spb_genres.head(10))# top 10 rows spb_genres

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

Hypothes is right partly:
*  Pop-music is most popular genre at Moscow. Also at moscow people likes genre - russian pop-music (which the same)
*  RAP music is popular at SPB and Moscow.


## Research conclusions



1. Users activites depends of weekday. 



2. People listen the same music during the week.


3. People from Moscow and SPB listen the same genres.

