### The goal of the project is to determine whether people's preferences differ and depend on the day of the week

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/datasets/music_project.csv')

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


We have 7 columns with variables of different data types:

* userID — user's ID;
* Track — track name;
* artist — artist name;
* genre — genre name;
* City — the city in which users listened to the track;
* time — the time at which users listened to the track;
* Day — the day of the week.

There are missing values in df.



Each row contains info about particular track of the determined genre that users listened to either in Moscow or in Saint-Petersburg on a particular day of the week at the certain time.

# EDA

Let's check for missing values and duplicates. 

We also should rename columns.

In [5]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [6]:
df.set_axis(['user_id', 'track_name', 'artist_name', 'genre_name', 'city', 'time', 'weekday'], axis='columns', inplace=True)

In [7]:
df.columns

Index(['user_id', 'track_name', 'artist_name', 'genre_name', 'city', 'time',
       'weekday'],
      dtype='object')

Let's check for the amount of missing values in each column.

In [8]:
df.isnull().sum()

user_id           0
track_name     1231
artist_name    7203
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

We can replace missing values with 'unknown'.

In [10]:
df['track_name'] = df['track_name'].fillna('unknown')

In [11]:
df['artist_name'] = df['artist_name'].fillna('unknown')

In [12]:
df.isnull().sum()

user_id           0
track_name        0
artist_name       0
genre_name     1198
city              0
time              0
weekday           0
dtype: int64

Now we can drop all the rows where there is no genre info.

In [13]:
df.dropna(subset = ['genre_name'], inplace = True)

In [14]:
df.isnull().sum()

user_id        0
track_name     0
artist_name    0
genre_name     0
city           0
time           0
weekday        0
dtype: int64

Duplicates' checking and removal.

In [15]:
df.duplicated().sum()

3755

In [16]:
df = df.drop_duplicates().reset_index(drop=True)

In [17]:
df.duplicated().sum()

0

Let's create a variable with all the unique genres.

Now we can create a func that will help us to find implicit duplicates (for example, when the same genre name written in different words).

In [19]:
genres_list = df['genre_name'].unique()

In [20]:
def find_genre(genre_name):
    count = 0
    for i in genres_list:
        if i == genre_name: 
            count += 1
    return count

Let's find some other genre *hiphop* spellings:

* hip
* hop
* hip-hop


In [22]:
search_hip = find_genre('hip') 

In [23]:
search_hip = find_genre('hop')

In [25]:
search_hip_hop = find_genre('hip-hop')

Let's write a func *find_hip_hop()* that will replace wrong genre name with *'hiphop'* and will check the replacement success.

In [35]:
def find_hip_hop(df, wrong):
    df['genre_name'] = df['genre_name'].replace(wrong, 'hiphop')
    count = df[df['genre_name'] == wrong]['genre_name'].count()
    return count

In [36]:
find_hip_hop(df, 'hip') 

0

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60126 entries, 0 to 60125
Data columns (total 7 columns):
user_id        60126 non-null object
track_name     60126 non-null object
artist_name    60126 non-null object
genre_name     60126 non-null object
city           60126 non-null object
time           60126 non-null object
weekday        60126 non-null object
dtypes: object(7)
memory usage: 3.2+ MB


**Results**

As a result of PDA, we dropped some duplicates and replaced missing values using different approaches. We also renamed columns' names to make our project work easier.

# Hypothesis 1: whether music preferences of people from Moscow differ from those who live in Saint-Petersburg or not.

Let's compare Monday, Wednesday, and Friday music preferences for both cities. 

In [48]:
genre_grouping = df.groupby('city')['genre_name']
genre_counting = df.groupby('city')['genre_name'].count()
genre_counting

city
Moscow              41892
Saint-Petersburg    18234
Name: genre_name, dtype: int64

Yandex.Music in Moscow had more than twice users in comparison with Saint-Petersburg.

Now let's count how many songs users listened to on a particular day of the week.

In [50]:
genre_counting_ = df.groupby('weekday')['genre_name'].count()
genre_counting_

weekday
Friday       21482
Monday       20866
Wednesday    17778
Name: genre_name, dtype: int64

Users preferred listening to music on Monday and Friday, rather than on Wednesday.

We are writing a function that checks the number of songs listened to for each city on Monday, Wednesday, and Friday.

In [78]:
def number_tracks(df,day,city):
    track_list = df[(df['weekday']==day)&(df['city'] == city)]
    track_list_count = track_list['genre_name'].count()
    return track_list_count

In [79]:
number_tracks(df, 'Monday', 'Moscow')

15347

In [80]:
number_tracks(df, 'Monday', 'Saint-Petersburg') 

5519

In [81]:
number_tracks(df, 'Wednesday', 'Moscow')  

10865

In [82]:
number_tracks(df, 'Wednesday', 'Saint-Petersburg')

6913

In [83]:
number_tracks(df, 'Friday', 'Moscow') 

15680

In [84]:
number_tracks(df, 'Friday', 'Saint-Petersburg') 

5802

Now we can create a table with the obtained results

In [86]:
data = [['Moscow', 15347, 10865, 15680],
       ['Saint-Petersburg', 5519, 6913, 5802]]

columns = ['city','monday','wednesday','friday']
table = pd.DataFrame(data = data, columns = columns)
table

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15347,10865,15680
1,Saint-Petersburg,5519,6913,5802


**The results**

The results show that, relative to Wednesday, music in St. Petersburg and Moscow is listened in the opposite way: in Moscow, peaks occur on Monday and Friday, and on Wednesday the listening time decreases. Whereas in St. Petersburg Wednesday is the day of the greatest interest in music, and on Monday and Friday it is less, and almost equally less.

# Hypothesis 2: Whether users prefer to listen to the same music on Monday mornings and Friday evenings or not.

Let's discover what genres are popular on Monday mornings and Friday evenings. There is an assumption that users like listening to invigorating music (for example, alternative genre) on Monday mornings, and they listen to dance genre music.

Let's create separate tables *moscow_general* and *spb_general* for both cities.

In [89]:
moscow_general = df[df['city'] == 'Moscow']
moscow_general

Unnamed: 0,user_id,track_name,artist_name,genre_name,city,time,weekday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,unknown,ruspop,Moscow,09:17:40,Friday
...,...,...,...,...,...,...,...
60120,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Moscow,21:07:12,Monday
60121,729CBB09,My Name,McLean,rnb,Moscow,13:32:28,Wednesday
60123,C5E3A0D5,Jalopiina,unknown,industrial,Moscow,20:09:26,Friday
60124,321D0506,Freight Train,Chas McDevitt,rock,Moscow,21:43:59,Friday


In [90]:
spb_general = df[df['city'] == 'Saint-Petersburg']
spb_general

Unnamed: 0,user_id,track_name,artist_name,genre_name,city,time,weekday
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Saint-Petersburg,21:20:49,Wednesday
...,...,...,...,...,...,...,...
60112,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Saint-Petersburg,21:14:40,Monday
60113,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Saint-Petersburg,21:06:50,Monday
60114,29E04611,Bre Petrunko,Perunika Trio,world,Saint-Petersburg,13:56:00,Monday
60115,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Saint-Petersburg,09:22:13,Monday


Let's write a function *genre_weekday()* that returns number of genres for a particular day on a certain time

In [110]:
def genre_weekday(df, day, time1, time2):
    genre_list = df[(df['weekday'] == day) & (df['time'] > time1) & (df['time'] < time2)]
    genre_list_sorted = genre_list.groupby('genre_name')['genre_name'].count()
    return genre_list_sorted.head(10) 

In [111]:
genre_weekday(moscow_general, 'Monday', '07:00:00', '11:00:00')

genre_name
adult            1
africa           2
alternative    164
ambient         22
americana        1
anime            7
arabesk          1
audiobook        1
avantgarde       4
balkan           1
Name: genre_name, dtype: int64

In [112]:
genre_weekday(spb_general, 'Monday', '07:00:00', '11:00:00')

genre_name
adult           1
alternative    58
ambient         5
balkan          1
beats           1
blues          11
brazilian       1
breakbeat       1
caucasian       2
chamber         1
Name: genre_name, dtype: int64

In [113]:
genre_weekday(moscow_general, 'Friday', '17:00:00', '23:00:00')

genre_name
adult               1
africa              1
alternative       163
ambient            18
americana           2
anime               7
arabesk             2
arena               1
argentinetango      3
art                 1
Name: genre_name, dtype: int64

In [114]:
genre_weekday(spb_general, 'Friday', '17:00:00', '23:00:00')

genre_name
acoustic           1
adult              1
alternative       63
ambient            7
americana          1
anime              3
arabesk            2
argentinetango     2
avantgarde         2
balkan             1
Name: genre_name, dtype: int64

Popular genres on Monday morning in St. Petersburg and Moscow turned out to be similar: everywhere, as expected, alternative music is popular. Despite this, the ending of the top 10 for the two cities is different: in St. Petersburg, the top 10 includes  blues and ambient genres, and in Moscow, ambient and anime genres.

At the end of the week the situation does not change. Alternative music is still in the first place. Again, the difference is noticeable only in the end of the top 10, where the ambient genre is also present in St. Petersburg on Friday evening.

**The results**

The alternative genre is the leader, and the top 5 in general does not differ in both cities. At the same time, it is clear that the end of the list is more “live”: for each city, more characteristic genres are distinguished, which really change their positions depending on the day of the week and time.

# Hypothesis 3. Users' preferences differ for the two cities.

A suggestion: Users in St. Petersburg like listening to the rusrap genre, and Moscow users prefer the pop genre.

In [124]:
moscow_genres = (moscow_general.groupby('genre_name')['genre_name'].count()).sort_values(ascending = False)

In [125]:
moscow_genres.head(10)

genre_name
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre_name, dtype: int64

In [126]:
spb_genres = (spb_general.groupby('genre_name')['genre_name'].count()).sort_values(ascending = False)

In [127]:
spb_genres.head(10)

genre_name
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre_name, dtype: int64

**The results**

Both in Moscow and St.Petersburg, the pop genre is on the top. And the rusrap, contrary to the assumption, occupies close positions in both cities.

# Final results


Hypotheses:

* Moscow and St. Petersburg users listen to music in different ways;

* the lists of the ten most popular genres on Monday morning and Friday evening have are different;

* people from the both cities prefer listening to different music genres.

**Overall results**

Users from Moscow and St. Petersburg have similar preferences: pop music prevails everywhere. At the same time, there is no dependence of preferences on the day of the week in each individual city - people constantly listen to what they like. But between the cities, in the context of days of the week, there is a difference: people from Moscow listen to music more on Monday and Friday, while people from St. Petersburg, on the contrary, listen to it more on Wednesday, but less on Monday and Friday.

As a result, the first hypothesis is not confirmed, the second hypothesis is confirmed, and the third is not confirmed.