# Y.Music

# Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Stage 2. Data preprocessing](#data_preprocessing)
    * [2.1 Header style](#header_style)
    * [2.2 Missing values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Stage 3. Testing the hypotheses](#hypotheses)
    * [3.1 Hypothesis 1: user activity in the two cities](#activity)
    * [3.2 Hypothesis 2: music preferences on Monday and Friday](#week)
    * [3.3 Hypothesis 3: genre preferences in Springfield and Shelbyville](#genre)
* [Findings](#end)

## Introduction <a id='intro'></a>
The primary focus of this project was to compare the music preferences in Springfield and Shelbyville, and hypotheses were established to identify its key purposes.

### Goal: 
Three hypotheses from the project:
1. User activity differs depending on the day of the week and from city to city. 
2. Springfield and Shelbyville residents listen to different genres on Monday mornings. It also the same for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In Shelbyville, they prefer rap, while Springfield has more pop fans.

### Stages 
Data on user behavior is stored in the file `/datasets/music_project_en.csv`. 
 
The project will consist of three stages:
 1. Data overview
 2. Data preprocessing
 3. Testing the hypotheses
 
[Back to Contents](#back)

### Stage 1. Data overview <a id='data_review'></a>

In [5]:
import pandas as pd # importing pandas

In [4]:
df = pd.read_csv('/datasets/music_project_en.csv') # reading the file and storing it to df
df.describe()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,08:14:07,Friday
freq,76,136,136,8850,45360,14,23149


In [6]:
df.head(10) # obtaining the first 10 rows from the df table

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [8]:
df.info()# obtaining general information about the data in df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


# Conclusions <a id='data_review_conclusions'></a> 


Every entry in the table contains information regarding a played track. Several columns provide details about the track, such as its title, artist, and genre, while the remaining columns capture user-related information, including their city of origin and the time the track was played.


[Back to Contents](#back)

## Stage 2. Data preprocessing <a id='data_preprocessing'></a>

 ### Header style <a id='header_style'></a>

In [9]:
print(df.columns)# the list of column names in the df table

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')


In [10]:
df= df.rename (columns={'  userID':'user_id','Track':'track','  City  ':'city','Day':'day'})# renaming columns

In [11]:
print(df.columns)# checking result: the list of column names

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


[Back to Contents](#back)

### Missing values <a id='missing_values'></a>

In [12]:
df.isna().sum() # calculating missing values

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

In [13]:
columns_to_replace = ["track", "artist", "genre"]

for column in columns_to_replace: #loop to replace missing column names with string 'unknown'
    df[column] = df[column].fillna('unknown')

In [14]:
df.isna().sum()# counting missing values

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

[Back to Contents](#back)

### Duplicates <a id='duplicates'></a>

In [15]:
df.duplicated().sum()# counting clear duplicates

3826

In [19]:
df["genre"].nunique() # Check how many uniques values are in the column

269

In [17]:
df.isna().sum()#counting missing values

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

In [18]:
df.shape

(65079, 7)

In [21]:
df = df.drop_duplicates().reset_index(drop=True)#droped duplicates

In [22]:
df.duplicated().sum()# checking for duplicates

0

In [23]:
unique_genres = sorted(df['genre'].unique())# viewing unique genre names
for genre in unique_genres:
    print(genre) 

acid
acoustic
action
adult
africa
afrikaans
alternative
ambient
americana
animated
anime
arabesk
arabic
arena
argentinetango
art
audiobook
avantgarde
axé
baile
balkan
beats
bigroom
black
bluegrass
blues
bollywood
bossa
brazilian
breakbeat
breaks
broadway
cantautori
cantopop
canzone
caribbean
caucasian
celtic
chamber
children
chill
chinese
choral
christian
christmas
classical
classicmetal
club
colombian
comedy
conjazz
contemporary
country
cuban
dance
dancehall
dancepop
dark
death
deep
deutschrock
deutschspr
dirty
disco
dnb
documentary
downbeat
downtempo
drum
dub
dubstep
eastern
easy
electronic
electropop
emo
entehno
epicmetal
estrada
ethnic
eurofolk
european
experimental
extrememetal
fado
film
fitness
flamenco
folk
folklore
folkmetal
folkrock
folktronica
forró
frankreich
französisch
french
funk
future
gangsta
garage
german
ghazal
gitarre
glitch
gospel
gothic
grime
grunge
gypsy
handsup
hard'n'heavy
hardcore
hardstyle
hardtechno
hip
hip-hop
hiphop
historisch
holiday
hop
horror
house
idm
i

In [24]:
def replace_wrong_genres(wrong_genres,correct_genre): #function for replacing implicit duplicates
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

In [25]:
duplicates = ['hip','hop','hip-hop'] # removing implicit duplicates
genre = 'hiphop' 
replace_wrong_genres(duplicates, genre)

In [26]:
unique_genres = sorted(df['genre'].unique())
for genre in unique_genres:
    print(genre) # checking for implicit duplicates

acid
acoustic
action
adult
africa
afrikaans
alternative
ambient
americana
animated
anime
arabesk
arabic
arena
argentinetango
art
audiobook
avantgarde
axé
baile
balkan
beats
bigroom
black
bluegrass
blues
bollywood
bossa
brazilian
breakbeat
breaks
broadway
cantautori
cantopop
canzone
caribbean
caucasian
celtic
chamber
children
chill
chinese
choral
christian
christmas
classical
classicmetal
club
colombian
comedy
conjazz
contemporary
country
cuban
dance
dancehall
dancepop
dark
death
deep
deutschrock
deutschspr
dirty
disco
dnb
documentary
downbeat
downtempo
drum
dub
dubstep
eastern
easy
electronic
electropop
emo
entehno
epicmetal
estrada
ethnic
eurofolk
european
experimental
extrememetal
fado
film
fitness
flamenco
folk
folklore
folkmetal
folkrock
folktronica
forró
frankreich
französisch
french
funk
future
gangsta
garage
german
ghazal
gitarre
glitch
gospel
gothic
grime
grunge
gypsy
handsup
hard'n'heavy
hardcore
hardstyle
hardtechno
hiphop
historisch
holiday
horror
house
idm
independent
india

[Back to Contents](#back)

## Stage 3. Testing hypotheses <a id='hypotheses'></a>

### Hypothesis 1: comparing user behavior in two cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville listen to music differently. Test this using the data on three days of the week: Monday, Wednesday, and Friday.

In [27]:
df.groupby('city')["track"].count()#count number of track by city

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64

In [28]:
# Calculating tracks played on each of the three days
print(df[df['day']=='Monday']['day'].count())
print(df[df['day']=='Wednesday']['day'].count())
print(df[df['day']=='Friday']['day'].count())

21354
18059
21840


In [37]:
df.groupby('day')["track"].count()#comparing number of tracks between days by using filter option

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

In [38]:
def number_tracks(day, city):
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count #define number of tracks by day and city

In [39]:
# the number of songs played in Springfield on Monday
springfield_monday = number_tracks('Monday', 'Springfield')
print(f'Number of songs played in Springfield on Monday: {springfield_monday}')

Number of songs played in Springfield on Monday: 15740


In [40]:
# the number of songs played in Shelbyville on Monday
shelbyville_monday = number_tracks('Monday', 'Shelbyville')
print(f'Number of songs played in Shelbyville on Monday: {shelbyville_monday}')

Number of songs played in Shelbyville on Monday: 5614


In [41]:
# the number of songs played in Springfield on Wednesday
springfield_wednesday = number_tracks('Wednesday', 'Springfield')
print(f'Number of songs played in Springfield on Wednesday: {springfield_wednesday}')

Number of songs played in Springfield on Wednesday: 11056


In [42]:
# the number of songs played in Shelbyville on Wednesday
shelbyville_wednesday = number_tracks('Wednesday', 'Shelbyville')
print(f'Number of songs played in Shelbyville on Wednesday: {shelbyville_wednesday}')

Number of songs played in Shelbyville on Wednesday: 7003


In [43]:
# the number of songs played in Springfield on Friday
springfield_friday = number_tracks('Friday', 'Springfield')
print(f'Number of songs played in Springfield on Friday: {springfield_friday}')

Number of songs played in Springfield on Friday: 15945


In [44]:
# the number of songs played in Shelbyville on Friday
shelbyville_friday = number_tracks('Friday', 'Shelbyville')
print(f'Number of songs played in Shelbyville on Friday: {shelbyville_friday}')

Number of songs played in Shelbyville on Friday: 5895


In [46]:
# table with results
data = {
    'city': ['Springfield', 'Shelbyville'],
    'monday': [number_tracks('Monday', 'Springfield'), number_tracks('Monday', 'Shelbyville')],
    'wednesday': [number_tracks('Wednesday', 'Springfield'), number_tracks('Wednesday', 'Shelbyville')],
    'friday': [number_tracks('Friday', 'Springfield'), number_tracks('Friday', 'Shelbyville')]
}
data = pd.DataFrame(data)
data

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusions**

The data reveals differences in user behavior:

-In Springfield, the highest number of songs played is observed on Mondays and Fridays, with a noticeable decline in activity on Wednesdays.

-In Shelbyville, in contrast, music listening peaks on Wednesdays, while user activity is lower on Mondays and Fridays.
So the first hypothesis seems to be correct.

[Back to Contents](#back)

### Hypothesis 2: music at the beginning and end of the week <a id='week'></a>


Based on the second hypothesis, it appears that residents of Springfield listen to genres that diverge from those favored by Shelbyville's users on Monday mornings and Friday nights.

In [48]:
spr_general = df[df['city'] == 'Springfield']

In [49]:
shel_general = df[df['city'] == 'Shelbyville']

In [51]:
def genre_weekday(df,day,time1,time2):
    genre_df = df[df['day'] == day]#variable genre_df which will store only those df rows where the day is equal to day=
    genre_df = genre_df[genre_df['time'] < time2]#genre_df will store only those rows where the time is smaller than time2=
    genre_df = genre_df[genre_df['time'] > time1]#genre_df will store only rows where the time is greater than time1=
    genre_df_count = genre_df.groupby(by='genre')['genre'].count()#find the number of rows for each genre 
    genre_df_sorted = genre_df_count.sort_values(ascending=False)#sort the result in descending order
    return genre_df_sorted[:15]#return the Series object storing the 15 most popular genres on a given day in a given timeframe

In [58]:
# Monday morning in Springfield
genre_weekday(spr_general,'Monday','07:00:00','11:00:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: genre, dtype: int64

In [63]:
#Monday morning in Shelbyville 
genre_weekday(shel_general,'Monday','07:00:00','11:00:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: genre, dtype: int64

In [64]:
#Friday evening in Springfield
genre_weekday(spr_general,'Friday','17:00:00','23:00:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: genre, dtype: int64

In [62]:
#Friday evening in Shelbyville
genre_weekday(shel_general,'Friday','17:00:00','23:00:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: genre, dtype: int64

**Conclusion**

Having compared the top 15 genres on Monday morning, conclusions can be drawn:

- The music preferences of users in both Springfield and Shelbyville are quite similar, with the top five genres being identical, except for a swap between rock and electronic, which have interchanged positions.

The scenario remains consistent for Friday evenings. While there may be some variation in individual genres, the overall top 15 genres are similar for both cities.

Thus, the second hypothesis has been partially proven true:
* Users exhibit similar music preferences at the start and end of the week.
* There is a negligible distinction between Springfield and Shelbyville, as both cities predominantly favor the pop genre.

[Back to Contents](#back)

### Hypothesis 3: genre preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Shelbyville loves rap music. Springfield's citizens are more into pop.

In [65]:
df.head()

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday


In [66]:
spr_genres_count = spr_general.groupby(by='genre')['genre'].count()#sort for Springfield
spr_genres = spr_genres_count.sort_values(ascending=False)# sort the resulting Series in descending order

In [68]:
print(spr_genres.head(10)) #first 10 rows of spr_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64


In [None]:
shel_genres_count = shel_general.groupby(by='genre')['genre'].count()#sort for Shelbyville 
shel_genres = shel_genres_count.sort_values(ascending=False)# sort the resulting Series in descending order

In [70]:
print(shel_genres.head(10)) # first 10 rows from shel_genres

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64


**Conclusion**

The hypothesis has been partially proven true:
* Pop music is indeed the dominant genre in Springfield, as anticipated.
* Nonetheless, it was revealed that pop music enjoys an equivalent level of popularity in both Springfield and Shelbyville, and rap did not make it into the top 5 genres in either city.

[Back to Contents](#back)

# Findings <a id='end'></a>


We have examined the following three hypotheses:

* User engagement varies based on both the day of the week and the specific city.
* Discrepancies in genre preferences between Springfield and Shelbyville residents are evident on Monday mornings and Friday evenings.
* While the top genre preference differs between Springfield and Shelbyville, both cities show a preference for pop music.