# Yandex.Music preferences

# Contents <a id='back'></a>

* [Introduction](#intro)
* [Step 1. Description of the data](#data_review)
* [Step 2. Data preprocessing](#data_preprocessing)
    * [2.1 Header style](#header_style)
    * [2.2 Missing values](#missing_values)
    * [2.3 Duplicates](#duplicates)
* [Step 3. Hypotheses testing](#hypotheses)
    * [3.1 Hypotheses 1: compare user behavior in the two cities](#activity)
    * [3.2 Hypotheses 2: music at the beginning and end of the week](#week)
    * [3.3 Hypotheses 3: gender preferences in two cities](#genre)
* [Conclusions](#end)

## Introduction <a id='intro'></a>

In this project, we will compare the music preferences of the cities of Springfield and Shelbyville will be compared. Real data from Yandex.Music will be studied to test the hypotheses below and compare the user behavior of these two cities. 

### Objetive: 
Test three hypotheses: 
1. User activity differs by day of the week and depending on the city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. The same is true for Friday nights. 
3. Springfield and Shelbyville listeners have different preferences. In Springfield they prefer pop while in Shelbyville there are more fans of rap.
 
[Back to Contents](#back)

## Step 1. Description of the data <a id='data_review'></a>

Open the data and browse it.

In [2]:
# import pandas
import pandas as pd

In [3]:
# the file is read and stored in the variable df
df=pd.read_csv('/datasets/music_project_en.csv')
df.describe()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,21:51:22,Friday
freq,76,136,136,8850,45360,14,23149


In [4]:
# The first 10 rows of the table df are obtained
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [5]:
# General information about df data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


### Conclusions <a id='data_review_conclusions'></a> 

The table contains seven columns. They all store the same type of data: object.

We can see some problems with the style of the column names: 

Some names are in upper case, others in lower case. there are some spaces in some names and we can see that there are missing values because we can find NaN.


Each row of the table stores data for the track that was played. Some columns describe the track itself: its title, artist and genre. The rest conveys the user's information: the city it comes from, the time the track was played and the days it is played. 

It is clear that by observing and analyzing the data at a glance, the information appreciated is sufficient to test the hypothesis, however, there are missing values, which does not allow us to process the data and perform a data treatment.

To continue, we need to preprocess the data.

[Back to Contents](#back)

## Step 2. Data preprocessing <a id='data_preprocessing'></a>


### 2.1. Header style <a id='header_style'></a>


In [6]:
# List of column names in the df table
print(df.columns)

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')


In [7]:
# Change column names and rename
df=df.rename(
    columns={
        '  userID':'user_id',
        'Track':'track', 
        '  City  ':'city', 
        'Day':'day',
    }

)

In [8]:
# The result is checked: the list of column names.
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


[Back to Contents](#back)

### 2.2. Missing values <a id='missing_values'></a>


In [9]:
# First find the number of missing values in the table.
# First method is with isna()
print(df.isna().sum())
# Second method is with isnull()
print(df.isnull().sum())

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64
user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64


We observe missing values in the columns `'track'`, `'artist'`, and `'genre'`, replacing them with the string `'unknown'`. To do this, we create the `columns_to_replace` list, loop through it with a `for` loop and replace the missing values in each of the columns:

In [10]:
# Recorremos los nombres de las columnas y reemplazando los valores ausentes con  'unknown'
columns_to_replace=['track', 'artist', 'genre']
for element in columns_to_replace: #this iteration is for the columns_to_replace list, where it will search for missing values only in the columns where they are known to be present.
    df[element]=df[element].fillna('unknown') # here we search for missing values in all columns

In [11]:
# counting missing values
print(df.isna().sum())

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64


[Back to Contents](#back)

### 2.3. Duplicates <a id='duplicates'></a>


In [12]:
# Counting obvious duplicate data
print(df.duplicated().sum())

3826


In [13]:
# Eliminating obvious duplicates
df=df.drop_duplicates()

In [14]:
# Checking duplicates
print(df.duplicated().sum())

0


Now get rid of implicit duplicates in the 'genre' column.

In [15]:
# We sort the data in alphabetical order and search for unique values
column_genre_sorted=df['genre'].sort_values()
print(column_genre_sorted.unique())

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'ambient' 'americana' 'animated' 'anime' 'arabesk' 'arabic' 'arena'
 'argentinetango' 'art' 'audiobook' 'avantgarde' 'axé' 'baile' 'balkan'
 'beats' 'bigroom' 'black' 'bluegrass' 'blues' 'bollywood' 'bossa'
 'brazilian' 'breakbeat' 'breaks' 'broadway' 'cantautori' 'cantopop'
 'canzone' 'caribbean' 'caucasian' 'celtic' 'chamber' 'children' 'chill'
 'chinese' 'choral' 'christian' 'christmas' 'classical' 'classicmetal'
 'club' 'colombian' 'comedy' 'conjazz' 'contemporary' 'country' 'cuban'
 'dance' 'dancehall' 'dancepop' 'dark' 'death' 'deep' 'deutschrock'
 'deutschspr' 'dirty' 'disco' 'dnb' 'documentary' 'downbeat' 'downtempo'
 'drum' 'dub' 'dubstep' 'eastern' 'easy' 'electronic' 'electropop' 'emo'
 'entehno' 'epicmetal' 'estrada' 'ethnic' 'eurofolk' 'european'
 'experimental' 'extrememetal' 'fado' 'film' 'fitness' 'flamenco' 'folk'
 'folklore' 'folkmetal' 'folkrock' 'folktronica' 'forró' 'frankreich'
 'französisch' 

In [16]:
# A function is designed for replacing implicit duplicates
def replace_wrong_genres(wrong_genres,correct_genre):
    if wrong_genres!=correct_genre:
        replace_wrongs_genres=df['genre'].replace(wrong_genres,correct_genre)
        return replace_wrongs_genres


In [17]:
# The implicit duplicates are eliminated
replace_wrong_genres(['hip','hop','hip-hop'],'hiphop')

0              rock
1              rock
2               pop
3              folk
4             dance
            ...    
65074           rnb
65075        hiphop
65076    industrial
65077          rock
65078       country
Name: genre, Length: 61253, dtype: object

In [18]:
# Checking implicit duplicates
replace_wrong_genres_sorted=replace_wrong_genres(['hip','hop','hip-hop'],'hiphop').sort_values()
print(replace_wrong_genres_sorted.unique())



['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'ambient' 'americana' 'animated' 'anime' 'arabesk' 'arabic' 'arena'
 'argentinetango' 'art' 'audiobook' 'avantgarde' 'axé' 'baile' 'balkan'
 'beats' 'bigroom' 'black' 'bluegrass' 'blues' 'bollywood' 'bossa'
 'brazilian' 'breakbeat' 'breaks' 'broadway' 'cantautori' 'cantopop'
 'canzone' 'caribbean' 'caucasian' 'celtic' 'chamber' 'children' 'chill'
 'chinese' 'choral' 'christian' 'christmas' 'classical' 'classicmetal'
 'club' 'colombian' 'comedy' 'conjazz' 'contemporary' 'country' 'cuban'
 'dance' 'dancehall' 'dancepop' 'dark' 'death' 'deep' 'deutschrock'
 'deutschspr' 'dirty' 'disco' 'dnb' 'documentary' 'downbeat' 'downtempo'
 'drum' 'dub' 'dubstep' 'eastern' 'easy' 'electronic' 'electropop' 'emo'
 'entehno' 'epicmetal' 'estrada' 'ethnic' 'eurofolk' 'european'
 'experimental' 'extrememetal' 'fado' 'film' 'fitness' 'flamenco' 'folk'
 'folklore' 'folkmetal' 'folkrock' 'folktronica' 'forró' 'frankreich'
 'französisch' 

[Back to Contents](#back)

### Conclusions <a id='data_preprocessing_conclusions'></a>

We detected three problems with the data:

- Incorrect header styles, where column names presented an inadequate presentation of the data, where capitalization and spaces were present.
- Missing values
- Obvious and implicit duplicates.

Header problems have been eliminated to make table processing easier.

All missing values have been replaced by `'unknown'`. But we still have to see if the missing values in `'genre'` affect our calculations.

The absence of duplicates will make the results more accurate and easier to understand.

Now we can continue testing the hypotheses.

[Back to Contents](#back)

## Step 3. Hypotheses testing <a id='hypotheses'></a>

### 3.1. Hypotheses 1: compare user behavior in the two cities <a id='activity'></a>

According to the first hypothesis, Springfield and Shelbyville users listen to music differently. Test this using data from three days of the week: Monday, Wednesday, and Friday.

* Divide the users into groups by city.
* Compare how many tracks each group played on Monday, Wednesday and Friday.


It evaluates user activity in each city and groups the data by city, finding the number of songs played in each group.

In [19]:
# Counting the tracks played in each city
print(df.groupby('city')['track'].count())

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64



Springfield has played more tracks than Shelbyville. But that doesn't imply that Springfield citizens listen to music more often. This city is simply bigger and there are more users.

Now group the data by day of the week and find the number of tracks played on Monday, Wednesday and Friday.

In [20]:
# calculating the tracks played on each of the three days
print(df.groupby('day')['track'].count())

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64


Wednesday was the quietest day of all.  A function will be designed to calculate the number of songs played on a given day and city.

In [21]:
# <creating the function number_tracks()>
# declare the function with two parameters: day=, city=.
# let the track_list variable store the df rows in which
# the value in the column 'day' is equal to the parameter day= and, at the same time, 
# the value in the 'city' column is equal to the city= parameter (apply consecutive filtering with logical indexing). 
# with logical indexing).
# let the track_list_count variable store the number of values of the 'user_id' column in track_list
# (found with the count() method).
# lets the function return a number: the value of track_list_count.


# the function counts the tracks played on a certain day and city.
# first retrieves the rows of the desired day from the table,
# then it filters the rows of the desired city from the result,
# then, it finds the number of 'user_id' values in the filtered table,
# and returns that number.
# to see what it returns, wrap the function call in print().

def number_tracks(day,city):
    track_list=df[(df['day']==day)&(df['city']==city)]['user_id'].count()
    return track_list


In [22]:
# number of songs played in Springfield on Monday
print(number_tracks('Monday','Springfield'))

15740


In [23]:
# number of songs played in Shelbyville on Monday
print(number_tracks('Monday','Shelbyville'))

5614


In [24]:
# number of songs played in Springfield on Wednesday
print(number_tracks('Wednesday','Springfield'))

11056


In [25]:
# number of songs played in Shelbyville on Wednesday
print(number_tracks('Wednesday','Shelbyville'))

7003


In [26]:
# number of songs played in Springfield on Friday
print(number_tracks('Friday','Springfield'))

15945


In [27]:
# number of songs played in Shelbyville on Friday
print(number_tracks('Friday','Shelbyville'))

5895


The `pd.DataFrame` is used to create a table, where
* The column names are: `['city', 'monday', 'wednesday', 'friday']`.
* The data are the results you got from `number_tracks()`.

In [28]:
result_number_tracks=[
    ['Springfield',number_tracks('Monday','Springfield'),number_tracks('Wednesday','Springfield'),number_tracks('Friday','Springfield')],
    ['Shelbyville',number_tracks('Monday','Shelbyville'),number_tracks('Wednesday','Shelbyville'),number_tracks('Friday','Shelbyville')],
    
]
first_hypothesis_columns=['city','monday','wednesday','friday']
first_hypothesis=pd.DataFrame(data=result_number_tracks,columns=first_hypothesis_columns)

In [29]:
# Results
print(first_hypothesis)

          city  monday  wednesday  friday
0  Springfield   15740      11056   15945
1  Shelbyville    5614       7003    5895


**Conclusions**

The data reveal differences in user behavior:

- In Springfield, the number of songs played peaks on Mondays and Fridays while there is a drop in activity on Wednesdays.
- In Shelbyville, on the contrary, users listen to more music on Wednesdays. User activity on Mondays and Fridays is lower.

So the first hypothesis seems to be correct.

[Back to Contents](#back)

### 3.2. Hypotheses 2: music at the beginning and end of the week <a id='week'></a>

According to the second hypothesis, on Monday mornings and Friday nights Springfield citizens listen to genres that differ from those that Shelbyville users enjoy.

Tables are obtained :
* For Springfield - `spr_general`.
* For Shelbyville - `shel_general`.


In [30]:
# getting the spr_general table from the rows of df, 
# where the values in the 'city' column is 'Springfield'.
spr_general=df[df['city']=='Springfield']
spr_general.head()

Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday


In [31]:
# getting shel_general from rows df,
# where the value of the column 'city' is 'Shelbyville'.
shel_general=df[df['city']=='Shelbyville']
shel_general.head()

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Shelbyville,21:20:49,Wednesday


A function with four parameters is designed:
* A table for the data (`df`).
* The day of the week (`day`)
* The date and time stamp in 'hh:mm' format (`time1`)
* The date and time stamp in the format 'hh:mm' (`time2`)

To return information of the 15 most popular genres of a given day in a period between two timestamps.

In [34]:

def genre_weekday(data_frame_general,day, time1, time2):
   
    # consecutive filtering
    # genre_df will only store those df rows where the day is equal to day=.
    genre_df =data_frame_general[(data_frame_general['day']==day)] # escribe tu código aquí

    # genre_df will only store those df rows where the time is less than time2=
    genre_df =genre_df[(genre_df['time']<time2)] 

    # genre_df will only store those df rows in which the time is greater than time1=
    genre_df =genre_df[(genre_df['time']>time1)] 
         
    # groups the DataFrame filtered by the column with the gender names, takes the gender column, and finds the number of rows for each gender with the count() method
    genre_df_grouped = genre_df.groupby('genre')['user_id'].count() 

    # we will sort the result in descending order (so the most popular genres will appear first in the Series object).
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False) 

    # we will return the Series object that stores the 15 most popular genres on a given day in a given time period.
    return genre_df_sorted[:15]

In [35]:
# function for Monday morning in Springfield (using spr_general instead of the df table)
genre_weekday(spr_general,'Monday','07:00','11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hip            281
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: user_id, dtype: int64

In [36]:
# function for Monday morning in Shelbyville (using shel_general instead of the df table)
genre_weekday(shel_general,'Monday','07:00','11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hip             79
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: user_id, dtype: int64

In [37]:
# function for Friday evening performance in Springfield
genre_weekday(spr_general,'Friday','17:00','23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hip            267
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: user_id, dtype: int64

In [38]:
# function for Friday afternoon in Shelbyville
genre_weekday(shel_general,'Friday','17:00','23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hip             94
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: user_id, dtype: int64

**Conclusions**

Having compared the 15 most popular Monday morning genres we can conclude the following:


1. Springfield and Shelbyville users listen to similar music. The five most popular genres are the same, only rock and electronica have swapped positions.


2. In Springfield the number of missing values turned out to be so high that the ``unknown'' value reached tenth. This means that missing values form a considerable part of the data, which could be the basis of the question about the reliability of our conclusions.


For Friday afternoon, the situation is similar. The individual genres vary somewhat but, in general, the 15 most popular genres are similar in the two cities.


Thus, the second hypothesis has been partially proven:
* Users listen to similar music at the beginning and end of the week.
* There is not a big difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.


However, the number of missing values makes this result somewhat questionable. In Springfield, there are so many that they affect our top 15 most popular. Had we not been missing those values, things might look different.

[Back to Contents](#back)

### 3.3. Hypotheses 3: gender preferences in two cities <a id='genre'></a>

Hypothesis: Shelbyville loves rap music. The citizens of Springfield like pop more.

In [39]:
# in one line: group the spr_general table by the 'genre' column, 
# count the 'genre' values with count() in the grouping, 
# sort the resulting series in descending order, and store it in spr_genres
spr_genres=spr_general.groupby('genre')['genre'].count().sort_values(ascending=False)

**Prints the first 10 rows of `spr_genres`:**

In [40]:
# printing the first 10 rows of spr_genres
print(spr_genres.head(10))

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hip            2041
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64


In [41]:
# in one line: group the shel_general table by the 'genre' column, 
# count the 'genre' values in the grouping with count(), 
# sort the resulting Series in descending order and store it in shel_genres
shel_genres=shel_general.groupby('genre')['genre'].count().sort_values(ascending=False)

**Prints the first 10 rows of  `shel_genres`:**

In [42]:
# printing the first 10 rows of shel_genres
print(shel_genres.head(10))

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hip             934
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64


**Conclusion**

The hypothesis has been partially proven:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music has turned out to be just as popular in Springfield as in Shelbyville and rap was not in the top 5 most popular in either city.


[Back to Contents](#back)

# Conclusions <a id='end'></a>

We have tested the following three hypotheses:

1. User activity differs depending on the day of the week and on different cities. 
2. On Monday mornings Springfield and Shelbyville residents listen to different genres. The same is true for Friday nights.
3. Springfield and Shelbyville listeners have different preferences. In both cities, Springfield and Shelbyville, pop is preferred.

After analyzing the data, we conclude:

1. The user activity in Springfield and Shelbyville depends on the day of the week although the cities vary in different ways. 

The first hypothesis is fully accepted, since in Springfield, the number of songs played peaks on Mondays and Fridays while on Wednesdays there is a decrease in activity; while in Shelbyville, on the contrary, users listen to more music on Wednesdays than on Mondays and Fridays when song playback is lower.

2. Music preferences do not vary significantly over the course of the week in Springfield and Shelbyville. We can observe small differences in the order on Mondays, but:
* In Springfield and Shelbyville people listen most to pop music.

So we cannot accept this hypothesis. We should also take into account that the result could have been different if it were not for the missing values, but we can observe that in both cities on Monday mornings the genre most listened to is pop and dance, the genres that follow and are in the first 5 most listened genres are the same although with different order, therefore, there are no significant differences in musical preferences.


3. It turns out that the musical preferences of Springfield and Shelbyville users are quite similar.


The third hypothesis is rejected.If there is any difference in preferences, it cannot be seen in the data, since rap is not in the top 5 of the charts.



[Back to Contents](#back)