# Music Recommendation System in YMusic

# Contents <a id='back'></a>

* [Introduction](#intro)
* [Step 1. Data Review](#data_review)
    * [1.1 Data Review Conclusions](#data_review_conclusions)
* [Step 2. Data Pre-processing](#data_preprocessing)
    * [2.1 Header Style](#header_style)
    * [2.2 Missing Values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Data Pre-processing Conclusions](#data_preprocessing_conclusions)
* [Step 3. Hypothesis](#hypotheses)
    * [3.1 Hypothesis 1: User activity between 2 cities](#activity)
    * [3.2 Hypothesis 2: Music preferences on Monday and Friday](#week)
    * [3.3 Hypothesis 3: Genre preferences in Springfield and Shelbyville](#genre)
* [Findings](#end)

## **Introduction** <a id='intro'></a>
Whenever we do research, we need to formulate a hypothesis that we can test. Sometimes we accept this hypothesis; but sometimes we also reject it. To make the right decisions, a business must be able to understand whether the assumptions were correct or not.

In this project, I will compare music preferences in Springfield and Shelbyville. And study actual Y.Music data to test the hypothesis below and compare to user behavior in these two cities.

### Goals:
Test these hypothesis:
1. User activity were vary depends on the day and city.
2. On Monday morning, Springfield and Shelbyville residents were tuned in to different genres. This also apply to Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. In Springfield, they prefer pop music, while Shelbyville, rap music has more fans.

### Steps
Datasets related to user behavior were stored in the `/content/music_project_en.csv`. There is no information about the quality of the data, so you need to check it before testing the hypothesis.

First, I will evaluate the quality of the data and see if the problem seems significant. Next, during data pre-processing, I will try to account for the most serious problems.

This project will consists of:
 1. Data Review
 2. Data Pre-processing
 3. Hypothesis


[Back to Contents](#back)

## **Step 1. Data Review** <a id='data_review'></a>

Open the data in Y.Music, and then explore the data.

In [None]:
# Import Pandas
import pandas

#for importing files to google collab
#from google.colab import files

In [None]:
#upload datasets
#uploaded = files.upload()

Read the `music_project_en.csv` file from the `/datasets/` folder and store it in the `df` variable:

In [None]:
# Read the file and store it to df variable
df = pandas.read_csv('/content/music_project_en.csv')

Show first 10 table rows:

In [None]:
# Show 10 rows data from df table
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


Get a general information from the table with a single command:

In [None]:
# Get general information from the data using df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


This table consists of 7 column. All data were stored in the same data type `object`.

Based on documentation:
- `'user_id'` — user identifier
- `'track'` — track title
- `'artist'` — artist name
- `'genre'` - music genre
- `'city'` — user located city
- `'time'` — song played duration
- `'day'` — day name

There are 3 problems with the style of writing column names:
1. Some names are uppercase, some are lowercase.
2. There is the use of spaces in some names.
3. There is some missing data (NULL/blank) in the `track`, `artist`, and `genre` columns.

The number of column values is different. It means the dataset contains missing values.

### **1.1 Data Review Conclusions** <a id='data_review_conclusions'></a>

Each row in the table stores data on the track title that being played. Several columns describe the song itself: `track` title, `artist`, and `genre`. And the rest convey information about the user: their hometown, and when they played the song.

It is clear that the data is sufficient to test the hypothesis. However, there is some missing values.

Next, we need to do a pre-processing the data.

[Back to Content](#back)

## **Step 2. Data Pre-processing** <a id='data_preprocessing'></a>
We need to make a correction related to formatting of the column headings and resolve missing values. And then, check for the duplicates data.

### **2.1 Header Style** <a id='header_style'></a>
Show column title:

In [None]:
# list of column name in df
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Rename the columns according to the rules of good writing style:
* If the name has several words, use `snake_case`
* All characters must be lowercase
* Remove spaces

In [None]:
# rename column name
df = df.rename(
    columns={
        '  userID': 'user_id',
        'Track': 'track',
        'genre': 'genre',
        '  City  ': 'city',
        'Day': 'day'
    }
)

Check the results. Show the column names once again:

In [None]:
# check result: column name list
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

[Back to Content](#back)

### **2.2 Missing Values** <a id='missing_values'></a>
Firstly, we have to find the number of missing values in the table. To do so, use two `Pandas` methods:

In [None]:
# sum up the missing values
df.isna().sum()

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect this research. For example, the missing values in `track` and `artist` are not that important. You can simply replace it with a clear marker.
But missing values in `'genre'` can influence comparisons of musical preferences in Springfield and Shelbyville. In real life, it is very useful to learn the reasons why the data is missing and try to fix it. But we don't have that opportunity in this project. So we have to:
* Fill this missing value with a marker
* Evaluate how much missing values can affect your calculations

Replace missing values in `'track'`, `'artist'` and `'genre'` with the string `'unknown'`. To do this, create a `columns_to_replace` list, *loop* with `for`, and replace the missing values in each column:

In [None]:
# loop the column name and change missing values with 'unknown'
columns_to_replace = df.loc[: , ['track', 'artist', 'genre']]

for wrong_value in columns_to_replace:
        df[wrong_value] = df[wrong_value].fillna('unknown')


Ensure there is no more tables that contain missing values. To do so, we have to recalculate the missing values.

In [None]:
# sum up the missing values
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

[Back to Content](#back)

### **2.3 Duplicates** <a id='duplicates'></a>
Find the number of explicit duplicates data in the table using a single command:

In [None]:
# sum up explicit duplicates
df.duplicated().sum()

3826

Call `Pandas` method to remove explicit duplicates:

In [None]:
# menghapus duplikat eksplisit
df = df.drop_duplicates().reset_index(drop=True)

Sum up the explicit duplicates again to make sure we've removed them all:

In [None]:
# check the duplicates
df.duplicated().sum()

0

Now, we remove the implicit duplicates in the `genre` column. For example, genre names can be written in different ways. Errors like these will also affect the results.

Show a unique list of genre names, ordered in ascending. To do this:
* Fetch the DataFrame column that has the data
* Apply sort method into it
* For sorted columns, call the method which will return all unique column values

In [None]:
# show all unique genre names
sorted(df['genre'].unique())

['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',
 'ethnic',
 'eurofolk',
 'european',
 'expe

Check through *list* to find implicit duplicates of the `hiphop` genre. This can be a misspelled name or an alternative name from the same genre.

We'll see the following implicit duplicates:
* `hip`
* `hop`
* `hip-hop`

To remove this, use the `replace_wrong_genres()` function with two parameters:
* `wrong_genres=` — list of duplicates
* `correct_genre=` — string with the correct value

The function must correct the name in the `'genre'` column of the `df` table, i.e. replace each value from the `wrong_genres` list with the value in `correct_genre`.

In [None]:
# function to replace implicit duplicates
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genres in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genres, correct_genre)


Call `replace_wrong_genres()` and pass the arguments so that will removes the implicit duplicates (`hip`, `hop`, and `hip-hop`) and replaces them with `hiphop`:

In [None]:
# remove implicit dupicates
duplicates = ['hip', 'hop', 'hip-hop']
name = 'hiphop'
replace_wrong_genres(duplicates, name)

Ensure duplicate names have been removed. Show a list of unique values from the `'genre'` column:

In [None]:
# check implicit duplicates
sorted(df['genre'].unique())

['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',
 'ethnic',
 'eurofolk',
 'european',
 'expe

[Back to Content](#back)

### **2.4 Data Pre-processing Conclusions** <a id='data_preprocessing_conclusions'></a>

We detected three problems with the data:

- Incorrect title writing style
- Missing values
- Explicit and implicit duplicates

The titles have now been cleaned up to make table processing easier.
All missing values have been replaced with `'unknown'`. But we still have to see if missing values in `'genre'` will affect our calculations.

The absence of duplicates will make the results more precise and easy to understand.

We can now proceed to hypothesis testing.

[Back to Content](#back)

## **Step 3. Hypothesis** <a id='hypotheses'></a>

### **3.1 Hypothesis 1: User activity between 2 cities** <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville have different behavior in listening to music. This test uses data on the following days: Monday, Wednesday, and Friday.

* Segregate users into groups by city.
* Compare how many songs each group played on Monday, Wednesday and Friday.



For practice, do each calculation separately.

Evaluate user activity in each city. Group the data by city and find the number of songs played in each group.


In [None]:
# Count the songs played in each city
number_song= df.groupby('city')['track'].count()
number_song

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64

Springfield has more songs played than Shelbyville. But that doesn't mean that Springfield residents listen to music more often. This city is bigger, and has more users.

Now, group the data by day and find the number of songs played on Monday, Wednesday and Friday.



In [None]:
# Count the tracks played on each day
number_tracks = df.groupby('day')['track'].count()
number_tracks

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

Wednesday is the most quiet day overall. But if we consider the two cities separately, we might come to a different conclusion.

You've seen how grouping by city or day works. Now write a function that will group by these two.

Create a `number_tracks()` function to count the number of songs played for a given day and city. This will take two parameters:
* name of the day
* city name


Within the function, use a variable to store the rows from the original table, where:
* The `'day'` column value is equals to the `day` parameter
* The `'city'` column value is equals to the `city` parameter

Apply sequential filtering with logical indexing.

Then, calculate the value `'user_id'` column in the resulting table. Save the result to a new variable. Return this variable from the function.

In [None]:
def number_tracks(day, city):
        track_list = df[df['day'] == day]
        track_list = track_list[track_list['city'] == city]
        df.sort_values(by='city')
        track_list_count = track_list['user_id'].count()
        return track_list_count

Call `number_tracks()` six times and change the value of the parameter, so that you can retrieve data for both cities for each of these days.

In [None]:
# number of songs played at Springfield on Monday
number_tracks('Monday', 'Springfield')

15740

In [None]:
# number of songs played in Shelbyville on Monday
number_tracks('Monday', 'Shelbyville')

5614

In [None]:
# number of songs played at Springfield on Wednesday
number_tracks('Wednesday', 'Springfield')

11056

In [None]:
# number of songs played in Shelbyville on Wednesday
number_tracks('Wednesday', 'Shelbyville')

7003

In [None]:
# number of songs played at Springfield on Friday
number_tracks('Friday', 'Springfield')

15945

In [None]:
# number of songs played in Shelbyville on Friday
number_tracks('Friday', 'Shelbyville')

5895

Use `pd.DataFrame` to create a table, where
* Column names are: `['city', 'monday', 'wednesday', 'friday']`
* Data is the result you get from `number_tracks()`

In [None]:
# table with results

data1 = [
    ['Springfield', '16715', '11755', '16890'],
    ['Shelbyville', '5982', '7478', '6259']
]


titles = ['city', 'monday', 'wednesday', 'friday']

cities_df = pandas.DataFrame(data=data1, columns=titles)

cities_df

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,16715,11755,16890
1,Shelbyville,5982,7478,6259


**Conclusions**

The data reveals differences in user behavior:

- In Springfield, the number of songs played has its peaked on Monday and Friday, while on Wednesday there was a decrease in activity.
- In Shelbyville, on the other hand, users listen to more music on Wednesdays.

There is less user activity on Monday and Friday.

[Back to Content](#back)

### **3.2 Hypothesis 2: Music preferences on Monday and Friday** <a id='week'></a>

According to the second hypothesis, on Monday morning and Friday night, Springfielders listen to a different from that enjoyed by the citizens of Shelbyville




Get the table (make sure your join table name matches the DataFrame given in the two code blocks below):
* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

In [None]:
# get table spr_general from df rows,
# where the value of 'city' column is 'Springfield'

spr_general = df[df['city']=='Springfield'].reset_index(drop=True)
spr_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
1,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
2,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
3,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
4,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
...,...,...,...,...,...,...,...
42736,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Springfield,21:07:12,Monday
42737,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
42738,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
42739,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [None]:
# get shell_general from df line,
# where the value of 'city' column is 'Shelbyville'

shel_general = df[df['city']=='Shelbyville'].reset_index(drop=True)
shel_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
2,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
3,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
4,E772D5C0,Pessimist,unknown,dance,Shelbyville,21:20:49,Wednesday
...,...,...,...,...,...,...,...
18507,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Shelbyville,21:14:40,Monday
18508,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Shelbyville,21:06:50,Monday
18509,29E04611,Bre Petrunko,Perunika Trio,world,Shelbyville,13:56:00,Monday
18510,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Shelbyville,09:22:13,Monday


Write a `genre_weekday()` function with four parameters:
* A table for data
* Name of the day
* First timestamp, in 'hh:mm' format
* Last timestamp, in 'hh:mm' format

The function should return info about the 15 most popular genres on a given day in the period between two time signatures.

In [None]:
def genre_weekday(data, day, time1, time2):

    # consecutive filters
    # genre_df will only save df lines where day equals day=
    genre_df = data[data['day'] == day]

    # genre_df will only save df rows where time is less than time2=
    genre_df = genre_df[genre_df['time'] < time2]

    # genre_df will only save df lines where time is greater than time1=
    genre_df = genre_df[genre_df['time'] > time1]

    # group the filtered DataFrames by columns with genre names, retrieve the genre columns, and find the number of rows for each genre with the count() method
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()

    # we will sort the results in descending order (so that the most popular genre comes first in the Series object)
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)

    # we will generate a Series object that stores the 15 most popular genres on a given day within a specified timeframe
    return genre_df_sorted[:15]


Compare the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (7:00 am to 11:00 am) and Friday night (5:00 pm to 11:00 pm):

In [None]:
# call function for Monday morning in Springfield (using spr_general instead of df table)
genre_weekday(spr_general, 'Monday', '07:00', '11:00')


genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: genre, dtype: int64

In [None]:
# call function for Monday morning in Shelbyville (use shel_general instead of df table)
genre_weekday(shel_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: genre, dtype: int64

In [None]:
# call function for Friday night in Springfield
genre_weekday(spr_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: genre, dtype: int64

In [None]:
# call the function for Friday night in Shelbyville
genre_weekday(shel_general, 'Friday', '17:00', '23:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: genre, dtype: int64

**Conclusions**

After we compared the top 15 genres on Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to music of the same genre. The top five genres are the same, only rock and electronic have switched places.

2. In Springfield, the number of missing values was so big that the `'unknown'` vaue was on 10th position. This means that the missing values contain a large amount of data, which may be grounds for questioning the precision of our conclusions.

On Friday night, we have a similar situation. Individual genres are quite variable, but overall, the top 15 genres for both cities are the same.

Thus, the second hypothesis is partially proven correct:
* Users listen to the same music at the beginning and at the end of the week.
* There are no notable differences between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result questionable. In Springfield, there are so many that influence our top 15. If we didn't ignore these values, the results might have been different.



[Back to Content](#back)

### **3.3 Hypothesis 3: Genre preferences in Springfield and Shelbyville** <a id='genre'></a>

Hypothesis: Shelbyville loves rap music. Springfield residents prefer pop music.


Group the `spr_general` table by genre and find the number of songs played for each genre using `count()` method. Then sort the results in descending order and save to `spr_genres`.

In [None]:
# in one line: group spr_general table by 'genre' column,
# count 'genre' values with count() in grouping,
# sort the generated Series in descending order, then save to spr_genres

spr_genres = spr_general.groupby('genre')['genre'].count()
spr_genres = spr_genres.sort_values(ascending=False)
spr_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
               ... 
metalcore         1
marschmusik       1
malaysian         1
lovers            1
ïîï               1
Name: genre, Length: 250, dtype: int64

Show 10 first rows from `spr_genres`:

In [None]:
# show 10 first rows from `spr_genres`:
spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Now, do the same with Shelbyville data.

Group the `shel_general` table by genre and find the number of songs played for each genre. Then sort the results in descending order and save them to the `shel_genres` table:

In [None]:
# on a single line: group the shell_general table by 'genre' column,
# count the value of 'genre' in the grouping using count(),
# sort the generated Series in descending order and save to shell_genres
shel_genres = shel_general.groupby('genre')['genre'].count()
shel_genres = shel_genres.sort_values(ascending=False)
shel_genres

genre
pop           2431
dance         1932
rock          1879
electronic    1736
hiphop         960
              ... 
mandopop         1
leftfield        1
laiko            1
jungle           1
worldbeat        1
Name: genre, Length: 202, dtype: int64

Show 10 first rows from `shel_genres`:

In [None]:
# show 10 first rows from `shel_genres`:
shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis is partially proven:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be equally popular in both Springfield and Shelbyville, and rap music was not in the top 5 for both city.

[Back to Content](#back)

# **Findings** <a id='end'></a>


We have tested the following hypotheses:

1. User activity varies depending on the day and city.
2. On Monday morning, residents of Springfield and Shelbyville tune in to different genres. This also applies to Friday night.
3. Listeners in Springfield and Shelbyville have different preferences. In both Springfield and Shelbyville, they preferred pop music.

After analyzing the data, we can conclude:

1. User activity in Springfield and Shelbyville depends on the day, even it has a different city.

The first hypothesis can be fully accepted.

2. Musical preferences weren't too different during a week in Springfield and Shelbyville. We can see a small difference in the order on Monday, but:
* In both Springfield and Shelbyville, most people listen to pop music.

So we cannot accept this hypothesis. We also have to remember that the results might have been different had it not been for the missing values.

3. Apparently, the music preferences of users from Springfield and Shelbyville are very similar.

The third hypothesis is rejected. If there is a difference in preference, it cannot be seen from this data.

[Back to Content](#back)