# Y.Music

# Content

* [Intro](#intro)
* [Stage 1. Data Overview](#data_review)
 * [Conclusion](#data_review_conclusions)
* [Stage 2. Data preprocessing](#data_preprocessing)
 * [2.1 Heading Style](#header_style)
 * [2.2 Missing Values](#missing_values)
 * [2.3 Duplicates](#duplicates)
 * [2.4 Conclusion](#data_preprocessing_conclusions)
* [Stage 3. Hypothesis Testing](#hypotheses)
 * [3.1 Hypothesis 1: User activity in both cities](#activity)
 * [3.2 Hypothesis 2: Music preference on Mondays and Fridays](#week)
 * [3.3 Hypothesis 3: Genre preferences in the cities of Springfield and Shelbyville](#genre)
* [Findings](#end)

## Introduction <a id='intro'></a>
Every time we carry out an analysis, we need to formulate several hypotheses that we need to test further. Sometimes, the tests we carry out lead us to accept the hypothesis, we also need to reject it. To make the right decisions in business, we must understand whether the assumptions we make are correct or not.

In this project, you will compare the music preferences of users in the cities of Springfield and Shelbyville. You'll study real Y.Music data to test the hypotheses below and compare user behavior in these two cities.

### Objective:
Testing three hypotheses:
1. User activity varies depending on the day and city.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. This also applies on Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. In Springfield, users prefer pop music, while in Shelbyville rap music has more fans.

### Stages
Data about user behavior is stored in the *file* `/datasets/music_project_en.csv`. There is no information regarding the quality of the data, so you need to check it first before testing your hypothesis.

First, you'll evaluate the quality of the data and see if the problem is significant. Then, during data pre-processing, you will try to address the most serious problems.

The project will consist of three stages:
 1. Data Overview
 2. Data Pre-processing
 3. Hypothesis testing


[Back to Content](#back)

## Stage 1. Data Overview <a id='data_review'></a>


In [2]:
# mengimpor Pandas
import pandas as pd

In [3]:
df=pd.read_csv('/datasets/music_project_en.csv')

In [4]:
df

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
...,...,...,...,...,...,...,...
65074,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
65075,D08D4A55,Maybe One Day (feat. Black Spade),Blu & Exile,hip,Shelbyville,10:00:00,Monday
65076,C5E3A0D5,Jalopiina,,industrial,Springfield,20:09:26,Friday
65077,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [5]:
# membaca berkas dan menyimpannya ke df
df.describe()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,08:14:07,Friday
freq,76,136,136,8850,45360,14,23149


In [6]:
# memperoleh 10 baris pertama dari tabel df
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [7]:
# memperoleh informasi umum tentang data yang tersedia di df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


There is also a different number of values ​​between columns. They all store the same data type, namely: `object`.

Based on the documentation:
- `'userID'` — User ID
- `'Track'` — song track title
- `'artist'` — artist name
- `'genre'`
- `'City'` — the city where the user is located
- `'time'` — the amount of time the song is played
- `'Day'` — day of the week

We can see three problems with the column name writing style:
1. Some names are written in uppercase, some in lowercase.
2. Some names use spaces.
3. `Find the third problem and explain it here`.

The writing problem is:
- writing `'userID`' - uses spaces, which should be changed to `'user_id'`.
- writing `'Track` - changed to `'track'`.
- writing `'City'` - changed to `'city'`.
- writing `'Day'` - changed to `'day'`.

The number of different column values. This indicates that the data we have contains missing values.

### Conclusion <a id='data_review_conclusions'></a>

Each row in the table stores data related to the song track being played. Many columns hold data that describes the track itself: track title, artist, and genre. The rest stores data related to user information: their hometown, and the time they played the song track.

It is clear that the data we have is sufficient to test the hypothesis. Nevertheless, we have missing values.

To continue the analysis, we need to pre-process the data first.

## Stage 2. Data Pre-processing <a id='data_preprocessing'></a>


### Title Writing Style <a id='header_style'></a>


In [8]:
# list nama kolom pada tabel df
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Rename columns according to the rules of good writing style:
* If the name has multiple words, use snake_case
* All characters must be in lower case
* Remove spaces

In [9]:
# mengganti nama kolom
df = df.rename(columns={'  userID' : 'user_id', 'Track' : 'track', '  City  ' : 'city', 'Day' : 'day'})

df

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
...,...,...,...,...,...,...,...
65074,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
65075,D08D4A55,Maybe One Day (feat. Black Spade),Blu & Exile,hip,Shelbyville,10:00:00,Monday
65076,C5E3A0D5,Jalopiina,,industrial,Springfield,20:09:26,Friday
65077,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [10]:
# mengecek hasil Anda: tampilkan sekali lagi list nama kolom
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

[Kembali ke Daftar Isi](#back)

### Missing Values ​​<a id='missing_values'></a>

In [11]:
# menghitung nilai yang hilang
df.isna().sum()

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values ​​affect the research. For example, missing values ​​in the `track` and `artist` columns are not that important. You can simply replace it with a clear marker.
However, missing values ​​in the `'genre'` column could affect the comparison of music preferences in Springfield and Shelbyville. In real life, it is useful to learn the reasons why the data was lost and try to fix it. Unfortunately, we haven't had that opportunity in this project. Therefore, you must:
* Fill in missing values ​​with markers
* Evaluate how much missing values ​​may impact your calculations

Replace missing values ​​in the `'track'`, `'artist'`, and `'genre'` columns with the *string* `'unknown'`. To do this, create a *list* `columns_to_replace`, apply a *loop* with `for` on the *list*, and replace the missing values ​​in each column:

In [12]:
# loop nama kolom dan ganti nilai yang hilang dengan 'unknown'
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column].fillna('unknown', inplace=True)

In [13]:
# menghitung nilai yang hilang
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicate <a id='duplicates'></a>

In [14]:
# menghitung duplikat eksplisit
df.duplicated().sum()

3826

Panggil metode `Pandas` untuk menghapus duplikat eksplisit:

In [15]:
# menghapus duplikat eksplisit
df = df.drop_duplicates()
df.duplicated().sum()

0

Hitung duplikat eksplisit sekali lagi untuk memastikan bahwa Anda telah menghapus semuanya:

In [16]:
# memeriksa duplikat
df.duplicated().sum()

0

Sekarang hapus duplikat implisit di kolom `genre`. Misalnya, nama genre dapat ditulis dengan cara yang berbeda. Kesalahan seperti ini juga akan memengaruhi hasil Anda.

Display a *list* containing unique genre names, then sort the *list* alphabetically. To do so:
* Retrieve the column of the DataFrame in question
* Apply the sorting method to the column
* For a sorted column, call a method that returns all unique values ​​of the column

In [17]:
# melihat nama genre yang unik
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

Look carefully at the *list* to find implicit duplicates of the `hiphop` genre. The duplicate could be an incorrectly written name or an alternative name of the same genre.

You will see the following implicit duplicate:
* `hip`
* `hop`
* `hip-hop`

To remove it, use the `replace_wrong_genres()` function with two parameters:
* `wrong_genres=` — *list* of duplicates to replace
* `correct_genre=` — *string* with correct values

The function should correct the names in the `'genre'` column of the `df` table by replacing each value from the *list* `wrong_genres` with a value from `correct_genre`.

In [18]:
# fungsi untuk mengganti duplikat implisit
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

Call `replace_wrong_genres()` and pass arguments to the function, so it can remove implicit duplicates (`hip`, `hop`, and `hip-hop`) and replace them with `hiphop`:

In [43]:
# masukkan fungsi yang mengganti duplikat implisit
duplicates = ['hip', 'hop', 'hip-hop']
genre = 'hiphop'
replace_wrong_genres(duplicates, genre)
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['genre'] = df['genre'].replace(wrong_genre, correct_genre)


Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday


Make sure that duplicate values ​​have been removed. Display a *list* of unique values ​​from the `'genre'` column:

In [20]:
# memeriksa duplikat implisit
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'ambient', 'americana', 'animated', 'anime',
       'arabesk', 'arabic', 'arena', 'argentinetango', 'art', 'audiobook',
       'avantgarde', 'axé', 'baile', 'balkan', 'beats', 'bigroom',
       'black', 'bluegrass', 'blues', 'bollywood', 'bossa', 'brazilian',
       'breakbeat', 'breaks', 'broadway', 'cantautori', 'cantopop',
       'canzone', 'caribbean', 'caucasian', 'celtic', 'chamber',
       'children', 'chill', 'chinese', 'choral', 'christian', 'christmas',
       'classical', 'classicmetal', 'club', 'colombian', 'comedy',
       'conjazz', 'contemporary', 'country', 'cuban', 'dance',
       'dancehall', 'dancepop', 'dark', 'death', 'deep', 'deutschrock',
       'deutschspr', 'dirty', 'disco', 'dnb', 'documentary', 'downbeat',
       'downtempo', 'drum', 'dub', 'dubstep', 'eastern', 'easy',
       'electronic', 'electropop', 'emo', 'entehno', 'epicmetal',
       'estrada', 'ethnic', 'eurofo

### Conclusions <a id='data_preprocessing_conclusions'></a>
We have detected three problems in our data:

- Wrong title writing style
- Missing values
- Explicit and implicit duplicates

Column headings have now been cleaned up to make table processing easier.
All missing values ​​have been replaced with `'unknown'`. However, we still have to see whether missing values ​​in the `'genre'` column will affect our calculations.

The absence of duplicates will make the results we get more precise and easier to understand.

Now, we can move on to hypothesis testing.

## Stage 3. Hypothesis Testing

### Hypothesis 1: Comparing User Behavior in Two Cities

According to the first hypothesis, users from Springfield and Shelbyville have different music-listening behaviors. This test uses data taken from three days of the week: Monday, Wednesday, and Friday.

* Divide users into groups by city.
* Compare how many song tracks each group played on Monday, Wednesday, and Friday.

Just as an exercise, do each calculation separately.

Evaluate user activity in each city. Group the data by city and find the number of song tracks played in each group.

In [21]:
# Menghitung trek lagu yang diputar di setiap kota
df.groupby('city')['track'].count()

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64

Users from Springfield played more tracks than users from Shelbyville. However, this does not imply that Springfield residents listen to music more often. The city is bigger, and there are more users.

Now, group the data by day and find the number of song tracks played on Monday, Wednesday, and Friday.

In [22]:
# Menghitung trek lagu yang diputar pada masing-masing hari
df.groupby('day')['track'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

Wednesday is the quietest day overall. But if we consider the two cities separately, we might come to a different conclusion.

You've seen how grouping by city or day works. Now, write a function that will group the data by city and day.

Create a `number_tracks()` function to count the number of song tracks played for a given day and city. The function will require two parameters:
* name of the day of the week
* city name

In the function we created, use variables to store rows from the original table, where:
 * The value of the `'day'` column is the same as the `day` parameter
 * The value of the `'city'` column is the same as the `city` parameter

Implement sequential filtering with logical indexing.

Then calculate the value of the `'user_id'` column in the resulting table. Save the result into a new variable. Return this variable from the function.

In [44]:
# <membuat fungsi number_tracks()>
def number_tracks(day, city):
    df_filter = df[(df['day'] == day) & (df['city'] == city)]
    track_list_count = df_filter['user_id'].count()
    return track_list_count

# Kita akan mendeklarasikan sebuah fungsi dengan dua parameter: day=, city=.
# Buat agar variabel track_list menyimpan baris df, di mana
# nilai di kolom 'day' sama dengan parameter day=, serta pada saat yang bersamaan,
# nilai pada kolom 'city' sama dengan parameter city= (terapkan pemfilteran berurutan
# dengan pengindeksan logis).
# Buat agar variabel track_list_count menyimpan jumlah nilai kolom 'user_id' pada track_list
# (temukan dengan metode count()).
# Buat agar fungsi yang Anda ciptakan menghasilkan angka: nilai track_list_count.

# Fungsi tersebut menghitung trek lagu yang diputar untuk kota dan hari tertentu.
# Pertama-tama ia akan mengambil baris dengan hari yang diinginkan dari tabel,# kemudian memfilter baris tersebut dengan kota yang diinginkan,
# kemudian mencari jumlah nilai 'user_id' pada tabel yang telah difilter,
# kemudian menghasilkan jumlah tersebut.
# Untuk melihat apa yang dihasilkan, kemas pemanggilan fungsi pada print().

Call `number_tracks()` six times and change the parameter value on each call, so that you can retrieve data for both cities for each day (Monday, Wednesday, and Friday).

In [45]:
# jumlah lagu yang diputar di Springfield pada hari Senin
number_tracks('Monday', 'Springfield')

15740

In [25]:
# jumlah lagu yang diputar di Shelbyville pada hari Senin
number_tracks('Monday', 'Shelbyville')

5614

In [26]:
#  jumlah lagu yang diputar di Springfield pada hari Rabu
number_tracks('Wednesday', 'Springfield')

11056

In [27]:
#  jumlah lagu yang diputar di Shelbyville pada hari Rabu
number_tracks('Wednesday', 'Shelbyville')

7003

In [28]:
# jumlah lagu yang diputar di Springfield pada hari Jumat
number_tracks('Friday', 'Springfield')

15945

In [29]:
# jumlah lagu yang diputar di Shelbyville pada hari Jumat
number_tracks('Friday', 'Shelbyville')

5895

Use `pd.DataFrame` to create a table, which
* The column names are: `['city', 'monday', 'wednesday', 'friday']`
* The data is the result of `number_tracks()` `number_tracks()`

In [30]:
# tabel dengan hasil
column = ['city', 'monday', 'wednesday', 'friday']
number_track = [
    ['Springfield', number_tracks('Monday', 'Springfield'), number_tracks('Wednesday', 'Springfield'), number_tracks('Friday', 'Springfield')],
    ['Shelbyville', number_tracks('Monday', 'Shelbyville'), number_tracks('Wednesday', 'Shelbyville'), number_tracks('Friday', 'Shelbyville')]
]
df_table = pd.DataFrame(data=number_track, columns=column)
df_table

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusion**

The data you get reveals differences in user behavior:

- In the city of Springfield, the number of songs played peaks on Mondays and Fridays, while on Wednesdays there is a decline in activity.
- In the city of Shelbyville, on the other hand, users listen to music more on Wednesdays.

User activity on Mondays and Fridays is less.

### Hypothesis 2: Music at the Beginning and End of the Week

According to the second hypothesis, on Monday mornings and Friday evenings, Springfield residents listen to a different genre of music than the residents of Shelbyville enjoy.

Get the table (make sure your combined table name matches the DataFrame given in the two code blocks below):
* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

In [31]:
# mendapatkan tabel spr_general dari baris df,
# yang nilai dari kolom 'city'-nya adalah 'Springfield'
spr_general = df[df['city'] == 'Springfield']
spr_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
...,...,...,...,...,...,...,...
65073,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Springfield,21:07:12,Monday
65074,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
65076,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
65077,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [32]:
# mendapatkan shel_general dari baris df,
# yang nilai dari kolom 'city'-nya adalah 'Shelbyville'
shel_general = df[df['city'] == 'Shelbyville']
shel_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Shelbyville,21:20:49,Wednesday
...,...,...,...,...,...,...,...
65063,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Shelbyville,21:14:40,Monday
65064,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Shelbyville,21:06:50,Monday
65065,29E04611,Bre Petrunko,Perunika Trio,world,Shelbyville,13:56:00,Monday
65066,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Shelbyville,09:22:13,Monday


Write a function `genre_weekday()` with four parameters:
* A table for data
* Name of the day
* First timestamp, in 'hh:mm' format
* Last timestamp, in 'hh:mm' format

The function should produce information about the 15 most popular genres on a given day in the period between two timestamps.

In [33]:
# Mendeklarasikan fungsi genre_weekday() dengan parameter day=, time1=, dan time2=. Fungsi tersebut harus
# memberikan informasi tentang genre yang paling populer pada hari dan waktu tertentu:

# 1) Buat agar variabel genre_df menyimpan baris yang memenuhi beberapa kondisi berikut:
#    - nilai pada kolom 'day' sama dengan nilai argumen day=
#    - nilai pada kolom 'time' lebih besar dari nilai argumen time1=
#    - nilai pada kolom 'time' lebih kecil dari nilai argumen time2=
#    Gunakan pemfilteran berurutan dengan pengindeksan logis.

# 2) Kelompokkan genre_df berdasarkan kolom 'genre', lalu ambil salah satu kolomnya,
#    dan gunakan metode count() untuk menemukan jumlah entri untuk masing-masing
#    genre yang terwakili; simpan Series yang dihasilkan ke dalam
#    variabel genre_df_count

# 3) Urutkan genre_df_count dalam urutan menurun berdasarkan frekuensi dan simpan hasilnya
#    ke dalam variabel genre_df_sorted

# 4) Hasilkan sebuah objek Series dengan 15 nilai genre_df_sorted pertama - 15 genre paling
#    populer (pada hari tertentu, dalam jangka waktu tertentu)

# tulis fungsi Anda di sini

    # pemfilteran berturut-turut
    # genre_df hanya akan menyimpan baris df yang day-nya sama dengan day
    # tulis kode program Anda di sini

    # genre_df hanya akan menyimpan baris df yang time-nya lebih kecil dari time2
    # tulis kode program Anda di sini

    # genre_df hanya akan menyimpan baris df yang time-nya lebih besar dari time1
    # tulis kode program Anda di sini

    # kelompokkan DataFrame yang telah difilter berdasarkan kolom dengan nama genre, ambil kolom genre, dan temukan jumlah baris untuk setiap genre dengan metode count()
    # tulis kode program Anda di sini

    # kita akan mengurutkan hasilnya dalam urutan menurun (sehingga genre yang paling populer ditampilkan lebih awal pada objek Series
    # tulis kode program Anda di sini

    # kita akan menghasilkan objek Series yang menyimpan 15 genre paling populer pada hari tertentu dalam jangka waktu tertentu
    

def genre_weekday(df, day, time1, time2):
    genre_df = df[df['day'] == day]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df_grouped = genre_df.groupby('genre')['user_id'].count()
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)
    return genre_df_sorted[:15]


Compare the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (from 7:00 to 11:00) and on Friday evening (from 17:00 to 23:00):

In [34]:
# memanggil fungsi untuk Senin pagi di Springfield (gunakan spr_general alih-alih tabel df)
genre_weekday(spr_general, 'Monday', '07.00', '11.00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: user_id, dtype: int64

In [35]:
# memanggil fungsi untuk Senin pagi di Shelbyville (gunakan shel_general alih-alih tabel df)
genre_weekday(shel_general, 'Monday', '07.00', '11.00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: user_id, dtype: int64

In [36]:
# memanggil fungsi untuk Jumat malam di Springfield
genre_weekday(spr_general, 'Friday', '17.00', '23.00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: user_id, dtype: int64

In [37]:
# memanggil fungsi untuk Jumat malam di Shelbyville
genre_weekday(shel_general, 'Friday', '17.00', '23.00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: user_id, dtype: int64

**Conclusion**

After comparing the top 15 genres as of Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to music of the same genre. The top five genres from both cities are the same, only rock and electronic have swapped places.

2. In Springfield, the number of missing values ​​turns out to be very large, so that the value `'unknown'` is in 10th place. This means that the missing values ​​account for a fairly large proportion of the data, so this fact could be a basis for questioning the reliability of our conclusions.

For Friday night, the situation was also similar. Individual genres varied quite a bit, but overall, the top 15 genres for both cities were the same.

Thus, the second hypothesis was partially proven correct:
* Users listen to the same music at the start and end of the week.
* There are no significant differences between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the significance of the number of missing values ​​makes this result questionable. In Springfield, there is a lot of missing value that impacts our top 15 genre results. If we did not have these missing values, the results might have been different.

### Hypothesis 3: Genre Preferences in Springfield and Shelbyville 

Hypothesis: Shelbyville loves rap music. Springfield residents prefer pop.

Group the `spr_general` table by genre and find the number of tracks played for each genre with the `count()` method. Then, sort the results in descending order and save them to `spr_genres`.

In [38]:
# dalam satu baris: kelompokkan tabel spr_general berdasarkan kolom 'genre',
# hitung nilai kolom 'genre' dengan count() dalam pengelompokan,
# urutkan Series yang dihasilkan dalam urutan menurun, lalu simpan hasilnya ke spr_genres
spr_genres = spr_general.groupby('genre')['track'].count()
spr_genres = spr_genres.sort_values(ascending=False)
spr_genres

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
               ... 
metalcore         1
marschmusik       1
malaysian         1
lovers            1
ïîï               1
Name: track, Length: 250, dtype: int64

In [39]:
# menampilkan 10 baris pertama dari spr_genres
spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: track, dtype: int64

Now do the same thing with the data from Shelbyville.

Group the `shel_general` table by genre and find the number of tracks played for each genre. Then, sort the results in descending order and save them to the `shel_genres` table:

In [46]:
# dalam satu baris: kelompokkan tabel shel_general berdasarkan kolom 'genre',
# hitung nilai kolom 'genre' dalam pengelompokan menggunakan count(),
# urutkan Series yang dihasilkan dalam urutan menurun dan simpanlah ke shel_genres
shel_genres = shel_general.groupby('genre')['track'].count()
shel_genres = shel_genres.sort_values(ascending=False)
shel_genres

genre
pop           2431
dance         1932
rock          1879
electronic    1736
hiphop         960
              ... 
mandopop         1
leftfield        1
laiko            1
jungle           1
worldbeat        1
Name: track, Length: 202, dtype: int64

In [42]:
# menampilkan 10 baris pertama dari shel_genres
shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: track, dtype: int64

**Conclusion**

This hypothesis was proven partially correct:
* Pop music is the most popular genre in Springfield, as we would expect.
* However, pop music turned out to be equally popular in both Springfield and Shelbyville, and rap music apparently didn't make the top 5 list of genres for either city.

[Kembali ke Daftar Isi](#back)

# Findings 

We have tested the following three hypotheses:

1. User activity in Springfield and Shelbyville depends on the day of the week, although these two cities vary in different ways.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. This also applies on Friday nights.
3. Listeners in Springfield and Shelbyville have different preferences. Both in Springfield and in Shelbyville, they prefer pop music.

After analyzing the available data, we can conclude that:

1. User activity in Springfield and Shelbyville depends on the day, even though the cities are different.

The first hypothesis can be completely accepted.

2. Music preferences do not vary significantly across the week in Springfield and Shelbyville. We can see a small difference in the order on Monday, but:
* In both Springfield and Shelbyville, users listen most to pop music.

Therefore, we cannot accept this hypothesis. It is also important to remember that the results obtained could have been different had we not had missing values.
3. It turns out that the music preferences of users from Springfield and Shelbyville are very similar.

The third hypothesis is rejected. If there are differences in preferences, unfortunately, we cannot know this from this data.
### Notes
In a real project, research involves statistical hypothesis testing, which is of course more precise and more quantitative. Also note that you can't always conclude an entire city based on data from a single source.

You will learn hypothesis testing in the statistical data analysis sprint.