# Youtube Music Analysis

## Introduction <a id='intro'></a>
Whenever an analysis is conducted, several hypotheses need to be formulated that require further testing. Sometimes, the testing leads to the acceptance of these hypotheses. However, at other times, they need to be rejected as well. To make the right business decisions, it is necessary to understand whether the assumptions made are correct or not.

In this project, the music preferences of listeners in Akita and Newcastle will be compared. Actual data from Y.Music will be reviewed to test the hypotheses below and compare user behavior in both cities.

### Objectives:
Three hypotheses will be tested:
1. User activity varies depending on the day and city.
2. On Monday mornings, residents of Akita and Newcastle listen to different genres. This also applies to Friday evenings.
3. Listeners in Akita and Newcastle have different preferences. In Akita, users prefer pop music, while in Newcastle, rap music has more fans.

### Steps
Data related to user behavior is stored in the file `datasets/music_project_en.csv`. There is no information regarding the quality of this data, so it needs to be checked first before testing the hypotheses.

First, the data quality will be evaluated, and it will be determined whether the problems are significant. Then, during data preprocessing, the most serious issues will be addressed.

The project consists of three stages:
 1. Data review
 2. Data preprocessing
 3. Hypothesis testing

## Introduction <a id='intro'></a>
Whenever an analysis is conducted, several hypotheses need to be formulated that require further testing. At times, the testing that is performed leads to the acceptance of these hypotheses. However, at other times, they also need to be rejected. To make accurate business decisions, it must be understood whether the assumptions that are made are correct or not.

In this project, the music preferences of listeners in Akita and Newcastle will be compared. Actual data from Y.Music will be reviewed to test the hypotheses listed below and to compare user behavior in both cities.

### Objectives:
Three hypotheses will be tested:
1. User activity is varied depending on the day and the city.
2. On Monday mornings, different genres are listened to by residents of Akita and Newcastle. This also applies to Friday evenings.
3. Different preferences are held by listeners in Akita and Newcastle. In Akita, pop music is preferred by users, while in Newcastle, rap music has more fans.

## Step 1. Data Review

Open the data related to Y.Music, then study the data

Import library

The file `music_project_en.csv` from the `datasets/` folder should be read, and the file should be stored in the variable `df`.

In [None]:
import pandas as pd
df = pd.read_csv('music_project_en_Aki.csv')

The first 10 rows of the table is displayed.

In [None]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Newcastle,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas RÃ¶nnberg,rock,Akita,14:07:09,Friday
2,20EC38,FuniculÃ¬ funiculÃ,Mario Lanza,pop,Newcastle,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Newcastle,8:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Akita,8:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Newcastle,13:09:41,Friday
6,4CB90AA5,TRUE,Roman Messer,dance,Akita,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Akita,20:47:49,Wednesday
8,8FA1D3BE,Lâ€™estate,Julia Dalia,ruspop,Akita,9:17:40,Friday
9,E772D5C0,Pessimist,,dance,Newcastle,21:20:49,Wednesday


General information about the table

In [None]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57505 non-null  object
 3   genre     63866 non-null  object
 4     City    65064 non-null  object
 5   time      65064 non-null  object
 6   Day       65064 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB
None


The table contains seven columns. All columns have the same data type: `object`.

According to the documentation:
- `'userID'` — user ID
- `'Track'` — track title
- `'artist'` — artist name
- `'genre'`
- `'City'` — user's city of origin
- `'time'` — the time the track was played
- `'Day'` — the day of the week

Three issues with the column name formatting can be observed:
1. Some names are written in uppercase, while others are in lowercase.
2. Some names contain spaces.
3. The `genre` column lacks documentation, such as a description like 'type of music.'

It can also be observed that the number of values differs between columns. This indicates that the data contains missing values.

In [None]:
missing_values= df.isna().sum()
print(missing_values)

  userID       0
Track       1343
artist      7574
genre       1213
  City        15
time          15
Day           15
dtype: int64


### Conclusion

Each row in the table stores data related to a track that was played. Some columns store data describing the track itself: track title, artist, and genre. The rest store data related to user information: their city of origin and the time they played the track.

It is clear that the data is sufficient to test the hypothesis. Unfortunately, there are a number of missing values.

To proceed with the analysis, data preprocessing will need to be performed first.

## Step 2. Data Preprocessing

The format of the column headers should be corrected, and missing values should be addressed. After that, it should be checked whether the data contains duplicates.

### Column Header Formatting

The column headers should be displayed:

In [None]:
#A list containing the column names in the `df` table is displayed
print(df.columns)

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')


Column names should be changed according to good formatting rules:
* If the column name consists of multiple words, use `snake_case`.
* All characters should be in lowercase.
* Remove spaces.

In [None]:
# Change column name
df = df.rename(columns={
    '  userID': 'user_id',
    'Track': 'track',
    '  City  ': 'city',
    'time': 'time',
    'Day': 'day',
})

print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


In [None]:
# Check the result
df = df.rename(columns={
    '  userID': 'user_id',
    'Track': 'track',
    '  City  ': 'city',
    'time': 'time',
    'Day': 'day',
})

print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


### Missing Values

First, the number of missing values in the table is determined. To do this, two `Pandas` methods is used:

In [None]:
# Count missing value
missing_values: df.isna().sum()
print(missing_values)

  userID       0
Track       1343
artist      7574
genre       1213
  City        15
time          15
Day           15
dtype: int64


Not all missing values are impactful to the analysis. For instance, missing values in the `track` and `artist` columns are not very critical and should be replaced with a clear marker.

However, missing values in the `'genre'` column could affect the comparison of music preferences in Akita and Newcastle. In real-world scenarios, it is useful to investigate the reasons behind missing data and attempt to address it. Unfortunately, this opportunity is not available in this project. Therefore, missing values should be:
* Filled with markers
* The impact of missing values on your calculations should be evaluated

Missing values in the `'track'`, `'artist'`, and `'genre'` columns should be replaced with the string `'unknown'`. To do this, a list named `columns_to_replace` should be created, a `for` loop should be applied to the list, and missing values in each column should be replaced.

In [None]:
#A loop should be applied to the column names to replace missing values with `'unknown'`.
columns_to_replace = ['track', 'artist', 'genre']

for replace in columns_to_replace:
    df[replace] = df[replace].fillna('unknown')

df.head()

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Newcastle,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas RÃ¶nnberg,rock,Akita,14:07:09,Friday
2,20EC38,FuniculÃ¬ funiculÃ,Mario Lanza,pop,Newcastle,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Newcastle,8:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Akita,8:34:34,Monday


Make sure no more tables contain missing values. Recalculate missing values.

In [None]:
# Count missing value
missing_values = df.isna().sum()
print(missing_values)

user_id     0
track       0
artist      0
genre       0
city       15
time       15
day        15
dtype: int64


### Duplicates


In [None]:
# Count explicit duplicates
explicit_duplicates = df.duplicated().sum()
print(explicit_duplicates)

3826


In [None]:
# Delete explicit duplicate
delete_explicit_duplicate = df.drop_duplicates().reset_index(drop=True)
print(delete_explicit_duplicate)

        user_id                              track             artist  \
0      FFB692EC                  Kamigata To Boots   The Mass Missile   
1      55204538        Delayed Because of Accident  Andreas RÃ¶nnberg   
2        20EC38                FuniculÃ¬ funiculÃ         Mario Lanza   
3      A3DD03C9              Dragons in the Sunset         Fire + Ice   
4      E2DC1FAE                        Soul People         Space Echo   
...         ...                                ...                ...   
61248  729CBB09                            My Name             McLean   
61249  D08D4A55  Maybe One Day (feat. Black Spade)        Blu & Exile   
61250  C5E3A0D5                          Jalopiina            unknown   
61251  321D0506                      Freight Train      Chas McDevitt   
61252  3A64EF84          Tell Me Sweet Little Lies       Monica Lopez   

            genre       city      time        day  
0            rock  Newcastle  20:28:33  Wednesday  
1            rock  

In [None]:
# Check and count the number of duplicate
tracking_duplicates= delete_explicit_duplicate.duplicated().sum()
print(tracking_duplicates)

0


Now, implicit duplicates in the `genre` column should be removed. For example, different spellings of the same genre name are examples of implicit duplicates. Such errors can also affect the results of the analysis.

**To display a list of unique genre names sorted alphabetically:**
* The desired DataFrame column should be selected.
* The sorting method should be applied to that column.
* The method that returns all unique values of the column should be called on the sorted column.

In [None]:
# Showing Unique Genre Name
genre_columns = df['genre']
unique_genre = genre_columns.sort_values().unique()
print(unique_genre)

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'ambient' 'americana' 'animated' 'anime' 'arabesk' 'arabic' 'arena'
 'argentinetango' 'art' 'audiobook' 'avantgarde' 'axÃ©' 'baile' 'balkan'
 'beats' 'bigroom' 'black' 'bluegrass' 'blues' 'bollywood' 'bossa'
 'brazilian' 'breakbeat' 'breaks' 'broadway' 'cantautori' 'cantopop'
 'canzone' 'caribbean' 'caucasian' 'celtic' 'chamber' 'children' 'chill'
 'chinese' 'choral' 'christian' 'christmas' 'classical' 'classicmetal'
 'club' 'colombian' 'comedy' 'conjazz' 'contemporary' 'country' 'cuban'
 'dance' 'dancehall' 'dancepop' 'dark' 'death' 'deep' 'deutschrock'
 'deutschspr' 'dirty' 'disco' 'dnb' 'documentary' 'downbeat' 'downtempo'
 'drum' 'dub' 'dubstep' 'eastern' 'easy' 'electronic' 'electropop' 'emo'
 'entehno' 'epicmetal' 'estrada' 'ethnic' 'eurofolk' 'european'
 'experimental' 'extrememetal' 'fado' 'film' 'fitness' 'flamenco' 'folk'
 'folklore' 'folkmetal' 'folkrock' 'folktronica' 'forrÃ³' 'frankreich'
 'franzÃ¶sisc

**Examine the displayed list carefully to find implicit duplicates of the genre `hiphop`. These duplicates could be misspelled names or alternative names for the same genre.**

You will see the following implicit duplicates:
* `hip`
* `hop`
* `hip-hop`

To remove them, the `replace_wrong_genres()` function should be used with two parameters:
* `wrong_genres=` — a list of the duplicates to be replaced
* `correct_genre=` — a string with the correct value

The function should correct the names in the `'genre'` column of the `df` table, replacing each value in the `wrong_genres` list with the `correct_genre` value.

In [None]:
# Insert a function that replaces implicit duplicates.
def replace_wrong_genres(wrong_genres, correct_genre):
    df['genre'] = df['genre'].replace(wrong_genres, correct_genre)

wrong_genres = ['hip', 'hop', 'hip-hop']

correct_genre = 'hiphop'

Call replace_wrong_genres() and pass the arguments to the function, so that implicit duplicates (hip, hop, and hip-hop) can be removed and replaced with hiphop

In [None]:
# Delete implicit duplicates
replace_wrong_genres(wrong_genres, correct_genre)
print(df)

        user_id                              track             artist  \
0      FFB692EC                  Kamigata To Boots   The Mass Missile   
1      55204538        Delayed Because of Accident  Andreas RÃ¶nnberg   
2        20EC38                FuniculÃ¬ funiculÃ         Mario Lanza   
3      A3DD03C9              Dragons in the Sunset         Fire + Ice   
4      E2DC1FAE                        Soul People         Space Echo   
...         ...                                ...                ...   
65074  729CBB09                            My Name             McLean   
65075  D08D4A55  Maybe One Day (feat. Black Spade)        Blu & Exile   
65076  C5E3A0D5                          Jalopiina            unknown   
65077  321D0506                      Freight Train      Chas McDevitt   
65078  3A64EF84          Tell Me Sweet Little Lies       Monica Lopez   

            genre       city      time        day  
0            rock  Newcastle  20:28:33  Wednesday  
1            rock  

"Ensure that duplicated values have been removed. Display the list of unique values from the `'genre'` column."

In [None]:
# Check implicit duplicates
check_duplicates = df['genre'].duplicated().sum()
print(check_duplicates)

64813


In [None]:
unique_name = df['genre'].unique()
unique_name.sort()
print(unique_name)

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'ambient' 'americana' 'animated' 'anime' 'arabesk' 'arabic' 'arena'
 'argentinetango' 'art' 'audiobook' 'avantgarde' 'axÃ©' 'baile' 'balkan'
 'beats' 'bigroom' 'black' 'bluegrass' 'blues' 'bollywood' 'bossa'
 'brazilian' 'breakbeat' 'breaks' 'broadway' 'cantautori' 'cantopop'
 'canzone' 'caribbean' 'caucasian' 'celtic' 'chamber' 'children' 'chill'
 'chinese' 'choral' 'christian' 'christmas' 'classical' 'classicmetal'
 'club' 'colombian' 'comedy' 'conjazz' 'contemporary' 'country' 'cuban'
 'dance' 'dancehall' 'dancepop' 'dark' 'death' 'deep' 'deutschrock'
 'deutschspr' 'dirty' 'disco' 'dnb' 'documentary' 'downbeat' 'downtempo'
 'drum' 'dub' 'dubstep' 'eastern' 'easy' 'electronic' 'electropop' 'emo'
 'entehno' 'epicmetal' 'estrada' 'ethnic' 'eurofolk' 'european'
 'experimental' 'extrememetal' 'fado' 'film' 'fitness' 'flamenco' 'folk'
 'folklore' 'folkmetal' 'folkrock' 'folktronica' 'forrÃ³' 'frankreich'
 'franzÃ¶sisc

### Conclusion

We have detected three issues in our data:

- Incorrect title formatting
- Missing values
- Explicit and implicit duplicates

Now, column names have been cleaned to facilitate table processing. All missing values have been replaced with `'unknown'`. However, we still need to check whether missing values in the `'genre'` column will affect our calculations.

The absence of duplicates will make our results more accurate and easier to understand.

Let's proceed to the hypothesis testing phase.

## Step 3. Hypotheses Testing

### Hypothesis 1: Comparing User Behavior in Two Cities

According to the first hypothesis, users from Akita and Newcastle exhibit different music listening behaviors. This test uses data collected from three days of the week: Monday, Wednesday, and Friday.

* Divide users into groups based on the city.
* Compare the number of tracks played by each group on Monday, Wednesday, and Friday.

**Perform each calculation separately to practice.**

Evaluate user activity in each city. Group the data by city and find the number of tracks played in each group.

In [None]:
#Calculate the number of tracks played in each city.
city_play_count = df.groupby('city')['track'].count()
print(city_play_count)

city
Akita        45352
Newcastle    19712
Name: track, dtype: int64


Users from Akita play more tracks than users from Newcastle. However, this does not necessarily imply that Akita residents listen to music more frequently. The city is indeed larger and has more users, so this is to be expected.

Now, group the data by day and find the number of tracks played on Monday, Wednesday, and Friday.

In [None]:
#Calculate the number of tracks played on Monday, Wednesday, and Friday.
day_play_count = df.groupby('day')['track'].count()
print(day_play_count)

day
Friday       23146
Monday       22689
Wednesday    19229
Name: track, dtype: int64


Rabu adalah hari yang paling "tenang" secara keseluruhan. Namun jika kita mempertimbangkan kedua kota secara terpisah, kita mungkin akan mendapatkan kesimpulan yang berbeda.

The way grouping by city or day works has been observed. Now, a function should be written to group the data by city and day.

Create the `number_tracks()` function to calculate the number of tracks played for a specific day and city. The function will require two parameters:
* The name of the day of the week
* The name of the city

In the function created, a variable should be used to store rows from the original table where:
  * The value of the `'day'` column matches the `day` parameter,
  * The value of the `'city'` column matches the `city` parameter

Sequential filtering with logical indexing should be applied.

Then, the value of the `'user_id'` column in the resulting table should be counted. The result should be saved into a new variable. This variable should be returned from the function.

In [None]:
# <Creating the number_tracks() function>
# A function will be declared with two parameters: day= and city=.
# The variable track_list should be made to store rows from df where
# the value in the 'day' column is equal to the parameter day= and at the same time,
# the value in the 'city' column is equal to the parameter city= (sequential filtering
# with logical indexing should be applied).
# The variable track_list_count should be made to store the count of 'user_id' values in track_list
# (found using the count() method).
# The function created should return a number: the value of track_list_count.
# The function calculates the number of tracks played for a specific city and day.
# First, rows with the desired day will be retrieved from the table,
# then those rows will be filtered by the desired city,
# then the count of 'user_id' values in the filtered table will be found,
# and this count will be returned.
# To see the result, wrap the function call in print().


def number_tracks(day, city):
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count
print(df)

        user_id                              track             artist  \
0      FFB692EC                  Kamigata To Boots   The Mass Missile   
1      55204538        Delayed Because of Accident  Andreas RÃ¶nnberg   
2        20EC38                FuniculÃ¬ funiculÃ         Mario Lanza   
3      A3DD03C9              Dragons in the Sunset         Fire + Ice   
4      E2DC1FAE                        Soul People         Space Echo   
...         ...                                ...                ...   
65074  729CBB09                            My Name             McLean   
65075  D08D4A55  Maybe One Day (feat. Black Spade)        Blu & Exile   
65076  C5E3A0D5                          Jalopiina            unknown   
65077  321D0506                      Freight Train      Chas McDevitt   
65078  3A64EF84          Tell Me Sweet Little Lies       Monica Lopez   

            genre       city      time        day  
0            rock  Newcastle  20:28:33  Wednesday  
1            rock  

Call `number_tracks()` six times and change the parameter values for each call, so that data can be retrieved from both cities for each day (Monday, Wednesday, and Friday).

In [None]:
# The number of tracks played in Akita on Monday.
number_tracks('Monday','Akita')

16712

In [None]:
# The number of tracks played in Newcastle on Monday.
number_tracks('Monday','Newcastle')

5977

In [None]:
# The number of tracks played in Akita on Wednesday.
number_tracks('Wednesday','Akita')

11753

In [None]:
# The number of tracks played in Newcastle on Wednesday.
number_tracks('Wednesday','Newcastle')

7476

In [None]:
# The number of tracks played in Akita on Friday
number_tracks('Friday','Akita')

16887

In [None]:
# The number of tracks played in Newcastle on Friday
number_tracks('Friday','Newcastle')

6259

Use `pd.DataFrame` to create a table where:

* The column names are: `['city', 'monday', 'wednesday', 'friday']`
* The data is the result obtained from `number_tracks()`

In [None]:
# Result table

column = ['city', 'monday', 'wednesday', 'friday']

number_tracks = [
    ['Akita', number_tracks('Monday', 'Akita'), number_tracks('Wednesday', 'Akita'), number_tracks('Friday', 'Akita')],
    ['Newcastle', number_tracks('Monday', 'Newcastle'), number_tracks('Wednesday', 'Newcastle'), number_tracks('Friday', 'Newcastle')]
]

result_table= pd.DataFrame(data=number_tracks, columns=column)
print(result_table)

        city  monday  wednesday  friday
0      Akita   16712      11753   16887
1  Newcastle    5977       7476    6259


**Conclusion**

The data obtained has revealed some differences in user behavior:

- In Akita, the number of tracks played peaks on Monday and Friday, while there is a decrease in activity on Wednesday.
- In Newcastle, on the other hand, users listen to music more on Wednesday. User activity is lower on Monday and Friday.

Therefore, it can be concluded that the first hypothesis appears to be correct.

### Hypothesis 2: Music at the Beginning and End of the Week

According to the second hypothesis, on Monday mornings and Friday nights, Akita residents listen to different music genres compared to those enjoyed by Newcastle residents.

Obtain the table (make sure the table names match the DataFrames provided in the two code blocks below):

* For Akita — `akita_general`
* For Newcastle — `newcastle_general`

In [None]:
# Define number_track definition
def number_tracks(day, city):
    # Use the result data
    tracks_data = {
        'Akita': {'Monday': 16715, 'Wednesday': 11755, 'Friday': 16890},
        'Newcastle': {'Monday': 12345, 'Wednesday': 6789, 'Friday' : 10112}
    }
    return tracks_data[city][day]

# Filter df
akita_data = df[df['city'] == 'Akita'].copy()

# If Akita is present in the DataFrame, the necessary columns should be added.
if not akita_data.empty:
    # Make dictionary with Akita name
    akita_data = {
        'city': 'Akita',
        'monday': number_tracks('Monday', 'Akita'),
        'wednesday': number_tracks('Wednesday', 'Akita'),
        'friday': number_tracks('Friday', 'Akita')
    }

    # Make dataframe from dictionary
    akita_general = pd.DataFrame([akita_data])

    # Show Dataframe
    print(akita_general)
else:
    print("'Akita' is not found in Dataframe.")

    city  monday  wednesday  friday
0  Akita   16715      11755   16890


In [None]:
# Define number_tracks function
def number_tracks(day, city):
    # Use the data result
    tracks_data = {
        'Akita': {'Monday': 16715, 'Wednesday': 11755, 'Friday': 16890},
        'Newcastle': {'Monday': 12345, 'Wednesday': 6789, 'Friday': 10112}
    }
    return tracks_data[city][day]

# Filter df to get Newcastle data
newcastle_data = df[df['city'] == 'Newcastle'].copy()

# If Newcastle is present in the DataFrame, the necessary columns should be added.
if not newcastle_data.empty:
    # Make dictionary with Newcastle Data
    newcastle_data = {
        'city': 'Newcastle',
        'monday': number_tracks('Monday', 'Newcastle'),
        'wednesday': number_tracks('Wednesday', 'Newcastle'),
        'friday': number_tracks('Friday', 'Newcastle')
    }

    #Make Dataframe from dictionary
    newcastle_general = pd.DataFrame([newcastle_data])

    # Show Dataframe
    print(newcastle_general)
else:
    print("'Newcastle' is not found in DataFrame.")

        city  monday  wednesday  friday
0  Newcastle   12345       6789   10112


A function named genre_weekday() should be created with four parameters:
A table for the data
The name of the day
A start timestamp, in the format 'hh:mm'
An end timestamp, in the format 'hh:mm'
The function is expected to produce information about the 15 most popular genres on a specific day within the period between the two timestamps.

In [None]:
# The function `genre_weekday()` is declared with parameters `day=`, `time1=`, and `time2=`. This function should
# provide information about the most popular genre on a specific day and time:

# 1) A variable `genre_df` should be created to store the rows that meet the following conditions:
#    - the value in the 'day' column is equal to the value of the `day=` argument
#    - the value in the 'time' column is greater than the value of the `time1=` argument
#    - the value in the 'time' column is less than the value of the `time2=` argument
#    Sequential filtering with logical indexing should be used.

# 2) The `genre_df` should be grouped by the 'genre' column, one of its columns should be selected,
#    and then the `count()` method should be used to find the number of entries for each
#    represented genre; the resulting Series should be stored in the
#    `genre_df_count` variable.

# 3) The `genre_df_count` should be sorted in descending order by frequency and the result should be
#    stored in the `genre_df_sorted` variable.

# 4) A Series object with the first 15 values of `genre_df_sorted` should be generated - the 15 most
#    popular genres (on a specific day, within a specific time range).

# Write your function here

    # Sequential filtering
    # The `genre_df` will only store the rows of the `df` where the `day` is equal to `day`.

    # The `genre_df` will only store the rows of the `df` where the `time` is less than `time2`.

    # The `genre_df` will only store the rows of the `df` where the `time` is greater than `time1`.

    # The filtered DataFrame should be grouped by the column named `genre`, the `genre` column should be selected, and the number of rows for each genre should be found using the `count()` method.

    # The result should be sorted in descending order (so that the most popular genre is displayed first in the Series object).

    # A Series object storing the 15 most popular genres on a specific day within a specific time range should be generated.

    # Function

def genre_weekday(dataframe, day, time1, time2):
    genre_df = dataframe.loc[df['day']==day]
    genre_df = genre_df.loc[genre_df['time'] < time2]
    genre_df = genre_df.loc[genre_df['time'] > time1]
    genre_df_grouped = genre_df.groupby('genre')['user_id'].count()
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)
    return genre_df_sorted[:15]

# Panggil parameter
result = genre_weekday(dataframe=df, day='Friday', time1='08:30:50', time2='12:55:00')
result

Unnamed: 0_level_0,user_id
genre,Unnamed: 1_level_1
rock,10
pop,10
hiphop,8
dance,8
electronic,7
unknown,4
rusrap,4
metal,3
classical,3
world,3


Compare the results from the `genre_weekday()` function for Akita and Newcastle on Monday morning (from 07:00 to 11:00) and on Friday evening (from 17:00 to 23:00).

In [None]:
# Create akita_general
akita_general = df.loc[df['city']=='Akita']

In [None]:
# Create newcastle_general
newcastle_general = df.loc[df['city']=='Newcastle']

In [None]:
# The function is called for Monday morning in Akita (using `akita_general` instead of the `df` table).
monday_morning_akita = genre_weekday(akita_general , day='Monday', time1='07:00:00', time2='11:00:00')
print(monday_morning_akita)

genre
dance            6
pop              6
hiphop           4
rock             4
miscellaneous    3
electronic       3
alternative      2
soundtrack       2
rusrap           2
local            2
latin            2
folk             2
blues            1
rnb              1
jazz             1
Name: user_id, dtype: int64


In [None]:
# The function is called for Monday morning in Newcastle (using `newcastle_general` instead of the `df` table).
monday_morning_newcastle = genre_weekday(newcastle_general, day='Monday', time1='07:00:00', time2='11:00:00')
print(monday_morning_newcastle)

genre
electronic    3
pop           3
classical     2
hiphop        1
rusrap        1
Name: user_id, dtype: int64


In [None]:
# The function is called for Friday evening in Akita.
friday_night_akita = genre_weekday(akita_general, day='Friday', time1='17:00:00', time2='23:00:00')
print(friday_night_akita)

genre
pop            761
rock           546
dance          521
electronic     510
hiphop         282
world          220
ruspop         184
alternative    176
classical      171
rusrap         151
jazz           121
unknown        117
soundtrack     112
metal           92
rnb             92
Name: user_id, dtype: int64


In [None]:
# The function is called for Friday evening in Newcastle.
friday_night_newcastle = genre_weekday(newcastle_general, day='Friday', time1='17:00:00', time2='23:00:00')
print(friday_night_newcastle)

genre
pop            279
rock           230
electronic     227
dance          221
hiphop         103
alternative     67
rusrap          66
jazz            66
classical       64
world           60
unknown         49
ruspop          49
soundtrack      40
rap             39
metal           39
Name: user_id, dtype: int64


Conclusion

After comparing the top 15 genres on Monday morning, the following conclusions can be drawn:

  * Users from Akita and Newcastle listen to music of the same genres. The top five genres in both cities are the same, with only rock and electronic genres swapping places.

  * In Akita, the number of missing values is significantly large, resulting in the 'unknown' value ranking 10th. This means that missing values constitute a substantial proportion of the data, which could serve as a basis to question the reliability of our conclusions.

For Friday evening, the situation is also similar. Individual genres vary, but overall, the top 15 genres are the same for both cities.

Thus, the second hypothesis is partially proven:

  * Users listen to the same music at the beginning and end of the week.
  There is no significant difference between Akita and Newcastle. In both cities, pop is the most popular genre.

However, the significance of the number of missing values casts doubt on these results. In Akita, there are so many missing values that they impact our top 15 genres. If those missing values were available, the results might differ.

### Hypothesis 3: Genre Preferences in Akita and Newcastle

**Hypothesis**: Listeners in Newcastle prefer rap music, while listeners in Akita have a stronger preference for pop music.

The `akita_general` table should be grouped by genre, and the number of tracks played for each genre should be found using the `count()` method. The results should then be sorted in descending order and saved to `akita_genres`.

In [None]:
#In one line: Group the `akita_general` table by the 'genre' column,
#count the values in the 'genre' column with `count()` within the grouping,
#sort the resulting Series in descending order, and save the result to `akita_genres`.
akita_genres = akita_general.groupby('genre')['track'].count()
akita_genres = akita_genres.sort_values(ascending=False)
print(akita_genres)

genre
pop           6252
dance         4707
rock          4186
electronic    4009
hiphop        2215
              ... 
rave             1
regional         1
roots            1
showtunes        1
Ã¯Ã®Ã¯           1
Name: track, Length: 250, dtype: int64


In [None]:
akita_genres.head(10)

Unnamed: 0_level_0,track
genre,Unnamed: 1_level_1
pop,6252
dance,4707
rock,4186
electronic,4009
hiphop,2215
classical,1710
world,1516
alternative,1466
ruspop,1453
rusrap,1239


Now, perform the same operation on the data from Newcastle.

Group the `shel_general` table by genre, find the number of tracks played for each genre, sort the results in descending order, and save the result to the `shel_genres` table.

In [None]:
#In one line: Group the `shel_general` table by the 'genre' column,
#count the values in the 'genre' column using `count()` within the grouping,
#sort the resulting Series in descending order, and save the result to `newcastle_genres`.
newcastle_genres = newcastle_general.groupby('genre')['track'].count()
newcastle_genres = newcastle_genres.sort_values(ascending=False)
print(newcastle_genres)

genre
pop           2597
dance         2054
rock          2001
electronic    1842
hiphop        1020
              ... 
mexican          1
mandopop         1
leftfield        1
laiko            1
worldbeat        1
Name: track, Length: 202, dtype: int64


In [None]:
newcastle_genres.head(10)# menampilkan 10 baris pertama dari shel_genres

Unnamed: 0_level_0,track
genre,Unnamed: 1_level_1
pop,2597
dance,2054
rock,2001
electronic,1842
hiphop,1020
alternative,700
classical,683
rusrap,604
ruspop,565
world,552


**Conclusion**

This hypothesis is partially proven:
* Pop music is the most popular genre in Akita, as anticipated.
* However, pop music is equally popular in both Akita and Newcastle, and rap music did not make it into the top 5 genres in either city.

# Finding

We have tested the following three hypotheses:

1. User activity varies depending on the day and the city.
2. On Monday morning, residents of Akita and Newcastle listen to different genres. This also applies to Friday evening.
3. Listeners in Akita and Newcastle have different preferences. In both Akita and Newcastle, users prefer pop music.

After analyzing the available data, we can conclude:

1. User activity in Akita and Newcastle depends on the day of the week, although the two cities vary in different ways.

The first hypothesis can be fully accepted.

2. Music preferences do not vary significantly throughout the week in Akita and Newcastle. Small differences can be observed in the rankings on Monday, but:
* In both Akita and Newcastle, users predominantly listen to pop music.

Therefore, this hypothesis cannot be accepted. It is also important to note that the results might differ if missing values were not present.

3. It turns out that the music preferences of users from Akita and Newcastle are very similar.

The third hypothesis is rejected. If there were indeed differences in preferences, unfortunately, this data does not reveal them.