# Y.Music

## Introduction <a id='intro'></a>
Every time we carry out an analysis, we need to formulate several hypotheses that we need to test further. Sometimes, the tests we carry out lead us to accept the hypothesis. But at other times, we also need to reject it. To make the right decisions in business, we must understand whether the assumptions we make are correct or not.

In this project, you will compare the music preferences of listeners in the cities of Springfield and Shelbyville. You'll review real data from Y.Music to test some of the hypotheses below and compare user behavior in both cities.

### Objective: 
Testing three hypotheses:
1. User activity varies depending on the day and city.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. This also applies to Friday nights.
3. Listeners in the cities of Springfield and Shelbyville have different preferences. In Springfield, users prefer pop music, while in Shelbyville rap music has more fans.

### Stages
Data related to user behavior is stored in this folder and named `music_project_en.csv`. There is no information regarding the quality of the data, therefore you need to check it first before testing the hypothesis.

First, you'll evaluate the quality of the data and see if the problem is significant. Then, during data pre-processing, you will try to address the most serious problems.
 
The project consists of three stages:
 1. Data review
 2. Data pre-processing
 3. Hypothesis testing

# Table of Content <a id='back'></a>

* [Intro](#intro)
* [Stage 1. Data Overview](#data_review)
    * [Conclusion](#data_review_conclusions)
* [Stage 2. Data preprocessing](#data_preprocessing)
    * [2.1 Heading Style](#header_style)
    * [2.2 Missing Values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusion](#data_preprocessing_conclusions)
* [Stage 3. Hypothesis Testing](#hypotheses)
    * [3.1 Hypothesis 1: User activity in both cities](#activity)
    * [3.2 Hypothesis 2: Music preference on Mondays and Fridays](#week)
    * [3.3 Hypothesis 3: Genre preferences in the cities of Springfield and Shelbyville](#genre)
* [Findings](#end)

## Stage 1. Data review <a id='data_review'></a>

Open the data related to Y.Music, then study the data.

You will need the `Pandas` library, so feel free to import it.

In [1]:
import pandas as pd 

In [2]:
try:
    # Try loading the file from your laptop path
    df = pd.read_csv('C:/Users/Eugene/Documents/GitHub/TripleTen-Projects/1. Y.Music Preferences Analysis/music_project_en.csv')
except FileNotFoundError:
    # If the file is not found, try loading from the PC path
    df = pd.read_csv('C:/Users/user/OneDrive/Documents/GitHub/TripleTen-Projects/1. Y.Music Preferences Analysis/music_project_en.csv')

Read the `music_project_en.csv` file from the folder and save it in the `df` variable:

In [3]:
df.describe()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,08:14:07,Friday
freq,76,136,136,8850,45360,14,23149


Show the first 10 rows of the table:

In [7]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


Dapatkan informasi umum tentang tabel dengan satu perintah:

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


This table contains seven columns. All columns have the same data type, namely: `object`.

Based on the documentation:
- `'userID'` — User ID
- `'Track'` — song title
- `'artist'` — artist name
- `'genre'`
- `'City'` — user's home city
- `'time'` — the time when the song was played
- `'Day'` — day of the week

We can see three problems with the column name writing style:
1. Some names are written in uppercase, some in lowercase.
2. Some names use spaces.
3. `Find yourself a third problem in the style of writing column names and explain the problem here`.

We can also see that there is a different number of values ​​between columns. This indicates that the data we have contains missing values.

### Conclusion <a id='data_review_conclusions'></a> 

Each row in the table stores data related to the song track being played. Several columns store data that describes the track itself: song title, artist and genre. The rest stores data related to user information: their hometown, the time they played the song track.

It is clear that the data we have is sufficient to test the hypothesis. Unfortunately, there are some values ​​missing.

To continue the analysis, we need to pre-process the data first.

[Return to Table of Content](#back)

## Stage 2. Data pre-processing <a id='data_preprocessing'></a>
Correct the formatting of column headings and resolve missing values. Then, check whether your data contains duplicates.

### Title writing style <a id='header_style'></a>
Show column headings:

In [9]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Rename columns according to the rules of good writing style:
* If the column name consists of multiple words, use snake_case
* All characters must be in lower case
* Remove spaces

In [10]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

Check the results. Display the column names again:

In [11]:
# mengecek hasilmu: tampilkan sekali lagi list yang memuat nama-nama kolom
df.columns

Index(['userid', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

[Return to Table of Content](#back)

### Missing values ​​<a id='missing_values'></a>
First, find the number of missing values ​​in the table. To do this, use two `Pandas` methods:

In [12]:
df.isnull().sum()

userid       0
track     1343
artist    7567
genre     1198
city         0
time         0
day          0
dtype: int64

Not all missing values ​​affect your research. For example, missing values ​​in the `track` and `artist` columns are not that important. You just need to replace it with a clear marker.
However, missing values ​​in the `'genre'` column could affect the comparison of music preferences in the cities of Springfield and Shelbyville. In real life, it is very useful to learn the reasons for such data loss and try to fix them. Unfortunately, we haven't had that opportunity in this project. Therefore, you must:
* Fill in missing values ​​with markers
* Evaluate how much missing values ​​affect your calculations

Replace missing values ​​in the `'track'`, `'artist'`, and `'genre'` columns with the string `'unknown'`. To do this, create a list called `columns_to_replace`, apply a `for` loop to it, and replace the missing values ​​in each column:

In [13]:
columns_to_replace = ['track', 'artist', 'genre']

for column in columns_to_replace:
    df[column].fillna('unknown', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna('unknown', inplace=True)


Make sure no more tables contain missing values. Recalculate missing values.

In [14]:
df.isnull().sum()

userid    0
track     0
artist    0
genre     0
city      0
time      0
day       0
dtype: int64

[Return to Table of Content](#back)

### Duplicates <a id='duplicates'></a>
Find the number of explicit duplicates in a table using one command:

In [15]:
df.duplicated().sum()

3826

Call one of the `Pandas` methods to remove explicit duplicates:

In [16]:
df.drop_duplicates(inplace=True)

Count the explicit duplicates again to make sure you've successfully removed them all:

In [17]:
df.duplicated().sum()

0

Now, remove the implicit duplicates in the `genre` column. For example, writing a genre name in a different way is an example of implicit duplicity. Errors like this will also affect the results of your analysis.

Display a list containing unique genre names, then sort the list alphabetically. To do so:
* Retrieve the desired DataFrame columns
* Apply the sort method to that column\n",
* For a sorted column, call a method that returns all unique values ​​of the column

In [18]:
unique_genres = sorted(df['genre'].unique())
unique_genres

['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',
 'ethnic',
 'eurofolk',
 'european',
 'expe

Look carefully at the list that has been displayed to find implicit duplicates of the `hiphop` genre. The duplicate could be an incorrectly written name or an alternative name of the same genre.

You will see the following implicit duplicate:
* `hip`
* `hop`
* `hip-hop`

To remove it, use the `replace_wrong_genres()` function with two parameters:
* `wrong_genres=` — list with duplicates to replace
* `correct_genre=` — string with correct value

The function should correct the names in the `'genre'` column of the `df` table, replacing each value from the `wrong_genres` list with a value from `correct_genre`.

In [19]:
# masukkan fungsi yang mengganti duplikat implisit
def replace_wrong_genres(wrong_genres, correct_genre):
    df['genre'].replace(wrong_genres, correct_genre, inplace=True)

Call `replace_wrong_genres()` and pass arguments to the function, so it can remove implicit duplicates (`hip`, `hop`, and `hip-hop`) and replace them with `hiphop`:

In [20]:
wrong_genres = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'

In [21]:
replace_wrong_genres(wrong_genres, correct_genre)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['genre'].replace(wrong_genres, correct_genre, inplace=True)


Make sure that duplicate values ​​have been removed. Display a list of unique values ​​from the `'genre'` column:

In [22]:
unique_genres_after_correction = sorted(df['genre'].unique())
unique_genres_after_correction

['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',
 'ethnic',
 'eurofolk',
 'european',
 'expe

[Return to Table of Content](#back)

### Conclusion <a id='data_preprocessing_conclusions'></a>
We have detected three problems in our data:

- Wrong title writing style
- Missing values
- Explicit and implicit duplicates

Now, the column names have been cleaned up to make table processing easier.
All missing values ​​have also been replaced with `'unknown'`. However, we still have to see whether missing values ​​in the `'genre'` column will affect our calculations.

The absence of duplicates will make the results we get more precise and easier to understand.

Come on, let's move on to the hypothesis testing stage.

[Return to Table of Content](#back)

## Stage 3. Hypothesis testing <a id='hypotheses'></a>

### Hypothesis 1: compare user behavior in two cities <a id='activity'></a>

According to the first hypothesis, users from the cities of Springfield and Shelbyville have different behavior in listening to music. This test uses data taken from three days of the week: Monday, Wednesday, and Friday.

* Divide users into groups by city.
* Compare how many song tracks each group played on Monday, Wednesday, and Friday.

Do each calculation separately so you can practice.

Evaluate user activity in each city. Group the data by city and find the number of song tracks played in each group.

In [23]:
tracks_per_city = df.groupby('city')['track'].count()
tracks_per_city

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64

Users from the city of Springfield played more tracks than users from the city of Shelbyville. However, this does not necessarily imply that Springfield residents listen to music more often. The city is indeed bigger, and there are more users. So, this is a natural thing.

Now, group the data by day and find the number of song tracks played on Monday, Wednesday, and Friday.

In [25]:
df_filtered = df[df['day'].isin(['Monday', 'Wednesday', 'Friday'])]

tracks_per_day = df_filtered.groupby('day')['track'].count()

tracks_per_day

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

In [26]:
tracks_per_city_day = df_filtered.groupby(['city', 'day'])['track'].count()

tracks_per_city_day


city         day      
Shelbyville  Friday        5895
             Monday        5614
             Wednesday     7003
Springfield  Friday       15945
             Monday       15740
             Wednesday    11056
Name: track, dtype: int64

Wednesday is the most "quiet" day overall. But if we consider the two cities separately, we might come to a different conclusion.

You've seen how grouping by city or day works. Now, write a function that will group the data by city and day.

Create a `number_tracks()` function to count the number of song tracks played for a given day and city. The function will require two parameters:
* name of the day of the week
* city name

In the function we created, use variables to store rows from the original table, where:
  * The value of the `'day'` column is the same as the `day`\n" parameter,
  * The value of the `'city'` column is the same as the `city` parameter

Implement sequential filtering with logical indexing.

Then, calculate the value of the `'user_id'` column in the resulting table. Save the result into a new variable. Return this variable from the function.

In [29]:
df

Unnamed: 0,userid,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
...,...,...,...,...,...,...,...
65074,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
65075,D08D4A55,Maybe One Day (feat. Black Spade),Blu & Exile,hiphop,Shelbyville,10:00:00,Monday
65076,C5E3A0D5,Jalopiina,unknown,industrial,Springfield,20:09:26,Friday
65077,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [30]:
def number_tracks(day, city):
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    
    track_list_count = track_list['userid'].count()
    
    return track_list_count

Call `number_tracks()` six times and change the parameter value on each call, so that you can retrieve data from both cities for each day (Monday, Wednesday, and Friday).

In [33]:
tracks_springfield_monday = number_tracks('Monday', 'Springfield')
tracks_springfield_monday

15740

In [34]:
tracks_shelbyville_monday = number_tracks('Monday', 'Shelbyville')
tracks_shelbyville_monday

5614

In [35]:
tracks_springfield_wednesday = number_tracks('Wednesday', 'Springfield')
tracks_springfield_wednesday

11056

In [36]:
tracks_shelbyville_wednesday = number_tracks('Wednesday', 'Shelbyville')
tracks_shelbyville_wednesday

7003

In [37]:
tracks_springfield_friday = number_tracks('Friday', 'Springfield')
tracks_springfield_friday

15945

In [38]:
tracks_shelbyville_friday = number_tracks('Friday', 'Shelbyville')
tracks_shelbyville_friday 

5895

Use `pd.DataFrame` to create a table, which
* The column names are: `['city', 'monday', 'wednesday', 'friday']`
* The data is the result you get from `number_tracks()`

In [39]:
data = {
    'city': ['Springfield', 'Shelbyville'],
    'monday': [tracks_springfield_monday, tracks_shelbyville_monday],
    'wednesday': [tracks_springfield_wednesday, tracks_shelbyville_wednesday],
    'friday': [tracks_springfield_friday, tracks_shelbyville_friday]
}

tracks_table = pd.DataFrame(data)

tracks_table

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Kesimpulan**

Data yang kamu dapatkan ternyata berhasil mengungkapkan beberapa perbedaan perilaku pengguna:

- Di kota Springfield, jumlah trek lagu yang diputar mencapai puncaknya pada hari Senin dan Jumat, sedangkan pada hari Rabu terjadi penurunan aktivitas.
- Di kota Shelbyville, sebaliknya, pengguna lebih banyak mendengarkan musik pada hari Rabu. Aktivitas pengguna pada hari Senin dan Jumat lebih sedikit. 

Dengan demikian, dapat disimpulkan bahwa hipotesis pertama tampaknya benar.

[Return to Table of Content](#back)

### Hypothesis 2: music at the beginning and end of the week <a id='week'></a>

According to the second hypothesis, on Monday mornings and Friday evenings, Springfield residents listen to a different genre of music than the residents of Shelbyville enjoy.

Get the table (make sure the combined table name matches the DataFrame given in the two code blocks below):
* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

In [40]:
spr_general = df[df['city'] == 'Springfield']

In [41]:
shel_general = df[df['city'] == 'Shelbyville']

Create a `genre_weekday()` function with four parameters:
* A table for data
* Name day
* Initial timestamp, in 'hh:mm' format
* End timestamp, in 'hh:mm' format

The function should produce information about the 15 most popular genres on a given day in the period between two timestamps.

In [42]:
def genre_weekday(data, day, time1, time2):
    # Filter berdasarkan hari
    genre_df = data[data['day'] == day]
    
    # Filter berdasarkan waktu (lebih besar dari time1 dan lebih kecil dari time2)
    genre_df = genre_df[(genre_df['time'] > time1) & (genre_df['time'] < time2)]
    
    # Kelompokkan berdasarkan genre dan hitung frekuensinya
    genre_df_grouped = genre_df.groupby('genre').size()
    
    # Urutkan hasilnya dari genre paling populer
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)
    
    # Kembalikan 15 genre paling populer
    return genre_df_sorted[:15]

Compare the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (from 7:00 to 11:00) and on Friday evening (from 17:00 to 23:00):

In [43]:
print("Springfield, Monday Morning:")
print(genre_weekday(spr_general, 'Monday', '07:00', '11:00'))

Springfield, Monday Morning:
genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
dtype: int64


In [44]:
print("Shelbyville, Monday Morning:")
print(genre_weekday(shel_general, 'Monday', '07:00', '11:00'))

Shelbyville, Monday Morning:
genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
dtype: int64


In [45]:
print("Springfield, Friday Evening:")
print(genre_weekday(spr_general, 'Friday', '17:00', '23:00'))

Springfield, Friday Evening:
genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
dtype: int64


In [46]:
print("Shelbyville, Friday Evening:")
print(genre_weekday(shel_general, 'Friday', '17:00', '23:00'))

Shelbyville, Friday Evening:
genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
dtype: int64


**Conclusion**

After comparing the top 15 genres as of Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to music of the same genre. The top five genres from both cities are the same, only rock and electronic have swapped places.

2. In Springfield, the number of missing values ​​turned out to be so large that the value `'unknown'` was in 10th place. This means that the missing values ​​account for a fairly large proportion of the data, so this fact can be used as a basis for questioning the reliability of our conclusions.

For Friday night, the situation was also similar. Individual genres varied quite a bit, but overall, the top 15 genres for both cities were the same.

Thus, the second hypothesis is proven to be partially correct:
* Users listen to the same music at the start and end of the week.
* There are no significant differences between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the significance of the number of missing values ​​makes this result questionable. In Springfield, there is a lot of missing value that impacts our top 15 genre results. If we did not have those missing values, the results might have been different.

[Return to Table of Content](#back)

### Hypothesis 3: genre preferences in the cities of Springfield and Shelbyville <a id='genre'></a>

Hypothesis: listeners in the city of Shelbyville like rap music, while listeners in the city of Springfield prefer pop.

Group the `spr_general` table by genre and find the number of tracks played for each genre with the `count()` method. Then, sort the results in descending order and save them to `spr_genres`.

In [47]:
spr_genres = spr_general.groupby('genre').size().sort_values(ascending=False)

Show the first 10 lines of `spr_genres`:

In [48]:
print("Top 10 Genres in Springfield:")
print(spr_genres.head(10))

Top 10 Genres in Springfield:
genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
dtype: int64


Now, do the same thing with the data from Shelbyville.

Group the `shel_general` table by genre and find the number of tracks played for each genre. Then, sort the results in descending order and save them to the `shel_genres` table:

In [49]:
shel_genres = shel_general.groupby('genre').size().sort_values(ascending=False)

Show the first 10 lines of `shel_genres`:

In [50]:
print("\nTop 10 Genres in Shelbyville:")
print(shel_genres.head(10))


Top 10 Genres in Shelbyville:
genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
dtype: int64


**Conclusion**

This hypothesis was proven to be partially correct:
* Pop music is the most popular genre in Springfield, as we would expect.
* However, pop music turned out to be equally popular in both Springfield and Shelbyville, and rap music apparently did not make the top 5 list of genres for either city.

[Return to Table of Content](#back)

# Findings <a id='end'></a>

We have tested the following three hypotheses:

1. User activity varies depending on the day and city.
2. On Monday mornings, residents of Springfield and Shelbyville listen to different genres. This also applies to Friday nights.
3. Listeners in the cities of Springfield and Shelbyville have different preferences. In both Springfield and Shelbyville, users prefer pop music.

After analyzing the available data, we can conclude that:

1. User activity in Springfield and Shelbyville depends on the day of the week, although these two cities vary in different ways.

The first hypothesis can be completely accepted.

2. Music preferences do not vary significantly across the week in Springfield and Shelbyville. We can see a small difference in the order on Monday, but:
* In both Springfield and Shelbyville, users listen most to pop music.

Therefore, we cannot accept this hypothesis. It is also important to remember that the results obtained could have been different had we not had missing values.
3. It turns out that the music preferences of users from the cities of Springfield and Shelbyville are very similar.

The third hypothesis is rejected. If there really are differences in preferences, unfortunately we cannot know this from this data.
### Notes
In real projects, research involves statistical hypothesis testing, which is of course more precise and more quantitative. Also note that you can't always draw conclusions about an entire city based on data from just one source.

You will learn hypothesis testing in the statistical data analysis sprint.

[Return to Table of Content](#back)