# Y.Music

## Introduction <a id='intro'></a>
Whenever we do research, we need to formulate a hypothesis that we can then test. Sometimes we accept these hypotheses; sometimes we reject them. To make the right choices, a business must be able to understand whether it is making the right assumptions or not.

In this project, you will compare the musical preferences of the inhabitants of Springfild and Shelbyville. You will study real data from Y.Music to test the hypothesis below and compare user behavior for these two cities.

### Objective: 
Test three hypotheses:
1. User activity is different depending on the day of the week and the city. 
2. During Monday mornings, residents of Springfield and Shelbyville listen to different genres. This is also true for Friday nights. 
3. Listeners in Springfield and Shelbyville have different preferences. In Springfield, people prefer pop, while Shelbyville has more rap fans.

### Steps 
The data on user behavior is stored in the file `/datasets/music_project_en.csv`. There is no information on the quality of the data, so you will need to examine it before testing the hypothesis. 

First, you'll assess the quality of the data and see if your problems are significant. Then, during data pre-processing, you will try to account for the most critical problems.
 
Your project will consist of three stages:
 1. Data overview 2. Data pre-processing 3. Testing hypotheses  


## Step 1. Overview of the data 

Open the data in Y.Music and explore it.

You'll need `pandas`, so import it.

In [1]:
import pandas as pd




Read the file `music_project_en.csv` from the `/datasets/` folder and save it in the variable `df`:

In [2]:
df = pd.read_csv('/datasets/music_project_en.csv')
df.describe()

Unnamed: 0,userID,Track,artist,genre,City,time,Day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,21:51:22,Friday
freq,76,136,136,8850,45360,14,23149


Display the first 10 rows of the table:

In [3]:
print(df.head(10))

     userID                        Track            artist   genre  \
0  FFB692EC            Kamigata To Boots  The Mass Missile    rock   
1  55204538  Delayed Because of Accident  Andreas Rönnberg    rock   
2    20EC38            Funiculì funiculà       Mario Lanza     pop   
3  A3DD03C9        Dragons in the Sunset        Fire + Ice    folk   
4  E2DC1FAE                  Soul People        Space Echo   dance   
5  842029A1                       Chains          Obladaet  rusrap   
6  4CB90AA5                         True      Roman Messer   dance   
7  F03E1C1F             Feeling This Way   Polina Griffith   dance   
8  8FA1D3BE                     L’estate       Julia Dalia  ruspop   
9  E772D5C0                    Pessimist               NaN   dance   

        City        time        Day  
0  Shelbyville  20:28:33  Wednesday  
1  Springfield  14:07:09     Friday  
2  Shelbyville  20:58:07  Wednesday  
3  Shelbyville  08:37:09     Monday  
4  Springfield  08:34:34     Monday  
5

Obtaining general information about the table with a command:

In [4]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB
None


The table contains seven columns. They store the same type of data: objects.

According to the documentation:
- `'userID'` - user ID
- `'Track'` - song title
- `'artist'` - artist name
- `'genre'` - the genre
- `'City'` - user's city
- `'time'` - exact time the song was played
- `'Day'` - day of the week 
We can see three problems with the style in the column names:
1. Some names are capitalized, some are lowercase.
2. There are spaces in some names.
3. `Find the problem yourself and describe it here.

The number of values in the columns is different. This means that the data contains missing values.




### Conclusions

Each row in the table stores data about a song that was played. Some columns describe the song itself: its title, artist and genre. The rest contains information about the user: the city they come from, the number of times the song has been played. 

It's clear that the data is sufficient to test the hypotheses. However, there are missing values.

To move forward, we need to pre-process the data.

## Step 2. Pre-process data 
Correct the formatting in the column header and work with the missing values. Then check the data for duplicates.

### Header style 
Display the column header:

In [5]:
print(df.columns)

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')


Change the column names according to the rules of good style practice:
* If the name has several words, use snake_case
* All characters must be lowercase
* Delete spaces

In [6]:
df = df.rename(columns={'  userID': 'user_id',
                        'Track': 'track',
                        '  City  ': 'city',
                        'Day': 'day'})

Check the result. Display the column names once more:

In [7]:
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


### Missing values 
First, find the number of missing values in the table. To do this, use two panda methods:

In [8]:
print(df.isna().sum())

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64


Not all missing values affect the search. For example, missing values in song and artist are not decisive. You can simply replace them with clear markers.

But missing values in 'genre' can affect the comparison of musical preferences of Springfield and Shelbyville. In real life, it would be useful to find out why the data is missing and try to compensate for it. But we don't have that possibility in this project. So you'll have to:
* Fill in the missing values with markers
* Evaluate how the missing values might affect your calculations

Replace the missing values in 'track', 'artist', and 'genre' with the string 'unknown'. To do this, create the columns_to_replace list, loop through it with the for loop, and replace the missing values in each of the columns:

In [9]:
columns_to_replace = ['track', 'artist', 'genre']

for c in columns_to_replace:
    df[c].fillna('unknow', inplace=True)

print(df)

        user_id                              track            artist  \
0      FFB692EC                  Kamigata To Boots  The Mass Missile   
1      55204538        Delayed Because of Accident  Andreas Rönnberg   
2        20EC38                  Funiculì funiculà       Mario Lanza   
3      A3DD03C9              Dragons in the Sunset        Fire + Ice   
4      E2DC1FAE                        Soul People        Space Echo   
...         ...                                ...               ...   
65074  729CBB09                            My Name            McLean   
65075  D08D4A55  Maybe One Day (feat. Black Spade)       Blu & Exile   
65076  C5E3A0D5                          Jalopiina            unknow   
65077  321D0506                      Freight Train     Chas McDevitt   
65078  3A64EF84          Tell Me Sweet Little Lies      Monica Lopez   

            genre         city      time        day  
0            rock  Shelbyville  20:28:33  Wednesday  
1            rock  Springfi

Make sure that the table no longer contains any missing values. Count the missing values again.

In [10]:
print(df.isna().sum())

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64


### Duplicates 
Find the number of obvious duplicates in the table using a command:

In [11]:
df.duplicated().sum()

3826

Call up the pandas method to get rid of obvious duplicates:

In [12]:
df.drop_duplicates(inplace=True)


Count the obvious duplicates again and make sure you've removed all of them:

In [13]:
df.duplicated().sum()

0

Now get rid of the implicit duplicates in the genre column. For example, the name of a genre can be written in different ways. Some errors will also affect the result.

Display the list of unique genre names, organized in alphabetical order. To do this:
* Retrieve the DataFrame of the desired column. 
* Apply a method of choice to it
* For the selected column, call the method that will return all the columns' unique values

In [14]:
sorted(df['genre'].unique())


['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',
 'ethnic',
 'eurofolk',
 'european',
 'expe

Look through the list and find implicit duplicates of the hiphop genre. These could be misspelled names, or alternative names for the same genre.

You'll see the following implicit duplicates:
* hip
* hop
* hip-hop

To get rid of them, declare the replace_wrong_genres() function with two parameters: 
* wrong_genres= - the list of duplicates
* correct_genre= - the string with the correct value

The function should correct the names in the 'genre' column of the df table, i.e. by replacing each value in the wrong_genres list with values from correct_genre.

In [15]:
replace_wrong_genres = ('hip', 'hop', 'hip-hop')
correct_genre = ('hiphop')

Call replace_wrong_genres() and pass it arguments so that it can eliminate the implicit duplicates (hip, hop, and hip-hop) and replace them with hiphop:

In [16]:
df['genre'] = df['genre'].replace(replace_wrong_genres, correct_genre)

Make sure that duplicate names have been removed. Display the list of unique column values:

In [17]:
sorted(df['genre'].unique())

['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',
 'ethnic',
 'eurofolk',
 'european',
 'expe

### Conclusions 
We detected three problems with the data:

- Incorrect header style
- Missing values
- Obvious and implicit duplicates

The header was cleaned up to make table processing simpler.

All missing values have been replaced with 'unkown'. But we still have to see if missing values in 'genre' will affect our calculations.

The absence of duplicates will make the results more accurate and easier to understand.

Now you can move on to hypothesis testing.

## Step 3. Testing hypotheses 

### Hypothesis 1: comparing user behavior in two cities 

According to the first hypothesis, users in Springfield and Shelbyville listen to music differently. Test this hypothesis using data from three days of the week: Monday, Wednesday and Friday.

* Divide users from each city into groups.
* Compare how many songs each group listened to on Monday, Wednesday and Friday.


For the purposes of practice, do each of these calculations separately. 

Evaluate user activity in each city. Group the data by city and find the number of songs played in each group.




In [18]:
df.groupby('city').count()

Unnamed: 0_level_0,user_id,track,artist,genre,time,day
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Shelbyville,18512,18512,18512,18512,18512,18512
Springfield,42741,42741,42741,42741,42741,42741


Springfield has more music played than Shelbyville. But that doesn't mean that the citizens of Springfield listen to music more often. This city is just bigger, and has more users.

Now group the data by day of the week and find the number of songs played on Monday, Wednesday and Friday.


In [19]:
df.groupby('day').count()

Unnamed: 0_level_0,user_id,track,artist,genre,city,time
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Friday,21840,21840,21840,21840,21840,21840
Monday,21354,21354,21354,21354,21354,21354
Wednesday,18059,18059,18059,18059,18059,18059


Wednesday is the quietest day in general. But if we consider the two cities separately, we should come to a different conclusion.

You've seen how grouping by city or day of the week works. Now write the function that groups the data by the two criteria.

Create the number_tracks() function to calculate the number of songs played on a given day of the week and in each city. You will need two parameters:
* day of the week
* city name

In the function, use the variable to store the rows of the original table, where:
  * the value of the 'day' column is equal to the day parameter
  * the value of the 'city' column is equal to the city parameter

Apply consecutive filters with logical indexing.

Then calculate the values of the 'user_id' column in the resulting table. Store the result in the new variable. Return this variable from the function.

In [20]:
def number_tracks(day, city) : 
    track_list= df.loc[(df['day'] == day)] 
    track_list= track_list[track_list['city'] == city] 

    track_list_count= track_list['user_id'].count() 

    return track_list_count 

Call `number_tracks()` six times, changing the parameter values, so that you get the data for both cities for the three days.

In [21]:
springfield_monday = number_tracks(day='Monday', city='Springfield')
springfield_monday

15740

In [22]:
shelbyville_monday = number_tracks(day='Monday', city='Shelbyville')
shelbyville_monday


5614

In [23]:
springfield_wednesday = number_tracks(day='Wednesday', city='Springfield')
springfield_wednesday

11056

In [24]:
shelbyville_wednesday = number_tracks(day='Wednesday', city='Shelbyville')
shelbyville_wednesday

7003

In [25]:
springfield_friday = number_tracks(day='Friday', city='Springfield')
springfield_friday

15945

In [26]:
shelbyville_friday = number_tracks(day='Friday', city='Shelbyville')
shelbyville_friday

5895

Use pd.DataFrame to create a table, where
* The column names are: ['city', 'monday', 'wednesday', 'friday']`
* The data is the result you get from number_tracks()

In [27]:
data = [
    ['Springfield', springfield_monday, springfield_wednesday, springfield_friday],
    ['Shelbyville', shelbyville_monday, shelbyville_wednesday, shelbyville_friday]
]
pd.DataFrame(data= data, columns = ['city', 'Monday', 'Wednesday', 'Friday'])

Unnamed: 0,city,Monday,Wednesday,Friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusions**

The data reveals differences in user behavior:

- In Springfield, the amount of music played peaks on Mondays and Fridays, while on Wednesday there is a decrease in activity.
- In Shelbyville, on the other hand, users listen to more music on Wednesdays. There is little activity on Mondays and Fridays.

So the first hypothesis seems to be correct.

### Hypothesis 2: music at the beginning and end of the week 

According to the second hypothesis, on Monday morning and Friday night, Springfield residents listen to genres that differ from what some Shelbyville users like.

Get a table (make sure the name of your combined table matches the DataFrame given in two code blocks below):
* For Springfield - `spr_general`
* For Shelbyville - `shel_general`

In [28]:
spr_general = df.loc[df['city']=='Springfield']
spr_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
...,...,...,...,...,...,...,...
65073,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Springfield,21:07:12,Monday
65074,729CBB09,My Name,McLean,rnb,Springfield,13:32:28,Wednesday
65076,C5E3A0D5,Jalopiina,unknow,industrial,Springfield,20:09:26,Friday
65077,321D0506,Freight Train,Chas McDevitt,rock,Springfield,21:43:59,Friday


In [29]:
shel_general = df.loc[df['city']=='Shelbyville']
shel_general

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
9,E772D5C0,Pessimist,unknow,dance,Shelbyville,21:20:49,Wednesday
...,...,...,...,...,...,...,...
65063,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Shelbyville,21:14:40,Monday
65064,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Shelbyville,21:06:50,Monday
65065,29E04611,Bre Petrunko,Perunika Trio,world,Shelbyville,13:56:00,Monday
65066,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Shelbyville,09:22:13,Monday


Write the function genre_weekday() with four parameters:
* A table for data (`df`)
* The day of the week (`day`)
* The first timestamp, in 'HH:MM' format (`time1`)
* The last timestamp, in 'HH,MM' format (`time2`)

The function should return information about the 15 most popular genres on a given day, within the period between the two timestamps.

In [30]:
def genre_weekday(df, day, time1, time2):
    
    genre_df = df.loc[df['day']==day]
    genre_df = genre_df.loc[genre_df['time']<time2]
    genre_df = genre_df.loc[genre_df['time']>time1]

    
    genre_df_grouped = genre_df[['user_id', 'genre']].groupby('genre').count()

    genre_df_sorted = genre_df_grouped.sort_values(by= 'user_id', ascending = False)
    
    
    return genre_df_sorted[:15]


Compare the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (from 7 a.m. to 11 a.m.) and Friday afternoon (from 5 p.m. to 11 p.m.):

In [31]:
spr_mon= genre_weekday(spr_general, day='Monday', time1='07:00:00', time2='11:00:00')
spr_mon


Unnamed: 0_level_0,user_id
genre,Unnamed: 1_level_1
pop,781
dance,549
electronic,480
rock,474
hiphop,286
ruspop,186
world,181
rusrap,175
alternative,164
unknow,161


In [32]:
shel_mon= genre_weekday(shel_general, day='Monday', time1='07:00:00', time2='11:00:00')
shel_mon

Unnamed: 0_level_0,user_id
genre,Unnamed: 1_level_1
pop,218
dance,182
rock,162
electronic,147
hiphop,80
ruspop,64
alternative,58
rusrap,55
jazz,44
classical,40


In [33]:
spr_fri= genre_weekday(spr_general, day='Friday', time1='17:00:00', time2='23:00:00')
spr_fri

Unnamed: 0_level_0,user_id
genre,Unnamed: 1_level_1
pop,713
rock,517
dance,495
electronic,482
hiphop,273
world,208
ruspop,170
classical,163
alternative,163
rusrap,142


In [34]:
shel_fri= genre_weekday(shel_general, day='Friday', time1='17:00:00', time2='23:00:00')
shel_fri

Unnamed: 0_level_0,user_id
genre,Unnamed: 1_level_1
pop,256
rock,216
electronic,216
dance,210
hiphop,97
alternative,63
jazz,61
classical,60
rusrap,59
world,54


**Conclusion**

Having compared the 15 most listened to genres on Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to similar music. The five most listened to genres are the same, only rock and electronic music have changed places.

2. In Springfield, there were so many missing values that the value 'unknown' came in 10th. This means that missing values made up a considerable portion of the data, which could be the basis for questioning the reliability of the conclusions.

For Friday afternoon, the situation is similar. Individual genres vary slightly, but on the whole, the 15 most listened to genres are similar for the two cities.

Thus, the second hypothesis was partially proven:
* Users listen to similar genres of music at the beginning and end of the week.
* There is not much difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result questionable. In Springfield, there are so many that they affected the top 15. If we weren't missing these values, things might be different.

### Hypothesis 3: preferences in Springfield and Shelbyville 

Hypothesis: Shelbyville loves rap. Springfield citizens are more into pop.

Group the spr_general table by genre and find the number of songs played for each genre with the count() method. Then organize the result in descending order and store it in spr_genres.

In [35]:
spr_genres = spr_general.groupby('genre').count().sort_values(by='track', ascending=False)
spr_genres

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,5892,5892,5892,5892,5892,5892
dance,4435,4435,4435,4435,4435,4435
rock,3965,3965,3965,3965,3965,3965
electronic,3786,3786,3786,3786,3786,3786
hiphop,2096,2096,2096,2096,2096,2096
...,...,...,...,...,...,...
metalcore,1,1,1,1,1,1
marschmusik,1,1,1,1,1,1
malaysian,1,1,1,1,1,1
lovers,1,1,1,1,1,1


Display the first 10 lines of spr_genres:

In [36]:
spr_genres.head(10)

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,5892,5892,5892,5892,5892,5892
dance,4435,4435,4435,4435,4435,4435
rock,3965,3965,3965,3965,3965,3965
electronic,3786,3786,3786,3786,3786,3786
hiphop,2096,2096,2096,2096,2096,2096
classical,1616,1616,1616,1616,1616,1616
world,1432,1432,1432,1432,1432,1432
alternative,1379,1379,1379,1379,1379,1379
ruspop,1372,1372,1372,1372,1372,1372
rusrap,1161,1161,1161,1161,1161,1161


Now do the same with the Shelbyville data.

Group the shel_general table by genre and find the number of songs played from each genre. Then organize the result in descending order and store it in the shel_genres table:


In [37]:
shel_genres = shel_general.groupby('genre').count().sort_values(by='track', ascending=False)
shel_genres

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,2431,2431,2431,2431,2431,2431
dance,1932,1932,1932,1932,1932,1932
rock,1879,1879,1879,1879,1879,1879
electronic,1736,1736,1736,1736,1736,1736
hiphop,960,960,960,960,960,960
...,...,...,...,...,...,...
mandopop,1,1,1,1,1,1
leftfield,1,1,1,1,1,1
laiko,1,1,1,1,1,1
jungle,1,1,1,1,1,1


Display the first 10 lines of shel_genres:

In [38]:
shel_genres.head(10)

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,2431,2431,2431,2431,2431,2431
dance,1932,1932,1932,1932,1932,1932
rock,1879,1879,1879,1879,1879,1879
electronic,1736,1736,1736,1736,1736,1736
hiphop,960,960,960,960,960,960
alternative,649,649,649,649,649,649
classical,646,646,646,646,646,646
rusrap,564,564,564,564,564,564
ruspop,538,538,538,538,538,538
world,515,515,515,515,515,515


**Conclusion**

The hypothesis was partially proven:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be equally popular in Springfield and Shelbyville, and rap was not in the top 5 in either city.


# Conclusion 

We tested the following three hypotheses:

1. User activity varies depending on the day of the week and the city. 
2. On Monday mornings, the inhabitants of Springfield and Shelbyville listen to different genres. This is also true for Friday nights. 
3. Listeners in Springfield and Shelbyville have different preferences. In both Springfield and Shellbyville, they prefer pop.

After analyzing the data, we conclude:

1. User activity in Springfield and Shelbyville depends on the day of the week, although the cities vary in different ways. 

The first hypothesis is fully accepted.

2. Music preferences do not vary significantly over the course of the week in either Springfield or Shelbyville. We can see small differences in the order on Mondays, but:
* In Springfield and Shelbyville, people listen to more pop music.

So we can accept this hypothesis. We should also bear in mind that the result might have been different if it hadn't been for the missing values.

3. It turns out that the musical preferences of users in Springfield and Shelbyville are quite similar.

The third hypothesis was rejected. If there is a difference in preferences, it can't be seen in this data.

### Observation
In real projects, research involves statistical hypothesis testing, which is more precise and more quantitative. Also realize that you can't always draw conclusions about an entire city based on data from just one source.

You will study hypothesis testing in the sprint on statistical data analysis.