# Yandex.Music

# Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Stage 2. Data preprocessing](#data_preprocessing)
    * [2.1 Header style](#header_style)
    * [2.2 Missing values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Stage 3. Testing the hypotheses](#hypotheses)
    * [3.1 Hypothesis 1: user activity in the two cities](#activity)
    * [3.2 Hypothesis 2: music preferences on Monday and Friday](#week)
    * [3.3 Hypothesis 3: genre preferences in Springfield and Shelbyville](#genre)
* [Findings](#end)

## Introduction <a id='intro'></a>
Whenever we're doing research, we need to formulate hypotheses that we can then test. Sometimes we accept these hypotheses; other times, we reject them. To make the right decisions, a business must be able to understand whether or not it's making the right assumptions.

In this project, you'll compare the music preferences of the cities of Springfield and Shelbyville. You'll study real Yandex.Music data to test the hypotheses below and compare user behavior for these two cities.

### Goal: 
Test three hypotheses:
1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In Springfield, they prefer pop, while Shelbyville has more rap fans.

### Stages 
Data on user behavior is stored in the file `/datasets/music_project_en.csv`. There is no information about the quality of the data, so you will need to explore it before testing the hypotheses. 

First, you'll evaluate the quality of the data and see whether its issues are significant. Then, during data preprocessing, you will try to account for the most critical problems.
 
Your project will consist of three stages:
 1. Data overview
 2. Data preprocessing
 3. Testing the hypotheses
 
[Back to Contents](#back)

## Stage 1. Data overview <a id='data_review'></a>

In [1]:
import pandas as pd

In [3]:
try:
    df = pd.read_csv('/datasets/music_project_en.csv')
    df.describe()
except:
    df = pd.read_csv(r'C:\Users\wolff\Desktop\Practicom_projects\P_1\music_project_en.csv')
    df.describe()    

first 10 table rows:

In [4]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [5]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


The table contains seven columns. They all store the same data type: `object`.

According to the documentation:
- `'userID'` — user identifier
- `'Track'` — track title
- `'artist'` — artist's name
- `'genre'`
- `'City'` — user's city
- `'time'` — the exact time the track was played
- `'Day'` — day of the week

We can see three issues with style in the column names:
1. Some names are uppercase, some are lowercase.
2. There are spaces in some names.
3. Some use two words with no snake_case.

The number of column values is different. This means the data contains missing values.


### Conclusions <a id='data_review_conclusions'></a> 

Each row in the table stores data on a track that was played. Some columns describe the track itself: its title, artist and genre. The rest convey information about the user: the city they come from, the time they played the track. 

It's clear that the data is sufficient to test the hypotheses. However, there are missing values.

To move forward, we need to preprocess the data.

[Back to Contents](#back)

## Stage 2. Data preprocessing <a id='data_preprocessing'></a>
Correcting the formatting in the column headers and deal with the missing values. Then, I will check whether there are duplicates in the data.

### Header style <a id='header_style'></a>
Column header:

In [5]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Change column names according to the rules of good style:
* If the name has several words, use snake_case
* All characters must be lowercase
* Delete spaces

In [6]:
df = df.rename(columns = {'  userID':'user_id','Track':'track','  City  ':'city','Day':'day'})


Check the result. Print the names of the columns once more:

In [7]:
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

[Back to Contents](#back)

### Missing values <a id='missing_values'></a>
First, finding the number of missing values in the table. To do so, I use two `pandas` methods:

In [6]:
print(df.isna().sum())
print(df.isna().sum()/len(df))

  userID       0
Track       1343
artist      7567
genre       1198
  City         0
time           0
Day            0
dtype: int64
  userID    0.000000
Track       0.020636
artist      0.116274
genre       0.018408
  City      0.000000
time        0.000000
Day         0.000000
dtype: float64


**artist column is ~11.6%  replacing or droping maight effect the overall outcome**

Not all missing values affect the research. For instance, the missing values in `track` and `artist` are not critical. You can simply replace them with clear markers.

But missing values in `'genre'` can affect the comparison of music preferences in Springfield and Shelbyville. In real life, it would be useful to learn the reasons why the data is missing and try to make up for them. But we do not have that opportunity in this project. So you will have to:
* Fill in these missing values with markers
* Evaluate how much the missing values may affect your computations

Replaced the missing values in `'track'`, `'artist'`, and `'genre'` with the string `'unknown'`. To do this, I created the `columns_to_replace` list, looped over it with `for`, and replaced the missing values in each of the columns:

In [9]:
columns_to_replace = ['track','artist','genre']
for row in columns_to_replace:
    df[row] = df[row].fillna('unknown')

Made sure the table contains no more missing values. Counted the missing values again.

In [10]:
print(df.isna().sum()) 

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64


[Back to Contents](#back)

### Duplicates <a id='duplicates'></a>
Found the number of obvious duplicates in the table using one command:

In [11]:
print(df.duplicated().sum())

3826


Call the `pandas` method for getting rid of obvious duplicates:

In [12]:
df = df.drop_duplicates()
df = df.drop_duplicates().reset_index(drop=True)

Counted obvious duplicates once more to make sure I have removed all of them:

In [13]:
print(df.duplicated().sum())

0


Now I got rid of implicit duplicates in the `genre` column. For example, the name of a genre can be written in different ways. Such errors will also affect the result.

Printed a list of unique genre names, sorted in alphabetical order. To do so:
* Retrieved the intended DataFrame column 
* Applyed a sorting method to it
* For the sorted column, called the method that will return all unique column values

In [7]:
# viewing unique genre names
df.sort_values(by='genre', ascending=True) 
df['genre'].unique()

array(['rock', 'pop', 'folk', 'dance', 'rusrap', 'ruspop', 'world',
       'electronic', nan, 'alternative', 'children', 'rnb', 'hip', 'jazz',
       'postrock', 'latin', 'classical', 'metal', 'reggae', 'triphop',
       'blues', 'instrumental', 'rusrock', 'dnb', 'türk', 'post',
       'country', 'psychedelic', 'conjazz', 'indie', 'posthardcore',
       'local', 'avantgarde', 'punk', 'videogame', 'techno', 'house',
       'christmas', 'melodic', 'caucasian', 'reggaeton', 'soundtrack',
       'singer', 'ska', 'salsa', 'ambient', 'film', 'western', 'rap',
       'beats', "hard'n'heavy", 'progmetal', 'minimal', 'tropical',
       'contemporary', 'new', 'soul', 'holiday', 'german', 'jpop',
       'spiritual', 'urban', 'gospel', 'nujazz', 'folkmetal', 'trance',
       'miscellaneous', 'anime', 'hardcore', 'progressive', 'korean',
       'numetal', 'vocal', 'estrada', 'tango', 'loungeelectronic',
       'classicmetal', 'dubstep', 'club', 'deep', 'southern', 'black',
       'folkrock', 'fitne

Looked through the list to find implicit duplicates of the genre `hiphop`. These could be names written incorrectly or alternative names of the same genre.

I will see the following implicit duplicates:
* `hip`
* `hop`
* `hip-hop`

To get rid of them, declared the function `replace_wrong_genres()` with two parameters: 
* `wrong_genres=` — the list of duplicates
* `correct_genre=` — the string with the correct value

The function should correct the names in the `'genre'` column from the `df` table, i.e. replace each value from the `wrong_genres` list with the value in `correct_genre`.

In [15]:
# function for replacing implicit duplicates
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres: 
        df['genre'] = df['genre'].replace(wrong_genre,correct_genre)

Called `replace_wrong_genres()` and pass its arguments so that it clears implicit duplcates (`hip`, `hop`, and `hip-hop`) and replaces them with `hiphop`:

In [16]:
# removing implicit duplicates
duplicates = ['hip', 'hop','hip-hop']
genre = 'hiphop'
replace_wrong_genres(duplicates, genre)

Made sure the duplicate names were removed. Printed the list of unique values from the `'genre'` column:

In [17]:
# checking for implicit duplicates
print(df['genre'].unique())

['rock' 'pop' 'folk' 'dance' 'rusrap' 'ruspop' 'world' 'electronic'
 'unknown' 'alternative' 'children' 'rnb' 'hiphop' 'jazz' 'postrock'
 'latin' 'classical' 'metal' 'reggae' 'triphop' 'blues' 'instrumental'
 'rusrock' 'dnb' 'türk' 'post' 'country' 'psychedelic' 'conjazz' 'indie'
 'posthardcore' 'local' 'avantgarde' 'punk' 'videogame' 'techno' 'house'
 'christmas' 'melodic' 'caucasian' 'reggaeton' 'soundtrack' 'singer' 'ska'
 'salsa' 'ambient' 'film' 'western' 'rap' 'beats' "hard'n'heavy"
 'progmetal' 'minimal' 'tropical' 'contemporary' 'new' 'soul' 'holiday'
 'german' 'jpop' 'spiritual' 'urban' 'gospel' 'nujazz' 'folkmetal'
 'trance' 'miscellaneous' 'anime' 'hardcore' 'progressive' 'korean'
 'numetal' 'vocal' 'estrada' 'tango' 'loungeelectronic' 'classicmetal'
 'dubstep' 'club' 'deep' 'southern' 'black' 'folkrock' 'fitness' 'french'
 'disco' 'religious' 'drum' 'extrememetal' 'türkçe' 'experimental' 'easy'
 'metalcore' 'modern' 'argentinetango' 'old' 'swing' 'breaks' 'eurofolk'
 'stone

[Back to Contents](#back)

### Conclusions <a id='data_preprocessing_conclusions'></a>
I detected three issues with the data:

- Incorrect header styles
- Missing values
- Obvious and implicit duplicates

The headers have been cleaned up to make processing the table simpler.

All missing values have been replaced with `'unknown'`. But I still have to see whether the missing values in `'genre'` will affect our calculations.

The absence of duplicates will make the results more precise and easier to understand.

Now I can move on to testing hypotheses. 

[Back to Contents](#back)

## Stage 3. Testing hypotheses <a id='hypotheses'></a>

### Hypothesis 1: comparing user behavior in two cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville listen to music differently. I will Test this using the data on three days of the week: Monday, Wednesday, and Friday.

* Dividing the users into groups by city.
* Comparing how many tracks each group played on Monday, Wednesday, and Friday.


For the sake of practice,I performed each computation separately. 

Evaluating user activity in each city. Grouped the data by city and found the number of songs played in each group.

In [18]:
print(df.groupby('city')['track'].count()) 

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64


Springfield has more tracks played than Shelbyville. But that does not imply that citizens of Springfield listen to music more often. This city is simply bigger, and there are more users.

Now grouped the data by day of the week and found the number of tracks played on Monday, Wednesday, and Friday.


In [19]:
# Calculating tracks played on each of the three days
print(df.groupby('day')['track'].count()) 

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64


Wednesday is the quietest day overall. But if we consider the two cities separately, we might come to a different conclusion.

We have seen how grouping by city or day works. Now I will write a function that will group by both.

Created the `number_tracks()` function to calculate the number of songs played for a given day and city. It will require two parameters:
* day of the week
* name of the city

In the function, I use a variable to store the rows from the original table, where:
  * `'day'` column value is equal to the `day` parameter
  * `'city'` column value is equal to the `city` parameter

Applyed consecutive filtering with logical indexing.

Then calculate the `'user_id'` column values in the resulting table. Stored the result to a new variable. Returned this variable from the function.

In [20]:
# <creating the function number_tracks()>
# We'll declare a function with two parameters: day=, city=.
# Let the track_list variable store the df rows where
# the value in the 'day' column is equal to the day= parameter and, at the same time, 
# the value in the 'city' column is equal to the city= parameter (apply consecutive filtering 
# with logical indexing).
# Let the track_list_count variable store the number of 'user_id' column values in track_list    
# (found with the count() method).
# Let the function return a number: the value of track_list_count.    
# The function counts tracked played for a certain city and day.
# It first retrieves the rows with the intended day from the table,
# then filters out the rows with the intended city from the result,
# then finds the number of 'user_id' values in the filtered table,
# then returns that number.
# To see what it returns, wrap the function call in print().

def number_tracks(df, day, city):
    
    '''Created a function with 3 parameters:
    1. df - is my data base
    2. day - the day of the week
    3. city - the city form the data'''
    
    track_list = df[(df['day'] == day)&(df['city'] == city)] 
    
    '''created a variable to store parameters that i declared in the function apllied & with logical indexing '''
    
    track_list_count = track_list['user_id'].count()
    
    '''created a variable to store a count of the tracks by using the count method on column "user id" '''
    
    return(track_list_count)
    
    '''ended the function with return of track list count to see the number after applaing the parameter values '''

In [21]:
print(number_tracks.__doc__)

Created a function with 3 parameters:
    1. df - is my data base
    2. day - the day of the week
    3. city - the city form the data


Called `number_tracks()` six times, changing the parameter values, so that I retrieve the data on both cities for each of the three days.

In [22]:
# the number of songs played in Springfield on Monday
sp_m = number_tracks(df, 'Monday', 'Springfield')
print(sp_m)

15740


In [23]:
# the number of songs played in Shelbyville on Monday
sh_m = number_tracks(df, 'Monday', 'Shelbyville')
print(sh_m)

5614


In [24]:
# the number of songs played in Springfield on Wednesday
sp_w = number_tracks(df, 'Wednesday', 'Springfield')
print(sp_w)

11056


In [25]:
# the number of songs played in Shelbyville on Wednesday
sh_w = number_tracks(df, 'Wednesday', 'Shelbyville')
print(sh_w)

7003


In [26]:
# the number of songs played in Springfield on Friday
sp_f = number_tracks(df, 'Friday', 'Springfield')
print(sp_f)

15945


In [27]:
# the number of songs played in Shelbyville on Friday
sh_f = number_tracks(df, 'Friday', 'Shelbyville')
print(sh_f)

5895


Used `pd.DataFrame` to create a table, where
* Column names are: `['city', 'monday', 'wednesday', 'friday']`
* The data is the results you I from `number_tracks()`

In [28]:
number_tracks_data = [['Springfield', sp_m, sp_w, sp_f], ['Shelbyville', sh_m, sh_w, sh_f]]
new_table_column = ['city', 'monday', 'wednesday', 'friday']
new_table = pd.DataFrame(data=number_tracks_data, columns=new_table_column) 

print(new_table)

          city  monday  wednesday  friday
0  Springfield   15740      11056   15945
1  Shelbyville    5614       7003    5895


**Conclusions**

The data reveals differences in user behavior:

- In Springfield, the number of songs played peaks on Mondays and Fridays, while on Wednesday there is a decrease in activity.
- In Shelbyville, on the contrary, users listen to music more on Wednesday. User activity on Monday and Friday is smaller.

So the first hypothesis seems to be correct.

[Back to Contents](#back)

### Hypothesis 2: music at the beginning and end of the week <a id='week'></a>

According to the second hypothesis, on Monday morning and Friday night, citizens of Springfield listen to genres that differ from ones users from Shelbyville enjoy.

Got tables (made sure that the name of my combined table matches the DataFrame given in the two code blocks below):
* For Springfield — `spr_general`
* For Shelbyville — `shel_general`

In [29]:
# created the spr_general table from the df rows, 
# where the value in the 'city' column is 'Springfield'
spr_general_data = df[df['city'] == 'Springfield']
spr_general = pd.DataFrame(data = spr_general_data)

In [30]:
# created the shel_general from the df rows,
# where the value in the 'city' column is 'Shelbyville'

shel_general_data = df[df['city'] == 'Shelbyville']
shel_general = pd.DataFrame(data = shel_general_data)

Wrote the `genre_weekday()` function with four parameters:
* A table for data (`df`)
* The day of the week (`day`)
* The first timestamp, in 'hh:mm' format (`time1`)
* The last timestamp, in 'hh:mm' format (`time2`)

The function should return info on the 15 most popular genres on a given day within the period between the two timestamps.

In [31]:
# 1) Let the genre_df variable store the rows that meet several conditions:
#    - the value in the 'day' column is equal to the value of the day= argument
#    - the value in the 'time' column is greater than the value of the time1= argument
#    - the value in the 'time' column is smaller than the value of the time2= argument
#    Use consecutive filtering with logical indexing.

# 2) Group genre_df by the 'genre' column, take one of its columns, 
#    and use the count() method to find the number of entries for each of 
#    the represented genres; store the resulting Series to the
#    genre_df_count variable

# 3) Sort genre_df_count in descending order of frequency and store the result
#    to the genre_df_sorted variable

# 4) Return a Series object with the first 15 genre_df_sorted value - the 15 most
#    popular genres (on a given day, within a certain timeframe)

# Write your function here
def genre_weekday(df,day,time1,time2): 
    # consecutive filtering
    # Create the variable genre_df which will store only those df rows where the day is equal to day=
    genre_df = df[(df['day'] == day)] 

    # filter again so that genre_df will store only those rows where the time is smaller than time2=
    genre_df = genre_df[(genre_df['time'] < time2)] 

    # filter once more so that genre_df will store only rows where the time is greater than time1=
    genre_df = genre_df[(genre_df['time'] > time1)] 

    # group the filtered DataFrame by the column with the names of genres, take the genre column, and find the number of rows for each genre with the count() method
    genre_df_count = genre_df.groupby('genre')['genre'].count()

    # sort the result in descending order (so that the most popular genres come first in the Series object)
    genre_df_sorted = genre_df_count.sort_values(ascending = False)

    # we will return the Series object storing the 15 most popular genres on a given day in a given timeframe
    return genre_df_sorted[:15]

Compared the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (from 7AM to 11AM) and on Friday evening (from 17:00 to 23:00):

In [32]:
# calling the function for Monday morning in Springfield (use spr_general instead of the df table)
spr_mm = genre_weekday(spr_general, 'Monday','07:00','11:00')
print(spr_mm)

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: genre, dtype: int64


In [33]:
# calling the function for Monday morning in Shelbyville (use shel_general instead of the df table)
sh_mm = genre_weekday(shel_general, 'Monday','07:00','11:00')
print(sh_mm)

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: genre, dtype: int64


In [34]:
# calling the function for Friday evening in Springfield
spr_fe = genre_weekday(spr_general, 'Friday','17:00','23:00')
print(spr_fe)

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
classical      163
alternative    163
rusrap         142
jazz           111
unknown        110
soundtrack     105
rnb             90
metal           88
Name: genre, dtype: int64


In [35]:
# calling the function for Friday evening in Shelbyville
sh_fe = genre_weekday(shel_general, 'Friday','17:00','23:00')
print(sh_fe)

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
unknown         47
ruspop          47
soundtrack      40
metal           39
rap             36
Name: genre, dtype: int64


**Conclusion**

Having compared the top 15 genres on Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to similar music. The top five genres are the same, only rock and electronic have switched places.

2. In Springfield, the number of missing values turned out to be so big that the value `'unknown'` came in 10th. This means that missing values make up a considerable portion of the data, which may be a basis for questioning the reliability of our conclusions.

For Friday evening, the situation is similar. Individual genres vary somewhat, but on the whole, the top 15 is similar for the two cities.

Thus, the second hypothesis has been partially proven true:
* Users listen to similar music at the beginning and end of the week.
* There is no major difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result questionable. In Springfield, there are so many that they affect our top 15. Were we not missing these values, things might look different.

[Back to Contents](#back)

### Hypothesis 3: genre preferences in Springfield and Shelbyville <a id='genre'></a>

Hypothesis: Shelbyville loves rap music. Springfield's citizens are more into pop.

Grouped the `spr_general` table by genre and find the number of songs played for each genre with the `count()` method. Then sort the result in descending order and store it to `spr_genres`.

In [36]:
# on one line: group the spr_general table by the 'genre' column, 
# count the 'genre' values with count() in the grouping, 
# sort the resulting Series in descending order, and store it to spr_genres
spr_genres = spr_general.groupby('genre')['genre'].count().sort_values(ascending = False)

Printed the first 10 rows from `spr_genres`:

In [37]:
# printing the first 10 rows of spr_genres
spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Now I did the same with the data on Shelbyville.

Grouped the `shel_general` table by genre and find the number of songs played for each genre. Then sorted the result in descending order and store it to the `shel_genres` table:


In [8]:
# on one line: grouped the shel_general table by the 'genre' column, 
# counted the 'genre' values in the grouping with count(), 
# sorted the resulting Series in descending order and store it to shel_genres
shel_genres = shel_general.groupby('genre')['genre'].count().sort_values(ascending = False)

NameError: name 'shel_general' is not defined

Printed the first 10 rows of `shel_genres`:

In [39]:
# printing the first 10 rows from shel_genres
shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusion**

The hypothesis has been partially proven true:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be equally popular in Springfield and Shelbyville, and rap wasn't in the top 5 for either city.


[Back to Contents](#back)

# Findings <a id='end'></a>

We have tested the following three hypotheses:

1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In both Springfield and Shelbyville, they prefer pop.

After analyzing the data, we concluded:

1. User activity in Springfield and Shelbyville depends on the day of the week, though the cities vary in different ways. 

The first hypothesis is fully accepted.

2. Musical preferences do not vary significantly over the course of the week in both Springfield and Shelbyville. We can see small differences in order on Mondays, but:
* In Springfield and Shelbyville, people listen to pop music most.

So we can't accept this hypothesis. We must also keep in mind that the result could have been different if not for the missing values.

3. It turns out that the musical preferences of users from Springfield and Shelbyville are quite similar.

The third hypothesis is rejected. If there is any difference in preferences, it cannot be seen from this data.

### Note 
In real projects, research involves statistical hypothesis testing, which is more precise and more quantitative. Also note that you cannot always draw conclusions about an entire city based on the data from just one source.

You will study hypothesis testing in the sprint on statistical data analysis.

[Back to Contents](#back)