# Importing libraries

In [1]:
import pandas as pd

# Stage 1: Data mining 

In [2]:
try:
    df = pd.read_csv('/datasets/music_project.csv')
except FileNotFoundError:
    df = pd.read_csv(r'C:\Users\ASUS\Desktop\Практикум\Data\yandex_music_project.csv')

Let's print the first 10 lines

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


General information about data

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


### Сonclusions

There are three style violations visible in the column titles:

- Lowercase letters are combined with uppercase ones.
- There are some gaps.
- Snake case.

Each row of the table contains data about the track that was listened to. Some of the columns describe the song itself: title, artist and genre. The rest of the data tells about the user: what city he is from, when he listened to music.

Preliminarily, it can be stated that there is enough data to test the hypotheses. But there are gaps in the data, and in the column names there are discrepancies with good style.

To move forward, we need to fix the problems in the data.

# Stage 2: Data preprocessing 

Let's correct the style in the column headings and eliminate gaps. Then we check the data for duplicates.

## Heading style

Let's display the column names:

In [5]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Let's bring the names into line with good style

In [6]:
df = df.rename(columns = {'  userID':'user_id','Track':'track','  City  ':'city','Day':'day'})

Check the result.

In [7]:
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

## Missing values

Let's count how many missing values there are in the table.

In [8]:
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Replace the missing values in the track, artist and genre columns with the string 'unknown'

In [9]:
columns_to_replace = ['track','artist','genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

Make sure there are no gaps left in the table

In [10]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

## Duplicates

Let's find and remove duplicates rows

In [11]:
df.duplicated().sum()

3826

In [12]:
df = df.drop_duplicates()

Make sure there are no duplicates left

In [13]:
df.duplicated().sum()

0

Now let's get rid of the implicit duplicates in the genre column. 

We will display a list of unique genre names, sorted in alphabetical order.

In [14]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Replace incorrect names

In [15]:
df['genre'] = df['genre'].replace('hip-hop','hiphop')
df['genre'] = df['genre'].replace('hop','hiphop') 
df['genre'] = df['genre'].replace('hip','hiphop')

Let's check the result

In [16]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

### Сonclusions

Preprocessing found three problems in the data:

- Heading style violations.
- Missing values.
- Duplicates - explicit and implicit.

# Stage 3: Testing hypotheses

## Comparison of user behavior in two capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. Let's test this assumption using data on three days of the week - Monday, Wednesday and Friday. For this:

- Separate users in Moscow and St. Petersburg
- Compare how many tracks each user group listened to on Monday, Wednesday and Friday.

We will perform each of the calculations separately.

Let's evaluate user activity in each city. Let's group the data by city and count the auditions in each group.

In [17]:
df.groupby('city')['track'].count() 

city
Moscow              42741
Saint-Petersburg    18512
Name: track, dtype: int64

There are more auditions in Moscow than in St. Petersburg. It does not follow from this that Moscow users listen to music more often. There are simply more users in Moscow.

In [18]:
df.groupby('day')['track'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

On average, users from the two cities are less active on Wednesdays. But the picture may change if we consider each city separately.

Let's create a function number_tracks() that will count plays for a given day and city.

In [19]:
def number_tracks(day, city):

    track_list=df[(df['day']==day) & (df['city']==city)]

    track_list_count = track_list['user_id'].count()

    return track_list_count

let's call number_tracks() six times, changing the value of the parameters

In [20]:
number_tracks('Monday','Moscow')

15740

In [21]:
number_tracks('Monday', 'Saint-Petersburg')

5614

In [22]:
number_tracks('Wednesday', 'Moscow')

11056

In [23]:
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [24]:
number_tracks('Friday', 'Moscow')

15945

In [25]:
number_tracks('Friday', 'Saint-Petersburg')

5895

### Сonclusions


The data shows the difference in user behavior:

- In Moscow, the peak of listening is on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity here on Monday and Friday is almost equally inferior to Wednesday.

This means that the data speaks in favor of the first hypothesis.

## Music at the beginning and end of the week
According to the second hypothesis, on Monday morning in Moscow some genres prevail, and in St. Petersburg others. Likewise, on Friday evenings, different genres predominate, depending on the city.

Save tables with data in two variables:
- moscow_general
- spb_general

In [26]:
moscow_general = df[df['city'] == 'Moscow']

In [27]:
spb_general = df[df['city'] == 'Saint-Petersburg']

Let's create a function genre_weekday() with four parameters:

- table (dataframe) with data,
- day of the week
- initial timestamp in 'hh:mm' format,
- last timestamp in 'hh:mm' format.

In [28]:
def genre_weekday(df, day, time1, time2):
    genre_df = df[(df['day']==day) & (df['time'] <= time2)& (df['time'] > time1)]
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_grouped.sort_values(ascending = False).head(10)
    return genre_df_sorted

Let's compare the results of the genre_weekday() function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and on Friday evening (from 17:00 to 23:00):

In [29]:
genre_weekday(moscow_general, 'Monday','07:00','11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [30]:
genre_weekday(spb_general, 'Monday','07:00','11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [31]:
genre_weekday(moscow_general, 'Friday','17:00','23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [32]:
genre_weekday(spb_general, 'Friday','17:00','23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

### Сonclusions

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg they listen to similar music. The only difference is that the Moscow rating included the “world” genre, while the St. Petersburg rating included jazz and classical.

2. In Moscow there were so many missing values that the value `'unknown'` took tenth place among the most popular genres. This means that missing values occupy a significant proportion of the data and threaten the reliability of the study.

Friday evening does not change this picture. Some genres go a little higher, others go down, but overall the top 10 remains the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not very pronounced. In Moscow people listen to Russian popular music more often, in St. Petersburg they listen to jazz.

However, gaps in the data cast doubt on this result. There are so many of them in Moscow that the top 10 ranking might look different if not for the lost data on genres.

## Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, music of this genre is listened to more often there than in Moscow. And Moscow is a city of contrasts, in which, nevertheless, pop music prevails.

Let's group the moscow_general table by genre and count the plays of tracks of each genre

In [33]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending = False)

In [34]:
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Now the same for St. Petersburg.

In [35]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending = False)

In [36]:
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

### Conclusions

The hypothesis was partially confirmed:

Pop music is the most popular genre in Moscow, as the hypothesis predicted. Moreover, in the top 10 genres there is a similar genre - Russian popular music.
Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.

# Stage 4: Study results

Three hypotheses were tested and conclusions were drawn:

The day of the week affects user activity differently in Moscow and St. Petersburg.
The first hypothesis was completely confirmed.

Musical preferences do not change much during the week - be it Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
in Moscow they listen to music of the “world” genre,
in St. Petersburg - jazz and classics.
Thus, the second hypothesis was only partially confirmed. This result might have been different if not for gaps in the data.

The tastes of Moscow and St. Petersburg users have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.
The third hypothesis was not confirmed. If differences in preferences exist, they are not noticeable for the majority of users.

In practice, research involves testing statistical hypotheses. From the data of one service it is not always possible to draw a conclusion about all residents of the city. Tests of statistical hypotheses will show how reliable they are based on the available data.