# Stage 1. Importing data 

## Import libraries

In [253]:
import pandas as pd

Read the *music_project.csv* file and store it in the *df* variable.

In [254]:
df = pd.read_csv('datasets/music_project.csv', sep=';')

Getting the first 10 rows of a table.

In [255]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,8:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,8:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,TRUE,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,9:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


General information about table data *df*.

In [256]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3001 entries, 0 to 3000
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   userID  3001 non-null   object
 1   Track   2916 non-null   object
 2   artist  2662 non-null   object
 3   genre   2934 non-null   object
 4   City    3001 non-null   object
 5   time    3001 non-null   object
 6   Day     3001 non-null   object
dtypes: object(7)
memory usage: 164.2+ KB


The number of values in the columns varies. This indicates that there are missing values in the data.

**Conclusions**

  The two problems that need to be addressed are gaps and poor-quality column names.

# Stage 2. Data preprocessing

Eliminate gaps, rename columns, and check data for duplicates.

In [257]:
df.columns

Index(['userID', 'Track', 'artist', 'genre', 'City', 'time', 'Day'], dtype='object')

There are spaces in the column names.
Let's rename the columns for the convenience of further work.

In [258]:
df.set_axis(['user_id', 'track_name', 'artist_name', 'genre_name', 'city', 'time', 'weekday'], axis = 'columns', inplace=True)

In [259]:
df.columns

Index(['user_id', 'track_name', 'artist_name', 'genre_name', 'city', 'time',
       'weekday'],
      dtype='object')

In [260]:
df.isnull().sum()

user_id          0
track_name      85
artist_name    339
genre_name      67
city             0
time             0
weekday          0
dtype: int64

Replacing the missing values in the columns with the name of the track and the artist with the string 'unknown'.

In [261]:
df['track_name'].fillna('unknown', inplace=True)

In [262]:
df['artist_name'].fillna('unknown', inplace=True)

In [263]:
df.isnull().sum()

user_id         0
track_name      0
artist_name     0
genre_name     67
city            0
time            0
weekday         0
dtype: int64

Deleting empty values in the column with genres

In [264]:
df.dropna(subset= ['genre_name'], inplace=True)

In [265]:
df.isnull().sum()

user_id        0
track_name     0
artist_name    0
genre_name     0
city           0
time           0
weekday        0
dtype: int64

Checking for duplicates

In [266]:
df.duplicated().sum()

616

In [267]:
df.drop_duplicates(inplace =True)
df.reset_index(drop=True, inplace =True)

In [268]:
df.duplicated().sum()

0

Let's declare the function *find_genre()* to find implicit duplicates in the column with genres.

In [269]:
genres_list = df['genre_name'].unique()

In [270]:
def find_genre(x):
    count = 0
    for name1 in genres_list:
        if name1 == x:
            count += 1
    return(count)

Calling the *find_genre()* function will help to check different variants of the hip-hop genre name in the table.

The correct name is *hiphop*. We are checking following options:

* hip
* hop
* hip-hop

In [271]:
find_genre('hip')

1

In [272]:
find_genre('hop')

0

In [273]:
find_genre('hip-hop')

0

Let's declare a function *find_hip_hop()* that replaces the incorrect name of this genre 

In [274]:
def find_hip_hop(df, wrong):
    df.replace(wrong, 'hiphop', inplace=True)
    return(df[df['genre_name'] == wrong]['genre_name'].count())

In [275]:
find_hip_hop(df, 'hip')

0

In [276]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2318 entries, 0 to 2317
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      2318 non-null   object
 1   track_name   2318 non-null   object
 2   artist_name  2318 non-null   object
 3   genre_name   2318 non-null   object
 4   city         2318 non-null   object
 5   time         2318 non-null   object
 6   weekday      2318 non-null   object
dtypes: object(7)
memory usage: 126.9+ KB


**Conclusion**

At the preprocessing stage, we found gaps, problems with column names and duplicates. 

# Are there differences in tastes between cities?

We check this assumption on the data on three days of the week - Monday, Wednesday and Friday.

For each city, we set the number of songs with a known genre listened to these days, and compare the results.

In [277]:
df.groupby('city')['genre_name'].count()

city
City                   1
Moscow              1131
Saint-Petersburg    1186
Name: genre_name, dtype: int64

In [278]:
df.groupby('weekday')['genre_name'].count()

weekday
Day             1
Friday        343
Monday       1688
Wednesday     286
Name: genre_name, dtype: int64

Monday and Friday people listen to more music

We create a *number_tracks()* function that takes a table, day of the week, and city name as parameters, and returns the number of listened tracks for which the genre is known.

In [279]:
def number_tracks(df, day, city_name):
    track_list = df[(df['weekday'] == day) & (df['city'] == city_name)]
    track_list_count = track_list['genre_name'].count()
    return(track_list_count)


In [280]:
number_tracks(df, 'Monday', 'Moscow')

715

In [281]:
number_tracks(df, 'Monday', 'Saint-Petersburg')

973

In [282]:
number_tracks(df, 'Wednesday', 'Moscow')

174

In [283]:
number_tracks(df, 'Wednesday', 'Saint-Petersburg')

112

In [284]:
number_tracks(df, 'Friday', 'Moscow')

242

In [285]:
number_tracks(df, 'Friday', 'Saint-Petersburg')

101

Let's summarize the received information in one table

In [286]:
data=[['Moscow', 247, 174, 242],
       ['Saint-Petersburg', 103, 112, 101]]
columns = ['city','monday','wednesday','friday'] 
table = pd.DataFrame(data = data, columns = columns)
table

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,247,174,242
1,Saint-Petersburg,103,112,101


**Conclusion**

* In Moscow, the peaks fall on Monday and Friday, and on Wednesday the listening time decreases.  
* In St. Petersburg Wednesday is the day of the greatest interest in music

# Do people listen to the same music on Monday morning and Friday evening?

We are looking for an answer to the question of what genres prevail in different cities on Monday morning and Friday evening.

We will get data tables for Moscow *moscow_general* and for St. Petersburg *spb_general*.

In [287]:
moscow_general = df[df['city'] == 'Moscow']

In [288]:
spb_general = df[df['city'] == 'Saint-Petersburg']

We create a function *genre_weekday()* that returns a list of genres for the requested day of the week and time of day

In [289]:
def genre_weekday(df, day, time1, time2):
    genre_list = df[(df['weekday'] == day) & (time1 < df['time']) & ( df['time'] < time2)]
    genre_list_sorted = genre_list.groupby('genre_name')['genre_name'].count() 
    return genre_list_sorted

In [290]:
genre_weekday(moscow_general, 'Monday', '07:00:00', '11:00:00')

genre_name
hiphop    1
pop       1
rock      1
Name: genre_name, dtype: int64

In [291]:
genre_weekday(spb_general, 'Monday', '07:00:00', '11:00:00')

genre_name
electronic    1
pop           1
Name: genre_name, dtype: int64

In [292]:
genre_weekday(moscow_general, 'Friday', '17:00:00', '23:00:00')

genre_name
alternative      3
blues            1
caucasian        1
classical        3
conjazz          1
country          1
dance           10
deep             1
electronic       6
eurofolk         1
fitness          1
gospel           1
hiphop           6
instrumental     1
jazz             2
latin            1
metal            2
modern           1
new              1
pop             12
psychedelic      1
reggae           3
religious        1
rnb              1
rock            11
ruspop           3
rusrap           3
rusrock          2
ska              1
spiritual        1
trance           1
türkçe           1
world            1
Name: genre_name, dtype: int64

In [293]:
genre_weekday(spb_general, 'Friday', '17:00:00', '23:00:00')

genre_name
classical       2
dance           4
easy            1
electronic      1
folk            1
hiphop          1
holiday         1
indie           1
instrumental    1
jazz            4
metal           1
pop             1
punk            1
rap             2
reggae          1
rock            2
ruspop          1
rusrock         1
shanson         1
soundtrack      1
world           1
Name: genre_name, dtype: int64

**Conclusion**

There is not many records pof people listening to music on Monday morning.

On Friday there is a wide range of music genres people listen to. In Moscow it's pop, in Saint-Petersburg - dance and jazz 

# Moscow and Saint-Petersburg have different music preferences?

In [294]:
moscow_genres = moscow_general.groupby('genre_name')['genre_name'].count().sort_values(ascending=False)

In [295]:
moscow_genres.head(10)

genre_name
pop           164
dance         125
rock          115
electronic    112
hiphop         64
classical      42
ruspop         39
world          30
rusrap         28
jazz           26
Name: genre_name, dtype: int64

In [296]:
spb_genres = spb_general.groupby('genre_name')['genre_name'].count().sort_values(ascending = False)

In [297]:
spb_genres.head(10)

genre_name
pop            168
dance          134
rock           125
electronic     108
hiphop          50
alternative     43
world           42
classical       41
jazz            41
rusrap          30
Name: genre_name, dtype: int64

**Conclusion** 

The most popular genres are pretty similar