# 1.Import libraries

In [1]:
import pandas as pd

# 2.Import the data

In [2]:
music_df = pd.read_csv('music_project_en.csv')

# 3.Check the data

In [3]:
#See the first 5 rows of the dataframe
print(music_df.head())

     userID                        Track            artist  genre  \
0  FFB692EC            Kamigata To Boots  The Mass Missile   rock   
1  55204538  Delayed Because of Accident  Andreas Rönnberg   rock   
2    20EC38            Funiculì funiculà       Mario Lanza    pop   
3  A3DD03C9        Dragons in the Sunset        Fire + Ice   folk   
4  E2DC1FAE                  Soul People        Space Echo  dance   

        City        time        Day  
0  Shelbyville  20:28:33  Wednesday  
1  Springfield  14:07:09     Friday  
2  Shelbyville  20:58:07  Wednesday  
3  Shelbyville  08:37:09     Monday  
4  Springfield  08:34:34     Monday  


In [4]:
columns = music_df.columns
print(columns)

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')


Columns names:
* userID and City have unnecessary spaces that may difficult analysis - remove it.
* Different styles - change to snake_case.

In [5]:
#snake_case
columns = music_df.columns.str.strip().str.lower()
columns = columns.to_list() #Transforming to list
columns[0] = 'user_id' # Renaming the first column specifically
music_df.columns = columns
print(columns)

['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day']


## 3.1 Data description
* 'user_id' — uniquely identifies each user
* 'track' — song title
* 'artist' — artist name
* 'genre' — music genre
* 'city' — user's city
* 'time' — the time of day when a track was played (HH:MM:SS)

In [6]:
#See the main data statistics
music_df.describe(include='all')

Unnamed: 0,user_id,track,artist,genre,city,time,day
count,65079,63736,57512,63881,65079,65079,65079
unique,41748,39666,37806,268,2,20392,3
top,A8AE9169,Brand,Kartvelli,pop,Springfield,08:14:07,Friday
freq,76,136,136,8850,45360,14,23149


Initial analysis:
* There are some columns that present missing values (count line) - should be treated prior to move forward with data analysis.
* There are only data from 2 cities and 3 days in the dataframe (unique line) - may be considered as categories.
* Pop seems to be the most listen genre (top line).

## 3.2 Check duplicates

In [7]:
duplicates = music_df.duplicated().sum()
print(f'The number of duplicate rows in the dataset is: {duplicates}')  

The number of duplicate rows in the dataset is: 3826


Remove the duplicated lines and then check the null values.

In [8]:
#Remove the duplicates and reset the index (drop=True remove the prevous indexes)
music_df = music_df.drop_duplicates().reset_index(drop=True)  

## 3.3 Null values

In [9]:
#See the main data attributes
print(music_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61253 entries, 0 to 61252
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  61253 non-null  object
 1   track    59991 non-null  object
 2   artist   54156 non-null  object
 3   genre    60126 non-null  object
 4   city     61253 non-null  object
 5   time     61253 non-null  object
 6   day      61253 non-null  object
dtypes: object(7)
memory usage: 3.3+ MB
None


In [10]:
#See the number of null values in each column
# Sorted in descending order
# Framed output to be more readable
music_df.isna().sum().sort_values(ascending=False).to_frame('Null Values')

Unnamed: 0,Null Values
artist,7097
track,1262
genre,1127
user_id,0
city,0
time,0
day,0


# 3.4 Data types

Change the following data types:
* City may be a category instead of object (only 2 possibilities)
* time should be time instead of object
* Day may be a category instead of object (only 3 possibilities)

In [12]:
music_df['city'] = music_df['city'].astype('category')
music_df['time'] = pd.to_datetime(music_df['time'], format='%H:%M:%S')
music_df['day'] = music_df['day'].astype('category')

print(music_df.dtypes)

user_id            object
track              object
artist             object
genre              object
city             category
time       datetime64[ns]
day              category
dtype: object
