# Music Industry Sales Kaggle Project

## Introduction

### Main Stakeholders

Because this dataset is about music sales based on different formats (CDs, cassette, vinyls, etc.), the main stakeholder will be the record label company. The company will ultimately want to know: what's the best (most profitable) format to release their artists' songs to?


### Business Tasks

To help guide my analysis, I want to answer these main questions:

- How many units were sold per song format?
- How much money did each format make in total?
- Adjusting for inflation, how much money is that today and which format was the most profitable?
- Are there any discrepencies or unusual trends (eg. random peak of a certain format)?
- What's the best (most profitable) format for new song releases?

In [1]:
# Imports
import numpy as np
import pandas as pd

## Exploratory Data Analysis (EDA)

In [2]:
# Read csv file
df = pd.read_csv('musicdata.csv')
df

Unnamed: 0,format,metric,year,number_of_records,value_actual
0,CD,Units,1973,1,
1,CD,Units,1974,1,
2,CD,Units,1975,1,
3,CD,Units,1976,1,
4,CD,Units,1977,1,
...,...,...,...,...,...
3003,Vinyl Single,Value (Adjusted),2015,1,6.205390
3004,Vinyl Single,Value (Adjusted),2016,1,5.198931
3005,Vinyl Single,Value (Adjusted),2017,1,6.339678
3006,Vinyl Single,Value (Adjusted),2018,1,5.386197


### Introductory Look

In [3]:
df.head()

Unnamed: 0,format,metric,year,number_of_records,value_actual
0,CD,Units,1973,1,
1,CD,Units,1974,1,
2,CD,Units,1975,1,
3,CD,Units,1976,1,
4,CD,Units,1977,1,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3008 entries, 0 to 3007
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   format             3008 non-null   object 
 1   metric             3008 non-null   object 
 2   year               3008 non-null   int64  
 3   number_of_records  3008 non-null   int64  
 4   value_actual       1351 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 117.6+ KB


In [5]:
# Statistical summary of numerical columns
df.describe()

Unnamed: 0,year,number_of_records,value_actual
count,3008.0,3008.0,1351.0
mean,1996.0,1.0,781.291237
std,13.566915,0.0,2246.837672
min,1973.0,1.0,-7.650944
25%,1984.0,1.0,3.700228
50%,1996.0,1.0,63.9
75%,2008.0,1.0,448.9
max,2019.0,1.0,19667.327786


In [6]:
df['format'].value_counts()

CD                                    141
DVD Audio                             141
Ringtones & Ringbacks                 141
Download Music Video                  141
Kiosk                                 141
CD Single                             141
Download Single                       141
SACD                                  141
Download Album                        141
Music Video (Physical)                141
Other Tapes                           141
8 - Track                             141
Vinyl Single                          141
LP/EP                                 141
Cassette Single                       141
Cassette                              141
Paid Subscriptions                     94
Limited Tier Paid Subscription         94
On-Demand Streaming (Ad-Supported)     94
Other Ad-Supported Streaming           94
Other Digital                          94
Paid Subscription                      94
SoundExchange Distributions            94
Synchronization                   

In [7]:
df['metric'].value_counts()

Value               1081
Value (Adjusted)    1081
Units                846
Name: metric, dtype: int64

So from an introductory look of the data, there are:

- 3 numerical columns: 'year', 'number_of_records', 'value_actual'
- 2 text columns: 'format', 'metric'

Since 'format' and 'metric' are both object dtypes, they need to be converted to string dtype.

In [8]:
df['format'] = df['format'].astype('string')
df['metric'] = df['metric'].astype('string')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3008 entries, 0 to 3007
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   format             3008 non-null   string 
 1   metric             3008 non-null   string 
 2   year               3008 non-null   int64  
 3   number_of_records  3008 non-null   int64  
 4   value_actual       1351 non-null   float64
dtypes: float64(1), int64(2), string(2)
memory usage: 117.6 KB


In [10]:
df['number_of_records'].value_counts()

1    3008
Name: number_of_records, dtype: int64

The 'number_of_records' column description states: "Unit (all rows are 1)." I assume this means that each observation (row) is 1 sale (eg. 1 cd sale, 1 cassette sale, 1 paid subscription sale, etc).

In [22]:
num_null_values = df['value_actual'].isnull().groupby(df['format']).sum()
num_non_null_values = df['value_actual'].notnull().groupby(df['format']).sum()
num_null_values / num_non_null_values

format
8 - Track                              1.389831
CD                                     0.270270
CD Single                              0.468750
Cassette                               0.258929
Cassette Single                        1.473684
DVD Audio                              1.473684
Download Album                         1.937500
Download Music Video                   2.133333
Download Single                        1.937500
Kiosk                                  2.133333
LP/EP                                  0.000000
Limited Tier Paid Subscription        10.750000
Music Video (Physical)                 0.516129
On-Demand Streaming (Ad-Supported)     4.222222
Other Ad-Supported Streaming          10.750000
Other Digital                         10.750000
Other Tapes                            2.000000
Paid Subscription                      2.133333
Paid Subscriptions                     2.241379
Ringtones & Ringbacks                  2.133333
SACD                             

In [18]:
df['value_actual'].notnull().groupby(df['format']).sum()

format
8 - Track                              59
CD                                    111
CD Single                              96
Cassette                              112
Cassette Single                        57
DVD Audio                              57
Download Album                         48
Download Music Video                   45
Download Single                        48
Kiosk                                  45
LP/EP                                 141
Limited Tier Paid Subscription          8
Music Video (Physical)                 93
On-Demand Streaming (Ad-Supported)     18
Other Ad-Supported Streaming            8
Other Digital                           8
Other Tapes                            47
Paid Subscription                      30
Paid Subscriptions                     29
Ringtones & Ringbacks                  45
SACD                                   51
SoundExchange Distributions            32
Synchronization                        22
Vinyl Single               

In [20]:
num_null_values

format
8 - Track                             82
CD                                    30
CD Single                             45
Cassette                              29
Cassette Single                       84
DVD Audio                             84
Download Album                        93
Download Music Video                  96
Download Single                       93
Kiosk                                 96
LP/EP                                  0
Limited Tier Paid Subscription        86
Music Video (Physical)                48
On-Demand Streaming (Ad-Supported)    76
Other Ad-Supported Streaming          86
Other Digital                         86
Other Tapes                           94
Paid Subscription                     64
Paid Subscriptions                    65
Ringtones & Ringbacks                 96
SACD                                  90
SoundExchange Distributions           62
Synchronization                       72
Vinyl Single                           0
Name: val

When non-null values are present:

- 8-track: 1973-1982
- CD: 1983-2019
- CD Single: 1988-2019
  - There are negative numbers
  - Units/Value/Value(Adjusted)??
- Cassette: 1973-2008
- Cassette Single: 1987-2008
- DVD Audio: 2001-2019

The null values correspond to the years when the format hasn't been introduced or has dropped off (eg. the 8-track format only lasted from 1973-1982, so 1983+ are null values).

Because the dataset did not provide any further explanation on what the metrics mean, I am assuming they are defined as such:

- Units: The quantity sold (eg. 1 CD Single, 2 Cassette)
- Value: The monetary price, in dollars
- Value (Adjusted): The adjusted monetary price to 2019, in dollars

The negative numbers are outliers because there is no such thing as a negative unit or price. There are only 12 instances of these, so I will replace these values with 0. For the instance of Cassette Single, the negative numbers appear at the tail end, right before the format is dropped off, which suggests that replacing with a 0 is okay.