# MLC Project - Napster
Group Focus: Data provided from DSPS

Considerations: which DSPs are doing a bad job of sending over complete or at least minimum data required for matching?

Which DSPs have the most streams? Do songs get more air time on certain DSPs?

Does song length correlate to DSP in any way?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import re

## Read data into notebook

In [2]:
mlc = pd.read_csv('../data/MLC_sample.csv')
mlc.head()

Unnamed: 0,If,Country Code,Registrant Code,Year of Reference,Usage Period,Streaming Platform (DSP),Streams,Recording Artist,Recording Label,Recording Title,Release Title,Recording Duration (Seconds),"Songwriter's Listed (1 = ""Yes"", 0 = ""No"")"
0,USUG12101043,US,UG1,21.0,,AudioMack,10175,Future,,FUTURE FT DEJ LOAF HEY THERE PROD BY DDS,#unknown#,181,0
1,USRC12100543,US,RC1,21.0,,SoundCloud,8597,LUCKY3RD,LUCKY3RD,Keep It Cool LUCKY3RD,Keep It Cool LUCKY3RD,133,0
2,USSM12102263,US,SM1,21.0,,SoundCloud,261280,LUCKY3RD,LUCKY3RD,Life Goes On LUCKY3RD,Life Goes On LUCKY3RD,171,0
3,USLD91731547,US,LD9,17.0,2/1/2021,Trebel,5,Bachata & Merengue Mix,Orchard,No dudes de mi- Merengue & Bachata Mix,Mega Mix 2010,1250,0
4,USAT22007048,US,AT2,20.0,,AudioMack,62105,Foolio,,WHEN I SEE YOU REMIX,#unknown#,187,0


## Rename columns

In [3]:
mlc.columns = ['ISRC', 'Country Code', 'Registrant Code', 'Year of Reference', 'Usage Period', 'DSP', 'Streams', 'Recording Artist', 'Recording Label', 'Recording Title', 'Release Title', 'Recording Duration (Seconds)', 'Songwriters Listed (1=Y, 0=N)']
mlc.head()

Unnamed: 0,ISRC,Country Code,Registrant Code,Year of Reference,Usage Period,DSP,Streams,Recording Artist,Recording Label,Recording Title,Release Title,Recording Duration (Seconds),"Songwriters Listed (1=Y, 0=N)"
0,USUG12101043,US,UG1,21.0,,AudioMack,10175,Future,,FUTURE FT DEJ LOAF HEY THERE PROD BY DDS,#unknown#,181,0
1,USRC12100543,US,RC1,21.0,,SoundCloud,8597,LUCKY3RD,LUCKY3RD,Keep It Cool LUCKY3RD,Keep It Cool LUCKY3RD,133,0
2,USSM12102263,US,SM1,21.0,,SoundCloud,261280,LUCKY3RD,LUCKY3RD,Life Goes On LUCKY3RD,Life Goes On LUCKY3RD,171,0
3,USLD91731547,US,LD9,17.0,2/1/2021,Trebel,5,Bachata & Merengue Mix,Orchard,No dudes de mi- Merengue & Bachata Mix,Mega Mix 2010,1250,0
4,USAT22007048,US,AT2,20.0,,AudioMack,62105,Foolio,,WHEN I SEE YOU REMIX,#unknown#,187,0


## Check for null values in each column

In [4]:
mlc.isnull().sum()

ISRC                             1760
Country Code                     1697
Registrant Code                  1697
Year of Reference                1761
Usage Period                     8102
DSP                              3999
Streams                             0
Recording Artist                    0
Recording Label                  1008
Recording Title                     0
Release Title                      69
Recording Duration (Seconds)        0
Songwriters Listed (1=Y, 0=N)       0
dtype: int64

## Check data type in each column

In [5]:
mlc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 13 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   ISRC                           98240 non-null   object 
 1   Country Code                   98303 non-null   object 
 2   Registrant Code                98303 non-null   object 
 3   Year of Reference              98239 non-null   float64
 4   Usage Period                   91898 non-null   object 
 5   DSP                            96001 non-null   object 
 6   Streams                        100000 non-null  int64  
 7   Recording Artist               100000 non-null  object 
 8   Recording Label                98992 non-null   object 
 9   Recording Title                100000 non-null  object 
 10  Release Title                  99931 non-null   object 
 11  Recording Duration (Seconds)   100000 non-null  int64  
 12  Songwriters Listed (1=Y, 0=N)  

## Count of DSPs

In [6]:
mlc['DSP'].value_counts()

Spotify               32268
Apple                 22200
Amazon                14438
Pandora               13777
Tidal                  3521
YouTube                2752
SoundCloud             2122
GTL                    1090
Melodyv                 829
Trebel                  817
iHeart Radio            707
AudioMack               550
NugsNet                 316
LiveXLive               200
Qoboz                   104
Midwest Tape            102
Deezer                   51
Anghami                  46
Sonos                    23
Recisio                  22
Smithsonian              21
Ultimate Guitar          19
PowerMusic                8
Wolfgangs                 4
Fan Label                 4
MixCloud                  4
Pacemaker                 3
Classical Archives        2
MonkingMe                 1
Name: DSP, dtype: int64

There are 29 DSPs with at least one stream attributed, but there are 3999 entries in the dataset that have a null value for DSP.

## DSP DataFrames for further analysis

In [7]:
deezer = mlc.loc[mlc['DSP'] == 'Deezer']
deezer.isnull().sum()

ISRC                             0
Country Code                     0
Registrant Code                  0
Year of Reference                0
Usage Period                     0
DSP                              0
Streams                          0
Recording Artist                 0
Recording Label                  5
Recording Title                  0
Release Title                    0
Recording Duration (Seconds)     0
Songwriters Listed (1=Y, 0=N)    0
dtype: int64

In [8]:
anghami = mlc.loc[mlc['DSP'] == 'Anghami']
anghami.isnull().sum()

ISRC                             0
Country Code                     0
Registrant Code                  0
Year of Reference                0
Usage Period                     0
DSP                              0
Streams                          0
Recording Artist                 0
Recording Label                  3
Recording Title                  0
Release Title                    0
Recording Duration (Seconds)     0
Songwriters Listed (1=Y, 0=N)    0
dtype: int64

In [9]:
sonos = mlc.loc[mlc['DSP'] == 'Sonos']
sonos.isnull().sum()

ISRC                             0
Country Code                     0
Registrant Code                  0
Year of Reference                0
Usage Period                     0
DSP                              0
Streams                          0
Recording Artist                 0
Recording Label                  0
Recording Title                  0
Release Title                    0
Recording Duration (Seconds)     0
Songwriters Listed (1=Y, 0=N)    0
dtype: int64

In [10]:
recisio = mlc.loc[mlc['DSP'] == 'Recisio']
recisio.isnull().sum()

ISRC                              0
Country Code                      0
Registrant Code                   0
Year of Reference                 0
Usage Period                      0
DSP                               0
Streams                           0
Recording Artist                  0
Recording Label                  22
Recording Title                   0
Release Title                     0
Recording Duration (Seconds)      0
Songwriters Listed (1=Y, 0=N)     0
dtype: int64

In [11]:
smithsonian = mlc.loc[mlc['DSP'] == 'Smithsonian']
smithsonian.isnull().sum()

ISRC                             1
Country Code                     1
Registrant Code                  1
Year of Reference                1
Usage Period                     0
DSP                              0
Streams                          0
Recording Artist                 0
Recording Label                  0
Recording Title                  0
Release Title                    0
Recording Duration (Seconds)     0
Songwriters Listed (1=Y, 0=N)    0
dtype: int64

In [12]:
ultimate_guitar = mlc.loc[mlc['DSP'] == 'Ultimate Guitar']
ultimate_guitar.isnull().sum()

ISRC                              0
Country Code                      0
Registrant Code                   0
Year of Reference                 0
Usage Period                      0
DSP                               0
Streams                           0
Recording Artist                  0
Recording Label                  19
Recording Title                   0
Release Title                    19
Recording Duration (Seconds)      0
Songwriters Listed (1=Y, 0=N)     0
dtype: int64

In [13]:
powermusic = mlc.loc[mlc['DSP'] == 'PowerMusic']
powermusic.isnull().sum()

ISRC                             0
Country Code                     0
Registrant Code                  0
Year of Reference                0
Usage Period                     0
DSP                              0
Streams                          0
Recording Artist                 0
Recording Label                  8
Recording Title                  0
Release Title                    0
Recording Duration (Seconds)     0
Songwriters Listed (1=Y, 0=N)    0
dtype: int64

In [14]:
wolfgangs = mlc.loc[mlc['DSP'] == 'Wolfgangs']
wolfgangs.isnull().sum()

ISRC                             0
Country Code                     0
Registrant Code                  0
Year of Reference                0
Usage Period                     0
DSP                              0
Streams                          0
Recording Artist                 0
Recording Label                  4
Recording Title                  0
Release Title                    4
Recording Duration (Seconds)     0
Songwriters Listed (1=Y, 0=N)    0
dtype: int64

In [15]:
fan_label = mlc.loc[mlc['DSP'] == 'Fan Label']
fan_label.isnull().sum()

ISRC                             0
Country Code                     0
Registrant Code                  0
Year of Reference                0
Usage Period                     0
DSP                              0
Streams                          0
Recording Artist                 0
Recording Label                  0
Recording Title                  0
Release Title                    0
Recording Duration (Seconds)     0
Songwriters Listed (1=Y, 0=N)    0
dtype: int64

In [16]:
mixcloud = mlc.loc[mlc['DSP'] == 'MixCloud']
mixcloud.isnull().sum()

ISRC                             0
Country Code                     0
Registrant Code                  0
Year of Reference                0
Usage Period                     0
DSP                              0
Streams                          0
Recording Artist                 0
Recording Label                  4
Recording Title                  0
Release Title                    0
Recording Duration (Seconds)     0
Songwriters Listed (1=Y, 0=N)    0
dtype: int64

In [17]:
pacemaker = mlc.loc[mlc['DSP'] == 'Pacemaker']
pacemaker.isnull().sum()

ISRC                             0
Country Code                     0
Registrant Code                  0
Year of Reference                0
Usage Period                     0
DSP                              0
Streams                          0
Recording Artist                 0
Recording Label                  0
Recording Title                  0
Release Title                    0
Recording Duration (Seconds)     0
Songwriters Listed (1=Y, 0=N)    0
dtype: int64

In [18]:
classical_archives = mlc.loc[mlc['DSP'] == 'Classical Archives']
classical_archives.isnull().sum()

ISRC                             0
Country Code                     0
Registrant Code                  0
Year of Reference                0
Usage Period                     0
DSP                              0
Streams                          0
Recording Artist                 0
Recording Label                  0
Recording Title                  0
Release Title                    0
Recording Duration (Seconds)     0
Songwriters Listed (1=Y, 0=N)    0
dtype: int64

In [19]:
monkingme = mlc.loc[mlc['DSP'] =='MonkingMe']
monkingme.isnull().sum()

ISRC                             0
Country Code                     0
Registrant Code                  0
Year of Reference                0
Usage Period                     0
DSP                              0
Streams                          0
Recording Artist                 0
Recording Label                  1
Recording Title                  0
Release Title                    0
Recording Duration (Seconds)     0
Songwriters Listed (1=Y, 0=N)    0
dtype: int64

In [20]:
unknownDSP = mlc.loc[mlc['DSP'].isna()]
unknownDSP.isnull().sum()

ISRC                               58
Country Code                       54
Registrant Code                    54
Year of Reference                  58
Usage Period                     3999
DSP                              3999
Streams                             0
Recording Artist                    0
Recording Label                    47
Recording Title                     0
Release Title                      35
Recording Duration (Seconds)        0
Songwriters Listed (1=Y, 0=N)       0
dtype: int64

## Deeper dive into song length details
Dataset provides song duration in seconds

In [21]:
mlc['Recording Duration (Seconds)'].describe()

count    100000.000000
mean       1016.893690
std       15565.692133
min           0.000000
25%         149.000000
50%         190.000000
75%         236.000000
max      818738.000000
Name: Recording Duration (Seconds), dtype: float64

In [22]:
mlc.nlargest(10, ['Recording Duration (Seconds)'])

Unnamed: 0,ISRC,Country Code,Registrant Code,Year of Reference,Usage Period,DSP,Streams,Recording Artist,Recording Label,Recording Title,Release Title,Recording Duration (Seconds),"Songwriters Listed (1=Y, 0=N)"
19931,QZES81895037,QZ,ES8,18.0,2/1/2021,Trebel,6464,Megan Thee Stallion,Warner,Cry Baby (feat. DaBaby),Good News,818738,0
34584,DEN962000842,DE,N96,20.0,2/1/2021,Trebel,4797,SZA,Sony Music,Good Days,Good Days,770091,0
20399,JMA272011862,JM,A27,20.0,2/1/2021,Trebel,3492,H.E.R.,Sony Music,Damage,Damage,714437,0
48381,USAT22006112,US,AT2,20.0,2/1/2021,Trebel,2965,Kevo Muney,Warner,Leave Some Day,Leave Some Day,463500,0
138,USUYG1360358,US,UYG,13.0,4/1/2021,Deezer,2,Unknown,,{Cartagena},Silvestre Dangond - Grandes Éxitos,356461,0
139,USUM72105617,US,UM7,21.0,4/1/2021,Deezer,2,Unknown,,{Esa Mujer},Silvestre Dangond - Grandes Éxitos,356461,0
140,USAT22007802,US,AT2,20.0,5/1/2021,Deezer,2,Unknown,,{Esa Mujer},Silvestre Dangond - Grandes Éxitos,356461,0
141,USUM72024668,US,UM7,20.0,4/1/2021,Deezer,2,Unknown,,"{Me Gusta, Me Gusta}",Silvestre Dangond - Grandes Éxitos,356461,0
142,USUG12004700,US,UG1,20.0,4/1/2021,Deezer,2,Unknown,,"{Me Gusta, Me Gusta}",Silvestre Dangond - Grandes Éxitos,356461,0
258,QZMGB2100021,QZ,MGB,21.0,3/1/2021,Pandora,160,Bizzy Bone,Bizzy Bone,0.44,The Mantra,356461,1


The longest song in this dataset comes in at 818,738 seconds which is equal to 227.4 hours. 356,461 also shows up in the dataset with more frequency than anticipated. This deserves some additional attention.

In [23]:
mlc.nsmallest(10, ['Recording Duration (Seconds)'])

Unnamed: 0,ISRC,Country Code,Registrant Code,Year of Reference,Usage Period,DSP,Streams,Recording Artist,Recording Label,Recording Title,Release Title,Recording Duration (Seconds),"Songwriters Listed (1=Y, 0=N)"
14080,USAB02001303,US,AB0,20.0,3/1/2021,Apple,28962,Mac-K the K Baby,OneWay DumbWay,Butterflies,Butterflies - Single,0,1
14081,GBKQU2111667,GB,KQU,21.0,3/1/2021,Apple,28962,Mac-K the K Baby,OneWay DumbWay,Butterflies,Butterflies - Single,0,1
14916,AUHS01912771,AU,HS0,19.0,2/1/2021,iHeart Radio,2192,Jelly Roll & Struggle Jennings,Struggle Jelly Roll,Can't Go Home,Waylon & Willie 2,0,0
33960,USRH12000094,US,RH1,20.0,1/1/2021,Apple,43995,T9ine,Columbia,Go Harder,Go Harder - Single,0,1
51916,QM24S1705172,QM,24S,17.0,2/1/2021,iHeart Radio,1045,Jelly Roll & Struggle Jennings,Struggle Jelly Roll,Love Won (feat. Shooter Jennings),Waylon & Willie 2,0,0
56892,USA6B0446310,US,A6B,4.0,3/1/2021,Apple,103842,"RH Music, Eduardo Luzquiños",Eduardo Luzquiños,Motive X Promiscuous,Motive X Promiscuous (Remix) - Single,0,1
56893,USY251719876,US,Y25,17.0,3/1/2021,Apple,20793,"RH Music, Eduardo Luzquiños",Eduardo Luzquiños,Motive X Promiscuous,Motive X Promiscuous (Remix) - Single,0,1
56894,USLR50100122,US,LR5,1.0,3/1/2021,Apple,20793,"RH Music, Eduardo Luzquiños",Eduardo Luzquiños,Motive X Promiscuous,Motive X Promiscuous (Remix) - Single,0,1
81337,USWB12004709,US,WB1,20.0,2/1/2021,iHeart Radio,1582,Thom Rotella,Thom Rotella,Street Talk,Street Talk,0,0
81634,QZMEN2057850,QZ,MEN,20.0,3/1/2021,Apple,40593,"Wovy, LowkeyLuke, Alejandro Lema",WOVY,Stuntin' On My Ex,Stuntin' On My Ex - Single,0,1


There are a significant number of entries in the dataset with the recording duration listed as 0 seconds. Will seek additional clarity around these entries.

In [24]:
mlc['Recording Duration (Seconds)'].value_counts().loc[lambda x : x<60]

327      59
338      56
351      56
349      56
58       55
         ..
3794      1
33633     1
3894      1
15937     1
8755      1
Name: Recording Duration (Seconds), Length: 2000, dtype: int64

There are 2000 entries in the dataset with a song length less than 60 seconds.

In [25]:
# mlc['Recording Duration (Seconds)'].value_counts().loc[lambda x : x>60 and x<120]
# moving to a for loop with .apply in order to evaluate multiple conditions in a cleaner format

In [26]:
def assign_length(row):
    if row <= 120:
        result = "Up to 2 min"
    elif row <= 240:
        result = "2-4 min"
    elif row <= 420:
        result = "4-7 min"
    elif row <= 600:
        result = "7-10 min"
    elif row <= 900:
        result = "10-15 min"
    elif row <= 3600:
        result = "15-60 min"
    elif row <= 86400:
        result = "1-24 hours"
    else:
        result = "More than 1 day"
    return result

mlc['Song Length'] = mlc['Recording Duration (Seconds)'].apply(assign_length)
mlc.head()

Unnamed: 0,ISRC,Country Code,Registrant Code,Year of Reference,Usage Period,DSP,Streams,Recording Artist,Recording Label,Recording Title,Release Title,Recording Duration (Seconds),"Songwriters Listed (1=Y, 0=N)",Song Length
0,USUG12101043,US,UG1,21.0,,AudioMack,10175,Future,,FUTURE FT DEJ LOAF HEY THERE PROD BY DDS,#unknown#,181,0,2-4 min
1,USRC12100543,US,RC1,21.0,,SoundCloud,8597,LUCKY3RD,LUCKY3RD,Keep It Cool LUCKY3RD,Keep It Cool LUCKY3RD,133,0,2-4 min
2,USSM12102263,US,SM1,21.0,,SoundCloud,261280,LUCKY3RD,LUCKY3RD,Life Goes On LUCKY3RD,Life Goes On LUCKY3RD,171,0,2-4 min
3,USLD91731547,US,LD9,17.0,2/1/2021,Trebel,5,Bachata & Merengue Mix,Orchard,No dudes de mi- Merengue & Bachata Mix,Mega Mix 2010,1250,0,15-60 min
4,USAT22007048,US,AT2,20.0,,AudioMack,62105,Foolio,,WHEN I SEE YOU REMIX,#unknown#,187,0,2-4 min


In [27]:
mlc['Song Length'].value_counts()

2-4 min            64179
4-7 min            17204
Up to 2 min        12487
7-10 min            2289
1-24 hours          1636
15-60 min           1257
10-15 min            751
More than 1 day      197
Name: Song Length, dtype: int64

In [29]:
def compared_to_median(row):
    if row < 190:
        result = "Lower"
    elif row == 190:
        result = "Median"
    else:
        result = "Higher"
    return result
mlc['Length Compared to Median'] = mlc['Recording Duration (Seconds)'].apply(compared_to_median)
mlc.head()

Unnamed: 0,ISRC,Country Code,Registrant Code,Year of Reference,Usage Period,DSP,Streams,Recording Artist,Recording Label,Recording Title,Release Title,Recording Duration (Seconds),"Songwriters Listed (1=Y, 0=N)",Song Length,Length Compared to Median
0,USUG12101043,US,UG1,21.0,,AudioMack,10175,Future,,FUTURE FT DEJ LOAF HEY THERE PROD BY DDS,#unknown#,181,0,2-4 min,Lower
1,USRC12100543,US,RC1,21.0,,SoundCloud,8597,LUCKY3RD,LUCKY3RD,Keep It Cool LUCKY3RD,Keep It Cool LUCKY3RD,133,0,2-4 min,Lower
2,USSM12102263,US,SM1,21.0,,SoundCloud,261280,LUCKY3RD,LUCKY3RD,Life Goes On LUCKY3RD,Life Goes On LUCKY3RD,171,0,2-4 min,Lower
3,USLD91731547,US,LD9,17.0,2/1/2021,Trebel,5,Bachata & Merengue Mix,Orchard,No dudes de mi- Merengue & Bachata Mix,Mega Mix 2010,1250,0,15-60 min,Higher
4,USAT22007048,US,AT2,20.0,,AudioMack,62105,Foolio,,WHEN I SEE YOU REMIX,#unknown#,187,0,2-4 min,Lower


In [30]:
mlc['Length Compared to Median'].value_counts()

Lower     49891
Higher    49470
Median      639
Name: Length Compared to Median, dtype: int64