## Databricks Recommendation System - Music

## How to create a music recommender?

### We have two approaches: 

####  Approach 1:

* User reviews and likes

####  Approach 2:

* Similarities between songs


## Features summary

* Acousticness/Acoustics: numerical variable; confidence measure from 0.0 to 1.0 if the track is acoustic. 1.0 represents high confidence ("unplugged" instruments) that the track is acoustic and 0.0 the opposite.

* Danceability: numerical variable; danceability describes how suitable a track is for dancing based on a combination of musical elements, including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is the least danceable and 1.0 is the most danceable.

* Duration_ms: numeric variable; the length of the track in milliseconds.

* Duration_min: numeric variable; the length of the track in minutes.

* Energy/Energy: numerical variable; energy is a measurement from 0.0 to 1.0 that represents a perception of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual characteristics that contribute to this attribute include dynamic range, perceived loudness, timbre, onset rate, and overall entropy.

* Explicit/Explicit: categorical variable; whether or not the track has explicit lyrics (true = 1 (yes) false = 0 (no OR unknown)).

* Id: The Spotify ID for the track.

* Instrumentalness/Instrumentality: numerical variable; predicts whether a track contains no vocals. The sounds “Ooh” and “aah” are treated as instrumentals in this context. Rap or spoken word tracks are clearly “vocal.” The closer the instrumentality value is to 1.0, the more likely the track is to contain no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is greater as the value approaches 1.0.

* Key/Tone: numeric variable; the estimated overall key of the track. Integers are mapped to pitches using standard Pitch Class notation. For example, 0 = C, 1 = C#/Db, 2 = D, and so on. If no hue was detected, the value is -1.

* Liveness/Ao vivo: numeric variable; detects the presence of an audience in the recording. Higher liveness values represent a greater likelihood that the track was performed live. A value above 0.8 gives a strong probability that the track is active.

* Loudness/Volume in dB: numeric variable; overall volume of a track in decibels (dB). Volume values are averaged across the entire track and are useful for comparing the relative loudness of tracks. Loudness is the quality of a sound in relation to its amplitude (“pitch”), measuring the relationship between the peaks and troughs of a sound wave. Typical values range between -60 and 0 db.

* Mode/Modo: numeric variable; the mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

* Popularity: numerical variable; The popularity of a track is a value between 0 and 100, with 100 being the most popular. Popularity is calculated algorithmically and is largely based on the total number of plays the track has had and how recent those plays are.

* Speechiness/Speech: numeric variable; detects the presence of spoken words in a track. The more exclusively spoken the recording (e.g. talk show, audiobook, poetry), the closer the attribute value will be to 1.0. Values above 0.66 describe tracks that are likely to be made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that can contain music and speech, either in sections or in layers, including cases such as rap-style music. Values below 0.33 likely represent music and other non-speech tracks.

* Tempo: numeric variable; Overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or rhythm of a given piece and derives directly from the average beat duration.

* Valence/Valencia: numeric variable; measure from 0.0 to 1.0 describing the musical positivity conveyed by a track. Tracks with high valence sound more positive (e.g., happy, joyful, euphoric), while tracks with low valence sound more negative (e.g., sad, depressed, angry).

* Year/Year: year in which the song was released.

Font: https://developer.spotify.com/documentation/web-api/reference/get-audio-features

## Creating Directory

In [0]:
dbutils.fs.ls('/FileStore/')

[FileInfo(path='dbfs:/FileStore/tables/', name='tables/', size=0, modificationTime=0)]

In [0]:
display(dbutils.fs.ls('/FileStore/tables/'))

path,name,size,modificationTime
dbfs:/FileStore/tables/databricks-classes/,databricks-classes/,0,0


In [0]:
dbutils.fs.mkdirs('/FileStore/tables/databricks-classes/Recommendation-System-Music/')
display(dbutils.fs.ls('/FileStore/tables/databricks-classes/'))

path,name,size,modificationTime
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/,Recommendation-System-Music/,0,0
dbfs:/FileStore/tables/databricks-classes/data-analysis/,data-analysis/,0,0
dbfs:/FileStore/tables/databricks-classes/file-formats/,file-formats/,0,0
dbfs:/FileStore/tables/databricks-classes/wine-quality/,wine-quality/,0,0


In [0]:
display(dbutils.fs.ls('/FileStore/tables/databricks-classes/Recommendation-System-Music/'))

path,name,size,modificationTime
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/data.csv,data.csv,29654587,1706255487000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/data_by_artist.csv,data_by_artist.csv,4315607,1706255482000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/data_by_genres.csv,data_by_genres.csv,576456,1706255482000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/data_by_year.csv,data_by_year.csv,21194,1706255483000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/data_w_genres.csv,data_w_genres.csv,5224673,1706255484000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/,processed-data/,0,0


## Imports:

In [0]:
from pyspark.sql.types import IntegerType, DoubleType, StringType
from pyspark.sql import functions as f
import pyspark.pandas as ps

## Loading the project Data

In [0]:
display(dbutils.fs.ls('/FileStore/tables/databricks-classes/Recommendation-System-Music/'))

path,name,size,modificationTime
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/data.csv,data.csv,29654587,1706255487000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/data_by_artist.csv,data_by_artist.csv,4315607,1706255482000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/data_by_genres.csv,data_by_genres.csv,576456,1706255482000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/data_by_year.csv,data_by_year.csv,21194,1706255483000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/data_w_genres.csv,data_w_genres.csv,5224673,1706255484000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/,processed-data/,0,0


In [0]:
for item in dbutils.fs.ls('/FileStore/tables/databricks-classes/Recommendation-System-Music/'):
    if item.size != 0:
        print(f'Head {item.name}: \n')
        display(dbutils.fs.head(f'/FileStore/tables/databricks-classes/Recommendation-System-Music/{item.name}'))

Head data.csv: 

[Truncated to first 65536 bytes]


'valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo\n0.0594,1921,0.982,"[\'Sergei Rachmaninoff\', \'James Levine\', \'Berliner Philharmoniker\']",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve",4,1921,0.0366,80.954\n0.963,1921,0.732,[\'Dennis Day\'],0.8190000000000001,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.93600000000001\n0.0394,1921,0.961,[\'KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat\'],0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339\n0.165,1921,0.967,[\'Frank Parker\'],0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.77e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109\n0.253,1921,0.957,[\'Phil Regan\'],0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,1.68e-06,3,0.229,

Head data_by_artist.csv: 

[Truncated to first 65536 bytes]


'mode,count,acousticness,artists,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key\n1,9,0.5901111111111111,"""Cats"" 1981 Original London Cast",0.4672222222222222,250318.5555555556,0.3940033333333333,0.011399851111111107,0.2908333333333333,-14.448,0.21038888888888888,117.51811111111112,0.3895,38.333333333333336,5\n1,26,0.8625384615384617,"""Cats"" 1983 Broadway Cast",0.4417307692307693,287280.0,0.4068076923076923,0.08115826423076923,0.3152153846153846,-10.69,0.17621153846153847,103.04415384615385,0.2688653846153846,30.57692307692308,5\n1,7,0.8565714285714285,"""Fiddler On The Roof” Motion Picture Chorus",0.34828571428571425,328920.0,0.2865714285714285,0.024592948571428568,0.3257857142857143,-15.230714285714285,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0\n1,27,0.884925925925926,"""Fiddler On The Roof” Motion Picture Orchestra",0.4250740740740739,262890.96296296304,0.2457703703703704,0.073587279259

Head data_by_genres.csv: 

[Truncated to first 65536 bytes]


"mode,genres,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key\n1,21st century classical,0.9793333333333332,0.16288333333333335,160297.66666666663,0.07131666666666665,0.60683367,0.3616,-31.514333333333337,0.04056666666666667,75.3365,0.10378333333333334,27.83333333333333,6\n1,432hz,0.49478,0.2993333333333333,1048887.333333333,0.4506783333333333,0.4777616666666668,0.131,-16.854,0.07681666666666667,120.28566666666666,0.22175,52.5,5\n1,8-bit,0.762,0.7120000000000001,115177.0,0.818,0.8759999999999999,0.126,-9.18,0.047,133.444,0.975,48.0,7\n1,[],0.6514170195595453,0.5290925603549332,232880.8902503945,0.4191460727353524,0.2053091895111363,0.21869585415040735,-12.288964675489455,0.10787155868681396,112.8573524318416,0.5136042963588958,20.859882191849056,7\n1,a cappella,0.676557304985755,0.5389612464387464,190628.5408867521,0.3164335701566952,0.003003441440420227,0.1722541371082621,-12.479387421652426,0.08285143981481483,112

Head data_by_year.csv: 



'mode,year,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key\n1,1921,0.8868960000000005,0.4185973333333336,260537.16666666663,0.23181513333333334,0.34487805886666656,0.20571,-17.04866666666665,0.073662,101.53149333333329,0.37932666666666665,0.6533333333333333,2\n1,1922,0.9385915492957748,0.4820422535211267,165469.74647887325,0.23781535211267596,0.4341948697183099,0.2407197183098592,-19.275281690140844,0.1166549295774648,100.88452112676056,0.5355492957746479,0.14084507042253522,10\n1,1923,0.9572467913513516,0.5773405405405401,177942.36216216214,0.2624064864864865,0.37173272502702703,0.2274621621621621,-14.129210810810811,0.0939486486486487,114.0107297297297,0.6254924324324328,5.389189189189189,0\n1,1924,0.940199860169493,0.5498940677966102,191046.70762711862,0.3443466101694912,0.5817009136440677,0.2352190677966101,-14.231343220338989,0.09208940677966099,120.68957203389822,0.6637254237288139,0.6610169491525424,10\n1,1

Head data_w_genres.csv: 

[Truncated to first 65536 bytes]


'genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count\n[\'show tunes\'],"""Cats"" 1981 Original London Cast",0.5901111111111111,0.4672222222222222,250318.5555555556,0.3940033333333333,0.011399851111111107,0.2908333333333333,-14.448,0.21038888888888888,117.51811111111112,0.3895,38.333333333333336,5,1,9\n[],"""Cats"" 1983 Broadway Cast",0.8625384615384617,0.4417307692307693,287280.0,0.4068076923076923,0.08115826423076923,0.3152153846153846,-10.69,0.17621153846153847,103.04415384615385,0.2688653846153846,30.57692307692308,5,1,26\n[],"""Fiddler On The Roof” Motion Picture Chorus",0.8565714285714285,0.34828571428571425,328920.0,0.2865714285714285,0.024592948571428568,0.3257857142857143,-15.230714285714285,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0,1,7\n[],"""Fiddler On The Roof” Motion Picture Orchestra",0.884925925925926,0.4250740740740739,262890.96296296304,0.245

## Data.csv

In [0]:
# File location and type
# Insert the directory of the data.cvs file in this field:
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/data.csv' 
file_type = 'csv'

# Options 
infer_schema = 'True'
first_row_is_header = 'True'
delimiter = ','

# Read the Data
df_data_csv = spark.read.format(file_type) \
    .option('inferSchema', infer_schema) \
    .option('header', first_row_is_header) \
    .option('sep', delimiter) \
    .load(file_location)

# Viewing the loaded data
df_data_csv.limit(10).display()

valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0.0594,1921,0.982,"['Sergei Rachmaninoff', 'James Levine', 'Berliner Philharmoniker']",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve",4,1921,0.0366,80.954
0.963,1921,0.732,['Dennis Day'],0.8190000000000001,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.93600000000001
0.0394,1921,0.961,['KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat'],0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339
0.165,1921,0.967,['Frank Parker'],0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.77e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109
0.253,1921,0.957,['Phil Regan'],0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,1.68e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665
0.196,1921,0.579,['KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat'],0.6970000000000001,395076,0.346,0,4pyw9DVHGStUre4J6hPngr,0.168,2,0.13,-12.505999999999998,1,Gati Mardika,6,1921,0.07,119.824
0.406,1921,0.996,['John McCormack'],0.518,159507,0.203,0,5uNZnElqOS3W4fRmRYPk4T,0.0,0,0.115,-10.589,1,The Wearing of the Green,4,1921,0.0615,66.221
0.0731,1921,0.993,['Sergei Rachmaninoff'],0.389,218773,0.088,0,02GDntOXexBFUvSgaXLPkd,0.527,1,0.363,-21.091,0,"Morceaux de fantaisie, Op. 3: No. 2, Prélude in C-Sharp Minor. Lento",2,1921,0.0456,92.867
0.721,1921,0.996,['Ignacio Corsini'],0.485,161520,0.13,0,05xDjWH9ub67nJJk82yfGf,0.151,5,0.104,-21.50800000000001,0,La Mañanita - Remasterizado,0,1921-03-20,0.0483,64.678
0.7709999999999999,1921,0.982,['Fortugé'],0.684,196560,0.257,0,08zfJvRLp7pjAb94MA9JmF,0.0,8,0.504,-16.415,1,Il Etait Syndiqué,0,1921,0.399,109.378


## Data_by_artist.csv

In [0]:
# File location and type
# Insert the directory of the data_by_artist.csv file in this field:
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/data_by_artist.csv'
file_type = 'csv'

# Options
infer_schema = 'True'
first_row_is_header = 'True'
delimiter = ','

# Read the Data
df_data_by_artist_csv = spark.read.format(file_type) \
    .option('inferSchema', infer_schema) \
    .option('header', first_row_is_header) \
    .option('sep', delimiter) \
    .load(file_location)

# Viewing the loaded data
df_data_by_artist_csv.limit(10).display()

mode,count,acousticness,artists,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,9,0.5901111111111111,"""""""Cats"""" 1981 Original London Cast""",0.4672222222222222,250318.5555555556,0.3940033333333333,0.0113998511111111,0.2908333333333333,-14.448,0.2103888888888888,117.51811111111112,0.3895,38.333333333333336,5
1,26,0.8625384615384617,"""""""Cats"""" 1983 Broadway Cast""",0.4417307692307693,287280.0,0.4068076923076923,0.0811582642307692,0.3152153846153846,-10.69,0.1762115384615384,103.04415384615385,0.2688653846153846,30.57692307692308,5
1,7,0.8565714285714285,"""""""Fiddler On The Roof” Motion Picture Chorus""",0.3482857142857142,328920.0,0.2865714285714285,0.0245929485714285,0.3257857142857143,-15.230714285714283,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0
1,27,0.884925925925926,"""""""Fiddler On The Roof” Motion Picture Orchestra""",0.4250740740740739,262890.96296296304,0.2457703703703704,0.0735872792592592,0.2754814814814815,-15.639370370370369,0.1232,88.66762962962959,0.3720296296296296,34.85185185185185,0
1,7,0.5107142857142857,"""""""Joseph And The Amazing Technicolor Dreamcoat"""" 1991 London Cast""",0.4671428571428572,270436.14285714284,0.4882857142857143,0.0094002914285714,0.195,-10.236714285714289,0.0985428571428571,122.83585714285714,0.4822857142857143,43.0,5
1,36,0.6095555555555557,"""""""Joseph And The Amazing Technicolor Dreamcoat"""" 1992 Canadian Cast""",0.4872777777777778,205091.9444444445,0.3099055555555556,0.0046956566666666,0.2747666666666666,-18.26638888888889,0.0980222222222221,118.64894444444442,0.4415555555555557,32.77777777777778,5
1,2,0.725,"""""""Mama"""" Helen Teagarden""",0.637,135533.0,0.512,0.186,0.426,-20.615,0.21,134.819,0.885,0.0,8
1,2,0.927,"""""""Test for Victor Young""""""",0.7340000000000001,175693.0,0.474,0.0762,0.737,-10.544,0.256,132.78799999999998,0.902,3.0,10
1,122,0.1731450819672131,"""""""Weird Al"""" Yankovic""",0.6627868852459013,218948.19672131148,0.6953934426229511,4.980262295081966e-05,0.1611016393442622,-9.768704918032787,0.0845360655737704,133.03118032786878,0.7513442622950824,34.22950819672131,9
1,15,0.5444666666666668,$NOT,0.7898,137910.46666666667,0.5329333333333333,0.0230625833333333,0.1803,-9.14926666666667,0.2936866666666666,112.34479999999998,0.4807,67.53333333333333,1


## Data_by_genres.csv

In [0]:
# File location and type
# Insert the directory of  the data_by_genres.csv file in this field:
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/data_by_genres.csv'
file_type = 'csv'

# Options
infer_schema = 'True'
first_row_is_header = 'True'
delimiter = ','

# Read the Data
df_data_by_genres_csv = spark.read.format(file_type) \
    .option('inferSchema', infer_schema) \
    .option('header', first_row_is_header) \
    .option('sep', delimiter) \
    .load(file_location)

# Viewing the loaded data
df_data_by_genres_csv.limit(10).display()

mode,genres,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,21st century classical,0.9793333333333332,0.1628833333333333,160297.66666666663,0.0713166666666666,0.60683367,0.3616,-31.514333333333337,0.0405666666666666,75.3365,0.1037833333333333,27.83333333333333,6
1,432hz,0.49478,0.2993333333333333,1048887.333333333,0.4506783333333333,0.4777616666666668,0.131,-16.854,0.0768166666666666,120.28566666666666,0.22175,52.5,5
1,8-bit,0.762,0.7120000000000001,115177.0,0.818,0.8759999999999999,0.126,-9.18,0.047,133.444,0.975,48.0,7
1,[],0.6514170195595453,0.5290925603549332,232880.8902503945,0.4191460727353524,0.2053091895111363,0.2186958541504073,-12.288964675489456,0.1078715586868139,112.8573524318416,0.5136042963588958,20.859882191849056,7
1,a cappella,0.676557304985755,0.5389612464387464,190628.5408867521,0.3164335701566952,0.0030034414404202,0.1722541371082621,-12.479387421652426,0.0828514398148148,112.1103620014245,0.4482486545584045,45.82007122507122,7
1,abstract,0.45921,0.5161666666666667,343196.5,0.4424166666666666,0.8496666666666667,0.1180666666666667,-15.472083333333332,0.0465166666666666,127.88575000000002,0.307325,43.5,1
1,abstract beats,0.3421466666666667,0.623,229936.2,0.5277999999999999,0.3336026120000001,0.0996533333333333,-7.918000000000001,0.1163733333333333,112.4138,0.4935066666666666,58.93333333333332,10
1,abstract hip hop,0.2438540633608816,0.6945709366391184,231849.2341597796,0.6462346418732783,0.0242312629201102,0.1685429201101929,-7.349327823691461,0.2142576997245178,108.2449865013774,0.5713909090909091,39.79070247933884,2
0,accordeon,0.3229999999999999,0.588,164000.0,0.392,0.441,0.0794,-14.899,0.0727,109.131,0.7090000000000001,39.0,2
1,accordion,0.446125,0.6248125,167061.5625,0.3734375,0.19373839375,0.1603,-14.4870625,0.0785375,112.8724375,0.6586875000000001,21.9375,2


## Data_by_year.csv

In [0]:
# File location and type
# Insert the directory of the data_by_year.csv file in this field:
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/data_by_year.csv'
file_type = 'csv'

# Options
infer_schema = 'True'
first_row_is_header = 'True'
delimiter = ','

# Read the Data
df_data_by_year_csv = spark.read.format(file_type) \
    .option('inferSchema', infer_schema) \
    .option('header', first_row_is_header) \
    .option('sep', delimiter) \
    .load(file_location)

# Viewing the loaded data
df_data_by_year_csv.limit(10).display()

mode,year,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,1921,0.8868960000000005,0.4185973333333336,260537.16666666663,0.2318151333333333,0.3448780588666665,0.20571,-17.04866666666665,0.073662,101.53149333333327,0.3793266666666666,0.6533333333333333,2
1,1922,0.9385915492957748,0.4820422535211267,165469.74647887325,0.2378153521126759,0.4341948697183099,0.2407197183098592,-19.275281690140844,0.1166549295774648,100.88452112676056,0.5355492957746479,0.1408450704225352,10
1,1923,0.9572467913513516,0.5773405405405401,177942.36216216214,0.2624064864864865,0.371732725027027,0.2274621621621621,-14.129210810810813,0.0939486486486487,114.0107297297297,0.6254924324324328,5.389189189189189,0
1,1924,0.940199860169493,0.5498940677966102,191046.70762711865,0.3443466101694912,0.5817009136440677,0.2352190677966101,-14.231343220338989,0.0920894067796609,120.68957203389822,0.6637254237288139,0.6610169491525424,10
1,1925,0.9626070503597138,0.5738633093525181,184986.92446043165,0.2785935251798561,0.4182973612230215,0.2376679856115108,-14.14641366906474,0.1119179856115108,115.5219208633093,0.6219287769784171,2.6043165467625897,5
1,1926,0.660817216981134,0.5998802612481859,156881.65747460088,0.2114670907111756,0.3330931111175616,0.2323695936139332,-18.492538461538462,0.4837036284470243,109.64803265602328,0.4369104571843251,1.4223512336719883,9
1,1927,0.9361794552845558,0.6482682926829262,184993.59837398367,0.2643213008130081,0.3913284986504065,0.1684502439024389,-14.422373983739831,0.113609593495935,114.84652357723554,0.6597004878048782,0.8016260162601626,7
1,1928,0.9386165035685952,0.5342878667724027,214827.90642347344,0.2079477954004757,0.4948354801348136,0.1752893735130848,-17.191982553528927,0.1599114988104679,106.77226169706591,0.4957126883425853,1.5257731958762886,1
1,1929,0.6014265861344558,0.6476698529411761,168999.41281512607,0.2418007352941172,0.2152040310609246,0.2360002100840333,-16.530376050420152,0.4900007352941176,110.94835714285716,0.6365298319327733,0.3403361344537815,7
1,1930,0.936714937370057,0.5181758835758836,195150.28534303536,0.3335239189189189,0.3522059281652805,0.2213108627858629,-12.869221413721428,0.1199096673596674,109.87119438669428,0.6162376299376306,0.9267151767151768,2


## Data_w_genres.csv

In [0]:
# File location and type
# Insert the directory of the data_w_genres.csv file in this field:
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/data_w_genres.csv'
file_type = 'csv'

# Options
infer_schema = 'True'
first_row_is_header = 'True'
delimiter = ','

# Read the Data
df_data_w_genres_csv = spark.read.format(file_type) \
    .option('inferSchema', infer_schema) \
    .option('header', first_row_is_header) \
    .option('sep', delimiter) \
    .load(file_location)

# Viewing the loaded data
df_data_w_genres_csv.limit(10).display()

genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
['show tunes'],"""""""Cats"""" 1981 Original London Cast""",0.5901111111111111,0.4672222222222222,250318.5555555556,0.3940033333333333,0.0113998511111111,0.2908333333333333,-14.448,0.2103888888888888,117.51811111111112,0.3895,38.333333333333336,5.0,1.0,9.0
[],"""""""Cats"""" 1983 Broadway Cast""",0.8625384615384617,0.4417307692307693,287280.0,0.4068076923076923,0.0811582642307692,0.3152153846153846,-10.69,0.1762115384615384,103.04415384615385,0.2688653846153846,30.57692307692308,5.0,1.0,26.0
[],"""""""Fiddler On The Roof” Motion Picture Chorus""",0.8565714285714285,0.3482857142857142,328920.0,0.2865714285714285,0.0245929485714285,0.3257857142857143,-15.230714285714283,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0.0,1.0,7.0
[],"""""""Fiddler On The Roof” Motion Picture Orchestra""",0.884925925925926,0.4250740740740739,262890.96296296304,0.2457703703703704,0.0735872792592592,0.2754814814814815,-15.639370370370369,0.1232,88.66762962962959,0.3720296296296296,34.85185185185185,0.0,1.0,27.0
[],"""""""Joseph And The Amazing Technicolor Dreamcoat"""" 1991 London Cast""",0.5107142857142857,0.4671428571428572,270436.14285714284,0.4882857142857143,0.0094002914285714,0.195,-10.236714285714289,0.0985428571428571,122.83585714285714,0.4822857142857143,43.0,5.0,1.0,7.0
[],"""""""Joseph And The Amazing Technicolor Dreamcoat"""" 1992 Canadian Cast""",0.6095555555555557,0.4872777777777778,205091.9444444445,0.3099055555555556,0.0046956566666666,0.2747666666666666,-18.26638888888889,0.0980222222222221,118.64894444444442,0.4415555555555557,32.77777777777778,5.0,1.0,36.0
[],"""""""Mama"""" Helen Teagarden""",0.725,0.637,135533.0,0.512,0.186,0.426,-20.615,0.21,134.819,0.885,0.0,8.0,1.0,2.0
[],"""""""Test for Victor Young""""""",0.927,0.7340000000000001,175693.0,0.474,0.0762,0.737,-10.544,0.256,132.78799999999998,0.902,3.0,10.0,1.0,2.0
"['comedy rock', 'comic', 'parody']","""""""Weird Al"""" Yankovic""",0.1731450819672131,0.6627868852459013,218948.19672131148,0.6953934426229511,4.980262295081966e-05,0.1611016393442622,-9.768704918032787,0.0845360655737704,133.03118032786878,0.7513442622950824,34.22950819672131,9.0,1.0,122.0
"['emo rap', 'florida rap', 'sad rap', 'underground hip hop', 'vapor trap']",$NOT,0.5444666666666668,0.7898,137910.46666666667,0.5329333333333333,0.0230625833333333,0.1803,-9.14926666666667,0.2936866666666666,112.34479999999998,0.4807,67.53333333333333,1.0,1.0,15.0


## Checking and processing the files

## Data

In [0]:
df_data_csv.limit(10).display()

valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0.0594,1921,0.982,"['Sergei Rachmaninoff', 'James Levine', 'Berliner Philharmoniker']",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve",4,1921,0.0366,80.954
0.963,1921,0.732,['Dennis Day'],0.8190000000000001,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.93600000000001
0.0394,1921,0.961,['KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat'],0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339
0.165,1921,0.967,['Frank Parker'],0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.77e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109
0.253,1921,0.957,['Phil Regan'],0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,1.68e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665
0.196,1921,0.579,['KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat'],0.6970000000000001,395076,0.346,0,4pyw9DVHGStUre4J6hPngr,0.168,2,0.13,-12.505999999999998,1,Gati Mardika,6,1921,0.07,119.824
0.406,1921,0.996,['John McCormack'],0.518,159507,0.203,0,5uNZnElqOS3W4fRmRYPk4T,0.0,0,0.115,-10.589,1,The Wearing of the Green,4,1921,0.0615,66.221
0.0731,1921,0.993,['Sergei Rachmaninoff'],0.389,218773,0.088,0,02GDntOXexBFUvSgaXLPkd,0.527,1,0.363,-21.091,0,"Morceaux de fantaisie, Op. 3: No. 2, Prélude in C-Sharp Minor. Lento",2,1921,0.0456,92.867
0.721,1921,0.996,['Ignacio Corsini'],0.485,161520,0.13,0,05xDjWH9ub67nJJk82yfGf,0.151,5,0.104,-21.50800000000001,0,La Mañanita - Remasterizado,0,1921-03-20,0.0483,64.678
0.7709999999999999,1921,0.982,['Fortugé'],0.684,196560,0.257,0,08zfJvRLp7pjAb94MA9JmF,0.0,8,0.504,-16.415,1,Il Etait Syndiqué,0,1921,0.399,109.378


In [0]:
df_data_csv.printSchema()

root
 |-- valence: double (nullable = true)
 |-- year: integer (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- artists: string (nullable = true)
 |-- danceability: string (nullable = true)
 |-- duration_ms: string (nullable = true)
 |-- energy: string (nullable = true)
 |-- explicit: string (nullable = true)
 |-- id: string (nullable = true)
 |-- instrumentalness: string (nullable = true)
 |-- key: string (nullable = true)
 |-- liveness: string (nullable = true)
 |-- loudness: string (nullable = true)
 |-- mode: string (nullable = true)
 |-- name: string (nullable = true)
 |-- popularity: string (nullable = true)
 |-- release_date: string (nullable = true)
 |-- speechiness: string (nullable = true)
 |-- tempo: string (nullable = true)



In [0]:
df_data_csv.columns

['valence',
 'year',
 'acousticness',
 'artists',
 'danceability',
 'duration_ms',
 'energy',
 'explicit',
 'id',
 'instrumentalness',
 'key',
 'liveness',
 'loudness',
 'mode',
 'name',
 'popularity',
 'release_date',
 'speechiness',
 'tempo']

In [0]:
columns_string = [
    'artists',
    'id',
    'name',
    'release_date'
]
    
columns_integer = [
    'year',
    'duration_ms',
    'explicit',
    'key',
    'mode',
    'popularity'
]

columns_double = [
    'valence',
    'acousticness',
    'danceability',
    'energy',
    'instrumentalness',
    'liveness',
    'loudness',
    'speechiness',
    'tempo'
]

In [0]:
len(columns_string) + len(columns_integer) + len(columns_double)

19

In [0]:
len(df_data_csv.columns)

19

In [0]:
for column in df_data_csv.columns:
    # Columns Strings Type
    if column in columns_string:
        df_data_csv = df_data_csv.withColumn(column, df_data_csv[column].cast(StringType()))
    # Columns Interger Type
    elif column in columns_integer:
        df_data_csv = df_data_csv.withColumn(column, df_data_csv[column].cast(IntegerType()))
    # Columns DoubleType
    else:
        df_data_csv = df_data_csv.withColumn(column, df_data_csv[column].cast(DoubleType()))
df_data_csv.printSchema()

root
 |-- valence: double (nullable = true)
 |-- year: integer (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- artists: string (nullable = true)
 |-- danceability: double (nullable = true)
 |-- duration_ms: integer (nullable = true)
 |-- energy: double (nullable = true)
 |-- explicit: integer (nullable = true)
 |-- id: string (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- key: integer (nullable = true)
 |-- liveness: double (nullable = true)
 |-- loudness: double (nullable = true)
 |-- mode: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- popularity: integer (nullable = true)
 |-- release_date: string (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- tempo: double (nullable = true)



In [0]:
df_data_csv.limit(10).display()

valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0.0594,1921,0.982,"['Sergei Rachmaninoff', 'James Levine', 'Berliner Philharmoniker']",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve",4,1921,0.0366,80.954
0.963,1921,0.732,['Dennis Day'],0.8190000000000001,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.93600000000001
0.0394,1921,0.961,['KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat'],0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339
0.165,1921,0.967,['Frank Parker'],0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.77e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109
0.253,1921,0.957,['Phil Regan'],0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,1.68e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665
0.196,1921,0.579,['KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat'],0.6970000000000001,395076,0.346,0,4pyw9DVHGStUre4J6hPngr,0.168,2,0.13,-12.505999999999998,1,Gati Mardika,6,1921,0.07,119.824
0.406,1921,0.996,['John McCormack'],0.518,159507,0.203,0,5uNZnElqOS3W4fRmRYPk4T,0.0,0,0.115,-10.589,1,The Wearing of the Green,4,1921,0.0615,66.221
0.0731,1921,0.993,['Sergei Rachmaninoff'],0.389,218773,0.088,0,02GDntOXexBFUvSgaXLPkd,0.527,1,0.363,-21.091,0,"Morceaux de fantaisie, Op. 3: No. 2, Prélude in C-Sharp Minor. Lento",2,1921,0.0456,92.867
0.721,1921,0.996,['Ignacio Corsini'],0.485,161520,0.13,0,05xDjWH9ub67nJJk82yfGf,0.151,5,0.104,-21.50800000000001,0,La Mañanita - Remasterizado,0,1921-03-20,0.0483,64.678
0.7709999999999999,1921,0.982,['Fortugé'],0.684,196560,0.257,0,08zfJvRLp7pjAb94MA9JmF,0.0,8,0.504,-16.415,1,Il Etait Syndiqué,0,1921,0.399,109.378


## Processing the names of the artists and names of the music

In [0]:
# Creating a sample of the dataset to corrupt or lose the original data
sample_df  = df_data_csv.sample(withReplacement = False, fraction = 0.001, seed=333)
sample_df.display()

valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0.907,1928,0.894,['Louis Armstrong & His Hot Five'],0.782,182600,0.409,0,7ui11djxhp7WL4NCHlET8O,0.278,3,0.309,-10.191,1,Hotter Than That,23,1928,0.0676,106.03
0.925,1932,0.855,"['Jack Teagarden', 'Red Nichols & His Five Pennies']",0.573,187773,0.218,0,0jLpTcgYYVVF4bHvjdJaxU,0.499,3,0.0483,-14.546,1,Sweet and Hot,3,1932-12-06,0.48,207.265
0.416,1940,0.983,['Count Basie'],0.557,157133,0.0664,0,6YNb280zh4hME3t1C7Or5Q,0.000242,0,0.246,-17.137,1,I Want a Little Girl,3,1940,0.0446,90.397
0.282,1941,0.908,['Rise Stevens'],0.449,151560,0.289,0,3D2qr2RFnn2XxhsJMeqogr,0.0,0,0.695,-15.108,1,One Life to Live,2,1941,0.0586,135.567
0.516,1956,0.7659999999999999,['Judy Garland'],0.456,221293,0.285,0,1rtzABisO0N3DL9S3WQ2Z1,0.0,7,0.0786,-13.675999999999998,1,Come Rain Or Come Shine,37,1956,0.0858,83.435
0.116,1964,0.979,['Ella Fitzgerald'],0.183,230800,0.146,0,43R3lzkHJG7YomRMDtEEI1,0.000705,7,0.268,-17.155,1,Early Autumn,39,1964,0.0337,175.639
0.7290000000000001,1965,0.6990000000000001,['Smokey Robinson & The Miracles'],0.502,174360,0.333,0,6QyQmdvQ1ywNccYa0pwLNQ,0.0,7,0.222,-10.914,1,The Tracks Of My Tears,63,1965-11-01,0.0264,96.982
0.489,1966,0.6559999999999999,['Simon & Garfunkel'],0.423,131747,0.34,0,0qSITuCPLxjoDtESBy70WO,0.0,9,0.127,-14.175999999999998,1,Flowers Never Bend with the Rainfall,49,1966-10-10,0.0356,110.142
0.825,1969,0.347,['Frank Zappa'],0.37,563400,0.787,0,4fb4QSuQt8FLghjG0Ipd7g,0.6459999999999999,4,0.113,-12.46,0,Willie The Pimp,44,1969-10-10,0.0435,88.064
0.741,1973,0.34,['The Wailers'],0.8009999999999999,279720,0.6409999999999999,0,5uBKhKWTJ4E47rcLQqu3YH,4.5e-05,8,0.0693,-9.349,1,I Shot The Sheriff,61,1973-10-19,0.173,96.96


In [0]:
sample_df = sample_df.withColumn('artists', f.regexp_replace('artists', '[\[|\]|\'|\|\"|\!|\^|\´|\`|\^|\(|\)]', ''))
sample_df = sample_df.withColumn('artists', f.regexp_replace('artists', '[\,]', '.'))
sample_df = sample_df.withColumn('artists', f.regexp_replace('artists', '[\$]', 'S'))
sample_df = sample_df.withColumn('artists', f.regexp_replace('artists', '[?,]', 'Q'))

In [0]:
sample_df.display()

valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0.907,1928,0.894,Louis Armstrong & His Hot Five,0.782,182600,0.409,0,7ui11djxhp7WL4NCHlET8O,0.278,3,0.309,-10.191,1,Hotter Than That,23,1928,0.0676,106.03
0.925,1932,0.855,Jack Teagarden. Red Nichols & His Five Pennies,0.573,187773,0.218,0,0jLpTcgYYVVF4bHvjdJaxU,0.499,3,0.0483,-14.546,1,Sweet and Hot,3,1932-12-06,0.48,207.265
0.416,1940,0.983,Count Basie,0.557,157133,0.0664,0,6YNb280zh4hME3t1C7Or5Q,0.000242,0,0.246,-17.137,1,I Want a Little Girl,3,1940,0.0446,90.397
0.282,1941,0.908,Rise Stevens,0.449,151560,0.289,0,3D2qr2RFnn2XxhsJMeqogr,0.0,0,0.695,-15.108,1,One Life to Live,2,1941,0.0586,135.567
0.516,1956,0.7659999999999999,Judy Garland,0.456,221293,0.285,0,1rtzABisO0N3DL9S3WQ2Z1,0.0,7,0.0786,-13.675999999999998,1,Come Rain Or Come Shine,37,1956,0.0858,83.435
0.116,1964,0.979,Ella Fitzgerald,0.183,230800,0.146,0,43R3lzkHJG7YomRMDtEEI1,0.000705,7,0.268,-17.155,1,Early Autumn,39,1964,0.0337,175.639
0.7290000000000001,1965,0.6990000000000001,Smokey Robinson & The Miracles,0.502,174360,0.333,0,6QyQmdvQ1ywNccYa0pwLNQ,0.0,7,0.222,-10.914,1,The Tracks Of My Tears,63,1965-11-01,0.0264,96.982
0.489,1966,0.6559999999999999,Simon & Garfunkel,0.423,131747,0.34,0,0qSITuCPLxjoDtESBy70WO,0.0,9,0.127,-14.175999999999998,1,Flowers Never Bend with the Rainfall,49,1966-10-10,0.0356,110.142
0.825,1969,0.347,Frank Zappa,0.37,563400,0.787,0,4fb4QSuQt8FLghjG0Ipd7g,0.6459999999999999,4,0.113,-12.46,0,Willie The Pimp,44,1969-10-10,0.0435,88.064
0.741,1973,0.34,The Wailers,0.8009999999999999,279720,0.6409999999999999,0,5uBKhKWTJ4E47rcLQqu3YH,4.5e-05,8,0.0693,-9.349,1,I Shot The Sheriff,61,1973-10-19,0.173,96.96


In [0]:
df_data_csv.limit(10).display()

valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0.0594,1921,0.982,"['Sergei Rachmaninoff', 'James Levine', 'Berliner Philharmoniker']",0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,"Piano Concerto No. 3 in D Minor, Op. 30: III. Finale. Alla breve",4,1921,0.0366,80.954
0.963,1921,0.732,['Dennis Day'],0.8190000000000001,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.93600000000001
0.0394,1921,0.961,['KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat'],0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339
0.165,1921,0.967,['Frank Parker'],0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.77e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109
0.253,1921,0.957,['Phil Regan'],0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,1.68e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665
0.196,1921,0.579,['KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat'],0.6970000000000001,395076,0.346,0,4pyw9DVHGStUre4J6hPngr,0.168,2,0.13,-12.505999999999998,1,Gati Mardika,6,1921,0.07,119.824
0.406,1921,0.996,['John McCormack'],0.518,159507,0.203,0,5uNZnElqOS3W4fRmRYPk4T,0.0,0,0.115,-10.589,1,The Wearing of the Green,4,1921,0.0615,66.221
0.0731,1921,0.993,['Sergei Rachmaninoff'],0.389,218773,0.088,0,02GDntOXexBFUvSgaXLPkd,0.527,1,0.363,-21.091,0,"Morceaux de fantaisie, Op. 3: No. 2, Prélude in C-Sharp Minor. Lento",2,1921,0.0456,92.867
0.721,1921,0.996,['Ignacio Corsini'],0.485,161520,0.13,0,05xDjWH9ub67nJJk82yfGf,0.151,5,0.104,-21.50800000000001,0,La Mañanita - Remasterizado,0,1921-03-20,0.0483,64.678
0.7709999999999999,1921,0.982,['Fortugé'],0.684,196560,0.257,0,08zfJvRLp7pjAb94MA9JmF,0.0,8,0.504,-16.415,1,Il Etait Syndiqué,0,1921,0.399,109.378


In [0]:
df_data_csv = df_data_csv.withColumn('artists', f.regexp_replace('artists', '[\[|\]|\'|\|\"|\!|\^|\´|\`|\^|\(|\)|\:|\”]', ''))
df_data_csv = df_data_csv.withColumn('artists', f.regexp_replace('artists', '[\,]', '.'))
df_data_csv = df_data_csv.withColumn('artists', f.regexp_replace('artists', '[\$]', 'S'))
df_data_csv = df_data_csv.withColumn('artists', f.regexp_replace('artists', '[?,]', 'Q'))

In [0]:
df_data_csv = df_data_csv.withColumn('name', f.regexp_replace('name', '[\[|\]|\'|\|\"|\!|\^|\´|\`|\^|\(|\)|\...|\”|\:]', ''))
df_data_csv = df_data_csv.withColumn('name', f.regexp_replace('name', '[\,]', '.'))
df_data_csv = df_data_csv.withColumn('name', f.regexp_replace('name', '[\/]', '-'))
df_data_csv = df_data_csv.withColumn('name', f.regexp_replace('name', '[?,]', 'Q'))

In [0]:
df_data_csv.limit(20).display()

valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0.0594,1921,0.982,Sergei Rachmaninoff. James Levine. Berliner Philharmoniker,0.279,831667,0.211,0,4BJqT0PrAfrxzMOxytFOIz,0.878,10,0.665,-20.096,1,Piano Concerto No 3 in D Minor. Op 30 III Finale Alla breve,4,1921,0.0366,80.954
0.963,1921,0.732,Dennis Day,0.8190000000000001,180533,0.341,0,7xPhfUan2yNtyFG0cUWkt8,0.0,7,0.16,-12.441,1,Clancy Lowered the Boom,5,1921,0.415,60.93600000000001
0.0394,1921,0.961,KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat,0.328,500062,0.166,0,1o6I8BglA6ylDMrIELygv1,0.913,3,0.101,-14.85,1,Gati Bali,5,1921,0.0339,110.339
0.165,1921,0.967,Frank Parker,0.275,210000,0.309,0,3ftBPsC5vPBKxYSee08FDH,2.77e-05,5,0.381,-9.316,1,Danny Boy,3,1921,0.0354,100.109
0.253,1921,0.957,Phil Regan,0.418,166693,0.193,0,4d6HGyGT8e121BsdKmw9v6,1.68e-06,3,0.229,-10.096,1,When Irish Eyes Are Smiling,2,1921,0.038,101.665
0.196,1921,0.579,KHP Kridhamardawa Karaton Ngayogyakarta Hadiningrat,0.6970000000000001,395076,0.346,0,4pyw9DVHGStUre4J6hPngr,0.168,2,0.13,-12.505999999999998,1,Gati Mardika,6,1921,0.07,119.824
0.406,1921,0.996,John McCormack,0.518,159507,0.203,0,5uNZnElqOS3W4fRmRYPk4T,0.0,0,0.115,-10.589,1,The Wearing of the Green,4,1921,0.0615,66.221
0.0731,1921,0.993,Sergei Rachmaninoff,0.389,218773,0.088,0,02GDntOXexBFUvSgaXLPkd,0.527,1,0.363,-21.091,0,Morceaux de fantaisie. Op 3 No 2. Prélude in C-Sharp Minor Lento,2,1921,0.0456,92.867
0.721,1921,0.996,Ignacio Corsini,0.485,161520,0.13,0,05xDjWH9ub67nJJk82yfGf,0.151,5,0.104,-21.50800000000001,0,La Mañanita - Remasterizado,0,1921-03-20,0.0483,64.678
0.7709999999999999,1921,0.982,Fortugé,0.684,196560,0.257,0,08zfJvRLp7pjAb94MA9JmF,0.0,8,0.504,-16.415,1,Il Etait Syndiqué,0,1921,0.399,109.378


In [0]:
dbutils.fs.mkdirs('/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data')

True

In [0]:
# File location and type
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data'
file_type ='parquet'

# Mode
mode = 'overwrite'

# Writing Data
df_data_csv.write.format(file_type) \
    .mode(mode) \
    .save(file_location)
display(dbutils.fs.ls(file_location))

path,name,size,modificationTime
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data/_committed_3177130763914826764,_committed_3177130763914826764,1607,1707232088000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data/_committed_4818330780914731346,_committed_4818330780914731346,1607,1707247202000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data/_committed_825971256025279735,_committed_825971256025279735,1607,1707232419000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data/_committed_8775830171136009925,_committed_8775830171136009925,1626,1707220077000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data/_started_4818330780914731346,_started_4818330780914731346,0,1707247187000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data/_started_6474467209837860103,_started_6474467209837860103,0,1707170758000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data/part-00000-tid-4818330780914731346-18d38a58-857d-4df8-9ea1-9110f24247c0-76-1-c000.snappy.parquet,part-00000-tid-4818330780914731346-18d38a58-857d-4df8-9ea1-9110f24247c0-76-1-c000.snappy.parquet,1922693,1707247202000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data/part-00001-tid-4818330780914731346-18d38a58-857d-4df8-9ea1-9110f24247c0-77-1-c000.snappy.parquet,part-00001-tid-4818330780914731346-18d38a58-857d-4df8-9ea1-9110f24247c0-77-1-c000.snappy.parquet,1965917,1707247202000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data/part-00002-tid-4818330780914731346-18d38a58-857d-4df8-9ea1-9110f24247c0-78-1-c000.snappy.parquet,part-00002-tid-4818330780914731346-18d38a58-857d-4df8-9ea1-9110f24247c0-78-1-c000.snappy.parquet,2016846,1707247202000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data/part-00003-tid-4818330780914731346-18d38a58-857d-4df8-9ea1-9110f24247c0-79-1-c000.snappy.parquet,part-00003-tid-4818330780914731346-18d38a58-857d-4df8-9ea1-9110f24247c0-79-1-c000.snappy.parquet,2004281,1707247202000


## Data_by_artist

In [0]:
df_data_by_artist_csv.limit(10).display()

mode,count,acousticness,artists,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,9,0.5901111111111111,"""""""Cats"""" 1981 Original London Cast""",0.4672222222222222,250318.5555555556,0.3940033333333333,0.0113998511111111,0.2908333333333333,-14.448,0.2103888888888888,117.51811111111112,0.3895,38.333333333333336,5
1,26,0.8625384615384617,"""""""Cats"""" 1983 Broadway Cast""",0.4417307692307693,287280.0,0.4068076923076923,0.0811582642307692,0.3152153846153846,-10.69,0.1762115384615384,103.04415384615385,0.2688653846153846,30.57692307692308,5
1,7,0.8565714285714285,"""""""Fiddler On The Roof” Motion Picture Chorus""",0.3482857142857142,328920.0,0.2865714285714285,0.0245929485714285,0.3257857142857143,-15.230714285714283,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0
1,27,0.884925925925926,"""""""Fiddler On The Roof” Motion Picture Orchestra""",0.4250740740740739,262890.96296296304,0.2457703703703704,0.0735872792592592,0.2754814814814815,-15.639370370370369,0.1232,88.66762962962959,0.3720296296296296,34.85185185185185,0
1,7,0.5107142857142857,"""""""Joseph And The Amazing Technicolor Dreamcoat"""" 1991 London Cast""",0.4671428571428572,270436.14285714284,0.4882857142857143,0.0094002914285714,0.195,-10.236714285714289,0.0985428571428571,122.83585714285714,0.4822857142857143,43.0,5
1,36,0.6095555555555557,"""""""Joseph And The Amazing Technicolor Dreamcoat"""" 1992 Canadian Cast""",0.4872777777777778,205091.9444444445,0.3099055555555556,0.0046956566666666,0.2747666666666666,-18.26638888888889,0.0980222222222221,118.64894444444442,0.4415555555555557,32.77777777777778,5
1,2,0.725,"""""""Mama"""" Helen Teagarden""",0.637,135533.0,0.512,0.186,0.426,-20.615,0.21,134.819,0.885,0.0,8
1,2,0.927,"""""""Test for Victor Young""""""",0.7340000000000001,175693.0,0.474,0.0762,0.737,-10.544,0.256,132.78799999999998,0.902,3.0,10
1,122,0.1731450819672131,"""""""Weird Al"""" Yankovic""",0.6627868852459013,218948.19672131148,0.6953934426229511,4.980262295081966e-05,0.1611016393442622,-9.768704918032787,0.0845360655737704,133.03118032786878,0.7513442622950824,34.22950819672131,9
1,15,0.5444666666666668,$NOT,0.7898,137910.46666666667,0.5329333333333333,0.0230625833333333,0.1803,-9.14926666666667,0.2936866666666666,112.34479999999998,0.4807,67.53333333333333,1


In [0]:
df_data_by_artist_csv.printSchema()

root
 |-- mode: integer (nullable = true)
 |-- count: integer (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- artists: string (nullable = true)
 |-- danceability: double (nullable = true)
 |-- duration_ms: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- loudness: double (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- popularity: double (nullable = true)
 |-- key: integer (nullable = true)



In [0]:
df_data_by_artist_csv.columns

['mode',
 'count',
 'acousticness',
 'artists',
 'danceability',
 'duration_ms',
 'energy',
 'instrumentalness',
 'liveness',
 'loudness',
 'speechiness',
 'tempo',
 'valence',
 'popularity',
 'key']

In [0]:
columns_string = [
    'artists'
]
    
columns_integer = [
    'mode',
    'count',
    'key'
]

columns_double = [
    'acousticness',
    'danceability',
    'duration_ms',
    'energy',
    'instrumentalness',
    'liveness',
    'loudness',
    'speechiness',
    'tempo',
    'valence',
    'popularity'
]

In [0]:
len(df_data_by_artist_csv.columns)

15

In [0]:
len(columns_string) + len(columns_integer) + len(columns_double)

15

In [0]:
for column in df_data_by_artist_csv.columns:
    # Columns Strings Type
    if column in columns_string:
        df_data_by_artist_csv = df_data_by_artist_csv.withColumn(column, df_data_by_artist_csv[column].cast(StringType()))
    # Columns Interger Type
    elif column in columns_integer:
        df_data_by_artist_csv = df_data_by_artist_csv.withColumn(column, df_data_by_artist_csv[column].cast(IntegerType()))
    # Columns DoubleType
    else:
        df_data_by_artist_csv = df_data_by_artist_csv.withColumn(column, df_data_by_artist_csv[column].cast(DoubleType()))
df_data_by_artist_csv.printSchema()

root
 |-- mode: integer (nullable = true)
 |-- count: integer (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- artists: string (nullable = true)
 |-- danceability: double (nullable = true)
 |-- duration_ms: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- loudness: double (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- popularity: double (nullable = true)
 |-- key: integer (nullable = true)



In [0]:
df_data_by_artist_csv.limit(10).display()

mode,count,acousticness,artists,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,9,0.5901111111111111,"""""""Cats"""" 1981 Original London Cast""",0.4672222222222222,250318.5555555556,0.3940033333333333,0.0113998511111111,0.2908333333333333,-14.448,0.2103888888888888,117.51811111111112,0.3895,38.333333333333336,5
1,26,0.8625384615384617,"""""""Cats"""" 1983 Broadway Cast""",0.4417307692307693,287280.0,0.4068076923076923,0.0811582642307692,0.3152153846153846,-10.69,0.1762115384615384,103.04415384615385,0.2688653846153846,30.57692307692308,5
1,7,0.8565714285714285,"""""""Fiddler On The Roof” Motion Picture Chorus""",0.3482857142857142,328920.0,0.2865714285714285,0.0245929485714285,0.3257857142857143,-15.230714285714283,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0
1,27,0.884925925925926,"""""""Fiddler On The Roof” Motion Picture Orchestra""",0.4250740740740739,262890.96296296304,0.2457703703703704,0.0735872792592592,0.2754814814814815,-15.639370370370369,0.1232,88.66762962962959,0.3720296296296296,34.85185185185185,0
1,7,0.5107142857142857,"""""""Joseph And The Amazing Technicolor Dreamcoat"""" 1991 London Cast""",0.4671428571428572,270436.14285714284,0.4882857142857143,0.0094002914285714,0.195,-10.236714285714289,0.0985428571428571,122.83585714285714,0.4822857142857143,43.0,5
1,36,0.6095555555555557,"""""""Joseph And The Amazing Technicolor Dreamcoat"""" 1992 Canadian Cast""",0.4872777777777778,205091.9444444445,0.3099055555555556,0.0046956566666666,0.2747666666666666,-18.26638888888889,0.0980222222222221,118.64894444444442,0.4415555555555557,32.77777777777778,5
1,2,0.725,"""""""Mama"""" Helen Teagarden""",0.637,135533.0,0.512,0.186,0.426,-20.615,0.21,134.819,0.885,0.0,8
1,2,0.927,"""""""Test for Victor Young""""""",0.7340000000000001,175693.0,0.474,0.0762,0.737,-10.544,0.256,132.78799999999998,0.902,3.0,10
1,122,0.1731450819672131,"""""""Weird Al"""" Yankovic""",0.6627868852459013,218948.19672131148,0.6953934426229511,4.980262295081966e-05,0.1611016393442622,-9.768704918032787,0.0845360655737704,133.03118032786878,0.7513442622950824,34.22950819672131,9
1,15,0.5444666666666668,$NOT,0.7898,137910.46666666667,0.5329333333333333,0.0230625833333333,0.1803,-9.14926666666667,0.2936866666666666,112.34479999999998,0.4807,67.53333333333333,1


## Processing the names of the artists

In [0]:
# Creating a sample of the dataset to corrupt or lose the original data
sample_df  = df_data_by_artist_csv.sample(withReplacement = False, fraction = 0.001, seed=333)
sample_df.display()

mode,count,acousticness,artists,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,8,0.5789999999999998,Angela Carrasco,0.41475,307573.5,0.40175,0.0013625,0.143475,-13.77375,0.03695,151.63324999999998,0.31675,48.0,9
1,2,0.865,Bani Tagore,0.525,169093.0,0.15,0.0,0.187,-12.001,0.0338,121.988,0.444,0.0,9
1,16,0.3296625,Candi Staton,0.6296249999999999,199941.875,0.5469999999999999,4.09125e-06,0.190775,-11.836125,0.0430874999999999,116.870875,0.83625,42.625,9
1,5,0.4964,Carl Zittrer,0.3704,144373.2,0.19622,0.147,0.13814,-19.413,0.03938,104.232,0.30046,25.4,7
1,4,0.7130000000000001,Dylan Thomas,0.726,139240.0,0.08505,6.1e-05,0.2175,-28.157,0.917,80.6515,0.201,4.5,2
1,2,0.906,Frente!,0.738,119813.0,0.198,0.0,0.105,-10.919,0.0625,123.158,0.682,54.0,1
1,4,0.0787499999999999,Fruit Bats,0.6579999999999999,227753.5,0.8445,0.0318149999999999,0.327,-5.863500000000001,0.02905,110.486,0.7655,60.0,7
0,2,0.000122,Genix,0.542,271628.0,0.97,0.841,0.414,-6.327999999999999,0.0441,129.009,0.0893,2.0,11
1,4,0.5345000000000001,Greg & Steve,0.8555,191768.5,0.675,0.0,0.1198999999999999,-7.811,0.09045,130.7665,0.9000000000000001,26.0,2
1,2,0.3339999999999999,Herbs,0.825,277360.0,0.649,0.0,0.623,-9.554,0.0418,122.155,0.97,61.0,7


In [0]:
sample_df = sample_df.withColumn('artists', f.regexp_replace('artists', '[\[|\]|\'|\|\"|\!|\^|\´|\`|\^|\(|\)]', ''))
sample_df = sample_df.withColumn('artists', f.regexp_replace('artists', '[\,]', '.'))
sample_df = sample_df.withColumn('artists', f.regexp_replace('artists', '[\$]', 'S'))
sample_df = sample_df.withColumn('artists', f.regexp_replace('artists', '[?,]', 'Q'))

In [0]:
sample_df.display()

mode,count,acousticness,artists,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,8,0.5789999999999998,Angela Carrasco,0.41475,307573.5,0.40175,0.0013625,0.143475,-13.77375,0.03695,151.63324999999998,0.31675,48.0,9
1,2,0.865,Bani Tagore,0.525,169093.0,0.15,0.0,0.187,-12.001,0.0338,121.988,0.444,0.0,9
1,16,0.3296625,Candi Staton,0.6296249999999999,199941.875,0.5469999999999999,4.09125e-06,0.190775,-11.836125,0.0430874999999999,116.870875,0.83625,42.625,9
1,5,0.4964,Carl Zittrer,0.3704,144373.2,0.19622,0.147,0.13814,-19.413,0.03938,104.232,0.30046,25.4,7
1,4,0.7130000000000001,Dylan Thomas,0.726,139240.0,0.08505,6.1e-05,0.2175,-28.157,0.917,80.6515,0.201,4.5,2
1,2,0.906,Frente,0.738,119813.0,0.198,0.0,0.105,-10.919,0.0625,123.158,0.682,54.0,1
1,4,0.0787499999999999,Fruit Bats,0.6579999999999999,227753.5,0.8445,0.0318149999999999,0.327,-5.863500000000001,0.02905,110.486,0.7655,60.0,7
0,2,0.000122,Genix,0.542,271628.0,0.97,0.841,0.414,-6.327999999999999,0.0441,129.009,0.0893,2.0,11
1,4,0.5345000000000001,Greg & Steve,0.8555,191768.5,0.675,0.0,0.1198999999999999,-7.811,0.09045,130.7665,0.9000000000000001,26.0,2
1,2,0.3339999999999999,Herbs,0.825,277360.0,0.649,0.0,0.623,-9.554,0.0418,122.155,0.97,61.0,7


In [0]:
df_data_by_artist_csv = df_data_by_artist_csv.withColumn('artists', f.regexp_replace('artists', '[\[|\]|\'|\|\"|\!|\^|\´|\`|\^|\(|\)|\:|\”]', ''))
df_data_by_artist_csv = df_data_by_artist_csv.withColumn('artists', f.regexp_replace('artists', '[\,]', '.'))
df_data_by_artist_csv = df_data_by_artist_csv.withColumn('artists', f.regexp_replace('artists', '[\$]', 'S'))
df_data_by_artist_csv = df_data_by_artist_csv.withColumn('artists', f.regexp_replace('artists', '[?,]', 'Q'))

In [0]:
df_data_by_artist_csv.limit(20).display()

mode,count,acousticness,artists,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,9,0.5901111111111111,Cats 1981 Original London Cast,0.4672222222222222,250318.5555555556,0.3940033333333333,0.0113998511111111,0.2908333333333333,-14.448,0.2103888888888888,117.51811111111112,0.3895,38.333333333333336,5
1,26,0.8625384615384617,Cats 1983 Broadway Cast,0.4417307692307693,287280.0,0.4068076923076923,0.0811582642307692,0.3152153846153846,-10.69,0.1762115384615384,103.04415384615385,0.2688653846153846,30.57692307692308,5
1,7,0.8565714285714285,Fiddler On The Roof Motion Picture Chorus,0.3482857142857142,328920.0,0.2865714285714285,0.0245929485714285,0.3257857142857143,-15.230714285714283,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0
1,27,0.884925925925926,Fiddler On The Roof Motion Picture Orchestra,0.4250740740740739,262890.96296296304,0.2457703703703704,0.0735872792592592,0.2754814814814815,-15.639370370370369,0.1232,88.66762962962959,0.3720296296296296,34.85185185185185,0
1,7,0.5107142857142857,Joseph And The Amazing Technicolor Dreamcoat 1991 London Cast,0.4671428571428572,270436.14285714284,0.4882857142857143,0.0094002914285714,0.195,-10.236714285714289,0.0985428571428571,122.83585714285714,0.4822857142857143,43.0,5
1,36,0.6095555555555557,Joseph And The Amazing Technicolor Dreamcoat 1992 Canadian Cast,0.4872777777777778,205091.9444444445,0.3099055555555556,0.0046956566666666,0.2747666666666666,-18.26638888888889,0.0980222222222221,118.64894444444442,0.4415555555555557,32.77777777777778,5
1,2,0.725,Mama Helen Teagarden,0.637,135533.0,0.512,0.186,0.426,-20.615,0.21,134.819,0.885,0.0,8
1,2,0.927,Test for Victor Young,0.7340000000000001,175693.0,0.474,0.0762,0.737,-10.544,0.256,132.78799999999998,0.902,3.0,10
1,122,0.1731450819672131,Weird Al Yankovic,0.6627868852459013,218948.19672131148,0.6953934426229511,4.980262295081966e-05,0.1611016393442622,-9.768704918032787,0.0845360655737704,133.03118032786878,0.7513442622950824,34.22950819672131,9
1,15,0.5444666666666668,SNOT,0.7898,137910.46666666667,0.5329333333333333,0.0230625833333333,0.1803,-9.14926666666667,0.2936866666666666,112.34479999999998,0.4807,67.53333333333333,1


In [0]:
# File location and type
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_artist'
file_type ='parquet'

# Mode
mode = 'overwrite'

# Writing Data
df_data_by_artist_csv.write.format(file_type) \
    .mode(mode) \
    .save(file_location)
display(dbutils.fs.ls(file_location))

path,name,size,modificationTime
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_artist/_committed_2159493154797974667,_committed_2159493154797974667,433,1707220088000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_artist/_committed_6670886875046542341,_committed_6670886875046542341,422,1707247215000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_artist/_committed_8203027676418423527,_committed_8203027676418423527,422,1707232429000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_artist/_committed_8967852966477912603,_committed_8967852966477912603,421,1707232101000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_artist/_started_6670886875046542341,_started_6670886875046542341,0,1707247213000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_artist/part-00000-tid-6670886875046542341-9bf832ce-8522-409a-a459-664a17e0b722-99-1-c000.snappy.parquet,part-00000-tid-6670886875046542341-9bf832ce-8522-409a-a459-664a17e0b722-99-1-c000.snappy.parquet,1918054,1707247215000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_artist/part-00001-tid-6670886875046542341-9bf832ce-8522-409a-a459-664a17e0b722-100-1-c000.snappy.parquet,part-00001-tid-6670886875046542341-9bf832ce-8522-409a-a459-664a17e0b722-100-1-c000.snappy.parquet,72517,1707247213000


## Data_by_genres

In [0]:
df_data_by_genres_csv.limit(10).display()

mode,genres,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,21st century classical,0.9793333333333332,0.1628833333333333,160297.66666666663,0.0713166666666666,0.60683367,0.3616,-31.514333333333337,0.0405666666666666,75.3365,0.1037833333333333,27.83333333333333,6
1,432hz,0.49478,0.2993333333333333,1048887.333333333,0.4506783333333333,0.4777616666666668,0.131,-16.854,0.0768166666666666,120.28566666666666,0.22175,52.5,5
1,8-bit,0.762,0.7120000000000001,115177.0,0.818,0.8759999999999999,0.126,-9.18,0.047,133.444,0.975,48.0,7
1,[],0.6514170195595453,0.5290925603549332,232880.8902503945,0.4191460727353524,0.2053091895111363,0.2186958541504073,-12.288964675489456,0.1078715586868139,112.8573524318416,0.5136042963588958,20.859882191849056,7
1,a cappella,0.676557304985755,0.5389612464387464,190628.5408867521,0.3164335701566952,0.0030034414404202,0.1722541371082621,-12.479387421652426,0.0828514398148148,112.1103620014245,0.4482486545584045,45.82007122507122,7
1,abstract,0.45921,0.5161666666666667,343196.5,0.4424166666666666,0.8496666666666667,0.1180666666666667,-15.472083333333332,0.0465166666666666,127.88575000000002,0.307325,43.5,1
1,abstract beats,0.3421466666666667,0.623,229936.2,0.5277999999999999,0.3336026120000001,0.0996533333333333,-7.918000000000001,0.1163733333333333,112.4138,0.4935066666666666,58.93333333333332,10
1,abstract hip hop,0.2438540633608816,0.6945709366391184,231849.2341597796,0.6462346418732783,0.0242312629201102,0.1685429201101929,-7.349327823691461,0.2142576997245178,108.2449865013774,0.5713909090909091,39.79070247933884,2
0,accordeon,0.3229999999999999,0.588,164000.0,0.392,0.441,0.0794,-14.899,0.0727,109.131,0.7090000000000001,39.0,2
1,accordion,0.446125,0.6248125,167061.5625,0.3734375,0.19373839375,0.1603,-14.4870625,0.0785375,112.8724375,0.6586875000000001,21.9375,2


In [0]:
df_data_by_genres_csv.printSchema()

root
 |-- mode: integer (nullable = true)
 |-- genres: string (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- danceability: double (nullable = true)
 |-- duration_ms: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- loudness: double (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- popularity: double (nullable = true)
 |-- key: integer (nullable = true)



In [0]:
df_data_by_genres_csv.columns

['mode',
 'genres',
 'acousticness',
 'danceability',
 'duration_ms',
 'energy',
 'instrumentalness',
 'liveness',
 'loudness',
 'speechiness',
 'tempo',
 'valence',
 'popularity',
 'key']

In [0]:
columns_string = [
   'genres'
]
    
columns_integer = [
    'mode', 
    'key'
]

columns_double = [
    'acousticness',
    'danceability',
    'duration_ms',
    'energy',
    'instrumentalness',
    'liveness',
    'loudness',
    'speechiness',
    'tempo',
    'valence',
    'popularity'
]

In [0]:
len(columns_string) + len(columns_integer) + len(columns_double)

14

In [0]:
len(df_data_by_genres_csv.columns)

14

In [0]:
for column in df_data_by_genres_csv.columns:
    # Columns Strings Type
    if column in columns_string:
        df_data_by_genres_csv = df_data_by_genres_csv.withColumn(column, df_data_by_genres_csv[column].cast(StringType()))
    # Columns Interger Type
    elif column in columns_integer:
        df_data_by_genres_csv = df_data_by_genres_csv.withColumn(column, df_data_by_genres_csv[column].cast(IntegerType()))
    # Columns DoubleType
    else:
        df_data_by_genres_csv = df_data_by_genres_csv.withColumn(column, df_data_by_genres_csv[column].cast(DoubleType()))
df_data_by_genres_csv.printSchema()

root
 |-- mode: integer (nullable = true)
 |-- genres: string (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- danceability: double (nullable = true)
 |-- duration_ms: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- loudness: double (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- popularity: double (nullable = true)
 |-- key: integer (nullable = true)



In [0]:
df_data_by_genres_csv.limit(10).display()

mode,genres,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,21st century classical,0.9793333333333332,0.1628833333333333,160297.66666666663,0.0713166666666666,0.60683367,0.3616,-31.514333333333337,0.0405666666666666,75.3365,0.1037833333333333,27.83333333333333,6
1,432hz,0.49478,0.2993333333333333,1048887.333333333,0.4506783333333333,0.4777616666666668,0.131,-16.854,0.0768166666666666,120.28566666666666,0.22175,52.5,5
1,8-bit,0.762,0.7120000000000001,115177.0,0.818,0.8759999999999999,0.126,-9.18,0.047,133.444,0.975,48.0,7
1,[],0.6514170195595453,0.5290925603549332,232880.8902503945,0.4191460727353524,0.2053091895111363,0.2186958541504073,-12.288964675489456,0.1078715586868139,112.8573524318416,0.5136042963588958,20.859882191849056,7
1,a cappella,0.676557304985755,0.5389612464387464,190628.5408867521,0.3164335701566952,0.0030034414404202,0.1722541371082621,-12.479387421652426,0.0828514398148148,112.1103620014245,0.4482486545584045,45.82007122507122,7
1,abstract,0.45921,0.5161666666666667,343196.5,0.4424166666666666,0.8496666666666667,0.1180666666666667,-15.472083333333332,0.0465166666666666,127.88575000000002,0.307325,43.5,1
1,abstract beats,0.3421466666666667,0.623,229936.2,0.5277999999999999,0.3336026120000001,0.0996533333333333,-7.918000000000001,0.1163733333333333,112.4138,0.4935066666666666,58.93333333333332,10
1,abstract hip hop,0.2438540633608816,0.6945709366391184,231849.2341597796,0.6462346418732783,0.0242312629201102,0.1685429201101929,-7.349327823691461,0.2142576997245178,108.2449865013774,0.5713909090909091,39.79070247933884,2
0,accordeon,0.3229999999999999,0.588,164000.0,0.392,0.441,0.0794,-14.899,0.0727,109.131,0.7090000000000001,39.0,2
1,accordion,0.446125,0.6248125,167061.5625,0.3734375,0.19373839375,0.1603,-14.4870625,0.0785375,112.8724375,0.6586875000000001,21.9375,2


In [0]:
# File location and type
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_genres'
file_type ='parquet'

# Mode
mode = 'overwrite'

# Writing Data
df_data_by_genres_csv.write.format(file_type) \
    .mode(mode) \
    .save(file_location)
display(dbutils.fs.ls(file_location))

path,name,size,modificationTime
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_genres/_committed_2964239454199687273,_committed_2964239454199687273,223,1707247221000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_genres/_committed_4399768952818124269,_committed_4399768952818124269,223,1707232433000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_genres/_committed_4528280514220052370,_committed_4528280514220052370,233,1707220094000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_genres/_committed_5445022735537102827,_committed_5445022735537102827,223,1707232108000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_genres/_started_2964239454199687273,_started_2964239454199687273,0,1707247220000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_genres/part-00000-tid-2964239454199687273-fb1447fd-ed0b-4d07-ab2c-9a93acb161e3-111-1-c000.snappy.parquet,part-00000-tid-2964239454199687273-fb1447fd-ed0b-4d07-ab2c-9a93acb161e3-111-1-c000.snappy.parquet,277208,1707247220000


## Data_by_year

In [0]:
df_data_by_year_csv.limit(10).display()

mode,year,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,1921,0.8868960000000005,0.4185973333333336,260537.16666666663,0.2318151333333333,0.3448780588666665,0.20571,-17.04866666666665,0.073662,101.53149333333327,0.3793266666666666,0.6533333333333333,2
1,1922,0.9385915492957748,0.4820422535211267,165469.74647887325,0.2378153521126759,0.4341948697183099,0.2407197183098592,-19.275281690140844,0.1166549295774648,100.88452112676056,0.5355492957746479,0.1408450704225352,10
1,1923,0.9572467913513516,0.5773405405405401,177942.36216216214,0.2624064864864865,0.371732725027027,0.2274621621621621,-14.129210810810813,0.0939486486486487,114.0107297297297,0.6254924324324328,5.389189189189189,0
1,1924,0.940199860169493,0.5498940677966102,191046.70762711865,0.3443466101694912,0.5817009136440677,0.2352190677966101,-14.231343220338989,0.0920894067796609,120.68957203389822,0.6637254237288139,0.6610169491525424,10
1,1925,0.9626070503597138,0.5738633093525181,184986.92446043165,0.2785935251798561,0.4182973612230215,0.2376679856115108,-14.14641366906474,0.1119179856115108,115.5219208633093,0.6219287769784171,2.6043165467625897,5
1,1926,0.660817216981134,0.5998802612481859,156881.65747460088,0.2114670907111756,0.3330931111175616,0.2323695936139332,-18.492538461538462,0.4837036284470243,109.64803265602328,0.4369104571843251,1.4223512336719883,9
1,1927,0.9361794552845558,0.6482682926829262,184993.59837398367,0.2643213008130081,0.3913284986504065,0.1684502439024389,-14.422373983739831,0.113609593495935,114.84652357723554,0.6597004878048782,0.8016260162601626,7
1,1928,0.9386165035685952,0.5342878667724027,214827.90642347344,0.2079477954004757,0.4948354801348136,0.1752893735130848,-17.191982553528927,0.1599114988104679,106.77226169706591,0.4957126883425853,1.5257731958762886,1
1,1929,0.6014265861344558,0.6476698529411761,168999.41281512607,0.2418007352941172,0.2152040310609246,0.2360002100840333,-16.530376050420152,0.4900007352941176,110.94835714285716,0.6365298319327733,0.3403361344537815,7
1,1930,0.936714937370057,0.5181758835758836,195150.28534303536,0.3335239189189189,0.3522059281652805,0.2213108627858629,-12.869221413721428,0.1199096673596674,109.87119438669428,0.6162376299376306,0.9267151767151768,2


In [0]:
df_data_by_year_csv.printSchema()

root
 |-- mode: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- danceability: double (nullable = true)
 |-- duration_ms: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- loudness: double (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- popularity: double (nullable = true)
 |-- key: integer (nullable = true)



In [0]:
df_data_by_year_csv.columns

['mode',
 'year',
 'acousticness',
 'danceability',
 'duration_ms',
 'energy',
 'instrumentalness',
 'liveness',
 'loudness',
 'speechiness',
 'tempo',
 'valence',
 'popularity',
 'key']

In [0]:
columns_string = [
   
]
    
columns_integer = [
    'mode',
    'year',
    'key'
]

columns_double = [
   'acousticness',
    'danceability',
    'duration_ms',
    'energy',
    'instrumentalness',
    'liveness',
    'loudness',
    'speechiness',
    'tempo',
    'valence',
    'popularity'
]

In [0]:
len(columns_string) + len(columns_integer) + len(columns_double)

14

In [0]:
len(df_data_by_year_csv.columns)

14

In [0]:
for column in df_data_by_year_csv.columns:
    # Columns Strings Type
    if column in columns_string:
        df_data_by_year_csv = df_data_by_year_csv.withColumn(column, df_data_by_year_csv[column].cast(StringType()))
    # Columns Interger Type
    elif column in columns_integer:
        df_data_by_year_csv = df_data_by_year_csv.withColumn(column, df_data_by_year_csv[column].cast(IntegerType()))
    # Columns DoubleType
    else:
        df_data_by_year_csv = df_data_by_year_csv.withColumn(column, df_data_by_year_csv[column].cast(DoubleType()))
df_data_by_year_csv.printSchema()

root
 |-- mode: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- danceability: double (nullable = true)
 |-- duration_ms: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- loudness: double (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- popularity: double (nullable = true)
 |-- key: integer (nullable = true)



In [0]:
df_data_by_year_csv.limit(10).display()

mode,year,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,1921,0.8868960000000005,0.4185973333333336,260537.16666666663,0.2318151333333333,0.3448780588666665,0.20571,-17.04866666666665,0.073662,101.53149333333327,0.3793266666666666,0.6533333333333333,2
1,1922,0.9385915492957748,0.4820422535211267,165469.74647887325,0.2378153521126759,0.4341948697183099,0.2407197183098592,-19.275281690140844,0.1166549295774648,100.88452112676056,0.5355492957746479,0.1408450704225352,10
1,1923,0.9572467913513516,0.5773405405405401,177942.36216216214,0.2624064864864865,0.371732725027027,0.2274621621621621,-14.129210810810813,0.0939486486486487,114.0107297297297,0.6254924324324328,5.389189189189189,0
1,1924,0.940199860169493,0.5498940677966102,191046.70762711865,0.3443466101694912,0.5817009136440677,0.2352190677966101,-14.231343220338989,0.0920894067796609,120.68957203389822,0.6637254237288139,0.6610169491525424,10
1,1925,0.9626070503597138,0.5738633093525181,184986.92446043165,0.2785935251798561,0.4182973612230215,0.2376679856115108,-14.14641366906474,0.1119179856115108,115.5219208633093,0.6219287769784171,2.6043165467625897,5
1,1926,0.660817216981134,0.5998802612481859,156881.65747460088,0.2114670907111756,0.3330931111175616,0.2323695936139332,-18.492538461538462,0.4837036284470243,109.64803265602328,0.4369104571843251,1.4223512336719883,9
1,1927,0.9361794552845558,0.6482682926829262,184993.59837398367,0.2643213008130081,0.3913284986504065,0.1684502439024389,-14.422373983739831,0.113609593495935,114.84652357723554,0.6597004878048782,0.8016260162601626,7
1,1928,0.9386165035685952,0.5342878667724027,214827.90642347344,0.2079477954004757,0.4948354801348136,0.1752893735130848,-17.191982553528927,0.1599114988104679,106.77226169706591,0.4957126883425853,1.5257731958762886,1
1,1929,0.6014265861344558,0.6476698529411761,168999.41281512607,0.2418007352941172,0.2152040310609246,0.2360002100840333,-16.530376050420152,0.4900007352941176,110.94835714285716,0.6365298319327733,0.3403361344537815,7
1,1930,0.936714937370057,0.5181758835758836,195150.28534303536,0.3335239189189189,0.3522059281652805,0.2213108627858629,-12.869221413721428,0.1199096673596674,109.87119438669428,0.6162376299376306,0.9267151767151768,2


In [0]:
# File location and type
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_year'
file_type ='parquet'

# Mode
mode = 'overwrite'

# Writing Data
df_data_by_year_csv.write.format(file_type) \
    .mode(mode) \
    .save(file_location)
display(dbutils.fs.ls(file_location))

path,name,size,modificationTime
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_year/_committed_238674918752713338,_committed_238674918752713338,222,1707232113000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_year/_committed_3149989720632713406,_committed_3149989720632713406,222,1707232437000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_year/_committed_3687273619894278790,_committed_3687273619894278790,234,1707220099000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_year/_committed_4224027635365993426,_committed_4224027635365993426,223,1707247226000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_year/_started_4224027635365993426,_started_4224027635365993426,0,1707247226000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_year/part-00000-tid-4224027635365993426-05ea61af-8396-40f2-9e1f-6f297c6931db-122-1-c000.snappy.parquet,part-00000-tid-4224027635365993426-05ea61af-8396-40f2-9e1f-6f297c6931db-122-1-c000.snappy.parquet,13390,1707247226000


## Data_w_genres

In [0]:
df_data_w_genres_csv.limit(10).display()

genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
['show tunes'],"""""""Cats"""" 1981 Original London Cast""",0.5901111111111111,0.4672222222222222,250318.5555555556,0.3940033333333333,0.0113998511111111,0.2908333333333333,-14.448,0.2103888888888888,117.51811111111112,0.3895,38.333333333333336,5.0,1.0,9.0
[],"""""""Cats"""" 1983 Broadway Cast""",0.8625384615384617,0.4417307692307693,287280.0,0.4068076923076923,0.0811582642307692,0.3152153846153846,-10.69,0.1762115384615384,103.04415384615385,0.2688653846153846,30.57692307692308,5.0,1.0,26.0
[],"""""""Fiddler On The Roof” Motion Picture Chorus""",0.8565714285714285,0.3482857142857142,328920.0,0.2865714285714285,0.0245929485714285,0.3257857142857143,-15.230714285714283,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0.0,1.0,7.0
[],"""""""Fiddler On The Roof” Motion Picture Orchestra""",0.884925925925926,0.4250740740740739,262890.96296296304,0.2457703703703704,0.0735872792592592,0.2754814814814815,-15.639370370370369,0.1232,88.66762962962959,0.3720296296296296,34.85185185185185,0.0,1.0,27.0
[],"""""""Joseph And The Amazing Technicolor Dreamcoat"""" 1991 London Cast""",0.5107142857142857,0.4671428571428572,270436.14285714284,0.4882857142857143,0.0094002914285714,0.195,-10.236714285714289,0.0985428571428571,122.83585714285714,0.4822857142857143,43.0,5.0,1.0,7.0
[],"""""""Joseph And The Amazing Technicolor Dreamcoat"""" 1992 Canadian Cast""",0.6095555555555557,0.4872777777777778,205091.9444444445,0.3099055555555556,0.0046956566666666,0.2747666666666666,-18.26638888888889,0.0980222222222221,118.64894444444442,0.4415555555555557,32.77777777777778,5.0,1.0,36.0
[],"""""""Mama"""" Helen Teagarden""",0.725,0.637,135533.0,0.512,0.186,0.426,-20.615,0.21,134.819,0.885,0.0,8.0,1.0,2.0
[],"""""""Test for Victor Young""""""",0.927,0.7340000000000001,175693.0,0.474,0.0762,0.737,-10.544,0.256,132.78799999999998,0.902,3.0,10.0,1.0,2.0
"['comedy rock', 'comic', 'parody']","""""""Weird Al"""" Yankovic""",0.1731450819672131,0.6627868852459013,218948.19672131148,0.6953934426229511,4.980262295081966e-05,0.1611016393442622,-9.768704918032787,0.0845360655737704,133.03118032786878,0.7513442622950824,34.22950819672131,9.0,1.0,122.0
"['emo rap', 'florida rap', 'sad rap', 'underground hip hop', 'vapor trap']",$NOT,0.5444666666666668,0.7898,137910.46666666667,0.5329333333333333,0.0230625833333333,0.1803,-9.14926666666667,0.2936866666666666,112.34479999999998,0.4807,67.53333333333333,1.0,1.0,15.0


In [0]:
df_data_w_genres_csv.printSchema()

root
 |-- genres: string (nullable = true)
 |-- artists: string (nullable = true)
 |-- acousticness: string (nullable = true)
 |-- danceability: string (nullable = true)
 |-- duration_ms: string (nullable = true)
 |-- energy: string (nullable = true)
 |-- instrumentalness: string (nullable = true)
 |-- liveness: string (nullable = true)
 |-- loudness: double (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- popularity: double (nullable = true)
 |-- key: double (nullable = true)
 |-- mode: double (nullable = true)
 |-- count: double (nullable = true)



In [0]:
df_data_w_genres_csv.columns

['genres',
 'artists',
 'acousticness',
 'danceability',
 'duration_ms',
 'energy',
 'instrumentalness',
 'liveness',
 'loudness',
 'speechiness',
 'tempo',
 'valence',
 'popularity',
 'key',
 'mode',
 'count']

In [0]:
columns_string = [
    'genres',
    'artists'
]
    
columns_integer = [
    'key',
    'mode',
    'count'
]

columns_double = [
    'acousticness',
    'danceability',
    'duration_ms',
    'energy',
    'instrumentalness',
    'liveness',
    'loudness',
    'speechiness',
    'tempo',
    'valence',
    'popularity'
]

In [0]:
len(columns_string) + len(columns_integer) + len(columns_double)

16

In [0]:
len(df_data_w_genres_csv.columns)

16

In [0]:
for column in df_data_w_genres_csv.columns:
    # Columns Strings Type
    if column in columns_string:
        df_data_w_genres_csv = df_data_w_genres_csv.withColumn(column, df_data_w_genres_csv[column].cast(StringType()))
    # Columns Interger Type
    elif column in columns_integer:
        df_data_w_genres_csv = df_data_w_genres_csv.withColumn(column, df_data_w_genres_csv[column].cast(IntegerType()))
    # Columns DoubleType
    else:
        df_data_w_genres_csv = df_data_w_genres_csv.withColumn(column, df_data_w_genres_csv[column].cast(DoubleType()))
df_data_w_genres_csv.printSchema()

root
 |-- genres: string (nullable = true)
 |-- artists: string (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- danceability: double (nullable = true)
 |-- duration_ms: double (nullable = true)
 |-- energy: double (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- liveness: double (nullable = true)
 |-- loudness: double (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- tempo: double (nullable = true)
 |-- valence: double (nullable = true)
 |-- popularity: double (nullable = true)
 |-- key: integer (nullable = true)
 |-- mode: integer (nullable = true)
 |-- count: integer (nullable = true)



In [0]:
df_data_w_genres_csv.limit(10).display()

genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
['show tunes'],"""""""Cats"""" 1981 Original London Cast""",0.5901111111111111,0.4672222222222222,250318.5555555556,0.3940033333333333,0.0113998511111111,0.2908333333333333,-14.448,0.2103888888888888,117.51811111111112,0.3895,38.333333333333336,5,1,9
[],"""""""Cats"""" 1983 Broadway Cast""",0.8625384615384617,0.4417307692307693,287280.0,0.4068076923076923,0.0811582642307692,0.3152153846153846,-10.69,0.1762115384615384,103.04415384615385,0.2688653846153846,30.57692307692308,5,1,26
[],"""""""Fiddler On The Roof” Motion Picture Chorus""",0.8565714285714285,0.3482857142857142,328920.0,0.2865714285714285,0.0245929485714285,0.3257857142857143,-15.230714285714283,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0,1,7
[],"""""""Fiddler On The Roof” Motion Picture Orchestra""",0.884925925925926,0.4250740740740739,262890.96296296304,0.2457703703703704,0.0735872792592592,0.2754814814814815,-15.639370370370369,0.1232,88.66762962962959,0.3720296296296296,34.85185185185185,0,1,27
[],"""""""Joseph And The Amazing Technicolor Dreamcoat"""" 1991 London Cast""",0.5107142857142857,0.4671428571428572,270436.14285714284,0.4882857142857143,0.0094002914285714,0.195,-10.236714285714289,0.0985428571428571,122.83585714285714,0.4822857142857143,43.0,5,1,7
[],"""""""Joseph And The Amazing Technicolor Dreamcoat"""" 1992 Canadian Cast""",0.6095555555555557,0.4872777777777778,205091.9444444445,0.3099055555555556,0.0046956566666666,0.2747666666666666,-18.26638888888889,0.0980222222222221,118.64894444444442,0.4415555555555557,32.77777777777778,5,1,36
[],"""""""Mama"""" Helen Teagarden""",0.725,0.637,135533.0,0.512,0.186,0.426,-20.615,0.21,134.819,0.885,0.0,8,1,2
[],"""""""Test for Victor Young""""""",0.927,0.7340000000000001,175693.0,0.474,0.0762,0.737,-10.544,0.256,132.78799999999998,0.902,3.0,10,1,2
"['comedy rock', 'comic', 'parody']","""""""Weird Al"""" Yankovic""",0.1731450819672131,0.6627868852459013,218948.19672131148,0.6953934426229511,4.980262295081966e-05,0.1611016393442622,-9.768704918032787,0.0845360655737704,133.03118032786878,0.7513442622950824,34.22950819672131,9,1,122
"['emo rap', 'florida rap', 'sad rap', 'underground hip hop', 'vapor trap']",$NOT,0.5444666666666668,0.7898,137910.46666666667,0.5329333333333333,0.0230625833333333,0.1803,-9.14926666666667,0.2936866666666666,112.34479999999998,0.4807,67.53333333333333,1,1,15


## Processing the names of the artists

In [0]:
# Creating a sample of the dataset to corrupt or lose the original data
sample_df  = df_data_w_genres_csv.sample(withReplacement = False, fraction = 0.001, seed=333)
sample_df.display()

genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
['pop romantico'],Angela Carrasco,0.5789999999999998,0.41475,307573.5,0.40175,0.0013625,0.143475,-13.77375,0.03695,151.63324999999998,0.31675,48.0,9,1,8
[],Bani Tagore,0.865,0.525,169093.0,0.15,0.0,0.187,-12.001,0.0338,121.988,0.444,0.0,9,1,2
"['classic soul', 'disco', 'motown', 'quiet storm', 'soul', 'southern soul']",Candi Staton,0.3296625,0.6296249999999999,199941.875,0.5469999999999999,4.09125e-06,0.190775,-11.836125,0.0430874999999999,116.870875,0.83625,42.625,9,1,16
['canadian soundtrack'],Carl Zittrer,0.4964,0.3704,144373.2,0.19622,0.147,0.13814,-19.413,0.03938,104.232,0.30046,25.4,7,1,5
['poetry'],Dylan Thomas,0.7130000000000001,0.726,139240.0,0.08505,6.1e-05,0.2175,-28.157,0.917,80.6515,0.201,4.5,2,1,4
"['chicha', 'cumbia', 'nu-cumbia']",Frente!,0.906,0.738,119813.0,0.198,0.0,0.105,-10.919,0.0625,123.158,0.682,54.0,1,1,2
"['deep new americana', 'freak folk', 'funk', 'indie folk', 'indie pop', 'indie rock', 'modern rock', 'new americana', 'shimmer pop', 'stomp and holler']",Fruit Bats,0.0787499999999999,0.6579999999999999,227753.5,0.8445,0.031815,0.327,-5.863500000000001,0.02905,110.486,0.7655,60.0,7,1,4
"['progressive house', 'progressive trance', 'trance', 'uplifting trance']",Genix,0.000122,0.542,271628.0,0.97,0.841,0.414,-6.327999999999999,0.0441,129.009,0.0893,2.0,11,0,2
"""[""""children's music""""","""""preschool children's music""""]""",,0.5345000000000001,0.8555,191768.5,0.675,0.0,0.1198999999999999,-7.811,0.09045,130.7665,0.9000000000000001,26,2,1
[],Herbs,0.334,0.825,277360.0,0.649,0.0,0.623,-9.554,0.0418,122.155,0.97,61.0,7,1,2


In [0]:
sample_df = sample_df.withColumn('artists', f.regexp_replace('artists', '[\[|\]|\'|\|\"|\!|\^|\´|\`|\^|\(|\)]', ''))
sample_df = sample_df.withColumn('artists', f.regexp_replace('artists', '[\,]', '.'))
sample_df = sample_df.withColumn('artists', f.regexp_replace('artists', '[\$]', 'S'))
sample_df = sample_df.withColumn('artists', f.regexp_replace('artists', '[?,]', 'Q'))

In [0]:
sample_df.display()

genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
['pop romantico'],Angela Carrasco,0.5789999999999998,0.41475,307573.5,0.40175,0.0013625,0.143475,-13.77375,0.03695,151.63324999999998,0.31675,48.0,9,1,8
[],Bani Tagore,0.865,0.525,169093.0,0.15,0.0,0.187,-12.001,0.0338,121.988,0.444,0.0,9,1,2
"['classic soul', 'disco', 'motown', 'quiet storm', 'soul', 'southern soul']",Candi Staton,0.3296625,0.6296249999999999,199941.875,0.5469999999999999,4.09125e-06,0.190775,-11.836125,0.0430874999999999,116.870875,0.83625,42.625,9,1,16
['canadian soundtrack'],Carl Zittrer,0.4964,0.3704,144373.2,0.19622,0.147,0.13814,-19.413,0.03938,104.232,0.30046,25.4,7,1,5
['poetry'],Dylan Thomas,0.7130000000000001,0.726,139240.0,0.08505,6.1e-05,0.2175,-28.157,0.917,80.6515,0.201,4.5,2,1,4
"['chicha', 'cumbia', 'nu-cumbia']",Frente,0.906,0.738,119813.0,0.198,0.0,0.105,-10.919,0.0625,123.158,0.682,54.0,1,1,2
"['deep new americana', 'freak folk', 'funk', 'indie folk', 'indie pop', 'indie rock', 'modern rock', 'new americana', 'shimmer pop', 'stomp and holler']",Fruit Bats,0.0787499999999999,0.6579999999999999,227753.5,0.8445,0.031815,0.327,-5.863500000000001,0.02905,110.486,0.7655,60.0,7,1,4
"['progressive house', 'progressive trance', 'trance', 'uplifting trance']",Genix,0.000122,0.542,271628.0,0.97,0.841,0.414,-6.327999999999999,0.0441,129.009,0.0893,2.0,11,0,2
"""[""""children's music""""",preschool childrens music,,0.5345000000000001,0.8555,191768.5,0.675,0.0,0.1198999999999999,-7.811,0.09045,130.7665,0.9000000000000001,26,2,1
[],Herbs,0.334,0.825,277360.0,0.649,0.0,0.623,-9.554,0.0418,122.155,0.97,61.0,7,1,2


In [0]:
df_data_w_genres_csv = df_data_w_genres_csv.withColumn('artists', f.regexp_replace('artists', '[\[|\]|\'|\|\"|\!|\^|\´|\`|\^|\(|\)|\:|\”]', ''))
df_data_w_genres_csv = df_data_w_genres_csv.withColumn('artists', f.regexp_replace('artists', '[\,]', '.'))
df_data_w_genres_csv = df_data_w_genres_csv.withColumn('artists', f.regexp_replace('artists', '[\$]', 'S'))
df_data_w_genres_csv = df_data_w_genres_csv.withColumn('artists', f.regexp_replace('artists', '[?,]', 'Q'))

In [0]:
df_data_w_genres_csv.limit(20).display()

genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
['show tunes'],Cats 1981 Original London Cast,0.5901111111111111,0.4672222222222222,250318.5555555556,0.3940033333333333,0.0113998511111111,0.2908333333333333,-14.448,0.2103888888888888,117.51811111111112,0.3895,38.333333333333336,5,1,9
[],Cats 1983 Broadway Cast,0.8625384615384617,0.4417307692307693,287280.0,0.4068076923076923,0.0811582642307692,0.3152153846153846,-10.69,0.1762115384615384,103.04415384615385,0.2688653846153846,30.57692307692308,5,1,26
[],Fiddler On The Roof Motion Picture Chorus,0.8565714285714285,0.3482857142857142,328920.0,0.2865714285714285,0.0245929485714285,0.3257857142857143,-15.230714285714283,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0,1,7
[],Fiddler On The Roof Motion Picture Orchestra,0.884925925925926,0.4250740740740739,262890.96296296304,0.2457703703703704,0.0735872792592592,0.2754814814814815,-15.639370370370369,0.1232,88.66762962962959,0.3720296296296296,34.85185185185185,0,1,27
[],Joseph And The Amazing Technicolor Dreamcoat 1991 London Cast,0.5107142857142857,0.4671428571428572,270436.14285714284,0.4882857142857143,0.0094002914285714,0.195,-10.236714285714289,0.0985428571428571,122.83585714285714,0.4822857142857143,43.0,5,1,7
[],Joseph And The Amazing Technicolor Dreamcoat 1992 Canadian Cast,0.6095555555555557,0.4872777777777778,205091.9444444445,0.3099055555555556,0.0046956566666666,0.2747666666666666,-18.26638888888889,0.0980222222222221,118.64894444444442,0.4415555555555557,32.77777777777778,5,1,36
[],Mama Helen Teagarden,0.725,0.637,135533.0,0.512,0.186,0.426,-20.615,0.21,134.819,0.885,0.0,8,1,2
[],Test for Victor Young,0.927,0.7340000000000001,175693.0,0.474,0.0762,0.737,-10.544,0.256,132.78799999999998,0.902,3.0,10,1,2
"['comedy rock', 'comic', 'parody']",Weird Al Yankovic,0.1731450819672131,0.6627868852459013,218948.19672131148,0.6953934426229511,4.980262295081966e-05,0.1611016393442622,-9.768704918032787,0.0845360655737704,133.03118032786878,0.7513442622950824,34.22950819672131,9,1,122
"['emo rap', 'florida rap', 'sad rap', 'underground hip hop', 'vapor trap']",SNOT,0.5444666666666668,0.7898,137910.46666666667,0.5329333333333333,0.0230625833333333,0.1803,-9.14926666666667,0.2936866666666666,112.34479999999998,0.4807,67.53333333333333,1,1,15


In [0]:
# File location and type
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_w_genres'
file_type ='parquet'

# Mode
mode = 'overwrite'

# Writing Data
df_data_w_genres_csv.write.format(file_type) \
    .mode(mode) \
    .save(file_location)
display(dbutils.fs.ls(file_location))

path,name,size,modificationTime
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_w_genres/_committed_3922870288229815722,_committed_3922870288229815722,434,1707220110000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_w_genres/_committed_395116854890766511,_committed_395116854890766511,421,1707232126000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_w_genres/_committed_440591130866655678,_committed_440591130866655678,419,1707232447000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_w_genres/_committed_5712057867446360260,_committed_5712057867446360260,421,1707247238000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_w_genres/_started_5712057867446360260,_started_5712057867446360260,0,1707247236000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_w_genres/part-00000-tid-5712057867446360260-45ffbb21-4fe2-47cf-923a-9333d01b1295-138-1-c000.snappy.parquet,part-00000-tid-5712057867446360260-45ffbb21-4fe2-47cf-923a-9333d01b1295-138-1-c000.snappy.parquet,1873908,1707247238000
dbfs:/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_w_genres/part-00001-tid-5712057867446360260-45ffbb21-4fe2-47cf-923a-9333d01b1295-139-1-c000.snappy.parquet,part-00001-tid-5712057867446360260-45ffbb21-4fe2-47cf-923a-9333d01b1295-139-1-c000.snappy.parquet,493090,1707247236000


## Analyzing the data

### Reading preprocessed file: data

In [0]:
# Location and file type
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data'
file_type = 'parquet'

# Reading file
Df_data = spark.read.format(file_type) \
    .load(file_location)

Df_data.limit(10).display()

valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
0.917,1970,0.096,The Velvet Underground,0.624,201440,0.774,0,60ZyiL4lmWzZyGfqyECTqp,0.0309,7,0.096,-10.390999999999998,1,Train Round the Bend - 2015 Remaster,24,1970,0.0315,117.006
0.511,1970,0.0019,Ten Years After,0.405,458463,0.5429999999999999,0,6DYyyUdHzI6RdSx0swUR1i,0.72,2,0.186,-9.313,1,Love Like a Man - 2017 Remaster,34,1970-04-01,0.029,107.598
0.466,1970,0.0528,The Mothers Of Invention,0.444,105587,0.568,0,6HJAS8XZO0ctUcN2KsbLRa,1.02e-05,11,0.512,-8.8,0,Oh No,24,1970-08-10,0.0327,124.319
0.523,1970,0.0811,Three Dog Night,0.502,174707,0.669,0,7sZ74qmKb1nyGKUgHROJ1n,0.000945,7,0.0906,-11.725,1,One Man Band,19,1970-01-01,0.0912,121.089
0.501,1970,0.000128,The Rolling Stones,0.273,246413,0.866,0,095WtNlSHE8TMB2gQ1fdTx,0.79,11,0.961,-7.598,1,Street Fighting Man - Live,25,1970-09-04,0.0347,134.891
0.8859999999999999,1970,0.25,Sly & The Family Stone,0.693,178360,0.6409999999999999,0,0aI5KoqucjqXjPi7bFENFQ,0.000528,0,0.0826,-9.99,1,Life,18,1970-11-21,0.0517,121.823
0.119,1970,0.13,William S. Fischer,0.231,273520,0.326,0,187c6h1frKYjnqKEoKPQQ6,0.287,4,0.106,-18.219,0,Chains,26,1970,0.0425,82.45200000000001
0.61,1970,0.6659999999999999,Yusuf / Cat Stevens,0.397,297107,0.634,0,27adiexGtJvf2NbH0GletP,0.0006799999999999999,7,0.134,-7.787000000000001,1,On The Road To Find Out - Live At KCET-TV. 1971,27,1970-11-23,0.0343,179.516
0.556,1970,0.00674,The Meters,0.601,164000,0.439,0,2SJ5On3CJUop2H2uxTlovf,0.7609999999999999,6,0.0827,-13.213,0,Oh. Calcutta,23,1970,0.0311,102.344
0.317,1970,0.746,Linda Ronstadt,0.341,172707,0.357,0,2XCtrbGnYP8inx76lmxXyt,4.47e-06,2,0.454,-13.245,1,Are My Thoughts With YouQ,20,1970,0.0385,123.604


In [0]:
display(Df_data.describe())

summary,valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo
count,170653.0,170653.0,170653.0,170653,170080.0,170454.0,170573.0,170606.0,170653,170270.0,170451.0,170613.0,170621.0,170635.0,170653,169496.0,170653,170042.0,170404.0
mean,0.5285872111424917,1976.78724077514,0.5021147637067068,419.12048192771084,0.5376399623706487,230404.3570171425,533.269539857587,161.70805833323564,15004.793020397112,36.46453885074412,20.310634727869008,3.5752400090098906,-11.417017795610027,7.382781961496762,Infinity,31.543546750365792,1929.7153378618455,5.064387185348326,118.22609112814231
stddev,0.2631714639897204,25.91785256455829,0.3760317251620433,417.014253648609,0.1759587953141558,126366.513988664,13500.95790521312,6984.208830168787,67448.47403485543,3139.1221317034187,2119.3151514089577,981.3336826644396,5.729826602416083,1867.3487851378695,,21.80303129615027,253.4321782547515,98.12127900967268,67.00273009940193
min,0.0,1921.0,0.0,*NSYNC,0.0,0.0,0.0,0.0,"""""Orchestra Sinfonica dell'EIAR di Torino""""]""",0.0,0.0,0.0,-60.0,-36.0,Cello Song,-39.0,"""""L'amour est un oiseau rebelle"""" (Carmen",-30.073,-8.331
max,1.0,2020.0,0.996,조정현,0.988,5403500.0,1622000.0,693933.0,7zzuPsjj9L3M7ikqGmjN0D,475880.0,503320.0,290627.0,11.0,701187.0,텅 빈 마음 Empty Heart,100.0,Whipped into Shape,2009.0,2010.0


In [0]:
Df_data.printSchema()

root
 |-- valence: double (nullable = true)
 |-- year: integer (nullable = true)
 |-- acousticness: double (nullable = true)
 |-- artists: string (nullable = true)
 |-- danceability: double (nullable = true)
 |-- duration_ms: integer (nullable = true)
 |-- energy: double (nullable = true)
 |-- explicit: integer (nullable = true)
 |-- id: string (nullable = true)
 |-- instrumentalness: double (nullable = true)
 |-- key: integer (nullable = true)
 |-- liveness: double (nullable = true)
 |-- loudness: double (nullable = true)
 |-- mode: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- popularity: integer (nullable = true)
 |-- release_date: string (nullable = true)
 |-- speechiness: double (nullable = true)
 |-- tempo: double (nullable = true)



### Counting how old we are in our music database

In [0]:
Df_data \
    .select('year') \
    .distinct() \
    .count() 

100

### Checking the number of songs per year

In [0]:
Df_data \
    .groupBy('year')\
    .count() \
    .withColumnRenamed('count', 'year x count') \
    .orderBy('year') \
    .display()

year,year x count
1921,150
1922,71
1923,185
1924,236
1925,278
1926,1378
1927,615
1928,1261
1929,952
1930,1924


## Viewing data for years x songs

In [0]:
Df_data \
    .groupBy('year')\
    .count() \
    .withColumnRenamed('count', 'year x count') \
    .orderBy('year') \
    .display()

year,year x count
1921,150
1922,71
1923,185
1924,236
1925,278
1926,1378
1927,615
1928,1261
1929,952
1930,1924


Databricks visualization. Run in Databricks to view.

## Separating years by decades

In [0]:
Df_data \
    .withColumn('decade', (f.floor(f.col('year') / 10) * 10).cast(IntegerType())) \
    .limit(20) \
    .display()

valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,decade
0.917,1970,0.096,The Velvet Underground,0.624,201440,0.774,0,60ZyiL4lmWzZyGfqyECTqp,0.0309,7,0.096,-10.390999999999998,1,Train Round the Bend - 2015 Remaster,24,1970,0.0315,117.006,1970
0.511,1970,0.0019,Ten Years After,0.405,458463,0.5429999999999999,0,6DYyyUdHzI6RdSx0swUR1i,0.72,2,0.186,-9.313,1,Love Like a Man - 2017 Remaster,34,1970-04-01,0.029,107.598,1970
0.466,1970,0.0528,The Mothers Of Invention,0.444,105587,0.568,0,6HJAS8XZO0ctUcN2KsbLRa,1.02e-05,11,0.512,-8.8,0,Oh No,24,1970-08-10,0.0327,124.319,1970
0.523,1970,0.0811,Three Dog Night,0.502,174707,0.669,0,7sZ74qmKb1nyGKUgHROJ1n,0.000945,7,0.0906,-11.725,1,One Man Band,19,1970-01-01,0.0912,121.089,1970
0.501,1970,0.000128,The Rolling Stones,0.273,246413,0.866,0,095WtNlSHE8TMB2gQ1fdTx,0.79,11,0.961,-7.598,1,Street Fighting Man - Live,25,1970-09-04,0.0347,134.891,1970
0.8859999999999999,1970,0.25,Sly & The Family Stone,0.693,178360,0.6409999999999999,0,0aI5KoqucjqXjPi7bFENFQ,0.000528,0,0.0826,-9.99,1,Life,18,1970-11-21,0.0517,121.823,1970
0.119,1970,0.13,William S. Fischer,0.231,273520,0.326,0,187c6h1frKYjnqKEoKPQQ6,0.287,4,0.106,-18.219,0,Chains,26,1970,0.0425,82.45200000000001,1970
0.61,1970,0.6659999999999999,Yusuf / Cat Stevens,0.397,297107,0.634,0,27adiexGtJvf2NbH0GletP,0.0006799999999999999,7,0.134,-7.787000000000001,1,On The Road To Find Out - Live At KCET-TV. 1971,27,1970-11-23,0.0343,179.516,1970
0.556,1970,0.00674,The Meters,0.601,164000,0.439,0,2SJ5On3CJUop2H2uxTlovf,0.7609999999999999,6,0.0827,-13.213,0,Oh. Calcutta,23,1970,0.0311,102.344,1970
0.317,1970,0.746,Linda Ronstadt,0.341,172707,0.357,0,2XCtrbGnYP8inx76lmxXyt,4.47e-06,2,0.454,-13.245,1,Are My Thoughts With YouQ,20,1970,0.0385,123.604,1970


In [0]:
Df_data = Df_data \
    .withColumn('decade', (f.floor(f.col('year') / 10) * 10).cast(IntegerType()))

Df_data.limit(20).display()


valence,year,acousticness,artists,danceability,duration_ms,energy,explicit,id,instrumentalness,key,liveness,loudness,mode,name,popularity,release_date,speechiness,tempo,decade
0.917,1970,0.096,The Velvet Underground,0.624,201440,0.774,0,60ZyiL4lmWzZyGfqyECTqp,0.0309,7,0.096,-10.390999999999998,1,Train Round the Bend - 2015 Remaster,24,1970,0.0315,117.006,1970
0.511,1970,0.0019,Ten Years After,0.405,458463,0.5429999999999999,0,6DYyyUdHzI6RdSx0swUR1i,0.72,2,0.186,-9.313,1,Love Like a Man - 2017 Remaster,34,1970-04-01,0.029,107.598,1970
0.466,1970,0.0528,The Mothers Of Invention,0.444,105587,0.568,0,6HJAS8XZO0ctUcN2KsbLRa,1.02e-05,11,0.512,-8.8,0,Oh No,24,1970-08-10,0.0327,124.319,1970
0.523,1970,0.0811,Three Dog Night,0.502,174707,0.669,0,7sZ74qmKb1nyGKUgHROJ1n,0.000945,7,0.0906,-11.725,1,One Man Band,19,1970-01-01,0.0912,121.089,1970
0.501,1970,0.000128,The Rolling Stones,0.273,246413,0.866,0,095WtNlSHE8TMB2gQ1fdTx,0.79,11,0.961,-7.598,1,Street Fighting Man - Live,25,1970-09-04,0.0347,134.891,1970
0.8859999999999999,1970,0.25,Sly & The Family Stone,0.693,178360,0.6409999999999999,0,0aI5KoqucjqXjPi7bFENFQ,0.000528,0,0.0826,-9.99,1,Life,18,1970-11-21,0.0517,121.823,1970
0.119,1970,0.13,William S. Fischer,0.231,273520,0.326,0,187c6h1frKYjnqKEoKPQQ6,0.287,4,0.106,-18.219,0,Chains,26,1970,0.0425,82.45200000000001,1970
0.61,1970,0.6659999999999999,Yusuf / Cat Stevens,0.397,297107,0.634,0,27adiexGtJvf2NbH0GletP,0.0006799999999999999,7,0.134,-7.787000000000001,1,On The Road To Find Out - Live At KCET-TV. 1971,27,1970-11-23,0.0343,179.516,1970
0.556,1970,0.00674,The Meters,0.601,164000,0.439,0,2SJ5On3CJUop2H2uxTlovf,0.7609999999999999,6,0.0827,-13.213,0,Oh. Calcutta,23,1970,0.0311,102.344,1970
0.317,1970,0.746,Linda Ronstadt,0.341,172707,0.357,0,2XCtrbGnYP8inx76lmxXyt,4.47e-06,2,0.454,-13.245,1,Are My Thoughts With YouQ,20,1970,0.0385,123.604,1970


In [0]:
Df_data \
    .select('decade') \
    .distinct() \
    .orderBy('decade') \
    .display()

decade
1920
1930
1940
1950
1960
1970
1980
1990
2000
2010


## Checking the number of songs by decades

In [0]:
Df_data \
    .groupBy('decade') \
    .count() \
    .orderBy('decade') \
    .display()

decade,count
1920,5126
1930,9549
1940,15378
1950,19850
1960,19549
1970,20000
1980,19850
1990,19901
2000,19646
2010,19774


## Viewing data for decades x songs

In [0]:
Df_data \
    .groupBy('decade') \
    .count() \
    .orderBy('decade') \
    .display()

decade,count
1920,5126
1930,9549
1940,15378
1950,19850
1960,19549
1970,20000
1980,19850
1990,19901
2000,19646
2010,19774


Databricks visualization. Run in Databricks to view.

## Reading preprocessed file: data_by_year

In [0]:
# File location and file type
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_year'
file_type = 'parquet'

# Reading file
Df_by_year = spark.read.format(file_type) \
    .load(file_location)

Df_by_year.limit(10).display()

mode,year,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,1921,0.8868960000000005,0.4185973333333336,260537.16666666663,0.2318151333333333,0.3448780588666665,0.20571,-17.04866666666665,0.073662,101.53149333333327,0.3793266666666666,0.6533333333333333,2
1,1922,0.9385915492957748,0.4820422535211267,165469.74647887325,0.2378153521126759,0.4341948697183099,0.2407197183098592,-19.275281690140844,0.1166549295774648,100.88452112676056,0.5355492957746479,0.1408450704225352,10
1,1923,0.9572467913513516,0.5773405405405401,177942.36216216214,0.2624064864864865,0.371732725027027,0.2274621621621621,-14.129210810810813,0.0939486486486487,114.0107297297297,0.6254924324324328,5.389189189189189,0
1,1924,0.940199860169493,0.5498940677966102,191046.70762711865,0.3443466101694912,0.5817009136440677,0.2352190677966101,-14.231343220338989,0.0920894067796609,120.68957203389822,0.6637254237288139,0.6610169491525424,10
1,1925,0.9626070503597138,0.5738633093525181,184986.92446043165,0.2785935251798561,0.4182973612230215,0.2376679856115108,-14.14641366906474,0.1119179856115108,115.5219208633093,0.6219287769784171,2.6043165467625897,5
1,1926,0.660817216981134,0.5998802612481859,156881.65747460088,0.2114670907111756,0.3330931111175616,0.2323695936139332,-18.492538461538462,0.4837036284470243,109.64803265602328,0.4369104571843251,1.4223512336719883,9
1,1927,0.9361794552845558,0.6482682926829262,184993.59837398367,0.2643213008130081,0.3913284986504065,0.1684502439024389,-14.422373983739831,0.113609593495935,114.84652357723554,0.6597004878048782,0.8016260162601626,7
1,1928,0.9386165035685952,0.5342878667724027,214827.90642347344,0.2079477954004757,0.4948354801348136,0.1752893735130848,-17.191982553528927,0.1599114988104679,106.77226169706591,0.4957126883425853,1.5257731958762886,1
1,1929,0.6014265861344558,0.6476698529411761,168999.41281512607,0.2418007352941172,0.2152040310609246,0.2360002100840333,-16.530376050420152,0.4900007352941176,110.94835714285716,0.6365298319327733,0.3403361344537815,7
1,1930,0.936714937370057,0.5181758835758836,195150.28534303536,0.3335239189189189,0.3522059281652805,0.2213108627858629,-12.869221413721428,0.1199096673596674,109.87119438669428,0.6162376299376306,0.9267151767151768,2


## Checking the number of years in the database

In [0]:
Df_by_year \
    .select('year') \
    .distinct() \
    .count()

100

## Viewing the duration of the songs according to the years

In [0]:
Df_by_year.display()

mode,year,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,1921,0.8868960000000005,0.4185973333333336,260537.16666666663,0.2318151333333333,0.3448780588666665,0.20571,-17.04866666666665,0.073662,101.53149333333327,0.3793266666666666,0.6533333333333333,2
1,1922,0.9385915492957748,0.4820422535211267,165469.74647887325,0.2378153521126759,0.4341948697183099,0.2407197183098592,-19.275281690140844,0.1166549295774648,100.88452112676056,0.5355492957746479,0.1408450704225352,10
1,1923,0.9572467913513516,0.5773405405405401,177942.36216216214,0.2624064864864865,0.371732725027027,0.2274621621621621,-14.129210810810813,0.0939486486486487,114.0107297297297,0.6254924324324328,5.389189189189189,0
1,1924,0.940199860169493,0.5498940677966102,191046.70762711865,0.3443466101694912,0.5817009136440677,0.2352190677966101,-14.231343220338989,0.0920894067796609,120.68957203389822,0.6637254237288139,0.6610169491525424,10
1,1925,0.9626070503597138,0.5738633093525181,184986.92446043165,0.2785935251798561,0.4182973612230215,0.2376679856115108,-14.14641366906474,0.1119179856115108,115.5219208633093,0.6219287769784171,2.6043165467625897,5
1,1926,0.660817216981134,0.5998802612481859,156881.65747460088,0.2114670907111756,0.3330931111175616,0.2323695936139332,-18.492538461538462,0.4837036284470243,109.64803265602328,0.4369104571843251,1.4223512336719883,9
1,1927,0.9361794552845558,0.6482682926829262,184993.59837398367,0.2643213008130081,0.3913284986504065,0.1684502439024389,-14.422373983739831,0.113609593495935,114.84652357723554,0.6597004878048782,0.8016260162601626,7
1,1928,0.9386165035685952,0.5342878667724027,214827.90642347344,0.2079477954004757,0.4948354801348136,0.1752893735130848,-17.191982553528927,0.1599114988104679,106.77226169706591,0.4957126883425853,1.5257731958762886,1
1,1929,0.6014265861344558,0.6476698529411761,168999.41281512607,0.2418007352941172,0.2152040310609246,0.2360002100840333,-16.530376050420152,0.4900007352941176,110.94835714285716,0.6365298319327733,0.3403361344537815,7
1,1930,0.936714937370057,0.5181758835758836,195150.28534303536,0.3335239189189189,0.3522059281652805,0.2213108627858629,-12.869221413721428,0.1199096673596674,109.87119438669428,0.6162376299376306,0.9267151767151768,2


Databricks visualization. Run in Databricks to view.

## Viewing the characteristics of the songs according to the years

In [0]:
Df_by_year.display()

mode,year,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,1921,0.8868960000000005,0.4185973333333336,260537.16666666663,0.2318151333333333,0.3448780588666665,0.20571,-17.04866666666665,0.073662,101.53149333333327,0.3793266666666666,0.6533333333333333,2
1,1922,0.9385915492957748,0.4820422535211267,165469.74647887325,0.2378153521126759,0.4341948697183099,0.2407197183098592,-19.275281690140844,0.1166549295774648,100.88452112676056,0.5355492957746479,0.1408450704225352,10
1,1923,0.9572467913513516,0.5773405405405401,177942.36216216214,0.2624064864864865,0.371732725027027,0.2274621621621621,-14.129210810810813,0.0939486486486487,114.0107297297297,0.6254924324324328,5.389189189189189,0
1,1924,0.940199860169493,0.5498940677966102,191046.70762711865,0.3443466101694912,0.5817009136440677,0.2352190677966101,-14.231343220338989,0.0920894067796609,120.68957203389822,0.6637254237288139,0.6610169491525424,10
1,1925,0.9626070503597138,0.5738633093525181,184986.92446043165,0.2785935251798561,0.4182973612230215,0.2376679856115108,-14.14641366906474,0.1119179856115108,115.5219208633093,0.6219287769784171,2.6043165467625897,5
1,1926,0.660817216981134,0.5998802612481859,156881.65747460088,0.2114670907111756,0.3330931111175616,0.2323695936139332,-18.492538461538462,0.4837036284470243,109.64803265602328,0.4369104571843251,1.4223512336719883,9
1,1927,0.9361794552845558,0.6482682926829262,184993.59837398367,0.2643213008130081,0.3913284986504065,0.1684502439024389,-14.422373983739831,0.113609593495935,114.84652357723554,0.6597004878048782,0.8016260162601626,7
1,1928,0.9386165035685952,0.5342878667724027,214827.90642347344,0.2079477954004757,0.4948354801348136,0.1752893735130848,-17.191982553528927,0.1599114988104679,106.77226169706591,0.4957126883425853,1.5257731958762886,1
1,1929,0.6014265861344558,0.6476698529411761,168999.41281512607,0.2418007352941172,0.2152040310609246,0.2360002100840333,-16.530376050420152,0.4900007352941176,110.94835714285716,0.6365298319327733,0.3403361344537815,7
1,1930,0.936714937370057,0.5181758835758836,195150.28534303536,0.3335239189189189,0.3522059281652805,0.2213108627858629,-12.869221413721428,0.1199096673596674,109.87119438669428,0.6162376299376306,0.9267151767151768,2


Databricks visualization. Run in Databricks to view.

## Viewing the characteristics of the songs according to the decades

In [0]:
Df_by_year \
    .withColumn('decade', (f.floor(f.col('year') / 10) * 10).cast(IntegerType())) \
    .limit(20) \
    .display()

mode,year,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,decade
1,1921,0.8868960000000005,0.4185973333333336,260537.16666666663,0.2318151333333333,0.3448780588666665,0.20571,-17.04866666666665,0.073662,101.53149333333327,0.3793266666666666,0.6533333333333333,2,1920
1,1922,0.9385915492957748,0.4820422535211267,165469.74647887325,0.2378153521126759,0.4341948697183099,0.2407197183098592,-19.275281690140844,0.1166549295774648,100.88452112676056,0.5355492957746479,0.1408450704225352,10,1920
1,1923,0.9572467913513516,0.5773405405405401,177942.36216216214,0.2624064864864865,0.371732725027027,0.2274621621621621,-14.129210810810813,0.0939486486486487,114.0107297297297,0.6254924324324328,5.389189189189189,0,1920
1,1924,0.940199860169493,0.5498940677966102,191046.70762711865,0.3443466101694912,0.5817009136440677,0.2352190677966101,-14.231343220338989,0.0920894067796609,120.68957203389822,0.6637254237288139,0.6610169491525424,10,1920
1,1925,0.9626070503597138,0.5738633093525181,184986.92446043165,0.2785935251798561,0.4182973612230215,0.2376679856115108,-14.14641366906474,0.1119179856115108,115.5219208633093,0.6219287769784171,2.6043165467625897,5,1920
1,1926,0.660817216981134,0.5998802612481859,156881.65747460088,0.2114670907111756,0.3330931111175616,0.2323695936139332,-18.492538461538462,0.4837036284470243,109.64803265602328,0.4369104571843251,1.4223512336719883,9,1920
1,1927,0.9361794552845558,0.6482682926829262,184993.59837398367,0.2643213008130081,0.3913284986504065,0.1684502439024389,-14.422373983739831,0.113609593495935,114.84652357723554,0.6597004878048782,0.8016260162601626,7,1920
1,1928,0.9386165035685952,0.5342878667724027,214827.90642347344,0.2079477954004757,0.4948354801348136,0.1752893735130848,-17.191982553528927,0.1599114988104679,106.77226169706591,0.4957126883425853,1.5257731958762886,1,1920
1,1929,0.6014265861344558,0.6476698529411761,168999.41281512607,0.2418007352941172,0.2152040310609246,0.2360002100840333,-16.530376050420152,0.4900007352941176,110.94835714285716,0.6365298319327733,0.3403361344537815,7,1920
1,1930,0.936714937370057,0.5181758835758836,195150.28534303536,0.3335239189189189,0.3522059281652805,0.2213108627858629,-12.869221413721428,0.1199096673596674,109.87119438669428,0.6162376299376306,0.9267151767151768,2,1930


In [0]:
Df_by_year = Df_by_year \
    .withColumn('decade', (f.floor(f.col('year') / 10) * 10).cast(IntegerType()))
Df_by_year.limit(10).display()

mode,year,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,decade
1,1921,0.8868960000000005,0.4185973333333336,260537.16666666663,0.2318151333333333,0.3448780588666665,0.20571,-17.04866666666665,0.073662,101.53149333333327,0.3793266666666666,0.6533333333333333,2,1920
1,1922,0.9385915492957748,0.4820422535211267,165469.74647887325,0.2378153521126759,0.4341948697183099,0.2407197183098592,-19.275281690140844,0.1166549295774648,100.88452112676056,0.5355492957746479,0.1408450704225352,10,1920
1,1923,0.9572467913513516,0.5773405405405401,177942.36216216214,0.2624064864864865,0.371732725027027,0.2274621621621621,-14.129210810810813,0.0939486486486487,114.0107297297297,0.6254924324324328,5.389189189189189,0,1920
1,1924,0.940199860169493,0.5498940677966102,191046.70762711865,0.3443466101694912,0.5817009136440677,0.2352190677966101,-14.231343220338989,0.0920894067796609,120.68957203389822,0.6637254237288139,0.6610169491525424,10,1920
1,1925,0.9626070503597138,0.5738633093525181,184986.92446043165,0.2785935251798561,0.4182973612230215,0.2376679856115108,-14.14641366906474,0.1119179856115108,115.5219208633093,0.6219287769784171,2.6043165467625897,5,1920
1,1926,0.660817216981134,0.5998802612481859,156881.65747460088,0.2114670907111756,0.3330931111175616,0.2323695936139332,-18.492538461538462,0.4837036284470243,109.64803265602328,0.4369104571843251,1.4223512336719883,9,1920
1,1927,0.9361794552845558,0.6482682926829262,184993.59837398367,0.2643213008130081,0.3913284986504065,0.1684502439024389,-14.422373983739831,0.113609593495935,114.84652357723554,0.6597004878048782,0.8016260162601626,7,1920
1,1928,0.9386165035685952,0.5342878667724027,214827.90642347344,0.2079477954004757,0.4948354801348136,0.1752893735130848,-17.191982553528927,0.1599114988104679,106.77226169706591,0.4957126883425853,1.5257731958762886,1,1920
1,1929,0.6014265861344558,0.6476698529411761,168999.41281512607,0.2418007352941172,0.2152040310609246,0.2360002100840333,-16.530376050420152,0.4900007352941176,110.94835714285716,0.6365298319327733,0.3403361344537815,7,1920
1,1930,0.936714937370057,0.5181758835758836,195150.28534303536,0.3335239189189189,0.3522059281652805,0.2213108627858629,-12.869221413721428,0.1199096673596674,109.87119438669428,0.6162376299376306,0.9267151767151768,2,1930


In [0]:
Df_by_year \
    .groupBy('decade') \
    .agg((f.mean('acousticness')).alias('mean_acousticness'), \
        (f.mean('danceability')).alias('mean_danceability'), \
        (f.mean('duration_ms')).alias('mean_duration_ms'), \
        (f.mean('energy')).alias('mean_energy'), \
        (f.mean('instrumentalness')).alias('mean_instrumentalness'), \
        (f.mean('liveness')).alias('mean_liveness'), \
        (f.mean('loudness')).alias('mean_loudness'), \
        (f.mean('speechiness')).alias('mean_speechiness'), \
        (f.mean('valence')).alias('mean_valence'), \
        (f.mean('popularity')).alias('mean_popularity')) \
    .display()

decade,mean_acousticness,mean_danceability,mean_duration_ms,mean_energy,mean_instrumentalness,mean_liveness,mean_loudness,mean_speechiness,mean_valence,mean_popularity
1990,0.3075080874881892,0.5660955727990249,248580.4799912688,0.5861257366457823,0.1097194094431119,0.196551042023223,-10.002534825152129,0.0805385465550868,0.545034417546426,44.193422649791685
1930,0.874072398938074,0.5418884140937485,208295.7623723605,0.2841575777945339,0.2726491375056954,0.2256678114214279,-14.19092997269453,0.1849156608200228,0.5645996179857298,2.7772641757469207
1950,0.8401896509024345,0.4768392989743589,220517.63587307688,0.2867833829230769,0.2476891484893846,0.2091408138461538,-14.730952001282066,0.0935320042307692,0.4785541158203846,10.7228
2020,0.2199308880935964,0.6929043349753701,193728.39753694585,0.6312316354679793,0.0163755243054187,0.1785354187192117,-6.595066995073878,0.1413836945812805,0.5010478078817729,64.30197044334976
1960,0.6263446178567353,0.4946423073370347,211668.9313360052,0.4134146979457628,0.1582295493744963,0.2086816441528107,-12.694151058502138,0.057821232526372,0.5515988656860983,26.44616941961425
1970,0.400161569561,0.5249274100000003,254051.7055,0.5337163684000003,0.1160311129009999,0.216406105,-11.42496555,0.0597639199999999,0.5850979400000005,35.0558
1920,0.8691756681272306,0.5590937531320911,189520.6091647152,0.253390447722291,0.3983627832714221,0.2176542616659591,-16.16313190069438,0.1928331585183144,0.5616528956495046,1.5043097410136013
1980,0.2986903958913592,0.546381672820513,252124.0169153846,0.59466316743718,0.1222215808658974,0.2046381143589744,-11.227398025641026,0.0620278153846154,0.5644021430512822,37.530730769230765
2000,0.2696770128304137,0.5741148147601711,239515.4798909466,0.6516750500136158,0.0837800712790397,0.195597979872152,-7.499271505628014,0.0877201834601508,0.5302409151735066,49.74032968537266
2010,0.2642777013590903,0.5971840941008854,227118.10745224747,0.6287041673483448,0.0876612272699647,0.1894875775322422,-7.517522136832882,0.0988010617513747,0.4560693465651335,57.64469028726741


In [0]:
Df_by_year \
    .groupBy('decade') \
    .agg((f.mean('acousticness')).alias('mean_acousticness'), \
        (f.mean('danceability')).alias('mean_danceability'), \
        (f.mean('duration_ms')).alias('mean_duration_ms'), \
        (f.mean('energy')).alias('mean_energy'), \
        (f.mean('instrumentalness')).alias('mean_instrumentalness'), \
        (f.mean('liveness')).alias('mean_liveness'), \
        (f.mean('loudness')).alias('mean_loudness'), \
        (f.mean('speechiness')).alias('mean_speechiness'), \
        (f.mean('valence')).alias('mean_valence'), \
        (f.mean('popularity')).alias('mean_popularity')) \
    .orderBy((f.col('decade')).asc()) \
    .display()

decade,mean_acousticness,mean_danceability,mean_duration_ms,mean_energy,mean_instrumentalness,mean_liveness,mean_loudness,mean_speechiness,mean_valence,mean_popularity
1920,0.8691756681272306,0.5590937531320911,189520.6091647152,0.253390447722291,0.3983627832714221,0.2176542616659591,-16.16313190069438,0.1928331585183144,0.5616528956495046,1.5043097410136013
1930,0.874072398938074,0.5418884140937485,208295.7623723605,0.2841575777945339,0.2726491375056954,0.2256678114214279,-14.19092997269453,0.1849156608200228,0.5645996179857298,2.7772641757469207
1940,0.8779975949514116,0.4732805578052879,221154.0150086203,0.2566394435643606,0.3776731719070574,0.2223245862450512,-15.196490387530504,0.1458022490019897,0.4900113948115973,1.865367444064331
1950,0.8401896509024345,0.4768392989743589,220517.63587307688,0.2867833829230769,0.2476891484893846,0.2091408138461538,-14.730952001282066,0.0935320042307692,0.4785541158203846,10.7228
1960,0.6263446178567353,0.4946423073370347,211668.9313360052,0.4134146979457628,0.1582295493744963,0.2086816441528107,-12.694151058502138,0.057821232526372,0.5515988656860983,26.44616941961425
1970,0.400161569561,0.5249274100000003,254051.7055,0.5337163684000003,0.1160311129009999,0.216406105,-11.42496555,0.0597639199999999,0.5850979400000005,35.0558
1980,0.2986903958913592,0.546381672820513,252124.0169153846,0.59466316743718,0.1222215808658974,0.2046381143589744,-11.227398025641026,0.0620278153846154,0.5644021430512822,37.530730769230765
1990,0.3075080874881892,0.5660955727990249,248580.4799912688,0.5861257366457823,0.1097194094431119,0.196551042023223,-10.002534825152129,0.0805385465550868,0.545034417546426,44.193422649791685
2000,0.2696770128304137,0.5741148147601711,239515.4798909466,0.6516750500136158,0.0837800712790397,0.195597979872152,-7.499271505628014,0.0877201834601508,0.5302409151735066,49.74032968537266
2010,0.2642777013590903,0.5971840941008854,227118.10745224747,0.6287041673483448,0.0876612272699647,0.1894875775322422,-7.517522136832882,0.0988010617513747,0.4560693465651335,57.64469028726741


Databricks visualization. Run in Databricks to view.

## Answering questions about the data

### Loading files: data_by_genres, data_by_artist, data_w_genres

In [0]:
# File location and file type
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_artist'
file_type = 'parquet'

# Reading file
Df_by_artist = spark.read.format(file_type) \
    .load(file_location)

Df_by_artist.limit(10).display()

mode,count,acousticness,artists,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,9,0.5901111111111111,Cats 1981 Original London Cast,0.4672222222222222,250318.5555555556,0.3940033333333333,0.0113998511111111,0.2908333333333333,-14.448,0.2103888888888888,117.51811111111112,0.3895,38.333333333333336,5
1,26,0.8625384615384617,Cats 1983 Broadway Cast,0.4417307692307693,287280.0,0.4068076923076923,0.0811582642307692,0.3152153846153846,-10.69,0.1762115384615384,103.04415384615385,0.2688653846153846,30.57692307692308,5
1,7,0.8565714285714285,Fiddler On The Roof Motion Picture Chorus,0.3482857142857142,328920.0,0.2865714285714285,0.0245929485714285,0.3257857142857143,-15.230714285714283,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0
1,27,0.884925925925926,Fiddler On The Roof Motion Picture Orchestra,0.4250740740740739,262890.96296296304,0.2457703703703704,0.0735872792592592,0.2754814814814815,-15.639370370370369,0.1232,88.66762962962959,0.3720296296296296,34.85185185185185,0
1,7,0.5107142857142857,Joseph And The Amazing Technicolor Dreamcoat 1991 London Cast,0.4671428571428572,270436.14285714284,0.4882857142857143,0.0094002914285714,0.195,-10.236714285714289,0.0985428571428571,122.83585714285714,0.4822857142857143,43.0,5
1,36,0.6095555555555557,Joseph And The Amazing Technicolor Dreamcoat 1992 Canadian Cast,0.4872777777777778,205091.9444444445,0.3099055555555556,0.0046956566666666,0.2747666666666666,-18.26638888888889,0.0980222222222221,118.64894444444442,0.4415555555555557,32.77777777777778,5
1,2,0.725,Mama Helen Teagarden,0.637,135533.0,0.512,0.186,0.426,-20.615,0.21,134.819,0.885,0.0,8
1,2,0.927,Test for Victor Young,0.7340000000000001,175693.0,0.474,0.0762,0.737,-10.544,0.256,132.78799999999998,0.902,3.0,10
1,122,0.1731450819672131,Weird Al Yankovic,0.6627868852459013,218948.19672131148,0.6953934426229511,4.980262295081966e-05,0.1611016393442622,-9.768704918032787,0.0845360655737704,133.03118032786878,0.7513442622950824,34.22950819672131,9
1,15,0.5444666666666668,SNOT,0.7898,137910.46666666667,0.5329333333333333,0.0230625833333333,0.1803,-9.14926666666667,0.2936866666666666,112.34479999999998,0.4807,67.53333333333333,1


In [0]:
# File_location and file_type
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_by_genres'
file_type = 'parquet'

# Reading File
Df_by_genres = spark.read.format(file_type) \
    .load(file_location)

Df_by_genres.limit(10).display()

mode,genres,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,21st century classical,0.9793333333333332,0.1628833333333333,160297.66666666663,0.0713166666666666,0.60683367,0.3616,-31.514333333333337,0.0405666666666666,75.3365,0.1037833333333333,27.83333333333333,6
1,432hz,0.49478,0.2993333333333333,1048887.333333333,0.4506783333333333,0.4777616666666668,0.131,-16.854,0.0768166666666666,120.28566666666666,0.22175,52.5,5
1,8-bit,0.762,0.7120000000000001,115177.0,0.818,0.8759999999999999,0.126,-9.18,0.047,133.444,0.975,48.0,7
1,[],0.6514170195595453,0.5290925603549332,232880.8902503945,0.4191460727353524,0.2053091895111363,0.2186958541504073,-12.288964675489456,0.1078715586868139,112.8573524318416,0.5136042963588958,20.859882191849056,7
1,a cappella,0.676557304985755,0.5389612464387464,190628.5408867521,0.3164335701566952,0.0030034414404202,0.1722541371082621,-12.479387421652426,0.0828514398148148,112.1103620014245,0.4482486545584045,45.82007122507122,7
1,abstract,0.45921,0.5161666666666667,343196.5,0.4424166666666666,0.8496666666666667,0.1180666666666667,-15.472083333333332,0.0465166666666666,127.88575000000002,0.307325,43.5,1
1,abstract beats,0.3421466666666667,0.623,229936.2,0.5277999999999999,0.3336026120000001,0.0996533333333333,-7.918000000000001,0.1163733333333333,112.4138,0.4935066666666666,58.93333333333332,10
1,abstract hip hop,0.2438540633608816,0.6945709366391184,231849.2341597796,0.6462346418732783,0.0242312629201102,0.1685429201101929,-7.349327823691461,0.2142576997245178,108.2449865013774,0.5713909090909091,39.79070247933884,2
0,accordeon,0.3229999999999999,0.588,164000.0,0.392,0.441,0.0794,-14.899,0.0727,109.131,0.7090000000000001,39.0,2
1,accordion,0.446125,0.6248125,167061.5625,0.3734375,0.19373839375,0.1603,-14.4870625,0.0785375,112.8724375,0.6586875000000001,21.9375,2


In [0]:
# File location and file type
file_location = '/FileStore/tables/databricks-classes/Recommendation-System-Music/processed-data/data_w_genres'
file_type = 'parquet'

# Reading file
Df_w_genres = spark.read.format(file_type) \
    .load(file_location)

Df_w_genres.limit(10).display()

genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
['show tunes'],Cats 1981 Original London Cast,0.5901111111111111,0.4672222222222222,250318.5555555556,0.3940033333333333,0.0113998511111111,0.2908333333333333,-14.448,0.2103888888888888,117.51811111111112,0.3895,38.333333333333336,5,1,9
[],Cats 1983 Broadway Cast,0.8625384615384617,0.4417307692307693,287280.0,0.4068076923076923,0.0811582642307692,0.3152153846153846,-10.69,0.1762115384615384,103.04415384615385,0.2688653846153846,30.57692307692308,5,1,26
[],Fiddler On The Roof Motion Picture Chorus,0.8565714285714285,0.3482857142857142,328920.0,0.2865714285714285,0.0245929485714285,0.3257857142857143,-15.230714285714283,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0,1,7
[],Fiddler On The Roof Motion Picture Orchestra,0.884925925925926,0.4250740740740739,262890.96296296304,0.2457703703703704,0.0735872792592592,0.2754814814814815,-15.639370370370369,0.1232,88.66762962962959,0.3720296296296296,34.85185185185185,0,1,27
[],Joseph And The Amazing Technicolor Dreamcoat 1991 London Cast,0.5107142857142857,0.4671428571428572,270436.14285714284,0.4882857142857143,0.0094002914285714,0.195,-10.236714285714289,0.0985428571428571,122.83585714285714,0.4822857142857143,43.0,5,1,7
[],Joseph And The Amazing Technicolor Dreamcoat 1992 Canadian Cast,0.6095555555555557,0.4872777777777778,205091.9444444445,0.3099055555555556,0.0046956566666666,0.2747666666666666,-18.26638888888889,0.0980222222222221,118.64894444444442,0.4415555555555557,32.77777777777778,5,1,36
[],Mama Helen Teagarden,0.725,0.637,135533.0,0.512,0.186,0.426,-20.615,0.21,134.819,0.885,0.0,8,1,2
[],Test for Victor Young,0.927,0.7340000000000001,175693.0,0.474,0.0762,0.737,-10.544,0.256,132.78799999999998,0.902,3.0,10,1,2
"['comedy rock', 'comic', 'parody']",Weird Al Yankovic,0.1731450819672131,0.6627868852459013,218948.19672131148,0.6953934426229511,4.980262295081966e-05,0.1611016393442622,-9.768704918032787,0.0845360655737704,133.03118032786878,0.7513442622950824,34.22950819672131,9,1,122
"['emo rap', 'florida rap', 'sad rap', 'underground hip hop', 'vapor trap']",SNOT,0.5444666666666668,0.7898,137910.46666666667,0.5329333333333333,0.0230625833333333,0.1803,-9.14926666666667,0.2936866666666666,112.34479999999998,0.4807,67.53333333333333,1,1,15


## Questions:
* 1 - Who are the top 10 artists?

* 2 - What is the musical genre of the top 10 artists?

* 3 - What are the top 10 genres?

* 4 - Which artists are in the top 10 genres?

## What are the top 10 artists?

In [0]:
Df_by_artist.limit(10).display()

mode,count,acousticness,artists,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key
1,9,0.5901111111111111,Cats 1981 Original London Cast,0.4672222222222222,250318.5555555556,0.3940033333333333,0.0113998511111111,0.2908333333333333,-14.448,0.2103888888888888,117.51811111111112,0.3895,38.333333333333336,5
1,26,0.8625384615384617,Cats 1983 Broadway Cast,0.4417307692307693,287280.0,0.4068076923076923,0.0811582642307692,0.3152153846153846,-10.69,0.1762115384615384,103.04415384615385,0.2688653846153846,30.57692307692308,5
1,7,0.8565714285714285,Fiddler On The Roof Motion Picture Chorus,0.3482857142857142,328920.0,0.2865714285714285,0.0245929485714285,0.3257857142857143,-15.230714285714283,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0
1,27,0.884925925925926,Fiddler On The Roof Motion Picture Orchestra,0.4250740740740739,262890.96296296304,0.2457703703703704,0.0735872792592592,0.2754814814814815,-15.639370370370369,0.1232,88.66762962962959,0.3720296296296296,34.85185185185185,0
1,7,0.5107142857142857,Joseph And The Amazing Technicolor Dreamcoat 1991 London Cast,0.4671428571428572,270436.14285714284,0.4882857142857143,0.0094002914285714,0.195,-10.236714285714289,0.0985428571428571,122.83585714285714,0.4822857142857143,43.0,5
1,36,0.6095555555555557,Joseph And The Amazing Technicolor Dreamcoat 1992 Canadian Cast,0.4872777777777778,205091.9444444445,0.3099055555555556,0.0046956566666666,0.2747666666666666,-18.26638888888889,0.0980222222222221,118.64894444444442,0.4415555555555557,32.77777777777778,5
1,2,0.725,Mama Helen Teagarden,0.637,135533.0,0.512,0.186,0.426,-20.615,0.21,134.819,0.885,0.0,8
1,2,0.927,Test for Victor Young,0.7340000000000001,175693.0,0.474,0.0762,0.737,-10.544,0.256,132.78799999999998,0.902,3.0,10
1,122,0.1731450819672131,Weird Al Yankovic,0.6627868852459013,218948.19672131148,0.6953934426229511,4.980262295081966e-05,0.1611016393442622,-9.768704918032787,0.0845360655737704,133.03118032786878,0.7513442622950824,34.22950819672131,9
1,15,0.5444666666666668,SNOT,0.7898,137910.46666666667,0.5329333333333333,0.0230625833333333,0.1803,-9.14926666666667,0.2936866666666666,112.34479999999998,0.4807,67.53333333333333,1


In [0]:
top_10_artists = Df_by_artist \
    .groupBy('artists') \
    .agg(f.sum('count').alias('total_songs')) \
    .orderBy(f.col('total_songs').desc()) \
    .limit(10) 
    

In [0]:
top_10_artists.display()

artists,total_songs
Francisco Canaro,3169
Эрнест Хемингуэй,2422
Эрих Мария Ремарк,2136
Frank Sinatra,1459
Ignacio Corsini,1256
Vladimir Horowitz,1200
Arturo Toscanini,1146
Billie Holiday,1103
Johnny Cash,1061
Elvis Presley,1023


## Vizualizing the top 10 Artists

In [0]:
top_10_artists.display()

artists,total_songs
Francisco Canaro,3169
Эрнест Хемингуэй,2422
Эрих Мария Ремарк,2136
Frank Sinatra,1459
Ignacio Corsini,1256
Vladimir Horowitz,1200
Arturo Toscanini,1146
Billie Holiday,1103
Johnny Cash,1061
Elvis Presley,1023


Databricks visualization. Run in Databricks to view.

## What is the musical genre of the top 10 artists?

In [0]:
Df_w_genres.limit(10).display()

genres,artists,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,popularity,key,mode,count
['show tunes'],Cats 1981 Original London Cast,0.5901111111111111,0.4672222222222222,250318.5555555556,0.3940033333333333,0.0113998511111111,0.2908333333333333,-14.448,0.2103888888888888,117.51811111111112,0.3895,38.333333333333336,5,1,9
[],Cats 1983 Broadway Cast,0.8625384615384617,0.4417307692307693,287280.0,0.4068076923076923,0.0811582642307692,0.3152153846153846,-10.69,0.1762115384615384,103.04415384615385,0.2688653846153846,30.57692307692308,5,1,26
[],Fiddler On The Roof Motion Picture Chorus,0.8565714285714285,0.3482857142857142,328920.0,0.2865714285714285,0.0245929485714285,0.3257857142857143,-15.230714285714283,0.1185142857142857,77.37585714285714,0.3548571428571429,34.857142857142854,0,1,7
[],Fiddler On The Roof Motion Picture Orchestra,0.884925925925926,0.4250740740740739,262890.96296296304,0.2457703703703704,0.0735872792592592,0.2754814814814815,-15.639370370370369,0.1232,88.66762962962959,0.3720296296296296,34.85185185185185,0,1,27
[],Joseph And The Amazing Technicolor Dreamcoat 1991 London Cast,0.5107142857142857,0.4671428571428572,270436.14285714284,0.4882857142857143,0.0094002914285714,0.195,-10.236714285714289,0.0985428571428571,122.83585714285714,0.4822857142857143,43.0,5,1,7
[],Joseph And The Amazing Technicolor Dreamcoat 1992 Canadian Cast,0.6095555555555557,0.4872777777777778,205091.9444444445,0.3099055555555556,0.0046956566666666,0.2747666666666666,-18.26638888888889,0.0980222222222221,118.64894444444442,0.4415555555555557,32.77777777777778,5,1,36
[],Mama Helen Teagarden,0.725,0.637,135533.0,0.512,0.186,0.426,-20.615,0.21,134.819,0.885,0.0,8,1,2
[],Test for Victor Young,0.927,0.7340000000000001,175693.0,0.474,0.0762,0.737,-10.544,0.256,132.78799999999998,0.902,3.0,10,1,2
"['comedy rock', 'comic', 'parody']",Weird Al Yankovic,0.1731450819672131,0.6627868852459013,218948.19672131148,0.6953934426229511,4.980262295081966e-05,0.1611016393442622,-9.768704918032787,0.0845360655737704,133.03118032786878,0.7513442622950824,34.22950819672131,9,1,122
"['emo rap', 'florida rap', 'sad rap', 'underground hip hop', 'vapor trap']",SNOT,0.5444666666666668,0.7898,137910.46666666667,0.5329333333333333,0.0230625833333333,0.1803,-9.14926666666667,0.2936866666666666,112.34479999999998,0.4807,67.53333333333333,1,1,15


In [0]:
top_10_artists.select('artists').display()

artists
Francisco Canaro
Эрнест Хемингуэй
Эрих Мария Ремарк
Frank Sinatra
Ignacio Corsini
Vladimir Horowitz
Arturo Toscanini
Billie Holiday
Johnny Cash
Elvis Presley


In [0]:
top_10_artists_genres = Df_w_genres \
    .filter(f.col('artists').isin(top_10_artists.select('artists').rdd.flatMap(lambda x: x).collect())) 
    

In [0]:
top_10_artists_genres \
    .select(
        'artists',
        'genres'
    ) \
    .display()

artists,genres
Arturo Toscanini,"['classical performance', 'historic orchestral performance', 'orchestral performance']"
Billie Holiday,"['adult standards', 'harlem renaissance', 'jazz blues', 'lounge', 'soul', 'torch song', 'vocal jazz']"
Elvis Presley,"['rock-and-roll', 'rockabilly']"
Francisco Canaro,"['tango', 'vintage tango']"
Frank Sinatra,"['adult standards', 'easy listening', 'lounge']"
Ignacio Corsini,"['tango', 'vintage tango']"
Johnny Cash,"['arkansas country', 'outlaw country']"
Vladimir Horowitz,"['classical', 'classical performance', 'classical piano', 'russian classical piano']"
Эрих Мария Ремарк,[]
Эрнест Хемингуэй,[]


## What are the top 10 genres?

In [0]:
Df_w_genres \
    .groupBy('genres') \
    .agg((f.count('genres')).alias('amount')) \
    .orderBy(f.col('amount').desc()) \
    .limit(10) \
    .display()

genres,amount
[],9857
['movie tunes'],69
['show tunes'],63
['hollywood'],56
['orchestral performance'],50
"['broadway', 'hollywood', 'show tunes']",48
"['disney', 'movie tunes']",45
['sleep'],42
['gospel'],41
"['contemporary country', 'country', 'country road', 'modern country rock']",41


In [0]:
top_10_genres = Df_w_genres \
    .groupBy('genres') \
    .agg((f.count('genres')).alias('amount')) \
    .orderBy(f.col('amount').desc()) \
    .filter(f.col('genres') != '[]') \
    .limit(10) 

In [0]:
top_10_genres.display()

genres,amount
['movie tunes'],69
['show tunes'],63
['hollywood'],56
['orchestral performance'],50
"['broadway', 'hollywood', 'show tunes']",48
"['disney', 'movie tunes']",45
['sleep'],42
['gospel'],41
"['contemporary country', 'country', 'country road', 'modern country rock']",41
['classical soprano'],40


In [0]:
genres_list = [row['genres'] for row in top_10_genres.select('genres').collect()]
genres_list

["['movie tunes']",
 "['show tunes']",
 "['hollywood']",
 "['orchestral performance']",
 "['broadway', 'hollywood', 'show tunes']",
 "['disney', 'movie tunes']",
 "['sleep']",
 "['gospel']",
 "['contemporary country', 'country', 'country road', 'modern country rock']",
 "['classical soprano']"]

In [0]:
Df_w_genres \
    .select('genres', 'artists') \
    .filter(f.col('genres').isin(genres_list)) \
    .limit(10) \
    .display()

genres,artists
['show tunes'],Cats 1981 Original London Cast
"['broadway', 'hollywood', 'show tunes']",Legally Blonde Ensemble
['show tunes'],Legally Blonde Greek Chorus
"['broadway', 'hollywood', 'show tunes']",Aaron Tveit
"['broadway', 'hollywood', 'show tunes']",Adam Pascal
['movie tunes'],Adriana Caselotti
['classical soprano'],Adrianne Pieczonka
['classical soprano'],Agnes Giebel
['gospel'],Albertina Walker
"['broadway', 'hollywood', 'show tunes']",Alex Brightman


In [0]:
top_10_genres_artists = Df_w_genres \
    .select('genres', 'artists') \
    .filter(f.col('genres').isin(genres_list)) \
    .limit(10)

In [0]:
top_10_genres_artists.display()

genres,artists
['show tunes'],Cats 1981 Original London Cast
"['broadway', 'hollywood', 'show tunes']",Legally Blonde Ensemble
['show tunes'],Legally Blonde Greek Chorus
"['broadway', 'hollywood', 'show tunes']",Aaron Tveit
"['broadway', 'hollywood', 'show tunes']",Adam Pascal
['movie tunes'],Adriana Caselotti
['classical soprano'],Adrianne Pieczonka
['classical soprano'],Agnes Giebel
['gospel'],Albertina Walker
"['broadway', 'hollywood', 'show tunes']",Alex Brightman
