MusicBox

Step 1. Installing Spark

Source: https://medium.com/@GalarnykMichael/install-spark-on-mac-pyspark-453f395f240b

Step 2. Loading Raw Data with PySpark DataFrame

I have to say that spark dataframe is way less handy than pandas.dataframe for me. Anyway, to avoid some malformated lines, I have to use filter function on RDD first and then transform RDD to DataFrame

from pyspark.sql.types import *

def parseLine(line):
    fields = line.split('\t')
    if len(fields) == 10:
        try:
            uid = float(fields[0])
            device = str(fields[1])
            song_id = str(fields[2])
            song_type = float(fields[3])
            song_name = str(fields[4])
            singer = str(fields[5])
            play_time = str(fields[6])
            song_length = float(fields[7])
            paid_flag = float(fields[8])
            fn = str(fields[9])
            return Row(uid, device, song_id, song_type, song_name, singer, play_time, song_length, paid_flag, fn)
        except:
            return Row(None)
    else:
        return Row(None)


schema = StructType([StructField('uid', FloatType(), False),
                     StructField('device', StringType(), True),
                     StructField('song_id', StringType(), False),
                     StructField('song_type', FloatType(), True),
                     StructField('song_name', StringType(), True),
                     StructField('singer', StringType(), True),
                     StructField('play_time', StringType(), False),
                     StructField('song_length', FloatType(), True),
                     StructField('paid_flag', FloatType(), True),
                     StructField('fn', StringType(), True),])

songs = lines.map(parseLine).filter(lambda x: len(x) == len(schema))
# Convert that to a DataFrame
songDataset = spark.createDataFrame(songs,schema).cache()
songDataset.show()

Here is the result:

+------------+------+---------+---------+--------------------+--------------------+---------+-----------+---------+------------------+ | uid|device| song_id|song_type| song_name| singer|play_time|song_length|paid_flag| fn| +------------+------+---------+---------+--------------------+--------------------+---------+-----------+---------+------------------+ |1.54422688E8| ar |20870993 | 1.0| 用情 | 狮子合唱团 | 22013 | 332.0| 0.0| 20170301_play.log| |1.54421904E8| ip | 6560858 | 0.0| 表情不要悲伤 | 伯贤&D.O.&张艺兴&朴灿烈 | 96 | 161.0| 0.0| 20170301_play.log| |1.54422624E8| ar | 3385963 | 1.0|Baby, Don't Cry(人...| EXO | 235868 | 235.0| 0.0| 20170301_play.log| |1.54410272E8| ar | 6777172 | 0.0| 3D-环绕音律1(3D Mix) | McTaiM | 164 | 237.0| 0.0| 20170301_play.log| |1.54407792E8| ar |19472465 | 0.0| 刚好遇见你 | 曲肖冰 | 24 | 201.0| 0.0| 20170301_play.log| |1.54422624E8| ar | 3198036 | 1.0| 只唱给你听 | SpeXial | 275249 | 0.0| 0.0| 20170301_play.log| |1.54422688E8| ar | 891952 | 0.0| 老男孩-(电影《老男孩》主题曲) | 筷子兄弟 | 300 | 300.0| 0.0| 20170301_play.log| +------------+------+---------+---------+--------------------+--------------------+---------+-----------+---------+------------------+

Check some statistics of the data:

songDataset.describe().show()

+-------+--------------------+---------+--------------------+-------------------+---------+--------------------+-------------------+------------------+---------+--------------------+ |summary| uid| device| song_id| song_type|song_name| singer| play_time| song_length|paid_flag| fn| +-------+--------------------+---------+--------------------+-------------------+---------+--------------------+-------------------+------------------+---------+--------------------+ | count| 164264529|164264529| 164264529| 164264529|164264529| 164264529| 164264529| 164264529|164264529| 164264529| | mean|1.3238275802376163E8| null|1.233773654943951...|0.14990355586749954| Infinity|2.069222247928341...| 204343.58310764717|-66.93764578220485| 0.0| null| | stddev|6.4977108791913636E7| null|3.724137957677398...| 0.3858627542831314| NaN|1.842783835874818...|5.244159580504907E8|1066716.9038908319| 0.0| null| | min| 0.0| | | 0.0| | | | -2.14748365E9| 0.0| 20170301_play.log| | max| 1.6926232E8| wp| 9999985| 3.0| 🙄| 😝😝😝😝😝| nan | 1.34396621E9| 0.0| 20170512_3_play.log| +-------+--------------------+---------+--------------------+-------------------+---------+--------------------+-------------------+------------------+---------+--------------------+

songDataset.groupBy('uid').count().orderBy('count', ascending = False).show(truncate=False)

+------------+-------+ |uid |count | +------------+-------+ |1685126.0 |8124398| |3.7025504E7 |5903930| |751824.0 |4554232| |1791497.0 |3376118| |497685.0 |3031361| |1062806.0 |2354592| |736305.0 |1848836| |0.0 |1201159| |1749320.0 |835164 | |4.6532272E7 |500025 | |1679121.0 |488577 | |2.8638488E7 |469655 | |637650.0 |243074 | |1.5594824E8 |217992 | |533817.0 |173401 | |3.2166204E7 |156643 | |6.4268008E7 |150171 | |2.6036032E7 |114145 | |3.2104144E7 |99175 | |1.67982848E8|82687 | +------------+-------+ only showing top 20 rows

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
notebooks		notebooks
script		script
README.md		README.md
ann_pca.png		ann_pca.png
download.png		download.png
page.xml		page.xml
recency.png		recency.png
svm_pca.png		svm_pca.png
total_active.png		total_active.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MusicBox

Step 1. Installing Spark

Step 2. Loading Raw Data with PySpark DataFrame

Check some statistics of the data:

About

Uh oh!

Releases

Packages

Languages

JMistral/MusicBox

Folders and files

Latest commit

History

Repository files navigation

MusicBox

Step 1. Installing Spark

Step 2. Loading Raw Data with PySpark DataFrame

Check some statistics of the data:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages