# Reading and Writing Data with Spark

The data set is read in from a local file. 
First let's import SparkConf and SparkSession

In [1]:
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession

Since we're using Spark locally we already have both a sparkcontext and a sparksession running. We can update some of the parameters, such our application's name. Let's just call it "Python Spark Data Loading example"

In [2]:
spark=SparkSession\
    .builder\
    .appName('Songs')\
    .getOrCreate()

Let's check if the change went through

In [12]:
spark.sparkContext.getConf().getAll()

<pyspark.conf.SparkConf at 0x1e6fdbe2070>

In [4]:
spark

we can see the app name is exactly how we set it

Let's create our first dataframe from a fairly small sample data set. Througout the file we'll work with a log file data set that describes user interactions with a music streaming service. The records describe events such as logging in to the site, visiting a page, listening to the next song, seeing an ad.

In [5]:
path = "data/tracks.csv"
user_log = spark.read.csv(path)

In [6]:
user_log.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)



In [7]:
user_log.describe()

DataFrame[summary: string, _c0: string, _c1: string, _c2: string, _c3: string, _c4: string, _c5: string, _c6: string, _c7: string, _c8: string, _c9: string, _c10: string, _c11: string, _c12: string, _c13: string, _c14: string, _c15: string, _c16: string, _c17: string, _c18: string, _c19: string]

In [8]:
user_log.show(n=2,vertical=True)

-RECORD 0--------------------
 _c0  | id                   
 _c1  | name                 
 _c2  | popularity           
 _c3  | duration_ms          
 _c4  | explicit             
 _c5  | artists              
 _c6  | id_artists           
 _c7  | release_date         
 _c8  | danceability         
 _c9  | energy               
 _c10 | key                  
 _c11 | loudness             
 _c12 | mode                 
 _c13 | speechiness          
 _c14 | acousticness         
 _c15 | instrumentalness     
 _c16 | liveness             
 _c17 | valence              
 _c18 | tempo                
 _c19 | time_signature       
-RECORD 1--------------------
 _c0  | 35iwgR4jXetI318WE... 
 _c1  | Carve                
 _c2  | 6                    
 _c3  | 126903               
 _c4  | 0                    
 _c5  | ['Uli']              
 _c6  | ['45tIt06XoI0Iio4... 
 _c7  | 1922-02-22           
 _c8  | 0.645                
 _c9  | 0.445                
 _c10 | 0                    
 _c11 | -1

In [9]:
user_log.take(5)

[Row(_c0='id', _c1='name', _c2='popularity', _c3='duration_ms', _c4='explicit', _c5='artists', _c6='id_artists', _c7='release_date', _c8='danceability', _c9='energy', _c10='key', _c11='loudness', _c12='mode', _c13='speechiness', _c14='acousticness', _c15='instrumentalness', _c16='liveness', _c17='valence', _c18='tempo', _c19='time_signature'),
 Row(_c0='35iwgR4jXetI318WEWsa1Q', _c1='Carve', _c2='6', _c3='126903', _c4='0', _c5="['Uli']", _c6="['45tIt06XoI0Iio4LBEVpls']", _c7='1922-02-22', _c8='0.645', _c9='0.445', _c10='0', _c11='-13.338', _c12='1', _c13='0.451', _c14='0.674', _c15='0.744', _c16='0.151', _c17='0.127', _c18='104.851', _c19='3'),
 Row(_c0='021ht4sdgPcrDgSk7JTbKY', _c1='Capítulo 2.16 - Banquero Anarquista', _c2='0', _c3='98200', _c4='0', _c5="['Fernando Pessoa']", _c6="['14jtPCOoNZwquk5wd9DxrY']", _c7='1922-06-01', _c8='0.695', _c9='0.263', _c10='0', _c11='-22.136', _c12='1', _c13='0.957', _c14='0.797', _c15='0.0', _c16='0.148', _c17='0.655', _c18='102.009', _c19='1'),
 Ro

In [10]:
out_path = "data/spotify1"
#user_log.write.save(out_path, mode='overwrite',format="csv", header=True)

In [11]:
user_log.select("_c5").show()

+-------------------+
|                _c5|
+-------------------+
|            artists|
|            ['Uli']|
|['Fernando Pessoa']|
|['Ignacio Corsini']|
|['Ignacio Corsini']|
|    ['Dick Haymes']|
|    ['Dick Haymes']|
|  ['Francis Marty']|
|    ['Mistinguett']|
|    ['Greg Fieler']|
|['Ignacio Corsini']|
|['Fernando Pessoa']|
|['Fernando Pessoa']|
|            ['Uli']|
|   ['Lucien Boyer']|
|    ['Félix Mayol']|
|['Fernando Pessoa']|
|['Fernando Pessoa']|
|['Fernando Pessoa']|
| ['Victor Boucher']|
+-------------------+
only showing top 20 rows

