# Data wrangling with Spark SQL

This notebook was destinated to explore some Sprak SQL methods. To achieve our goal, we will explore from `sparkify_log_small.json` data, which contains logs of a digital songs service.

In [2]:
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession

In [4]:
# Create Spark session and name it
spark = SparkSession\
        .builder\
        .appName("Using Spark SQL in Python")\
        .getOrCreate()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/08/10 19:48:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Read data using DATAFRAMES


In [5]:
# set data path
data_path = '../../data/sparkify_log_small.json'

In [8]:
log_songs = spark.read.json(data_path)
# show the columns of the df
log_songs.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: long (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: double (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: long (nullable = true)
 |-- sessionId: long (nullable = true)
 |-- song: string (nullable = true)
 |-- status: long (nullable = true)
 |-- ts: long (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



## View a sample of data

### Using SQL-like methods

In [26]:
log_songs.select(["artist", "song", 'length']).show(n = 5)

+--------------------+--------------------+---------+
|              artist|                song|   length|
+--------------------+--------------------+---------+
|       Showaddywaddy|Christmas Tears W...|232.93342|
|          Lily Allen|       Cheryl Tweedy|195.23873|
|Cobra Starship Fe...|Good Girls Go Bad...|196.20526|
|          Alex Smoke| Don't See The Point|405.99465|
|                NULL|                NULL|     NULL|
+--------------------+--------------------+---------+
only showing top 5 rows



### Using temp view

In [28]:
log_songs.createOrReplaceTempView("some_songs")
spark.sql("SELECT artist, song, length FROM some_songs LIMIT 5").show()

+--------------------+--------------------+---------+
|              artist|                song|   length|
+--------------------+--------------------+---------+
|       Showaddywaddy|Christmas Tears W...|232.93342|
|          Lily Allen|       Cheryl Tweedy|195.23873|
|Cobra Starship Fe...|Good Girls Go Bad...|196.20526|
|          Alex Smoke| Don't See The Point|405.99465|
|                NULL|                NULL|     NULL|
+--------------------+--------------------+---------+



### Using `take` method

Another way to visualize samples of data, is using `take` method.

Which **returns the first num rows as a list of Row.**

In [30]:
print(log_songs.take(5))

[Row(artist='Showaddywaddy', auth='Logged In', firstName='Kenneth', gender='M', itemInSession=112, lastName='Matthews', length=232.93342, level='paid', location='Charlotte-Concord-Gastonia, NC-SC', method='PUT', page='NextSong', registration=1509380319284, sessionId=5132, song='Christmas Tears Will Fall', status=200, ts=1513720872284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"', userId='1046'), Row(artist='Lily Allen', auth='Logged In', firstName='Elizabeth', gender='F', itemInSession=7, lastName='Chase', length=195.23873, level='free', location='Shreveport-Bossier City, LA', method='PUT', page='NextSong', registration=1512718541284, sessionId=5027, song='Cheryl Tweedy', status=200, ts=1513720878284, userAgent='"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"', userId='1000'), Row(artist='Cobra Starship Featuring Leighton Meester', auth='Logged In', firstName

## `describe()` method

Now we will show some statistics of some numeric columns

In [22]:
print(log_songs.describe(['length']).show())

+-------+-----------------+
|summary|           length|
+-------+-----------------+
|  count|             8347|
|   mean|249.6486587492503|
| stddev|95.00437130781461|
|    min|          1.12281|
|    max|        1806.8371|
+-------+-----------------+

None


## Writing data from DataFrames

Spark can write data in DataFrames to many formats, in this example we'll write the DataFrame from the JSON file into a new CSV file, essentially converting JSON to CSV.

In [15]:
out_path = "../../data/sparkify_log_small.csv"

In [32]:
log_songs.write.mode('overwrite').save(out_path, format = 'csv', header = True)



## Reading csv

In [41]:
log_song_csv_reading = spark.read.option('header', True).csv(out_path)

In [42]:
log_song_csv_reading.printSchema()

root
 |-- artist: string (nullable = true)
 |-- auth: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- itemInSession: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- length: string (nullable = true)
 |-- level: string (nullable = true)
 |-- location: string (nullable = true)
 |-- method: string (nullable = true)
 |-- page: string (nullable = true)
 |-- registration: string (nullable = true)
 |-- sessionId: string (nullable = true)
 |-- song: string (nullable = true)
 |-- status: string (nullable = true)
 |-- ts: string (nullable = true)
 |-- userAgent: string (nullable = true)
 |-- userId: string (nullable = true)



In [43]:
log_song_csv_reading.show(n = 5)

+--------------------+---------+---------+------+-------------+---------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|              artist|     auth|firstName|gender|itemInSession| lastName|   length|level|            location|method|    page| registration|sessionId|                song|status|           ts|           userAgent|userId|
+--------------------+---------+---------+------+-------------+---------+---------+-----+--------------------+------+--------+-------------+---------+--------------------+------+-------------+--------------------+------+
|       Showaddywaddy|Logged In|  Kenneth|     M|          112| Matthews|232.93342| paid|Charlotte-Concord...|   PUT|NextSong|1509380319284|     5132|Christmas Tears W...|   200|1513720872284|"Mozilla/5.0 (Win...|  1046|
|          Lily Allen|Logged In|Elizabeth|     F|            7|    Chase|195.23873| free|Shreveport-Bossie...|   PUT