# Loading data into RDD, DataFrame


Most of the tasks in this section will be done in the notebook, some will be done on the cluster, you can find the necessary files and submission scripts here:

```bash
cd 2_LoadingData
```

By default, Spark will not override the output folder. For `RDDs`, one can pass a config parameter to alter that `.config("spark.hadoop.validateOutputSpecs", "false")` for `DataFrame`s we will explicitely set the write mode.

First, initialize spark session and spark context:

In [7]:
import pyspark
try:
    sc
except NameError:    
    spark = pyspark.sql.SparkSession.builder.master("local[*]").appName("BD course").config("spark.hadoop.validateOutputSpecs", "false").getOrCreate()
    sc = spark.sparkContext

## Load unstructured data

One of the first things we need to learn is how to read the data into Spark RDDs and dataframes. Spark provides a battery-included API for reading structured data in most data formats (CSV, JSON, Parquet) as well as unstructured data (plain text files, server logs, etc).

### Reading plain text

`textFile` and `wholeTextFiles` are functions to read in plain unstructured text.

1. `textFile` reads data line by line creating an RDD where each entry corresponds to a line (kind of like readlines() in Python)
1. `wholeTextFiles` reads the whole file into a pair RDD: (file path, context of the whole file as string)


Following code demonstrate that on an example of the word count.

In [8]:
#from pyspark import SparkContext
import sys
import time
import os

def main1(args):
    start = time.time()
    #sc = SparkContext(appName="LoadUnstructured")

    #By default it assumes file located on hdfs folder, 
    #but by prefixing "file://" it will search the local file system
    #Can specify a folder, can pass list of folders or use wild character
    input_rdd = sc.textFile("./data/unstructured/")

    counts = input_rdd.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

    print ("\nTaking the 10 most frequent words in the text and corresponding frequencies:")
    print (counts.takeOrdered(10, key=lambda x: -x[1]))

    counts.map(lambda x: (x[1],x[0])).sortByKey(0).map(lambda x: (x[1],x[0])).repartition(1).saveAsTextFile("./output_loadunstructured1/")
 
    end = time.time()
    print ("Elapsed time: ", (end-start))

In [9]:
# Try the record-per-line-input
main1(sys.argv)


Taking the 10 most frequent words in the text and corresponding frequencies:
[('the', 22635), ('of', 11167), ('and', 11086), ('to', 10707), ('a', 10433), ('I', 10183), ('in', 7006), ('that', 6911), ('was', 6779), ('his', 4955)]
Elapsed time:  1.7705183029174805


In [10]:
def main2(args):
    start = time.time()

    #Use alternative approach: load the dinitial file into a pair RDD
    input_pair_rdd = sc.wholeTextFiles("./data/unstructured/")

    counts = input_pair_rdd.flatMap(lambda line: line[1].split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

    print ("\nTaking the 10 most frequent words in the text and corresponding frequencies:")
    print (counts.takeOrdered(10, key=lambda x: -x[1]))
    counts.map(lambda x: (x[1],x[0])).sortByKey(0).map(lambda x: (x[1],x[0])).repartition(1).saveAsTextFile("./output_loadunstructured2/")

    end = time.time()
    print ("Elapsed time: ", (end-start))

In [11]:
#Use alternative approach: load the initial file into a pair RDD
main2(sys.argv)


Taking the 10 most frequent words in the text and corresponding frequencies:
[('the', 22635), ('of', 11167), ('and', 11086), ('to', 10707), ('a', 10433), ('I', 10183), ('in', 7006), ('that', 6911), ('was', 6779), ('his', 4955)]
Elapsed time:  1.3675246238708496


## Loading CSV

Next, we are going to learn how to load data into structured format like CSV. There is at least two ways to do that:

1. Read the files line by line with `textFiles()` method, split on delimiter (not recommended). This will produced an RDD which is a data structure optimized for row-oriented analysis and functional primitives like `map` and `filter`
1. Read the CSV files using the built in `DataFrameReader` (recommended). This will produce a dataframe, which is a data structure optimized for column-oriented analysis and relational primitives

In [12]:
import csv
import sys
import os
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO

#this one is use when you use textFile
def loadRecord(line,header,delimiter):
    """Parse a CSV line"""
    input = StringIO(line)
    reader = csv.DictReader(input, delimiter=delimiter, fieldnames=header)
    return next(reader)

def main_rdd(args):
    #sc = SparkContext(appName="LoadCsv")
    delimiter = "|"

    # Try the record-per-line-input
    input = sc.textFile("./data/csv/person_nodes.csv")
    header = input.first().split(delimiter)
    data = input.filter(lambda x: header[0] not in x).map(lambda x: loadRecord(x,header,delimiter))
    data.repartition(1).saveAsTextFile("./output_csv/")

def main_dataframe(args):
    delimiter = "|"

    #csv into spark dataframe   
    input_df = spark.read.options(header='true', inferschema='true',delimiter=delimiter).csv('./data/csv/person_nodes.csv')
    input_df.write.mode("overwrite").option("header", "true").csv("./output_csv2/")


In [13]:
#Load into a regular RDD using textFile and parsing the CSV file line by line
main_rdd(sys.argv)

In [14]:
#Load into dataframe using the csv reader from Databricks
main_dataframe(sys.argv)

### Example: analyzing a diamonds dataset

In a simialr way as before, we are going to read the CSV file into dataframe.
Spark DataFrameReader can handle delimiters, escaping, and can optionally skip header line for CSV files.

In [15]:
# Read csv data as DataFrame using spark csv dataframe reader
diamonds = spark.read.options(header='true', inferSchema='true').csv('./data/csv/diamonds.csv')

In [45]:
diamonds.show(10)

+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|  1| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
|  2| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
|  3| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|
|  4| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
|  5| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|
|  6| 0.24|Very Good|    J|   VVS2| 62.8| 57.0|  336|3.94|3.96|2.48|
|  7| 0.24|Very Good|    I|   VVS1| 62.3| 57.0|  336|3.95|3.98|2.47|
|  8| 0.26|Very Good|    H|    SI1| 61.9| 55.0|  337|4.07|4.11|2.53|
|  9| 0.22|     Fair|    E|    VS2| 65.1| 61.0|  337|3.87|3.78|2.49|
| 10| 0.23|Very Good|    H|    VS1| 59.4| 61.0|  338| 4.0|4.05|2.39|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
only showing top 10 rows



In [17]:
diamonds.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- carat: double (nullable = true)
 |-- cut: string (nullable = true)
 |-- color: string (nullable = true)
 |-- clarity: string (nullable = true)
 |-- depth: double (nullable = true)
 |-- table: double (nullable = true)
 |-- price: integer (nullable = true)
 |-- x: double (nullable = true)
 |-- y: double (nullable = true)
 |-- z: double (nullable = true)



Let's try doing some basic queries to understand the dataset better.

In [18]:
diamonds.count()

53940

In [22]:
diamonds.select('color').distinct().show()

+-----+
|color|
+-----+
|    F|
|    E|
|    D|
|    J|
|    G|
|    I|
|    H|
+-----+



Next, let us try to estimate an average price per carat. As you have noticed, the price column is an integer. This can result in a loss of precision as we do averaging. So first of all we will cast this column to double type:

In [23]:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import *

# Convert Price column to type DoubleType
diamondsdf = diamonds.withColumn("price", diamonds["price"].cast(DoubleType()))

We will use "groupby-aggregate function" to calculate the average. This creates a column with default name "avg(price)" which we rename to something easier to type. Finally, we order output by price in descending order:

In [25]:
# Calculate average price per carat value
carat_avgPrice = (diamondsdf
                  .groupBy("carat")
                  .avg("price")
                  .withColumnRenamed("avg(price)", "avgPrice")
                  .orderBy(desc("avgPrice")))

# View top10 highest average prices and corresponding carat value
carat_avgPrice.show(10)

+-----+------------------+
|carat|          avgPrice|
+-----+------------------+
| 3.51|           18701.0|
| 2.67|           18686.0|
|  4.5|           18531.0|
| 5.01|           18018.0|
| 2.57|17841.666666666668|
|  2.6|           17535.0|
| 2.64|           17407.0|
| 4.13|           17329.0|
| 2.39|17182.428571428572|
| 2.71|           17146.0|
+-----+------------------+
only showing top 10 rows



### Analyzing CSV files in Python as RDDs (not recommended approach)

In principle, one can use `RDD` to analyze structured data as well, but it seems to be less concenient, especially if the logic of your analysis can be expressed using SQL-like relational primitives.

We will now convert our diamonds DataFrame into RDD:

In [26]:
# We can convert the DataFrame directly into an RDD
diamonds_rdd = diamonds.rdd

In [27]:
# View first 3 rows of the diamonds RDD
diamonds_rdd.take(3)

[Row(_c0=1, carat=0.23, cut='Ideal', color='E', clarity='SI2', depth=61.5, table=55.0, price=326, x=3.95, y=3.98, z=2.43),
 Row(_c0=2, carat=0.21, cut='Premium', color='E', clarity='SI1', depth=59.8, table=61.0, price=326, x=3.89, y=3.84, z=2.31),
 Row(_c0=3, carat=0.23, cut='Good', color='E', clarity='VS1', depth=56.9, table=65.0, price=327, x=4.05, y=4.07, z=2.31)]

You can now use RDD operations to analyze the data:

In [28]:
# Diamond counts by cuts
countByGroup = diamonds_rdd.map(lambda x: (x.cut, 1)).reduceByKey(lambda x,y: x+y)
print (countByGroup.collect())

[('Ideal', 21551), ('Premium', 13791), ('Good', 4906), ('Very Good', 12082), ('Fair', 1610)]


In [29]:
# Distinct diamond clarities in dataset
distinctClarity = diamonds_rdd.map(lambda x: x.clarity).distinct()
print (distinctClarity.collect())

['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF']


In [30]:
# Average price per diamond cut
avgPrice = diamonds_rdd.map(lambda x: (x.cut, float(x.price))).reduceByKey(lambda x,y: (x+y)/2)
print (avgPrice.collect())

[('Ideal', 2756.7240663718817), ('Premium', 2756.654813661215), ('Good', 2755.647409027791), ('Very Good', 2756.7183661747795), ('Fair', 2743.567771968392)]


# Exercise: load a CSV file and analyze it

Use what you have learned to load a set of `CSV` datasets. Open **load_csv_exercise.py** and follow the assignment therein.

1. Actor
1. Movie
1. Actor playing in movie (relationships)

and find movies where **Tom Hanks** played in.

Save the answer in the `JSON` format.

# Loading JSON

The best and probably the only reasonable way to load JSON files is using the Spark DataFrameReader.
Spark SQL has built in support for reading in JSON files which contain a separate, self-contained JSON object per line. 

**Note: Multi-line JSON files are currently not compatible with Spark SQL.**

In [34]:
testJsonData = spark.read.json("./data/json/test.json")

In [32]:
testJsonData.printSchema()

root
 |-- array: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- dict: struct (nullable = true)
 |    |-- extra_key: string (nullable = true)
 |    |-- key: string (nullable = true)
 |-- int: long (nullable = true)
 |-- string: string (nullable = true)



In [33]:
testJsonData.show()

+---------+--------------------+---+-------+
|    array|                dict|int| string|
+---------+--------------------+---+-------+
|[1, 2, 3]|       [null,value1]|  1|string1|
|[2, 4, 6]|       [null,value2]|  2|string2|
|[3, 6, 9]|[extra_value3,val...|  3|string3|
+---------+--------------------+---+-------+



Spark SQL can infer the schema automatically from your JSON data. To view the schema, use `printSchema`.

Let's now try doing some basic queries to understand the dataset better.

In [35]:
# Count number of rows in dataset
print (testJsonData.count())

3


JSON data can contain nested data structures which can be accessed with a "."

In [37]:
testJsonData.select('dict.key').show()

+------+
|   key|
+------+
|value1|
|value2|
|value3|
+------+



We can also perform DataFrame operations such as filtering queries according to some criteria:

In [38]:
testJsonData.filter(testJsonData["int"] > 1).show()

+---------+--------------------+---+-------+
|    array|                dict|int| string|
+---------+--------------------+---+-------+
|[2, 4, 6]|       [null,value2]|  2|string2|
|[3, 6, 9]|[extra_value3,val...|  3|string3|
+---------+--------------------+---+-------+



### Analyzing JSON files in Python with SQL
Any DataFrame, including those created with JSON data, can be registered as a Spark SQL table to query with SQL.

In [39]:
# Create a Spark SQL temp table
# Note that temp tables are not global across clusters and will not persist across cluster restarts
testJsonData.registerTempTable("test_json")

We can run any SQL queries on that table with Spark SQL:

In [40]:
spark.sql("SELECT * FROM test_json").show()

+---------+--------------------+---+-------+
|    array|                dict|int| string|
+---------+--------------------+---+-------+
|[1, 2, 3]|       [null,value1]|  1|string1|
|[2, 4, 6]|       [null,value2]|  2|string2|
|[3, 6, 9]|[extra_value3,val...|  3|string3|
+---------+--------------------+---+-------+



### Mini-exercise

Switch to the Adroit cluster work directory, open the file: **load_json.py**
and follow instructions inline. Submit the jobs to the cluster using **slurm_for_json.cmd** file