# Loading data into RDD, DataFrame


## Load unstructured data

Most of the tasks will be done in the notebook, the exercises will be done on the cluster, you can find the necessary files and submission scripts for the cluster portion of the exercise here:

```bash
cd 2_LoadingData
```

Note, that if you re-run the code below more than once, it will fail with "Output directory already exists" exception.
If you must rerun, it please clean the output folders first:

In [1]:
def output_cleaner():
    import os
    os.system("rm -rf ./output*")
    print "Output folders removed!"

In [2]:
#from pyspark import SparkContext
import sys
import time
import os

def main1(args):
    start = time.time()
    #sc = SparkContext(appName="LoadUnstructured")

    #By default it assumes file located on hdfs folder, 
    #but by prefixing "file://" it will search the local file system
    #Can specify a folder, can pass list of folders or use wild character
    input_rdd = sc.textFile("./data/unstructured/")

    counts = input_rdd.flatMap(lambda line: line.split()) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

    print "\nTaking the 10 most frequent words in the text and corresponding frequencies:"
    #print counts.takeOrdered(10, key=lambda x: -x[1])

    counts.map(lambda (a, b): (b, a)).sortByKey(0).map(lambda (a, b): (b, a)).repartition(1).saveAsTextFile("./output_loadunstructured1/")

    end = time.time()
    print "Elapsed time: ", (end-start)

def main2(args):
    start = time.time()
    #sc = SparkContext(appName="LoadUnstructured")

    #Use alternative approach: load the dinitial file into a pair RDD
    input_pair_rdd = sc.wholeTextFiles("./data/unstructured/")

    counts = input_pair_rdd.flatMap(lambda line: line[1].split()) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

    print "\nTaking the 10 most frequent words in the text and corresponding frequencies:"
    #print counts.takeOrdered(10, key=lambda x: -x[1])
    counts.map(lambda (a, b): (b, a)).sortByKey(0).map(lambda (a, b): (b, a)).repartition(1).saveAsTextFile("./output_loadunstructured2/")

    end = time.time()
    print "Elapsed time: ", (end-start)
    #sc.stop()

In [3]:
# Try the record-per-line-input
#output_cleaner()
main1(sys.argv)



Taking the 10 most frequent words in the text and corresponding frequencies:
Elapsed time:  4.95878505707


In [4]:
#Use alternative approach: load the initial file into a pair RDD
#output_cleaner()
main2(sys.argv)


Taking the 10 most frequent words in the text and corresponding frequencies:
Elapsed time:  1.03143596649


## Loading CSV

First, we are going to learn how to load data into structured CSV format. There is at least two ways to do that:

1) Read the files line by line with textFiles() method, split on delimiter

Similarly to Python, there is a data structured designed to be used when working with structured data (I mean Pandas Dataframes), it is also called the dataframe (a concept closely linked to Spark SQL). There is a way to read CSV directly into Spark dataframe 

2) Read the files into dataframe using spark-csv module from Databricks
https://github.com/databricks/spark-csv

Note, that for Spark2.0.0+ spark-csv has migrated to the core Spark with API kept very close to the original.

In [5]:
import csv
import sys
import StringIO
import os

#this one is use when you use textFile
def loadRecord(line,header,delimiter):
    """Parse a CSV line"""
    input = StringIO.StringIO(line)
    reader = csv.DictReader(input, delimiter=delimiter, fieldnames=header)
    return reader.next()

def main_rdd(args):
    #sc = SparkContext(appName="LoadCsv")
    delimiter = "|"

    # Try the record-per-line-input
    input = sc.textFile("./data/csv/person_nodes.csv")
    header = input.first().split(delimiter)
    data = input.filter(lambda x: header[0] not in x).map(lambda x: loadRecord(x,header,delimiter))
    data.repartition(1).saveAsTextFile("./output_csv/")

def main_dataframe(args):
    delimiter = "|"

    #csv into spark dataframe   
    input_df = spark.read.options(header='true', inferschema='true',delimiter=delimiter).csv('./data/csv/person_nodes.csv')
    input_df.write.option("header", "true").csv("./output_csv2/")


In [6]:
#Load into a regular RDD using textFile and parsing the CSV file line by line
output_cleaner()
main_rdd(sys.argv)

Output folders removed!


In [7]:
#Load into dataframe using the csv reader from Databricks
main_dataframe(sys.argv)

### CSV files to Dataframes

Spark CSV dataframe reader can handle delimiters, escaping, and skipping header lines for CSV files.

In [8]:
# Read csv data as DataFrame using spark csv dataframe reader
diamonds = spark.read.options(header='true', inferSchema='true').csv('./data/csv/diamonds.csv')

In [9]:
diamonds.show()

+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|_c0|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|
+---+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|  1| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
|  2| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
|  3| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|
|  4| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
|  5| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|
|  6| 0.24|Very Good|    J|   VVS2| 62.8| 57.0|  336|3.94|3.96|2.48|
|  7| 0.24|Very Good|    I|   VVS1| 62.3| 57.0|  336|3.95|3.98|2.47|
|  8| 0.26|Very Good|    H|    SI1| 61.9| 55.0|  337|4.07|4.11|2.53|
|  9| 0.22|     Fair|    E|    VS2| 65.1| 61.0|  337|3.87|3.78|2.49|
| 10| 0.23|Very Good|    H|    VS1| 59.4| 61.0|  338| 4.0|4.05|2.39|
| 11|  0.3|     Good|    J|    SI1| 64.0| 55.0|  339|4.25|4.28|2.73|
| 12| 0.23|    Ideal|    J|    VS1

In [10]:
diamonds.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- carat: double (nullable = true)
 |-- cut: string (nullable = true)
 |-- color: string (nullable = true)
 |-- clarity: string (nullable = true)
 |-- depth: double (nullable = true)
 |-- table: double (nullable = true)
 |-- price: integer (nullable = true)
 |-- x: double (nullable = true)
 |-- y: double (nullable = true)
 |-- z: double (nullable = true)



### Analyzing CSV files in Python as DataFrames

Let's try doing some basic queries to understand the dataset better.

In [11]:
diamonds.count()

53940

In [12]:
diamonds.select('color').distinct().collect()

[Row(color=u'F'),
 Row(color=u'E'),
 Row(color=u'D'),
 Row(color=u'J'),
 Row(color=u'G'),
 Row(color=u'I'),
 Row(color=u'H')]

In [13]:
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import *

# Convert Price column to type DoubleType
diamondsdf = diamonds.withColumn("price", diamonds["price"].cast(DoubleType()))

# Calculate average price per carat value
carat_avgPrice = (diamondsdf
                  .groupBy("carat")
                  .avg("price")
                  .withColumnRenamed("avg(price)", "avgPrice")
                  .orderBy(desc("avgPrice")))

# View top10 highest average prices and corresponding carat value
carat_avgPrice.show(10)

+-----+------------------+
|carat|          avgPrice|
+-----+------------------+
| 3.51|           18701.0|
| 2.67|           18686.0|
|  4.5|           18531.0|
| 5.01|           18018.0|
| 2.57|17841.666666666668|
|  2.6|           17535.0|
| 2.64|           17407.0|
| 4.13|           17329.0|
| 2.39|17182.428571428572|
| 2.71|           17146.0|
+-----+------------------+
only showing top 10 rows



### Analyzing CSV files in Python as RDDs

You can also convert your DataFrame to RDDs and perform RDD operations.

In [14]:
# We can convert the DataFrame directly into an RDD
diamonds_rdd = diamonds.rdd

In [15]:
# View first 3 rows of the diamonds RDD
diamonds_rdd.take(3)

[Row(_c0=1, carat=0.23, cut=u'Ideal', color=u'E', clarity=u'SI2', depth=61.5, table=55.0, price=326, x=3.95, y=3.98, z=2.43),
 Row(_c0=2, carat=0.21, cut=u'Premium', color=u'E', clarity=u'SI1', depth=59.8, table=61.0, price=326, x=3.89, y=3.84, z=2.31),
 Row(_c0=3, carat=0.23, cut=u'Good', color=u'E', clarity=u'VS1', depth=56.9, table=65.0, price=327, x=4.05, y=4.07, z=2.31)]

You can now use RDD operations to analyze the data.

In [16]:
# Diamond counts by cuts
countByGroup = diamonds_rdd.map(lambda x: (x.cut, 1)).reduceByKey(lambda x,y: x+y)
print countByGroup.collect()

[(u'Ideal', 21551), (u'Good', 4906), (u'Premium', 13791), (u'Very Good', 12082), (u'Fair', 1610)]


In [17]:
# Distinct diamond clarities in dataset
distinctClarity = diamonds_rdd.map(lambda x: x.clarity).distinct()
print distinctClarity.collect()

[u'SI2', u'SI1', u'VS1', u'I1', u'VS2', u'VVS1', u'VVS2', u'IF']


In [18]:
# Average price per diamond cut
avgPrice = diamonds_rdd.map(lambda x: (x.cut, float(x.price))).reduceByKey(lambda x,y: (x+y)/2)
print avgPrice.collect()

[(u'Ideal', 2756.7240663718817), (u'Good', 2755.647409027791), (u'Premium', 2756.654813661215), (u'Very Good', 2756.7183661747795), (u'Fair', 2743.567771968392)]


### Mini-exercise on loading CSV

Use what you have learned to load a set of CSV datasets. Open load_csv_exercise.py and follow the assignment therein.

-- Actor

-- Movie

-- Actor playing in movie (relationships)

and find movies where **Tom Hanks** played in.

Save the answer in the JSON format.

### JSON files with Python

This notebook shows an example of how to load JSON data in Python notebooks and best practices for working with JSON data.

#### Loading JSON data with Spark SQL into a DataFrame

Spark SQL has built in support for reading in JSON files which contain a separate, self-contained JSON object per line. Multi-line JSON files are currently not compatible with Spark SQL.

In [19]:
testJsonData = spark.read.json("./data/json/test.json")

In [20]:
testJsonData.printSchema()

root
 |-- array: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- dict: struct (nullable = true)
 |    |-- extra_key: string (nullable = true)
 |    |-- key: string (nullable = true)
 |-- int: long (nullable = true)
 |-- string: string (nullable = true)



In [21]:
testJsonData.show()

+---------+--------------------+---+-------+
|    array|                dict|int| string|
+---------+--------------------+---+-------+
|[1, 2, 3]|       [null,value1]|  1|string1|
|[2, 4, 6]|       [null,value2]|  2|string2|
|[3, 6, 9]|[extra_value3,val...|  3|string3|
+---------+--------------------+---+-------+



Spark SQL can infer the schema automatically from your JSON data. To view the schema, use printSchema.

### Analyzing JSON files in Python as DataFrames

Let's try doing some basic queries to understand the dataset better.

In [22]:
# Count number of rows in dataset
print testJsonData.count()

3


JSON data can contain nested data structures which can be accessed with a "."

In [23]:
print testJsonData.select('dict.key').collect()

[Row(key=u'value1'), Row(key=u'value2'), Row(key=u'value3')]


We can also perform DataFrame operations such as filtering queries according to some criteria.

In [24]:
print testJsonData.filter(testJsonData["int"] > 1)

DataFrame[array: array<bigint>, dict: struct<extra_key:string,key:string>, int: bigint, string: string]


### Analyzing JSON files in Python as RDDs
You can also convert your DataFrame to RDDs and perform RDD operations.

In [25]:
# Convert DataFrame directly into an RDD
testJsonDataRDD = testJsonData.rdd

In [26]:
# View first 3 rows of the RDD
testJsonDataRDD.take(3)

[Row(array=[1, 2, 3], dict=Row(extra_key=None, key=u'value1'), int=1, string=u'string1'),
 Row(array=[2, 4, 6], dict=Row(extra_key=None, key=u'value2'), int=2, string=u'string2'),
 Row(array=[3, 6, 9], dict=Row(extra_key=u'extra_value3', key=u'value3'), int=3, string=u'string3')]

In [27]:
# View distinct values in the 'array' column
testJsonDataRDD.flatMap(lambda r: r.array).distinct().collect()

[1, 2, 3, 4, 6, 9]

### Analyzing JSON files in Python with SQL
Any DataFrame, including those created with JSON data, can be registered as a Spark SQL table to query with SQL.

In [29]:
# Create a Spark SQL temp table
# Note that temp tables are not global across clusters and will not persist across cluster restarts
testJsonData.registerTempTable("test_json")

We can run any SQL queries on that table with Spark SQL:

In [30]:
spark.sql("SELECT * FROM test_json").show()

+---------+--------------------+---+-------+
|    array|                dict|int| string|
+---------+--------------------+---+-------+
|[1, 2, 3]|       [null,value1]|  1|string1|
|[2, 4, 6]|       [null,value2]|  2|string2|
|[3, 6, 9]|[extra_value3,val...|  3|string3|
+---------+--------------------+---+-------+



### Mini-exercise

Switch to the Adroit cluster work directory, open the file: load_json.py
and follow instructions inline. Submit the jobs to the cluster using slurm_for_json.cmd file


### Parquet Files in Python

This notebook describes how to register a table in Spark SQL from parquet files.
Parquet Files are a great format for storing large tables in SparkSQL.
Consider converting text files with a schema into parquet files for more efficient storage.
Parquet provides a lot of optimizations under the hood to speed up your queries.
Just call ```bash .write.parquet``` on a DataFrame to encode in into Parquet.

In [33]:
from pyspark.sql import Row

array = [Row(key="a", group="vowels", value=1, someints=[1], map = {"a" : 1}),
         Row(key="b", group="consonants", value=2, someints=[2, 2], map = {"b" : 2}),
         Row(key="c", group="consonants", value=3, someints=[3, 3, 3], map = {"c" : 3}),
         Row(key="d", group="consonants", value=4, someints=[4, 4, 4, 4], map = {"d" : 4}),
         Row(key="e", group="vowels", value=5, someints=[5, 5, 5, 5, 5], map = {"3" : 5})]
dataframe = spark.createDataFrame(sc.parallelize(array))
dataframe.show()
# now that it's created, let's write it to disk
dataframe.write.parquet("./output_parquet/testParquetFiles")

+----------+---+-----------+---------------+-----+
|     group|key|        map|       someints|value|
+----------+---+-----------+---------------+-----+
|    vowels|  a|Map(a -> 1)|            [1]|    1|
|consonants|  b|Map(b -> 2)|         [2, 2]|    2|
|consonants|  c|Map(c -> 3)|      [3, 3, 3]|    3|
|consonants|  d|Map(d -> 4)|   [4, 4, 4, 4]|    4|
|    vowels|  e|Map(3 -> 5)|[5, 5, 5, 5, 5]|    5|
+----------+---+-----------+---------------+-----+



### Registering a Temp Table from parquet files

Taking Parquet files and registering them as a temp table is super easy in Spark SQL.

In [34]:
dataframeFromParquet = spark.read.parquet("./output_parquet/testParquetFiles")  
dataframeFromParquet.registerTempTable("parquetTable1")

In [35]:
spark.sql("SELECT * FROM parquetTable1").show()

+----------+---+-----------+---------------+-----+
|     group|key|        map|       someints|value|
+----------+---+-----------+---------------+-----+
|consonants|  d|Map(d -> 4)|   [4, 4, 4, 4]|    4|
|    vowels|  e|Map(3 -> 5)|[5, 5, 5, 5, 5]|    5|
|consonants|  b|Map(b -> 2)|         [2, 2]|    2|
|consonants|  c|Map(c -> 3)|      [3, 3, 3]|    3|
|    vowels|  a|Map(a -> 1)|            [1]|    1|
+----------+---+-----------+---------------+-----+

