Load and Join Employees and Departments data

In [1]:
# SparkSession is entry point for Dataframes and datasets, not RDDs.
import logging
from pyspark.sql import SparkSession

In [6]:
def rdd_to_dataframe(data, schema):
    """
    Example: This fn creates a Spark RDD, loads it into a Spark DataFrame, and returns the DataFrame 
    """
        
    # Create a SparkSession
    spark = SparkSession.builder.appName("RDDToDataFrame").getOrCreate()

    try:
        # Create an RDD from the input data, using Spark Context not Session!
        rdd = spark.sparkContext.parallelize(data)

        # Convert RDD to DataFrame
        df = spark.createDataFrame(rdd, schema)

        # Return the DataFrame, without stopping the SparkSession
        return df

    except Exception as e:
        # Log error and Stop the SparkSession
        logging.error('Error while transforming RDD to DF: {}'.format(e))
        spark.stop()

In [7]:
# Data sample
dept_data = [(1,"Big Data"), (2, "Finance"), (3,"Marketing")]
dept_schema = ["department_id", "department_name"]

In [8]:
# Data sample
emp_data = [(1,"Sita", 17), (1,"Gita", 30), (2,"Mohan", 26)]
emp_schema = ["department_id","employee_name", "age"]

Using spark RDD as spark dataframe

In [9]:
# Call function, to transform RDD into DF
df_emp = rdd_to_dataframe(emp_data, emp_schema)
df_dept = rdd_to_dataframe(dept_data, dept_schema)

24/03/17 08:11:35 WARN Utils: Your hostname, sasa-1-2 resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
24/03/17 08:11:35 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/03/17 08:11:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

In [10]:
# Show schema
df_dept.show()

+-------------+---------------+
|department_id|department_name|
+-------------+---------------+
|            1|       Big Data|
|            2|        Finance|
|            3|      Marketing|
+-------------+---------------+



In [11]:
df_emp.printSchema()

root
 |-- department_id: long (nullable = true)
 |-- employee_name: string (nullable = true)
 |-- age: long (nullable = true)



Using spark SQL to join two datasets, dept and emp

In [12]:
# Do we have a session running?
# Gets an existing SparkSession or, if there is no existing one,
# creates a new one based on the options set in this builder.
spark = SparkSession.builder.appName("RDDToDataFrame").getOrCreate()

In [13]:
# Register as view for spark.sql
'''
    Some KeyPoints to note:

    createOrReplaceTempView() is used when you wanted to store the table for a specific spark session.
    Once created you can use it to run SQL queries.
    These temporary views are session-scoped i.e. valid only that running spark session.
    It can’t be shared between the sessions
    These views will be dropped when the session ends unless you created it as Hive table.
    Use saveAsTable() to materialize the contents of the DataFrame and create a pointer to the data in the metastore.
'''
df_emp.createOrReplaceTempView('employees')
df_dept.createOrReplaceTempView('departments')

In [14]:
# Query sample, using Spark SQL
spark.sql('''
            select emp.*, dept.*
            from employees as emp
                inner join departments as dept on (emp.department_id = dept.department_id) 
            where age >= 18
            ''').show()



+-------------+-------------+---+-------------+---------------+
|department_id|employee_name|age|department_id|department_name|
+-------------+-------------+---+-------------+---------------+
|            1|         Gita| 30|            1|       Big Data|
|            2|        Mohan| 26|            2|        Finance|
+-------------+-------------+---+-------------+---------------+



                                                                                

In [16]:
# Let's now save the JOINED RESULTSET into a new Temporary View
spark.sql('''
        select emp.employee_name, emp.age, emp.department_id, dept.department_name
        from employees as emp
            inner join departments as dept on (emp.department_id = dept.department_id)
             where age >= 18
        ''').createOrReplaceTempView('dept_employees')

In [17]:
spark.sql('''
        select * from dept_employees where department_id is not null
        ''').show()



+-------------+---+-------------+---------------+
|employee_name|age|department_id|department_name|
+-------------+---+-------------+---------------+
|         Gita| 30|            1|       Big Data|
|        Mohan| 26|            2|        Finance|
+-------------+---+-------------+---------------+



                                                                                

Save this output for Business data consumers.

In [21]:
# Define output location
output_location = '/home/sasa/output/dept_employees/'

# Let's now save the JOINED RESULTSET to local storage. This could be Amazon S3 or other. 
spark.sql('''
        select * from dept_employees where department_id is not null
        ''').write.mode('append').csv(output_location)

Description of output file: There are two files generated at the output_location
1) _SUCCESS -This is blank.
2) part00000-mhfjdkl-c000.csv   - This file contains the output result.