This Notebook is for analyzing profile test data. The test data includes 20 json files. This notebook will load the data into dataframe and analyzed using Spark SQL.

1. Import PySpark Library

In [1]:
from pyspark.sql import SparkSession

2. Create SparkSession

In [2]:
spark = SparkSession \
    .builder \
    .appName("Profile Analyse") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

3. Load JSON FILES

In [3]:
df = spark.read.json("test_data/*.json")

4. Print DataFrame Schema

In [4]:
df.printSchema()

root
 |-- id: string (nullable = true)
 |-- profile: struct (nullable = true)
 |    |-- firstName: string (nullable = true)
 |    |-- jobHistory: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- fromDate: string (nullable = true)
 |    |    |    |-- location: string (nullable = true)
 |    |    |    |-- salary: long (nullable = true)
 |    |    |    |-- title: string (nullable = true)
 |    |    |    |-- toDate: string (nullable = true)
 |    |-- lastName: string (nullable = true)



5. Records in DataSet

In [6]:
df.count()

17139693

6. Flatten Nested DataFrame (Data Preparation for Analysis)

Firstly, convert firstName, lastName and jobHistory into columns

In [10]:
nested_df = df.select("id", "profile.firstName", "profile.lastName", "profile.jobHistory")

Secondly, use explode function to explode jobHistory array into multiple rows

In [11]:
from pyspark.sql.functions import explode
exploded_df = nested_df.withColumn("jobHistory", explode(nested_df.jobHistory))

Thirdly, convert fromDate, location, salary, title, toDate into columns

In [12]:
flattern_df = exploded_df.select("id", "firstName", "lastName", "jobHistory.fromDate", "jobHistory.location", "jobHistory.salary", "jobHistory.title", "jobHistory.toDate")

In [None]:
7. Create TempView for dataFrame

In [13]:
flattern_df.createOrReplaceTempView("people")

8. Average Salary for each profile

In [14]:
sqlDF = spark.sql("SELECT id, firstName, lastName, avg(salary) as avgSalary FROM people group by id, firstName, lastName order by lastName desc limit 10")
sqlDF.show()

+--------------------+---------+--------+------------------+
|                  id|firstName|lastName|         avgSalary|
+--------------------+---------+--------+------------------+
|5894afab-574f-429...|  Richard|  Zywiec|           69625.0|
|82dab74c-3946-45b...|   Robert|  Zywiec| 66833.33333333333|
|ba24222d-6e39-40d...|  Matthew|  Zywiec|           65500.0|
|3462e81b-a9cc-4ac...|   Sandra| Zywicki|43333.333333333336|
|4e26c80a-8e84-46f...|    Bobby| Zywicki| 78166.66666666667|
|eeb15ed5-fb0d-4d6...|    Donna| Zywicki|          115000.0|
|b8a6bf60-03da-4e0...|     Paul| Zywicki| 98166.66666666667|
|f643f39c-e18a-430...|   Calvin| Zywicki|125428.57142857143|
|56cc651c-0bbd-492...|    Katie| Zywicki|           55250.0|
|920e1983-c8da-458...|    Lovie| Zywicki| 63571.42857142857|
+--------------------+---------+--------+------------------+



9. Average Salary for whole DataSet

In [15]:
sqlDF = spark.sql("SELECT avg(salary) as avgSalary FROM people")
sqlDF.show()

+----------------+
|       avgSalary|
+----------------+
|97473.6229416272|
+----------------+

