This Notebook is for analyzing profile test data. The test data includes 20 json files. This notebook will load the data into dataframe and analyzed using Spark SQL.

1. Import PySpark Library

In [1]:
from pyspark.sql import SparkSession

2. Create SparkSession

In [2]:
spark = SparkSession \
    .builder \
    .appName("Profile Analyse") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

3. Load JSON FILES

In [3]:
df = spark.read.json("test_data/*.json")

4. Print DataFrame Schema

In [4]:
df.printSchema()

root
 |-- id: string (nullable = true)
 |-- profile: struct (nullable = true)
 |    |-- firstName: string (nullable = true)
 |    |-- jobHistory: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- fromDate: string (nullable = true)
 |    |    |    |-- location: string (nullable = true)
 |    |    |    |-- salary: long (nullable = true)
 |    |    |    |-- title: string (nullable = true)
 |    |    |    |-- toDate: string (nullable = true)
 |    |-- lastName: string (nullable = true)



5. Records in DataSet

In [6]:
df.count()

17139693

6. Flatten Nested DataFrame (Data Preparation for Analysis)

Firstly, convert firstName, lastName and jobHistory into columns

In [10]:
nested_df = df.select("id", "profile.firstName", "profile.lastName", "profile.jobHistory")

Secondly, use explode function to explode jobHistory array into multiple rows

In [8]:
from pyspark.sql.functions import explode
exploded_df = nested_df.withColumn("jobHistory", explode(nested_df.jobHistory))

Thirdly, convert fromDate, location, salary, title, toDate into columns

In [9]:
flattern_df = exploded_df.select("id", "firstName", "lastName", "jobHistory.fromDate", "jobHistory.location", "jobHistory.salary", "jobHistory.title", "jobHistory.toDate")