In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('PySparkLearning').getOrCreate()

Since we don’t have the parquet file, let’s work with writing parquet from a DataFrame. First, create a Pyspark DataFrame from a list of data using spark.createDataFrame() method.



In [19]:
data =[   ("James ","","Smith","36636","M",3000),
          ("Michael ","Rose","","40288","M",4000),
          ("Robert ","","Williams","42114","M",5000),
          ("Maria ","Anne","Jones","39192","F",4000),
          ("Jen","Mary","Brown","","F",-1)
      ]

columns=["firstname","middlename","lastname","dob","gender","salary"]

df=spark.createDataFrame(data,columns)

In [20]:
df.write.parquet("../Resources/people.parquet")


### Pyspark Read Parquet file into DataFrame

In [21]:
parDF = spark.read.parquet("../Resources/people.parquet")

In [22]:
parDF.show()

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
|  Robert |          |Williams|42114|     M|  5000|
| Michael |      Rose|        |40288|     M|  4000|
|   James |          |   Smith|36636|     M|  3000|
|   Maria |      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
+---------+----------+--------+-----+------+------+



### Append or Overwrite an existing Parquet file

In [23]:
df.write.mode('append').parquet("../Resources/people.parquet")
parDF = spark.read.parquet("../Resources/people.parquet")
parDF.show()

df.write.mode('overwrite').parquet("../Resources/people.parquet")
parqDF = spark.read.parquet("../Resources/people.parquet")
parqDF.show()

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
|  Robert |          |Williams|42114|     M|  5000|
|  Robert |          |Williams|42114|     M|  5000|
| Michael |      Rose|        |40288|     M|  4000|
| Michael |      Rose|        |40288|     M|  4000|
|   James |          |   Smith|36636|     M|  3000|
|   James |          |   Smith|36636|     M|  3000|
|   Maria |      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
|   Maria |      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
+---------+----------+--------+-----+------+------+

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
|  Robert |          |Williams|42114|     M|  5000|
| Michael |      Rose|        |40288|     M|  4000|
|   James |

### Executing SQL queries DataFrame

Pyspark Sql provides to create temporary views on parquet files for executing sql queries. These views are available until your program exists.

In [24]:
parqDF.createOrReplaceTempView("ParquetTable")
parkSQL = spark.sql("select * from ParquetTable where salary >= 4000 ")
parkSQL.show()

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
|  Robert |          |Williams|42114|     M|  5000|
| Michael |      Rose|        |40288|     M|  4000|
|   Maria |      Anne|   Jones|39192|     F|  4000|
+---------+----------+--------+-----+------+------+



### Creating a table on Parquet file

Now let’s walk through executing SQL queries on parquet file. In order to execute sql queries, create a temporary view or table directly on the parquet file instead of creating from DataFrame.

In [29]:
spark.sql("CREATE OR REPLACE TEMP VIEW PERSON USING parquet OPTIONS (path \"../Resources/people.parquet\")")
spark.sql("SELECT * FROM PERSON").show()
# Here, we created a temporary view PERSON from “people.parquet” file.

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname|  dob|gender|salary|
+---------+----------+--------+-----+------+------+
|  Robert |          |Williams|42114|     M|  5000|
| Michael |      Rose|        |40288|     M|  4000|
|   James |          |   Smith|36636|     M|  3000|
|   Maria |      Anne|   Jones|39192|     F|  4000|
|      Jen|      Mary|   Brown|     |     F|    -1|
+---------+----------+--------+-----+------+------+



### Create Parquet partition file

When we execute a particular query on PERSON table, it scan’s through all the rows and returns the results back. This is similar to the traditional database query execution. In PySpark, we can improve query execution in an optimized way by doing partitions on the data using `partitionBy()` method. Following is the example of partitionBy().



In [30]:
df.write.partitionBy("gender","salary").mode("overwrite").parquet("../Resources/people_partition.parquet")

When you check the people_parquet.parquet file, it has two partitions “gender” followed by “salary” inside.

![Screen Shot](../Reference%20Images/people_partition.png)

In [34]:
# Retrieving from a partitioned Parquet file

parDF = spark.read.parquet('../Resources/people_partition.parquet/gender=M')
parDF.show()

+---------+----------+--------+-----+------+
|firstname|middlename|lastname|  dob|salary|
+---------+----------+--------+-----+------+
|  Robert |          |Williams|42114|  5000|
| Michael |      Rose|        |40288|  4000|
|   James |          |   Smith|36636|  3000|
+---------+----------+--------+-----+------+



### Creating a table on Partitioned Parquet file

Here, I am creating a table on partitioned parquet file and executing a query that executes faster than the table without partition, hence improving the performance.



In [38]:
spark.sql("CREATE OR REPLACE TEMP VIEW PERSON_PART using PARQUET OPTIONS (path '../Resources/people_partition.parquet/gender=F')")
spark.sql("SELECT * FROM PERSON_PART").show()

+---------+----------+--------+-----+------+
|firstname|middlename|lastname|  dob|salary|
+---------+----------+--------+-----+------+
|   Maria |      Anne|   Jones|39192|  4000|
|      Jen|      Mary|   Brown|     |    -1|
+---------+----------+--------+-----+------+

