# Dataframe in PySpark: Overview

In Apache Spark, a DataFrame is a distributed collection of rows under named columns. In simple terms, it is same as a table in relational database or an Excel sheet with Column headers. It also shares some common characteristics with RDD:
- Immutable in nature : We can create DataFrame / RDD once but can’t change it. And we can transform a DataFrame / RDD after applying transformations.
- Lazy Evaluations: Which means that a task is not executed until an action is performed.
- Distributed: RDD and DataFrame both are distributed in nature.

## Why DataFrames are Useful?
- DataFrames are designed for processing large collection of structured or semi-structured data.
- Observations in Spark DataFrame are organised under named columns, which helps Apache Spark to understand the schema of a DataFrame. This helps Spark optimize execution plan on these queries.
- DataFrame in Apache Spark has the ability to handle petabytes of data.
- DataFrame has a support for wide range of data format and sources.
- It has API support for different languages like Python, R, Scala, Java.

## How to create a DataFrame?

A DataFrame in Apache Spark can be created in multiple ways:
- It can be created using different data formats. For example, loading the data from JSON, CSV.
- Loading data from Existing RDD.
- Programmatically specifying schema.

## Creating DataFrame from RDD

- Create a list of tuples. Each tuple contains name of a person with age.
- Create a RDD from the list above.
- Convert each tuple to a row.
- Create a DataFrame by applying createDataFrame on RDD with the help of sqlContext.

In [1]:
# Windows
import findspark
findspark.init()
findspark.find()

'C:\\Tools\\spark-3.3.0-bin-hadoop3'

In [2]:
import pyspark
sc = pyspark.SparkContext(appName='Spark DataFrames')

In [3]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)



In [4]:
from pyspark.sql import Row
lst = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)]
rdd = sc.parallelize(lst)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
schemaPeople = sqlContext.createDataFrame(people)

In [5]:
#Lets check the type of schemaPeople.
type(schemaPeople)

pyspark.sql.dataframe.DataFrame

In [6]:
schemaPeople.collect()

[Row(name='Ankit', age=25),
 Row(name='Jalfaizy', age=22),
 Row(name='saurabh', age=20),
 Row(name='Bala', age=26)]

In [7]:
schemaPeople.show()

+--------+---+
|    name|age|
+--------+---+
|   Ankit| 25|
|Jalfaizy| 22|
| saurabh| 20|
|    Bala| 26|
+--------+---+



In [None]:
# Create the Departments
dept1 = Row(id='123456', name='Computer Science')
dept2 = Row(id='789012', name='Mechanical Engineering')
dept3 = Row(id='345678', name='Theater and Drama')
dept4 = Row(id='901234', name='Indoor Recreation')

In [None]:
Employee = Row("firstName", "lastName", "email", "salary")

In [None]:
emp1 = Employee('ramesh', 'armbrust', 'no-reply@spds.edu', 100000)
emp2 = Employee('suresh', 'meng', 'no-reply@spds.edu', 120000)
emp3 = Employee('naresh', None, 'no-reply@spds.edu', 140000)
emp4 = Employee(None, 'kamesh', 'no-reply@spds.edu', 160000)

In [None]:
#Connect Department with Employee
deptWithEmp12 = Row(department=dept1, employees=[emp1, emp2])
deptWithEmp34 = Row(department=dept2, employees=[emp3, emp4])
deptWithEmp14 = Row(department=dept3, employees=[emp1, emp4])
deptWithEmp23 = Row(department=dept4, employees=[emp2, emp3])

In [None]:
df1 = sqlContext.createDataFrame([deptWithEmp12,deptWithEmp34])

In [None]:
df1.show()

In [None]:
df2 = sqlContext.createDataFrame([deptWithEmp14,deptWithEmp23])

In [None]:
unionDF = df1.union(df2)

In [None]:
unionDF.show()

## Parquet

Parquet is a columnar file format that provides optimizations to speed up queries and is a far more efficient file format than CSV or JSON.

In [None]:
unionDF.write.parquet('df.parquet2')

In [None]:
parquetDF = sqlContext.read.parquet('df.parquet2')

In [None]:
parquetDF.show()

Example3

In [None]:
from pyspark.sql import Row
rdd = sc.parallelize([Row(name='a', age=99, height=100), \
                      Row(name='b',age=50,height=100), \
                      Row(name='b',age=50,height=178)])

In [None]:
df1 = rdd.toDF()

In [None]:
df1.show()